research-article

Free access

Distinct-value synopses for multiset operations

Authors:

Rainer Gemulla,

Berthold Reinwald,

Yannis SismanisAuthors Info & Claims

Communications of the ACM, Volume 52, Issue 10

Pages 87 - 95

https://doi.org/10.1145/1562764.1562787

Published: 01 October 2009 Publication History

All formats PDF

Abstract

The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into partitions. We create for each partition a synopsis that can be used to estimate the number of DVs in the partition. By combining and extending a number of results in the literature, we obtain both suitable synopses and DV estimators. The synopses can be created in parallel, and can be easily combined to yield synopses and DV estimates for "compound" partitions that are created from the base partitions via arbitrary multiset union, intersection, or difference operations. Our synopses can also handle deletions of individual partition elements. We prove that our DV estimators are unbiased, provide error bounds, and show how to select synopsis sizes in order to achieve a desired estimation accuracy. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.

References

[1]

Astrahan, M., Schkolnick, M., Whang, K. Approximating the number of unique values of an attribute without sorting. Inf. Sys. 12 (1987), 11--15.

Digital Library

[2]

Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L. Counting distinct elements in a data stream. In Proc. RANDOM (2002), 1--10.

Digital Library

[3]

Beyer, K.S., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R. On synopses for distinct-value estimation under multiset operations. In Proc. ACM SIGMOD (2007), 199--210.

Digital Library

[4]

Brown, P.G., Haas, P.J. Techniques for warehousing of sample data. In Proc. ICDE (2006).

Digital Library

[5]

Cohen, E., Kaplan, H. Tighter estimation using bottom k sketches. Proc. VLDB Endow. 1, 1 (2008), 213--224.

Digital Library

[6]

Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V. Mining database structure; or, how to build a data quality browser. In Proc. ACM SIGMOD (2002), 240--251.

Digital Library

[7]

Duffield, N., Lund, C., Thorup, M. Priority sampling for estimation of arbitrary subset sums. J. ACM 54, 6 (2007), 32.

Digital Library

[8]

Durand, M., Flajolet, P. Loglog counting of large cardinalities. In Proc. ESA (2003), 605--617.

[9]

Flajolet, P., Martin, G.N. Probabilistic counting algorithms for data base applications. J. Comp. Sys. Sci. 31 (1985), 182--209.

Digital Library

[10]

Ganguly, S., Garofalakis, M., Rastogi, R. Tracking set-expression cardinalities over continuous update streams. VLDB J. 13 (2004), 354--369.

Digital Library

[11]

Gemulla, R. Sampling Algorithms for Evolving Datasets. Ph.D. thesis, TU Dresden, Dept. of CS, 2008. http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644.

[12]

Gemulla, R., Lehner, W. Sampling time-based sliding windows in bounded space. In Proc. SIGMOD (2008), 379--392.

Digital Library

[13]

Gibbons, P. Distinct-values estimation over data streams. In M. Garofalakis, J. Gehrke, and R. Rastogi, editors, Data Stream Management: Processing High-Speed Data Streams. Springer, 2009. To appear.

[14]

Haas, P.J., Stokes, L. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc. 93 (1998), 1475--1487.

[15]

Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D. Hashed samples: selectivity estimators for set similarity selection queries. Proc. VLDB Endow. 1, 1 (2008), 201--212.

Digital Library

[16]

Motwani, R., Raghavan, P. Randomized Algorithms. Cambridge University Press (1995).

Digital Library

[17]

Serfling, R.J. Approximation Theorems of Mathematical Statistics. Wiley, New York (1980).

[18]

Shukla, A., Deshpande, P., Naughton, J.F., Ramasamy, K. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB (1996), 522--531.

Digital Library

[19]

Simitsis, A., Baid, A., Sismanis, Y., Reinwald, B. Multidimensional content eXploration. Proc. VLDB Endow. 1, 1 (2008), 660--671.

Digital Library

[20]

Szegedy, M. The DLT priority sampling is essentially optimal. In Proc. STOC (2006), 150--158.

Digital Library

[21]

Whang, K., Vander-Zanden, B.T., Taylor, H.M. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Sys. 15 (1990), 208--229.

Digital Library

[22]

Zimmer, C., Tryfonopoulos, C., Weikum, G. Exploiting correlated keywords to improve approximate information filtering. In Proc. SIGIR (2008), 323--330.

Digital Library

Cited By

Tench DWest EZhang VBender MChowdhury ADelayo DDellas JFarach-Colton MSeip TZhang K(2024)GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)ACM Transactions on Database Systems10.1145/364384649:3(1-31)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3643846
Lemiesz J(2023)Efficient Framework for Operating on Data SketchesProceedings of the VLDB Endowment10.14778/3594512.359452616:8(1967-1978)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.14778/3594512.3594526
Wang DPettie SGeerts FNgo HSintos S(2023)Better Cardinality Estimators for HyperLogLog, PCSA, and BeyondProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588680(317-327)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3584372.3588680
Show More Cited By

Index Terms

Distinct-value synopses for multiset operations
1. Information systems
  1. Data management systems

Recommendations

On synopses for distinct-value estimation under multiset operations
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques that are designed for use within a flexible and scalable "synopsis ...
Synopses for query optimization: A space-complexity perspective
Special Issue: SIGMOD/PODS 2004

Database systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms are by far the most commonly used. While histograms ...
Synopses for query optimization: a space-complexity perspective
PODS '04: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Database systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms are by far the most commonly used. While histograms ...

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM

Communications of the ACM Volume 52, Issue 10

A View of Parallel Computing

October 2009

134 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/1562764

Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2009

Published in CACM Volume 52, Issue 10

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Popular
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
5,027
Total Downloads

Downloads (Last 12 months)421
Downloads (Last 6 weeks)131

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tench DWest EZhang VBender MChowdhury ADelayo DDellas JFarach-Colton MSeip TZhang K(2024)GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)ACM Transactions on Database Systems10.1145/364384649:3(1-31)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3643846
Lemiesz J(2023)Efficient Framework for Operating on Data SketchesProceedings of the VLDB Endowment10.14778/3594512.359452616:8(1967-1978)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.14778/3594512.3594526
Wang DPettie SGeerts FNgo HSintos S(2023)Better Cardinality Estimators for HyperLogLog, PCSA, and BeyondProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588680(317-327)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3584372.3588680
Dickens CBax E(2023)Matching Noisy Keys for Obfuscation2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386396(5485-5492)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386396
Dickens CThaler JTing DKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Order-invariant cardinality estimators are differentially privateProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601376(15204-15216)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601376
Ting DIves ZBonifati AEl Abbadi A(2022)Adaptive Threshold SamplingProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526122(1612-1625)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526122
Reviriego PSanchez-Macian ALiu SLombardi F(2022)On the Security of the K Minimum Values (KMV) SketchIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2021.310128019:5(3539-3545)Online publication date: 1-Sep-2022
https://doi.org/10.1109/TDSC.2021.3101280
Lemiesz J(2021)On the algebra of data sketchesProceedings of the VLDB Endowment10.14778/3461535.346155314:9(1655-1667)Online publication date: 1-May-2021
https://dl.acm.org/doi/10.14778/3461535.3461553
Santos ABessa AChirigati FMusco CFreire JLi GLi ZIdreos SSrivastava D(2021)Correlation Sketches for Approximate Join-Correlation QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3458456(1531-1544)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3458456
Pettie SWang DKhuller SVassilevska Williams V(2021)Information theoretic limits of cardinality estimation: Fisher meets ShannonProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing10.1145/3406325.3451032(556-569)Online publication date: 15-Jun-2021
https://dl.acm.org/doi/10.1145/3406325.3451032
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents