skip to main content
research-article
Free access

Distinct-value synopses for multiset operations

Published: 01 October 2009 Publication History

Abstract

The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into partitions. We create for each partition a synopsis that can be used to estimate the number of DVs in the partition. By combining and extending a number of results in the literature, we obtain both suitable synopses and DV estimators. The synopses can be created in parallel, and can be easily combined to yield synopses and DV estimates for "compound" partitions that are created from the base partitions via arbitrary multiset union, intersection, or difference operations. Our synopses can also handle deletions of individual partition elements. We prove that our DV estimators are unbiased, provide error bounds, and show how to select synopsis sizes in order to achieve a desired estimation accuracy. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.

References

[1]
Astrahan, M., Schkolnick, M., Whang, K. Approximating the number of unique values of an attribute without sorting. Inf. Sys. 12 (1987), 11--15.
[2]
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L. Counting distinct elements in a data stream. In Proc. RANDOM (2002), 1--10.
[3]
Beyer, K.S., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R. On synopses for distinct-value estimation under multiset operations. In Proc. ACM SIGMOD (2007), 199--210.
[4]
Brown, P.G., Haas, P.J. Techniques for warehousing of sample data. In Proc. ICDE (2006).
[5]
Cohen, E., Kaplan, H. Tighter estimation using bottom k sketches. Proc. VLDB Endow. 1, 1 (2008), 213--224.
[6]
Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V. Mining database structure; or, how to build a data quality browser. In Proc. ACM SIGMOD (2002), 240--251.
[7]
Duffield, N., Lund, C., Thorup, M. Priority sampling for estimation of arbitrary subset sums. J. ACM 54, 6 (2007), 32.
[8]
Durand, M., Flajolet, P. Loglog counting of large cardinalities. In Proc. ESA (2003), 605--617.
[9]
Flajolet, P., Martin, G.N. Probabilistic counting algorithms for data base applications. J. Comp. Sys. Sci. 31 (1985), 182--209.
[10]
Ganguly, S., Garofalakis, M., Rastogi, R. Tracking set-expression cardinalities over continuous update streams. VLDB J. 13 (2004), 354--369.
[11]
Gemulla, R. Sampling Algorithms for Evolving Datasets. Ph.D. thesis, TU Dresden, Dept. of CS, 2008. http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644.
[12]
Gemulla, R., Lehner, W. Sampling time-based sliding windows in bounded space. In Proc. SIGMOD (2008), 379--392.
[13]
Gibbons, P. Distinct-values estimation over data streams. In M. Garofalakis, J. Gehrke, and R. Rastogi, editors, Data Stream Management: Processing High-Speed Data Streams. Springer, 2009. To appear.
[14]
Haas, P.J., Stokes, L. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc. 93 (1998), 1475--1487.
[15]
Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D. Hashed samples: selectivity estimators for set similarity selection queries. Proc. VLDB Endow. 1, 1 (2008), 201--212.
[16]
Motwani, R., Raghavan, P. Randomized Algorithms. Cambridge University Press (1995).
[17]
Serfling, R.J. Approximation Theorems of Mathematical Statistics. Wiley, New York (1980).
[18]
Shukla, A., Deshpande, P., Naughton, J.F., Ramasamy, K. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB (1996), 522--531.
[19]
Simitsis, A., Baid, A., Sismanis, Y., Reinwald, B. Multidimensional content eXploration. Proc. VLDB Endow. 1, 1 (2008), 660--671.
[20]
Szegedy, M. The DLT priority sampling is essentially optimal. In Proc. STOC (2006), 150--158.
[21]
Whang, K., Vander-Zanden, B.T., Taylor, H.M. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Sys. 15 (1990), 208--229.
[22]
Zimmer, C., Tryfonopoulos, C., Weikum, G. Exploiting correlated keywords to improve approximate information filtering. In Proc. SIGIR (2008), 323--330.

Cited By

View all

Index Terms

  1. Distinct-value synopses for multiset operations

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Communications of the ACM
    Communications of the ACM  Volume 52, Issue 10
    A View of Parallel Computing
    October 2009
    134 pages
    ISSN:0001-0782
    EISSN:1557-7317
    DOI:10.1145/1562764
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 October 2009
    Published in CACM Volume 52, Issue 10

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Popular
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)421
    • Downloads (Last 6 weeks)131
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Digital Edition

    View this article in digital edition.

    Digital Edition

    Magazine Site

    View this article on the magazine site (external)

    Magazine Site

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media