skip to main content
10.1145/1247480.1247504acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

On synopses for distinct-value estimation under multiset operations

Published: 11 June 2007 Publication History

Abstract

The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques that are designed for use within a flexible and scalable "synopsis warehouse" architecture. In this setting, incoming data is split into partitions and a synopsis is created for each partition; each synopsis can then be used to quickly estimate the number of DVs in its corresponding partition. By combining and extending a number of results in the literature, we obtain both appropriate synopses and novel DV estimators to use in conjunction with these synopses. Our synopses can be created in parallel, and can then be easily combined to yield synopses and DV estimates for arbitrary unions, intersections or differences of partitions. Our synopses can also handle deletions of individual partition elements. We use the theory of order statistics to show that our DV estimators are unbiased, and to establish moment formulas and sharp error bounds. Based on a novel limit theorem, we can exploit results due to Cohen in order to select synopsis sizes when initially designing the warehouse. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.

References

[1]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. Sys. Sci., 58:137--147, 1999.
[2]
M. Astrahan, M. Schkolnick, and K. Whang. Approximating the number of unique values of an attribute without sorting. Inf. Sys., 12:11--15, 1987.
[3]
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proc. RANDOM, pages 1--10, 2002.
[4]
P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, pages 161--180. Springer, 2005.
[5]
P. G. Brown and P. J. Haas. Techniques for warehousing of sample data. In Proc. ICDE, 2006.
[6]
M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In Proc. ACM PODS, pages 268--279, 2000.
[7]
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Sys. Sci., 55:441--453, 1997.
[8]
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In Proc. ACM SIGMOD, pages 240--251, 2002.
[9]
H. A. David and H. N. Nagaraja. Order Statistics. Wiley, third edition, 2003.
[10]
A. R. Didonato and A. H. Morris, Jr. Algorithm 708; significant digit computation of the incomplete beta function ratios. ACM Trans. Math. Software, 18(3):360--373, 1992.
[11]
M. Durand and P. Flajolet. Loglog counting of large cardinalities. In Proc. 11th Eur. Symp. Algorithms (ESA 2003), volume 2832 of Lecture Notes in Computer Science. Springer, 2003.
[12]
C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high speed links. In Proc. SIGCOMM '02, pages 323--336, 2002.
[13]
P. Flajolet. Adaptive sampling In M. Hazewinkel, editor, Encyclopaedia of Mathematics, Supplement I. Kluwer, 1997.
[14]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Computer Sys. Sci., 31:182--209, 1985.
[15]
S. Ganguly, M. Garofalakis, and R. Rastogi. Tracking set-expression cardinalities over continuous update streams. VLDB J., 13:354--369, 2004.
[16]
P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. VLDB, pages 541--550, 2001.
[17]
P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams In Proc. ACM Symp. Parallel Algorithms and Architecture, pages 281--291, 2001.
[18]
F. Giroire.Order statistics and estimating cardinalities of massive data sets. In Proc. Intl. Conf. Analysis Algorithms, pages 157--166, 2005.
[19]
P. J. Haas, Y. Liu, and L. Stokes. An estimator of the number of species from quadrat sampling. Biometrics, 62:135--141, 2006.
[20]
P. J. Haas and L. Stokes. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc., 93:1475--1487, 1998.
[21]
P. Hellekalek and S. Wegenkittl. Empirical evidence concerning AES. ACM Trans. Modelling Comput. Simulation, 13:322--333, 2003.
[22]
Y. E. Ioannidis. The history of histograms (abridged). In Proc. VLDB, pages 19--30, 2003.
[23]
N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univeriate Distributions-2. Wiley, 2nd edition, 1995.
[24]
S. Karlin and H. M. Taylor. A Second Course in Stochastic Processes. Academic Press, 1981.
[25]
D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, 1973.
[26]
M. Matsumoto and T. Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. Modeling Computer Simulation, 8(1):3--30, 1998.
[27]
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[28]
S. Padmanabhan, B. Bhattacharjee, T. Malkemus, L. Cranston, and M. Huras. Multi-dimensional clustering: a new data layout scheme in DB2. In Proc. ACM SIGMOD, pages 637--641, 2003.
[29]
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proc. ACM SIGMOD, pages 23--34, 1979.
[30]
R. J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.
[31]
A. Shukla, P. Deshpande, J. F. Naughton, and K. Ramasamy. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB, pages 522--531, 1996.
[32]
J. Vitter. Random Sampling with a Reservoir. ACM Trans. Math. Software, 11(1):37--57, 1985.
[33]
K. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Sys., 15:208--229, 1990.

Cited By

View all

Index Terms

  1. On synopses for distinct-value estimation under multiset operations

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
    June 2007
    1210 pages
    ISBN:9781595936868
    DOI:10.1145/1247480
    • General Chairs:
    • Lizhu Zhou,
    • Tok Wang Ling,
    • Program Chair:
    • Beng Chin Ooi
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 June 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. distinct-value estimation
    2. synopsis warehouse

    Qualifiers

    • Article

    Conference

    SIGMOD/PODS07
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)58
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media