article

Order statistics and estimating cardinalities of massive data sets

Author:

Frédéric GiroireAuthors Info & Claims

Discrete Applied Mathematics, Volume 157, Issue 2

Pages 406 - 427

https://doi.org/10.1016/j.dam.2008.06.020

Published: 20 January 2009 Publication History

Abstract

A new class of algorithms to estimate the cardinality of very large multisets using constant memory and doing only one pass on the data is introduced here. It is based on order statistics rather than on bit patterns in binary representations of numbers. Three families of estimators are analyzed. They attain a standard error of 1M using M units of storage, which places them in the same class as the best known algorithms so far. The algorithms have a very simple internal loop, which gives them an advantage in terms of processing speed. For instance, a memory of only 12 kB and only few seconds are sufficient to process a multiset with several million elements and to build an estimate with accuracy of order 2 percent. The algorithms are validated both by mathematical analysis and by experimentations on real internet traffic.

References

[1]

Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D. and Trevisan, Luca, Counting distinct elements in a data stream. In: RANDOM '02: Proceedings of the 6th International Workshop on Randomization and Approximation Techniques, Springer-Verlag. pp. 1-10.

[2]

Broder, A., On the resemblance and containment of documents. In: SEQUENCES '97: Proceedings of the Compression and Complexity of Sequences 1997, IEEE Computer Society, Washington, DC, USA. pp. 21

Digital Library

[3]

Broder, Andrei Z., Identifying and filtering near-duplicate documents. In: COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, Springer-Verlag, London, UK. pp. 1-10.

[4]

P. Chassaing, L. Gerin, Efficient estimation of the cardinality of large data sets, in: math.ST/0701347, 2007. Extended abstract in the proceedings of the 4th Colloquium on Mathematics and Computer Science, 2007, pp.¿419-422

[5]

Considine, J., Li, F., Kollios, G. and Byers, J., Approximate aggregation techniques for sensor databases. In: ICDE '04: Proceedings of the 20th International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA. pp. 449

[6]

Durand, M. and Flajolet, P., Loglog counting of large cardinalities. In: Di Battista, G., Zwick, U. (Eds.), Lecture Notes in Computer Science, vol.~2832. pp. 605-617.

[7]

Estan, C., Varghese, G. and Fisk, M., Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw. v14 i5. 925-937.

[8]

Flajolet, P., Adaptive sampling. In: Hazewinkel, M. (Ed.), Encyclopaedia of Mathematics, vol.~Suppl.~I. Kluwer Academic Publishers, Dordrecht. pp. 28

[9]

P. Flajolet, E. Fusy, O. Gandouet, F. Meunier, Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm, in: Philippe Jacquet (Ed.) Analysis of Algorithm 2007, AofA07, Discrete Mathematics and Theoretical Computer Science Proceedings, 2007 (in press)

[10]

Flajolet, P. and Martin, P.N., Probabilistic counting. In: Proceedings of the 24th Annual Symposium on Foundations of Computer Science, IEEE Computer Society Press. pp. 76-82.

Digital Library

[11]

C. Fraleigh, C. Diot, B. Lyles, S. Moon, P. Owezarski, K. Papagiannaki, F. Tobagi, Design and deployment of a passive monitoring infrastructure, in: Passive and Active Measurement Workshop, Amsterdam, April 2001

[12]

Fusy, í. and Giroire, F., Estimating the number of active flows in a data stream over a sliding window. In: Appelgate, David (Ed.), Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments and the Fourth Workshop on Analytic Algorithmics and Combinatorics, SIAM Press. pp. 223-231.

[13]

L. Getoor, B. Taskar, D. Koller, Selectivity estimation using probabilistic models, in: SIGMOD Conference, 2001

[14]

P.B. Gibbons, Distinct sampling for highly-accurate answers to distinct values queries and event reports, VLDB J. (2001) 541-550

[15]

Giroire, F., Order statistics and estimating cardinalities of massive data sets. In: Martnez, Conrado (Ed.), International Conference on Analysis of Algorithms, volume AD of DMTCS Proceedings, pp. 157-166.

[16]

F. Giroire, Directions to use probabilistic algorithms for cardinality for DNA analysis, in: Journées Ouvertes Biologie Informatique Mathématiques, JOBIM 2006, pp.¿3-5, July 2006

[17]

F. Giroire, Réseaux, algorithmique et analyse combinatoire de grands ensembles, Ph.D. Thesis, Université Paris VI, November 2006

[18]

G. Iannaccone, C. Diot, I. Graham, N. McKeown, Monitoring very high speed links, in: ACM SIGCOMM Internet Measurement Workshop, San Francisco, CA, November 2001

Digital Library

[19]

Knuth, D.E., The Art of Computer Programming, Volume~3: Sorting and Searching. 1973. Addison-Wesley.

[20]

Whang, K.-Y., Zanden, B.T.V. and Taylor, H.M., A linear-time probabilistic counting algorithm for database applications. TODS. v15 i2. 208-229.

Cited By

Qi YLi RWang PSun YXing RBaeza-Yates RBonchi F(2024)QSketch: An Efficient Sketch for Weighted Cardinality Estimation in StreamsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671695(2432-2443)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671695
Sultan SShakiba KLee AChen PStumm M(2024)TTLs Matter: Efficient Cache Sizing with TTL-Aware Miss Ratio Curves and Working Set SizesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650066(387-404)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3650066
Hehir JTing DCormode GKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Sketch-flip-mergeProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618930(12846-12865)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3618930
Show More Cited By

Index Terms

Order statistics and estimating cardinalities of massive data sets

Recommendations

On cardinalities of k-abelian equivalence classes

Two words u and v are k-abelian equivalent if for each word x of length at most k, x occurs equally many times as a factor in both u and v. The notion of k-abelian equivalence is an intermediate notion between the abelian equivalence and the equality of ...
The Use of Order Statistics in Estimation

<P>A survey is given of recent advances, including a number of original contributions by the author, in the use of order statistics to obtain point and interval estimates of the parameters of various statistical populations from complete and from ...
Cache-oblivious data structures for massive data sets

Comments

Information & Contributors

Information

Published In

Copyright © Elsevier B.V. © 2008.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 20 January 2009

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qi YLi RWang PSun YXing RBaeza-Yates RBonchi F(2024)QSketch: An Efficient Sketch for Weighted Cardinality Estimation in StreamsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671695(2432-2443)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671695
Sultan SShakiba KLee AChen PStumm M(2024)TTLs Matter: Efficient Cache Sizing with TTL-Aware Miss Ratio Curves and Working Set SizesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650066(387-404)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3650066
Hehir JTing DCormode GKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Sketch-flip-mergeProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618930(12846-12865)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3618930
Wang DPettie SGeerts FNgo HSintos S(2023)Better Cardinality Estimators for HyperLogLog, PCSA, and BeyondProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588680(317-327)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3584372.3588680
Harris DRinberg ARottenstreich O(2023)Compressing Distributed Network Sketches With Traffic-Aware SummariesIEEE Transactions on Network and Service Management10.1109/TNSM.2022.317229920:2(1962-1975)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TNSM.2022.3172299
Lin YXu ZZhang YLi YZhang J(2023)Cardinality estimation with smoothing autoregressive modelsWorld Wide Web10.1007/s11280-023-01195-726:5(3441-3461)Online publication date: 28-Jul-2023
https://dl.acm.org/doi/10.1007/s11280-023-01195-7
Dickens CThaler JTing DKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Order-invariant cardinality estimators are differentially privateProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601376(15204-15216)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601376
Zhou XLi GChai CFeng J(2022)A learned query rewrite system using Monte Carlo tree searchProceedings of the VLDB Endowment10.14778/3485450.348545615:1(46-58)Online publication date: 14-Jan-2022
https://dl.acm.org/doi/10.14778/3485450.3485456
Wu YFan ZShi QZhang YYang TChen CZhong ZLi JShtul ATu Y(2022)SHE: A Generic Framework for Data Stream Mining over Sliding WindowsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545009(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545009
Karppa MPagh RZhang ARangwala H(2022)HyperLogLogLogProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539246(753-761)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539246
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents