skip to main content
article

Order statistics and estimating cardinalities of massive data sets

Published: 20 January 2009 Publication History

Abstract

A new class of algorithms to estimate the cardinality of very large multisets using constant memory and doing only one pass on the data is introduced here. It is based on order statistics rather than on bit patterns in binary representations of numbers. Three families of estimators are analyzed. They attain a standard error of 1M using M units of storage, which places them in the same class as the best known algorithms so far. The algorithms have a very simple internal loop, which gives them an advantage in terms of processing speed. For instance, a memory of only 12 kB and only few seconds are sufficient to process a multiset with several million elements and to build an estimate with accuracy of order 2 percent. The algorithms are validated both by mathematical analysis and by experimentations on real internet traffic.

References

[1]
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D. and Trevisan, Luca, Counting distinct elements in a data stream. In: RANDOM '02: Proceedings of the 6th International Workshop on Randomization and Approximation Techniques, Springer-Verlag. pp. 1-10.
[2]
Broder, A., On the resemblance and containment of documents. In: SEQUENCES '97: Proceedings of the Compression and Complexity of Sequences 1997, IEEE Computer Society, Washington, DC, USA. pp. 21
[3]
Broder, Andrei Z., Identifying and filtering near-duplicate documents. In: COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, Springer-Verlag, London, UK. pp. 1-10.
[4]
P. Chassaing, L. Gerin, Efficient estimation of the cardinality of large data sets, in: math.ST/0701347, 2007. Extended abstract in the proceedings of the 4th Colloquium on Mathematics and Computer Science, 2007, pp.¿419-422
[5]
Considine, J., Li, F., Kollios, G. and Byers, J., Approximate aggregation techniques for sensor databases. In: ICDE '04: Proceedings of the 20th International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA. pp. 449
[6]
Durand, M. and Flajolet, P., Loglog counting of large cardinalities. In: Di Battista, G., Zwick, U. (Eds.), Lecture Notes in Computer Science, vol.~2832. pp. 605-617.
[7]
Estan, C., Varghese, G. and Fisk, M., Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw. v14 i5. 925-937.
[8]
Flajolet, P., Adaptive sampling. In: Hazewinkel, M. (Ed.), Encyclopaedia of Mathematics, vol.~Suppl.~I. Kluwer Academic Publishers, Dordrecht. pp. 28
[9]
P. Flajolet, E. Fusy, O. Gandouet, F. Meunier, Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm, in: Philippe Jacquet (Ed.) Analysis of Algorithm 2007, AofA07, Discrete Mathematics and Theoretical Computer Science Proceedings, 2007 (in press)
[10]
Flajolet, P. and Martin, P.N., Probabilistic counting. In: Proceedings of the 24th Annual Symposium on Foundations of Computer Science, IEEE Computer Society Press. pp. 76-82.
[11]
C. Fraleigh, C. Diot, B. Lyles, S. Moon, P. Owezarski, K. Papagiannaki, F. Tobagi, Design and deployment of a passive monitoring infrastructure, in: Passive and Active Measurement Workshop, Amsterdam, April 2001
[12]
Fusy, í. and Giroire, F., Estimating the number of active flows in a data stream over a sliding window. In: Appelgate, David (Ed.), Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments and the Fourth Workshop on Analytic Algorithmics and Combinatorics, SIAM Press. pp. 223-231.
[13]
L. Getoor, B. Taskar, D. Koller, Selectivity estimation using probabilistic models, in: SIGMOD Conference, 2001
[14]
P.B. Gibbons, Distinct sampling for highly-accurate answers to distinct values queries and event reports, VLDB J. (2001) 541-550
[15]
Giroire, F., Order statistics and estimating cardinalities of massive data sets. In: Martnez, Conrado (Ed.), International Conference on Analysis of Algorithms, volume AD of DMTCS Proceedings, pp. 157-166.
[16]
F. Giroire, Directions to use probabilistic algorithms for cardinality for DNA analysis, in: Journées Ouvertes Biologie Informatique Mathématiques, JOBIM 2006, pp.¿3-5, July 2006
[17]
F. Giroire, Réseaux, algorithmique et analyse combinatoire de grands ensembles, Ph.D. Thesis, Université Paris VI, November 2006
[18]
G. Iannaccone, C. Diot, I. Graham, N. McKeown, Monitoring very high speed links, in: ACM SIGCOMM Internet Measurement Workshop, San Francisco, CA, November 2001
[19]
Knuth, D.E., The Art of Computer Programming, Volume~3: Sorting and Searching. 1973. Addison-Wesley.
[20]
Whang, K.-Y., Zanden, B.T.V. and Taylor, H.M., A linear-time probabilistic counting algorithm for database applications. TODS. v15 i2. 208-229.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 20 January 2009

Author Tags

  1. Algorithm analysis
  2. Cardinality estimates
  3. Traffic analysis
  4. Very large multisets

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media