research-article

Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets

Authors:

P. Triantafillou,

G. WeikumAuthors Info & Claims

ACM Transactions on Computer Systems (TOCS), Volume 27, Issue 1

Article No.: 2, Pages 1 - 53

https://doi.org/10.1145/1482619.1482621

Published: 13 February 2009 Publication History

Abstract

Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name containing “spiderman”) where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are: (1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy integration in applications. This article contributes the DHS (Distributed Hash Sketches) method for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator. DHS is based on hash sketches for probabilistic counting, but distributes the bits of each counter across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying careful attention to fast access and aggregation as well as update costs. The article discusses various design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.

References

[1]

Aberer, K., Datta, A., Hauswirth, M., and Schmidt, R. 2005. Indexing data-oriented overlay networks. In Proceedings of the International Conference on Very Large Databases (VLDB).

Digital Library

[2]

Alon, N., Gibbons, P. B., Matias, Y., and Szegedy, M. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS).

Digital Library

[3]

Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC).

Digital Library

[4]

Aspnes, J. and Shah, G. 2003. Skip Graphs. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA).

Digital Library

[5]

Babaoǧlu, Ö., Meling, H., and Montresor, A. 2002. Anthill: A framework for the development of agent-based peer-to-peer systems. In Proceedings of the IEEE International Conference on Distributed Computing and Systems (ICDCS).

Digital Library

[6]

Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D., and Trevisan, L. 2002. Counting distinct elements in a data stream. In Proceedings of the International Workshop on Randomization and Approximation Techniques (RANDOM).

Digital Library

[7]

Bawa, M., Garcia-Molina, H., Gionis, A., and Motwani, R. 2003. Estimating aggregates on a peer-to-peer network. Tech. rep., Computer Science Department, Stanford University.

[8]

Bawa, M., Gionis, A., Garcia-Molina, H., and Motwani, R. 2004. The price of validity in dynamic networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD).

Digital Library

[9]

Beyer, K., Haas, P. J., Reinwald, B., Sismanis, Y., and Gemulla, R. 2007. On synopses for distinct-value estimation under multiset operations. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD).

Digital Library

[10]

Bharambe, A., Agrawal, M., and Seshan, S. 2004. Mercury: Supporting scalable multi-attribute range queries. In Proceedings of the ACM Symposium on Communications Architectures and Protocols (SIGCOMM).

Digital Library

[11]

Chaudhuri, S., Motwani, R., and Narasayya, R. 1998. Random sampling for histogram construction: How much is enough&quest; In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD).

Digital Library

[12]

Considine, J., Li, F., Kollios, G., and Byers, J. 2004. Approximate aggregation techniques for sensor databases. In Proceedings of the International Conference on Data Engineering (ICDE).

Digital Library

[13]

Cormode, G. and Garofalakis, M. N. 2005. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the International Conference on Very Large Databases (VLDB).

Digital Library

[14]

Cormode, G. and Muthukrishnan, S. 2004. An improved data stream summary: The count-min sketch and its applications. In Proceedings of the Latin American Symposium on Theoretical Informatics (LATIN).

[15]

Dabek, F., Li, J., Sit, E., Robertson, J., Kaashoek, M. F., and Morris, R. 2004. Designing a DHT for low latency and high throughput. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).

Digital Library

[16]

Damgaard, C. and Weiner, J. 2000. Describing inequality in plant size or fecundity. Ecology 81, 1139--1142.

[17]

Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2004. Sketch-Based multi-query processing over data streams. In Proceedings of the International Conference on Extending Database Technology (EDBT).

[18]

Druschel, P. and Rowstron, A. 2001. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proceedings of the IFIP/ACM IFIP/ACM International Conference on Distributed Systems Platforms (Middleware).

Digital Library

[19]

Durand, M. and Flajolet, P. 2003. Loglog counting of large cardinalities. In Proceedings of the Annual European Symposium on Algorithms (ESA).

[20]

Flajolet, P. and Martin, G. N. 1985. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31, 2, 182--209.

Digital Library

[21]

FreeDHS. 2006. Homepage. http://netcins.ceid.upatras.gr/DHS.php.

[22]

FreePastry. 2002. Homepage. http://freepastry.org/FreePastry/.

[23]

Ganguly, S., Garofalakis, M., and Rastogi, R. 2003. Processing set expressions over continuous update streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD).

Digital Library

[24]

Ganguly, S., Gibbons, P. B., Matias, Y., and Silberschatz, A. 1996. Bifocal sampling for skew-resistant join size estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD).

Digital Library

[25]

Gnutella. 2001. Homepage. http://gnutella.wego.com/.

[26]

Gummadi, K., Gummadi, R., Gribble, S., Ratnasamy, S., Shenker, S., and Stoica, I. 2003. The impact of DHT routing geometry on resilience and proximity. In Proceedings of the ACM Symposium on Communications Architectures and Protocols (SIGCOMM).

Digital Library

[27]

Gupta, A., Agrawal, D., and El Abbadi, A. 2003. Approximate range selection queries in peer-to-peer systems. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).

[28]

Hadjieleftheriou, M., Byers, J. W., and Kollios, G. 2005. Robust sketching and aggregation of distributed data streams. Tech. rep. 2005-011, Computer Science Department, Boston University.

[29]

Harren, M., Hellerstein, J. M., Huebsch, R., Loo, B. T., Shenker, S., and Stoica, I. 2002. Complex queries in DHT-based peer-to-peer networks. In Proceedings of the International Workshop on Peer-to-Peer Systems (IPTPS).

Digital Library

[30]

Harvey, N., Jones, M., Saroiu, S., Theimer, M., and Wolman, A. 2003. Skipnet: A scalable overlay network with practical locality properties. In Proceedings of the USENIX Symposium on Internet Technologies and Systems (USITS).

Digital Library

[31]

Huebsch, R., Chun, B. N., Hellerstein, J. M., Loo, B. T., Maniatis, P., Roscoe, T., Shenker, S., Stoica, I., and Yumerefendi, A. R. 2005. The architecture of PIER: An Internet-scale query processor. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).

[32]

Huebsch, R., Hellerstein, J. M., Lanham, N., Loo, B. T., Shenker, S., and Stoica, I. 2003. Querying the Internet with PIER. In Proceedings of the International Conference on Very Large Data Bases (VLDB).

Digital Library

[33]

Ives, Z., Khandelwal, N., Kapur, A., and Cakir, M. 2005. ORCHESTRA: Rapid, collaborative sharing of dynamic data. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).

[34]

Jelasity, M. and Montresor, A. 2004. Epidemic-Style proactive aggregation in large overlay networks. In Proceedings of the IEEE International Conference on Distributed Computing and Systems (ICDCS).

Digital Library

[35]

Karger, D., Lehman, E., Leighton, T., Levine, M., Lewin, D., and Panigrahy, R. 1997. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'97).

Digital Library

[36]

Kempe, D., Dobra, A., and Gehrke, J. 2003. Computing aggregate information using gossip. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS).

Digital Library

[37]

Koloniari, G. and Pitoura, E. 2004. Content-based routing of path queries in peer-to-peer systems. In Proceedings of the International Conference on Extending Database Technology (EDBT).

[38]

Krishnan, P. 1995. Online prediction algorithms for databases and operating systems. Ph.D. thesis, Brown University.

Digital Library

[39]

Lipton, R. and Naughton, J. F. 1995. Query size estimation by adaptive sampling. J. Comput. Syst. Sci. 51, 1, 18--25.

Digital Library

[40]

Lipton, R. J., Naughton, J. F., and Schneider, D. A. 1990. Practical selectivity estimation through adaptive sampling. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD).

Digital Library

[41]

Manku, G. 2003. Routing networks for distributed hash tables. In Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC).

Digital Library

[42]

Maymouknov, P. and Mazières, D. 2002. Kademlia: A peer-to-peer information system based on the XOR metric. In Proceedings of the International Workshop on Peer-to-Peer Systems (IPTPS).

Digital Library

[43]

Michel, S., Bender, M., Ntarmos, N., Triantafillou, P., Weikum, G., and Zimmer, C. 2006. Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM).

Digital Library

[44]

Montresor, A., Meling, H., and Babaoǧlu, Ö. 2002. Messor: Load-Balancing through a swarm of autonomous agents. In Proceedings of the Workshop on Agent and Peer-to-Peer Systems.

Digital Library

[45]

Ng, W. S., Ooi, B. C., Tan, K. L., and Zhou, A. 2003. PeerDB: A P2P-based system for distributed data sharing. In Proceedings of the International Conference on Data Engineering (ICDE).

[46]

Ntarmos, N. and Triantafillou, P. 2004. AESOP: Altruism-Endowed self-organizing peers. In Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P).

Digital Library

[47]

Ntarmos, N., Triantafillou, P., and Weikum, G. 2006. Counting at large: Efficient cardinality estimation in Internet-scale data networks. In Proceedings of the International Conference on Data Engineering (ICDE).

Digital Library

[48]

Palmer, C. R., Siganos, G., Faloutsos, M., Faloutsos, C., and Gibbons, P. B. 2001. The connectivity and fault-tolerance of the Internet topology. In Proceedings of the Workshop on Network-Related Data Management (NRDM).

[49]

Papadimos, V., Maier, D., and Tufte, K. 2003. Distributed query processing and catalogs for peer-to-peer systems. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).

[50]

Pitoura, T., Ntarmos, N., and Triantafillou, P. 2006. Replication, load balancing, and efficient range query processing in DHT data networks. In Proceedings of the International Conference on Extending Database Technology (EDBT).

Digital Library

[51]

Pitoura, T. and Triantafillou, P. 2007. Load distribution fairness in p2p data management systems. In Proceedings of the International Conference on Data Engineering (ICDE).

[52]

Ratnasamy, S., Francis, P., Handley, M., Karp, R., and Shenker, S. 2001. A scalable content-addressable network. In Proceedings of the ACM Symposium on Communications Architectures and Protocols (SIGCOMM).

Digital Library

[53]

Rhea, S., Geels, D., Roscoe, T., and Kubiatowicz, J. 2004. Handling churn in a DHT. In Proceedings of the USENIX Annual Technical Conference (USENIX).

Digital Library

[54]

Saroiu, S., Gummadi, P. K., and Gribble, S. D. 2002. A measurement study of peer-to-peer file sharing systems. In Proceedings of the Multimedia Computing and Networking Conference (MMCN).

[55]

Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. 2001. Chord: A scalable peer-to-peer lookup service for Internet applications. In Proceedings of the ACM Symposium on Communications Architectures and Protocols (SIGCOMM).

Digital Library

[56]

Triantafillou, P. and Pitoura, T. 2003. Towards a unifying framework for complex query processing over structured peer-to-peer data networks. In Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P).

[57]

van Renesse, R., Birman, K. P., and Vogels, W. 2003. Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. ACM Trans. Comput. Syst. 21, 2, 164--206.

Digital Library

[58]

Yalagandula, P. and Dahlin, M. 2004. A scalable distributed information management system. In Proceedings of the ACM Symposium on Communications Architectures and Protocols (SIGCOMM).

Digital Library

[59]

Yang, B. and Garcia-Molina, H. 2001. Comparing hybrid peer-to-peer systems. In Proceedings of the International Conference on Very Large Data Bases (VLDB).

Digital Library

[60]

Zhao, B. Y., Kubiatowicz, J. D., and Joseph, A. D. 2001. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Tech. rep. UCB/CSD-01-1141, University of California at Berkeley, Computer Science Department.

Digital Library

Cited By

Abedjan Z(2022)Data ProfilingEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_8-2(1-6)Online publication date: 17-Feb-2022
https://doi.org/10.1007/978-3-319-63962-8_8-2
Abedjan Z(2019)Data ProfilingEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_8(563-568)Online publication date: 20-Feb-2019
https://doi.org/10.1007/978-3-319-77525-8_8
Kaur MRiaz SMushtaq A(2019)Cyber-Physical Cloud Computing Systems and Internet of EverythingPrinciples of Internet of Things (IoT) Ecosystem: Insight Paradigm10.1007/978-3-030-33596-0_8(201-227)Online publication date: 14-Nov-2019
https://doi.org/10.1007/978-3-030-33596-0_8
Show More Cited By

Index Terms

Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Distributed retrieval
      2. Peer-to-peer retrieval
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Organizing principles for web applications

Recommendations

Incrementally improving lookup latency in distributed hash table systems
SIGMETRICS '03: Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems

Distributed hash table (DHT) systems are an important class of peer-to-peer routing infrastructures. They enable scalable wide-area storage and retrieval of information, and will support the rapid development of a wide variety of Internet-scale ...
Incrementally improving lookup latency in distributed hash table systems

Distributed hash table (DHT) systems are an important class of peer-to-peer routing infrastructures. They enable scalable wide-area storage and retrieval of information, and will support the rapid development of a wide variety of Internet-scale ...
Stealth distributed hash table: a robust and flexible super-peered DHT
CoNEXT '06: Proceedings of the 2006 ACM CoNEXT conference

Most Distributed Hash Tables (DHTs) simply consider interconnecting homogeneous nodes on the same overlay. However, realistically nodes on a network are heterogeneous in terms of their capabilities. Because of this, traditional DHTs have been shown to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 27, Issue 1

February 2009

100 pages

ISSN:0734-2071

EISSN:1557-7333

DOI:10.1145/1482619

Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 February 2009

Accepted: 01 December 2008

Received: 01 February 2008

Published in TOCS Volume 27, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Sixth Framework Programme

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
1,065
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)5

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Abedjan Z(2022)Data ProfilingEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_8-2(1-6)Online publication date: 17-Feb-2022
https://doi.org/10.1007/978-3-319-63962-8_8-2
Abedjan Z(2019)Data ProfilingEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_8(563-568)Online publication date: 20-Feb-2019
https://doi.org/10.1007/978-3-319-77525-8_8
Kaur MRiaz SMushtaq A(2019)Cyber-Physical Cloud Computing Systems and Internet of EverythingPrinciples of Internet of Things (IoT) Ecosystem: Insight Paradigm10.1007/978-3-030-33596-0_8(201-227)Online publication date: 14-Nov-2019
https://doi.org/10.1007/978-3-030-33596-0_8
Abedjan ZGolab LNaumann FPapenbrock T(2018)Data ProfilingSynthesis Lectures on Data Management10.2200/S00878ED1V01Y201810DTM05210:4(1-154)Online publication date: 7-Nov-2018
https://doi.org/10.2200/S00878ED1V01Y201810DTM052
Rizwan PBabu MBalamurugan BSuresh K(2018)Real-time big data computing for Internet of Things and cyber physical system aided medical devices for better healthcare2018 Majan International Conference (MIC)10.1109/MINTC.2018.8363160(1-8)Online publication date: Mar-2018
https://doi.org/10.1109/MINTC.2018.8363160
Abedjan Z(2018)Data ProfilingEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_8-1(1-6)Online publication date: 1-Feb-2018
https://doi.org/10.1007/978-3-319-63962-8_8-1
Chang SLiu JLiu FYang J(2016)A Parallel Space Saving Algorithm and Performance Test2016 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC)10.1109/IHMSC.2016.112(198-202)Online publication date: Aug-2016
https://doi.org/10.1109/IHMSC.2016.112
Jesus PBaquero CAlmeida P(2015)A Survey of Distributed Data Aggregation AlgorithmsIEEE Communications Surveys & Tutorials10.1109/COMST.2014.235439817:1(381-404)Online publication date: Sep-2016
https://doi.org/10.1109/COMST.2014.2354398
Abedjan ZGolab LNaumann F(2015)Profiling relational dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-015-0389-y24:4(557-581)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1007/s00778-015-0389-y
Malazi HZamanifar KPruteanu ADulman S(2014)Gossip-based density estimation in dynamic heterogeneous wireless sensor networksInternational Journal of Autonomous and Adaptive Communications Systems10.1504/IJAACS.2014.0580197:1/2(151-168)Online publication date: 1-Nov-2014
https://dl.acm.org/doi/10.1504/IJAACS.2014.058019
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents