skip to main content
10.1145/2949689.2949715acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Bermuda: An Efficient MapReduce Triangle Listing Algorithm for Web-Scale Graphs

Published: 18 July 2016 Publication History

Abstract

Triangle listing plays an important role in graph analysis and has numerous graph mining applications. With the rapid growth of graph data, distributed methods for listing triangles over massive graphs are urgently needed. Therefore, the triangle listing problem has been studied in several distributed infrastructures including MapReduce. However, existing algorithms suffer from generating and shuffling huge amounts of intermediate data, where interestingly, a large percentage of this data is redundant. Inspired by this observation, we present the "Bermuda" method, an efficient MapReducebased triangle listing technique for massive graphs.
Different from existing approaches, Bermuda effectively reduces the size of the intermediate data via redundancy elimination and sharing of messages whenever possible. As a result, Bermuda achieves orders-of-magnitudes of speedup and enables processing larger graphs that other techniques fail to process under the same resources. Bermuda exploits the locality of processing, i.e., in which reduce instance each graph vertex will be processed, to avoid the redundancy of generating messages from mappers to reducers. Bermuda also proposes novel message sharing techniques within each reduce instance to increase the usability of the received messages. We present and analyze several reduce-side caching strategies that dynamically learn the expected access patterns of the shared messages, and adaptively deploy the appropriate technique for better sharing. Extensive experiments conducted on real-world large-scale graphs show that Bermuda speeds up the triangle listing computations by factors up to 10x. Moreover, with a relatively small cluster, Bermuda can scale up to large datasets, e.g., ClueWeb graph dataset (688GB), while other techniques fail to finish.

References

[1]
N. Alon, R. Yuster, and U. Zwick. Finding and counting given length cycles. Algorithmica, 1997.
[2]
S. Arifuzzaman, M. Khan, and M. Marathe. Patric: A parallel algorithm for counting triangles in massive networks. CIKM, 2013.
[3]
N. Bao and T. Suzumura. Towards highly scalable pregel-based graph processing platform with x10. WWW, 2013.
[4]
V. Batagelj and A. Mrvar. A subquadratic triad census algorithm for large sparse networks with small maximum degree. Social networks, 2001.
[5]
L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. KDD, 2008.
[6]
J. Berry, B. Hendrickson, R. LaViolette, and C. Phillips. Tolerating the community detection resolution limit with edge weighting. Physical Review E, 2011.
[7]
L. Buriol, G. Frahling, and S. Leonardi. Counting triangles in data streams. VLDB, 2006.
[8]
N. Chiba and T. Nishizeki. Arboricity and subgraph listing algorithms. SIAM Journal on Computing, 1985.
[9]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 2008.
[10]
J. Eckmann and E. Moses. Curvature of co-links uncovers hidden thematic layers in the world wide web. Academy of Sciences, 2002.
[11]
J. Gonzalez, R. Xin, A. Dave, and D. Crankshaw. Graphx: Graph processing in a distributed dataflow framework. GRADES,SIGMOD workshop, 2014.
[12]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. OSDI, 2012.
[13]
W.-S. Han, S. Lee, K. Park, J.-H. Lee, M.-S. Kim, J. Kim, and H. Yu. Turbograph: a fast parallel graph engine handling billion-scale graphs in a single pc. KDD, 2013.
[14]
X. Hu, Y. Tao, and C. Chung. I/O-Efficient algorithms on triangle listing and counting. ACM Transactions on Database Systems, 2014.
[15]
A. Itai and M. Rodeh. Finding a minimum circuit in a graph. SIAM, 1978.
[16]
G. Keramidas and P. Petoumenos. Cache replacement based on reuse-distance prediction. ICCD, 2007.
[17]
S. Khuller and B. Saha. On finding dense subgraphs. Automata, 2009.
[18]
J. Kim, W. Han, S. Lee, K. Park, and H. Yu. Opt: a new framework for overlapped and parallel triangulation in large-scale graphs. SIGMOD, 2014.
[19]
A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. OSDI, 2012.
[20]
J. Lin. The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce. LSDR-IR workshop, 2009.
[21]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. SIGMOD, 2010.
[22]
R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, and D. Chklovskii. Network motifs: simple building blocks of complex networks. Academy of Sciences, 2002.
[23]
G. W. Oehlert. A note on the delta method. The American Statistician, 1992.
[24]
H. Park, F. Silvestri, U. Kang, and R. Pagh. MapReduce triangle enumeration with guarantees. CIKM, 2014.
[25]
P. Petoumenos and G. Keramidas. Instruction-based reuse-distance prediction for effective cache management. 2009.
[26]
S. Salihoglu and J. Widom. Gps: A graph processing system. SSDBM, 2013.
[27]
A. D. Sarma, F. N. Afrati, S. Salihoglu, and J. D. Ullman. Upper and lower bounds on the cost of a map-reduce computation. PVLDB, 2013.
[28]
T. Schank. Algorithmic aspects of triangle-based network analysis. Phd in computer science, 2007.
[29]
Y. Shao, B. Cui, L. Chen, L. Ma, J. Yao, and N. Xu. Parallel subgraph listing in a large-scale graph. SIGMOD, 2014.
[30]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. MSST, 2010.
[31]
S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. KDD, 2011.
[32]
C. H. C. Teixeira, A. J. Fonseca, M. Serafini, G. Siganos, M. J. Zaki, and A. Aboulnaga. Arabesque: a system for distributed graph mining. SOSP, 2015.
[33]
T. White. Hadoop: The definitive guide. 2010.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management
July 2016
290 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed Triangle Listing
  2. Graph Analytics
  3. MapReduce

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SSDBM '16

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media