research-article

PTE: Enumerating Trillion Triangles On Distributed Systems

Authors:

Sung-Hyon Myaeng,

U. KangAuthors Info & Claims

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1115 - 1124

https://doi.org/10.1145/2939672.2939757

Published: 13 August 2016 Publication History

Abstract

How can we enumerate triangles from an enormous graph with billions of vertices and edges? Triangle enumeration is an important task for graph data analysis with many applications including identifying suspicious users in social networks, detecting web spams, finding communities, etc. However, recent networks are so large that most of the previous algorithms fail to process them. Recently, several MapReduce algorithms have been proposed to address such large networks; however, they suffer from the massive shuffled data resulting in a very long processing time. In this paper, we propose PTE (Pre-partitioned Triangle Enumeration), a new distributed algorithm for enumerating triangles in enormous graphs by resolving the structural inefficiency of the previous MapReduce algorithms. PTE enumerates trillions of triangles in a billion scale graph by decreasing three factors: the amount of shuffled data, total work, and network read.

Experimental results show that PTE provides up to 47 times faster performance than recent distributed algorithms on real world graphs, and succeeds in enumerating more than 3 trillion triangles on the ClueWeb12 graph with 6.3 billion vertices and 72 billion edges, which any previous triangle computation algorithm fail to process.

Supplementary Material

MP4 File (kdd2016_park_trillion_triangles_01-acm.mp4)

Download
263.84 MB

References

[1]

Jesse Alpert and Nissan Hajaj. http://googleblog.blogspot.kr/2008/07/we-knew-web-was-big.html, 2008.

[2]

Shaikh Arifuzzaman, Maleq Khan, and Madhav V. Marathe. PATRIC: a parallel algorithm for counting triangles in massive networks. In CIKM, 2013.

Digital Library

[3]

Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis. Efficient algorithms for large-scale local triangle counting. TKDD, 2010.

Digital Library

[4]

Jonathan W Berry, Bruce Hendrickson, Randall A LaViolette, and Cynthia A Phillips. Tolerating the community detection resolution limit with edge weighting. Phys. Rev. E, 83(5):056119, 2011.

[5]

Bin-Hui Chou and Einoshin Suzuki. Discovering community-oriented roles of nodes in a social network. In DaWaK, pages 52--64, 2010.

Digital Library

[6]

Jonathan Cohen. Graph twiddling in a mapreduce world. CiSE, 11(4):29--41, 2009.

Digital Library

[7]

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.

Digital Library

[8]

Jean-Pierre Eckmann and Elisha Moses. Curvature of co-links uncovers hidden thematic layers in the world wide web. PNAS, 99(9):5825--5829, 2002.

[9]

Facebook. http://newsroom.fb.com/company-info, 2015.

[10]

Ilias Giechaskiel, George Panagopoulos, and Eiko Yoneki. PDTL: parallel and distributed triangle listing for massive graphs. In ICPP, 2015.

Digital Library

[11]

Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012.

Digital Library

[12]

Herodotos Herodotou. Hadoop performance models. arXiv, 2011.

[13]

Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung. Massive graph triangulation. In SIGMOD, pages 325--336, 2013.

Digital Library

[14]

ByungSoo Jeon, Inah Jeon, Lee Sael, and U Kang. Scout: Scalable coupled matrix-tensor factorization - algorithm and discoveries. In ICDE, 2016.

[15]

U Kang, Jay-Yoon Lee, Danai Koutra, and Christos Faloutsos. Net-ray: Visualizing and mining billion-scale graphs. In PAKDD, 2014.

[16]

U Kang, Brendan Meeder, Evangelos E. Papalexakis, and Christos Faloutsos. Heigen: Spectral analysis for billion-scale graphs. TKDE, pages 350--362, 2014.

Digital Library

[17]

U Kang, Hanghang Tong, Jimeng Sun, Ching-Yung Lin, and Christos Faloutsos. Gbase: an efficient analysis platform for large graphs. VLDB J., 21(5):637--650, 2012.

Digital Library

[18]

U Kang, Charalampos E. Tsourakakis, and Faloutsos Faloutsos. Pegasus: A peta-scale graph mining system - implementation and observations. ICDM, 2009.

Digital Library

[19]

Jinha Kim, Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, and Hwanjo Yu. OPT: A new framework for overlapped and parallel triangulation in large-scale graphs. In SIGMOD, pages 637--648, 2014.

Digital Library

[20]

Matthieu Latapy. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci., pages 458--473, 2008.

Digital Library

[21]

Rasmus Pagh and Francesco Silvestri. The input/output complexity of triangle enumeration. In PODS, pages 224--233, 2014.

Digital Library

[22]

Ha-Myung Park and Chin-Wan Chung. An efficient mapreduce algorithm for counting triangles in a very large graph. In CIKM, pages 539--548, 2013.

Digital Library

[23]

Ha-Myung Park, Francesco Silvestri, U Kang, and Rasmus Pagh. Mapreduce triangle enumeration with guarantees. In CIKM, pages 1739--1748, 2014.

Digital Library

[24]

Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi. Defining and identifying communities in networks. PNAS, 101(9):2658--2663, 2004.

[25]

Thomas Schank. Algorithmic aspects of triangle-based network analysis. Phd thesis, University Karlsruhe, 2007.

[26]

Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011.

Digital Library

[27]

Twitter. https://about.twitter.com/company, 2015.

[28]

Mark N. Wegman and Larry Carter. New hash functions and their use in authentication and set equality. J. Comput. Syst. Sci., 22(3):265--279, 1981.

[29]

Zhi Yang, Christo Wilson, Xiao Wang, Tingting Gao, Ben Y. Zhao, and Yafei Dai. Uncovering social network sybils in the wild. TKDD, 2014.

Digital Library

Cited By

Park SOh SKim MLee IChabbi MSteuwer M(2024)INFINEL: An efficient GPU-based processing method for unpredictable large output graph queriesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638490(147-159)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638490
Farouzi AZhou XBellatreche LMalki MOrdonez C(2024)Balanced parallel triangle enumeration with an adaptive algorithmDistributed and Parallel Databases10.1007/s10619-023-07437-x42:1(103-141)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10619-023-07437-x
Lakhotia KKannan RPrasanna V(2023)Parallel Peeling of Bipartite Networks for Hierarchical Dense Subgraph DiscoveryACM Transactions on Parallel Computing10.1145/358308410:2(1-35)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3583084
Show More Cited By

Index Terms

PTE: Enumerating Trillion Triangles On Distributed Systems
1. Information systems
  1. Information systems applications
    1. Data mining
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Enumerating Trillion Subgraphs On Distributed Systems

How can we find patterns from an enormous graph with billions of vertices and edges? The subgraph enumeration, which is to find patterns from a graph, is an important task for graph data analysis with many applications, including analyzing the social ...
BIGMiner: a fast and scalable distributed frequent pattern miner for big data

Frequent itemset mining is widely used as a fundamental data mining technique. Recently, there have been proposed a number of MapReduce-based frequent itemset mining methods in order to overcome the limits on data size and speed of mining that sequential ...
Finding a maximum-weight induced k-partite subgraph of an i-triangulated graph

An i-triangulated graph is a graph in which every odd cycle has two non-crossing chords; i-triangulated graphs form a subfamily of perfect graphs. A slightly more general family of perfect graphs are clique-separable graphs. A graph is clique-separable ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2016

2176 pages

ISBN:9781450342322

DOI:10.1145/2939672

General Chairs:
Balaji Krishnapuram
IBM
,
Mohak Shah
Bosch
,
Program Chairs:
Alex Smola
Amazon
,
Charu Aggarwal
IBM
,
Dou Shen
Baidu
,
Rajeev Rastogi
Amazon

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministry of Science, ICT and Future Planning

Conference

KDD '16

Sponsor:

KDD '16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13 - 17, 2016

California, San Francisco, USA

Acceptance Rates

KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
277
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Park SOh SKim MLee IChabbi MSteuwer M(2024)INFINEL: An efficient GPU-based processing method for unpredictable large output graph queriesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638490(147-159)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638490
Farouzi AZhou XBellatreche LMalki MOrdonez C(2024)Balanced parallel triangle enumeration with an adaptive algorithmDistributed and Parallel Databases10.1007/s10619-023-07437-x42:1(103-141)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10619-023-07437-x
Lakhotia KKannan RPrasanna V(2023)Parallel Peeling of Bipartite Networks for Hierarchical Dense Subgraph DiscoveryACM Transactions on Parallel Computing10.1145/358308410:2(1-35)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3583084
Gou XZou L(2023)Sliding window-based approximate triangle counting with bounded memory usageThe VLDB Journal10.1007/s00778-023-00783-332:5(1087-1110)Online publication date: 9-Mar-2023
https://doi.org/10.1007/s00778-023-00783-3
Taniguchi RAmagata DHara T(2022)Efficient Retrieval of Top-k Weighted Triangles on Static and Dynamic Spatial DataIEEE Access10.1109/ACCESS.2022.317762010(55298-55307)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3177620
Levinas IScherz RLouzoun Y(2022)BFS-based distributed algorithm for parallel local-directed subgraph enumerationJournal of Complex Networks10.1093/comnet/cnac05110:6Online publication date: 1-Dec-2022
https://doi.org/10.1093/comnet/cnac051
Santoso YLiu XSrinivasan VThomo A(2022)Four node graphlet and triad enumeration on distributed platformsDistributed and Parallel Databases10.1007/s10619-022-07416-840:2-3(335-372)Online publication date: 30-Jul-2022
https://doi.org/10.1007/s10619-022-07416-8
Taniguchi RAmagata DHara T(2022)Efficient Retrieval of Top-k Weighted Spatial TrianglesDatabase Systems for Advanced Applications10.1007/978-3-031-00123-9_17(224-231)Online publication date: 8-Apr-2022
https://doi.org/10.1007/978-3-031-00123-9_17
Liu XSantoso YSrinivasan VThomo A(2022)Practical Survey on MapReduce Subgraph Enumeration AlgorithmsAdvances in Internet, Data & Web Technologies10.1007/978-3-030-95903-6_45(430-444)Online publication date: 2-Feb-2022
https://doi.org/10.1007/978-3-030-95903-6_45
Lakhotia KKannan RPrasanna VDe Rose C(2021)ReceiptProceedings of the VLDB Endowment10.14778/3430915.343092914:3(404-417)Online publication date: 9-Dec-2021
https://dl.acm.org/doi/10.14778/3430915.3430929
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents