skip to main content
research-article

Enumerating Trillion Subgraphs On Distributed Systems

Published: 01 October 2018 Publication History

Abstract

How can we find patterns from an enormous graph with billions of vertices and edges? The subgraph enumeration, which is to find patterns from a graph, is an important task for graph data analysis with many applications, including analyzing the social network evolution, measuring the significance of motifs in biological networks, observing the dynamics of Internet, and so on. Especially, the triangle enumeration, a special case of the subgraph enumeration, where the pattern is a triangle, has many applications such as identifying suspicious users in social networks, detecting web spams, and finding communities. However, recent networks are so large that most of the previous algorithms fail to process them. Recently, several MapReduce algorithms have been proposed to address such large networks; however, they suffer from the massive shuffled data resulting in a very long processing time.
In this article, we propose scalable methods for enumerating trillion subgraphs on distributed systems. We first propose PTE (Pre-partitioned Triangle Enumeration), a new distributed algorithm for enumerating triangles in enormous graphs by resolving the structural inefficiency of the previous MapReduce algorithms. PTE enumerates trillions of triangles in a billion scale graph by decreasing three factors: the amount of shuffled data, total work, and network read. We also propose PSE (Pre-partitioned Subgraph Enumeration), a generalized version of PTE for enumerating subgraphs that match an arbitrary query graph. Experimental results show that PTE provides 79 times faster performance than recent distributed algorithms on real-world graphs, and succeeds in enumerating more than 3 trillion triangles on the ClueWeb12 graph with 6.3 billion vertices and 72 billion edges. Furthermore, PSE successfully enumerates 265 trillion clique subgraphs with 4 vertices from a subdomain hyperlink network, showing 47 times faster performance than the state of the art distributed subgraph enumeration algorithm.

References

[1]
Foto N. Afrati, Dimitris Fotakis, and Jeffrey D. Ullman. 2013. Enumerating subgraph instances using map-reduce. In ICDE. 62--73.
[2]
Shaikh Arifuzzaman, Maleq Khan, and Madhav V. Marathe. 2013. PATRIC: A parallel algorithm for counting triangles in massive networks. In CIKM.
[3]
Claude Barthels, Gustavo Alonso, Torsten Hoefler, Timo Schneider, and Ingo Müller. 2017. Distributed join algorithms on thousands of cores. PVLDB 10, 5 (2017), 517--528.
[4]
Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis. 2010. Efficient algorithms for large-scale local triangle counting.TKDD 4, 3 (2010), 13:1--13:28.
[5]
Jonathan W. Berry, Bruce Hendrickson, Randall A. LaViolette, and Cynthia A. Phillips. 2011. Tolerating the community detection resolution limit with edge weighting. Phys. Rev. E 83, 5 (2011), 056119.
[6]
Bin-Hui Chou and Einoshin Suzuki. 2010. Discovering community-oriented roles of nodes in a social network. In DaWaK. 52--64.
[7]
Jonathan Cohen. 2009. Graph twiddling in a mapreduce world. CiSE 11, 4 (2009), 29--41.
[8]
Luigi P. Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. 2004. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell. 26, 10 (2004), 1367--1372.
[9]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In OSDI. 137--150.
[10]
Jean-Pierre Eckmann and Elisha Moses. 2002. Curvature of co-links uncovers hidden thematic layers in the world wide web. PNAS 99, 9 (2002), 5825--5829.
[11]
Ilias Giechaskiel, George Panagopoulos, and Eiko Yoneki. 2015. PDTL: Parallel and distributed triangle listing for massive graphs. In ICPP.
[12]
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI. 17--30.
[13]
Enrico Gregori, Luciano Lenzini, and Simone Mainardi. 2013. Parallel k-clique community detection on large-scale networks. IEEE Trans. Parallel Distrib. Syst. 24, 8 (2013), 1651--1660.
[14]
Joshua A. Grochow and Manolis Kellis. 2007. Network motif discovery using subgraph enumeration and symmetry-breaking. In RECOMB. 92--106.
[15]
Herodotos Herodotou. 2011. Hadoop performance models. Technical Report CS-2011-05. Duke University. http://www.cs.duke.edu/starfish/files/hadoop-models.pdf.
[16]
Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung. 2013. Massive graph triangulation. In SIGMOD. 325--336.
[17]
ByungSoo Jeon, Inah Jeon, Lee Sael, and U. Kang. 2016. SCouT: Scalable coupled matrix-tensor factorization -- algorithm and discoveries. In ICDE.
[18]
Inah Jeon, Evangelos E. Papalexakis, Christos Faloutsos, Lee Sael, and U. Kang. 2016. Mining billion-scale tensors: Algorithms and discoveries. VLDB J. 25, 4 (2016), 519--544.
[19]
Sanjay Ram Kairam, Dan J. Wang, and Jure Leskovec. 2012. The life and death of online groups: Predicting group growth and longevity. In WSDM. 673--682.
[20]
U. Kang, Jay-Yoon Lee, Danai Koutra, and Christos Faloutsos. 2014. Net-ray: Visualizing and mining billion-scale graphs. In PAKDD.
[21]
U. Kang, Brendan Meeder, Evangelos E. Papalexakis, and Christos Faloutsos. 2014. HEigen: Spectral analysis for billion-scale graphs.TKDE (2014), 350--362.
[22]
U. Kang, Evangelos E. Papalexakis, Abhay Harpale, and Christos Faloutsos. 2012. GigaTensor: Scaling tensor analysis up by 100 times -- algorithms and discoveries. In KDD. 316--324.
[23]
U. Kang, Hanghang Tong, Jimeng Sun, Ching-Yung Lin, and Christos Faloutsos. 2012. GBASE: An efficient analysis platform for large graphs. VLDB J. 21, 5 (2012), 637--650.
[24]
U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2011. PEGASUS: Mining peta-scale graphs. Knowl. Inf. Syst. 27, 2 (2011), 303--325.
[25]
U. Kang, Charalampos E. Tsourakakis, and Faloutsos Faloutsos. 2009. PEGASUS: A peta-scale graph mining system -- implementation and observations. ICDM (2009).
[26]
Hyeonji Kim, Juneyoung Lee, Sourav S. Bhowmick, Wook-Shin Han, Jeong-Hoon Lee, Seongyun Ko, and Moath H. A. Jarrah. 2016. DUALSIM: Parallel subgraph enumeration in a massive graph on a single machine. In SIGMOD. 1231--1245.
[27]
Jinha Kim, Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, and Hwanjo Yu. 2014. OPT: A new framework for overlapped and parallel triangulation in large-scale graphs. In SIGMOD. 637--648.
[28]
Longbin Lai, Lu Qin, Xuemin Lin, and Lijun Chang. 2015. Scalable subgraph enumeration in mapreduce. PVLDB 8, 10 (2015), 974--985.
[29]
Matthieu Latapy. 2008. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. (2008), 458--473.
[30]
Rasmus Pagh and Francesco Silvestri. 2014. The input/output complexity of triangle enumeration. In PODS. 224--233.
[31]
Ha-Myung Park, Sung-Hyon Myaeng, and U. Kang. 2016. PTE: Enumerating trillion triangles on distributed systems. In KDD. 1115--1124.
[32]
Ha-Myung Park, Namyong Park, Sung-Hyon Myaeng, and U. Kang. 2016. Partition aware connected component computation in distributed systems. In ICDM 2016, December 12-15, Barcelona, Spain. 420--429.
[33]
Ha-Myung Park and Chin-Wan Chung. 2013. An efficient mapreduce algorithm for counting triangles in a very large graph. In CIKM. 539--548.
[34]
Ha-Myung Park, Chiwan Park, and U. Kang. 2018. PegasusN: A scalable and versatile graph mining system. In AAAI.
[35]
Ha-Myung Park, Francesco Silvestri, U. Kang, and Rasmus Pagh. 2014. MapReduce triangle enumeration with guarantees. In CIKM. 1739--1748.
[36]
Namyong Park, ByungSoo Jeon, Jungwoo Lee, and U. Kang. 2016. BIGtensor: Mining billion-scale tensor made easy. In CIKM, Indianapolis, IN, USA, October 24--28. 2457--2460.
[37]
Ali Pinar, C. Seshadhri, and Vaidyanathan Vishal. 2017. ESCAPE: Efficiently counting all 5-vertex subgraphs. In WWW. 1431--1440.
[38]
Todd Plantenga. 2013. Inexact subgraph isomorphism in MapReduce. J. Parallel Distrib. Comput. 73, 2 (2013), 164--175.
[39]
Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi. 2004. Defining and identifying communities in networks. PNAS 101, 9 (2004), 2658--2663.
[40]
Lee Sael, Inah Jeon, and U. Kang. 2015. Scalable tensor mining. Big Data Research 2, 2 (2015), 82--86. Visions on Big Data.
[41]
Thomas Schank. 2007. Algorithmic Aspects of Triangle-Based Network Analysis. Ph.D. thesis. University Karlsruhe.
[42]
Yingxia Shao, Bin Cui, Lei Chen, Lin Ma, Junjie Yao, and Ning Xu. 2014. Parallel subgraph listing in a large-scale graph. In SIGMOD. 625--636.
[43]
Kijung Shin, Lee Sael, and U. Kang. 2017. Fully scalable methods for distributed tensor factorization. IEEE Trans. Knowl. Data Eng. 29, 1 (2017), 100--113.
[44]
Zhao Sun, Hongzhi Wang, Haixun Wang, Bin Shao, and Jianzhong Li. 2012. Efficient subgraph matching on billion node graphs. PVLDB 5, 9 (2012), 788--799.
[45]
Siddharth Suri and Sergei Vassilvitskii. 2011. Counting triangles and the curse of the last reducer. In WWW. 607--614.
[46]
Mark N. Wegman and Larry Carter. 1981. New hash functions and their use in authentication and set equality. J. Comput. Syst. Sci. 22, 3 (1981), 265--279.
[47]
Zhi Yang, Christo Wilson, Xiao Wang, Tingting Gao, Ben Y. Zhao, and Yafei Dai. 2014. Uncovering social network sybils in the wild. TKDD 8, 1 (2014), 2:1--2:29.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 12, Issue 6
December 2018
327 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3271478
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2018
Accepted: 01 July 2018
Revised: 01 February 2018
Received: 01 June 2017
Published in TKDD Volume 12, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Triangle enumeration
  2. big data
  3. distributed algorithm
  4. graph algorithm
  5. network analysis
  6. scalable algorithm
  7. subgraph enumeration

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • project SID2017 of the University of Padova
  • Institute for Information 8 communications Technology Promotion(IITP)
  • the Korea government(MSIT)
  • Chongqing Liangjiang KAIST International Program
  • Chongqing Research Program of Basic Research and Frontier Technology

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 27 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media