skip to main content
research-article

Faster Motif Counting via Succinct Color Coding and Adaptive Sampling

Published: 19 May 2021 Publication History

Abstract

We address the problem of computing the distribution of induced connected subgraphs, aka graphlets or motifs, in large graphs. The current state-of-the-art algorithms estimate the motif counts via uniform sampling by leveraging the color coding technique by Alon, Yuster, and Zwick. In this work, we extend the applicability of this approach by introducing a set of algorithmic optimizations and techniques that reduce the running time and space usage of color coding and improve the accuracy of the counts. To this end, we first show how to optimize color coding to efficiently build a compact table of a representative subsample of all graphlets in the input graph. For 8-node motifs, we can build such a table in one hour for a graph with 65M nodes and 1.8B edges, which is times larger than the state of the art. We then introduce a novel adaptive sampling scheme that breaks the “additive error barrier” of uniform sampling, guaranteeing multiplicative approximations instead of just additive ones. This allows us to count not only the most frequent motifs, but also extremely rare ones. For instance, on one graph we accurately count nearly 10.000 distinct 8-node motifs whose relative frequency is so small that uniform sampling would literally take centuries to find them. Our results show that color coding is still the most promising approach to scalable motif counting.

References

[1]
Ahmed F. Abdelzaher, Ahmad F. Al-Musawi, Preetam Ghosh, Michael L. Mayo, and Edward J. Perkins. 2015. Transcriptional network growing models using motif-based preferential attachment. Front Bioeng Biotechnol 3 (2015), 157.
[2]
Matteo Agostini, Marco Bressan, and Shahrzad Haddadan. 2019. Mixing time bounds for graphlet random walks. Inform. Process. Lett. 152 (2019), 105851.
[3]
Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, and Nick Duffield. 2015. Efficient graphlet counting for large networks. In Proceedings of the 2015 IEEE International Conference on Data Mining. 1–10.
[4]
N. Alon, P. Dao, I. Hajirasouliha, F. Hormozdiari, and S. C. Sahinalp. 2008. Biomolecular network motif counting and discovery by color coding. Bioinformatics 24, 13 (Jul. 2008), i241–249.
[5]
Noga Alon, Ori Gurel-Gurevich, and Eyal Lubetzky. 2010. Choice-memory tradeoff in allocations. Ann. Appl. Probab. 20, 4 (2010), 1470–1511.
[6]
Noga Alon, Raphael Yuster, and Uri Zwick. 1995. Color-coding. J. ACM 42, 4 (1995), 844–856.
[7]
Toufik Baroudi, Rachid Seghir, and Vincent Loechner. 2017. Optimization of triangular and banded matrix operations using 2d-packed layouts. ACM TACO 14, 4 (2017), 55:1–55:19.
[8]
Mansurul A. Bhuiyan, Mahmudur Rahman, Mahmuda Rahman, and Mohammad Al Hasan. 2012. GUISE: Uniform sampling of graphlets for large graph analysis. In Proceedings of the 2012 IEEE International Conference on Data Mining. 91–100.
[9]
Marco Bressan, Flavio Chierichetti, Ravi Kumar, Stefano Leucci, and Alessandro Panconesi. 2017. Counting graphlets: Space vs time. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining. 557–566.
[10]
Marco Bressan, Flavio Chierichetti, Ravi Kumar, Stefano Leucci, and Alessandro Panconesi. 2018. Motif counting beyond five nodes. ACM TKDD 12, 4, Article 48 (2018), 25 pages.
[11]
Marco Bressan, Stefano Leucci, and Alessandro Panconesi. 2019. Motivo: Fast motif counting via succinct color coding and adaptive sampling. Proc. VLDB Endow. 12, 11 (Jul. 2019), 1651–1663.
[12]
Venkatesan T. Chakaravarthy, Michael Kapralov, Prakash Murali, Fabrizio Petrini, Xinyu Que, Yogish Sabharwal, and Baruch Schieber. 2016. Subgraph counting: Color coding beyond trees. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium. 2–11.
[13]
Jianer Chen, Xiuzhen Huang, Iyad A. Kanj, and Ge Xia. 2006. Strong computational lower bounds via parameterized complexity. J. Comput. System Sci. 72, 8 (2006), 1346–1367.
[14]
Xuhao Chen, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali. 2020. Pangolin: An efficient and flexible graph mining system on CPU and GPU. Proc. VLDB Endow. 13, 8 (Apr. 2020), 1190–1205.
[15]
Xiaowei Chen, Yongkun Li, Pinghui Wang, and John C. S. Lui. 2016. A general framework for estimating graphlet statistics via random walk. Proc. VLDB Endow. 10, 3 (2016), 253–264.
[16]
Vinicius Dias, Carlos H. C. Teixeira, Dorgival Guedes, Wagner Meira, and Srinivasan Parthasarathy. 2019. Fractal: A general-purpose graph pattern mining system. In Proceedings of the 2019 International Conference on Management of Data. 1357–1374.
[17]
Devdatt Dubhashi and Alessandro Panconesi. 2009. Concentration of Measure for the Analysis of Randomized Algorithms (1st ed.). Cambridge University Press, New York, NY.
[18]
P. Elias. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21, 2 (1975), 194–203.
[19]
Irene Finocchi, Marco Finocchi, and Emanuele G. Fusco. 2015. Clique counting in MapReduce: Algorithms and experiments. ACM J. Exp. Algorithmics 20, Article 1.7 (Oct. 2015), 20 pages.
[20]
Guyue Han and Harish Sethu. 2016. Waddling random walk: Fast and accurate mining of motif statistics in large graphs. In Proceedings of the 2016 IEEE International Conference on Data Mining.181–190.
[21]
Falk Hüffner, Sebastian Wernicke, and Thomas Zichner. 2008. Algorithm engineering for color-coding with applications to signaling pathway detection. Algorithmica 52, 2 (2008), 114–132.
[22]
Shweta Jain and C. Seshadhri. 2017. A fast and provable method for estimating clique counts using Turán’s theorem. In Proceedings of the 2017 World Wide Web Conference. 441–449.
[23]
Shweta Jain and C. Seshadhri. 2020. The power of pivoting for exact clique counting. In Proceedings of the 2020 ACM International Conference on Web Search and Data Mining. 268–276.
[24]
H. Jeong, S. P. Mason, A.-L. Barabási, and Z. N. Oltvai. 2001. Lethality and centrality in protein networks. Nature 411, 6833 (May 2001), 41–42.
[25]
Madhav Jha, C. Seshadhri, and Ali Pinar. 2015. Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. In Proceedings of the 2015 World Wide Web Conference. 495–505.
[26]
Daniel Mawhirter and Bo Wu. 2019. AutoMine: Harmonizing high-level abstraction and high performance for graph mining. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 509–523.
[27]
Brendan D. McKay and Adolfo Piperno. 2014. Practical graph isomorphism, II. J. Symb. Comput. 60, 0 (2014), 94–112.
[28]
Richard Otter. 1948. The number of trees. Ann. Math. 49, 3 (1948), 583–599.
[29]
Kirill Paramonov, Dmitry Shemetov, and James Sharpnack. 2019. Estimating graphlet statistics via lifting. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 587–595.
[30]
Ali Pinar, C. Seshadhri, and Vaidyanathan Vishal. 2017. ESCAPE: Efficiently counting all 5-vertex subgraphs. In Proceedings of the 2017 World Wide Web Conference. 1431–1440.
[31]
S. Ranu and A. K. Singh. 2009. GraphSig: A scalable approach to mining significant subgraphs in large graph databases. In Proceedings of the 2009 IEEE 25th International Conference on Data Engineering. 844–855.
[32]
G. M. Slota and K. Madduri. 2013. Fast approximate subgraph counting and enumeration. In Proceedings of the 2013 42nd International Conference on Parallel Processing. 210–219.
[33]
Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, and Ashraf Aboulnaga. 2015. Arabesque: A system for distributed graph mining. In Proceedings of the 25th Symposium on Operating Systems Principles. 425–440.
[34]
Ngoc Hieu Tran, Kwok Pui Choi, and Louxin Zhang. 2013. Counting motifs in the human interactome. Nat Commun 4, 2241 (2013).
[35]
Michael D. Vose. 1991. A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Software Eng. 17, 9 (1991), 972–975.
[36]
Pinghui Wang, John C. S. Lui, Bruno Ribeiro, Don Towsley, Junzhou Zhao, and Xiaohong Guan. 2014. Efficiently estimating motif statistics of large networks. ACM TKDD 9, 2 (2014), 27 pages.
[37]
Pinghui Wang, Jing Tao, Junzhou Zhao, and Xiaohong Guan. 2018. Moss: A scalable tool for efficiently sampling and counting 4- and 5-node graphlets. IEEE Transactions on Knowledge and Data Engineering 30, 1 (2018), 73–86.
[38]
Pinghui Wang, Xiangliang Zhang, Zhenguo Li, Jiefeng Cheng, John C. S. Lui, Don Towsley, Junzhou Zhao, Jing Tao, and Xiaohong Guan. 2019. A fast sampling method of exploring graphlet degrees of large directed and undirected graphs. Knowledge and Information Systems 61, 1 (2019), 301–326.
[39]
Ömer Nebil Yaveroğlu, Noël Malod-Dognin, Darren Davis, Zoran Levnajic, Vuk Janjic, Rasa Karapandza, Aleksandar Stojmirovic, and Nataša Pržulj. 2014. Revealing the hidden language of complex networks. Sci Rep 4, Article 4547 (2014).
[40]
Esti Yeger-Lotem, Shmuel Sattath, Nadav Kashtan, Shalev Itzkovitz, Ron Milo, Ron Y. Pinter, Uri Alon, and Hanah Margalit. 2004. Network motifs in integrated cellular networks of transcription–regulation and protein–protein interaction. Proceedings of the National Academy of Sciences 101, 16 (2004), 5934–5939.
[41]
Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich. 2017. Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 555–564.
[42]
Hao Zhang, Jeffrey Xu Yu, Yikai Zhang, Kangfei Zhao, and Hong Cheng. 2020. Distributed subgraph counting: A general approach. Proc. VLDB Endow. 13, 12 (Aug. 2020), 2493–2507.
[43]
Zhao Zhao, Maleq Khan, V. S. Anil Kumar, and Madhav V. Marathe. 2010. Subgraph enumeration in large social contact networks using parallel color coding and streaming. In Proceedings of the 2010 39th International Conference on Parallel Processing. 594–603.
[44]
Z. Zhao, G. Wang, A. R. Butt, M. Khan, V. S. A. Kumar, and M. V. Marathe. 2012. SAHAD: Subgraph analysis in massive networks using Hadoop. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 390–401.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 15, Issue 6
June 2021
474 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3465438
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 May 2021
Accepted: 01 January 2021
Received: 01 September 2020
Published in TKDD Volume 15, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Graphlets
  2. motifs
  3. color coding
  4. subgraph counting
  5. graph mining

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Google Focused Award “Algorithms and Learning for AI” (ALL4AI)
  • Bertinoro International Center for Informatics (BICI)
  • ERC Starting
  • Dipartimenti di Eccellenza 2018-2022
  • Department of Computer Science at Sapienza University of Rome

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)12
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media