skip to main content
research-article

Mining Graphlet Counts in Online Social Networks

Published: 16 April 2018 Publication History

Abstract

Counting subgraphs is a fundamental analysis task for online social networks (OSNs). Given the sheer size and restricted access of OSN, efficient computation of subgraph counts is highly challenging. Although a number of algorithms have been proposed to estimate the relative counts of subgraphs in OSNs with restricted access, there are only few works which try to solve a more general problem, i.e., counting subgraph frequencies. In this article, we propose an efficient random walk-based framework to estimate the subgraph counts. Our framework generates samples by leveraging consecutive steps of the random walk as well as by observing neighbors of visited nodes. Using the importance sampling technique, we derive unbiased estimators of the subgraph counts. To make better use of the degree information of visited nodes, we also design improved estimators, which increases the accuracy of the estimation with no additional cost. We conduct extensive experimental evaluation on real-world OSNs to confirm our theoretical claims. The experiment results show that our estimators are unbiased, accurate, efficient, and better than the state-of-the-art algorithms. For the Weibo graph with more than 58 million nodes, our method produces estimate of triangle count with an error less than 5% using only 20,000 sampled nodes. Detailed comparison with the state-of-the-art methods demonstrates that our algorithm is 2--10 times more accurate.

References

[1]
Louigi Addario-Berry and Tao Lei. 2012. The mixing time of the Newman: Watts small world. In Proc. SODA. 1661--1668.
[2]
Foto N. Afrati, Dimitris Fotakis, and Jeffrey D. Ullman. 2013. Enumerating subgraph instances using map-reduce. In Proc. ICDE. 62--73.
[3]
Nesreen K. Ahmed, Nick Duffield, Jennifer Neville, and Ramana Kompella. 2014. Graph sample and hold: A framework for big-graph analytics. In Proc. KDD. 1446--1455.
[4]
Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, and Nick Duffield. 2015. Efficient graphlet counting for large networks. In Proc. ICDM. 1--10.
[5]
N. Alon, R. Yuster, and U. Zwick. 1997. Finding and counting given length cycles. Algorithmica 17, 3 (1997), 209--223.
[6]
M. A. Bhuiyan, M. Rahman, M. Rahman, and M. Al Hasan. 2012. GUISE: Uniform sampling of graphlets for large graph analysis. In Proc. ICDM. 91--100.
[7]
Xiaowei Chen, Yongkun Li, Pinghui Wang, and John Lui. 2016. A general framework for estimating graphlet statistics via random walk. In Proc. VLDB, vol. 10, 3 (2016), 253--264.
[8]
Xiaowei Chen and John Lui. 2016. Mining graphlet counts in online social networks. In Proc. ICDM. 71--80.
[9]
Kai-Min Chung, Henry Lam, Zhenming Liu, and Michael Mitzenmacher. 2012. Chernoff-Hoeffding bounds for Markov chains: Generalized and simplified. In Proc. STACS, vol. 14. LIPIcs, 124--135.
[10]
Radu Curticapean, Holger Dell, and Dániel Marx. 2017. Homomorphisms are a good basis for counting small subgraphs. In Proc. STOC. 210--223.
[11]
Anirban Dasgupta, Ravi Kumar, and Tamas Sarlos. 2014. On estimating the average degree. In Proc. WWW. 795--806.
[12]
Lorenzo De Stefani, Alessandro Epasto, Matteo Riondato, and Eli Upfal. 2016. TRIÈST: Counting local and global triangles in fully-dynamic streams with fixed memory size. In Proc. KDD. 825--834.
[13]
Reinhard Diestel. 2012. Graph Theory, (4th ed.). Springer.
[14]
Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, and Alexandros G. Dimakis. 2015. Beyond triangles: A distributed framework for estimating 3-profiles of large graphs. In Proc. KDD. 229--238.
[15]
Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, and Alexandros G. Dimakis. 2016. Distributed estimation of graph 4-profiles. In Proc. WWW. 483--493.
[16]
Tobias Friedrich and Anton Krohmer. 2015. Cliques in hyperbolic random graphs. In Proc. INFOCOM. 1544--1552.
[17]
M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. 2010. Walking in Facebook: A case study of unbiased sampling of OSNs. In Proc. INFOCOM. 1--9.
[18]
Olle Häggström. 2002. Finite Markov Chains and Algorithmic Applications. Vol. 52. Cambridge University Press.
[19]
Frank Harary and Edgar M. Palmer. 1973. Graphical Enumeration. Academic Press, NY.
[20]
Adib Hasan, Po-Chien Chung, and Wayne Hayes. 2017. Graphettes: Constant-time determination of graphlet and orbit identity including (possibly disconnected) graphlets up to size 8. PLoS One 12, 8 (2017), 1--12.
[21]
Tomaž Hočevar and Janez Demšar. 2014. A combinatorial approach to graphlet counting. Bioinformatics 30, 4 (2014), 559--565.
[22]
Shweta Jain and C. Seshadhri. 2017. A fast and provable method for estimating clique counts using TuráN’s theorem. In Proc. WWW. 441--449.
[23]
Madhav Jha, C. Seshadhri, and Ali Pinar. 2015. Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. In Proc. WWW. 495--505.
[24]
Madhav Jha, C. Seshadhri, and Ali Pinar. 2015. A space-efficient streaming algorithm for estimating transitivity and triangle counts using the birthday paradox. ACM Transactions on Knowledge Discovery from Data 9, 3 (2015), 15:1--15:21.
[25]
Galin L. Jones and others. 2004. On the Markov chain central limit theorem. Probability Surveys 1, 299--320 (2004), 5--1.
[26]
Liran Katzir and Stephen J. Hardiman. 2015. Estimating clustering coefficients and size of social networks via random walk. ACM Transactions on the Web 9, 4 (2015), 19:1--19:20.
[27]
Liran Katzir, Edo Liberty, and Oren Somekh. 2011. Estimating sizes of social networks via biased sampling. In Proc. WWW. 597--606.
[28]
Hyeonji Kim, Juneyoung Lee, Sourav S. Bhowmick, Wook-Shin Han, JeongHoon Lee, Seongyun Ko, and Moath H. A. Jarrah. 2016. DUALSIM: Parallel subgraph enumeration in a massive graph on a single machine. In Proc. SIGMOD. 1231--1245.
[29]
KONECT Datasets. 2015. KONECT Datasets: The Koblenz Network Collection. Retrieved April 27, 2017 from http://konect.uni-koblenz.de. (2015).
[30]
Longbin Lai, Lu Qin, Xuemin Lin, and Lijun Chang. 2015. Scalable subgraph enumeration in MapReduce. Publication of the Very Large Database Endowment. 8, 10 (2015), 974--985.
[31]
Chul-Ho Lee, Xin Xu, and Do Young Eun. 2012. Beyond random walk and metropolis-hastings samplers: Why you should not backtrack for unbiased graph sampling. In Proc. SIGMETRICS. 319--330.
[32]
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. Retrieved April 20, 2016 from http://snap.stanford.edu/data. (2014).
[33]
R. H. Li, J. X. Yu, L. Qin, R. Mao, and T. Jin. 2015. On random walk based graph sampling. In Proc. ICDE. 927--938.
[34]
Yongsub Lim and U. Kang. 2015. MASCOT: Memory-efficient and accurate sampling for counting local triangles in graph streams. In Proc. KDD. 685--694.
[35]
László Lovász. 1993. Random walks on graphs. Combinatorics, Paul Erdos is Eighty 2 (1993), 1--46.
[36]
Sean P. Meyn and Richard L. Tweedie. 2012. Markov Chains and Stochastic Stability. Springer Science 8 Business Media.
[37]
Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proc. IMC. 29--42.
[38]
Abedelaziz Mohaisen, Aaram Yun, and Yongdae Kim. 2010. Measuring the mixing time of social graphs. In Proc. IMC. 383--389.
[39]
James R. Norris. 1998. Markov Chains. Number 2. Cambridge University Press.
[40]
Art B. Owen. 2013. Monte Carlo theory, methods and examples. Retrieved September 11, 2015 from http://statweb.stanford.edu/∼owen/mc/. (2013).
[41]
Ali Pinar, C. Seshadhri, and Vaidyanathan Vishal. 2017. ESCAPE: Efficiently counting all 5-vertex subgraphs. In Proc. WWW. 1431--1440.
[42]
M. Rahman, M. A. Bhuiyan, and M. Al Hasan. 2014. Graft: An efficient graphlet counting method for large graph analysis. IEEE Transactions on Knowledge and Data Engineering 26, 10 (2014), 2466--2478.
[43]
Gareth O. Roberts and Jeffrey S. Rosenthal. 2004. General state space Markov chains and MCMC algorithms. Probability Surveys 1 (2004), 20--71.
[44]
Ryan A. Rossi and Nesreen K. Ahmed. 2015. Network Repository: A Scientific Network Data Repository with Interactive Visualization and Mining Tools. Retrieved April 20, 2016 from http://networkrepository.com. (2015).
[45]
Thomas Schank and Dorothea Wagner. 2005. Finding, counting and listing all triangles in large graphs, an experimental study. In Proc. WEA. 606--609.
[46]
Comandur Seshadhri, Ali Pinar, and Tamara G. Kolda. 2013. Triadic measures on graphs: The power of wedge sampling. In Proc. SIAM SDM. 10--18.
[47]
Yingxia Shao, Bin Cui, Lei Chen, Lin Ma, Junjie Yao, and Ning Xu. 2014. Parallel subgraph listing in a large-scale graph. In Proc. SIGMOD. 625--636.
[48]
Nino Shervashidze, S. V. N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. 2009. Efficient graphlet kernels for large graph comparison. In Proc. AISTATS, vol. 5. 488--495.
[49]
N. J. A. Sloane. 1995. The Online Encyclopedia of Integer Sequences, A001349. Retrieved January 1, 2018 from https://oeis.org/A001349. (1995).
[50]
Samis Trevezas and Nikolaos Limnios. 2009. Variance estimation in the central limit theorem for Markov chains. Journal of Statistical Planning and Inference 139, 7 (2009), 2242--2253.
[51]
Johan Ugander, Lars Backstrom, and Jon Kleinberg. 2013. Subgraph frequencies: Mapping the empirical and extremal geography of large graph collections. In Proc. WWW. 1307--1318.
[52]
Pinghui Wang, John C. S. Lui, Bruno Ribeiro, Don Towsley, Junzhou Zhao, and Xiaohong Guan. 2014. Efficiently estimating motif statistics of large networks. ACM Transactions on Knowledge Discovery from Data 9, 2 (2014), 8:1--8:27.
[53]
P. Wang, J. C. S. Lui, D. Towsley, and J. Zhao. 2016. Minfer: A method of inferring motif statistics from sampled edges. In Proc. ICDE. 1050--1061.
[54]
P. Wang, J. Zhao, X. Zhang, Z. Li, J. Cheng, J. C. S. Lui, D. Towsley, J. Tao, and X. Guan. 2018. MOSS-5: A fast method of approximating counts of 5-node graphlets in large graphs. IEEE Transactions on Knowledge and Data Engineering 30, 1 (2018), 73--86.
[55]
Junzhou Zhao, John C. S. Lui, Don Towsley, Pinghui Wang, and Xiaohong Guan. 2015. Tracking triadic cardinality distributions for burst detection in social activity streams. In Proc. COSN. 15--25.
[56]
Zhuojie Zhou, Nan Zhang, and Gautam Das. 2015. Leveraging history for faster sampling of online social networks. Publication of the Very Large Database Endowment 8, 10 (2015), 1034--1045.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 12, Issue 4
August 2018
354 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3208362
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 April 2018
Accepted: 01 January 2018
Revised: 01 January 2018
Received: 01 April 2017
Published in TKDD Volume 12, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Graphlet count
  2. Markov chain Monte Carlo
  3. online social networks
  4. random walk

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • GRF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media