research-article

Mining Graphlet Counts in Online Social Networks

Authors:

John C. S. LuiAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 12, Issue 4

Article No.: 41, Pages 1 - 38

https://doi.org/10.1145/3182392

Published: 16 April 2018 Publication History

Abstract

Counting subgraphs is a fundamental analysis task for online social networks (OSNs). Given the sheer size and restricted access of OSN, efficient computation of subgraph counts is highly challenging. Although a number of algorithms have been proposed to estimate the relative counts of subgraphs in OSNs with restricted access, there are only few works which try to solve a more general problem, i.e., counting subgraph frequencies. In this article, we propose an efficient random walk-based framework to estimate the subgraph counts. Our framework generates samples by leveraging consecutive steps of the random walk as well as by observing neighbors of visited nodes. Using the importance sampling technique, we derive unbiased estimators of the subgraph counts. To make better use of the degree information of visited nodes, we also design improved estimators, which increases the accuracy of the estimation with no additional cost. We conduct extensive experimental evaluation on real-world OSNs to confirm our theoretical claims. The experiment results show that our estimators are unbiased, accurate, efficient, and better than the state-of-the-art algorithms. For the Weibo graph with more than 58 million nodes, our method produces estimate of triangle count with an error less than 5% using only 20,000 sampled nodes. Detailed comparison with the state-of-the-art methods demonstrates that our algorithm is 2--10 times more accurate.

References

[1]

Louigi Addario-Berry and Tao Lei. 2012. The mixing time of the Newman: Watts small world. In Proc. SODA. 1661--1668.

Digital Library

[2]

Foto N. Afrati, Dimitris Fotakis, and Jeffrey D. Ullman. 2013. Enumerating subgraph instances using map-reduce. In Proc. ICDE. 62--73.

Digital Library

[3]

Nesreen K. Ahmed, Nick Duffield, Jennifer Neville, and Ramana Kompella. 2014. Graph sample and hold: A framework for big-graph analytics. In Proc. KDD. 1446--1455.

Digital Library

[4]

Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, and Nick Duffield. 2015. Efficient graphlet counting for large networks. In Proc. ICDM. 1--10.

Digital Library

[5]

N. Alon, R. Yuster, and U. Zwick. 1997. Finding and counting given length cycles. Algorithmica 17, 3 (1997), 209--223.

[6]

M. A. Bhuiyan, M. Rahman, M. Rahman, and M. Al Hasan. 2012. GUISE: Uniform sampling of graphlets for large graph analysis. In Proc. ICDM. 91--100.

Digital Library

[7]

Xiaowei Chen, Yongkun Li, Pinghui Wang, and John Lui. 2016. A general framework for estimating graphlet statistics via random walk. In Proc. VLDB, vol. 10, 3 (2016), 253--264.

Digital Library

[8]

Xiaowei Chen and John Lui. 2016. Mining graphlet counts in online social networks. In Proc. ICDM. 71--80.

[9]

Kai-Min Chung, Henry Lam, Zhenming Liu, and Michael Mitzenmacher. 2012. Chernoff-Hoeffding bounds for Markov chains: Generalized and simplified. In Proc. STACS, vol. 14. LIPIcs, 124--135.

[10]

Radu Curticapean, Holger Dell, and Dániel Marx. 2017. Homomorphisms are a good basis for counting small subgraphs. In Proc. STOC. 210--223.

Digital Library

[11]

Anirban Dasgupta, Ravi Kumar, and Tamas Sarlos. 2014. On estimating the average degree. In Proc. WWW. 795--806.

Digital Library

[12]

Lorenzo De Stefani, Alessandro Epasto, Matteo Riondato, and Eli Upfal. 2016. TRIÈST: Counting local and global triangles in fully-dynamic streams with fixed memory size. In Proc. KDD. 825--834.

Digital Library

[13]

Reinhard Diestel. 2012. Graph Theory, (4th ed.). Springer.

[14]

Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, and Alexandros G. Dimakis. 2015. Beyond triangles: A distributed framework for estimating 3-profiles of large graphs. In Proc. KDD. 229--238.

Digital Library

[15]

Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, and Alexandros G. Dimakis. 2016. Distributed estimation of graph 4-profiles. In Proc. WWW. 483--493.

Digital Library

[16]

Tobias Friedrich and Anton Krohmer. 2015. Cliques in hyperbolic random graphs. In Proc. INFOCOM. 1544--1552.

[17]

M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. 2010. Walking in Facebook: A case study of unbiased sampling of OSNs. In Proc. INFOCOM. 1--9.

Digital Library

[18]

Olle Häggström. 2002. Finite Markov Chains and Algorithmic Applications. Vol. 52. Cambridge University Press.

[19]

Frank Harary and Edgar M. Palmer. 1973. Graphical Enumeration. Academic Press, NY.

[20]

Adib Hasan, Po-Chien Chung, and Wayne Hayes. 2017. Graphettes: Constant-time determination of graphlet and orbit identity including (possibly disconnected) graphlets up to size 8. PLoS One 12, 8 (2017), 1--12.

[21]

Tomaž Hočevar and Janez Demšar. 2014. A combinatorial approach to graphlet counting. Bioinformatics 30, 4 (2014), 559--565.

[22]

Shweta Jain and C. Seshadhri. 2017. A fast and provable method for estimating clique counts using TuráN’s theorem. In Proc. WWW. 441--449.

Digital Library

[23]

Madhav Jha, C. Seshadhri, and Ali Pinar. 2015. Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. In Proc. WWW. 495--505.

Digital Library

[24]

Madhav Jha, C. Seshadhri, and Ali Pinar. 2015. A space-efficient streaming algorithm for estimating transitivity and triangle counts using the birthday paradox. ACM Transactions on Knowledge Discovery from Data 9, 3 (2015), 15:1--15:21.

Digital Library

[25]

Galin L. Jones and others. 2004. On the Markov chain central limit theorem. Probability Surveys 1, 299--320 (2004), 5--1.

[26]

Liran Katzir and Stephen J. Hardiman. 2015. Estimating clustering coefficients and size of social networks via random walk. ACM Transactions on the Web 9, 4 (2015), 19:1--19:20.

Digital Library

[27]

Liran Katzir, Edo Liberty, and Oren Somekh. 2011. Estimating sizes of social networks via biased sampling. In Proc. WWW. 597--606.

Digital Library

[28]

Hyeonji Kim, Juneyoung Lee, Sourav S. Bhowmick, Wook-Shin Han, JeongHoon Lee, Seongyun Ko, and Moath H. A. Jarrah. 2016. DUALSIM: Parallel subgraph enumeration in a massive graph on a single machine. In Proc. SIGMOD. 1231--1245.

Digital Library

[29]

KONECT Datasets. 2015. KONECT Datasets: The Koblenz Network Collection. Retrieved April 27, 2017 from http://konect.uni-koblenz.de. (2015).

[30]

Longbin Lai, Lu Qin, Xuemin Lin, and Lijun Chang. 2015. Scalable subgraph enumeration in MapReduce. Publication of the Very Large Database Endowment. 8, 10 (2015), 974--985.

Digital Library

[31]

Chul-Ho Lee, Xin Xu, and Do Young Eun. 2012. Beyond random walk and metropolis-hastings samplers: Why you should not backtrack for unbiased graph sampling. In Proc. SIGMETRICS. 319--330.

Digital Library

[32]

Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. Retrieved April 20, 2016 from http://snap.stanford.edu/data. (2014).

[33]

R. H. Li, J. X. Yu, L. Qin, R. Mao, and T. Jin. 2015. On random walk based graph sampling. In Proc. ICDE. 927--938.

[34]

Yongsub Lim and U. Kang. 2015. MASCOT: Memory-efficient and accurate sampling for counting local triangles in graph streams. In Proc. KDD. 685--694.

Digital Library

[35]

László Lovász. 1993. Random walks on graphs. Combinatorics, Paul Erdos is Eighty 2 (1993), 1--46.

[36]

Sean P. Meyn and Richard L. Tweedie. 2012. Markov Chains and Stochastic Stability. Springer Science 8 Business Media.

Digital Library

[37]

Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proc. IMC. 29--42.

Digital Library

[38]

Abedelaziz Mohaisen, Aaram Yun, and Yongdae Kim. 2010. Measuring the mixing time of social graphs. In Proc. IMC. 383--389.

Digital Library

[39]

James R. Norris. 1998. Markov Chains. Number 2. Cambridge University Press.

[40]

Art B. Owen. 2013. Monte Carlo theory, methods and examples. Retrieved September 11, 2015 from http://statweb.stanford.edu/&sim;owen/mc/. (2013).

[41]

Ali Pinar, C. Seshadhri, and Vaidyanathan Vishal. 2017. ESCAPE: Efficiently counting all 5-vertex subgraphs. In Proc. WWW. 1431--1440.

Digital Library

[42]

M. Rahman, M. A. Bhuiyan, and M. Al Hasan. 2014. Graft: An efficient graphlet counting method for large graph analysis. IEEE Transactions on Knowledge and Data Engineering 26, 10 (2014), 2466--2478.

[43]

Gareth O. Roberts and Jeffrey S. Rosenthal. 2004. General state space Markov chains and MCMC algorithms. Probability Surveys 1 (2004), 20--71.

[44]

Ryan A. Rossi and Nesreen K. Ahmed. 2015. Network Repository: A Scientific Network Data Repository with Interactive Visualization and Mining Tools. Retrieved April 20, 2016 from http://networkrepository.com. (2015).

Digital Library

[45]

Thomas Schank and Dorothea Wagner. 2005. Finding, counting and listing all triangles in large graphs, an experimental study. In Proc. WEA. 606--609.

Digital Library

[46]

Comandur Seshadhri, Ali Pinar, and Tamara G. Kolda. 2013. Triadic measures on graphs: The power of wedge sampling. In Proc. SIAM SDM. 10--18.

[47]

Yingxia Shao, Bin Cui, Lei Chen, Lin Ma, Junjie Yao, and Ning Xu. 2014. Parallel subgraph listing in a large-scale graph. In Proc. SIGMOD. 625--636.

Digital Library

[48]

Nino Shervashidze, S. V. N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. 2009. Efficient graphlet kernels for large graph comparison. In Proc. AISTATS, vol. 5. 488--495.

[49]

N. J. A. Sloane. 1995. The Online Encyclopedia of Integer Sequences, A001349. Retrieved January 1, 2018 from https://oeis.org/A001349. (1995).

[50]

Samis Trevezas and Nikolaos Limnios. 2009. Variance estimation in the central limit theorem for Markov chains. Journal of Statistical Planning and Inference 139, 7 (2009), 2242--2253.

[51]

Johan Ugander, Lars Backstrom, and Jon Kleinberg. 2013. Subgraph frequencies: Mapping the empirical and extremal geography of large graph collections. In Proc. WWW. 1307--1318.

Digital Library

[52]

Pinghui Wang, John C. S. Lui, Bruno Ribeiro, Don Towsley, Junzhou Zhao, and Xiaohong Guan. 2014. Efficiently estimating motif statistics of large networks. ACM Transactions on Knowledge Discovery from Data 9, 2 (2014), 8:1--8:27.

Digital Library

[53]

P. Wang, J. C. S. Lui, D. Towsley, and J. Zhao. 2016. Minfer: A method of inferring motif statistics from sampled edges. In Proc. ICDE. 1050--1061.

[54]

P. Wang, J. Zhao, X. Zhang, Z. Li, J. Cheng, J. C. S. Lui, D. Towsley, J. Tao, and X. Guan. 2018. MOSS-5: A fast method of approximating counts of 5-node graphlets in large graphs. IEEE Transactions on Knowledge and Data Engineering 30, 1 (2018), 73--86.

[55]

Junzhou Zhao, John C. S. Lui, Don Towsley, Pinghui Wang, and Xiaohong Guan. 2015. Tracking triadic cardinality distributions for burst detection in social activity streams. In Proc. COSN. 15--25.

Digital Library

[56]

Zhuojie Zhou, Nan Zhang, and Gautam Das. 2015. Leveraging history for faster sampling of online social networks. Publication of the Very Large Database Endowment 8, 10 (2015), 1034--1045.

Digital Library

Cited By

Kim HMoon HBu FKo JShin K(2025)Estimating simplet counts via samplingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00890-934:2Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1007/s00778-024-00890-9
邱文(2024)Frequent Itemset Mining in the Graph Data FieldComputer Science and Application10.12677/CSA.2024.14101714:01(158-172)Online publication date: 2024
https://doi.org/10.12677/CSA.2024.141017
Hu PMotik B(2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
https://dl.acm.org/doi/10.1145/3689209
Show More Cited By

Index Terms

Mining Graphlet Counts in Online Social Networks

Recommendations

Sampling in online social networks
SAC '14: Proceedings of the 29th Annual ACM Symposium on Applied Computing

In this paper, we propose a new graph sampling method for online social networks that achieves the following. First, a sample graph should reflect the ratio between the number of nodes and the number of edges of the original graph. Second, a sample ...
Albatross sampling: robust and effective hybrid vertex sampling for social graphs
HotPlanet '11: Proceedings of the 3rd ACM international workshop on MobiArch

Nowadays, Online Social Networks (OSNs) have become dramatically popular and the study of social graphs attracts the interests of a large number of researchers. One critical challenge is the huge size of the social graph, which makes the graph analyzing ...
SSRW: A Scalable Algorithm for Estimating Graphlet Statistics Based on Random Walk
Database Systems for Advanced Applications
Abstract
Mining graphlet statistics is very meaningful due to its wide applications in social networks, bioinformatics and information security, etc. However, it is a big challenge to exactly count graphlet statistics as the number of subgraphs ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 12, Issue 4

August 2018

354 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3208362

Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
University of Louisiana at Lafayette, USA

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 April 2018

Accepted: 01 January 2018

Revised: 01 January 2018

Received: 01 April 2017

Published in TKDD Volume 12, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

GRF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
513
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kim HMoon HBu FKo JShin K(2025)Estimating simplet counts via samplingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00890-934:2Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1007/s00778-024-00890-9
邱文(2024)Frequent Itemset Mining in the Graph Data FieldComputer Science and Application10.12677/CSA.2024.14101714:01(158-172)Online publication date: 2024
https://doi.org/10.12677/CSA.2024.141017
Hu PMotik B(2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
https://dl.acm.org/doi/10.1145/3689209
Yin NShen LChen CHua XLuo X(2024)SPORT: A Subgraph Perspective on Graph Classification with Label NoiseACM Transactions on Knowledge Discovery from Data10.1145/368746818:9(1-20)Online publication date: 6-Nov-2024
https://dl.acm.org/doi/10.1145/3687468
Hou WZhao XTang B(2024)LearnSC: An Efficient and Unified Learning-Based Framework for Subgraph Counting Problem2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00206(2625-2638)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00206
Jin-Bo GChao LXiang-Yan ZHong-Bo HQiang LHan L(2023)Automatic Calculation Method of Satellite Communication Load Redundancy Optimization Scheme Based on Shortest Path Query and Minimum Spanning Tree Method2023 6th International Symposium on Autonomous Systems (ISAS)10.1109/ISAS59543.2023.10164571(1-6)Online publication date: 23-Jun-2023
https://doi.org/10.1109/ISAS59543.2023.10164571
Majeed AKhan SHwang S(2022)A Comprehensive Analysis of Privacy-Preserving Solutions Developed for Online Social NetworksElectronics10.3390/electronics1113193111:13(1931)Online publication date: 21-Jun-2022
https://doi.org/10.3390/electronics11131931
Hsu CChen TChen H(2022)Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web LogsJournal of Data and Information Quality10.1145/349039214:2(1-17)Online publication date: 23-Mar-2022
https://dl.acm.org/doi/10.1145/3490392
Yang YLiu HWang HZhang XPhilip Chen C(2022)On enumerating algorithms of novel multiple leaf-distance granular regular α-subtrees of treesInformation and Computation10.1016/j.ic.2022.104942289(104942)Online publication date: Nov-2022
https://doi.org/10.1016/j.ic.2022.104942
Chen XDathathri RGill GPingali K(2020)PangolinProceedings of the VLDB Endowment10.14778/3389133.338913713:8(1190-1205)Online publication date: 3-May-2020
https://dl.acm.org/doi/10.14778/3389133.3389137
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents