skip to main content
10.1145/3442381.3449875acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Effective and Scalable Clustering on Massive Attributed Graphs

Published: 03 June 2021 Publication History

Abstract

Given a graph G where each node is associated with a set of attributes, and a parameter k specifying the number of output clusters, k-attributed graph clustering (k-AGC) groups nodes in G into k disjoint clusters, such that nodes within the same cluster share similar topological and attribute characteristics, while those in different clusters are dissimilar. This problem is challenging on massive graphs, e.g., with millions of nodes and billions of attribute values. For such graphs, existing solutions either incur prohibitively high costs, or produce clustering results with compromised quality.
In this paper, we propose, an efficient approach to k-AGC that yields high-quality clusters with costs linear to the size of the input graph G. The main contributions of are twofold: (i) a novel formulation of the k-AGC problem based on an attributed multi-hop conductance quality measure custom-made for this problem setting, which effectively captures cluster coherence in terms of both topological proximities and attribute similarities, and (ii) a linear-time optimization solver that obtains high quality clusters iteratively, based on efficient matrix operations such as orthogonal iterations, an alternative optimization approach, as well as an initialization technique that significantly speeds up the convergence of in practice.
Extensive experiments, comparing 11 competitors on 6 real datasets, demonstrate that consistently outperforms all competitors in terms of result quality measured against ground truth labels, while being up to orders of magnitude faster. In particular, on the Microsoft Academic Knowledge Graph dataset with 265.2 million edges and 1.1 billion attribute values, outputs high-quality results for 5-AGC within 1.68 hours using a single CPU core, while none of the 11 competitors finish within 3 days.

References

[1]
Charu C Aggarwal and Chandan K Reddy. 2014. Data Clustering: Algorithms and Applications. CRC Press.
[2]
Esra Akbas and Peixiang Zhao. 2017. Attributed graph clustering: An attribute-aware graph embedding approach. In ASONAM.
[3]
Aleksandar Bojchevski, Johannes Klicpera, Bryan Perozzi, Amol Kapoor, Martin Blais, Benedek Rózemberczki, Michal Lukasik, and Stephan Günnemann. 2020. Scaling Graph Neural Networks with Approximate PageRank. In SIGKDD.
[4]
Cécile Bothorel, Juan David Cruz, Matteo Magnani, and Barbora Micenkova. 2015. Clustering attributed graphs: models, measures and methods. Network Science (2015).
[5]
Petr Chunaev. 2019. Community detection in node-attributed social networks: a survey. arXiv preprint arXiv:1912.09816(2019).
[6]
Fan RK Chung and Fan Chung Graham. 1997. Spectral graph theory.
[7]
David Combe, Christine Largeron, Elöd Egyed-Zsigmond, and Mathias Géry. 2012. Combining relations and text in scientific network clustering. In ASONAM.
[8]
Peter Congdon. 2007. Bayesian statistical modelling.
[9]
James W Demmel. 1997. Applied numerical linear algebra. Siam.
[10]
Issam Falih, Nistor Grozavu, Rushed Kanawati, and Younès Bennani. 2017. Anca: Attributed network clustering algorithm. In Complex Networks.
[11]
Issam Falih, Nistor Grozavu, Rushed Kanawati, and Younès Bennani. 2018. Community detection in attributed network. In WWW.
[12]
Santo Fortunato. 2010. Community detection in graphs. Physics reports (2010).
[13]
Linton C Freeman. 1996. Cliques, Galois lattices, and the structure of human social groups. Social networks (1996).
[14]
Olivier Goldschmidt and Dorit S Hochbaum. 1988. Polynomial algorithm for the k-cut problem. In FOCS.
[15]
Roger Guimera and Luis A Nunes Amaral. 2005. Functional cartography of complex metabolic networks. Nature (2005).
[16]
Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review (2011).
[17]
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS.
[18]
Daniel Hanisch, Alexander Zien, Ralf Zimmer, and Thomas Lengauer. 2002. Co-clustering of biological networks and gene expression data. Bioinformatics (2002).
[19]
John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. J R Stat Soc Ser C (1979).
[20]
Taher Haveliwala and Sepandar Kamvar. 2003. The second eigenvalue of the Google matrix. Technical Report. Stanford.
[21]
Dongxiao He, Zhiyong Feng, Di Jin, Xiaobao Wang, and Weixiong Zhang. 2017. Joint identification of network communities and semantics via integrative modeling of network topologies and node contents. In AAAI.
[22]
Darko Hric, Richard K Darst, and Santo Fortunato. 2014. Community detection in networks: Structural communities versus ground truth. Physical Review E (2014).
[23]
Huimin Huang, Hong Shen, and Zaiqiao Meng. 2020. Community-based influence maximization in attributed networks. Applied Intelligence(2020).
[24]
Xiao Huang, Jundong Li, and Xia Hu. 2017. Label informed attributed network embedding. In WSDM.
[25]
Glen Jeh and Jennifer Widom. 2003. Scaling personalized web search. In WWW.
[26]
Gueorgi Kossinets and Duncan J Watts. 2006. Empirical analysis of an evolving social network. science (2006).
[27]
Timothy La Fond and Jennifer Neville. 2010. Randomization tests for distinguishing social influence and homophily effects. In WWW.
[28]
Andrea Lancichinetti and Santo Fortunato. 2009. Community detection algorithms: a comparative analysis. Physical review E (2009).
[29]
Ye Li, Chaofeng Sha, Xin Huang, and Yanchun Zhang. 2018. Community detection in attributed graphs: An embedding approach. In AAAI.
[30]
U Liji, Yahui Chai, and Jianrui Chen. 2018. Improved personalized recommendation based on user attributes clustering and score matrix filling. CSI (2018).
[31]
Jie Liu, Zhicheng He, Lai Wei, and Yalou Huang. 2018. Content to node: Self-translation network embedding. In SIGKDD.
[32]
László Lovász 1993. Random walks on graphs: A survey. Combinatorics, Paul erdos is eighty(1993).
[33]
Fanrong Meng, Xiaobin Rui, Zhixiao Wang, Yan Xing, and Longbing Cao. 2018. Coupled node similarity learning for community detection in attributed networks. Entropy (2018).
[34]
Zaiqiao Meng, Shangsong Liang, Hongyan Bao, and Xiangliang Zhang. 2019. Co-embedding attributed networks. In WSDM.
[35]
Leon Mirsky. 1975. A trace inequality of John von Neumann. Monatshefte für mathematik(1975).
[36]
Waqas Nawaz, Kifayat-Ullah Khan, Young-Koo Lee, and Sungyoung Lee. 2015. Intra graph clustering using collaborative similarity measure. DAPD (2015).
[37]
Jennifer Neville, Micah Adler, and David Jensen. 2003. Clustering relational data using attribute and link information. In IJCAI.
[38]
Mark EJ Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical review E (2004).
[39]
Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002. On spectral clustering: Analysis and an algorithm. In NeurIPS.
[40]
Krzysztof Nowicki and Tom A B Snijders. 2001. Estimation and prediction for stochastic blockstructures. J Am Stat Assoc (2001).
[41]
Hae-Sang Park and Chi-Hyuck Jun. 2009. A simple and fast algorithm for K-medoids clustering. Expert systems with applications(2009).
[42]
Yiye Ruan, David Fuhry, and Srinivasan Parthasarathy. 2013. Efficient community detection in large networks using content and links. In WWW.
[43]
Heinz Rutishauser. 1969. Computational aspects of FL Bauer’s simultaneous iteration method. Numer. Math. (1969).
[44]
Satu Elisa Schaeffer. 2007. Graph clustering. Computer science review(2007).
[45]
Karsten Steinhaeuser and Nitesh V Chawla. 2008. Community detection in a large real-world social network. In SBP.
[46]
Shayan A Tabrizi, Azadeh Shakery, Masoud Asadpour, Maziar Abbasi, and Mohammad Ali Tavallaie. 2013. Personalized pagerank clustering: A graph clustering algorithm based on random walks. Physica A (2013).
[47]
Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2006. Fast random walk with restart and its applications. In ICDM.
[48]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. ICLR (2018).
[49]
Konstantin Voevodski, Shang-Hua Teng, and Yu Xia. 2009. Finding local communities in protein networks. BMC bioinformatics (2009).
[50]
Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and computing(2007).
[51]
Dorothea Wagner and Frank Wagner. 1993. Between min cut and graph bisection. In MFCS.
[52]
Chun Wang, Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Attributed graph clustering: a deep attentional embedding approach. In IJCAI.
[53]
Chun Wang, Shirui Pan, Guodong Long, Xingquan Zhu, and Jing Jiang. 2017. Mgae: Marginalized graph autoencoder for graph clustering. In CIKM.
[54]
Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng. 2012. A model-based approach to attributed graph clustering. In SIGMOD.
[55]
Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Chang. 2015. Network representation learning with rich text information. In AAAI.
[56]
Hong Yang, Shirui Pan, Ling Chen, Chuan Zhou, and Peng Zhang. 2019. Low-Bit Quantization for Attributed Network Representation Learning. In IJCAI.
[57]
Hong Yang, Shirui Pan, Peng Zhang, Ling Chen, Defu Lian, and Chengqi Zhang. 2018. Binarized attributed network embedding. In ICDM.
[58]
Jaewon Yang, Julian McAuley, and Jure Leskovec. 2013. Community detection in networks with node attributes. In ICDM.
[59]
Renchi Yang, Jieming Shi, Xiaokui Xiao, Yin Yang, and Sourav S Bhowmick. 2020. Homogeneous network embedding for massive graphs via reweighted personalized PageRank. PVLDB (2020).
[60]
Renchi Yang, Jieming Shi, Xiaokui Xiao, Yin Yang, Juncheng Liu, and Sourav S. Bhowmick. 2021. Scaling Attributed Network Embedding to Massive Graphs. PVLDB (2021).
[61]
Renchi Yang, Xiaokui Xiao, Zhewei Wei, Sourav S Bhowmick, Jun Zhao, and Rong-Hua Li. 2019. Efficient Estimation of Heat Kernel PageRank for Local Clustering. In SIGMOD.
[62]
Tianbao Yang, Rong Jin, Yun Chi, and Shenghuo Zhu. 2009. Combining link and content for community detection: a discriminative approach. In SIGKDD.
[63]
Hugo Zanghi, Stevenn Volant, and Christophe Ambroise. 2010. Clustering based on random graph model embedding vertex features. Pattern Recognition Letters(2010).
[64]
Xiaotong Zhang, Han Liu, Qimai Li, and Xiao-Ming Wu. 2019. Attributed graph clustering via adaptive graph convolution. In IJCAI.
[65]
Ziwei Zhang, Peng Cui, Xiao Wang, Jian Pei, Xuanrong Yao, and Wenwu Zhu. 2018. Arbitrary-order proximity preserved network embedding. In SIGKDD.
[66]
Sheng Zhou, Hongxia Yang, Xin Wang, Jiajun Bu, Martin Ester, Pinggang Yu, Jianwei Zhang, and Can Wang. 2018. PRRE: Personalized Relation Ranking Embedding for Attributed Networks. In CIKM.
[67]
Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. 2009. Graph clustering based on structural/attribute similarities. PVLDB (2009).
[68]
Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. 2010. Clustering large attributed graphs: An efficient incremental approach. In ICDM.

Cited By

View all
  1. Effective and Scalable Clustering on Massive Attributed Graphs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '21: Proceedings of the Web Conference 2021
    April 2021
    4054 pages
    ISBN:9781450383127
    DOI:10.1145/3442381
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 June 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. attributed graph
    2. graph clustering
    3. random walk

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    WWW '21
    Sponsor:
    WWW '21: The Web Conference 2021
    April 19 - 23, 2021
    Ljubljana, Slovenia

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)76
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media