research-article

Effective and Scalable Clustering on Massive Attributed Graphs

Authors:

Xiaokui XiaoAuthors Info & Claims

WWW '21: Proceedings of the Web Conference 2021

Pages 3675 - 3687

https://doi.org/10.1145/3442381.3449875

Published: 03 June 2021 Publication History

Abstract

Given a graph G where each node is associated with a set of attributes, and a parameter k specifying the number of output clusters, k-attributed graph clustering (k-AGC) groups nodes in G into k disjoint clusters, such that nodes within the same cluster share similar topological and attribute characteristics, while those in different clusters are dissimilar. This problem is challenging on massive graphs, e.g., with millions of nodes and billions of attribute values. For such graphs, existing solutions either incur prohibitively high costs, or produce clustering results with compromised quality.

In this paper, we propose, an efficient approach to k-AGC that yields high-quality clusters with costs linear to the size of the input graph G. The main contributions of are twofold: (i) a novel formulation of the k-AGC problem based on an attributed multi-hop conductance quality measure custom-made for this problem setting, which effectively captures cluster coherence in terms of both topological proximities and attribute similarities, and (ii) a linear-time optimization solver that obtains high quality clusters iteratively, based on efficient matrix operations such as orthogonal iterations, an alternative optimization approach, as well as an initialization technique that significantly speeds up the convergence of in practice.

Extensive experiments, comparing 11 competitors on 6 real datasets, demonstrate that consistently outperforms all competitors in terms of result quality measured against ground truth labels, while being up to orders of magnitude faster. In particular, on the Microsoft Academic Knowledge Graph dataset with 265.2 million edges and 1.1 billion attribute values, outputs high-quality results for 5-AGC within 1.68 hours using a single CPU core, while none of the 11 competitors finish within 3 days.

References

[1]

Charu C Aggarwal and Chandan K Reddy. 2014. Data Clustering: Algorithms and Applications. CRC Press.

[2]

Esra Akbas and Peixiang Zhao. 2017. Attributed graph clustering: An attribute-aware graph embedding approach. In ASONAM.

[3]

Aleksandar Bojchevski, Johannes Klicpera, Bryan Perozzi, Amol Kapoor, Martin Blais, Benedek Rózemberczki, Michal Lukasik, and Stephan Günnemann. 2020. Scaling Graph Neural Networks with Approximate PageRank. In SIGKDD.

[4]

Cécile Bothorel, Juan David Cruz, Matteo Magnani, and Barbora Micenkova. 2015. Clustering attributed graphs: models, measures and methods. Network Science (2015).

[5]

Petr Chunaev. 2019. Community detection in node-attributed social networks: a survey. arXiv preprint arXiv:1912.09816(2019).

[6]

Fan RK Chung and Fan Chung Graham. 1997. Spectral graph theory.

[7]

David Combe, Christine Largeron, Elöd Egyed-Zsigmond, and Mathias Géry. 2012. Combining relations and text in scientific network clustering. In ASONAM.

[8]

Peter Congdon. 2007. Bayesian statistical modelling.

[9]

James W Demmel. 1997. Applied numerical linear algebra. Siam.

[10]

Issam Falih, Nistor Grozavu, Rushed Kanawati, and Younès Bennani. 2017. Anca: Attributed network clustering algorithm. In Complex Networks.

[11]

Issam Falih, Nistor Grozavu, Rushed Kanawati, and Younès Bennani. 2018. Community detection in attributed network. In WWW.

[12]

Santo Fortunato. 2010. Community detection in graphs. Physics reports (2010).

[13]

Linton C Freeman. 1996. Cliques, Galois lattices, and the structure of human social groups. Social networks (1996).

[14]

Olivier Goldschmidt and Dorit S Hochbaum. 1988. Polynomial algorithm for the k-cut problem. In FOCS.

[15]

Roger Guimera and Luis A Nunes Amaral. 2005. Functional cartography of complex metabolic networks. Nature (2005).

[16]

Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review (2011).

[17]

Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS.

[18]

Daniel Hanisch, Alexander Zien, Ralf Zimmer, and Thomas Lengauer. 2002. Co-clustering of biological networks and gene expression data. Bioinformatics (2002).

[19]

John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. J R Stat Soc Ser C (1979).

[20]

Taher Haveliwala and Sepandar Kamvar. 2003. The second eigenvalue of the Google matrix. Technical Report. Stanford.

[21]

Dongxiao He, Zhiyong Feng, Di Jin, Xiaobao Wang, and Weixiong Zhang. 2017. Joint identification of network communities and semantics via integrative modeling of network topologies and node contents. In AAAI.

[22]

Darko Hric, Richard K Darst, and Santo Fortunato. 2014. Community detection in networks: Structural communities versus ground truth. Physical Review E (2014).

[23]

Huimin Huang, Hong Shen, and Zaiqiao Meng. 2020. Community-based influence maximization in attributed networks. Applied Intelligence(2020).

[24]

Xiao Huang, Jundong Li, and Xia Hu. 2017. Label informed attributed network embedding. In WSDM.

[25]

Glen Jeh and Jennifer Widom. 2003. Scaling personalized web search. In WWW.

[26]

Gueorgi Kossinets and Duncan J Watts. 2006. Empirical analysis of an evolving social network. science (2006).

[27]

Timothy La Fond and Jennifer Neville. 2010. Randomization tests for distinguishing social influence and homophily effects. In WWW.

[28]

Andrea Lancichinetti and Santo Fortunato. 2009. Community detection algorithms: a comparative analysis. Physical review E (2009).

[29]

Ye Li, Chaofeng Sha, Xin Huang, and Yanchun Zhang. 2018. Community detection in attributed graphs: An embedding approach. In AAAI.

[30]

U Liji, Yahui Chai, and Jianrui Chen. 2018. Improved personalized recommendation based on user attributes clustering and score matrix filling. CSI (2018).

[31]

Jie Liu, Zhicheng He, Lai Wei, and Yalou Huang. 2018. Content to node: Self-translation network embedding. In SIGKDD.

[32]

László Lovász 1993. Random walks on graphs: A survey. Combinatorics, Paul erdos is eighty(1993).

[33]

Fanrong Meng, Xiaobin Rui, Zhixiao Wang, Yan Xing, and Longbing Cao. 2018. Coupled node similarity learning for community detection in attributed networks. Entropy (2018).

[34]

Zaiqiao Meng, Shangsong Liang, Hongyan Bao, and Xiangliang Zhang. 2019. Co-embedding attributed networks. In WSDM.

[35]

Leon Mirsky. 1975. A trace inequality of John von Neumann. Monatshefte für mathematik(1975).

[36]

Waqas Nawaz, Kifayat-Ullah Khan, Young-Koo Lee, and Sungyoung Lee. 2015. Intra graph clustering using collaborative similarity measure. DAPD (2015).

[37]

Jennifer Neville, Micah Adler, and David Jensen. 2003. Clustering relational data using attribute and link information. In IJCAI.

[38]

Mark EJ Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical review E (2004).

[39]

Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002. On spectral clustering: Analysis and an algorithm. In NeurIPS.

[40]

Krzysztof Nowicki and Tom A B Snijders. 2001. Estimation and prediction for stochastic blockstructures. J Am Stat Assoc (2001).

[41]

Hae-Sang Park and Chi-Hyuck Jun. 2009. A simple and fast algorithm for K-medoids clustering. Expert systems with applications(2009).

[42]

Yiye Ruan, David Fuhry, and Srinivasan Parthasarathy. 2013. Efficient community detection in large networks using content and links. In WWW.

[43]

Heinz Rutishauser. 1969. Computational aspects of FL Bauer’s simultaneous iteration method. Numer. Math. (1969).

[44]

Satu Elisa Schaeffer. 2007. Graph clustering. Computer science review(2007).

[45]

Karsten Steinhaeuser and Nitesh V Chawla. 2008. Community detection in a large real-world social network. In SBP.

[46]

Shayan A Tabrizi, Azadeh Shakery, Masoud Asadpour, Maziar Abbasi, and Mohammad Ali Tavallaie. 2013. Personalized pagerank clustering: A graph clustering algorithm based on random walks. Physica A (2013).

[47]

Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2006. Fast random walk with restart and its applications. In ICDM.

[48]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. ICLR (2018).

[49]

Konstantin Voevodski, Shang-Hua Teng, and Yu Xia. 2009. Finding local communities in protein networks. BMC bioinformatics (2009).

[50]

Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and computing(2007).

[51]

Dorothea Wagner and Frank Wagner. 1993. Between min cut and graph bisection. In MFCS.

[52]

Chun Wang, Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Attributed graph clustering: a deep attentional embedding approach. In IJCAI.

[53]

Chun Wang, Shirui Pan, Guodong Long, Xingquan Zhu, and Jing Jiang. 2017. Mgae: Marginalized graph autoencoder for graph clustering. In CIKM.

Digital Library

[54]

Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng. 2012. A model-based approach to attributed graph clustering. In SIGMOD.

[55]

Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Chang. 2015. Network representation learning with rich text information. In AAAI.

[56]

Hong Yang, Shirui Pan, Ling Chen, Chuan Zhou, and Peng Zhang. 2019. Low-Bit Quantization for Attributed Network Representation Learning. In IJCAI.

[57]

Hong Yang, Shirui Pan, Peng Zhang, Ling Chen, Defu Lian, and Chengqi Zhang. 2018. Binarized attributed network embedding. In ICDM.

[58]

Jaewon Yang, Julian McAuley, and Jure Leskovec. 2013. Community detection in networks with node attributes. In ICDM.

[59]

Renchi Yang, Jieming Shi, Xiaokui Xiao, Yin Yang, and Sourav S Bhowmick. 2020. Homogeneous network embedding for massive graphs via reweighted personalized PageRank. PVLDB (2020).

[60]

Renchi Yang, Jieming Shi, Xiaokui Xiao, Yin Yang, Juncheng Liu, and Sourav S. Bhowmick. 2021. Scaling Attributed Network Embedding to Massive Graphs. PVLDB (2021).

[61]

Renchi Yang, Xiaokui Xiao, Zhewei Wei, Sourav S Bhowmick, Jun Zhao, and Rong-Hua Li. 2019. Efficient Estimation of Heat Kernel PageRank for Local Clustering. In SIGMOD.

[62]

Tianbao Yang, Rong Jin, Yun Chi, and Shenghuo Zhu. 2009. Combining link and content for community detection: a discriminative approach. In SIGKDD.

[63]

Hugo Zanghi, Stevenn Volant, and Christophe Ambroise. 2010. Clustering based on random graph model embedding vertex features. Pattern Recognition Letters(2010).

[64]

Xiaotong Zhang, Han Liu, Qimai Li, and Xiao-Ming Wu. 2019. Attributed graph clustering via adaptive graph convolution. In IJCAI.

[65]

Ziwei Zhang, Peng Cui, Xiao Wang, Jian Pei, Xuanrong Yao, and Wenwu Zhu. 2018. Arbitrary-order proximity preserved network embedding. In SIGKDD.

[66]

Sheng Zhou, Hongxia Yang, Xin Wang, Jiajun Bu, Martin Ester, Pinggang Yu, Jianwei Zhang, and Can Wang. 2018. PRRE: Personalized Relation Ranking Embedding for Attributed Networks. In CIKM.

Digital Library

[67]

Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. 2009. Graph clustering based on structural/attribute similarities. PVLDB (2009).

[68]

Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. 2010. Clustering large attributed graphs: An efficient incremental approach. In ICDM.

Cited By

Wei JYang ZLuo QZhang YQin LZhang W(2024)High-Order Local Clustering on HypergraphsICST Transactions on Scalable Information Systems10.4108/eetsis.743111:6Online publication date: 15-Nov-2024
https://doi.org/10.4108/eetsis.7431
陈丹(2024)Algorithm Research of Attention Mechanism in Graph Neural Network ModelModeling and Simulation10.12677/MOS.2024.13102213:01(225-238)Online publication date: 2024
https://doi.org/10.12677/MOS.2024.131022
Yang RShi J(2024)Efficient High-Quality Clustering for Large Bipartite GraphsProceedings of the ACM on Management of Data10.1145/36392782:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639278
Show More Cited By

Effective and Scalable Clustering on Massive Attributed Graphs
1. Information systems
  1. Information systems applications

Recommendations

Clustering Large Attributed Graphs: A Balance between Structural and Attribute Similarities

Social networks, sensor networks, biological networks, and many other information networks can be modeled as a large graph. Graph vertices represent entities, and graph edges represent their relationships or interactions. In many large graphs, there is ...
Computing Vertex-Vertex Dissimilarities Using Random Trees: Application to Clustering in Graphs
Advances in Intelligent Data Analysis XVIII
Abstract
A current challenge in graph clustering is to tackle the issue of complex networks, i.e, graphs with attributed vertices and/or edges. In this paper, we present GraphTrees, a novel method that relies on random decision trees to compute pairwise ...
A versatile framework for attributed network clustering via K-nearest neighbor augmentation
Abstract
Attributed networks containing entity-specific information in node attributes are ubiquitous in modeling social networks, e-commerce, bioinformatics, etc. Their inherent network topology ranges from simple graphs to hypergraphs with high-order ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '21: Proceedings of the Web Conference 2021

April 2021

4054 pages

ISBN:9781450383127

DOI:10.1145/3442381

Editors:
Jure Leskovec
Stanford
,
Marko Grobelnik
Jožef Stefan Institute
,
Marc Najork
Google
,
Jie Tang
Tsinghua University
,
Leila Zia
Wikimedia Foundation

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '21

Sponsor:

SIGWEB

WWW '21: The Web Conference 2021

April 19 - 23, 2021

Ljubljana, Slovenia

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
434
Total Downloads

Downloads (Last 12 months)76
Downloads (Last 6 weeks)4

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wei JYang ZLuo QZhang YQin LZhang W(2024)High-Order Local Clustering on HypergraphsICST Transactions on Scalable Information Systems10.4108/eetsis.743111:6Online publication date: 15-Nov-2024
https://doi.org/10.4108/eetsis.7431
陈丹(2024)Algorithm Research of Attention Mechanism in Graph Neural Network ModelModeling and Simulation10.12677/MOS.2024.13102213:01(225-238)Online publication date: 2024
https://doi.org/10.12677/MOS.2024.131022
Yang RShi J(2024)Efficient High-Quality Clustering for Large Bipartite GraphsProceedings of the ACM on Management of Data10.1145/36392782:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639278
Wang HYang RXiao XBaeza-Yates RBonchi F(2024)Effective Edge-wise Representation Learning in Edge-Attributed Bipartite GraphsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671805(3081-3091)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671805
Li YGuo GShi JYang RShen SLi QLuo J(2024)A versatile framework for attributed network clustering via K-nearest neighbor augmentationThe VLDB Journal10.1007/s00778-024-00875-833:6(1913-1943)Online publication date: 16-Sep-2024
https://doi.org/10.1007/s00778-024-00875-8
Li YYang RShi J(2023)Efficient and Effective Attributed Hypergraph Clustering via K-Nearest Neighbor AugmentationProceedings of the ACM on Management of Data10.1145/35892611:2(1-23)Online publication date: 20-Jun-2023
https://doi.org/10.1145/3589261
Yang RShi JXiao XYang YBhowmick SLiu J(2023)PANE: scalable and effective attributed network embeddingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00790-432:6(1237-1262)Online publication date: 24-Mar-2023
https://dl.acm.org/doi/10.1007/s00778-023-00790-4

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents