skip to main content
research-article

Clustering Large Attributed Graphs: A Balance between Structural and Attribute Similarities

Published: 01 February 2011 Publication History

Abstract

Social networks, sensor networks, biological networks, and many other information networks can be modeled as a large graph. Graph vertices represent entities, and graph edges represent their relationships or interactions. In many large graphs, there is usually one or more attributes associated with every graph vertex to describe its properties. In many application domains, graph clustering techniques are very useful for detecting densely connected groups in a large graph as well as for understanding and visualizing a large graph. The goal of graph clustering is to partition vertices in a large graph into different clusters based on various criteria such as vertex connectivity or neighborhood similarity. Many existing graph clustering methods mainly focus on the topological structure for clustering, but largely ignore the vertex properties, which are often heterogenous. In this article, we propose a novel graph clustering algorithm, SA-Cluster, which achieves a good balance between structural and attribute similarities through a unified distance measure. Our method partitions a large graph associated with attributes into k clusters so that each cluster contains a densely connected subgraph with homogeneous attribute values. An effective method is proposed to automatically learn the degree of contributions of structural similarity and attribute similarity. Theoretical analysis is provided to show that SA-Cluster is converging quickly through iterative cluster refinement. Some optimization techniques on matrix computation are proposed to further improve the efficiency of SA-Cluster on large graphs. Extensive experimental results demonstrate the effectiveness of SA-Cluster through comparisons with the state-of-the-art graph clustering and summarization methods.

References

[1]
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM-SIGMOD International Conference on Management of Data (SIGMOD’98). 94--105.
[2]
Apostol, T. M. 1967. Calculus, Vol. 1: One-Variable Calculus, with an Introduction to Linear Algebra 2nd Ed. Wiley.
[3]
Botton, L. and Bengio, Y. 1994. Convergence properties of the k-means algorithms. In Proceedings of Advances in Neural Information Processing Systems 7 (NIPS’94). 585--592.
[4]
Cai, D., Shao, Z., He, X., Yan, X., and Han, J. 2005. Mining hidden community in heterogeneous social networks. In Proceedings of the Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD’05). 58--65.
[5]
Descartes, R. 1954. The Geometry of René Descartes. Dover Publications.
[6]
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’96). 226--231.
[7]
Faloutsos, C., McCurley, K., and Tomkins, A. 2004. Fast discovery of connection subgraphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04). 118--127.
[8]
Fine, B. and Rosenber, G. 1997. The Fundamental Theorem of Algebra (Undergraduate Texts in Mathematics). Springer-Verlag.
[9]
Gibson, D., Kleinberg, J., and Raghavan., P. 1998. Inferring web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia. 225--234.
[10]
Hinneburg, A. and Keim, D. A. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’98). 58--65.
[11]
Hofmann, T. 1998. Probabilistic latent semantic indexing. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). 50--57.
[12]
Jeh, G. and Widom, J. 2002. SimRank: A measure of structural-context similarity. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02). 538--543.
[13]
Kaufman, L. and Rousseeuw, P. J. 1987. Clustering by means of medoids. In Statistical Data Analysis Based on the L1 Norm. 405--416.
[14]
Liu, Z., Yu, J. X., Ke, Y., Lin, X., and Chen, L. 2008. Spotting significant changing subgraphs in evolving graphs. In Proceedings of the 2008 International Conference on Data Mining (ICDM’08).
[15]
Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.
[16]
Navlakha, S., Rastogi, R., and Shrivastava, N. 2008. Graph summarization with bounded error. In Proceedings of the ACM-SIGMOD International Conference on Management of Data (SIGMOD’08). 419--432.
[17]
Newman, M. E. J. and Girvan, M. 2004. Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113.
[18]
Ng, R. and Han, J. 1994. Efficient and effective clustering method for spatial data mining. In Proceedings of the International Conference on Very Large Data Bases (VLDB’94). 144--155.
[19]
Penrose, R. 1956. On best approximate solution of linear matrix equations. Math. Proc. Cambridge Phil. Soc. 52, 17--19.
[20]
Pons, P. and Latapy, M. 2006. Computing communities in large networks using random walks. J. Graph Algor. Appli. 10, 2, 191--218.
[21]
Shi, J. and Malik, J. 2000. Normalized cuts and image segmentation. IEEE Trans. Patt. Anal. Mach. Intell. 22, 8, 888--905.
[22]
Strang, G. 2005. Linear Algebra and its Applications. Brooks Cole.
[23]
Sun, J., Faloutsos, C., Papadimitriou, S., and Yu, P. S. 2007. Graphscope: Parameter-free mining of large time-evolving graphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’07). 687--696.
[24]
Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., and Wu, T. 2009. Rankclus: Integrating clustering with ranking for heterogenous information network analysis. In Proceedings of the International Conference on Extending Database Technology (EDBT’09). 565--576.
[25]
Tian, Y., Hankins, R. A., and Patel, J. M. 2008. Efficient aggregation for graph summarization. In Proceedings of the ACM-SIGMOD International Conference on Management of Data (SIGMOD’08). 567--580.
[26]
Tong, H. and Faloutsos, C. 2006. Center-piece subgraphs: problem definition and fast solutions. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). 404--413.
[27]
Tong, H., Faloutsos, C., and Pan, J.-Y. 2006. Fast random walk with restart and its applications. In Proceedings of the International Conference on Data Mining (ICDM’06). 613--622.
[28]
Tsai, C.-Y. and Chui, C.-C. 2008. Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Computat. Statis. Data Anal. 52, 4658--4672.
[29]
Xu, X., Yuruk, N., Feng, Z., and Schweiger, T. A. J. 2007. Scan: A structural clustering algorithm for networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’07). 824--833.
[30]
Zhai, C., Velivelli, A., and Yu, B. 2004. A cross-collection mixture model for comparative text mining. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04). 743--748.
[31]
Zhou, Y., Cheng, H., and Yu, J. X. 2009. Graph clustering based on structural/attribute similarities. In Proceedings of the International Conference on Very Large Data Bases (VLDB’09). 718--729.

Cited By

View all

Index Terms

  1. Clustering Large Attributed Graphs: A Balance between Structural and Attribute Similarities

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 5, Issue 2
    February 2011
    192 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/1921632
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 February 2011
    Accepted: 01 July 2010
    Revised: 01 May 2010
    Received: 01 January 2010
    Published in TKDD Volume 5, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Graph clustering
    2. attribute similarity
    3. structural proximity

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)72
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media