poster

Representing document as dependency graph for document clustering

Authors:

Zheng ChenAuthors Info & Claims

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Pages 2177 - 2180

https://doi.org/10.1145/2063576.2063920

Published: 24 October 2011 Publication History

Abstract

In traditional clustering methods, a document is often represented as "bag of words" (in BOW model) or n-grams (in suffix tree document model) without considering the natural language relationships between the words. In this paper, we propose a novel approach DGDC (Dependency Graph-based Document Clustering algorithm) to address this issue. In our algorithm, each document is represented as a dependency graph where the nodes correspond to words which can be seen as meta-descriptions of the document; whereas the edges stand for the relations between pairs of words. A new similarity measure is proposed to compute the pairwise similarity of documents based on their corresponding dependency graphs. By applying the new similarity measure in the Group-average Agglomerative Hierarchial Clustering (GAHC) algorithm, the final clusters of documents can be obtained. The experiments were carried out on five public document datasets. The empirical results have indicated that the DGDC algorithm can achieve better performance in document clustering tasks compared with other approaches based on the BOW model and suffix tree document model.

References

[1]

F. Bach and M. Jordan. Learning spectral clustering. In Proceedings of Advances in Neural Information Processing Systems 16 (NIPS), Cambridge, MA: MIT Press, 2003.

[2]

A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra. Generative model-based clustering of directional data. In Proceedings of KDD'03, pages 19--28, Washington, DC, USA, August 2003.

Digital Library

[3]

D. Cer, M.-C. de Marneffe, D. Jurafsky, and C. Manning. Parsing to stanford dependencies: Trade-offs between speed and accuracy. In Proceedings of LREC'2010, pages 1628--1632, Valletta, Malta, May 2010.

[4]

Y. Chen and L. Tu. Density-based clustering for real-time stream data. In Proceedings of SIGKDD'07, 2007.

Digital Library

[5]

H. Chim and X. Deng. A new suffix tree similarity measure for document clustering. In Proceedings of WWW'07, pages 218--225, Banff, Alberta, Canada, May 2007.

Digital Library

[6]

T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms, section 26.2, The Floyd--Warshall algorithm, pages 558--565. MIT Press and McGraw-Hill, Reading, Massachusetts, first edition, 1990.

[7]

M. A. Covington. A fundamental algorithm for dependency parsing. In Proceedings of the 39th Annual ACM Southeast Conference, 2001.

[8]

M.-C. de Marneffe, B. MacCartney, and C. D. Manning. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC'06, 2006.

[9]

M.-C. de Marneffe and C. D. Manning. Stanford Dependencies manual, 2008.

[10]

B. E. Dom. An information-theoretic external cluster-validity measure. Research Report RJ 10219, IBM, 2001.

[11]

J. Eisner. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of COLING'96, pages 340--345, Copenhagen, August 1996.

Digital Library

[12]

T. Ferguson. A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209--230, 1973.

[13]

K. M. Hammouda and M. S. Kamel. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10):1279--1296, 2004.

Digital Library

[14]

A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In Proceedings of Semantic Web Workshop, SIGIR'03, Toronto, Canada, July 2003.

[15]

X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou. Exploiting wikipedia as external knowledge for document clustering. In Proceedings of KDD'09, pages 389--396, June 2009.

Digital Library

[16]

D. B. Johnson. Efficient algorithms for shortest paths in sparse networks. Journal of the ACM, 24(1):1--13, 1977.

Digital Library

[17]

D. Klein and C. Manning. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 2003.

Digital Library

[18]

T. Li, S. Ma, and M. Ogihara. Document clustering via adaptive subspace iteration. In Proceedings of SIGIR'04, pages 218--225, Sheffield, South Yorkshire, UK, 2004.

Digital Library

[19]

T. Li, S. Zhu, and M. Ogihara. Efficient multi-way text categorization via generalized discriminant analysis. In Proceedings of CIKM'03, pages 317--324, 2003.

Digital Library

[20]

U. manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935--948, 1993.

Digital Library

[21]

M. Meilå and D. Heckerman. An experimental comparison of model-based clustering methods. Machine Learning, 42(1).

Digital Library

[22]

J. Nivre. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT'03), 2003.

[23]

M. Porter. New models in probabilistic information retrieval. British Library Research and development Report No.5587, 1980.

[24]

G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1983.

Digital Library

[25]

G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communication of ACM, 18(11):613--620, 1975.

Digital Library

[26]

W. S. Sarle. Cubic clustering criterion. SAS Technical Report A-108, Cary, NC: SAS Institute Inc. 1980.

[27]

S. E. Schaeffer. Graph clustering. Computer Science Review, 1(1):27--64, 2007.

Digital Library

[28]

Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet process. Journal of the American Statistical Association, 101(476):1566--1581, 2007.

[29]

A. Tomovic, P. Janicic, and V. Keaelj. N-gram-based classification and unsupervised hierarchical clustering of genome sequences. Computer Methods and Programs in Biomedicine, 81(2):137--153, 2006.

[30]

U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4), 2007.

Digital Library

[31]

D. D. Walker and E. K. Ringger. Model-based document clustering with a collapsed gibbs sampler. In Proceedings of SIGKDD'08, pages 704--712, Las Vegas, Nevada, USA, August 2008.

Digital Library

[32]

W. Xu and Y. Gong. Document clustering by concept factorization. In Proceedings of SIGIR'04, pages 202--209, Sheffield, South Yorkshire, UK, July 2004.

Digital Library

[33]

W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings of SIGIR'03, pages 267--273, Toronto, Canada, July 2003.

Digital Library

[34]

I. Yoo, X. Hu, and I.-Y. Song. Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In Proceedings of KDD'06, pages 791--796, New York, NY, August 2006.

Digital Library

[35]

G. Yu, R. Huang, and Z. Wang. Document clustering via dirichlet process mixture model with feature selection. In Proceedings of KDD'10, pages 763--771, Washington, DC, USA, July 2010.

Digital Library

[36]

O. zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR'98, Seattle, USA, 1998.

Digital Library

[37]

Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report, Department of Computer Science, University of Minnesota, 2002.

Cited By

Pan KZhang GLiao MXu J(2023)Graph-Based Interactive Matching for Pairs of News ArticlesCognitive Computation10.1007/s12559-023-10208-616:2(507-516)Online publication date: 30-Oct-2023
https://doi.org/10.1007/s12559-023-10208-6
Ferilli SRedavid DDi Pierro D(2022)Holistic graph-based document representation and management for open scienceInternational Journal on Digital Libraries10.1007/s00799-022-00328-z24:4(205-227)Online publication date: 29-Jun-2022
https://doi.org/10.1007/s00799-022-00328-z
Pereira Vde Castro L(2022)Natural Language Processing Based on a Text Graph Convolutional NetworkDistributed Computing and Artificial Intelligence, 19th International Conference10.1007/978-3-031-20859-1_1(1-10)Online publication date: 13-Dec-2022
https://doi.org/10.1007/978-3-031-20859-1_1
Show More Cited By

Index Terms

Representing document as dependency graph for document clustering
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Multi-Layer Semantics Based Document Clustering
WIMS '16: Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics

Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Document Base or Corpus) into smaller, more manageable subject homogeneous collections (clusters). Traditional method of document ...
Text document clustering based on neighbors

Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as ...
A new suffix tree similarity measure for document clustering
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

October 2011

2712 pages

ISBN:9781450307178

DOI:10.1145/2063576

Editors:
Bettina Berendt,
Arjen de Vries,
Wenfei Fan,
Craig Macdonald
University of Glasgow, UK
,
Iadh Ounis
University of Glasgow, UK
,
Ian Ruthven
University of Strathclyde, UK

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

CIKM '11

Sponsor:

CIKM '11: International Conference on Information and Knowledge Management

October 24 - 28, 2011

Glasgow, Scotland, UK

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
519
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pan KZhang GLiao MXu J(2023)Graph-Based Interactive Matching for Pairs of News ArticlesCognitive Computation10.1007/s12559-023-10208-616:2(507-516)Online publication date: 30-Oct-2023
https://doi.org/10.1007/s12559-023-10208-6
Ferilli SRedavid DDi Pierro D(2022)Holistic graph-based document representation and management for open scienceInternational Journal on Digital Libraries10.1007/s00799-022-00328-z24:4(205-227)Online publication date: 29-Jun-2022
https://doi.org/10.1007/s00799-022-00328-z
Pereira Vde Castro L(2022)Natural Language Processing Based on a Text Graph Convolutional NetworkDistributed Computing and Artificial Intelligence, 19th International Conference10.1007/978-3-031-20859-1_1(1-10)Online publication date: 13-Dec-2022
https://doi.org/10.1007/978-3-031-20859-1_1
Singh KDorendro ADevi HMahanta A(2021)Analysis of Changing Trends in Textual Data RepresentationRecent Trends in Image Processing and Pattern Recognition10.1007/978-981-16-0507-9_21(237-251)Online publication date: 26-Feb-2021
https://doi.org/10.1007/978-981-16-0507-9_21
Osman ABarukub O(2020)Graph-Based Text Representation and Matching: A Review of the State of the Art and Future ChallengesIEEE Access10.1109/ACCESS.2020.29931918(87562-87583)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2993191
Zhang TLiu BNiu DLai KXu YCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)Multiresolution Graph Attention Networks for Relevance MatchingProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271806(933-942)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3271806
Rafi MSharif MArshad WMohsin SRafay H(2016)Multi-Layer Semantics Based Document ClusteringProceedings of the 6th International Conference on Web Intelligence, Mining and Semantics10.1145/2912845.2912880(1-4)Online publication date: 13-Jun-2016
https://dl.acm.org/doi/10.1145/2912845.2912880
Thakkar HEndris KGimenez-Garcia JDebattista JLange CAuer S(2016)Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentProceedings of the 6th International Conference on Web Intelligence, Mining and Semantics10.1145/2912845.2912857(1-12)Online publication date: 13-Jun-2016
https://dl.acm.org/doi/10.1145/2912845.2912857

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents