skip to main content
10.1145/2063576.2063920acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

Representing document as dependency graph for document clustering

Published: 24 October 2011 Publication History

Abstract

In traditional clustering methods, a document is often represented as "bag of words" (in BOW model) or n-grams (in suffix tree document model) without considering the natural language relationships between the words. In this paper, we propose a novel approach DGDC (Dependency Graph-based Document Clustering algorithm) to address this issue. In our algorithm, each document is represented as a dependency graph where the nodes correspond to words which can be seen as meta-descriptions of the document; whereas the edges stand for the relations between pairs of words. A new similarity measure is proposed to compute the pairwise similarity of documents based on their corresponding dependency graphs. By applying the new similarity measure in the Group-average Agglomerative Hierarchial Clustering (GAHC) algorithm, the final clusters of documents can be obtained. The experiments were carried out on five public document datasets. The empirical results have indicated that the DGDC algorithm can achieve better performance in document clustering tasks compared with other approaches based on the BOW model and suffix tree document model.

References

[1]
F. Bach and M. Jordan. Learning spectral clustering. In Proceedings of Advances in Neural Information Processing Systems 16 (NIPS), Cambridge, MA: MIT Press, 2003.
[2]
A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra. Generative model-based clustering of directional data. In Proceedings of KDD'03, pages 19--28, Washington, DC, USA, August 2003.
[3]
D. Cer, M.-C. de Marneffe, D. Jurafsky, and C. Manning. Parsing to stanford dependencies: Trade-offs between speed and accuracy. In Proceedings of LREC'2010, pages 1628--1632, Valletta, Malta, May 2010.
[4]
Y. Chen and L. Tu. Density-based clustering for real-time stream data. In Proceedings of SIGKDD'07, 2007.
[5]
H. Chim and X. Deng. A new suffix tree similarity measure for document clustering. In Proceedings of WWW'07, pages 218--225, Banff, Alberta, Canada, May 2007.
[6]
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms, section 26.2, The Floyd--Warshall algorithm, pages 558--565. MIT Press and McGraw-Hill, Reading, Massachusetts, first edition, 1990.
[7]
M. A. Covington. A fundamental algorithm for dependency parsing. In Proceedings of the 39th Annual ACM Southeast Conference, 2001.
[8]
M.-C. de Marneffe, B. MacCartney, and C. D. Manning. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC'06, 2006.
[9]
M.-C. de Marneffe and C. D. Manning. Stanford Dependencies manual, 2008.
[10]
B. E. Dom. An information-theoretic external cluster-validity measure. Research Report RJ 10219, IBM, 2001.
[11]
J. Eisner. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of COLING'96, pages 340--345, Copenhagen, August 1996.
[12]
T. Ferguson. A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209--230, 1973.
[13]
K. M. Hammouda and M. S. Kamel. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10):1279--1296, 2004.
[14]
A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In Proceedings of Semantic Web Workshop, SIGIR'03, Toronto, Canada, July 2003.
[15]
X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou. Exploiting wikipedia as external knowledge for document clustering. In Proceedings of KDD'09, pages 389--396, June 2009.
[16]
D. B. Johnson. Efficient algorithms for shortest paths in sparse networks. Journal of the ACM, 24(1):1--13, 1977.
[17]
D. Klein and C. Manning. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 2003.
[18]
T. Li, S. Ma, and M. Ogihara. Document clustering via adaptive subspace iteration. In Proceedings of SIGIR'04, pages 218--225, Sheffield, South Yorkshire, UK, 2004.
[19]
T. Li, S. Zhu, and M. Ogihara. Efficient multi-way text categorization via generalized discriminant analysis. In Proceedings of CIKM'03, pages 317--324, 2003.
[20]
U. manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935--948, 1993.
[21]
M. Meilå and D. Heckerman. An experimental comparison of model-based clustering methods. Machine Learning, 42(1).
[22]
J. Nivre. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT'03), 2003.
[23]
M. Porter. New models in probabilistic information retrieval. British Library Research and development Report No.5587, 1980.
[24]
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1983.
[25]
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communication of ACM, 18(11):613--620, 1975.
[26]
W. S. Sarle. Cubic clustering criterion. SAS Technical Report A-108, Cary, NC: SAS Institute Inc. 1980.
[27]
S. E. Schaeffer. Graph clustering. Computer Science Review, 1(1):27--64, 2007.
[28]
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet process. Journal of the American Statistical Association, 101(476):1566--1581, 2007.
[29]
A. Tomovic, P. Janicic, and V. Keaelj. N-gram-based classification and unsupervised hierarchical clustering of genome sequences. Computer Methods and Programs in Biomedicine, 81(2):137--153, 2006.
[30]
U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4), 2007.
[31]
D. D. Walker and E. K. Ringger. Model-based document clustering with a collapsed gibbs sampler. In Proceedings of SIGKDD'08, pages 704--712, Las Vegas, Nevada, USA, August 2008.
[32]
W. Xu and Y. Gong. Document clustering by concept factorization. In Proceedings of SIGIR'04, pages 202--209, Sheffield, South Yorkshire, UK, July 2004.
[33]
W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings of SIGIR'03, pages 267--273, Toronto, Canada, July 2003.
[34]
I. Yoo, X. Hu, and I.-Y. Song. Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In Proceedings of KDD'06, pages 791--796, New York, NY, August 2006.
[35]
G. Yu, R. Huang, and Z. Wang. Document clustering via dirichlet process mixture model with feature selection. In Proceedings of KDD'10, pages 763--771, Washington, DC, USA, July 2010.
[36]
O. zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR'98, Seattle, USA, 1998.
[37]
Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report, Department of Computer Science, University of Minnesota, 2002.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
October 2011
2712 pages
ISBN:9781450307178
DOI:10.1145/2063576
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dependency graph
  2. document clustering
  3. document representation model
  4. similarity measure

Qualifiers

  • Poster

Conference

CIKM '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media