skip to main content
10.1145/1651587.1651599acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Efficient approach for incremental Vietnamese document clustering

Published: 02 November 2009 Publication History

Abstract

In this paper, we present how to use graph model for clustering Vietnamese document incrementally. Graph based model allows us to model completely the structure of not only each document but also the whole collection of documents. The graph structure is easily updated when there is a new document. When building the graph incrementally we can identify representative subgraph features, which are later used for calculating hybrid pair-wise document similarity. These subgraph features make clustering process less sensitive to the Vietnamese word segmentation step. Based on the hybrid similarity measure, the documents are groups into clusters on-the-fly without any assumptions on the number of clusters and without retrieving previous documents.

References

[1]
Baeza-Yates, R., Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison Wesley.
[2]
Can, F., Drochak II, N.D. 1990. Incremental Clustering for Dynamic Document Databases. In Proceedings of the 1990 Symposium on Applied Computing, 61--67.
[3]
Cao, T. H., Ngo, V. M., Hong, D. T., and Quan, T. T. 2008. A Named-Entity-Based Multi-Vector Space Model for Semantic Document Clustering. In Proceedings of the 1st Pacific-Asia Workshop on Web Mining and Web-Based Application (WMWA'2008), May 20, Osaka, Japan, 139--150.
[4]
Cao, T. H., Do, H. T., Hong, D. T. and Quan, T. T. 2008. Fuzzy Named Entity-Based Document Clustering. In Proceedings of the 17th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE'2008), June 01-06, Hong Kong), 2028--2034.
[5]
Cutting, D. R., Karger, D. R., Pedersen, P., and Tukey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of ACM SIGIR'92, ACM Press, 318--329.
[6]
Dinh D., Hoang K., and Nguyen V. T. 2001. Vietnamese Word Segmentation. In Proceedings of NLPRS'01, 749--756.
[7]
Do P., Hoang K. 2005. Improving Learning Algorithm of Self Organizing Map for Document Clustering. In Proceedings of 3rd International Conference Research, Innovation and Vision of the Future (RIVF'05), 173--176.
[8]
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd Int. Conf. on Knowledge Discovery and Data Mining, 226--231.
[9]
Ester, M., Kriegel, H-P., Sander, J., Wimmer, M., and Xu, X. 1998. Incremental Clustering for Mining in a Data Warehousing Environment. In Proceedings of VLDB'98, 1--11.
[10]
Fisher, D. H. 1987. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139--172
[11]
Hammouda, M., Kamel, M. 2002. Phrase-based Document Similarity Based on Index Graph Model. In Proceedings of the 2002 IEEE international conference on Data mining (ICDM'02), 203--210.
[12]
Hammouda, M., Kamel, M. 2003. Incremental Document Clustering using Cluster Similarity Histogram. In Proceedings of the IEEE/WIC international conf. on Web Intelligence, 597--601.
[13]
Ho T.B., Kawasaki S., and Nguyen N.B. 2002. Documents Clustering Using Tolerance Rough Set Models and Its Application to Information Retrieval. Book chapter in Intelligent Exploration of the Web, Physica-Verlag, 181--196.
[14]
Larsen, J., Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM SIGKDD, ACM Press, 16--22.
[15]
Liu, T., Liu, S., Chen, Z., and Ma, W-Y. 2003. An Evaluation on Feature Selection for Text Clustering. In Proceedings of the 12th ICML, 488--495.
[16]
Salton, G., Wong, A., and Yang, C.S. 1975. A vector space model for automatic indexing. Communication of ACM, 18(11), 613--620.
[17]
Sahoo, N., Callan, J., Krishnan, R., Duncan, G., and Padman R. 2006. Incremental Hierarchical Clustering of Text Documents. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06), 357--366.
[18]
Schenker, A., Last, M., Bunke, H., and Kandel, A. 2004. Classification of Web Documents Using Graph Matching. International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition, 18(3), 475--479.
[19]
Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In Proceedings of Workshop on Text Mining, SIGKDD.
[20]
Nguyen-Hoang. T. A, Hoang K., Bui T. D, and Nguyen A. T. 2009. Incremental Document Clustering Based on Graph Model. In Proceedings of ADMA 2009, Springer, LNAI.5678, 569--576.
[21]
Yang, Y., Carbonell, J. G., Brown, R. G., Pierce, T., Archibald, B. T., and Liu, X. 1999. Learning Approaches for Detecting and Tracking News Event. IEEE Intelligent Systems, vol. 14, No. 4.
[22]
Yang, Y., Pedersen, J.O. 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 412--420.
[23]
Yoo, I., Hu, X.2006. Clustering Large Collection of Biomedical Literature Based on Ontology-Enriched Bipartite Graph Representation and Mutual Refinement Strategy. In Proceedings of PAKDD 2006, Springer, LNAI. 3918, 303--312.
[24]
Ward, J. H. JR.1963. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236--244.
[25]
Wong, W., Fu, A. 2000. Incremental Document Clustering for Webpage Classification. In Proceedings of 2000 Int'l Conf. Information Soc. in the 21st Century: Emerging Technologies and New Challenges.
[26]
Zamir, O., Etzioni, O., Madanim, O., and Karp, R. M. 1997. Fast and Intuitive Clustering of Web Documents. In Proceeding of 3rd Int'l Conf. Knowledge Discovery and Data Mining, 287--290.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WIDM '09: Proceedings of the eleventh international workshop on Web information and data management
November 2009
104 pages
ISBN:9781605588087
DOI:10.1145/1651587
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. graph model
  2. incremental document clustering
  3. shared phrases
  4. vietnamese

Qualifiers

  • Research-article

Conference

CIKM '09
Sponsor:

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media