research-article

Efficient approach for incremental Vietnamese document clustering

Authors:

Tu Anh Nguyen Hoang,

Kiem HoangAuthors Info & Claims

WIDM '09: Proceedings of the eleventh international workshop on Web information and data management

Pages 47 - 54

https://doi.org/10.1145/1651587.1651599

Published: 02 November 2009 Publication History

Abstract

In this paper, we present how to use graph model for clustering Vietnamese document incrementally. Graph based model allows us to model completely the structure of not only each document but also the whole collection of documents. The graph structure is easily updated when there is a new document. When building the graph incrementally we can identify representative subgraph features, which are later used for calculating hybrid pair-wise document similarity. These subgraph features make clustering process less sensitive to the Vietnamese word segmentation step. Based on the hybrid similarity measure, the documents are groups into clusters on-the-fly without any assumptions on the number of clusters and without retrieving previous documents.

References

[1]

Baeza-Yates, R., Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison Wesley.

Digital Library

[2]

Can, F., Drochak II, N.D. 1990. Incremental Clustering for Dynamic Document Databases. In Proceedings of the 1990 Symposium on Applied Computing, 61--67.

[3]

Cao, T. H., Ngo, V. M., Hong, D. T., and Quan, T. T. 2008. A Named-Entity-Based Multi-Vector Space Model for Semantic Document Clustering. In Proceedings of the 1st Pacific-Asia Workshop on Web Mining and Web-Based Application (WMWA'2008), May 20, Osaka, Japan, 139--150.

[4]

Cao, T. H., Do, H. T., Hong, D. T. and Quan, T. T. 2008. Fuzzy Named Entity-Based Document Clustering. In Proceedings of the 17th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE'2008), June 01-06, Hong Kong), 2028--2034.

[5]

Cutting, D. R., Karger, D. R., Pedersen, P., and Tukey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of ACM SIGIR'92, ACM Press, 318--329.

Digital Library

[6]

Dinh D., Hoang K., and Nguyen V. T. 2001. Vietnamese Word Segmentation. In Proceedings of NLPRS'01, 749--756.

[7]

Do P., Hoang K. 2005. Improving Learning Algorithm of Self Organizing Map for Document Clustering. In Proceedings of 3rd International Conference Research, Innovation and Vision of the Future (RIVF'05), 173--176.

[8]

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd Int. Conf. on Knowledge Discovery and Data Mining, 226--231.

[9]

Ester, M., Kriegel, H-P., Sander, J., Wimmer, M., and Xu, X. 1998. Incremental Clustering for Mining in a Data Warehousing Environment. In Proceedings of VLDB'98, 1--11.

Digital Library

[10]

Fisher, D. H. 1987. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139--172

[11]

Hammouda, M., Kamel, M. 2002. Phrase-based Document Similarity Based on Index Graph Model. In Proceedings of the 2002 IEEE international conference on Data mining (ICDM'02), 203--210.

Digital Library

[12]

Hammouda, M., Kamel, M. 2003. Incremental Document Clustering using Cluster Similarity Histogram. In Proceedings of the IEEE/WIC international conf. on Web Intelligence, 597--601.

Digital Library

[13]

Ho T.B., Kawasaki S., and Nguyen N.B. 2002. Documents Clustering Using Tolerance Rough Set Models and Its Application to Information Retrieval. Book chapter in Intelligent Exploration of the Web, Physica-Verlag, 181--196.

Digital Library

[14]

Larsen, J., Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM SIGKDD, ACM Press, 16--22.

Digital Library

[15]

Liu, T., Liu, S., Chen, Z., and Ma, W-Y. 2003. An Evaluation on Feature Selection for Text Clustering. In Proceedings of the 12th ICML, 488--495.

[16]

Salton, G., Wong, A., and Yang, C.S. 1975. A vector space model for automatic indexing. Communication of ACM, 18(11), 613--620.

Digital Library

[17]

Sahoo, N., Callan, J., Krishnan, R., Duncan, G., and Padman R. 2006. Incremental Hierarchical Clustering of Text Documents. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06), 357--366.

Digital Library

[18]

Schenker, A., Last, M., Bunke, H., and Kandel, A. 2004. Classification of Web Documents Using Graph Matching. International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition, 18(3), 475--479.

[19]

Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In Proceedings of Workshop on Text Mining, SIGKDD.

[20]

Nguyen-Hoang. T. A, Hoang K., Bui T. D, and Nguyen A. T. 2009. Incremental Document Clustering Based on Graph Model. In Proceedings of ADMA 2009, Springer, LNAI.5678, 569--576.

Digital Library

[21]

Yang, Y., Carbonell, J. G., Brown, R. G., Pierce, T., Archibald, B. T., and Liu, X. 1999. Learning Approaches for Detecting and Tracking News Event. IEEE Intelligent Systems, vol. 14, No. 4.

Digital Library

[22]

Yang, Y., Pedersen, J.O. 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 412--420.

Digital Library

[23]

Yoo, I., Hu, X.2006. Clustering Large Collection of Biomedical Literature Based on Ontology-Enriched Bipartite Graph Representation and Mutual Refinement Strategy. In Proceedings of PAKDD 2006, Springer, LNAI. 3918, 303--312.

Digital Library

[24]

Ward, J. H. JR.1963. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236--244.

[25]

Wong, W., Fu, A. 2000. Incremental Document Clustering for Webpage Classification. In Proceedings of 2000 Int'l Conf. Information Soc. in the 21st Century: Emerging Technologies and New Challenges.

[26]

Zamir, O., Etzioni, O., Madanim, O., and Karp, R. M. 1997. Fast and Intuitive Clustering of Web Documents. In Proceeding of 3rd Int'l Conf. Knowledge Discovery and Data Mining, 287--290.

Cited By

Kathiria PArolkar H(2022)Trend analysis and forecasting of publication activities by Indian computer science researchers during the period of 2010–23Expert Systems10.1111/exsy.1307039:10Online publication date: 19-Jun-2022
https://doi.org/10.1111/exsy.13070

Index Terms

Efficient approach for incremental Vietnamese document clustering

Recommendations

Incremental Document Clustering Based on Graph Model
ADMA '09: Proceedings of the 5th International Conference on Advanced Data Mining and Applications

In this paper, we propose a new approach based on graph model and enhanced IncrementalDBSCAN to solve incremental document clustering problem. Instead of traditional vector-based model, a graph-based is used for document representation. By using graph ...
An incremental document clustering algorithm based on a hierarchical agglomerative approach
ICDCIT'05: Proceedings of the Second international conference on Distributed Computing and Internet Technology

Document clustering is classifying a data set of documents into groups of closely related documents, so that its resulting clusters can be used in browsing and searching the documents of a specific topic. In most cases of such as application, a set of ...
A scaleable document clustering approach for large document corpora

In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WIDM '09: Proceedings of the eleventh international workshop on Web information and data management

November 2009

104 pages

ISBN:9781605588087

DOI:10.1145/1651587

Program Chairs:
Chee Yong Chan
National University of Singapore
,
Prasenjit Mitra
The Pennsylvania State University

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '09

Sponsor:

CIKM '09: Conference on Information and Knowledge Management

November 2, 2009

Hong Kong, China

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
182
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kathiria PArolkar H(2022)Trend analysis and forecasting of publication activities by Indian computer science researchers during the period of 2010–23Expert Systems10.1111/exsy.1307039:10Online publication date: 19-Jun-2022
https://doi.org/10.1111/exsy.13070

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents