Article

Text Document Clustering Using Community Discovery Approach

Authors:

S. Durga BhavaniAuthors Info & Claims

Distributed Computing and Internet Technology: 16th International Conference, ICDCIT 2020, Bhubaneswar, India, January 9–12, 2020, Proceedings

Pages 336 - 346

https://doi.org/10.1007/978-3-030-36987-3_22

Published: 09 January 2020 Publication History

Abstract

The problem of document clustering is about automatic grouping of text documents into groups containing similar documents. This problem under supervised setting yields good results whereas for unannotated data the unsupervised machine learning approach does not yield good results always. Algorithms like K-Means clustering are most popular when the class labels are not known. The objective of this work is to apply community discovery algorithms from the literature of social network analysis to detect the underlying groups in the text data.

We model the corpus of documents as a graph with distinct non-trivial words from the whole corpus considered as nodes and an edge is added between two nodes if the corresponding word nodes occur together in at least one common document. Edge weight between two word nodes is defined as the number of documents in which those two words co-occur together. We apply the fast Louvain community discovery algorithm to detect communities. The challenge is to interpret the communities as classes. If the number of communities obtained is greater than the required number of classes, a technique for merging is proposed. The community which has the maximum number of similar words with a document is assigned as the community for that document. The main thrust of the paper is to show a novel approach to document clustering using community discovery algorithms. The proposed algorithm is evaluated on a few bench mark data sets and we find that our algorithm gives competitive results on majority of the data sets when compared to the standard clustering algorithms.

References

[1]

Dataset:BBC (2019). http://mlg.ucd.ie/datasets/bbc.html. Accessed 29 Apr 2019

[2]

Blondel VD, Guillaume J-L, Lambiotte R, and Lefebvre EFast unfolding of communities in large networksJ. Stat. Mech. Theory Exp.t2008200810100-10807120580

[3]

CDreview (2019). https://gist.githubusercontent.com/kunalj101corpus. Accessed 29 Apr 2019

[4]

Chen, Z., Liu, B.: Mining topics in documents: Standing on the shoulders of big data. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pp. 1116–1125, New York. ACM (2014)

[5]

Fortunato SCommunity detection in graphsPhys. Reports20104863–575-1742580414

[6]

Girvan M and Newman MEJCommunity structure in social and biological networksProc. Natl. Acad. Sci.200299127821-78261908073

[7]

Han J, Kamber M, and Pei J Data Mining: Concepts and Techniques 2012 San Francisco Morgan Kaufmann

[8]

Hartigan JA and Wong MAAlgorithm AS 136: a k-means clustering algorithmJ. Roy. Stat. Soc. Ser. C (Appl. Stat.)1979281100-1080447.62062

[9]

Johnson, R., Zhang, T.: Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in Neural Information Processing Systems (NIPS), vol. 28, pp. 919–927 (2015)

[10]

Kido, G.S., Igawa, R.A., Barbon, S.: Topic Modeling based on Louvain method in Online Social Networks, XII Brazilian Symposium on Information Systems, Florianópolis, SC, 17–20 May 2016

[11]

Kiph, T.N., Welling, N.: Semi-Supervised Classification with graph convoluional networks. In: ICLR (2017)

[12]

Liu, X., Li, K., Zhou, M.: Collective semantic role labeling for tweets with clustering, IJCAI, pp. 1832–1837, (2011)

[13]

Papadopoulos S, Kompatsiaris Y, Vakali A, and Spyridonos P Community detection in social media Data Min. Knowl. Disc. 2012 24 3 515-554

[14]

Sarkar, K., Law, R.: A novel approach to document classification using wordnet. ArXiv preprint arXiv:1510.02755 (2015)

[15]

Shu K, Sliva A, Wang S, Tang J, and Liu H Fake news detection on social media: a data mining perspective ACM SIGKDD Explor. Newsl. 2017 19 1 22-36

[16]

Dataset: SMSSpamCollection. https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

[17]

Wordnet: https://wordnet.princeton.edu (2010)

[18]

Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM (2002)

Index Terms

Text Document Clustering Using Community Discovery Approach
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Text document clustering based on frequent word meaning sequences

Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. Thus, word sequences in the documents are ignored, while the meaning of natural languages strongly depends on them. In this paper, we ...
Text document clustering based on frequent word sequences
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

In this paper, we propose a new text clustering algorithm, named Clustering based on Frequent Word Sequences (CFWS). A word sequence is frequent if it occurs in more than certain percentage of the documents in the text database. In the past, the vector ...
Text clustering using one-mode projection of document-word bipartite graphs
SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied Computing

Many real life networks have an underlying bipartite structure based on which similarity between two nodes or data instances can be defined. For example, in the case of a document corpus, the similarity between a pair of documents can be assumed to ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Distributed Computing and Internet Technology: 16th International Conference, ICDCIT 2020, Bhubaneswar, India, January 9–12, 2020, Proceedings

Jan 2020

442 pages

ISBN:978-3-030-36986-6

DOI:10.1007/978-3-030-36987-3

Editors:
Dang Van Hung
Vietnam National University, Hanoi, Vietnam
,
Meenakshi D´Souza
International Institute of Information Technology, Bangalore, India

© Springer Nature Switzerland AG 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 09 January 2020

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents