skip to main content
10.1007/978-3-030-36987-3_22guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Text Document Clustering Using Community Discovery Approach

Published: 09 January 2020 Publication History

Abstract

The problem of document clustering is about automatic grouping of text documents into groups containing similar documents. This problem under supervised setting yields good results whereas for unannotated data the unsupervised machine learning approach does not yield good results always. Algorithms like K-Means clustering are most popular when the class labels are not known. The objective of this work is to apply community discovery algorithms from the literature of social network analysis to detect the underlying groups in the text data.
We model the corpus of documents as a graph with distinct non-trivial words from the whole corpus considered as nodes and an edge is added between two nodes if the corresponding word nodes occur together in at least one common document. Edge weight between two word nodes is defined as the number of documents in which those two words co-occur together. We apply the fast Louvain community discovery algorithm to detect communities. The challenge is to interpret the communities as classes. If the number of communities obtained is greater than the required number of classes, a technique for merging is proposed. The community which has the maximum number of similar words with a document is assigned as the community for that document. The main thrust of the paper is to show a novel approach to document clustering using community discovery algorithms. The proposed algorithm is evaluated on a few bench mark data sets and we find that our algorithm gives competitive results on majority of the data sets when compared to the standard clustering algorithms.

References

[2]
Blondel VD, Guillaume J-L, Lambiotte R, and Lefebvre EFast unfolding of communities in large networksJ. Stat. Mech. Theory Exp.t2008200810100-10807120580
[4]
Chen, Z., Liu, B.: Mining topics in documents: Standing on the shoulders of big data. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pp. 1116–1125, New York. ACM (2014)
[5]
Fortunato SCommunity detection in graphsPhys. Reports20104863–575-1742580414
[6]
Girvan M and Newman MEJCommunity structure in social and biological networksProc. Natl. Acad. Sci.200299127821-78261908073
[7]
Han J, Kamber M, and Pei J Data Mining: Concepts and Techniques 2012 San Francisco Morgan Kaufmann
[8]
Hartigan JA and Wong MAAlgorithm AS 136: a k-means clustering algorithmJ. Roy. Stat. Soc. Ser. C (Appl. Stat.)1979281100-1080447.62062
[9]
Johnson, R., Zhang, T.: Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in Neural Information Processing Systems (NIPS), vol. 28, pp. 919–927 (2015)
[10]
Kido, G.S., Igawa, R.A., Barbon, S.: Topic Modeling based on Louvain method in Online Social Networks, XII Brazilian Symposium on Information Systems, Florianópolis, SC, 17–20 May 2016
[11]
Kiph, T.N., Welling, N.: Semi-Supervised Classification with graph convoluional networks. In: ICLR (2017)
[12]
Liu, X., Li, K., Zhou, M.: Collective semantic role labeling for tweets with clustering, IJCAI, pp. 1832–1837, (2011)
[13]
Papadopoulos S, Kompatsiaris Y, Vakali A, and Spyridonos P Community detection in social media Data Min. Knowl. Disc. 2012 24 3 515-554
[14]
Sarkar, K., Law, R.: A novel approach to document classification using wordnet. ArXiv preprint arXiv:1510.02755 (2015)
[15]
Shu K, Sliva A, Wang S, Tang J, and Liu H Fake news detection on social media: a data mining perspective ACM SIGKDD Explor. Newsl. 2017 19 1 22-36
[18]
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM (2002)

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Distributed Computing and Internet Technology: 16th International Conference, ICDCIT 2020, Bhubaneswar, India, January 9–12, 2020, Proceedings
Jan 2020
442 pages
ISBN:978-3-030-36986-6
DOI:10.1007/978-3-030-36987-3

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 09 January 2020

Author Tags

  1. Social networks
  2. Louvain community discovery algorithm
  3. Clustering

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media