skip to main content
10.1145/3289600.3291032acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling

Published: 30 January 2019 Publication History

Abstract

In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.

References

[1]
Klemens Boehm Abel Elekes, Martin Schaeler. 2017. On the Various Semantics of Similarity in Word Embedding Models. Digital Libraries (JCDL) 2017 ACM/IEEE Joint Conference (2017).
[2]
Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL'14.
[3]
M. S. Bartlett. 1937. Properties of Sufficiency and Statistical Tests. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 160 (1937).
[4]
David M. Blei. 2012. Probabilistic Topic Models. Communications of The ACM 55, 4 (April 2012), 77--84.
[5]
DavidMBlei, AndrewY Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research (2003).
[6]
William B. Cavnar and John M. Trenkle. 1994. N-Gram-Based Text Categorization. In SDAIR'94.
[7]
Zhiyuan Chen and Bing Liu. 2014. Topic Modeling Using Topics from Many Domains, Lifelong Learning and Big Data. In ICML'14.
[8]
X. Cheng, X. Yan, Y. Lan, and J. Guo. 2014. BTM: Topic Modeling over Short Texts. IEEE TKDE '14 (2014).
[9]
Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for Topic Models with Word Embeddings. In ACL '15.
[10]
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research (2008).
[11]
Emitza Guzman and Walid Maalej. 2014. How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews. In Requirements Engineering.
[12]
William L. Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky. 2016. Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. CoRR (2016).
[13]
Thomas Hofmann. 1999. Probabilistic Latent Semantic Indexing. In SIGIR '99.
[14]
Clayton J. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In ICWSM'14.
[15]
Ou Jin, Nathan N. Liu, Kai Zhao, Yong Yu, and Qiang Yang. 2011. Transferring Topical Knowledge from Auxiliary Long Texts for Short Text Clustering. In CIKM.
[16]
Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature (1999).
[17]
Howard Levene. 1960. Robust tests for equality of variances. (1960).
[18]
Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings. ACM TOIS (2017).
[19]
Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In SIGIR'16.
[20]
Quanzhi Li, Sameena Shah, Xiaomo Liu, Armineh Nourbakhsh, and Rui Fang. {n. d.}. TweetSift: Tweet Topic Classification Based on Entity Knowledge Base and Topic Enhanced Word Embedding. In CIKM'16.
[21]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. (2013).
[22]
Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In LREC'18.
[23]
David M. Mimno and Laure Thompson. 2017. The strange geometry of skip-gram with negative sampling. In EMNLP.
[24]
Sergey I Nikolenko. 2016. Topic Quality Metrics Based on Distributed Word Representations. In SIGIR'16.
[25]
Sergey I Nikolenko, Sergei Koltcov, and Olessia Koltsova. 2017. Topic modelling for qualitative studies. Journal of Information Science (2017).
[26]
G. Pedrosa, M. Pita, P. Bicalho, A. Lacerda, and G. L. Pappa. 2016. Topic Modeling for Short Texts with Co-occurrence Frequency-Based Expansion. In BRACIS.
[27]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP.
[28]
Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic Modeling over Short Texts by Incorporating Word Embeddings. In PAKDD. Springer.
[29]
Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. 2012. Statistical Topic Models for Multi-label Document Classification. Mach. Learn. 88, 1--2 (July 2012), 157--208.
[30]
Gerard Salton and Christopher Buckley. 1988. Term-weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage. 24, 5 (1988), 513--523.
[31]
Bei Shi, Wai Lam, Shoaib Jameel, Steven Schockaert, and Kwun Ping Lai. 2017. Jointly Learning Word Embeddings and Latent Topics. In SIGIR'17.
[32]
Tian Shi, Kyeongpil Kang, Jaegul Choo, and Chandan K. Reddy. 2018. Short- Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations. In WWW '18. 1105--1114.
[33]
Felipe Viegas, Marcos Gonçalves, Wellington Martins, and Leonardo Rocha. 2015. Parallel Lazy Semi-Naive Bayes Strategies for Effective and Efficient Document Classification. In CIKM.
[34]
Konstantin Vorontsov and Anna Potapenko. 2015. Additive regularization of topic models. Mach. Learn. 101, 1--3 (2015), 303--323.
[35]
Laura Wendlandt, Jonathan K. Kummerfeld, and Rada Mihalcea. 2018. Factors Influencing the Surprising Instability of Word Embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics.
[36]
Pengtao Xie and Eric P. Xing. 2013. Integrating Document Clustering and Topic Modeling. CoRR abs/1309.6874 (2013).
[37]
Yiming Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Inf. Ret. (1999).
[38]
Hamed Zamani and W. Bruce Croft. 2017. Relevance-based Word Embedding. In SIGIR'17.
[39]
Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and Traditional Media Using Topic Models. In ECIR'11.

Cited By

View all

Index Terms

  1. CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining
      January 2019
      874 pages
      ISBN:9781450359405
      DOI:10.1145/3289600
      © 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 January 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data representation
      2. topic modeling
      3. word embedding

      Qualifiers

      • Research-article

      Funding Sources

      • CNPq
      • MASWeb
      • Mundiale
      • CAPES

      Conference

      WSDM '19

      Acceptance Rates

      WSDM '19 Paper Acceptance Rate 84 of 511 submissions, 16%;
      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)60
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 26 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media