research-article

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling

Authors:

Sérgio Canuto,

Christian Gomes,

Washington Luiz,

Leonardo Rocha,

Marcos André GonçalvesAuthors Info & Claims

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pages 753 - 761

https://doi.org/10.1145/3289600.3291032

Published: 30 January 2019 Publication History

Abstract

In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.

References

[1]

Klemens Boehm Abel Elekes, Martin Schaeler. 2017. On the Various Semantics of Similarity in Word Embedding Models. Digital Libraries (JCDL) 2017 ACM/IEEE Joint Conference (2017).

Digital Library

[2]

Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL'14.

[3]

M. S. Bartlett. 1937. Properties of Sufficiency and Statistical Tests. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 160 (1937).

[4]

David M. Blei. 2012. Probabilistic Topic Models. Communications of The ACM 55, 4 (April 2012), 77--84.

Digital Library

[5]

DavidMBlei, AndrewY Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research (2003).

Digital Library

[6]

William B. Cavnar and John M. Trenkle. 1994. N-Gram-Based Text Categorization. In SDAIR'94.

[7]

Zhiyuan Chen and Bing Liu. 2014. Topic Modeling Using Topics from Many Domains, Lifelong Learning and Big Data. In ICML'14.

Digital Library

[8]

X. Cheng, X. Yan, Y. Lan, and J. Guo. 2014. BTM: Topic Modeling over Short Texts. IEEE TKDE '14 (2014).

[9]

Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for Topic Models with Word Embeddings. In ACL '15.

[10]

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research (2008).

Digital Library

[11]

Emitza Guzman and Walid Maalej. 2014. How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews. In Requirements Engineering.

[12]

William L. Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky. 2016. Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. CoRR (2016).

[13]

Thomas Hofmann. 1999. Probabilistic Latent Semantic Indexing. In SIGIR '99.

Digital Library

[14]

Clayton J. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In ICWSM'14.

[15]

Ou Jin, Nathan N. Liu, Kai Zhao, Yong Yu, and Qiang Yang. 2011. Transferring Topical Knowledge from Auxiliary Long Texts for Short Text Clustering. In CIKM.

Digital Library

[16]

Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature (1999).

[17]

Howard Levene. 1960. Robust tests for equality of variances. (1960).

[18]

Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings. ACM TOIS (2017).

Digital Library

[19]

Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In SIGIR'16.

Digital Library

[20]

Quanzhi Li, Sameena Shah, Xiaomo Liu, Armineh Nourbakhsh, and Rui Fang. {n. d.}. TweetSift: Tweet Topic Classification Based on Entity Knowledge Base and Topic Enhanced Word Embedding. In CIKM'16.

Digital Library

[21]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. (2013).

[22]

Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In LREC'18.

[23]

David M. Mimno and Laure Thompson. 2017. The strange geometry of skip-gram with negative sampling. In EMNLP.

[24]

Sergey I Nikolenko. 2016. Topic Quality Metrics Based on Distributed Word Representations. In SIGIR'16.

Digital Library

[25]

Sergey I Nikolenko, Sergei Koltcov, and Olessia Koltsova. 2017. Topic modelling for qualitative studies. Journal of Information Science (2017).

Digital Library

[26]

G. Pedrosa, M. Pita, P. Bicalho, A. Lacerda, and G. L. Pappa. 2016. Topic Modeling for Short Texts with Co-occurrence Frequency-Based Expansion. In BRACIS.

[27]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP.

[28]

Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic Modeling over Short Texts by Incorporating Word Embeddings. In PAKDD. Springer.

[29]

Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. 2012. Statistical Topic Models for Multi-label Document Classification. Mach. Learn. 88, 1--2 (July 2012), 157--208.

Digital Library

[30]

Gerard Salton and Christopher Buckley. 1988. Term-weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage. 24, 5 (1988), 513--523.

Digital Library

[31]

Bei Shi, Wai Lam, Shoaib Jameel, Steven Schockaert, and Kwun Ping Lai. 2017. Jointly Learning Word Embeddings and Latent Topics. In SIGIR'17.

Digital Library

[32]

Tian Shi, Kyeongpil Kang, Jaegul Choo, and Chandan K. Reddy. 2018. Short- Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations. In WWW '18. 1105--1114.

Digital Library

[33]

Felipe Viegas, Marcos Gonçalves, Wellington Martins, and Leonardo Rocha. 2015. Parallel Lazy Semi-Naive Bayes Strategies for Effective and Efficient Document Classification. In CIKM.

Digital Library

[34]

Konstantin Vorontsov and Anna Potapenko. 2015. Additive regularization of topic models. Mach. Learn. 101, 1--3 (2015), 303--323.

Digital Library

[35]

Laura Wendlandt, Jonathan K. Kummerfeld, and Rada Mihalcea. 2018. Factors Influencing the Surprising Instability of Word Embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics.

[36]

Pengtao Xie and Eric P. Xing. 2013. Integrating Document Clustering and Topic Modeling. CoRR abs/1309.6874 (2013).

Digital Library

[37]

Yiming Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Inf. Ret. (1999).

Digital Library

[38]

Hamed Zamani and W. Bruce Croft. 2017. Relevance-based Word Embedding. In SIGIR'17.

Digital Library

[39]

Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and Traditional Media Using Topic Models. In ECIR'11.

Cited By

Viegas FCanuto SCunha WFrança CValiense CFonseca GMachado ARocha LGonçalves M(2024)Pipelining Semantic Expansion and Noise Filtering for Sentiment Analysis of Short Documents – CluSent MethodJournal on Interactive Systems10.5753/jis.2024.411715:1(561-575)Online publication date: 11-Jun-2024
https://doi.org/10.5753/jis.2024.4117
Ahmed MTiun SOmar NSani N(2024)A multi-view representation technique based on principal component analysis for enhanced short text clusteringPLOS ONE10.1371/journal.pone.030920619:8(e0309206)Online publication date: 23-Aug-2024
https://doi.org/10.1371/journal.pone.0309206
Basheer RAlkhatib B(2024)DarkOnto: An Ontology Construction Approach for Dark Web Community Discussions Through Topic Modeling and Ontology LearningHuman Behavior and Emerging Technologies10.1155/2024/79140282024:1Online publication date: 19-Sep-2024
https://doi.org/10.1155/2024/7914028
Show More Cited By

Index Terms

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Topic modeling

Recommendations

Automatic Topic Title Assignment with Word Embedding
Abstract
In this paper, we propose TAWE (title assignment with word embedding), a new method to automatically assign titles to topics inferred from sets of documents. This method combines the results obtained from the topic modeling performed with, e.g., ...
Transportation sentiment analysis using word embedding and ontology-based topic modeling
Abstract
Social networks play a key role in providing a new approach to collecting information regarding mobility and transportation services. To study this information, sentiment analysis can make decent observations to support intelligent ...
Highlights
- Social networks provide a new approach to collect data regarding transportation.
Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations
WWW '18: Proceedings of the 2018 World Wide Web Conference

Being a prevalent form of social communications on the Internet, billions of short texts are generated everyday. Discovering knowledge from them has gained a lot of interest from both industry and academia. The short texts have a limited contextual ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

January 2019

874 pages

ISBN:9781450359405

DOI:10.1145/3289600

General Chairs:
J. Shane Culpepper
RMIT University
,
Alistair Moffat
The University of Melbourne
,
Program Chairs:
Paul N. Bennett
Microsoft
,
Kristina Lerman
University of Southern California

Copyright © 2019 ACM.

© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

CNPq
MASWeb
Mundiale
CAPES

Conference

WSDM '19

Sponsor:

WSDM '19: The Twelfth ACM International Conference on Web Search and Data Mining

February 11 - 15, 2019

Melbourne VIC, Australia

Acceptance Rates

WSDM '19 Paper Acceptance Rate 84 of 511 submissions, 16%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
1,016
Total Downloads

Downloads (Last 12 months)60
Downloads (Last 6 weeks)5

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Viegas FCanuto SCunha WFrança CValiense CFonseca GMachado ARocha LGonçalves M(2024)Pipelining Semantic Expansion and Noise Filtering for Sentiment Analysis of Short Documents – CluSent MethodJournal on Interactive Systems10.5753/jis.2024.411715:1(561-575)Online publication date: 11-Jun-2024
https://doi.org/10.5753/jis.2024.4117
Ahmed MTiun SOmar NSani N(2024)A multi-view representation technique based on principal component analysis for enhanced short text clusteringPLOS ONE10.1371/journal.pone.030920619:8(e0309206)Online publication date: 23-Aug-2024
https://doi.org/10.1371/journal.pone.0309206
Basheer RAlkhatib B(2024)DarkOnto: An Ontology Construction Approach for Dark Web Community Discussions Through Topic Modeling and Ontology LearningHuman Behavior and Emerging Technologies10.1155/2024/79140282024:1Online publication date: 19-Sep-2024
https://doi.org/10.1155/2024/7914028
Basheer RAlkhatib B(2024)Conceptualizing Discussions on the Dark Web: An Empirical Topic Modeling ApproachComplexity10.1155/2024/27752362024(1-24)Online publication date: 14-Feb-2024
https://doi.org/10.1155/2024/2775236
Yang MLi ZGao YHe CHuang FChen W(2024)Heterogeneous Graph Attention Networks for Depression Identification by Campus Cyber-Activity PatternsIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.334368911:3(3493-3503)Online publication date: Jun-2024
https://doi.org/10.1109/TCSS.2023.3343689
Ravenda FBahrainian SRaballo AMira ACrestani F(2024)A self-supervised seed-driven approach to topic modelling and clusteringJournal of Intelligent Information Systems10.1007/s10844-024-00891-8Online publication date: 28-Sep-2024
https://doi.org/10.1007/s10844-024-00891-8
Lin ZYan JLei ZRao Y(2024)Lifelong Hierarchical Topic Modeling via Non-negative Matrix FactorizationWeb and Big Data10.1007/978-981-97-2421-5_11(155-170)Online publication date: 12-May-2024
https://doi.org/10.1007/978-981-97-2421-5_11
Guizzardi SColangelo MMirandola PGalli C(2023)Modeling New Trends in Bone Regeneration, using the BERTopic ApproachRegenerative Medicine10.2217/rme-2023-009618:9(719-734)Online publication date: 14-Aug-2023
https://doi.org/10.2217/rme-2023-0096
Zhang YWan CXiao KWan QLiu DLiu X(2023)rHDP: An Aspect Sharing-Enhanced Hierarchical Topic Model for Multi-Domain CorpusACM Transactions on Information Systems10.1145/363135242:3(1-31)Online publication date: 29-Dec-2023
https://dl.acm.org/doi/10.1145/3631352
Pereira AViegas FGonçalves MRocha L(2023)Evaluating the Limits of the Current Evaluation Metrics for Topic ModelingProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617040(119-127)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3617023.3617040
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten