skip to main content
10.1145/1553374.1553410acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Accounting for burstiness in topic models

Published: 14 June 2009 Publication History

Abstract

Many different topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suffer from the important flaw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of language that if a word is used once in a document, it is more likely to be used again. We introduce a topic model that uses Dirichlet compound multinomial (DCM) distributions to model this burstiness phenomenon. On both text and non-text datasets, the new model achieves better held-out likelihood than standard latent Dirichlet allocation (LDA). It is straightforward to incorporate the DCM extension into topic models that are more complex than LDA.

References

[1]
Airoldi, E. M., Fienberg, S. E., & Xing, E. P. (2007). Mixed membership analysis of genome-wide expression data. Arxiv preprint arXiv:0711.2520.
[2]
Blei, D., & Lafferty, J. (2005). Correlated topic models. Advances in Neural Information Processing Systems 18 (pp. 147--154).
[3]
Blei, D., Ng, A., & Jordan, M. (2001). Latent Dirichlet allocation. Advances in Neural Information Processing Systems 14 (pp. 601--608).
[4]
Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. J. of Machine Learning Research, 3, 993--1022.
[5]
Celeux, G., Chaveau, D., & Diebolt, J. (1996). Stochastic versions of the EM algorithm: An experimental study in the mixture case. J. of Statistical Computation and Simulation, 55, 287--314.
[6]
Church, K., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1, 163--190.
[7]
Elkan, C. (2006). Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. Proceedings of the 23rd International Conference on Machine Learning (pp. 289--296).
[8]
Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 524--531).
[9]
Globerson, A., Chechik, G., Pereira, F., & Tishby, N. (2004). Euclidean embedding of co-occurrence data. Advances in Neural Information Processing Systems 17 (pp. 497--504).
[10]
Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 104, 5228--5235.
[11]
Griffiths, T., Steyvers, M., Blei, D., & Tenenbaum, J. (2004). Integrating topics and syntax. Advances in Neural Information Processing Systems 17 (pp. 537--544).
[12]
Heinrich, G. (2005). Parameter estimation for text analysis. Available at http://www.arbylon.net/publications/text-est.pdf.
[13]
Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. Proceedings of the 23rd International Conference on Machine Learning (pp. 577--584).
[14]
Li, W., & McCallum, A. (2008). Pachinko allocation: Scalable mixture models of topic correlations. J. of Machine Learning Research. Submitted.
[15]
Madsen, R., Kauchak, D., & Elkan, C. (2005). Modeling word burstiness using the Dirichlet distribution. Proceedings of the 22nd International Conference on Machine Learning (pp. 545--552).
[16]
Newton, M., & Raftery, A. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society B, 56, 3--48.
[17]
Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. Proceedings of 20th International Conference on Machine Learning (pp. 616--623).
[18]
Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. (1997). Algorithm 778: L-BFGS-B: Fortran routines for large scale bound constrained optimization. ACM Transactions on Mathematical Software, 23, 550--560.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning
June 2009
1331 pages
ISBN:9781605585161
DOI:10.1145/1553374

Sponsors

  • NSF
  • Microsoft Research: Microsoft Research
  • MITACS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

ICML '09
Sponsor:
  • Microsoft Research

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media