research-article

Differential Topic Models

Authors:

Lan DuAuthors Info & Claims

IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 37, Issue 2

Pages 230 - 242

https://doi.org/10.1109/TPAMI.2014.2313127

Published: 01 February 2015 Publication History

Abstract

In applications we may want to <italic>compare</italic> different document collections: they could have shared content but also different and unique aspects in particular collections. This task has been called comparative text mining or cross-collection modeling. We present a <italic>differential topic model</italic> for this application that models both topic differences and similarities. For this we use hierarchical Bayesian nonparametric models. Moreover, we found it was important to properly model power-law phenomena in topic-word distributions and thus we used the full Pitman-Yor process rather than just a Dirichlet process. Furthermore, we propose the transformed Pitman-Yor process (TPYP) to incorporate prior knowledge such as vocabulary variations in different collections into the model. To deal with the non-conjugate issue between model prior and likelihood in the TPYP, we thus propose an efficient sampling algorithm using a data augmentation technique based on the multinomial theorem. Experimental results show the model discovers interesting aspects of different collections. We also show the proposed MCMC based algorithm achieves a dramatically reduced test perplexity compared to some existing topic models. Finally, we show our model outperforms the state-of-the-art for document classification/ideology prediction on a number of text collections.

References

[1]

C. Zhai, A. Velivelli, and B. Yu, “A cross-collection mixture model for comparative text mining”, Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2004, pp. 743 –748.

[2]

M. Paul, Cross-collection topic models: Automatically comparing and contrasting text, Master’s thesis, Univ. Illinois Urbana-Champaign, IL, USA, 2009.

[3]

A. Ahmed, and E. Xing, “Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective”, Proc. Conf. Empirical Methods Natural Language Process., 2010, pp. 1140 –1150.

[4]

W.-H. Lin, T. Wilson, J. Wiebe, and A. Hauptmann, “Which side are you on?: Identifying perspectives at the document and sentence levels”, Proc. 10th Conf. Comput. Natural Language Learning, 2006, pp. 109– 116.

[5]

T. Hofmann, “Probabilistic latent semantic indexing ”, Proc. 22nd Annu. Int. ACM SIGIR Conf. Res. Develop. Inform. Retrieval, 1999, pp. 50–57.

Digital Library

[6]

J. Eisenstein, A. Ahmed, and E. Xing, “Sparse additive generative models of text ”, Proc. Int. Conf. Mach. Learning, 2011, pp. 1041–1048 .

[7]

M. Paul, and R. Girju, “Cross-cultural analysis of blogs and forums with mixed-collection topic models”, Proc. Conf. Empirical Methods Natural Language Process.: Volume 3, 2009, pp. 1408–1417.

[8]

M. Paul, and R. Girju, “A two-dimensional topic-aspect model for discovering multi-faceted topics”, AAAI Conf. Artificial Intell.pp. 545–550, 2010.

[9]

M. Paul, C. Zhai, and R. Girju, “Summarizing contrastive viewpoints in opinionated text”, Proc. Conf. Empirical Methods Natural Language Process., 2010, pp. 66 –76.

[10]

D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation ”, J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.

[11]

Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet processes ”, J. Amer. Statist. Assoc., vol. 101, no. 476, pp. 1566 –1581, 2006.

[12]

W. Li, and A. Mccallum, “Pachinko allocation: DAG-structured mixture models of topic correlations”, Proc. 23rd Int. Conf. Mach. Learning, 2006, pp. 577–584.

[13]

D. Andrzejewski, X. Zhu, and M. Craven, “Incorporating domain knowledge into topic modeling via Dirichlet Forest priors”, Proc. 26th Annu. Int. Conf. Mach. Learning, 2009, pp. 25–32.

Digital Library

[14]

D.M. Blei, T.L. Griffiths, and M.I. Jordan, “The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies”, J. ACM, vol. 57, no. 2, pp. 1–30, 2010.

Digital Library

[15]

S. Goldwater, T. Griffiths, and M. Johnson, “Producing power-law distributions and damping word frequencies with two-stage language models”, J. Mach. Learn. Res., vol. 12, pp. 2335–2382, 2011.

Digital Library

[16]

I. Sato, and H. Nakagawa, “Topic models with power-law using Pitman-Yor process”, Proc. 16th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2010.

Digital Library

[17]

D. M. Blei, and J. D. Lafferty, “A correlated topic model of science ”, Ann. Appl. Statist., vol. 1, no. 1, pp. 17– 35, 2007.

[18]

D. Mimno, H. Wallach, and A. McCallum, “Gibbs sampling for logistic normal topic models with graph-based priors”, Proc. NIPS Workshop Analyzing Graphspp. 1–8, 2008.

[19]

J. Paisley, C. Wang, and D. Blei, “The discrete infinite logistic normal distribution”, Bayesian Anal., vol. 7, no. 2, pp. 235– 272, 2012.

[20]

L. Du, W. Buntine, H. Jin, and C. Chen, “Sequential latent Dirichlet allocation ”, Knowl. Inform. Syst., vol. 31, no. 3, pp. 475– 503, 2012.

Digital Library

[21]

L. Du, W. Buntine, and H. Jin, “Modelling sequential text with an adaptive topic model”, Proc. Joint Conf. Empirical Methods Natural Language Process. Comput. Natural Language Learning, 2012, pp. 535–545.

[22]

C. Wang, and D. M. Blei, “Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process”, Proc. Advances Neural Inform. Process. Syst. 22, 2009, pp. 1982– 1989.

[23]

C. Wang, B. Thiesson, C. Meek, and D.M. Blei, “Markov topic models”, Proc. Artificial Intell. Statist., 2009, pp. 583–590 .

[24]

J. Petterson, A. Smola, T. Caetano, W. Buntine, and S. Narayanamurthy, “Word features for latent Dirichlet allocation”, Proc. Advances Neural Inform. Process. Syst., 2010, pp. 1921– 1929.

[25]

D. Newman, E. Bonilla, and W. Buntine, “Improving topic coherence with regularized topic models”, Proc. Advances Neural Inform. Process. Syst., 2011, pp. 496– 504.

[26]

M. Paul, and M. Dredze, “Factorial LDA: Sparse multi-dimensional text models”, Proc. Advances Neural Inform. Process. Syst., 2012, pp. 2591– 2599.

[27]

L. Wan, L. Zhu, and R. Fergus, “A hybrid neural network-latent topic model”, Proc. 15th Int. Conf. Artificial Intell. Statist., 2012, pp. 1287– 1294.

[28]

T. Ferguson, “A Bayesian analysis of some nonparametric problems”, The Ann. Statist., vol. 1, pp. 209–230, 1973.

[29]

Y. W. Teh, and M. I. Jordan, Hierarchical Bayesian Nonparametric Models with Applications, Cambridge, U.K. : Cambridge Univ. Press, 2010.

[30]

Y. W. Teh, “A hierarchical Bayesian language model based on Pitman-Yor processes”, Proc. 21st Int. Conf. Comput. Linguistics, 44th Annu. Meet. Assoc. Comput. Linguistics, 2006, pp. 985–992.

[31]

Y. W. Teh, A Bayesian interpretation of interpolated Kneser-Ney, Nat. Univ, Singapore, Singapore, Tech. Rep., 2006.

[32]

P. Orbanz, and J. M. Buhmann, “Nonparametric Bayesian image segmentation”, Int. J. Comput. Vis., vol. 77, pp. 25–45, 2007.

[33]

L. Du, L. Ren, D. B. Dunson, and L. Carin, “A Bayesian model for simultaneous image clustering, annotation and object segmentation”, Proc. Advances Neural Inform. Process. Syst. 22, 2009, pp. 486– 494.

[34]

E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky, “Describing visual scenes using transformed Dirichlet processes”, Advances Neural Inform. Process. Syst. 18, 2005, pp. 1299– 1306.

[35]

F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. Teh, “A stochastic memoizer for sequence data ”, Proc. 26th Ann. Int. Conf. Mach. Learning, 2009, pp. 1129– 1136.

Digital Library

[36]

Z. Xu, V. Tresp, K. Yu, and H. P. Kriegel, “Infinite hidden relational models ”, Proc. 22nd Conf. Annu. Conf. Uncertainty Artif. Intell., 2006, pp. 544 –551.

[37]

H. Ishwaran, and L. F. James, “Gibbs sampling methods for stick-breaking priors”, J. Am. Statist. Assoc., vol. 96, no. 453, pp. 161 –173, 2001.

[38]

D. Aldous, “Exchangeability and related topics ”, Proc. École d’Été de Probabilitésde Saint-Flour XIII, 1983, pp. 1–198.

[39]

B. A. Frigyik, M. R. Gupta, and Y. Chen, “Shadow Dirichlet for restricted probability modeling”, Proc. Advances Neural Inform. Process. Syst. 23, 2010, pp. 613– 621.

[40]

C. Chen, L. Du, and W. Buntine, “Sampling table configurations for the hierarchical Poisson-Dirichlet process”, Proc. Eur. Conf. Mach. Learning Knowl. Discovery Databases, 2011, pp. 296 –311.

[41]

C. Robert, and G. Casella, Monte Carlo Statistical MethodsSecond, New York, NY, USA : Springer-Verlag, 2004.

[42]

W. Buntine, and M. Hutter, A Bayesian view of the Poisson-Dirichlet process, Canberra, Australia: NICTA and ANU, Australia, Tech. Rep., 2012.

[43]

T. P. Minka, Estimating a Dirichlet distribution, Massachusetts, MA, USA, Tech. Rep., 2000.

[44]

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models”, Mach. Learn., vol. 37, pp. 183–233, 1999.

Digital Library

[45]

T. Yano, W. Cohen, and N. Smith, “Predicting response to political blog posts with topic models”, Proc. Annu. Conf. North Am. Chapter Assoc. Comput. Linguistics, 2009, pp. 477 –485.

[46]

M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The author-topic model for authors and documents”, Proc. 20th Conf. Uncertainty Artif. Intell., 2004, pp. 487– 494.

Digital Library

[47]

L. Du, W. Buntine, and H. Jin, “A segmented topic model based on the two-parameter Poisson-Dirichlet process”, Mach. Learn., vol. 81, pp. 5–19, 2010.

Digital Library

[48]

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database”, Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248– 255.

[49]

C. Cortes, and V.N. Vapnik, “Support-vector networks ”, Mach. Learn., vol. 20, no. 3, pp. 273– 297, 1995.

[50]

C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, Cambridge, U.K.: Cambridge Univ. Press, 2008.

[51]

C. C. Chang, and C. J. Lin, Libsvm–A library for support vector machines, Nat. Taiwan University, Taiwan, China, Tech. Rep. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

Cited By

Xu SLi LAn XHao LYang G(2021)An approach for detecting the commonality and specialty between scientific publications and patentsScientometrics10.1007/s11192-021-04085-9126:9(7445-7475)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1007/s11192-021-04085-9
Le TAkoglu L(2019)ContraVis: Contrastive and Visual Topic Modeling for Comparing Document CollectionsThe World Wide Web Conference10.1145/3308558.3313617(928-938)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313617
Xuan JLu JZhang G(2019)A Survey on Bayesian Nonparametric LearningACM Computing Surveys10.1145/329104452:1(1-36)Online publication date: 25-Jan-2019
https://dl.acm.org/doi/10.1145/3291044
Show More Cited By

Index Terms

Differential Topic Models

Index terms have been assigned to the content through auto-classification.

Recommendations

Probabilistic topic models
KDD '11 Tutorials: Proceedings of the 17th ACM SIGKDD International Conference Tutorials

Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. ...
Topic Models with Topic Ordering Regularities for Topic Segmentation
ICDM '14: Proceedings of the 2014 IEEE International Conference on Data Mining

Documents from the same domain usually discuss similar topics in a similar order. In this paper we present new ordering-based topic models that use generalised Mallows models to capture this regularity to constrain topic assignments. Specifically, these ...
Independent factor topic models
ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

Topic models such as Latent Dirichlet Allocation (LDA) and Correlated Topic Model (CTM) have recently emerged as powerful statistical tools for text document modeling. In this paper, we improve upon CTM and propose Independent Factor Topic Models (IFTM) ...

Comments

Information & Contributors

Information

Published In

Copyright © 2014.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 February 2015

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu SLi LAn XHao LYang G(2021)An approach for detecting the commonality and specialty between scientific publications and patentsScientometrics10.1007/s11192-021-04085-9126:9(7445-7475)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1007/s11192-021-04085-9
Le TAkoglu L(2019)ContraVis: Contrastive and Visual Topic Modeling for Comparing Document CollectionsThe World Wide Web Conference10.1145/3308558.3313617(928-938)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313617
Xuan JLu JZhang G(2019)A Survey on Bayesian Nonparametric LearningACM Computing Surveys10.1145/329104452:1(1-36)Online publication date: 25-Jan-2019
https://dl.acm.org/doi/10.1145/3291044
Zhao HDu LBuntine WZhou M(2018)Dirichlet belief networks for topic structure learningProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327757.3327892(7966-7977)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327757.3327892
Mao XFeng BHao YNie LHuang HWen GSingh SMarkovitch S(2017)S2JSD-LSHProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298023.3298040(3244-3251)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3298023.3298040
Xu HDong MZhu DKotov ACarcone ANaar-King S(2016)Text Classification with Topic-based Word Embedding and Convolutional Neural NetworksProceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/2975167.2975176(88-97)Online publication date: 2-Oct-2016
https://dl.acm.org/doi/10.1145/2975167.2975176

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents