skip to main content
research-article

Differential Topic Models

Published: 01 February 2015 Publication History

Abstract

In applications we may want to <italic>compare</italic> different document collections: they could have shared content but also different and unique aspects in particular collections. This task has been called comparative text mining or cross-collection modeling. We present a <italic>differential topic model</italic> for this application that models both topic differences and similarities. For this we use hierarchical Bayesian nonparametric models. Moreover, we found it was important to properly model power-law phenomena in topic-word distributions and thus we used the full Pitman-Yor process rather than just a Dirichlet process. Furthermore, we propose the transformed Pitman-Yor process (TPYP) to incorporate prior knowledge such as vocabulary variations in different collections into the model. To deal with the non-conjugate issue between model prior and likelihood in the TPYP, we thus propose an efficient sampling algorithm using a data augmentation technique based on the multinomial theorem. Experimental results show the model discovers interesting aspects of different collections. We also show the proposed MCMC based algorithm achieves a dramatically reduced test perplexity compared to some existing topic models. Finally, we show our model outperforms the state-of-the-art for document classification/ideology prediction on a number of text collections.

References

[1]
C. Zhai, A. Velivelli, and B. Yu, “A cross-collection mixture model for comparative text mining”, Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2004, pp. 743 –748.
[2]
M. Paul, Cross-collection topic models: Automatically comparing and contrasting text, Master’s thesis, Univ. Illinois Urbana-Champaign, IL, USA, 2009.
[3]
A. Ahmed, and E. Xing, “Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective”, Proc. Conf. Empirical Methods Natural Language Process., 2010, pp. 1140 –1150.
[4]
W.-H. Lin, T. Wilson, J. Wiebe, and A. Hauptmann, “Which side are you on?: Identifying perspectives at the document and sentence levels”, Proc. 10th Conf. Comput. Natural Language Learning, 2006, pp. 109– 116.
[5]
T. Hofmann, “Probabilistic latent semantic indexing ”, Proc. 22nd Annu. Int. ACM SIGIR Conf. Res. Develop. Inform. Retrieval, 1999, pp. 50–57.
[6]
J. Eisenstein, A. Ahmed, and E. Xing, “Sparse additive generative models of text ”, Proc. Int. Conf. Mach. Learning, 2011, pp. 1041–1048 .
[7]
M. Paul, and R. Girju, “Cross-cultural analysis of blogs and forums with mixed-collection topic models”, Proc. Conf. Empirical Methods Natural Language Process.: Volume 3, 2009, pp. 1408–1417.
[8]
M. Paul, and R. Girju, “A two-dimensional topic-aspect model for discovering multi-faceted topics”, AAAI Conf. Artificial Intell.pp. 545–550, 2010.
[9]
M. Paul, C. Zhai, and R. Girju, “Summarizing contrastive viewpoints in opinionated text”, Proc. Conf. Empirical Methods Natural Language Process., 2010, pp. 66 –76.
[10]
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation ”, J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.
[11]
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet processes ”, J. Amer. Statist. Assoc., vol. 101, no. 476, pp. 1566 –1581, 2006.
[12]
W. Li, and A. Mccallum, “Pachinko allocation: DAG-structured mixture models of topic correlations”, Proc. 23rd Int. Conf. Mach. Learning, 2006, pp. 577–584.
[13]
D. Andrzejewski, X. Zhu, and M. Craven, “Incorporating domain knowledge into topic modeling via Dirichlet Forest priors”, Proc. 26th Annu. Int. Conf. Mach. Learning, 2009, pp. 25–32.
[14]
D.M. Blei, T.L. Griffiths, and M.I. Jordan, “The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies”, J. ACM, vol. 57, no. 2, pp. 1–30, 2010.
[15]
S. Goldwater, T. Griffiths, and M. Johnson, “Producing power-law distributions and damping word frequencies with two-stage language models”, J. Mach. Learn. Res., vol. 12, pp. 2335–2382, 2011.
[16]
I. Sato, and H. Nakagawa, “Topic models with power-law using Pitman-Yor process”, Proc. 16th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2010.
[17]
D. M. Blei, and J. D. Lafferty, “A correlated topic model of science ”, Ann. Appl. Statist., vol. 1, no. 1, pp. 17– 35, 2007.
[18]
D. Mimno, H. Wallach, and A. McCallum, “Gibbs sampling for logistic normal topic models with graph-based priors”, Proc. NIPS Workshop Analyzing Graphspp. 1–8, 2008.
[19]
J. Paisley, C. Wang, and D. Blei, “The discrete infinite logistic normal distribution”, Bayesian Anal., vol. 7, no. 2, pp. 235– 272, 2012.
[20]
L. Du, W. Buntine, H. Jin, and C. Chen, “Sequential latent Dirichlet allocation ”, Knowl. Inform. Syst., vol. 31, no. 3, pp. 475– 503, 2012.
[21]
L. Du, W. Buntine, and H. Jin, “Modelling sequential text with an adaptive topic model”, Proc. Joint Conf. Empirical Methods Natural Language Process. Comput. Natural Language Learning, 2012, pp. 535–545.
[22]
C. Wang, and D. M. Blei, “Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process”, Proc. Advances Neural Inform. Process. Syst. 22, 2009, pp. 1982– 1989.
[23]
C. Wang, B. Thiesson, C. Meek, and D.M. Blei, “Markov topic models”, Proc. Artificial Intell. Statist., 2009, pp. 583–590 .
[24]
J. Petterson, A. Smola, T. Caetano, W. Buntine, and S. Narayanamurthy, “Word features for latent Dirichlet allocation”, Proc. Advances Neural Inform. Process. Syst., 2010, pp. 1921– 1929.
[25]
D. Newman, E. Bonilla, and W. Buntine, “Improving topic coherence with regularized topic models”, Proc. Advances Neural Inform. Process. Syst., 2011, pp. 496– 504.
[26]
M. Paul, and M. Dredze, “Factorial LDA: Sparse multi-dimensional text models”, Proc. Advances Neural Inform. Process. Syst., 2012, pp. 2591– 2599.
[27]
L. Wan, L. Zhu, and R. Fergus, “A hybrid neural network-latent topic model”, Proc. 15th Int. Conf. Artificial Intell. Statist., 2012, pp. 1287– 1294.
[28]
T. Ferguson, “A Bayesian analysis of some nonparametric problems”, The Ann. Statist., vol. 1, pp. 209–230, 1973.
[29]
Y. W. Teh, and M. I. Jordan, Hierarchical Bayesian Nonparametric Models with Applications, Cambridge, U.K. : Cambridge Univ. Press, 2010.
[30]
Y. W. Teh, “A hierarchical Bayesian language model based on Pitman-Yor processes”, Proc. 21st Int. Conf. Comput. Linguistics, 44th Annu. Meet. Assoc. Comput. Linguistics, 2006, pp. 985–992.
[31]
Y. W. Teh, A Bayesian interpretation of interpolated Kneser-Ney, Nat. Univ, Singapore, Singapore, Tech. Rep., 2006.
[32]
P. Orbanz, and J. M. Buhmann, “Nonparametric Bayesian image segmentation”, Int. J. Comput. Vis., vol. 77, pp. 25–45, 2007.
[33]
L. Du, L. Ren, D. B. Dunson, and L. Carin, “A Bayesian model for simultaneous image clustering, annotation and object segmentation”, Proc. Advances Neural Inform. Process. Syst. 22, 2009, pp. 486– 494.
[34]
E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky, “Describing visual scenes using transformed Dirichlet processes”, Advances Neural Inform. Process. Syst. 18, 2005, pp. 1299– 1306.
[35]
F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. Teh, “A stochastic memoizer for sequence data ”, Proc. 26th Ann. Int. Conf. Mach. Learning, 2009, pp. 1129– 1136.
[36]
Z. Xu, V. Tresp, K. Yu, and H. P. Kriegel, “Infinite hidden relational models ”, Proc. 22nd Conf. Annu. Conf. Uncertainty Artif. Intell., 2006, pp. 544 –551.
[37]
H. Ishwaran, and L. F. James, “Gibbs sampling methods for stick-breaking priors”, J. Am. Statist. Assoc., vol. 96, no. 453, pp. 161 –173, 2001.
[38]
D. Aldous, “Exchangeability and related topics ”, Proc. École d’Été de Probabilitésde Saint-Flour XIII, 1983, pp. 1–198.
[39]
B. A. Frigyik, M. R. Gupta, and Y. Chen, “Shadow Dirichlet for restricted probability modeling”, Proc. Advances Neural Inform. Process. Syst. 23, 2010, pp. 613– 621.
[40]
C. Chen, L. Du, and W. Buntine, “Sampling table configurations for the hierarchical Poisson-Dirichlet process”, Proc. Eur. Conf. Mach. Learning Knowl. Discovery Databases, 2011, pp. 296 –311.
[41]
C. Robert, and G. Casella, Monte Carlo Statistical MethodsSecond, New York, NY, USA : Springer-Verlag, 2004.
[42]
W. Buntine, and M. Hutter, A Bayesian view of the Poisson-Dirichlet process, Canberra, Australia: NICTA and ANU, Australia, Tech. Rep., 2012.
[43]
T. P. Minka, Estimating a Dirichlet distribution, Massachusetts, MA, USA, Tech. Rep., 2000.
[44]
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models”, Mach. Learn., vol. 37, pp. 183–233, 1999.
[45]
T. Yano, W. Cohen, and N. Smith, “Predicting response to political blog posts with topic models”, Proc. Annu. Conf. North Am. Chapter Assoc. Comput. Linguistics, 2009, pp. 477 –485.
[46]
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The author-topic model for authors and documents”, Proc. 20th Conf. Uncertainty Artif. Intell., 2004, pp. 487– 494.
[47]
L. Du, W. Buntine, and H. Jin, “A segmented topic model based on the two-parameter Poisson-Dirichlet process”, Mach. Learn., vol. 81, pp. 5–19, 2010.
[48]
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database”, Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248– 255.
[49]
C. Cortes, and V.N. Vapnik, “Support-vector networks ”, Mach. Learn., vol. 20, no. 3, pp. 273– 297, 1995.
[50]
C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, Cambridge, U.K.: Cambridge Univ. Press, 2008.
[51]
C. C. Chang, and C. J. Lin, Libsvm–A library for support vector machines, Nat. Taiwan University, Taiwan, China, Tech. Rep. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 February 2015

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media