skip to main content
10.1145/2783258.2783411acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams

Published: 10 August 2015 Publication History

Abstract

Clusters in document streams, such as online news articles, can be induced by their textual contents, as well as by the temporal dynamics of their arriving patterns. Can we leverage both sources of information to obtain a better clustering of the documents, and distill information that is not possible to extract using contents only? In this paper, we propose a novel random process, referred to as the Dirichlet-Hawkes process, to take into account both information in a unified framework. A distinctive feature of the proposed model is that the preferential attachment of items to clusters according to cluster sizes, present in Dirichlet processes, is now driven according to the intensities of cluster-wise self-exciting temporal point processes, the Hawkes processes. This new model establishes a previously unexplored connection between Bayesian Nonparametrics and temporal Point Processes, which makes the number of clusters grow to accommodate the increasing complexity of online streaming contents, while at the same time adapts to the ever changing dynamics of the respective continuous arrival time. We conducted large-scale experiments on both synthetic and real world news articles, and show that Dirichlet-Hawkes processes can recover both meaningful topics and temporal dynamics, which leads to better predictive performance in terms of content perplexity and arrival time of future documents.

Supplementary Material

MP4 File (p219.mp4)

References

[1]
O. Aalen, O. Borgan, and H. Gjessing. Survival and event history analysis: a process point of view. Springer, 2008.
[2]
A. Ahmed, J. Eisenstein, Q. Ho, E. P. Xing, A. J. Smola, and C. H. Teo. The topic-cluster model. In Artificial Intelligence and Statistics AISTATS, 2011.
[3]
A. Ahmed, Q. Ho, J. Eisenstein, E. Xing, A. Smola, and C. Teo. Unified analysis of streaming news. In Proceedings of WWW, Hyderabad, India, 2011. IW3C2, Sheridan Printing.
[4]
A. Ahmed and E. Xing. Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In SDM, pages 219--230. SIAM, 2008.
[5]
C. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2:1152--1174, 1974.
[6]
D. Blei and P. Frazier. Distance dependent chinese restaurant processes. In ICML, pages 87--94, 2010.
[7]
D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, pages 113--120, 2006.
[8]
D. Daley and D. Vere-Jones. An introduction to the theory of point processes: volume II: general theory and structure, volume 2. Springer, 2007.
[9]
Q. Diao and J. Jiang. Recurrent chinese restaurant process with a duration-based discount for event identification from twitter. In SDM, 2014.
[10]
A. Doucet, J. F. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle filtering for dynamic bayesian networks. In C. Boutilier and M. Goldszmidt, editors, UAI, pages 176--183, SF, CA, 2000.
[11]
A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer-Verlag, 2001.
[12]
N. Du, L. Song, A. Smola, and M. Yuan. Learning networks of heterogeneous influence. In NIPS, pages 2789--2797, 2012.
[13]
N. Du, L. Song, H. Woo, and H. Zha. Uncover Topic-Sensitive Information Diffusion Networks. In Artificial Intelligence and Statistics (AISTATS), 2013.
[14]
M. Farajtabar, N. Du, M. Gomez-Rodriguez, I. Valera, H. Zha, and L. Song. Shaping Social Activity by Incentivizing Users. In NIPS, 2014.
[15]
J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005.
[16]
T. Griffiths and Z. Ghahramani. The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12:1185--1224, 2011.
[17]
A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83--90, 1971.
[18]
N. L. Hjort, C. Holmes, P. Muller, and S. G. Walker. Bayesian Nonparametrics. Cambridge University Press, 2010.
[19]
J. Kingman. On doubly stochastic poisson processes. Mathematical Proceedings of the Cambridge Philosophical Society, pages 923--930, 1964.
[20]
J. F. C. Kingman. Poisson processes, volume 3. Oxford university press, 1992.
[21]
L. Li, H. Deng, A. Dong, Y. Chang, and H. Zha. Identifying and labeling search tasks via query-based hawkes processes. In KDD, pages 731--740, 2014.
[22]
C. Suen, S. Huang, C. Eksombatchai, R. Sosic, and J. Leskovec. Nifty: A system for large scale information flow tracking and clustering. In WWW, 2013.
[23]
Y. W. Teh. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 985--992, 2006.
[24]
X. Wang and A. McCallum. Topics over time: A non-markov continuous-time model of topical trends. In KDD, 2006.

Cited By

View all

Index Terms

  1. Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
      August 2015
      2378 pages
      ISBN:9781450336642
      DOI:10.1145/2783258
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 August 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. dirichlet process
      2. document modeling
      3. hawkes process

      Qualifiers

      • Research-article

      Funding Sources

      • NSF
      • NSF/NIH
      • NSF CAREER

      Conference

      KDD '15
      Sponsor:

      Acceptance Rates

      KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;
      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)250
      • Downloads (Last 6 weeks)41
      Reflects downloads up to 24 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media