research-article

Dynamic Clustering of Streaming Short Documents

Authors:

Shangsong Liang,

Emine Yilmaz,

Evangelos KanoulasAuthors Info & Claims

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 995 - 1004

https://doi.org/10.1145/2939672.2939748

Published: 13 August 2016 Publication History

Get Access

Abstract

Clustering technology has found numerous applications in mining textual data. It was shown to enhance the performance of retrieval systems in various different ways, such as identifying different query aspects in search result diversification, improving smoothing in the context of language modeling, matching queries with documents in a latent topic space in ad-hoc retrieval, summarizing documents etc. The vast majority of clustering methods have been developed under the assumption of a static corpus of long (and hence textually rich) documents. Little attention has been given to streaming corpora of short text, which is the predominant type of data in Web 2.0 applications, such as social media, forums, and blogs. In this paper, we consider the problem of dynamically clustering a streaming corpus of short documents. The short length of documents makes the inference of the latent topic distribution challenging, while the temporal dynamics of streams allow topic distributions to change over time. To tackle these two challenges we propose a new dynamic clustering topic model - DCT - that enables tracking the time-varying distributions of topics over documents and words over topics. DCT models temporal dynamics by a short-term or long-term dependency model over sequential data, and overcomes the difficulty of handling short text by assigning a single topic to each short document and using the distributions inferred at a certain point in time as priors for the next inference, allowing the aggregation of information. At the same time, taking a Bayesian approach allows evidence obtained from new streaming documents to change the topic distribution. Our experimental results demonstrate that the proposed clustering algorithm outperforms state-of-the-art dynamic and non-dynamic clustering topic models in terms of perplexity and when integrated in a cluster-based query likelihood model it also outperforms state-of-the-art models in terms of retrieval quality.

Supplementary Material

MP4 File (kdd2016_zhang_dynamic_clustering_01-acm.mp4)

Download
256.66 MB

References

[1]

N. Begum, L. Ulanova, J. Wang, and E. Keogh. Accelerating dynamic time warping clustering with a novel admissible pruning strategy. In KDD, pages 49--58, 2013.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Modeling online reviews with multi-grain topic models

LJST: A Semi-supervised Joint Sentiment-Topic Model for Short Texts

Keyword Assisted Embedded Topic Model

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations