skip to main content
10.1145/3107411.3108195acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
poster

Best Setting of Model Parameters in Applying Topic Modeling on Textual Documents.

Published: 20 August 2017 Publication History

Abstract

Probabilistic topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. It offers a viable approach to structure huge textual document collections into latent topic themes to aid text mining. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. In this study, we use a heuristic approach to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed. Then we describe extensive sensitivity studies to determine best practices for generating effective topic models. To test effectiveness and validity of topic models, we constructed a ground truth data set from PubMed that contained some 40 health related themes including negative controls, and mixed it with a data set of unstructured documents. We found that obtaining the most useful model, tuned to desired sensitivity versus specificity, requires an iterative process wherein preprocessing steps, the type of topic modeling algorithm, and the algorithm's model parameters are systematically varied. Models need to be compared with both qualitative, subjective assessments and quantitative, objective assessments, and care is required that Gibbs sampling in model estimation is sufficient to assure stable solutions. With a high quality model, documents can be rank-ordered in accordance with probability of being associated with complex regulatory query string, greatly lessoning text mining work. Importantly, topic models are agnostic about how words and documents are defined, and thus our findings are extensible to topic models where samples are defined as documents, and genes, proteins or their sequences are words.

References

[1]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3 (January 2003), 993--1022.
[2]
Weizhong Zhao, James J Chen, Roger Perkins, Zhichao Liu, Weigong Ge, Yijun Ding, and Wen Zou. 2015. A heuristic approach to determine an appropriate number of topics in topic modeling. BMC Bioinformatics 16, 13 (December 2015), S8.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
August 2017
800 pages
ISBN:9781450347228
DOI:10.1145/3107411
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2017

Check for updates

Author Tags

  1. LDA
  2. parameter setting
  3. text mining
  4. topic modeling

Qualifiers

  • Poster

Conference

BCB '17
Sponsor:

Acceptance Rates

ACM-BCB '17 Paper Acceptance Rate 42 of 132 submissions, 32%;
Overall Acceptance Rate 254 of 885 submissions, 29%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 131
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media