skip to main content
10.1145/3534678.3539247acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Sparse Conditional Hidden Markov Model for Weakly Supervised Named Entity Recognition

Published: 14 August 2022 Publication History

Abstract

Weakly supervised named entity recognition methods train label models to aggregate the token annotations of multiple noisy labeling functions (LFs) without seeing any manually annotated labels. To work well, the label model needs to contextually identify and emphasize well-performed LFs while down-weighting the under-performers. However, evaluating the LFs is challenging due to the lack of ground truths. To address this issue, we propose the sparse conditional hidden Markov model (Sparse-CHMM). Instead of predicting the entire emission matrix as other HMM-based methods, Sparse-CHMM focuses on estimating its diagonal elements, which are considered as the reliability scores of the LFs. The sparse scores are then expanded to the full-fledged emission matrix with pre-defined expansion functions. We also augment the emission with weighted XOR scores, which track the probabilities of an LF observing incorrect entities. Sparse-CHMM is optimized through unsupervised learning with a three-stage training pipeline that reduces the training difficulty and prevents the model from falling into local optima. Compared with the baselines in the Wrench benchmark, Sparse-CHMM achieves a 3.01 average F1 score improvement on five comprehensive datasets. Experiments show that each component of Sparse-CHMM is effective, and the estimated LF reliabilities strongly correlate with true LF F1 scores.

Supplemental Material

MP4 File
Presentation video for KDD'22 Sparse Conditional Hidden Markov Model for Weakly Supervised Named Entity Recognition

References

[1]
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In EMNLP-IJCNLP. 3615--3620.
[2]
Benedikt Boecking, Willie Neiswanger, Eric P. Xing, and Artur Dubrawski. 2021. Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling. In ICLR.
[3]
Xiangning Chen, Ruochen Wang, Minhao Cheng, Xiaocheng Tang, and Cho-Jui Hsieh. 2021. DrNAS: Dirichlet Neural Architecture Search. In ICLR.
[4]
Hanjun Dai, Bo Dai, Yan-Ming Zhang, Shuang Li, and Le Song. 2017. Recurrent Hidden Semi-Markov Model. In ICLR.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171--4186.
[6]
Rezarta Islamaj Dogan, Robert Leaman, and Zhiyong Lu. 2014. NCBI disease corpus: A resource for disease name recognition and concept normalization. J. Biomed. Informatics (2014), 1--10.
[7]
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld. 2011. Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. In ACL. 541--550.
[8]
Martin Jankowiak and Fritz Obermeyer. 2018. Pathwise Derivatives Beyond the Reparameterization Trick. In ICML. 2240--2249.
[9]
Edward Kim, Kevin Huang, Adam Saunders, Andrew McCallum, Gerbrand Ceder, and Elsa Olivetti. 2017. Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning. Chemistry of Materials (2017), 9436--9444.
[10]
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In ICLR.
[11]
Ouyu Lan, Xiao Huang, Bill Yuchen Lin, He Jiang, Liyuan Liu, and Xiang Ren. 2020. Learning to Contextually Aggregate Multi-Source Supervision for Sequence Labeling. In ACL. 2134--2146.
[12]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. bioinformatics (2019), 1234--1240.
[13]
Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. 2016. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database J. Biol. Databases Curation (2016).
[14]
Yinghao Li, Pranav Shetty, Lucas Liu, Chao Zhang, and Le Song. 2021. BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition. In ACL-IJCNL. 6178--6190.
[15]
Pierre Lison, Jeremy Barnes, and Aliaksandr Hubin. 2021. skweak: Weak Supervision Made Easy for NLP. In ACL-IJCNL. 337--346.
[16]
Pierre Lison, Jeremy Barnes, Aliaksandr Hubin, and Samia Touileb. 2020. Named Entity Recognition without Labelled Data: A Weak Supervision Approach. In ACL. 1518--1533.
[17]
Hao Liu, Lirong He, Haoli Bai, Bo Dai, Kun Bai, and Zenglin Xu. 2018. Structured Inference for Recurrent Hidden Semi-Markov Model. In IJCAI. 2447--2453.
[18]
An Thanh Nguyen, Byron Wallace, Junyi Jessy Li, Ani Nenkova, and Matthew Lease. 2017. Aggregating and Predicting Sequence Labels from Crowd Annotations. In ACL. 299--309.
[19]
Jerrod Parker and Shi Yu. 2021. Named Entity Recognition through Deep Representation Learning and Weak Supervision. In ACL-IJCNLP Findings. 3828--3839.
[20]
Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. SemEval-2014 Task 4: Aspect Based Sentiment Analysis. In SemEval. 27--35.
[21]
Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow. (2017), 269--282.
[22]
Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher Ré. 2019. Training Complex Models with Multi-Task Weak Supervision. AAAI (2019), 4763--4771.
[23]
Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data Programming: Creating Large Training Sets, Quickly. In NIPS. 3574--3582.
[24]
Wendi Ren, Yinghao Li, Hanting Su, David Kartchner, Cassie Mitchell, and Chao Zhang. 2020. Denoising Multi-Source Weak Supervision for Neural Text Classification. In EMNLP Findings. 3739--3754.
[25]
Esteban Safranchik, Shiying Luo, and Stephen H. Bach. 2020. Weakly Supervised Sequence Tagging from Noisy Rules. In AAAI. 5570--5578.
[26]
Jingbo Shang, Liyuan Liu, Xiaotao Gu, Xiang Ren, Teng Ren, and Jiawei Han. 2018. Learning Named Entity Tagger using Domain-Specific Dictionary. In EMNLP. 2054--2064.
[27]
Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL2003 Shared Task: Language-Independent Named Entity Recognition. In HLTNAACL. 142--147.
[28]
Ke M. Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, and Kevin Knight. 2016. Unsupervised Neural Hidden Markov Models. In SPNLP. 63--71.
[29]
Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, and Ann Houston. 2013. OntoNotes Release 5.0 LDC2013T19.
[30]
Sam Wiseman, Stuart Shieber, and Alexander Rush. 2018. Learning Neural Templates for Text Generation. In EMNLP. 3174--3187.
[31]
Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner. 2021. WRENCH: A Comprehensive Benchmark for Weak Supervision. CoRR (2021).

Cited By

View all

Index Terms

  1. Sparse Conditional Hidden Markov Model for Weakly Supervised Named Entity Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    August 2022
    5033 pages
    ISBN:9781450393850
    DOI:10.1145/3534678
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 August 2022

    Check for updates

    Author Tags

    1. hidden markov model
    2. information extraction
    3. named entity recognition
    4. weak supervision

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    KDD '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)139
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media