skip to main content
10.5555/1642059.1642075dlproceedingsArticle/Chapter ViewAbstractPublication PageslawConference Proceedingsconference-collections
research-article
Free access

Active learning for part-of-speech tagging: accelerating corpus annotation

Published: 28 June 2007 Publication History

Abstract

In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by Committee (QBC) and report on experiments with several baselines and new variations of QBC and QBU, inspired by weaknesses particular to their use in this application. Experiments on English prose and poetry test these approaches and evaluate their robustness. The results allow us to make recommendations for both types of text and raise questions that will lead to further inquiry.

References

[1]
Anderson, B., and Moore, A. (2005). "Active Learning for HMM: Objective Functions and Algorithms." ICML, Germany.
[2]
Brants, T., (2000). "TnT -- a statistical part-of-speech tagger." ANLP, Seattle, WA.
[3]
Brill, E., and Wu, J. (1998). "Classifier combination for improved lexical disambiguation." Coling/ACL, Montreal, Quebec, Canada. Pp. 191--195.
[4]
Day, D., et al. (1997). "Mixed-Initiative Development of Language Processing Systems." ANLP, Washington, D.C.
[5]
Engelson, S. and Dagan. I. (1996). "Minimizing manual annotation cost in supervised training from corpora." ACL, Santa Cruz, California. Pp. 319--326.
[6]
Freund, Y., Seung, H., Shamir, E., and Tishby, N. (1997). "Selective sampling using the query by committee algorithm." Machine Learning, 28(2--3):133--168.
[7]
Godbert, G. and Ramsay, J. (1991). "For now." In the British National Corpus file B1C.xml. London: The Diamond Press (pp. 1--108).
[8]
Hughes, T. (1982). "Selected Poems." In the British National Corpus file H&R.xml. London: Faber&Faber Ltd. (pp. 35--235).
[9]
Kupiec, J. (1992). "Robust part-of-speech tagging using a hidden Markov model." Computer Speech and Language 6, pp. 225--242.
[10]
Lewis, D., and Catlett, J. (1994). "Heterogeneous uncertainty sampling for supervised learning." ICML.
[11]
Lewis, D., and Gale, W. (1995). "A sequential algorithm for training text classifiers: Corrigendum and additional data." SIGIR Forum, 29(2), 13--19.
[12]
Mann, G., and McCallum, A. (2007). "Efficient Computation of Entropy Gradient for Semi-Supervised Conditional Random Fields". NAACL-HLT.
[13]
Marcus, M. et al. (1999). "Treebank-3." Linguistic Data Consortium, Philadelphia, PA.
[14]
Raiffa, H. and Schlaiffer, R. (1967). Applied Statistical Decision Theory. New York: Wiley Interscience.
[15]
Raine, C. (1984). "Rich." In the British National Corpus file CB0.xml. London: Faber&Faber Ltd. (pp. 13--101).
[16]
Ratnaparkhi, A. (1996). "A Maximum Entropy Model for Part-Of-Speech Tagging." EMNLP.
[17]
Roy, N., and McCallum, A. (2001a). "Toward optimal active learning through sampling estimation of error reduction." ICML.
[18]
Roy, N. and McCallum, A. (2001b). "Toward Optimal Active Learning through Monte Carlo Estimation of Error Reduction." ICML, Williamstown.
[19]
Seung, H., Opper, M., and Sompolinsky, H. (1992). "Query by committee". COLT. Pp. 287--294.
[20]
Thrun S., and Moeller, K. (1992). "Active exploration in dynamic environments." NIPS.
[21]
Toutanova, K., Klein, D., Manning, C., and Singer, Y. (2003). "Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network." HLT-NAACL. Pp. 252--259.
[22]
Toutanova, K. and Manning, C. (2000). "Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger." EMNLP, Hong Kong. Pp. 63--70.

Cited By

View all
  1. Active learning for part-of-speech tagging: accelerating corpus annotation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image DL Hosted proceedings
      LAW '07: Proceedings of the Linguistic Annotation Workshop
      June 2007
      210 pages

      Publisher

      Association for Computational Linguistics

      United States

      Publication History

      Published: 28 June 2007

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)57
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 26 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media