skip to main content
10.3115/974557.974608dlproceedingsArticle/Chapter ViewAbstractPublication PagesanlcConference Proceedingsconference-collections
Article
Free access

Mixed-initiative development of language processing systems

Published: 31 March 1997 Publication History

Abstract

Historically, tailoring language processing systems to specific domains and languages for which they were not originally built has required a great deal of effort. Recent advances in corpus-based manual and automatic training methods have shown promise in reducing the time and cost of this porting process. These developments have focused even greater attention on the bottleneck of acquiring reliable, manually tagged training data. This paper describes a new set of integrated tools, collectively called the Alembic Workbench, that uses a mixed-initiative approach to "bootstrapping" the manual tagging process, with the goal of reducing the overhead associated with corpus development. Initial empirical studies using the Alembic Workbench to annotate "named entities" demonstrates that this approach can approximately double the production rate. As an added benefit, the combined efforts of machine and user produce domain specific annotation rules that can be used to annotate similar texts automatically through the Alembic-NLP system. The ultimate goal of this project is to enable end users to generate a practical domain-specific information extraction system within a single session.

References

[1]
John Aberdeen, John Burger, David Day, Lynette Hirschman, David Palmer, Patricia Robinson, and Marc Vilain. 1996. The Alembic system as used in MET. In Proceedings of the TIPSTER 24 Month Workshop, May.
[2]
Eric Brill. 1992. A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, Trento.
[3]
Eric Brill. 1993. A Corpus-Based Approach to Language Learning. Ph.D. thesis, University of Pennsylvania, Philadelphia, Penn.
[4]
Sean P. Engelson and Ido Dagan. 1996. Minimizing manual annotation cost in supervised training from corpora. Computation and Linguistic E-Print Service (cmp-lg/9606030), June.
[5]
Ralph Grishman. 1995. TIPSTER phase II architecture design. World Wide Web document. URL=http://cs.nyu.edu/cs/faculty/grishman/tipster.html
[6]
Ralph Grishman and Beth Sundheim. 1996. Message Understanding Conference---6: A Brief History. In International Conference on Computational Linguistics, Copenhagen, Denmark, August. The International Committee on Computational Linguistics.
[7]
Marc Vilain and David Day. 1996. Finite-state parsing by rule sequences. In International Conference on Computational Linguistics, Copenhagen, Denmark, August. The International Committee on Computational Linguistics.
[8]
Marc Vilain. 1993. Validation of terminological inference in an information extraction task. In Proceedings of the ARPA Workshop on Human Language Technology, Plainsboro, New Jersey.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
ANLC '97: Proceedings of the fifth conference on Applied natural language processing
March 1997
417 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 31 March 1997

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media