skip to main content
10.3115/1075096.1075152dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free access

Is it harder to parse Chinese, or the Chinese Treebank?

Published: 07 July 2003 Publication History

Abstract

We present a detailed investigation of the challenges posed when applying parsing models developed against English corpora to Chinese. We develop a factored-model statistical parser for the Penn Chinese Treebank, showing the implications of gross statistical differences between WSJ and Chinese Tree-banks for the most general methods of parser adaptation. We then provide a detailed analysis of the major sources of statistical parse errors for this corpus, showing their causes and relative frequencies, and show that while some types of errors are due to difficult ambiguities inherent in Chinese grammar, others arise due to treebank annotation practices. We show how each type of error can be addressed with simple, targeted changes to the independence assumptions of the maximum likelihood-estimated PCFG factor of the parsing model, which raises our F1 from 80.7% to 82.6% on our development set, and achieves parse accuracy close to the best published figures for Chinese parsing.

References

[1]
Emily Bender. 2001. The syntax of Mandarin ba: Reconsidering the verbal analysis. Journal of East Asian Linguistics, 9(2):105--145.
[2]
Daniel Bikel and David Chiang. 2000. Two statistical parsing models applied to the Chinese treebank. In Proceedings of the Second Chinese Language Processing Workshop, pages 1--6.
[3]
Eugene Charniak. 2000. A maximum-entropy-inspired parser. In Proceedings of NAACL.
[4]
David Chiang and Daniel Bikel. 2002. Recovering latent information in treebanks. In Proceedings of COLING-2002, pages 183--189.
[5]
Ken Church and Ramish Patil. 1982. Coping with syntactic ambiguity or how to put the block in the box on the table. American Journal of Computational Linguistics, 8.
[6]
Michael Collins. 1999. Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, U. Penn.
[7]
Michael Collins. 2000. Discriminative reranking for natural language parsing. In Proceedings of ICML, pages 175--182. Morgan Kaufmann, San Francisco, CA.
[8]
John C. Henderson and Eric Brill. 1999. Exploiting diversity in natural language processing: Combining parsers. In Proceedings of EM-NLP.
[9]
Mark Johnson. 1998. PCFG models of linguistic tree representations. Computational Linguistics, 24(4):613--632.
[10]
Dan Klein and Christopher D. Manning. 2002. Fast exact inference with a factored model for natural language parsing. In Proceedings of NIPS.
[11]
Alexander Krotov, Mark Hepple, Robert Gaizauskas, and Yorick Wilks. 1998. Compacting the Penn Treebank grammar. In Proceedings of ACL-COLING, pages 699--703.
[12]
Ivan A. Sag and Thomas Wasow. 1999. Syntactic Theory: A Formal Introduction. CUP.
[13]
Nianwen Xue, Fu-Dong Chiou, and Martha Palmer. 2002. Building a large-scale annotated Chinese corpus. In Proceedings of COLING.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
July 2003
571 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 07 July 2003

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 85 of 443 submissions, 19%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)47
  • Downloads (Last 6 weeks)4
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media