skip to main content
research-article

A Unified Model for Solving the OOV Problem of Chinese Word Segmentation

Published: 12 June 2015 Publication History

Abstract

This article proposes a unified, character-based, generative model to incorporate additional resources for solving the out-of-vocabulary (OOV) problem of Chinese word segmentation, within which different types of additional information can be utilized independently in corresponding submodels. This article mainly addresses the following three types of OOV: unseen dictionary words, named entities, and suffix-derived words, none of which are handled well by current approaches. The results show that our approach can effectively improve the performance of the first two types with positive interaction in F-score. Additionally, we also analyze reason that suffix information is not helpful. After integrating the proposed generative model with the corresponding discriminative approach, our evaluation on various corpora---including SIGHAN-2005, CIPS-SIGHAN-2010, and the Chinese Treebank (CTB)---shows that our integrated approach achieves the best performance reported in the literature on all testing sets when additional information and resources are allowed.

References

[1]
Baroni, M. 2009. Distributions in text. In Corpus Linguistics: An International Handbook, A. Lüdeling and M. Kytö (Eds.). Mouton de Gruyter, Berlin.
[2]
Bilmes, J. A. and Kirchhoff, K. 2003. Factored language models and generalized parallel backoff. In Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics (HLT/NAACL’03). 4--6.
[3]
Chen, S. F. and Goodman, J. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98. Harvard University Center for Research in Computing Technology.
[4]
Dong, Z., Dong, Q., and Hao, C. 2010. Word segmentation needs change---From a linguists view. In Proceedings of the 1st CIPS-SIGHAN Joint Conference on Chinese Language Processing. 1--7.
[5]
Emerson, T. 2005. The second international Chinese word segmentation bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 123--133.
[6]
Feng, H., Chen, K., Deng, X., and Zheng, W. 2004. Accessor variety criteria for Chinese wordextraction. Comput. Linguistics 30, 1, 75--93.
[7]
Gao, J., Li, M., Wu, A., and Huang, C.-N. 2005. Chinese word segmentation and named entity recognition: A pragmatic approach. Comput. Linguistics 31, 531--574.
[8]
Hatori, J., Matsuzaki, T., Miyao, Y., and Tsujii, J. 2012. Incremental joint approach to word segmentation, POS tagging, and dependency parsing in Chinese. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 1045--1053.
[9]
Huang, C. and Zhao, H. 2007. Chinese word segmentation: A decade review. J. Chinese Inf.Process. 21, 3, 8--20.
[10]
Jiampojamarn, S., Cherry, C., and Kondrak, G. 2010. Integrating joint n-gram features into a discriminative training framework. In Proceedings of the NAACL. 697--700.
[11]
Jiang, W., Huang, L., and Liu, Q. 2009. Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging---A case study. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 522--530.
[12]
Jiang, W., Huang, L., Liu, Q., and Lu, Y. 2008. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the ACL. 897--904.
[13]
Jiang, W., Sun, M., Lv, Y., Yang, Y., and Liu, Q. 2013. Discriminative learning with natural annotations: Word segmentation as a case study. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. (Vol. 1, Long Papers). 761--769.
[14]
Jin, G. and Chen, X. 2008. The fourth international Chinese language processing bakeoff: Chinese word segmentation, named entity recognition and Chinese POS tagging. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 69.
[15]
Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., and Isahara H. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 513--521.
[16]
Li, X., Wang, K., Zong, C., and Su, K.-Y. 2012. Integrating surface and abstract features for robust cross-domain Chinese word segmentation. In Proceedings of COLING. 1653--1670.
[17]
Li, X., Zong, C., and Su, K.-Y. 2013. A study of the effectiveness of suffixes for Chinese word segmentation. In Proceedings of the 27th Pacific Asia Conference on Language, Information and Computation.
[18]
Li, Z. 2011. Parsing the internal structure of words: A new paradigm for Chinese word segmentation.In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.1405--1414.
[19]
Li, Z. and Sun, M. 2009. Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguistics 35, 4, 505--512.
[20]
Li, Z. and Zhou, G. 2012. Unified dependency parsing of Chinese morphological and syntactic structures. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 1445--1454.
[21]
Ng, H. T. and Low, J. K. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? Word-based or character-based. In Proceedings of the EMNLP. 277--284.
[22]
Nadeau, D. and Sekine, S. 2007. A survey of named entity recognition and classification.Lingvisticae Investigationes, 30, 1, 3--26.
[23]
Och, F. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. 160--167.
[24]
Peng, F., Feng, F., and McCallum, A. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of COLING. 562--568.
[25]
Qian, X. and Liu, Y. 2012. Joint Chinese word segmentation, POS tagging and parsing. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 501--511.
[26]
Stolcke, A. 2002. SRILM---An extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing. 311--318.
[27]
Sun, J., Gao, J., Zhang, L., Zhou, M., and Huang, C. 2002. Chinese named entity identification using class-based language model. In Proceedings of the 19th International Conference on Computational Linguistics. pp 1--7.
[28]
Sun, W. 2010. Word-based and character-based word segmentation models: Comparison and combination. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). 1211--1219.
[29]
Sun, W. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. 1385--1394.
[30]
Sun, W. and Xu, J. 2011. Enhancing Chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 970--979.
[31]
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. 2005. A conditional random field word segmenter for SIGHAN bakeoff 2005. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 168--171.
[32]
Wang, K., Zong, C., and Su, K.-Y. 2009. Which is more suitable for Chinese word segmentation, thegenerative model or the discriminative one? In Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation (PACLIC’23). 827--834.
[33]
Wang, K., Zong, C., and Su, K.-Y. 2012. Integrating generative and discriminative character-based models for Chinese word segmentation. ACM Trans. Asian Lang. Inf. Process.
[34]
Wang, Y., Kazama, J., Tsuruoka, Y., Chen, W., Zhang, Y., and Torisawa, K. 2011. Improving Chinese word segmentation and POS tagging with semi-supervised methods using large auto-analyzed data. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 309--317.
[35]
Wang, Z., Zong, C., and Xue, N. 2013. A lattice-based framework for joint Chinese word segmentation, POS tagging and parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol. 2, Short Papers). 623--627.
[36]
Xiong, Y., Zhu, J., Huang, H., and Xu, H. 2009. Minimum tag error for discriminative training of conditional random fields. Inf. Sci. 179, 1--2, 169--179.
[37]
Xue, N., Xia, F., Chiou, F., and Palmer, M. 2005. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Nat. Lang. Eng. 11, 2, 207--238.
[38]
Xue, N. and Shen, L. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 176--179.
[39]
Zhang, H., Yu, H., Xiong, D., and Liu, Q. 2003. HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 184--187.
[40]
Zhang, M., Zhang, Y., Che, W., and Liu, T. 2013. Chinese parsing exploiting characters. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 125--134.
[41]
Zhang, M., Zhang, Y., Che, W., and Liu, T. 2014. Type-supervised domain adaptation for joint segmentation and POS tagging. In Proceedings of the 14th Conference of the European Chapter of the ACL. 588--597.
[42]
Zhang, R., Kikui, G., and Sumita, E. 2006. Subword-based tagging for confidence-dependent Chinese word segmentation. In Proceedings of the COLING/ACL. 961--968.
[43]
Zhang, Y., Vogel, S., and Waibel, A. 2004. Interpreting BLEU/NIST scores: How much improvement do we need to have a better system. In Proceedings of the 4th International Conference on Language Resource and Evaluation (LREC). 2051--2054.
[44]
Zhang, Y. and Clark, S. 2007. Chinese segmentation with a word-based perceptron algorithm. In Proceedings of the ACL. 840--847.
[45]
Zhang, Y. and Clark, S. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of the ACL/HLT. 888--896.
[46]
Zhang, Y. and Clark, S. 2011. Syntactic processing using the generalized perceptron and beam search. Comput. Linguistics 37, 105--151.
[47]
Zhao, H., Huang, C., and Li, M. 2006a. An improved Chinese word segmentation system with conditional random field. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing.162--165.
[48]
Zhao, H., Huang, C.-N., Li, M., and Lu, B.-L. 2006b. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the PACLIC-20. 87--94.
[49]
Zhao, H., Huang, C.-N., Li, M., and Lu, B.-L. 2010a. A unified character-based tagging framework for Chinese word segmentation. ACM Trans. Asian Lang. Inf. Process. 9, 2, 1--32.
[50]
Zhao, H. and Kit, C. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 106--111.
[51]
Zhao, H., Song, Y., and Kit, C. 2010b. How large a corpus do we need: Statistical method versus rule-based method. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10).
[52]
Zhao, H. and Liu, Q. 2010. The CIPS-SIGHAN CLP 2010 Chinese word segmentation bakeoff. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP’10).199--209.
[53]
Zipf, G. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 14, Issue 3
June 2015
90 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/2791399
Issue’s Table of Contents
© 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2015
Accepted: 01 October 2014
Revised: 01 July 2014
Received: 01 May 2014
Published in TALLIP Volume 14, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Chinese word segmentation
  2. domain adaptation
  3. model integration
  4. out-of-vocabulary words

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • International Science & Technology Cooperation Program of China
  • Hi-Tech Research and Development Program (“863” Program) of China
  • Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media