research-article

A Unified Model for Solving the OOV Problem of Chinese Word Segmentation

Authors:

Chengqing Zong,

Keh-yih SuAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 14, Issue 3

Article No.: 12, Pages 1 - 29

https://doi.org/10.1145/2699940

Published: 12 June 2015 Publication History

Abstract

This article proposes a unified, character-based, generative model to incorporate additional resources for solving the out-of-vocabulary (OOV) problem of Chinese word segmentation, within which different types of additional information can be utilized independently in corresponding submodels. This article mainly addresses the following three types of OOV: unseen dictionary words, named entities, and suffix-derived words, none of which are handled well by current approaches. The results show that our approach can effectively improve the performance of the first two types with positive interaction in F-score. Additionally, we also analyze reason that suffix information is not helpful. After integrating the proposed generative model with the corresponding discriminative approach, our evaluation on various corpora---including SIGHAN-2005, CIPS-SIGHAN-2010, and the Chinese Treebank (CTB)---shows that our integrated approach achieves the best performance reported in the literature on all testing sets when additional information and resources are allowed.

References

[1]

Baroni, M. 2009. Distributions in text. In Corpus Linguistics: An International Handbook, A. Lüdeling and M. Kytö (Eds.). Mouton de Gruyter, Berlin.

[2]

Bilmes, J. A. and Kirchhoff, K. 2003. Factored language models and generalized parallel backoff. In Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics (HLT/NAACL’03). 4--6.

Digital Library

[3]

Chen, S. F. and Goodman, J. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98. Harvard University Center for Research in Computing Technology.

[4]

Dong, Z., Dong, Q., and Hao, C. 2010. Word segmentation needs change---From a linguists view. In Proceedings of the 1st CIPS-SIGHAN Joint Conference on Chinese Language Processing. 1--7.

[5]

Emerson, T. 2005. The second international Chinese word segmentation bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 123--133.

[6]

Feng, H., Chen, K., Deng, X., and Zheng, W. 2004. Accessor variety criteria for Chinese wordextraction. Comput. Linguistics 30, 1, 75--93.

Digital Library

[7]

Gao, J., Li, M., Wu, A., and Huang, C.-N. 2005. Chinese word segmentation and named entity recognition: A pragmatic approach. Comput. Linguistics 31, 531--574.

Digital Library

[8]

Hatori, J., Matsuzaki, T., Miyao, Y., and Tsujii, J. 2012. Incremental joint approach to word segmentation, POS tagging, and dependency parsing in Chinese. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 1045--1053.

Digital Library

[9]

Huang, C. and Zhao, H. 2007. Chinese word segmentation: A decade review. J. Chinese Inf.Process. 21, 3, 8--20.

[10]

Jiampojamarn, S., Cherry, C., and Kondrak, G. 2010. Integrating joint n-gram features into a discriminative training framework. In Proceedings of the NAACL. 697--700.

Digital Library

[11]

Jiang, W., Huang, L., and Liu, Q. 2009. Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging---A case study. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 522--530.

Digital Library

[12]

Jiang, W., Huang, L., Liu, Q., and Lu, Y. 2008. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the ACL. 897--904.

[13]

Jiang, W., Sun, M., Lv, Y., Yang, Y., and Liu, Q. 2013. Discriminative learning with natural annotations: Word segmentation as a case study. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. (Vol. 1, Long Papers). 761--769.

[14]

Jin, G. and Chen, X. 2008. The fourth international Chinese language processing bakeoff: Chinese word segmentation, named entity recognition and Chinese POS tagging. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 69.

[15]

Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., and Isahara H. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 513--521.

Digital Library

[16]

Li, X., Wang, K., Zong, C., and Su, K.-Y. 2012. Integrating surface and abstract features for robust cross-domain Chinese word segmentation. In Proceedings of COLING. 1653--1670.

[17]

Li, X., Zong, C., and Su, K.-Y. 2013. A study of the effectiveness of suffixes for Chinese word segmentation. In Proceedings of the 27th Pacific Asia Conference on Language, Information and Computation.

[18]

Li, Z. 2011. Parsing the internal structure of words: A new paradigm for Chinese word segmentation.In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.1405--1414.

Digital Library

[19]

Li, Z. and Sun, M. 2009. Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguistics 35, 4, 505--512.

Digital Library

[20]

Li, Z. and Zhou, G. 2012. Unified dependency parsing of Chinese morphological and syntactic structures. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 1445--1454.

Digital Library

[21]

Ng, H. T. and Low, J. K. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? Word-based or character-based. In Proceedings of the EMNLP. 277--284.

[22]

Nadeau, D. and Sekine, S. 2007. A survey of named entity recognition and classification.Lingvisticae Investigationes, 30, 1, 3--26.

[23]

Och, F. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. 160--167.

Digital Library

[24]

Peng, F., Feng, F., and McCallum, A. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of COLING. 562--568.

Digital Library

[25]

Qian, X. and Liu, Y. 2012. Joint Chinese word segmentation, POS tagging and parsing. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 501--511.

Digital Library

[26]

Stolcke, A. 2002. SRILM---An extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing. 311--318.

[27]

Sun, J., Gao, J., Zhang, L., Zhou, M., and Huang, C. 2002. Chinese named entity identification using class-based language model. In Proceedings of the 19th International Conference on Computational Linguistics. pp 1--7.

Digital Library

[28]

Sun, W. 2010. Word-based and character-based word segmentation models: Comparison and combination. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). 1211--1219.

Digital Library

[29]

Sun, W. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. 1385--1394.

Digital Library

[30]

Sun, W. and Xu, J. 2011. Enhancing Chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 970--979.

Digital Library

[31]

Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. 2005. A conditional random field word segmenter for SIGHAN bakeoff 2005. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 168--171.

[32]

Wang, K., Zong, C., and Su, K.-Y. 2009. Which is more suitable for Chinese word segmentation, thegenerative model or the discriminative one? In Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation (PACLIC’23). 827--834.

[33]

Wang, K., Zong, C., and Su, K.-Y. 2012. Integrating generative and discriminative character-based models for Chinese word segmentation. ACM Trans. Asian Lang. Inf. Process.

Digital Library

[34]

Wang, Y., Kazama, J., Tsuruoka, Y., Chen, W., Zhang, Y., and Torisawa, K. 2011. Improving Chinese word segmentation and POS tagging with semi-supervised methods using large auto-analyzed data. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 309--317.

[35]

Wang, Z., Zong, C., and Xue, N. 2013. A lattice-based framework for joint Chinese word segmentation, POS tagging and parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol. 2, Short Papers). 623--627.

[36]

Xiong, Y., Zhu, J., Huang, H., and Xu, H. 2009. Minimum tag error for discriminative training of conditional random fields. Inf. Sci. 179, 1--2, 169--179.

Digital Library

[37]

Xue, N., Xia, F., Chiou, F., and Palmer, M. 2005. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Nat. Lang. Eng. 11, 2, 207--238.

Digital Library

[38]

Xue, N. and Shen, L. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 176--179.

Digital Library

[39]

Zhang, H., Yu, H., Xiong, D., and Liu, Q. 2003. HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 184--187.

Digital Library

[40]

Zhang, M., Zhang, Y., Che, W., and Liu, T. 2013. Chinese parsing exploiting characters. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 125--134.

[41]

Zhang, M., Zhang, Y., Che, W., and Liu, T. 2014. Type-supervised domain adaptation for joint segmentation and POS tagging. In Proceedings of the 14th Conference of the European Chapter of the ACL. 588--597.

[42]

Zhang, R., Kikui, G., and Sumita, E. 2006. Subword-based tagging for confidence-dependent Chinese word segmentation. In Proceedings of the COLING/ACL. 961--968.

Digital Library

[43]

Zhang, Y., Vogel, S., and Waibel, A. 2004. Interpreting BLEU/NIST scores: How much improvement do we need to have a better system. In Proceedings of the 4th International Conference on Language Resource and Evaluation (LREC). 2051--2054.

[44]

Zhang, Y. and Clark, S. 2007. Chinese segmentation with a word-based perceptron algorithm. In Proceedings of the ACL. 840--847.

[45]

Zhang, Y. and Clark, S. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of the ACL/HLT. 888--896.

[46]

Zhang, Y. and Clark, S. 2011. Syntactic processing using the generalized perceptron and beam search. Comput. Linguistics 37, 105--151.

Digital Library

[47]

Zhao, H., Huang, C., and Li, M. 2006a. An improved Chinese word segmentation system with conditional random field. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing.162--165.

[48]

Zhao, H., Huang, C.-N., Li, M., and Lu, B.-L. 2006b. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the PACLIC-20. 87--94.

[49]

Zhao, H., Huang, C.-N., Li, M., and Lu, B.-L. 2010a. A unified character-based tagging framework for Chinese word segmentation. ACM Trans. Asian Lang. Inf. Process. 9, 2, 1--32.

Digital Library

[50]

Zhao, H. and Kit, C. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 106--111.

[51]

Zhao, H., Song, Y., and Kit, C. 2010b. How large a corpus do we need: Statistical method versus rule-based method. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10).

[52]

Zhao, H. and Liu, Q. 2010. The CIPS-SIGHAN CLP 2010 Chinese word segmentation bakeoff. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP’10).199--209.

[53]

Zipf, G. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.

Cited By

Hombeck JVoigt HHeggemann TDatta RLawonn K(2023)Tell Me Where To Go: Voice-Controlled Hands-Free Locomotion for Virtual Reality Systems2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR)10.1109/VR55154.2023.00028(123-134)Online publication date: Mar-2023
https://doi.org/10.1109/VR55154.2023.00028
Zhang C(2023)Improved Word Segmentation System for Chinese Criminal Judgment DocumentsApplied Artificial Intelligence10.1080/08839514.2023.229752438:1Online publication date: 21-Dec-2023
https://doi.org/10.1080/08839514.2023.2297524
Neergaard KXu HGerman JHuang C(2021)Database of word-level statistics for Mandarin Chinese (DoWLS-MAN)Behavior Research Methods10.3758/s13428-021-01620-754:2(987-1009)Online publication date: 17-Aug-2021
https://doi.org/10.3758/s13428-021-01620-7
Show More Cited By

Index Terms

A Unified Model for Solving the OOV Problem of Chinese Word Segmentation

Recommendations

A Unified Character-Based Tagging Framework for Chinese Word Segmentation

Chinese word segmentation is an active area in Chinese language processing though it is suffering from the argument about what precisely is a word in Chinese. Based on corpus-based segmentation standard, we launched this study. In detail, we regard ...
Study on the Influencing Factors of Chinese Word Segmentation
IALP '12: Proceedings of the 2012 International Conference on Asian Language Processing

Out-of-vocabulary words (OOV) and ambiguity are two important issues for Chinese word segmentation (CWS). In previous studies, the measurement of OOV has been clearly stated, while the measurement of ambiguity requires further clarification. This paper ...
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 14, Issue 3

June 2015

90 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/2791399

Editor:
Richard Sproat
Google, Inc., USA

Issue’s Table of Contents

Copyright © 2015 ACM.

© 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2015

Accepted: 01 October 2014

Revised: 01 July 2014

Received: 01 May 2014

Published in TALLIP Volume 14, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

International Science & Technology Cooperation Program of China
Hi-Tech Research and Development Program (“863” Program) of China
Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
300
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hombeck JVoigt HHeggemann TDatta RLawonn K(2023)Tell Me Where To Go: Voice-Controlled Hands-Free Locomotion for Virtual Reality Systems2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR)10.1109/VR55154.2023.00028(123-134)Online publication date: Mar-2023
https://doi.org/10.1109/VR55154.2023.00028
Zhang C(2023)Improved Word Segmentation System for Chinese Criminal Judgment DocumentsApplied Artificial Intelligence10.1080/08839514.2023.229752438:1Online publication date: 21-Dec-2023
https://doi.org/10.1080/08839514.2023.2297524
Neergaard KXu HGerman JHuang C(2021)Database of word-level statistics for Mandarin Chinese (DoWLS-MAN)Behavior Research Methods10.3758/s13428-021-01620-754:2(987-1009)Online publication date: 17-Aug-2021
https://doi.org/10.3758/s13428-021-01620-7
Chen JHou HGao J(2020)Inside Importance Factors of Graph-Based Keyword Extraction on Chinese Short TextACM Transactions on Asian and Low-Resource Language Information Processing10.1145/338897119:5(1-15)Online publication date: 21-Jun-2020
https://dl.acm.org/doi/10.1145/3388971
Neergaard KHuang C(2019)Constructing the Mandarin Phonological Network: Novel Syllable Inventory Used to Identify Schematic SegmentationComplexity10.1155/2019/69798302019(1-21)Online publication date: 23-Apr-2019
https://doi.org/10.1155/2019/6979830
Liu X(2018)Comparisons of Features for Chinese Word SegmentationGeo-Spatial Knowledge and Intelligence10.1007/978-981-13-0896-3_49(492-499)Online publication date: 12-Jun-2018
https://doi.org/10.1007/978-981-13-0896-3_49

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents