skip to main content
10.1145/3573428.3573542acmotherconferencesArticle/Chapter ViewAbstractPublication PageseitceConference Proceedingsconference-collections
research-article

A Chinese word segmentation method based on dictionary and HMM

Published: 15 March 2023 Publication History

Abstract

Aiming at the problems of ambiguity segmentation and low success rate of new words discovery in Chinese word segmentation, this paper proposes a Chinese word segmentation method based on dictionary and Hidden Markov Model. Through forward maximum matching algorithm and backward maximum matching algorithm, the coarse segmentation results are obtained, and the ambiguous fragments are collected and input into the Hidden Markov model. The Hidden Markov Model performs secondary word segmentation through word order tagging and identifies new words, and adds new words to the dictionary to improve the dictionary. The experimental results show that the proposed algorithm improves the problem of low success rate of ambiguity recognition and new word discovery, improves the accuracy, recall and F1 value of ordinary text segmentation, and improves the problem that Jieba segmentation ability decreases in professional text.

References

[1]
GONG F H, ZHU P H. Word segmentation Based on Adaptive Hidden Markov Model in Oil field [J]. COMPUTER SCIENCE, 2018, 45(S1): 97-100.
[2]
JIANG W L, CHEN Z H, SHAO D G. Dynamic programming word segmentation algorithm based on domain dictionaries [J]. Journal of Nanjing University of Science and Technology, 2019, 43(1): 63-71.
[3]
YUAN Y, PENG J H, ZHANG R Y. Study on Chinese Word Sense Disambiguation Based on Statistics [J]. JOURNAL OF INFORMATION ENGINEERING UNIVERSITY, 2007, 8(4): 501-504.
[4]
LIU Y, WEI G Z. Improvement on maximum matching method mechanism based on double character Hash indexing [J]. Electronic Design Engineering, 2017, 25(16): 11-15.
[5]
DU L P, LI X G, YU G. New Word Detection Based on an Improved PMI Algorithm for Enhancing Segmentation System [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016, 52(1): 35-40.
[6]
ZHAO Z Q, CHEN Z Y, LIU J B, Chinese named entity recognition in power domain based on Bi-LSTM-CRF [C] //International Conference on Artificial Intelligence and Pattern Recognition. Beijing: AIPR, 2019: 176-180.
[7]
XU C W, WANG F Y, HAN J L, Exploiting multiple embedding for Chinese named entity recognition [C] //Proceedings of the 28th ACM International Conference on Information and Knowledge Management. Beijing: Association for Computing Machinery, 2019: 2269-2272.
[8]
Zhang Q, Liu X Y, Fu J L. Neural networks incorporating dictionaries for Chinese word segmentation [C] //Proceedings of the Thirty- Second AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 5682-5689.
[9]
WU Y F, WEI X, QIN Y B, A radical-based method for Chinese named entity recognition [C] //International Conference on Big Data. Los Angeles: IEEE, 2019: 125-130.
[10]
YANG F, ZHANG J H, LIU G S, Five-strokebased CNN-Bi RNN-CRF network for Chinese named entity recognition [C]//CCF International Conference on Natural Language Processing and Chinese Computing. Hohhot China Computer Federation, 2018: 184-195.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering
October 2022
1999 pages
ISBN:9781450397148
DOI:10.1145/3573428
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 March 2023

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EITCE 2022

Acceptance Rates

Overall Acceptance Rate 508 of 972 submissions, 52%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 38
    Total Downloads
  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media