research-article

Investigation of Automatic Part-of-Speech Tagging using CRF, HMM and LSTM on Misspelled and Edited Texts

Authors:

Farhad Aydinov,

Igbal Huseynov,

Sofiya Sayadzada,

Samir RustamovAuthors Info & Claims

AICCC '22: Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference

Pages 21 - 28

https://doi.org/10.1145/3582099.3582103

Published: 20 April 2023 Publication History

Abstract

Part-of-speech tagging is the process of assigning words in a given text to appropriate parts-of speech in order to reduce the disambiguation which may arise depending on the contextual usage of the words. In this paper, the problem of word sense disambiguation in Azerbaijani language is addressed by applying part of speech tagging on two varying data corpora, misspelled, and edited (clean) text using 3 different machine learning algorithms: Hidden Markov Model, Long Short-Term Memory, and Conditional Random Fields. The comparative analysis on the outcomes of the mentioned algorithms and their accuracy scores were analysed in the paper. The misspelled dataset for the experiments is provided by Unibank from their chatbot dialogues while the clean textual data was retrieved from the books and newspapers in Azerbaijani. The experiments showed that the Bidirectional LSTM has the highest accuracy scores for both edited (98.2%) and noisy (96.2%) data corpora. Suggested models can be used in the application of algorithms focuses on part of speech tags and syntactic structure of Azerbaijani language which is an agglutinative language belonging to Turkic languages family, thus enabling the research to be further investigated in other agglutinative languages with similar grammatical structure.

References

[1]

Şevket Can, Bahar Karaoğlan, Tarık Kışla, and Senem Kumova Metin, "Using Word Embeddings in Turkish Part of Speech Tagging," International Journal of Machine Learning and Computing vol. 11, no. 5, pp. 367-372, 2021.

[2]

Eskander, R., Muresan, S., & Collins, M. (2020). Unsupervised Cross-Lingual Part-of-Speech Tagging for Truly Low-Resource Scenarios. EMNLP.

[3]

Rustamov Samir, Mustafali Ali, Sadigov Ziyaddin, Mollayev Rasim and Mammadov Samir. "Part-of-Speech Tagging for Azerbaijani Language," 12th International Conference on Application of Information and Communication Technologies (AICT),2018.

[4]

A. Valizada, "Development of mathematical and software applications for PoS tagging texts in Azerbaijani language," 2015. Master thesis. Qafqaz University.

[5]

T. Dincer, B. Karaoglan and T. Kisla, "A Suffix Based Part-of-Speech Tagger for Turkish," Fifth International Conference on Information Technology: New Generations (itng 2008), Las Vegas, NV, USA, 2008, pp. 680-685.

Digital Library

[6]

C. A. Bahcevan, E. Kutlu and T. Yildiz, "Deep Neural Network Architecture for Part-of-Speech Tagging for Turkish Language," 2018 3rd International Conference on Computer Science and Engineering (UBMK), Sarajevo, Bosnia and Herzegovina, 2018, pp. 235-238.

[7]

Toleu, Alymzhan & Tolegen, Gulmira & Mussabayev, Rustam. (2020). Deep Learning for Multilingual POS Tagging. 10.1007/978-3-030-63119-2_2.

[8]

N. Bölücü and B. Can, "Stem-based PoS tagging for agglutinative languages," 2017 25th Signal Processing and Communications Applications Conference (SIU), 2017, pp. 1-4.

[9]

J.Silfverberg, Miikka & Ruokolainen, Teemu & Lindon, Krister & Kurimo, Mikko. (2014). Part-of-Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy. 2. 259-264. 10.3115/v1/P14-2043.

[10]

Chopra, Romansha & Singh, Nivedita & Zhenning, Yang & Iyenger, N Ch Sriman Narayana. (2017). Sequence Labeling using Conditional Random Fields. International Journal of u- and e- Service, Science and Technology. 10. 101-108. 10.14257/ijunesst.2017.10.9.10.

[11]

B. Lőrincz, M. Nuţu and A. Stan, "Romanian Part of Speech Tagging using LSTM Networks," 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP), 2019, pp. 223-228.

[12]

Pham, Bao. (2020). Parts of Speech Tagging: Rule-Based.

[13]

Kumawat, Deepika & Jain, Vinesh. (2015). POS Tagging Approaches A Comparison. International Journal of Computer Applications. 118. 32-38. 10.5120/20752-3148.

[14]

Hasan, Fahim & UzZaman, Naushad & Khan, Mumit. (2007). Comparion of different POS tagging technique (N-Gram, HMM and Brill's tagger) for Bangla. 10.1007/978-1-4020-6264-3_23.

[15]

Emil Kalbaliyev, Samir Rustamov. Text Similarity Detection using Machine Learning algorithms with Character-based similarity measures. MIDI 2020. 8th Machine Intelligence and Digital Interaction. 9–10 December, 2020. Warsaw, Poland.

[16]

Sevda Mammadli, Shamsaddin Huseynov, Huseyn Alkaramov, Ulviyya Jafarli, Umid Suleymanov and Samir Rustamov. Sentiment Polarity Detection in Azerbaijani Social Nes Articles. RANLP 2019.Recent Advances in Natural Language Processing. 2019. Varna, Bulgaria.

[17]

Derczynski, L., Ritter, A., Clark, S., & Bontcheva, K. (2013). Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data. RANLP.

[18]

Phani Gadde, L. V. Subramaniam, and Tanveer A. Faruquie. 2011. Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results. In Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data (MOCR_AND '11). Association for Computing Machinery, New York, NY, USA, Article 5, 1–8.

[19]

Marcel Bollmann. (2013). POS Tagging for Historical Texts with Sparse Training Data. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 11–18, Sofia, Bulgaria. Association for Computational Linguistics.

[20]

Camburu, O. M. Explaining deep neural networks [PhD thesis]. University of Oxford.

[21]

Makazhanov, Aibek & Makhambetov, Olzhas & Yessenbayev, Zhandos & Sabyrgaliyev, Islam & Sharafudinov, Anuar. (2014). On Certain Aspects of Kazakh Part-of-Speech Tagging. 10.1109/ICAICT.2014.7035953.

[22]

İşgüder, Gözde & Steedman, Mark. (2018). Data Augmentation via Dependency Tree Morphing for Low-Resource Languages. 5004-5009. 10.18653/v1/D18-1545.

[23]

Silfverberg, Miikka & Ruokolainen, Teemu & Lindén, Krister & Kurimo, Mikko. (2014). Part-of-Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy. 2. 259-264. 10.3115/v1/P14-2043.

Cited By

Phukan RBaruah NSarma SKonwar D(2024)Parts-of-Speech Tagger in Assamese Using LSTM and Bi-LSTMAdvances in Data-Driven Computing and Intelligent Systems10.1007/978-981-99-9524-0_3(19-31)Online publication date: 26-Feb-2024
https://doi.org/10.1007/978-981-99-9524-0_3
Khalid HSiddique AAli R(2023)Custom Hidden Markov Models for Effective Part-of-Speech Tagging2023 18th International Conference on Emerging Technologies (ICET)10.1109/ICET59753.2023.10374930(33-37)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICET59753.2023.10374930

Index Terms

Investigation of Automatic Part-of-Speech Tagging using CRF, HMM and LSTM on Misspelled and Edited Texts
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Lexical semantics

Recommendations

Toward enhanced Arabic speech recognition using part of speech tagging

One major source of suboptimal performance in automatic continuous speech recognition systems is misrecognition of small words. In general, errors resulting from small words are much more than errors resulting from long words. Therefore, compounding ...
Automatic part of speech tagging for Arabic: an experiment using Bigram hidden Markov model
RSKT'10: Proceedings of the 5th international conference on Rough set and knowledge technology

Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS tagger is a useful preprocessing tool in many natural languages processing (NLP) applications such as ...
Designing HMM-Based Part-of-Speech Tagger for Lithuanian Language

This paper describes a preliminary experiment in designing a Hidden Markov Model (HMM)-based part-of-speech tagger for the Lithuanian language. Part-of-speech tagging is the problem of assigning to each word of a text the proper tag in its context of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AICCC '22: Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference

December 2022

302 pages

ISBN:9781450398749

DOI:10.1145/3582099

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

AICCC 2022

AICCC 2022: 2022 5th Artificial Intelligence and Cloud Computing Conference

December 17 - 19, 2022

Osaka, Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
53
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Phukan RBaruah NSarma SKonwar D(2024)Parts-of-Speech Tagger in Assamese Using LSTM and Bi-LSTMAdvances in Data-Driven Computing and Intelligent Systems10.1007/978-981-99-9524-0_3(19-31)Online publication date: 26-Feb-2024
https://doi.org/10.1007/978-981-99-9524-0_3
Khalid HSiddique AAli R(2023)Custom Hidden Markov Models for Effective Part-of-Speech Tagging2023 18th International Conference on Emerging Technologies (ICET)10.1109/ICET59753.2023.10374930(33-37)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICET59753.2023.10374930

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents