skip to main content
10.1145/3582099.3582103acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaicccConference Proceedingsconference-collections
research-article

Investigation of Automatic Part-of-Speech Tagging using CRF, HMM and LSTM on Misspelled and Edited Texts

Published: 20 April 2023 Publication History

Abstract

Part-of-speech tagging is the process of assigning words in a given text to appropriate parts-of speech in order to reduce the disambiguation which may arise depending on the contextual usage of the words. In this paper, the problem of word sense disambiguation in Azerbaijani language is addressed by applying part of speech tagging on two varying data corpora, misspelled, and edited (clean) text using 3 different machine learning algorithms: Hidden Markov Model, Long Short-Term Memory, and Conditional Random Fields. The comparative analysis on the outcomes of the mentioned algorithms and their accuracy scores were analysed in the paper. The misspelled dataset for the experiments is provided by Unibank from their chatbot dialogues while the clean textual data was retrieved from the books and newspapers in Azerbaijani. The experiments showed that the Bidirectional LSTM has the highest accuracy scores for both edited (98.2%) and noisy (96.2%) data corpora. Suggested models can be used in the application of algorithms focuses on part of speech tags and syntactic structure of Azerbaijani language which is an agglutinative language belonging to Turkic languages family, thus enabling the research to be further investigated in other agglutinative languages with similar grammatical structure.

References

[1]
Şevket Can, Bahar Karaoğlan, Tarık Kışla, and Senem Kumova Metin, "Using Word Embeddings in Turkish Part of Speech Tagging," International Journal of Machine Learning and Computing vol. 11, no. 5, pp. 367-372, 2021.
[2]
Eskander, R., Muresan, S., & Collins, M. (2020). Unsupervised Cross-Lingual Part-of-Speech Tagging for Truly Low-Resource Scenarios. EMNLP.
[3]
Rustamov Samir, Mustafali Ali, Sadigov Ziyaddin, Mollayev Rasim and Mammadov Samir. "Part-of-Speech Tagging for Azerbaijani Language," 12th International Conference on Application of Information and Communication Technologies (AICT),2018.
[4]
A. Valizada, "Development of mathematical and software applications for PoS tagging texts in Azerbaijani language," 2015. Master thesis. Qafqaz University.
[5]
T. Dincer, B. Karaoglan and T. Kisla, "A Suffix Based Part-of-Speech Tagger for Turkish," Fifth International Conference on Information Technology: New Generations (itng 2008), Las Vegas, NV, USA, 2008, pp. 680-685.
[6]
C. A. Bahcevan, E. Kutlu and T. Yildiz, "Deep Neural Network Architecture for Part-of-Speech Tagging for Turkish Language," 2018 3rd International Conference on Computer Science and Engineering (UBMK), Sarajevo, Bosnia and Herzegovina, 2018, pp. 235-238.
[7]
Toleu, Alymzhan & Tolegen, Gulmira & Mussabayev, Rustam. (2020). Deep Learning for Multilingual POS Tagging. 10.1007/978-3-030-63119-2_2.
[8]
N. Bölücü and B. Can, "Stem-based PoS tagging for agglutinative languages," 2017 25th Signal Processing and Communications Applications Conference (SIU), 2017, pp. 1-4.
[9]
J.Silfverberg, Miikka & Ruokolainen, Teemu & Lindon, Krister & Kurimo, Mikko. (2014). Part-of-Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy. 2. 259-264. 10.3115/v1/P14-2043.
[10]
Chopra, Romansha & Singh, Nivedita & Zhenning, Yang & Iyenger, N Ch Sriman Narayana. (2017). Sequence Labeling using Conditional Random Fields. International Journal of u- and e- Service, Science and Technology. 10. 101-108. 10.14257/ijunesst.2017.10.9.10.
[11]
B. Lőrincz, M. Nuţu and A. Stan, "Romanian Part of Speech Tagging using LSTM Networks," 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP), 2019, pp. 223-228.
[12]
Pham, Bao. (2020). Parts of Speech Tagging: Rule-Based.
[13]
Kumawat, Deepika & Jain, Vinesh. (2015). POS Tagging Approaches A Comparison. International Journal of Computer Applications. 118. 32-38. 10.5120/20752-3148.
[14]
Hasan, Fahim & UzZaman, Naushad & Khan, Mumit. (2007). Comparion of different POS tagging technique (N-Gram, HMM and Brill's tagger) for Bangla. 10.1007/978-1-4020-6264-3_23.
[15]
Emil Kalbaliyev, Samir Rustamov. Text Similarity Detection using Machine Learning algorithms with Character-based similarity measures. MIDI 2020. 8th Machine Intelligence and Digital Interaction. 9–10 December, 2020. Warsaw, Poland.
[16]
Sevda Mammadli, Shamsaddin Huseynov, Huseyn Alkaramov, Ulviyya Jafarli, Umid Suleymanov and Samir Rustamov. Sentiment Polarity Detection in Azerbaijani Social Nes Articles. RANLP 2019.Recent Advances in Natural Language Processing. 2019. Varna, Bulgaria.
[17]
Derczynski, L., Ritter, A., Clark, S., & Bontcheva, K. (2013). Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data. RANLP.
[18]
Phani Gadde, L. V. Subramaniam, and Tanveer A. Faruquie. 2011. Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results. In Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data (MOCR_AND '11). Association for Computing Machinery, New York, NY, USA, Article 5, 1–8.
[19]
Marcel Bollmann. (2013). POS Tagging for Historical Texts with Sparse Training Data. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 11–18, Sofia, Bulgaria. Association for Computational Linguistics.
[20]
Camburu, O. M. Explaining deep neural networks [PhD thesis]. University of Oxford.
[21]
Makazhanov, Aibek & Makhambetov, Olzhas & Yessenbayev, Zhandos & Sabyrgaliyev, Islam & Sharafudinov, Anuar. (2014). On Certain Aspects of Kazakh Part-of-Speech Tagging. 10.1109/ICAICT.2014.7035953.
[22]
İşgüder, Gözde & Steedman, Mark. (2018). Data Augmentation via Dependency Tree Morphing for Low-Resource Languages. 5004-5009. 10.18653/v1/D18-1545.
[23]
Silfverberg, Miikka & Ruokolainen, Teemu & Lindén, Krister & Kurimo, Mikko. (2014). Part-of-Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy. 2. 259-264. 10.3115/v1/P14-2043.

Cited By

View all

Index Terms

  1. Investigation of Automatic Part-of-Speech Tagging using CRF, HMM and LSTM on Misspelled and Edited Texts

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AICCC '22: Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference
    December 2022
    302 pages
    ISBN:9781450398749
    DOI:10.1145/3582099
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 April 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CRFs
    2. HMMs
    3. LSTM
    4. Part of Speech Tagging
    5. PoS Tagging for agglutinative languages

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    AICCC 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media