skip to main content
short-paper

Construction of Mizo: English Parallel Corpus for Machine Translation

Published: 24 August 2023 Publication History

Abstract

Parallel corpus is a key component of statistical and Neural Machine Translation (NMT). While most research focuses on machine translation, corpus creation studies are limited for many languages, and no research paper on a Mizo–English corpus exists yet. A high-quality parallel corpus is required for Natural Language Processing activities including machine translation, Chatbots, Transliteration, and Cross-Language Information Retrieval. This work aims to investigate parallel corpus creation techniques and apply them to the Mizo–English language pair. Another goal is to test machine translation on the newly constructed corpus. We contributed to LF Aligner tool to support Mizo language for Mizo sentence alignment in corpus development. Our effort created the first large-scale Mizo–English parallel corpus with over 529K sentences. The pre-processed corpus was used for Mizo-to-English NMT. It was evaluated using BLEU, Character F1 Score (ChrF), and Translation Edit Rate (TER) scores. Our system achieved BLEU 45.08, ChrF 65.36, and TER 41.16, setting a new benchmark for Mizo-to-English translation.

References

[1]
Mehreen Alam and Sibt Ul Hussain. 2022. Roman-urdu-parl: Roman-urdu and urdu parallel corpus for urdu language understanding. Trans. Asian Low-Resour. Lang. Inf. Process. 21, 1 (2022), 1–20.
[2]
Olaia Andaluz-Pinedo and Hugo Sanjurjo-González. 2022. Corpus tools for parallel corpora of theatre plays: An introduction to TAligner and ACM-theatre. Lang. Resour. Eval. 56, 2 (2022), 651–671.
[3]
Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, et al. 2022. Building machine translation systems for the next thousand languages. arXiv:2205.03983. Retrieved from https://arxiv.org/abs/2205.03983
[4]
Yohanens Biadgligne and Kamel Smaïli. 2021. Parallel corpora preparation for english-amharic machine translation. In International Work-Conference on Artificial Neural Networks. Springer International Publishing, Madeira, 443–455.
[5]
Sai Man Cheok, Lap Man Hoi, Su-Kit Tang, and Rita Tse. 2022. Crawling parallel data for bilingual corpus using hybrid crawling architecture. Proc. Comput. Sci. 198 (2022), 122–127.
[6]
Sio Tai Cheong, Jiabo Xu, and Yue Liu. 2018. On the design of web crawlers for constructing an efficient chinese-portuguese bilingual corpus system. In Proceedings of the International Conference on Electronics, Information, and Communication (ICEIC’18). IEEE, Honolulu, HI, 1–4.
[7]
Shi Min Chua. 2022. Compiling and analysing a large corpus of online discussions to explore users’ interactions. Appl. Corp. Ling. 2, 2 (2022), 100017.
[8]
Dang Ngoc Chuong and Pusadee Seresangtakul. 2019. Semi-automatic word-aligned tool for thai-vietnamese parallel corpus construction. In Proceedings of the 16th International Joint Conference on Computer Science and Software Engineering (JCSSE’19). IEEE, IEEE, 121–125.
[9]
Chanambam Sveta Devi and Bipul Syam Purkayastha. 2020. Steps of pre-processing for english to mizo SMT system. In International Conference on Machine Learning, Image Processing, Network Security and Data Sciences. Springer, Silchar, 156–167.
[10]
Ibrahim Gashaw and H. L. Shashirekha. 2018. Construction of amharic - arabic parallel text corpus for neural machine translation. Int. J. Nat. Lang. Comput. 7, 5 (2018), 93–103.
[11]
Abbas Ghaddar and Philippe Langlais. 2020. Sedar: A large scale french-english financial domain parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3595–3602. https://aclanthology.org/2020.lrec-1.442
[12]
Vanlalruata Hnamte, Haulai Thangkhanhau, Jamal Hussain, Chawngthu Lalnunmawii, Laldinsangi Tlaisun, et al. 2022. Mizo to english machine translation: An evaluation benchmark. In Proceedings of the International Conference on Futuristic Technologies (INCOFT’22). IEEE, Belgaum, 1–6.
[13]
Go Inoue, Nizar Habash, Yuji Matsumoto, and Hiroyuki Aoyama. 2018. A parallel corpus of arabic-japanese news articles. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18), Vol. 11. European Language Resources Association, Miyazaki, 79–91. https://aclanthology.org/L18-1147
[14]
Saiful Islam, Abhijit Paul, Bipul Shyam Purkayastha, and Ismail Hussain. 2018. Construction of english-bodo parallel text corpus for statistical machine translation. Int. J. Nat. Lang. Comput. 7, 5 (2018), 93–103.
[15]
Vanlalmuansangi Khenglawt, Sahinur Rahman Laskar, Santanu Pal, Partha Pakray, and Ajoy Kumar Khan. 2022. Language resource building and english-to-mizo neural machine translation encountering tonal words. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 48–54. https://aclanthology.org/2022.wildre-1.9
[16]
Sonal Khosla and Haridasa Acharya. 2018. A survey report on the existing methods of building a parallel corpus. Int. J. Adv. Res. Comput. Sci. 9, 4 (2018), 13–19.
[17]
Guillaume Klein, François Hernandez, Vincent Nguyen, and Jean Senellart. 2020. The OpenNMT neural machine translation toolkit: 2020 edition. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track). Association for Machine Translation in the Americas, Virtual, 102–109. https://aclanthology.org/2020.amta-research.9
[18]
Chi-kiu Lo. 2019. YiSi-a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, Florence, 507–513.
[19]
Donatella Merlini and Martina Rossini. 2021. Text categorization with WEKA: A survey. Mach. Learn. Appl. 4 (2021), 100033.
[20]
Bojana Mikelenić and Marko Tadić. 2020. Building the spanish-croatian parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3932–3936. https://aclanthology.org/2020.lrec-1.484
[21]
Makoto Morishita, Jun Suzuki, and Masaaki Nagata. 2020. JParaCrawl: A large scale web-based english-japanese parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3603–3609. https://aclanthology.org/2020.lrec-1.443
[22]
Vandan Mujadia and Dipti Misra Sharma. 2022. The LTRC hindi-telugu parallel corpus. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3417–3424. https://aclanthology.org/2022.lrec-1.365
[23]
Elhadji Mamadou Nguer, Alla Lo, Cheikh M. Bamba Dione, Sileye O. Ba, and Moussa Lo. 2020. SENCORPUS: A french-wolof parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 2803–2811. https://aclanthology.org/2020.lrec-1.341
[24]
Amarnath Pathak, Partha Pakray, and Jereemi Bentham. 2019. English–mizo machine translation using neural and statistical approaches. Neural Comput. Appl. 31, 11 (2019), 7615–7631.
[25]
B. Premjith, M. Anand Kumar, and K. P. Soman. 2019. Neural machine translation system for english to indian language translation using MTIL parallel corpus. J. Intell. Syst. 28, 3 (2019), 387–398.
[26]
Ricardo Rei, Craig Stewart, Ana C. Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 2685–2702.
[27]
Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7881–7892.
[28]
Zaitinkhuma Thihlum, Vanlalmuansangi Khenglawt, and Somen Debnath. 2020. Machine translation of english language to mizo language. In Proceedings of the IEEE International Conference on Cloud Computing in Emerging Markets (CCEM’20). IEEE, Zhaw School of Engineering, 92–97.
[29]
Taoling Tian, Chai Song, Jin Ting, and Hongyang Huang. 2022. A french-to-english machine translation model using transformer network. Proc. Comput. Sci. 199 (2022), 1438–1443.
[30]
An-Zi Yen, Hen-Hsen Huang, and Hsin-Hsi Chen. 2019. Learning english–chinese bilingual word representations from sentence-aligned parallel corpus. Comput. Speech Lang. 56 (2019), 52–72.
[31]
Imad Zeroual and Abdelhak Lakhouaja. 2020. MulTed: A multilingual aligned and tagged parallel corpus. Appl. Comput. Inf. 18, 1/2 (2020), 61–73.

Cited By

View all

Index Terms

  1. Construction of Mizo: English Parallel Corpus for Machine Translation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 8
    August 2023
    373 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3615980
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2023
    Online AM: 21 July 2023
    Accepted: 13 July 2023
    Revised: 13 June 2023
    Received: 20 November 2022
    Published in TALLIP Volume 22, Issue 8

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Mizo
    2. corpus construction
    3. bilingual corpus
    4. parallel text
    5. machine translation

    Qualifiers

    • Short-paper

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 291
      Total Downloads
    • Downloads (Last 12 months)119
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 06 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media