research-article

Investigating Unsupervised Neural Machine Translation for Low-resource Language Pair English-Mizo via Lexically Enhanced Pre-trained Language Models

Authors:

Candy Lalrempuii,

Badal SoniAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 8

Article No.: 209, Pages 1 - 18

https://doi.org/10.1145/3609222

Published: 23 August 2023 Publication History

Abstract

The vast majority of languages in the world at present are considered to be low-resource languages. Since the availability of large parallel data is crucial for the success of most modern machine translation approaches, improving machine translation for low-resource languages is a key challenge. Most unsupervised techniques for translation benefit closely related languages with monolingual data of substantial quantity. To facilitate research in this direction for the extremely low resource language pair English (en) and Mizo (lus), we have developed a parallel and monolingual corpus for the Mizo language from various news websites. We explore Unsupervised Neural Machine Translation (UNMT) based on the developed monolingual data. We observe that cross-lingual embedding (CLWE) initializations on subword segmented data during pre-training, based on both masked language modelling and sequence-to-sequence generation tasks, improve translation performance. We experiment with cross-lingual alignment and combined alignment and joint training for learning the cross-lingual embedding representations. We also report baseline performances and the impact of CLWE initialization using semi-supervised and supervised neural machine translation. Empirical results show that both CLWE initializations work well for the distant pair English-Mizo compared to the baselines.

References

[1]

B. Ahmadnia, P. Kordjamshidi, and G. Haffari. 2018. Neural machine translation advised by statistical machine translation: The case of Farsi-Spanish bilingually low-resource scenario. In Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA’18). IEEE, 1209–1213.

[2]

Benyamin Ahmadnia, Javier Serrano, and Gholamreza Haffari. 2017. Persian-Spanish low-resource statistical machine translation through English as Pivot language. In Proceedings of the International Conference Recent Advances in Natural Language Processing. INCOMA Ltd., 24–30. DOI:

[3]

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 451–462. DOI:

[4]

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 789–798. DOI:

[5]

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2019. An effective approach to unsupervised machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 194–203.

[6]

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machine translation. In Proceedings of the 6th International Conference on Learning Representations.

[7]

Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, and Eneko Agirre. 2020. A call for more rigor in unsupervised cross-lingual learning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7375–7388. DOI:

[8]

Tamali Banerjee, Rudra V. Murthy, and Pushpak Bhattacharya. 2021. Crosslingual embeddings are essential in UNMT for distant languages: An English to IndoAryan case study. In Proceedings of Machine Translation Summit XVIII: Research Track. Association for Machine Translation in the Americas, 23–34. Retrieved from https://aclanthology.org/2021.mtsummit-research.3

[9]

Jereemi Bentham, Partha Pakray, Goutam Majumder, Sunday Lalbiaknia, and Alexander Gelbukh. 2016. Identification of rules for recognition of named entity classes in Mizo language. In Proceedings of the 15th Mexican International Conference on Artificial Intelligence (MICAI’16). IEEE, 8–13.

[10]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Computat. Ling. 5 (2017), 135–146. DOI:

[11]

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 1724–1734. DOI:

[12]

Alexandra Chronopoulou, Dario Stojanovski, and Alexander Fraser. 2021. Improving the lexical ability of pretrained language models for unsupervised neural machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 173–180. DOI:

[13]

Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Adv. Neural Inf. Process. Syst. 32 (2019), 7059–7069.

[14]

Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’15). MIT Press, Cambridge, MA, 3079–3087.

Digital Library

[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186.

[16]

Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 462–471. DOI:

[17]

Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 866–875. DOI:

[18]

Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loïc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. CoRR abs/1503.03535 (2015).

[19]

Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc’Aurelio Ranzato. 2019. The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 6098–6111. DOI:

[20]

Yedid Hoshen and Lior Wolf. 2018. Non-adversarial unsupervised word translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 469–478. DOI:

[21]

Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, and Bamdev Mishra. 2019. Learning multilingual word embeddings in latent metric space: A geometric approach. Trans. Assoc. Computat. Ling. 7 (2019), 107–120. DOI:

[22]

Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2979–2984. DOI:

[23]

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1700–1709. Retrieved from https://aclanthology.org/D13-1176

[24]

Jyotsana Khatri, Rudra Murthy, Tamali Banerjee, and Pushpak Bhattacharyya. 2021. Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages. Mach. Translat. 35, 4 (01 Dec2021), 711–744. DOI:

Digital Library

[25]

Vanlalmuansangi Khenglawt, Sahinur Rahman Laskar, Santanu Pal, Partha Pakray, and Ajoy Kumar Khan. 2022. Language resource building and English-to-Mizo neural machine translation encountering tonal words. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. European Language Resources Association, 48–54. Retrieved from https://aclanthology.org/2022.wildre-1.9

[26]

Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the 1st Workshop on Neural Machine Translation. Association for Computational Linguistics, 28–39. DOI:

[27]

Candy Lalrempuii and Badal Soni. 2020. Attention-based English to Mizo neural machine translation. In Machine Learning, Image Processing, Network Security and Data Sciences. Springer, Singapore, 193–203.

[28]

Candy Lalrempuii and Badal Soni. 2023. LUS: Mizo Monolingual Corpus. DOI:

[29]

Candy Lalrempuii, Badal Soni, and Partha Pakray. 2021. An improved English-to-Mizo neural machine translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 4, Article 61 (May2021), 21 pages. DOI:

Digital Library

[30]

Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=rkYTTf-AZ

[31]

Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations.

[32]

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Computat. Ling. 8 (2020), 726–742. DOI:

[33]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. Association for Computational Linguistics, 151–159. DOI:

[34]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1412–1421. DOI:

[35]

Goutam Majumder, Partha Pakray, Zoramdinthara Khiangte, and Alexander Gelbukh. 2018. Multiword expressions (MWE) for Mizo language: Literature survey. In Computational Linguistics and Intelligent Text Processing, Alexander Gelbukh (Ed.). Springer International Publishing, Cham, 623–635.

[36]

Kelly Marchisio, Kevin Duh, and Philipp Koehn. 2020. When does unsupervised machine translation work? In Proceedings of the 5th Conference on Machine Translation. Association for Computational Linguistics, 571–583. Retrieved from https://aclanthology.org/2020.wmt-1.68

[37]

I. Dan Melamed. 1995. Automatic evaluation and uniform filter cascades for inducing N-Best translation lexicons. In Proceedings of the 3rd Workshop on Very Large Corpora. Association for Computational Linguistics. Retrieved from https://aclanthology.org/W95-0115

[38]

Rohit More, Anoop Kunchukuttan, Pushpak Bhattacharyya, and Raj Dabre. 2015. Augmenting pivot based SMT with word segmentation. In Proceedings of the 12th International Conference on Natural Language Processing. NLP Association of India, 303–307. Retrieved from https://aclanthology.org/W15-5944

[39]

Partha Pakray, Arunagshu Pal, Goutam Majumder, and Alexander Gelbukh. 2015. Resource building and parts-of-speech (POS) tagging for the Mizo language. In Proceedings of the 14th Mexican International Conference on Artificial Intelligence (MICAI’15). IEEE, 3–7.

Digital Library

[40]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). Association for Computational Linguistics, 311–318. DOI:

Digital Library

[41]

Amarnath Pathak, Partha Pakray, and Jereemi Bentham. 2019. English–Mizo machine translation using neural and statistical approaches. Neural Comput. Appl. 31, 11 (Nov.2019), 7615–7631. DOI:

Digital Library

[42]

Maja Popović. 2015. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the 10th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 392–395. DOI:

[43]

Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. Association for Computational Linguistics, 186–191.

[44]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 86–96. DOI:

[45]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1715–1725. DOI:

[46]

Aditya Siddhant, Ankur Bapna, Yuan Cao, Orhan Firat, Mia Chen, Sneha Kudugunta, Naveen Arivazhagan, and Yonghui Wu. 2020. Leveraging monolingual data with self-supervision for multilingual neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2827–2835. DOI:

[47]

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked sequence to sequence pre-training for language generation. In Proceedings of the International Conference on Machine Learning. 5926–5936.

[48]

Haipeng Sun, Rui Wang, Masao Utiyama, Benjamin Marie, Kehai Chen, Eiichiro Sumita, and Tiejun Zhao. 2021. Unsupervised neural machine translation for similar and distant language pairs: An empirical study. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 1, Article 10 (Mar.2021), 17 pages. DOI:

Digital Library

[49]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’14). MIT Press, Cambridge, MA, 3104–3112. Retrieved from http://dl.acm.org/citation.cfm?id=2969033.2969173

Digital Library

[50]

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). Association for Computing Machinery, New York, NY, 1096–1103. DOI:

Digital Library

[51]

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11 (Dec.2010), 3371–3408.

Digital Library

[52]

Zirui Wang, Jiateng Xie, Ruochen Xu, Yiming Yang, Graham Neubig, and Jaime G. Carbonell. 2020. Cross-lingual alignment vs joint training: A comparative study and a simple unified framework. In Proceedings of the 8th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=S1l-C0NtwS

[53]

Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1006–1011. DOI:

[54]

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 483–498. DOI:

[55]

Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1959–1970. DOI:

[56]

Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1568–1575. DOI:

Cited By

Fan TZhang Y(2024)Automatic translation system from Classical Chinese to Modern Chinese based on large language modelsInternational Conference on Computer Graphics, Artificial Intelligence, and Data Processing (ICCAID 2023)10.1117/12.3026580(119)Online publication date: 27-Mar-2024
https://doi.org/10.1117/12.3026580
Kashyap KSarma SAhmed M(2024)Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averagingInternational Journal of Information Technology10.1007/s41870-023-01714-916:3(1539-1549)Online publication date: 30-Jan-2024
https://doi.org/10.1007/s41870-023-01714-9
Lalrempuii CSoni B(2023)Investigation of Data Augmentation Techniques for Assamese-English Language Pair Machine Translation2023 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP60301.2023.10354679(1-6)Online publication date: 27-Nov-2023
https://doi.org/10.1109/iSAI-NLP60301.2023.10354679

Index Terms

Investigating Unsupervised Neural Machine Translation for Low-resource Language Pair English-Mizo via Lexically Enhanced Pre-trained Language Models
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation

Recommendations

Construction of Mizo: English Parallel Corpus for Machine Translation
Parallel corpus is a key component of statistical and Neural Machine Translation (NMT). While most research focuses on machine translation, corpus creation studies are limited for many languages, and no research paper on a Mizo–English corpus exists yet. ...
An Improved English-to-Mizo Neural Machine Translation
Machine Translation is an effort to bridge language barriers and misinterpretations, making communication more convenient through the automatic translation of languages. The quality of translations produced by corpus-based approaches predominantly depends ...
Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
Abstract
Unsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 8

August 2023

373 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3615980

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2023

Online AM: 13 July 2023

Accepted: 06 July 2023

Revised: 07 April 2023

Received: 07 July 2022

Published in TALLIP Volume 22, Issue 8

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
365
Total Downloads

Downloads (Last 12 months)150
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fan TZhang Y(2024)Automatic translation system from Classical Chinese to Modern Chinese based on large language modelsInternational Conference on Computer Graphics, Artificial Intelligence, and Data Processing (ICCAID 2023)10.1117/12.3026580(119)Online publication date: 27-Mar-2024
https://doi.org/10.1117/12.3026580
Kashyap KSarma SAhmed M(2024)Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averagingInternational Journal of Information Technology10.1007/s41870-023-01714-916:3(1539-1549)Online publication date: 30-Jan-2024
https://doi.org/10.1007/s41870-023-01714-9
Lalrempuii CSoni B(2023)Investigation of Data Augmentation Techniques for Assamese-English Language Pair Machine Translation2023 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP60301.2023.10354679(1-6)Online publication date: 27-Nov-2023
https://doi.org/10.1109/iSAI-NLP60301.2023.10354679

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents