skip to main content
research-article

Investigating Unsupervised Neural Machine Translation for Low-resource Language Pair English-Mizo via Lexically Enhanced Pre-trained Language Models

Published: 23 August 2023 Publication History

Abstract

The vast majority of languages in the world at present are considered to be low-resource languages. Since the availability of large parallel data is crucial for the success of most modern machine translation approaches, improving machine translation for low-resource languages is a key challenge. Most unsupervised techniques for translation benefit closely related languages with monolingual data of substantial quantity. To facilitate research in this direction for the extremely low resource language pair English (en) and Mizo (lus), we have developed a parallel and monolingual corpus for the Mizo language from various news websites. We explore Unsupervised Neural Machine Translation (UNMT) based on the developed monolingual data. We observe that cross-lingual embedding (CLWE) initializations on subword segmented data during pre-training, based on both masked language modelling and sequence-to-sequence generation tasks, improve translation performance. We experiment with cross-lingual alignment and combined alignment and joint training for learning the cross-lingual embedding representations. We also report baseline performances and the impact of CLWE initialization using semi-supervised and supervised neural machine translation. Empirical results show that both CLWE initializations work well for the distant pair English-Mizo compared to the baselines.

References

[1]
B. Ahmadnia, P. Kordjamshidi, and G. Haffari. 2018. Neural machine translation advised by statistical machine translation: The case of Farsi-Spanish bilingually low-resource scenario. In Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA’18). IEEE, 1209–1213.
[2]
Benyamin Ahmadnia, Javier Serrano, and Gholamreza Haffari. 2017. Persian-Spanish low-resource statistical machine translation through English as Pivot language. In Proceedings of the International Conference Recent Advances in Natural Language Processing. INCOMA Ltd., 24–30. DOI:
[3]
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 451–462. DOI:
[4]
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 789–798. DOI:
[5]
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2019. An effective approach to unsupervised machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 194–203.
[6]
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machine translation. In Proceedings of the 6th International Conference on Learning Representations.
[7]
Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, and Eneko Agirre. 2020. A call for more rigor in unsupervised cross-lingual learning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7375–7388. DOI:
[8]
Tamali Banerjee, Rudra V. Murthy, and Pushpak Bhattacharya. 2021. Crosslingual embeddings are essential in UNMT for distant languages: An English to IndoAryan case study. In Proceedings of Machine Translation Summit XVIII: Research Track. Association for Machine Translation in the Americas, 23–34. Retrieved from https://aclanthology.org/2021.mtsummit-research.3
[9]
Jereemi Bentham, Partha Pakray, Goutam Majumder, Sunday Lalbiaknia, and Alexander Gelbukh. 2016. Identification of rules for recognition of named entity classes in Mizo language. In Proceedings of the 15th Mexican International Conference on Artificial Intelligence (MICAI’16). IEEE, 8–13.
[10]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Computat. Ling. 5 (2017), 135–146. DOI:
[11]
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 1724–1734. DOI:
[12]
Alexandra Chronopoulou, Dario Stojanovski, and Alexander Fraser. 2021. Improving the lexical ability of pretrained language models for unsupervised neural machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 173–180. DOI:
[13]
Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Adv. Neural Inf. Process. Syst. 32 (2019), 7059–7069.
[14]
Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’15). MIT Press, Cambridge, MA, 3079–3087.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186.
[16]
Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 462–471. DOI:
[17]
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 866–875. DOI:
[18]
Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loïc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. CoRR abs/1503.03535 (2015).
[19]
Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc’Aurelio Ranzato. 2019. The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 6098–6111. DOI:
[20]
Yedid Hoshen and Lior Wolf. 2018. Non-adversarial unsupervised word translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 469–478. DOI:
[21]
Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, and Bamdev Mishra. 2019. Learning multilingual word embeddings in latent metric space: A geometric approach. Trans. Assoc. Computat. Ling. 7 (2019), 107–120. DOI:
[22]
Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2979–2984. DOI:
[23]
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1700–1709. Retrieved from https://aclanthology.org/D13-1176
[24]
Jyotsana Khatri, Rudra Murthy, Tamali Banerjee, and Pushpak Bhattacharyya. 2021. Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages. Mach. Translat. 35, 4 (01 Dec2021), 711–744. DOI:
[25]
Vanlalmuansangi Khenglawt, Sahinur Rahman Laskar, Santanu Pal, Partha Pakray, and Ajoy Kumar Khan. 2022. Language resource building and English-to-Mizo neural machine translation encountering tonal words. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. European Language Resources Association, 48–54. Retrieved from https://aclanthology.org/2022.wildre-1.9
[26]
Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the 1st Workshop on Neural Machine Translation. Association for Computational Linguistics, 28–39. DOI:
[27]
Candy Lalrempuii and Badal Soni. 2020. Attention-based English to Mizo neural machine translation. In Machine Learning, Image Processing, Network Security and Data Sciences. Springer, Singapore, 193–203.
[28]
Candy Lalrempuii and Badal Soni. 2023. LUS: Mizo Monolingual Corpus. DOI:
[29]
Candy Lalrempuii, Badal Soni, and Partha Pakray. 2021. An improved English-to-Mizo neural machine translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 4, Article 61 (May2021), 21 pages. DOI:
[30]
Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=rkYTTf-AZ
[31]
Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations.
[32]
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Computat. Ling. 8 (2020), 726–742. DOI:
[33]
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. Association for Computational Linguistics, 151–159. DOI:
[34]
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1412–1421. DOI:
[35]
Goutam Majumder, Partha Pakray, Zoramdinthara Khiangte, and Alexander Gelbukh. 2018. Multiword expressions (MWE) for Mizo language: Literature survey. In Computational Linguistics and Intelligent Text Processing, Alexander Gelbukh (Ed.). Springer International Publishing, Cham, 623–635.
[36]
Kelly Marchisio, Kevin Duh, and Philipp Koehn. 2020. When does unsupervised machine translation work? In Proceedings of the 5th Conference on Machine Translation. Association for Computational Linguistics, 571–583. Retrieved from https://aclanthology.org/2020.wmt-1.68
[37]
I. Dan Melamed. 1995. Automatic evaluation and uniform filter cascades for inducing N-Best translation lexicons. In Proceedings of the 3rd Workshop on Very Large Corpora. Association for Computational Linguistics. Retrieved from https://aclanthology.org/W95-0115
[38]
Rohit More, Anoop Kunchukuttan, Pushpak Bhattacharyya, and Raj Dabre. 2015. Augmenting pivot based SMT with word segmentation. In Proceedings of the 12th International Conference on Natural Language Processing. NLP Association of India, 303–307. Retrieved from https://aclanthology.org/W15-5944
[39]
Partha Pakray, Arunagshu Pal, Goutam Majumder, and Alexander Gelbukh. 2015. Resource building and parts-of-speech (POS) tagging for the Mizo language. In Proceedings of the 14th Mexican International Conference on Artificial Intelligence (MICAI’15). IEEE, 3–7.
[40]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). Association for Computational Linguistics, 311–318. DOI:
[41]
Amarnath Pathak, Partha Pakray, and Jereemi Bentham. 2019. English–Mizo machine translation using neural and statistical approaches. Neural Comput. Appl. 31, 11 (Nov.2019), 7615–7631. DOI:
[42]
Maja Popović. 2015. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the 10th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 392–395. DOI:
[43]
Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. Association for Computational Linguistics, 186–191.
[44]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 86–96. DOI:
[45]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1715–1725. DOI:
[46]
Aditya Siddhant, Ankur Bapna, Yuan Cao, Orhan Firat, Mia Chen, Sneha Kudugunta, Naveen Arivazhagan, and Yonghui Wu. 2020. Leveraging monolingual data with self-supervision for multilingual neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2827–2835. DOI:
[47]
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked sequence to sequence pre-training for language generation. In Proceedings of the International Conference on Machine Learning. 5926–5936.
[48]
Haipeng Sun, Rui Wang, Masao Utiyama, Benjamin Marie, Kehai Chen, Eiichiro Sumita, and Tiejun Zhao. 2021. Unsupervised neural machine translation for similar and distant language pairs: An empirical study. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 1, Article 10 (Mar.2021), 17 pages. DOI:
[49]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’14). MIT Press, Cambridge, MA, 3104–3112. Retrieved from http://dl.acm.org/citation.cfm?id=2969033.2969173
[50]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). Association for Computing Machinery, New York, NY, 1096–1103. DOI:
[51]
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11 (Dec.2010), 3371–3408.
[52]
Zirui Wang, Jiateng Xie, Ruochen Xu, Yiming Yang, Graham Neubig, and Jaime G. Carbonell. 2020. Cross-lingual alignment vs joint training: A comparative study and a simple unified framework. In Proceedings of the 8th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=S1l-C0NtwS
[53]
Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1006–1011. DOI:
[54]
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 483–498. DOI:
[55]
Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1959–1970. DOI:
[56]
Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1568–1575. DOI:

Cited By

View all
  • (2024)Automatic translation system from Classical Chinese to Modern Chinese based on large language modelsInternational Conference on Computer Graphics, Artificial Intelligence, and Data Processing (ICCAID 2023)10.1117/12.3026580(119)Online publication date: 27-Mar-2024
  • (2024)Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averagingInternational Journal of Information Technology10.1007/s41870-023-01714-916:3(1539-1549)Online publication date: 30-Jan-2024
  • (2023)Investigation of Data Augmentation Techniques for Assamese-English Language Pair Machine Translation2023 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP60301.2023.10354679(1-6)Online publication date: 27-Nov-2023

Index Terms

  1. Investigating Unsupervised Neural Machine Translation for Low-resource Language Pair English-Mizo via Lexically Enhanced Pre-trained Language Models

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 8
    August 2023
    373 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3615980
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 August 2023
    Online AM: 13 July 2023
    Accepted: 06 July 2023
    Revised: 07 April 2023
    Received: 07 July 2022
    Published in TALLIP Volume 22, Issue 8

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Unsupervised neural machine translation
    2. cross-lingual word embeddings
    3. low resource languages
    4. Mizo

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)150
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Automatic translation system from Classical Chinese to Modern Chinese based on large language modelsInternational Conference on Computer Graphics, Artificial Intelligence, and Data Processing (ICCAID 2023)10.1117/12.3026580(119)Online publication date: 27-Mar-2024
    • (2024)Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averagingInternational Journal of Information Technology10.1007/s41870-023-01714-916:3(1539-1549)Online publication date: 30-Jan-2024
    • (2023)Investigation of Data Augmentation Techniques for Assamese-English Language Pair Machine Translation2023 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP60301.2023.10354679(1-6)Online publication date: 27-Nov-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media