skip to main content
10.1145/3570991.3571068acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
short-paper

UDAAN - Machine Learning based Post-Editing tool for Document Translation

Published: 04 January 2023 Publication History

Abstract

We introduce UDAAN, an open-source post-editing tool that can reduce manual editing efforts to quickly produce publishable-standard documents in several Indic languages. UDAAN has an end-to-end Machine Translation (MT) plus post-editing pipeline wherein users can upload a document to obtain raw MT output. Further, users can edit the raw translations using our tool. UDAAN offers several advantages: i. Domain-aware, vocabulary-based lexical constrained MT. ii. source-target and target-target lexicon suggestions for users. Replacements are based on the source and target texts’ lexicon alignment. iii. Translation suggestions are based on logs created during user interaction. iv. Source-target sentence alignment visualisation that reduces the cognitive load of users during editing. v. Translated outputs from our tool are available in multiple formats: docs, latex, and PDF. We also provide the facility to use around 100 in-domain dictionaries for lexicon-aware machine translation. Although we limit our experiments to English-to-Hindi translation, our tool is independent of the source and target languages. Experimental results based on the usage of the tools and users’ feedback show that our tool speeds up the translation time by approximately a factor of three compared to the baseline method of translating documents from scratch. Our tool is available for both Windows and Linux platforms. The tool is open-source under MIT license, and the source code can be accessed from our website, https://www.udaanproject.org. Demonstration and tutorial videos for various features of our tool can be accessed here. Our MT pipeline can be accessed at https://udaaniitb.aicte-india.org/udaan/translate/.

References

[1]
Guttu Sai Abhishek, Harshad Ingole, Parth Laturia, Vineeth Dorna, Ayush Maheshwari, Ganesh Ramakrishnan, and Rishabh Iyer. 2021. SPEAR : Semi-supervised Data Programming in Python. arxiv:2108.00373 [cs.LG]
[2]
Vicent Alabau, Christian Buck, Michael Carl, Francisco Casacuberta, Mercedes García-Martínez, Ulrich Germann, Jesús González-Rubio, Robin Hill, Philipp Koehn, Luis Leiva, Bartolomé Mesa-Lao, Daniel Ortiz-Martínez, Herve Saint-Amand, Germán Sanchis Trilles, and Chara Tsoukala. 2014. CASMACAT: A Computer-assisted Translation Workbench. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Gothenburg, Sweden, 25–28. https://doi.org/10.3115/v1/E14-2007
[3]
Tamer Alkhouli, Gabriel Bretschner, and Hermann Ney. 2018. On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation. In Proceedings of the Third Conference on Machine Translation: Research Papers. Association for Computational Linguistics, Brussels, Belgium, 177–185. https://doi.org/10.18653/v1/W18-6318
[4]
Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, and Matthias andigital Humanities Huck. 2018. Findings of the 2019 Conference on Machine Translation (WMT19), Vol. 5. Frontiers, 9.
[5]
Claude Bédard. 2000. Mémoire de traduction cherche traducteur de phrases. Traduire 186(2000), 41–49.
[6]
Guanhua Chen, Yun Chen, and Victor OK Li. 2021. Lexically Constrained Neural Machine Translation with Explicit Alignment Guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 12630–12638.
[7]
Stephen Doherty. 2016. Translations| The Impact of Translation Technologies on the Process and Product of Translation. International journal of communication 10 (2016), 23.
[8]
Marcello Federico, Nicola Bertoldi, Mauro Cettolo, Matteo Negri, Marco Turchi, Marco Trombetti, Alessandro Cattelan, Antonio Farina, Domenico Lupinetti, Andrea Martines, 2014. The MateCat tool. In COLING (Demos). 129–132.
[9]
Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, errors, and context: A large-scale study of human evaluation for machine translation. arXiv preprint arXiv:2104.14478(2021).
[10]
Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2016. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems 28, 10(2016), 2222–2232.
[11]
Nico Herbig, Tim Düwel, Santanu Pal, Kalliopi Meladaki, Mahsa Monshizadeh, Antonio Krüger, and Josef van Genabith. 2020. MMPE: A multi-modal interface for post-editing machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1691–1702.
[12]
Chris Hokamp and Qun Liu. 2017. Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1535–1546. https://doi.org/10.18653/v1/P17-1141
[13]
Dongjun Lee, Junhyeong Ahn, Heesoo Park, and Jaemin Jo. 2021. IntelliCAT: Intelligent Machine Translation Post-Editing with Quality Estimation and Translation Suggestion. arXiv preprint arXiv:2105.12172(2021).
[14]
Ayush Maheshwari, Krishnateja Killamsetty, Ganesh Ramakrishnan, Rishabh Iyer, Marina Danilevsky, and Lucian Popa. 2022. Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming. In Findings of the Association for Computational Linguistics: ACL 2022. 1188–1202.
[15]
Ayush Maheshwari, Piyush Sharma, Preethi Jyothi, and Ganesh Ramakrishnan. 2022. DICTDIS: Dictionary Constrained Disambiguation for Improved NMT. arXiv preprint arXiv:2210.06996(2022).
[16]
Ayush Maheshwari, Nikhil Singh, Amrith Krishna, and Ganesh Ramakrishnan. 2022. A Benchmark and Dataset for Post-OCR text correction in Sanskrit. https://doi.org/10.48550/ARXIV.2211.07980
[17]
Joss Moorkens, Stephen Doherty, Dorothy Kenny, and Sharon O’Brien. 2014. A virtuous circle: laundering translation memory data using statistical machine translation. Perspectives 22, 3 (2014), 291–303.
[18]
Masaaki Nagata, Chousa Katsuki, and Masaaki Nishino. 2020. A supervised word alignment method based on cross-language span prediction using multilingual BERT. arXiv preprint arXiv:2004.14516(2020).
[19]
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
[20]
Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh Shantadevi Khapra. 2021. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages. arxiv:2104.05596 [cs.CL]
[21]
Masoud Jalili Sabet, Philipp Dufter, François Yvon, and Hinrich Schütze. 2020. SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1627–1643.
[22]
Rohit Saluja, Ayush Maheshwari, Ganesh Ramakrishnan, Parag Chaudhuri, and Mark Carman. 2019. Ocr on-the-go: Robust end-to-end systems for reading license plates & street signs. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 154–159.
[23]
Sebastin Santy, Sandipan Dandapat, Monojit Choudhury, and Kalika Bali. 2019. INMT: Interactive neural machine translation prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. 103–108.
[24]
Lane Schwartz, Isabel Lacruz, and Tatyana Bystrova. 2015. Effects of word alignment visualization on post-editing quality & speed. In Proceedings of Machine Translation Summit XV: Papers.
[25]
Antonio Toral, Martijn Wieling, and Andy Way. 2018. Post-editing effort of a novel with statistical and neural machine translation. Frontiers in Digital Humanities 5 (2018), 9.
[26]
Jan Van den Bergh, Eva Geurts, Donald Degraen, Mieke Haesen, Iulianna Van der Lek-Ciudin, and Karin Coninx. 2015. Recommendations for translation environments to improve translators’ workflows. In Proceedings of Translating and the Computer 37.
[27]
Vincent Vandeghinste, Tom Vanallemeersch, Liesbeth Augustinus, Bram Bulté, Frank Van Eynde, Joris Pelemans, Lyan Verwimp, Patrick Wambacq, Geert Heyman, Marie-Francine Moens, 2019. Improving the translation environment for professional translators. In Informatics, Vol. 6. Multidisciplinary Digital Publishing Institute, 24.
[28]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.

Index Terms

  1. UDAAN - Machine Learning based Post-Editing tool for Document Translation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
    January 2023
    357 pages
    ISBN:9781450397971
    DOI:10.1145/3570991
    © 2023 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 January 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. document translation
    2. machine translation
    3. post-editing software

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Conference

    CODS-COMAD 2023

    Acceptance Rates

    Overall Acceptance Rate 197 of 680 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 91
      Total Downloads
    • Downloads (Last 12 months)29
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media