skip to main content
10.1145/3411170.3411258acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgoodtechsConference Proceedingsconference-collections
research-article

Building an Italian-Chinese Parallel Corpus for Machine Translation from the Web

Published: 14 September 2020 Publication History

Abstract

In an increasingly globalized world, being able to understand texts in different languages (even more so in different alphabets and charsets) has become a necessity. This can be strategic even while moving and travelling across different countries, characterized by different languages. With this in mind, bilingual corpora become critical resources since they are the basis of every state-of-the-art automatic translation system; moreover, building a parallel corpus is usually a complex and very expensive operation. This paper describes an innovative approach we have defined and adopted to automatically build an Italian-Chinese parallel corpus, with the aim of using it for training an Italian-Chinese Neural Machine Translation. Our main idea is to scrape parallel texts from the Web: we defined a general pipeline, describing each specific step from the selection of the appropriate data sources to the sentence alignment method. A final evaluation was conducted to evaluate the goodness of our approach and its results show that 90% of the sentences were correctly aligned. The corpus we have obtained consists of more than 6,000 sentence pairs (Italian and Chinese), which are the basis for building a Machine Translation system.

References

[1]
Ahmad Aghaebrahimian, Michael Ustaszewski, and Andy Stauder. 2019. The TransBank Aligner: Cross-Sentence Alignment with Deep Neural Networks. In International Conference on Text, Speech, and Dialogue. Springer, 185--196.
[2]
E Bartlett. [n.d.]. J., JW Kotrlik, et al. (2001)." Organizational research: Determining appropriate sample size in survey research.". Information Technology, Learning, and Performance 19, 1 ([n.d.]).
[3]
Peter F Brown, Jennifer C Lai, and Robert L Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 169--176.
[4]
Helena M Caseli, Tiago F Pereira, Lucia Specia, Thiago AS Pardo, Caroline Gasperin, and Sandra Maria Aluísio. 2009. Building a Brazilian Portuguese parallel corpus of original and simplified texts. Advances in Computational Linguistics, Research in Computer Science 41 (2009), 59--70.
[5]
Luca Casini, Giovanni Delnevo, Marco Roccetti, Nicolò Zagni, and Giuseppe Cappiello. 2019. Deep Water: Predicting water meter failures through a human-machine intelligence collaboration. In International Conference on Human Interaction and Emerging Technologies. Springer, 688--694.
[6]
Sunita Chand. 2016. Empirical survey of machine translation tools. In 2016 Second International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN). IEEE, 181--185.
[7]
Marta R Costa-Jussa and José AR Fonollosa. 2015. Latest trends in hybrid machine translation and its applications. Computer Speech & Language 32, 1 (2015), 3--10.
[8]
Giovanni Delnevo, Marco Roccetti, and Silvia Mirri. 2019. Intelligent and good machines? The role of domain and context codification. Mobile Networks and Applications (2019), 1--9.
[9]
Miquel Esplà-Gomis, Mikel L Forcada, Gema Ramírez-Sánchez, and Hieu Hoang. 2019. ParaCrawl: Web-scale parallel corpora for the languages of the EU. In Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks. 118--119.
[10]
Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, Jim O'Regan, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema Ramírez-Sánchez, and Francis M Tyers. 2011. Apertium: a free/open-source platform for rule-based machine translation. Machine translation 25, 2 (2011), 127--144.
[11]
William A Gale and Kenneth W Church. 1993. A program for aligning sentences in bilingual corpora. Computational linguistics 19, 1 (1993), 75--102.
[12]
Francis Grégoire and Philippe Langlais. 2018. Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. In Proceedings of the 27th International Conference on Computational Linguistics. 1442--1453.
[13]
Yaser Jararweh, Mahmoud Al-Ayyoub, Maged Fakirah, Luay Alawneh, and Brij B Gupta. 2019. Improving the performance of the needleman-wunsch algorithm using parallelization and vectorization techniques. Multimedia Tools and Applications 78, 4 (2019), 3961--3977.
[14]
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, Vol. 5. Citeseer, 79--86.
[15]
Philipp Koehn. 2009. Statistical machine translation. Cambridge University Press.
[16]
Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872 (2017).
[17]
John Oladosu, Adebimpe Esan, Ibrahim Adeyanju, Benjamin Adegoke, Olatayo Olaniyan, and Bolaji Omodunbi. 2016. Approaches to machine translation: a review. FUOYE Journal of Engineering and Technology 1, 1 (2016).
[18]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318.
[19]
Douglas Robinson. 2019. Becoming a translator: An introduction to the theory and practice of translation. Routledge.
[20]
Marco Roccetti, Giovanni Delnevo, Luca Casini, and Giuseppe Cappiello. 2019. Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures. Journal of Big Data 6, 1 (2019), 70.
[21]
Marco Roccetti, Giovanni Delnevo, Luca Casini, and Paola Salomoni. 2020. A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis. Mobile Networks and Applications (2020), 1--9.
[22]
André Santos. 2011. A survey on parallel corpora alignment. MI-STAR 2011 (2011), 117--128.
[23]
Rico Sennrich and Martin Volk. 2010. MT-based sentence alignment for OCR-generated parallel texts. In The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010).
[24]
Rico Sennrich and Martin Volk. 2011. Iterative, MT-based sentence alignment of parallel texts. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011). 175--182.
[25]
Jörg Tiedemann. 2009. News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing, Vol. 5. 237--248.
[26]
Seppe vanden Broucke and Bart Baesens. 2018. Stirring the HTML and CSS Soup. In Practical Web Scraping for Data Science. Springer, 49--77.
[27]
Maria Jose Varela-Salinas, Ruth Burbat, et al. 2018. Google translate and deepL: breaking taboos in translator training. (2018).
[28]
Warren Weaver. 1949. Memorandum on Translation.
[29]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[30]
Nan Yang, Shujie Liu, Mu Li, Ming Zhou, and Nenghai Yu. 2013. Word alignment modeling with context dependent deep neural network. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 166--175.
[31]
Danni Yu and Yicong Yu. [n.d.]. Knowledge Dissemination in Media Discourse: Analysis of Italian-Chinese/Chinese-Italian Parallel Newspaper Corpora. In Knowledge Dissemination, Etichs, and Ideology in Specialised Communication: Linguistic and Discursive Perspectives Pre-conference Proceedings. 87.
[32]
Zhaorong Zong and Changchun Hong. 2019. Research on Alignment in the Construction of Parallel Corpus. In Journal of Physics: Conference Series, Vol. 1213. IOP Publishing, 042003.

Cited By

View all

Index Terms

  1. Building an Italian-Chinese Parallel Corpus for Machine Translation from the Web

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      GoodTechs '20: Proceedings of the 6th EAI International Conference on Smart Objects and Technologies for Social Good
      September 2020
      286 pages
      ISBN:9781450375597
      DOI:10.1145/3411170
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      In-Cooperation

      • EAI: The European Alliance for Innovation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 September 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Italian-Chinese corpus
      2. bilingual annotation
      3. machine translation
      4. sentence alignment

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      GoodTechs '20

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)10
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 04 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media