research-article

Inducing a Bilingual Lexicon from Short Parallel Multiword Sequences

Authors:

Taisuke Harada,

Kumiko Tanaka-Ishii,

Eiichiro SumitaAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 16, Issue 3

Article No.: 15, Pages 1 - 20

https://doi.org/10.1145/3003726

Published: 17 March 2017 Publication History

Abstract

This article proposes a technique for mining bilingual lexicons from pairs of parallel short word sequences. The technique builds a generative model from a corpus of training data consisting of such pairs. The model is a hierarchical nonparametric Bayesian model that directly induces a bilingual lexicon while training. The model learns in an unsupervised manner and is designed to exploit characteristics of the language pairs being mined. The proposed model is capable of utilizing commonly used word-pair frequency information and additionally can employ the internal character alignments within the words themselves. It is thereby capable of mining transliterations and can use reliably aligned transliteration pairs to support the mining of other words in their context. The model is also capable of performing word reordering and word deletion during the alignment process, and it is furthermore capable of operating in the absence of full segmentation information. In this work, we study two mining tasks based on English-Japanese and English-Chinese language pairs, and compare the proposed approach to baselines based on a simpler models that use only word-pair frequency information. Our results show that the proposed method is able to mine bilingual word pairs at higher levels of precision and recall than the baselines.

References

[1]

Phil Blunsom, Trevor Cohn, Chris Dyer, and Miles Osborne. 2009. A Gibbs sampler for phrasal synchronous grammar induction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL. 782--790.

Digital Library

[2]

Kareem Darwish. 2010. Transliteration mining with phonetic conflation and iterative training. In Proceedings of the 2010 Named Entities Workshop. 53--56. http://www.aclweb.org/anthology/W10-2407

Digital Library

[3]

Andrew Finch and Eiichiro Sumita. 2010. A Bayesian model of bilingual segmentation for transliteration. In Proceedings of the 7th International Workshop on Spoken Language Translation (IWSLT’10). 259--266.

[4]

Andrew M. Finch, Ohnmar Htun, and Eiichiro Sumita. 2012. The NICT translation system for IWSLT 2012. In Proceedings of the 2012 International Workshop on Spoken Language Translation (IWSLT’12). 121--125. http://www.isca-speech.org/archive/iwslt_12/sltc_121.html.

[5]

Takaaki Fukunishi, Andrew Michael Finch, Eiichiro Sumita, and Seiichi Yamamoto. 2013. A Bayesian alignment approach to transliteration mining. ACM Transactions on Asian Language Information Processing 12, 3, Article No. 9.

Digital Library

[6]

Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2006. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL-44). 673--680.

Digital Library

[7]

Ohnmar Htun, Andrew Finch, Eiichiro Sumita, and Yoshiki Mikami. 2012. Improving transliteration mining by integrating expert knowledge with statistical approaches. International Journal of Computer Applications 58, Article No. 17.

[8]

Hemant Ishwaran and Lancelot F. James. 2003. Generalized weighted Chinese restaurant processes for species sampling mixture models. Statistica Sinica 13, 4, 1211--1235.

[9]

Sittichai Jiampojamarn, Kenneth Dwyer, Shane Bergsma, Aditya Bhargava, Qing Dou, Mi-Young Kim, and Grzegorz Kondrak. 2010. Transliteration generation and mining with limited training resources. In Proceedings of the 2010 Named Entities Workshop. 39--47. http://www.aclweb.org/anthology/W10-2405

Digital Library

[10]

Kevin Knight and Jonathan Graehl. 1998. Machine transliteration. Computational Linguistics 24, 4, 599--612. http://www.aclweb.org/anthology/J98-4003

Digital Library

[11]

A. Kumaran, Mitesh M. Khapra, and Haizhou Li. 2010. Report of NEWS 2010 transliteration mining shared task. In Proceedings of the 2010 Named Entities Workshop. 21--28. http://www.aclweb.org/anthology/W10-2403

Digital Library

[12]

Abby Levenberg, Chris Dyer, and Phil Blunsom. 2012. A Bayesian model for learning SCFGs with discontiguous rules. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 223--232.

Digital Library

[13]

Tingting Li, Tiejun Zhao, Andrew Finch, and Chunyue Zhang. 2013. A tightly-coupled unsupervised clustering and bilingual alignment model for transliteration. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 393--398. http://www.aclweb.org/anthology/P13-2070

[14]

D. Lopresti, A. Tomkins, and J. Zhou. 1997. Algorithms for matching hand-drawn sketches. In Proceedings of the 5th International Workshop on Frontiers in Handwriting Recognition. 233--238.

[15]

Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 (ACL-IJCNLP’09). 100--108.

Digital Library

[16]

Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shinsuke Mori, and Tatsuya Kawahara. 2011. An unsupervised model for joint phrase alignment and extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT’11). 632--641.

Digital Library

[17]

Sara Noeman and Amgad Madkour. 2010. Language independent transliteration mining system using finite state automata framework. In Proceedings of the 2010 Named Entities Workshop. 57--61. http://www.aclweb.org/anthology/W10-2408

Digital Library

[18]

Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1, 19--51.

Digital Library

[19]

Jim Pitman and Marc Yor. 1995. The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator. Retrieved November 19, 2016, from http://digitalassets.lib.berkeley.edu/sdtr/ucb/text/433.pdf.

[20]

Eric Sven Ristad and Peter N. Yianilos. 1998. Learning string edit distance. IEEE Transactions on Pattern Recognition and Machine Intelligence 20, 5, 522--532.

Digital Library

[21]

Hassan Sajjad, Alexander Fraser, and Helmut Schmid. 2012. A statistical model for unsupervised and semi-supervised transliteration mining. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 469--477. http://www.aclweb.org/anthology/P12-1049

Digital Library

[22]

Steven L. Scott. 2002. Bayesian methods for hidden Markov models: Recursive computing in the 21st century. Journal of the American Statistical Association 97, 457, 337--351. http://www.jstor.org/stable/3085787

Cited By

Chaudhary AShekhar S(2024)A Framework to Find Single Language Version Using Pattern Analysis in Mixed Script Queries2024 2nd International Conference on Disruptive Technologies (ICDT)10.1109/ICDT61202.2024.10489697(1593-1601)Online publication date: 15-Mar-2024
https://doi.org/10.1109/ICDT61202.2024.10489697
Chaudhary AShekhar S(2023)A Study of Transliteration Approaches2023 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)10.1109/ICCCIS60361.2023.10425698(147-153)Online publication date: 3-Nov-2023
https://doi.org/10.1109/ICCCIS60361.2023.10425698
Liu DYang KQu QLv J(2019)Ancient–Modern Chinese Translation with a New Large Training DatasetACM Transactions on Asian and Low-Resource Language Information Processing10.1145/332588719:1(1-13)Online publication date: 31-May-2019
https://dl.acm.org/doi/10.1145/3325887
Show More Cited By

Index Terms

Inducing a Bilingual Lexicon from Short Parallel Multiword Sequences
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
      2. Machine translation

Recommendations

A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora
AMTA '98: Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup

We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using ...
A Bilingual Adversarial Autoencoder for Unsupervised Bilingual Lexicon Induction

Unsupervised bilingual lexicon induction aims to generate bilingual lexicons without any cross-lingual signals. Successfully solving this problem would benefit many downstream tasks, such as unsupervised machine translation and transfer learning. In ...
Bootstrapping a Lexicon of Multiword Adverbs for Brazilian Portuguese
Computational and Corpus-Based Phraseology
Abstract
This paper presents the process for bootstrapping a computational lexicon of multiword adverbs for Brazilian Portuguese (PT-BR) from an already existing lexicon built for the European variety of the language (PT-PT). This ongoing work aims to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 16, Issue 3

September 2017

167 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3041821

Editor:
Nianwen Xue
Brandeis University, Waltham, USA

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 March 2017

Accepted: 01 September 2016

Revised: 01 June 2016

Received: 01 November 2015

Published in TALLIP Volume 16, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
257
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chaudhary AShekhar S(2024)A Framework to Find Single Language Version Using Pattern Analysis in Mixed Script Queries2024 2nd International Conference on Disruptive Technologies (ICDT)10.1109/ICDT61202.2024.10489697(1593-1601)Online publication date: 15-Mar-2024
https://doi.org/10.1109/ICDT61202.2024.10489697
Chaudhary AShekhar S(2023)A Study of Transliteration Approaches2023 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)10.1109/ICCCIS60361.2023.10425698(147-153)Online publication date: 3-Nov-2023
https://doi.org/10.1109/ICCCIS60361.2023.10425698
Liu DYang KQu QLv J(2019)Ancient–Modern Chinese Translation with a New Large Training DatasetACM Transactions on Asian and Low-Resource Language Information Processing10.1145/332588719:1(1-13)Online publication date: 31-May-2019
https://dl.acm.org/doi/10.1145/3325887
Wushouer MLin DIshida TMurakami Y(2018)A Constraint Approach to Lexicon Induction for Low-Resource LanguagesServices Computing for Language Resources10.1007/978-981-10-7793-7_7(109-123)Online publication date: 24-Feb-2018
https://doi.org/10.1007/978-981-10-7793-7_7

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents