skip to main content
research-article

Inducing a Bilingual Lexicon from Short Parallel Multiword Sequences

Published: 17 March 2017 Publication History

Abstract

This article proposes a technique for mining bilingual lexicons from pairs of parallel short word sequences. The technique builds a generative model from a corpus of training data consisting of such pairs. The model is a hierarchical nonparametric Bayesian model that directly induces a bilingual lexicon while training. The model learns in an unsupervised manner and is designed to exploit characteristics of the language pairs being mined. The proposed model is capable of utilizing commonly used word-pair frequency information and additionally can employ the internal character alignments within the words themselves. It is thereby capable of mining transliterations and can use reliably aligned transliteration pairs to support the mining of other words in their context. The model is also capable of performing word reordering and word deletion during the alignment process, and it is furthermore capable of operating in the absence of full segmentation information. In this work, we study two mining tasks based on English-Japanese and English-Chinese language pairs, and compare the proposed approach to baselines based on a simpler models that use only word-pair frequency information. Our results show that the proposed method is able to mine bilingual word pairs at higher levels of precision and recall than the baselines.

References

[1]
Phil Blunsom, Trevor Cohn, Chris Dyer, and Miles Osborne. 2009. A Gibbs sampler for phrasal synchronous grammar induction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL. 782--790.
[2]
Kareem Darwish. 2010. Transliteration mining with phonetic conflation and iterative training. In Proceedings of the 2010 Named Entities Workshop. 53--56. http://www.aclweb.org/anthology/W10-2407
[3]
Andrew Finch and Eiichiro Sumita. 2010. A Bayesian model of bilingual segmentation for transliteration. In Proceedings of the 7th International Workshop on Spoken Language Translation (IWSLT’10). 259--266.
[4]
Andrew M. Finch, Ohnmar Htun, and Eiichiro Sumita. 2012. The NICT translation system for IWSLT 2012. In Proceedings of the 2012 International Workshop on Spoken Language Translation (IWSLT’12). 121--125. http://www.isca-speech.org/archive/iwslt_12/sltc_121.html.
[5]
Takaaki Fukunishi, Andrew Michael Finch, Eiichiro Sumita, and Seiichi Yamamoto. 2013. A Bayesian alignment approach to transliteration mining. ACM Transactions on Asian Language Information Processing 12, 3, Article No. 9.
[6]
Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2006. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL-44). 673--680.
[7]
Ohnmar Htun, Andrew Finch, Eiichiro Sumita, and Yoshiki Mikami. 2012. Improving transliteration mining by integrating expert knowledge with statistical approaches. International Journal of Computer Applications 58, Article No. 17.
[8]
Hemant Ishwaran and Lancelot F. James. 2003. Generalized weighted Chinese restaurant processes for species sampling mixture models. Statistica Sinica 13, 4, 1211--1235.
[9]
Sittichai Jiampojamarn, Kenneth Dwyer, Shane Bergsma, Aditya Bhargava, Qing Dou, Mi-Young Kim, and Grzegorz Kondrak. 2010. Transliteration generation and mining with limited training resources. In Proceedings of the 2010 Named Entities Workshop. 39--47. http://www.aclweb.org/anthology/W10-2405
[10]
Kevin Knight and Jonathan Graehl. 1998. Machine transliteration. Computational Linguistics 24, 4, 599--612. http://www.aclweb.org/anthology/J98-4003
[11]
A. Kumaran, Mitesh M. Khapra, and Haizhou Li. 2010. Report of NEWS 2010 transliteration mining shared task. In Proceedings of the 2010 Named Entities Workshop. 21--28. http://www.aclweb.org/anthology/W10-2403
[12]
Abby Levenberg, Chris Dyer, and Phil Blunsom. 2012. A Bayesian model for learning SCFGs with discontiguous rules. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 223--232.
[13]
Tingting Li, Tiejun Zhao, Andrew Finch, and Chunyue Zhang. 2013. A tightly-coupled unsupervised clustering and bilingual alignment model for transliteration. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 393--398. http://www.aclweb.org/anthology/P13-2070
[14]
D. Lopresti, A. Tomkins, and J. Zhou. 1997. Algorithms for matching hand-drawn sketches. In Proceedings of the 5th International Workshop on Frontiers in Handwriting Recognition. 233--238.
[15]
Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 (ACL-IJCNLP’09). 100--108.
[16]
Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shinsuke Mori, and Tatsuya Kawahara. 2011. An unsupervised model for joint phrase alignment and extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT’11). 632--641.
[17]
Sara Noeman and Amgad Madkour. 2010. Language independent transliteration mining system using finite state automata framework. In Proceedings of the 2010 Named Entities Workshop. 57--61. http://www.aclweb.org/anthology/W10-2408
[18]
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1, 19--51.
[19]
Jim Pitman and Marc Yor. 1995. The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator. Retrieved November 19, 2016, from http://digitalassets.lib.berkeley.edu/sdtr/ucb/text/433.pdf.
[20]
Eric Sven Ristad and Peter N. Yianilos. 1998. Learning string edit distance. IEEE Transactions on Pattern Recognition and Machine Intelligence 20, 5, 522--532.
[21]
Hassan Sajjad, Alexander Fraser, and Helmut Schmid. 2012. A statistical model for unsupervised and semi-supervised transliteration mining. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 469--477. http://www.aclweb.org/anthology/P12-1049
[22]
Steven L. Scott. 2002. Bayesian methods for hidden Markov models: Recursive computing in the 21st century. Journal of the American Statistical Association 97, 457, 337--351. http://www.jstor.org/stable/3085787

Cited By

View all

Index Terms

  1. Inducing a Bilingual Lexicon from Short Parallel Multiword Sequences

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 16, Issue 3
      September 2017
      167 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3041821
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 March 2017
      Accepted: 01 September 2016
      Revised: 01 June 2016
      Received: 01 November 2015
      Published in TALLIP Volume 16, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Bilingual lexicon
      2. alignment
      3. mining

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 30 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media