skip to main content
article

Stemming to improve translation lexicon creation form bitexts

Published: 01 July 2006 Publication History

Abstract

Arabic is a morphologically rich language that presents significant challenges to many natural language processing applications because a word often conveys complex meanings decomposable into several morphemes (i.e. prefix, stem, suffix). By segmenting words into morphemes, we could improve the performance of English/Arabic translation pair's extraction from parallel texts. This paper describes two algorithms and their combination to automatically extract an English/Arabic bilingual dictionary from parallel texts that exist in the Internet archive after using an Arabic light stemmer as a preprocessing step. Before using the Arabic light stemmer, the total system precision and recall were 88.6% and 81.5% respectively, then the system precision an recall increased to 91.6% and 82.6% respectively after applying the Arabic light stemmer on the Arabic documents.The algorithms have certain variables which values can be changed to control the system precision and recall. Like most of the systems do, the accuracy of our system is directly proportional to the number of sentence pairs used. However our system is able to extract translation pairs from a very small parallel corpus. This new system can extract translations from only two sentences in one language and two sentences in the other language if the requirements of the system accomplished. Moreover, this system is able to extract word pairs that are translation of each others, synonyms and the explanation of the word in the other language as well. By controlling the system variables, we could achieve 100% precision for the antnnt bilingual dictionary with a small recall.

References

[1]
Aljalyl, M., & Frieder, O. (2002). On Arabic search: Improving the retrieval effectiveness via a light stemmer approach. In K. Kalpakis et al. (Eds.), CIKM 2002: Proceedings of the Eleventh International Conference on Information and Knowledge Management (McLean, VA, Nov. 2002) (pp. 340-347). New York: ACM.
[2]
Brown, P. F., Pietra, S. D., Pietra, V. D., & Ercer, D. R. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2).
[3]
Chen, A., & Gey, F. (2001). Translation term weighting and combining translation resources in cross-language retrieval. In The Tenth Text REtrieval Conference (TREC 2001) Gaithersburg, Maryland (pp. 529-533).
[4]
Chen, A., & Gey, F. (2002). Building an arabic stemmer for information retrieval. In The Eleventh Text Retrieval Conference Gaithersburg, Maryland (TREC 2002) (pp. 19-22).
[5]
Cherry, C., & Lin, D. (2003). A probability model to improve word alignment. In Proceedings of ACL-2003, Sapporo, Japan 2003 (pp. 88-95).
[6]
Darwish, K., & Oard, D. W. (2002). Evidence combination for arabic-english retrieval. In The Eleventh Text Retrieval Conference (TREC 2002), Gaithersburg, Maryland 2002.
[7]
Déjean, H., Gaussier, É., & Sadat, F. (2002). Bilingual terminology extraction: an approach based on a multilingual thesaurus applicable to comparable corpora. In Proceedings of the 19th International Conference on Computational Linguistics COLING 2002 (pp. 218-224).
[8]
Djoerd, H. (1998). Multilingual domain modeling in twenty-one: automatic creation of a bi-directional translation lexicon from a parallel corpus. In P.-A. Coppen, H. van Halteren, & L. Teunissen (Eds.), Proceedings of the eighth CLIN meeting 1998 (pp. 41-58).
[9]
Elamed, D. I. (1996). Automatic construction of clean broad-coverage translation lexicons. In Proceedings of the Second Conference of the Association for Machine Translation in the Americas (AMTA-96). Montreal 1996.
[10]
Fattah, M. A., Ren, F., & Shingo, K. (2004). Internet archive as a source of bilingual dictionary. In Proceedings of International Conference on Information Technology: Coding and Computing (ITCC 2004) (pp. 298-302). Las Vegas, Nevada: IEEE Computer Society.
[11]
Franz, M., & McCarley, J. S. (2002). Arabic information retrieval at IBM. In The Eleventh Text Retrieval Conference (TREC2002), Gaithersburg, Maryland 2002.
[12]
Hiemstra, D. (1997). Deriving a bilingual lexicon for cross language information retrieval. In Proceedings of Gronics 1997 (pp. 21-26).
[13]
Hiemstra, D., Jong, F. M. G., & Kraaij, W. (1997). Domain specific lexicon acquisition tool for cross-language information retrieval. In Proceedings of RIAO'97 Conference on Computer-Assisted Searching on the Internet 1997 (pp. 255-266). Available from http: //www.internetworldstats.com/stats7.htm, March 24, 2005.
[14]
Julapalli, M., & Dhond, S. (2003). Word alignment in bilingual parallel corpora. CS224N/Ling237 Final Projects 2003, Spring 2002/ 2003.
[15]
Kay, M., & Roscheisen, M. (1993). Text-Translation alignment. Computational Linguistic, 19(1), 121-142.
[16]
Lafourcade, M. (1997). Multilingual dictionary construction and services case study with the Fe projects. In PACLING'97--Meisei University--Ohme, Tokyo, Japan 1997 (pp. 171-181).
[17]
Larkey, Leah. S., Connell, M. E., & Jaleel, N. A. (2003). Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Process, 2(2), 130-142.
[18]
Lee, Y., Papineni, K., Roukos, S., Emam, O., & Hassan, H. (2003). Language model based arabic word segmentation. In Proceedings of ACL-2003, Sapporo, Japan 2003 (pp. 399-406).
[19]
McEwanl, C. J. A., Ounis, I., & Ruthven, I. (2002). Building bilingual dictionaries from parallel web documents. LNCS, Spring 2002, 303-323.
[20]
Nie, J. Y., Simard, M., Isabelle, P., & Durard, R. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (ACM SIGIR'99), Berkeley 1999 (pp. 74-81).
[21]
Niessen, S., Vogel, S., Ney, H., & Tillmann, C. (1998). A DP-based search algorithm for statistical machine translation. In COLING-ACL '98: Annual Conf. of the Association for Computational Linguistics and 17th Int. Conf. on Computational Linguistics, Montreal, Canada, August 1998 (pp. 960-967).
[22]
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19-51.
[23]
Och, Josef, F., Tillmann, C., & Ney, H. (1999). Improved alignment models for statistical machine translation. In Proc. of the Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, University of Maryland, College Park, MD, June 1999 (pp. 20-28).
[24]
Resnik, P., & Smithy, N. A. (2003). The web as a parallel corpus. University of Maryland technical report UMIACS-TR-2002-61 (also listed by technical report numbers CS-TR-4381 and LAMP-TR-089), July 2002. (Revised version to appear in Computational Linguistics 29(3), September 2003).
[25]
Rogati, M., McCarley, S., & Yang, Y. (2003). Unsupervised learning of arabic stemming using a parallel corpus. In Proceedings of ACL-2003, Sapporo, Japan 2003 (pp. 391-398).
[26]
Sadat, F., Yoshikawa, M., & Uemura, S. (2003). Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach. In Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages IRAL, Sapporo, Japan 2003.
[27]
Tanaka, K., & Umemura, K. (1994). Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th International Conference on Computational Linguistics.
[28]
Tillmann, C., Vogel, S., Ney, H., & Ubiaga, A. (1997). A DP-based search using monotone alignments in statistical translation. In Proc. 35th Annual Conf. of the Association for Computational Linguistics, Madrid, Spain, July 1997 (pp. 289-296).
[29]
Utiyama, M., & Isahara, H. (2003). Reliable measures for aligning Japanese English news articles and sentences. In Proceedings of ACL-2003, Sapporo, Japan 2003 (pp. 72-79).
[30]
Venugopal, A., Vogel, S., & Waibel, A. (2003). Effective phrase translation extraction from alignment models. In Proceedings of ACL-2003, Sapporo, Japan 2003 (pp. 319-326).
[31]
Vogel, S., Ney, H., & Tillmann, C. (1996). HMM-based word alignment in statistical translation. In COLING '96: The 16th Int. Conf. on Computational Linguistics, Copenhagen, August 1996 (pp. 836-841).
[32]
Xu, J., Fraser, A., & Weischede, R. (2001). TREC 2001 Cross-lingual retrieval at BBN. In The Tenth Text REtrieval Conference (TREC 2001), Gaithersburg, Maryland 2001 (pp. 68-77).
[33]
Xu, J., & Weischedel, R. (2000). TREC-9 cross-lingual retrieval at BBN. In The Ninth Text REtrieval Conference (TREC 9), Gaithersburg, Maryland 2000 (pp. 106-116).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal
Information Processing and Management: an International Journal  Volume 42, Issue 4
July 2006
273 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 July 2006

Author Tags

  1. english/arabic translation
  2. multilingual dictionaries
  3. multilingual thesaurus
  4. stemming

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media