article

Stemming to improve translation lexicon creation form bitexts

Authors:

Mohamed Abdel Fattah,

Shingo KuroiwaAuthors Info & Claims

Information Processing and Management: an International Journal, Volume 42, Issue 4

Pages 1003 - 1016

https://doi.org/10.1016/j.ipm.2005.07.002

Published: 01 July 2006 Publication History

Abstract

Arabic is a morphologically rich language that presents significant challenges to many natural language processing applications because a word often conveys complex meanings decomposable into several morphemes (i.e. prefix, stem, suffix). By segmenting words into morphemes, we could improve the performance of English/Arabic translation pair's extraction from parallel texts. This paper describes two algorithms and their combination to automatically extract an English/Arabic bilingual dictionary from parallel texts that exist in the Internet archive after using an Arabic light stemmer as a preprocessing step. Before using the Arabic light stemmer, the total system precision and recall were 88.6% and 81.5% respectively, then the system precision an recall increased to 91.6% and 82.6% respectively after applying the Arabic light stemmer on the Arabic documents.The algorithms have certain variables which values can be changed to control the system precision and recall. Like most of the systems do, the accuracy of our system is directly proportional to the number of sentence pairs used. However our system is able to extract translation pairs from a very small parallel corpus. This new system can extract translations from only two sentences in one language and two sentences in the other language if the requirements of the system accomplished. Moreover, this system is able to extract word pairs that are translation of each others, synonyms and the explanation of the word in the other language as well. By controlling the system variables, we could achieve 100% precision for the antnnt bilingual dictionary with a small recall.

References

[1]

Aljalyl, M., & Frieder, O. (2002). On Arabic search: Improving the retrieval effectiveness via a light stemmer approach. In K. Kalpakis et al. (Eds.), CIKM 2002: Proceedings of the Eleventh International Conference on Information and Knowledge Management (McLean, VA, Nov. 2002) (pp. 340-347). New York: ACM.

Digital Library

[2]

Brown, P. F., Pietra, S. D., Pietra, V. D., & Ercer, D. R. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2).

Digital Library

[3]

Chen, A., & Gey, F. (2001). Translation term weighting and combining translation resources in cross-language retrieval. In The Tenth Text REtrieval Conference (TREC 2001) Gaithersburg, Maryland (pp. 529-533).

[4]

Chen, A., & Gey, F. (2002). Building an arabic stemmer for information retrieval. In The Eleventh Text Retrieval Conference Gaithersburg, Maryland (TREC 2002) (pp. 19-22).

[5]

Cherry, C., & Lin, D. (2003). A probability model to improve word alignment. In Proceedings of ACL-2003, Sapporo, Japan 2003 (pp. 88-95).

Digital Library

[6]

Darwish, K., & Oard, D. W. (2002). Evidence combination for arabic-english retrieval. In The Eleventh Text Retrieval Conference (TREC 2002), Gaithersburg, Maryland 2002.

[7]

Déjean, H., Gaussier, É., & Sadat, F. (2002). Bilingual terminology extraction: an approach based on a multilingual thesaurus applicable to comparable corpora. In Proceedings of the 19th International Conference on Computational Linguistics COLING 2002 (pp. 218-224).

[8]

Djoerd, H. (1998). Multilingual domain modeling in twenty-one: automatic creation of a bi-directional translation lexicon from a parallel corpus. In P.-A. Coppen, H. van Halteren, & L. Teunissen (Eds.), Proceedings of the eighth CLIN meeting 1998 (pp. 41-58).

[9]

Elamed, D. I. (1996). Automatic construction of clean broad-coverage translation lexicons. In Proceedings of the Second Conference of the Association for Machine Translation in the Americas (AMTA-96). Montreal 1996.

[10]

Fattah, M. A., Ren, F., & Shingo, K. (2004). Internet archive as a source of bilingual dictionary. In Proceedings of International Conference on Information Technology: Coding and Computing (ITCC 2004) (pp. 298-302). Las Vegas, Nevada: IEEE Computer Society.

Digital Library

[11]

Franz, M., & McCarley, J. S. (2002). Arabic information retrieval at IBM. In The Eleventh Text Retrieval Conference (TREC2002), Gaithersburg, Maryland 2002.

[12]

Hiemstra, D. (1997). Deriving a bilingual lexicon for cross language information retrieval. In Proceedings of Gronics 1997 (pp. 21-26).

[13]

Hiemstra, D., Jong, F. M. G., & Kraaij, W. (1997). Domain specific lexicon acquisition tool for cross-language information retrieval. In Proceedings of RIAO'97 Conference on Computer-Assisted Searching on the Internet 1997 (pp. 255-266). Available from http: //www.internetworldstats.com/stats7.htm, March 24, 2005.

[14]

Julapalli, M., & Dhond, S. (2003). Word alignment in bilingual parallel corpora. CS224N/Ling237 Final Projects 2003, Spring 2002/ 2003.

[15]

Kay, M., & Roscheisen, M. (1993). Text-Translation alignment. Computational Linguistic, 19(1), 121-142.

Digital Library

[16]

Lafourcade, M. (1997). Multilingual dictionary construction and services case study with the Fe projects. In PACLING'97--Meisei University--Ohme, Tokyo, Japan 1997 (pp. 171-181).

[17]

Larkey, Leah. S., Connell, M. E., & Jaleel, N. A. (2003). Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Process, 2(2), 130-142.

Digital Library

[18]

Lee, Y., Papineni, K., Roukos, S., Emam, O., & Hassan, H. (2003). Language model based arabic word segmentation. In Proceedings of ACL-2003, Sapporo, Japan 2003 (pp. 399-406).

Digital Library

[19]

McEwanl, C. J. A., Ounis, I., & Ruthven, I. (2002). Building bilingual dictionaries from parallel web documents. LNCS, Spring 2002, 303-323.

Digital Library

[20]

Nie, J. Y., Simard, M., Isabelle, P., & Durard, R. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (ACM SIGIR'99), Berkeley 1999 (pp. 74-81).

Digital Library

[21]

Niessen, S., Vogel, S., Ney, H., & Tillmann, C. (1998). A DP-based search algorithm for statistical machine translation. In COLING-ACL '98: Annual Conf. of the Association for Computational Linguistics and 17th Int. Conf. on Computational Linguistics, Montreal, Canada, August 1998 (pp. 960-967).

Digital Library

[22]

Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19-51.

Digital Library

[23]

Och, Josef, F., Tillmann, C., & Ney, H. (1999). Improved alignment models for statistical machine translation. In Proc. of the Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, University of Maryland, College Park, MD, June 1999 (pp. 20-28).

[24]

Resnik, P., & Smithy, N. A. (2003). The web as a parallel corpus. University of Maryland technical report UMIACS-TR-2002-61 (also listed by technical report numbers CS-TR-4381 and LAMP-TR-089), July 2002. (Revised version to appear in Computational Linguistics 29(3), September 2003).

Digital Library

[25]

Rogati, M., McCarley, S., & Yang, Y. (2003). Unsupervised learning of arabic stemming using a parallel corpus. In Proceedings of ACL-2003, Sapporo, Japan 2003 (pp. 391-398).

Digital Library

[26]

Sadat, F., Yoshikawa, M., & Uemura, S. (2003). Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach. In Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages IRAL, Sapporo, Japan 2003.

Digital Library

[27]

Tanaka, K., & Umemura, K. (1994). Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th International Conference on Computational Linguistics.

Digital Library

[28]

Tillmann, C., Vogel, S., Ney, H., & Ubiaga, A. (1997). A DP-based search using monotone alignments in statistical translation. In Proc. 35th Annual Conf. of the Association for Computational Linguistics, Madrid, Spain, July 1997 (pp. 289-296).

Digital Library

[29]

Utiyama, M., & Isahara, H. (2003). Reliable measures for aligning Japanese English news articles and sentences. In Proceedings of ACL-2003, Sapporo, Japan 2003 (pp. 72-79).

Digital Library

[30]

Venugopal, A., Vogel, S., & Waibel, A. (2003). Effective phrase translation extraction from alignment models. In Proceedings of ACL-2003, Sapporo, Japan 2003 (pp. 319-326).

Digital Library

[31]

Vogel, S., Ney, H., & Tillmann, C. (1996). HMM-based word alignment in statistical translation. In COLING '96: The 16th Int. Conf. on Computational Linguistics, Copenhagen, August 1996 (pp. 836-841).

Digital Library

[32]

Xu, J., Fraser, A., & Weischede, R. (2001). TREC 2001 Cross-lingual retrieval at BBN. In The Tenth Text REtrieval Conference (TREC 2001), Gaithersburg, Maryland 2001 (pp. 68-77).

[33]

Xu, J., & Weischedel, R. (2000). TREC-9 cross-lingual retrieval at BBN. In The Ninth Text REtrieval Conference (TREC 9), Gaithersburg, Maryland 2000 (pp. 106-116).

Cited By

Jabbar AIqbal STamimy MHussain SAkhunzada A(2020)Empirical evaluation and study of text stemming algorithmsArtificial Intelligence Review10.1007/s10462-020-09828-353:8(5559-5588)Online publication date: 15-Apr-2020
https://dl.acm.org/doi/10.1007/s10462-020-09828-3
Fattah MRen F(2008)English-Arabic proper-noun transliteration-pairs creationJournal of the American Society for Information Science and Technology10.5555/1398018.139803859:10(1675-1687)Online publication date: 1-Aug-2008
https://dl.acm.org/doi/10.5555/1398018.1398038
Fattah MRen FKuroiwa S(2006)Text-based English-Arabic sentence alignmentProceedings of the 2006 international conference on Intelligent computing: Part II10.5555/1882540.1882640(748-753)Online publication date: 16-Aug-2006
https://dl.acm.org/doi/10.5555/1882540.1882640

Index Terms

Stemming to improve translation lexicon creation form bitexts

Recommendations

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
Abstract
Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS ...
Stemming resource-poor Indian languages

Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While ...

Comments

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal

Information Processing and Management: an International Journal Volume 42, Issue 4

July 2006

273 pages

ISSN:0306-4573

Issue’s Table of Contents

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 July 2006

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jabbar AIqbal STamimy MHussain SAkhunzada A(2020)Empirical evaluation and study of text stemming algorithmsArtificial Intelligence Review10.1007/s10462-020-09828-353:8(5559-5588)Online publication date: 15-Apr-2020
https://dl.acm.org/doi/10.1007/s10462-020-09828-3
Fattah MRen F(2008)English-Arabic proper-noun transliteration-pairs creationJournal of the American Society for Information Science and Technology10.5555/1398018.139803859:10(1675-1687)Online publication date: 1-Aug-2008
https://dl.acm.org/doi/10.5555/1398018.1398038
Fattah MRen FKuroiwa S(2006)Text-based English-Arabic sentence alignmentProceedings of the 2006 international conference on Intelligent computing: Part II10.5555/1882540.1882640(748-753)Online publication date: 16-Aug-2006
https://dl.acm.org/doi/10.5555/1882540.1882640

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents