skip to main content
10.3115/1118935.1118943dlproceedingsArticle/Chapter ViewAbstractPublication PagesiralConference Proceedingsconference-collections
Article
Free access

Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach

Published: 07 July 2003 Publication History

Abstract

Recent years saw an increased interest in the use and the construction of large corpora. With this increased interest and awareness has come an expansion in the application to knowledge acquisition and bilingual terminology extraction. The present paper will seek to present an approach to bilingual lexicon extraction from non-aligned comparable corpora, combination to linguistics-based pruning and evaluations on Cross-Language Information Retrieval. We propose and explore a two-stages translation model for the acquisition of bilingual terminology from comparable corpora, disambiguation and selection of best translation alternatives on the basis of their morphological knowledge. Evaluations using a large-scale test collection on Japanese-English and different weighting schemes of SMART retrieval system confirmed the effectiveness of the proposed combination of two-stages comparable corpora and linguistics-based pruning on Cross-Language Information Retrieval.

References

[1]
C. Buckley, J. Allan and G. Salton. 1994. Automatic Routing and Ad-hoc Retrieval using Smart. Proc. Second Text Retrieval Conference TREC-2, pages 45--56,
[2]
I. Dagan and I. Itai. 1994. Word Sense Disambiguation using a Second Language Monolingual Corpus. Computational Linguistics, 20(4):563--596.
[3]
H. Dejean, E. Gaussier and F. Sadat. 2002. An Approach based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction. In Proc. COLING 2002.
[4]
M. Diab and S. Finch. 2000. A Statistical Word-level Translation Model for Comparable Corpora. Proc. of the Conference on Content-based Multimedia Information Access RIAO.
[5]
T. Dunning. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational linguistics 19(1).
[6]
EDR. 1996. Japan Electronic Dictionary Research Institute, Ltd. EDR electronic dictionary version 1.5 EDR. Technical guide. Technical report TR2-007.
[7]
A. E. Fox and A. J. Shaw. 1994. Combination of Multiple Searches. Proc. Second Text Retrieval Conference TREC-2, pages 243--252.
[8]
N. Fuhr, U. Pfeifer, C. Bremkamp, M. Pollmann and C. Buckley. 1994. Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection. Proc. Second Text Retrieval Conference TREC-2, pages 67--74.
[9]
P. Fung. 2000. A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In Jean Veronis, Ed. Parallel Text Processing.
[10]
D.Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experiments. Proc. ACM SIGIR '93, pages 329--338.
[11]
N. Kando. 2001. Overview of the Second NTCIR Workshop. In Proc. Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and Text Summarization.
[12]
J. Klavans and E. Tzoukermann. 1996. Combining Corpus and Machine-Readable Dictionary Data for Building Bilingual Lexicons. Machine Translation, 10(3-4): 1--34.
[13]
D. Knaus and P. Shauble. 1993. Effective and Efficient Retrieval from Large and Dynamic Document Collections. Proc. Second Text Retrieval Conference TREC-3, pages 163--170.
[14]
K. Knight and J. Graehl. 1998. Machine Transliteration. Computational Linguistics, 24(4).
[15]
P. Koehn and K. Knight. 2002. Learning a Translation Lexicon from Monolingual Corpora.In Proc. ACL-02 Workshop on Unsupervised Lexical Acquisition.
[16]
Y.Matsumoto, A. Kitauchi, T. Yamashita, O. Imaichi and T. Imamura. 1997. Japanese Morphological Analysis System ChaSen Manual. Technical Report NAIST-IS-TR97007.
[17]
H. Nakagawa. 2000. Disambiguation of Lexical Translations based on Bilingual Comparable Corpora. Proc. LREC2000, Workshop of Terminology Resources and Computation WTRC2000, pages 33--38.
[18]
C. Peters and E. Picchi. 1995. Capturing the comparable: A System for Querying Comparable Text Corpora. Proc. 3rd International Conference on Statistical Analysis of Textual Data, pages 255--262.
[19]
R. Rapp. 1999. Automatic Identification of Word Translations from Unrelated English and German Corpora. In Proc. European Association for Computational Linguistics.
[20]
F. Sadat, M. Yoshikawa and S. Uemura. 2003. Enhancing Cross-language Information Retrieval by an Automatic Acquisition of Bilingual Terminology from Comparable Corpora. In Proc. ACM SIGIR 2003, Toronto, Canada.
[21]
F. Sadat, M. Yoshikawa and S. Uemura. 2003. Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval. In Proc. ACL 2003, Sapporo, Japan.
[22]
G. Salton. 1971. The SMART Retrieval System, Experiments in Automatic Documents Processing. Prentice-Hall, Inc., Englewood Cliffs, NJ.
[23]
G. Salton and J. McGill. 1983. Introduction to Modern Information Retrieval. New York, Mc Graw-Hill.
[24]
J. Savoy. 2003. Cross-Language Information Retrieval: Experiments based on CLEF 2000 Corpora. Information Processing & Management 39(1):75--115.
[25]
S. Sekine. 2001. OAK System-Manual. New York University.
[26]
I. Shahzad, K. Ohtake, S. Masuyama and K. Yamamoto. 1999. Identifying Translations of Compound using Non-aligned Corpora. Proc. Workshop MAL, pages 108--113.
[27]
K. Tanaka and H. Iwasaki. 1996 Extraction of Lexical Translations from Non-aligned Corpora. Proc. COLING 96.

Cited By

View all
  1. Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image DL Hosted proceedings
        AsianIR '03: Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11
        July 2003
        175 pages
        • Program Chair:
        • Jun Adachi

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        Published: 07 July 2003

        Author Tags

        1. comparable corpora
        2. cross-language information retrieval
        3. disambiguation
        4. part-of-speech
        5. translation

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)43
        • Downloads (Last 6 weeks)5
        Reflects downloads up to 28 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media