skip to main content
10.5555/2856695.2856716guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article
Free access

Multi-language text indexing for internet retrieval

Published: 25 June 1997 Publication History

Abstract

We address here the issues associated with indexing multilingual collections of information, as is found for example on the internet. We examine in particular the task of language identification and the use of stemming algorithms for several European languages. We also present the lessons we have learned from our experience in using the SPIDER information retrieval system as a search engine over the intranet of the ETH Zurich; a multilingual intranet which contains documents in English, French, German and Italian.

References

[1]
{Baayen et al., 1993} Baayen, R., Piepenbrock, R., & van Rijn, H. (1993). The CELEX Lexical Database. Technical report, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA.
[2]
{Cavnar & Trenkle, 1994} Cavnar, W., & Trenkle, J. (1994). N-gram based text categorization. In SIGIR, pp. pp. 161--169.
[3]
{Damashek, 1995} Damashek, M. (1995). Gauging similarity with n-grams: Language-independent categorization of text. Science, Vol. 267(No. 10).
[4]
{Dunning, 1994} Dunning, T. (1994). Statistical identification of language. Technical report, CRL Technical Memo MCCS-94-273.
[5]
{Frakes & Baeza-Yates, 1992} Frakes, W. B., & Baeza-Yates, R., editors (1992). Information Retrieval: Data Structures and Algorithms. Prentice-Hall.
[6]
{Harman, 1987} Harman, D. (1987). A Failure Analysis on the Limitations of Suffixing in an Online Environment. In ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 102--108.
[7]
{Harman, 1991} Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42(1), 321--331.
[8]
{Harman, 1995} Harman, D. (1995). Overview of the Third Text REtrieval Conference (TREC-3). In TREC-3 Proceedings.
[9]
{Harman, 1996} Harman, D. (1996). Overview of the Fourth Text REtrieval Conference (TREC-4). In TREC-4 Proceedings.
[10]
{Hull & Grefenstette, 1996} Hull, D., & Grefenstette, G. (1996). Stemming algorithms - a case study for detailed evaluation. Journal of the American Society for Information Science, 47(1), 70--84.
[11]
{Kikui, 1996} Kikui, G. (1996). Identifying the Coding System and Language of On-line Documents on the Internet. In Proceedings of the Sixteenth International Conference on Computational Linguistics: COLING'96, Copenhagen, Denmark.
[12]
{Knaus et al., 1995} Knaus, D., Mittendorf, E., Schäuble, P., & Sheridan, P. (1995). Highlighting Relevant Passages for Users of the Interactive SPIDER Retrieval System. In Proceedings of the Fourth Text Retrieval Conference (TREC4).
[13]
{Knaus & Schäuble, 1996} Knaus, D., & Schäuble, P. (1996). The System Architecture and the Transaction Concept of the SPIDER Information Retrieval System. IEEE Bulletin of the Technical Committee on Data Engineering, 19(1), 43--52. Special Issue on the Integration of Information Retrieval and Database Technology.
[14]
{Kraaij & Pohlman, 1996} Kraaij, W., & Pohlman, R. (1996). Viewing Stemming as Recall Enhancement. In Proceedings of the 19th ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, pp. 40--48.
[15]
{Krovetz, 1993} Krovetz, R. (1993). Viewing morphology as an inference process. In ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 191--202.
[16]
{Landauer & Littman, 1991} Landauer, T. K., & Littman, M. L. (1991). A Statistical Method for Language-Independent Representation of the Topical Content of Text Segments. In Proceedings of the Eleventh International Conference: Expert Systems and Their Applications, pp. 77--85.
[17]
{Lennon et al., 1981} Lennon, M., Pierce, D., Tarry, B., & Willett, P. (1981). An Evaluation of some Conflation Algorithms for Information Retrieval. Journal of Information Science, 3, 177--183.
[18]
{Litwin, 1980} Litwin, W. (1980). Linear Hashing: A New Tool for File and Table Addressing. In International Conference on Very Large Data Bases, pp. 212--223.
[19]
{Lovins, 1968} Lovins, J. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, 22--31.
[20]
{Mittendorf et al., 1995} Mittendorf, E., Schäuble, P., & Sheridan, P. (1995). Applying Probabilistic Term Weighting to OCR Text in the case of a Large Alphabetic Library Catalogue. In Proceedings of the 18th ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, pp. 328--335.
[21]
{Porter, 1980} Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program, 14(3), 130--137.
[22]
{Schäuble, 1993} Schäuble, P. (1993). SPIDER: A Multiuser Information Retrieval System for Semistructured and Dynamic Data. In Proceedings of the 16th ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, PA, pp. 318--327.
[23]
{Sheridan & Ballerini, 1996} Sheridan, P., & Ballerini, J. P. (1996). Experiments in Multilingual Information Retrieval using the SPIDER System. In Proceedings of the 19th ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, pp. 58--65.
[24]
{Sheridan et al., 1996} Sheridan, P., Ballerini, J. P., & Schäuble, P. (1996). Building a Large Multilingual Test Collection From Comparable News Documents. In Proceedings of SIGIR Workshop on Cross-linguistic Information Retrieval, Zurich, Switzerland.
[25]
{Sheridan et al., 1997} Sheridan, P., Wechsler, M., & Schäuble, P. (1997). Cross-Language Speech Retrieval. In Proceedings of the 20th ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA (to appear).
[26]
{Sibun & Reynar, 1996} Sibun, P., & Reynar, J. (1996). Language identification: Examining the issues. In Symposium on Document Analysis and Information Retrieval, pp. 125--135, Las Vegas.
[27]
{Souter et al., 1994} Souter, C., Churcher, G., Hayes, J., Hughes, J., & Johnson, S. (1994). Natural language identification using corpus-based models. Hermes Journal of Linguistics, Vol. 13, pp. 183--203. Faculty of Modern Languages, Aarhus School of Business, Denmark.
[28]
{van Rijsbergen, 1979} van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths, London, second edition.
[29]
{Wechsler & Schäuble, 1995} Wechsler, M., & Schäuble, P. (1995). Speech retrieval based on automatic indexing. In Ruthven, I., editor, Proceedings of the Final Workshop on Multimedia Information Retrieval (MIRO'95), Electronic Workshops in Computing, Glasgow. Springer.
[30]
{Ziegler, 1991} Ziegler, D. (1991). The Automatic Identification of Languages Using Linguistic Recognition Signals. PhD thesis, State University of New York, Buffalo.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
RIAO '97: Computer-Assisted Information Searching on Internet
June 1997
603 pages
  • Conference Chairs:
  • L. Devroye,
  • C. Chrisment

Publisher

LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIRE

Paris, France

Publication History

Published: 25 June 1997

Author Tags

  1. language identification
  2. multilingual retrieval
  3. stemming

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)63
  • Downloads (Last 6 weeks)6
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media