skip to main content
research-article

Incremental cluster-based retrieval using compressed cluster-skipping inverted files

Published: 20 June 2008 Publication History

Abstract

We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental-CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size.

References

[1]
Altingovde, I. S., Can, F., and Ulusoy, Ö. 2006. Algorithms for within-cluster searches using inverted files. In Proceedings of the International Symposium on Computer and Information Sciences (ISCIS). 707--716.
[2]
Altingovde, I. S., Özcan, R., Öcalan, H. C., Can, F., and Ulusoy, Ö. 2007. Large-Scale cluster-based retrieval experiments on Turkish texts. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 891--892.
[3]
Anh, N. V., Kretser, O. de, and Moffat, A. 2001. Vector-Space ranking with effective early termination. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 35--42
[4]
Anh, V. N. and Moffat, A. 2005a. Inverted index compression using word-aligned binary codes. Inf. Retr. 8, 1, 151--166.
[5]
Anh, V. N. and Moffat, A. 2005b. Simplified similarity scoring using term ranks. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 226--233.
[6]
Anh, V. N. and Moffat, A. 2006. Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. Seattle, WA. ACM Press, 372--379.
[7]
Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., and Silvestri, F. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 183--190.
[8]
Blandford, D. and Blelloch, G. 2002. Index compression through document reordering. In Proceedings of the Data Compression Conference. IEEE, 342--351.
[9]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. 30, 1--7, 107--117.
[10]
Brown, E. W. 1995. Fast evaluation of structured queries for information retrieval. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, New York, 30--38.
[11]
Buckley, C. and Lewit, A. F. 1985. Optimization of inverted vector searches. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, New York, 97--110.
[12]
Buckley C. and Voorhees E. M. 2000. Evaluating evaluation measure stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, New York, 33--40.
[13]
Cacheda, F. and Baeza-Yates, R. 2004. An optimistic model for searching Web directories. In Proceedings of the 26th European Conference on IR Research (ECIR). 364--377.
[14]
Cacheda, F., Carneiro, V., Guerrero, C., and Viña, Á. 2003. Optimization of restricted searches in Web directories using hybrid data structures. In Proceedings of the 25th European Conference on IR Research (ECIR). 436--451.
[15]
Cambazoglu, B. B. 2006. Models and algorithms for parallel text retrieval. Ph.D. thesis, Bilkent University, Ankara, Turkey.
[16]
Cambazoglu, B. B. and Aykanat, C. 2006. Performance of query processing implementations in ranking-based text retrieval systems using inverted indices. Inf. Process. Manage. 42, 4, 875--898.
[17]
Can, F. 1993. Incremental clustering for dynamic information processing. ACM Trans. Inf. Syst. 11, 2, 143--164.
[18]
Can, F. 1994. On the efficiency of best-match cluster searches. Inf. Process. Manage. 30, 3, 343--361.
[19]
Can, F., Altingovde, I. S., and Demir, E. 2004. Efficiency and effectiveness of query processing in cluster-based retrieval. Inf. Syst. 29, 8, 697--717.
[20]
Can, F. and Ozkarahan, E. A. 1990. Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Trans. Database Syst. 15, 4, 483--517.
[21]
Croft, W. B. 1980. A model of cluster searching based on classification. Inf. Syst. 5, 3, 189--195.
[22]
Harman, D. 1992. Ranking algorithms. In Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, eds. Prentice Hall, Englewood Cliffs, NJ, (Chapter 14), 363--392.
[23]
Jain, A. K. and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ.
[24]
Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Inf. Storage Retr. 7, 217--240.
[25]
Lester, N., Moffat, A., Webber, W., and Zobel, J. 2005. Space-Limited ranked query evaluation using adaptive pruning. In Proceedings of the 6th International Conference on Web Information Systems Engineering, New York, 470--477.
[26]
Liu, X. and Croft, W. B. 2004. Cluster-Based retrieval using language models. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 186--193.
[27]
Long, X. and Suel, T. 2003. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). 129--140.
[28]
Moffat, A. and Zobel, J. 1996. Self-Indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst. 14, 4, 349--379.
[29]
Murray, D. M. 1972. Document retrieval based on clustered files. Ph.D. thesis, Cornell University. Also Rep. ISR-20. National Science Foundation, National Library of Medicine.
[30]
Nuray, R. and Can, F. 2006. Automatic ranking of information retrieval systems using data fusion. Inf. Process. Manage. 42, 3, 595--614.
[31]
Persin, M. 1994. Document filtering for fast ranking. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, 339--348.
[32]
Persin, M., Zobel, J., and Sacks-Davis, R. 1996. Filtered document retrieval with frequency-sorted indexes. J. Amer. Soc. Inf. Sci. 47, 10, 749--764.
[33]
Salton, G. 1975. Dynamic Information and Library Processing. Prentice-Hall, Englewood Cliffs, NJ.
[34]
Salton, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA.
[35]
Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 5, 513-523.
[36]
Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill, New York.
[37]
Silvestri, F., Orlando, S., and Perego, R. 2004. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, 305--312.
[38]
Silvestri, F. 2007. Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research (ECIR). 101--112.
[39]
Stanford University. 2007. Webbase project Web site. www-diglib.stanford.edu/~testbed/doc2/WebBase.
[40]
Strohman, T. and Croft, W. B. 2007. Efficient document retrieval in main memory. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 175--182.
[41]
Trec 2006. Trec. Web site. http://trec.nist.gov.
[42]
Tombros, A. 2002. The effectiveness of query-based hierarchic clustering of documents for information retrieval. Ph.D. thesis, University of Glasgow, Glasgow, UK.
[43]
Van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. Butterworths, London.
[44]
Voorhees, E. M. 1985. Cluster hypothesis revisited. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 188--196.
[45]
Voorhees, E. M. 1986a. The effectiveness and efficiency of agglomerative hierarchical clustering in document retrieval. Ph.D. thesis, Cornell University, Ithaca, NY.
[46]
Voorhees, E. M. 1986b. The efficiency of inverted index and cluster searches. In Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York. 164--174.
[47]
Willett, P. 1988. Recent trends in hierarchical document clustering: A critical review. Inf. Process. Manage. 24, 5, 577--597.
[48]
Witten, I. H., Moffat, A., and Bell, T. C. 1994. Managing Gigabytes Compressing and Indexing Documents and Images. Van Nostrand Reinhold, New York.
[49]
Zettair 2007. The Zettair search engine. http://www.seg.rmit.edu.au/zettair/.
[50]
Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Comput. Surv. 38, 2, 1--56.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 26, Issue 3
June 2008
236 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1361684
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2008
Accepted: 01 December 2007
Revised: 01 November 2007
Received: 01 February 2006
Published in TOIS Volume 26, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Best match
  2. cluster-based retrieval (CBR)
  3. cluster-skipping inverted index structure (CS-IIS)
  4. full search (FS)
  5. index compression
  6. inverted index structure (IIS)
  7. query processing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)5
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media