skip to main content
10.1145/1718487.1718534acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Revisiting globally sorted indexes for efficient document retrieval

Published: 04 February 2010 Publication History

Abstract

There has been a large amount of research on efficient document retrieval in both IR and web search areas. One important technique to improve retrieval efficiency is early termination, which speeds up query processing by avoiding scanning the entire inverted lists. Most early termination techniques first build new inverted indexes by sorting the inverted lists in the order of either the term-dependent information, e.g., term frequencies or term IR scores, or the term-independent information, e.g., static rank of the document; and then apply appropriate retrieval strategies on the resulting indexes. Although the methods based only on the static rank have been shown to be ineffective for the early termination, there are still many advantages of using the methods based on term-independent information. In this paper, we propose new techniques to organize inverted indexes based on the term-independent information beyond static rank and study the new retrieval strategies on the resulting indexes. We perform a detailed experimental evaluation on our new techniques and compare them with the existing approaches. Our results on the TREC GOV and GOV2 data sets show that our techniques can improve query efficiency significantly.

References

[1]
V. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. In SIGIR 2001.
[2]
V. Anh and A. Moffat. Compressed inverted files with reduced decoding overheads. In SIGIR 1998.
[3]
V. Anh and A. Moffat. Pruned query evaluation using pre-computed impact scores. In SIGIR 2006.
[4]
R. Baeza-Yates and B. Ribeiro-Neto. Modern information retrieval. Addision Wesley, 1999.
[5]
A. Borodin, G. Roberts, J. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. In Proc. of the 10th Intl. Conf. on World Wide Web, 2001.
[6]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the 7th Intl. Conf. on World Wide Web, 1998.
[7]
A. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. of the 12th Conf. on Information and Knowledge Management, pages 426--434, Nov 2003.
[8]
C. Buckley and A.F. Lewit. Optimization of inverted vector searches. In Proc. of the 8th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 970--110, 1985.
[9]
S. Büttcher and C. Clarke. A document-centric approach to static index pruning in text retrieval systems. In Proc. of the 15th ACM international Conference on information and Knowledge, 2006
[10]
D. Carmel, et al. Static index pruning for information retrieval systems. In Proc. of the 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 43--50, New Orleans, Louisiana, USA, September 2001.
[11]
G. Das, D. Gunopulos, N. Koudas, and D. Tsirogiannis. Answering top-k queries using views. In VLDB 2006.
[12]
R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware. JCSS, 66(4):614--656, 2003.
[13]
R. Fagin. Combining fuzzy information from multiple systems. JCSS, 58(1):83--99, 1999.
[14]
R. Fagin. Combining fuzzy information: an overview. SIGMOD Rec., 31(2):109--118, 2002.
[15]
U. Güntzer, W. Balke, and W. Kiebling. Optimizing multi-feature queries for image databases. In Proc. of the 26th Int. Conf. on Very Large Data Bases, pages 419--428, 2000.
[16]
D. Harman and G. Candela. Retrieving records from a gigabyte of text on a mini-computer using statistical ranking. JASIS, 41(8):581--589, 1990.
[17]
M. Kaszkiel, J. Zobel, and R. Sacks-Davis. Efficient passage ranking for document databases. ACM Transactions on Information Systems, 17(4):406--439, Oct. 1999.
[18]
R.Kumar, K. Punera, T. Suel and S. Vassilvitskii, Top-k aggregation using intersections of ranked inputs, In WSDM 2009.
[19]
X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proc. of the 29th Int. Conf. on Very Large Data Bases, September 2003.
[20]
X. Long and T. Suel. Three-level caching for efficient query processing in large web search engines. In Proc. of the 14th Intl. Conf. on World Wide Web, pages 257--266, 2005
[21]
A. Moffat and J. Zobel. Fast ranking in limited space. In Proc. of the 10th IEEE Intl. Conf. on Data Engineering. Houston, TX, February 1994.
[22]
E. de Moura et al. Improving web search efficiency via a locality based static pruning method. In Proc. of the 14th Intl. Conf. on World Wide Web, pages 235--244, 2005.
[23]
S. Nepal and M.V. Ramakrishna. Query processing issues in Image (multimedia) Databases. In Proc. of the 15th ICDE, pages 22--29, 1999.
[24]
M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. JASIS, 47(10):749--764, 1996.
[25]
Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval systems. In ECIR 2003.
[26]
S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In Proc. of the 3rd Text Retrieval Conference (TREC), Nov 1994
[27]
R. Schenkel, A. Broschart, S. Hwang, M. Theobald and G. Weikum. Efficient text proximity search. In SPIRE 2007
[28]
G. Skobeltsyn, F. Junqueira, V. Plachouras, R. Baeza-Yates: ResIn: a combination of results caching and index pruning for high-performance web search engines. In SIGIR 2008.
[29]
Y. Tsegay, A. Turpin, and J. Zobel. Dynamic index pruning for effective caching. In Proc. of the ACM 16th Conference on Information and Knowledge Management, 2007
[30]
H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Information Processing and Management, 31(6):831--850, 1995.
[31]
I. Witten, A. Moffat, and T. Bell. Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, second edition, 1999.
[32]
W. Wong and D. Lee. Implementation of partial document ranking using inverted files. Information Processing and Management, 29(5):647--669, 1993.
[33]
H. Yan, S. Ding and T. Suel. Compressing term positions in web indexes, In Proc. of the 32nd Annual SIGIR Conf. on Research and Development in Information Retrieval, Boston, July, 2009.
[34]
M. Zhu, S. Shi, M. Li, and J.-R. Wen. Effective top-K computation in retrieving structured documents with term-proximity Support. In CIKM 2007.
[35]
M. Zhu, S. Shi, N. Yu, J.-R. Wen. 2008. Can phrase indexing help to process non-phrase queries? In CIKM 2008.
[36]
J. Zobel and A. Moffat. 2006. Inverted files for text search engines. ACM Computing Surveys. Vol. 38, No 2, Jul. 2006.

Cited By

View all

Index Terms

  1. Revisiting globally sorted indexes for efficient document retrieval

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '10: Proceedings of the third ACM international conference on Web search and data mining
      February 2010
      468 pages
      ISBN:9781605588896
      DOI:10.1145/1718487
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 February 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. dynamic index pruning
      2. globally-sorted index
      3. top-k

      Qualifiers

      • Research-article

      Conference

      Acceptance Rates

      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 15 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media