skip to main content
10.1145/1277741.1277831acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

A time machine for text search

Published: 23 July 2007 Publication History

Abstract

Text search over temporally versioned document collections such as web archives has received little attention as a research problem. As a consequence, there is no scalable and principled solution to search such a collection as of a specified time. In this work, we address this shortcoming and propose an efficient solution for time-travel text search by extending the inverted file index to make it ready for temporal search. We introduce approximate temporal coalescing as a tunable method to reduce the index size without significantly affecting the quality of results. In order to further improve the performance of time-travel queries, we introduce two principled techniques to trade off index size for its performance. These techniques can be formulated as optimization problems that can be solved to near-optimality. Finally, our approach is evaluated in a comprehensive series of experiments on two large-scale real-world datasets. Results unequivocally show that our methods make it possible to build an efficient "time machine" scalable to large versioned text collections.

References

[1]
V. N. Anh and A. Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. In SIGIR, 2006.
[2]
V. N. Anh and A. Moffat. Pruning Strategies for Mixed-Mode Querying. In CIKM, 2006.
[3]
P. G. Anick and R. A. Flynn. Versioning a Full-Text Information Retrieval System. In SIGIR, 1992.
[4]
R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
[5]
K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A Time Machine for Text search. Technical Report MPI-I-2007-5-002, Max-Planck Institute for Informatics, 2007.
[6]
M. H. Böhlen, R. T. Snodgrass, and M. D. Soo. Coalescing in Temporal Databases. In VLDB, 1996.
[7]
P. Boldi, M. Santini, and S. Vigna. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations. In WAW, 2004.
[8]
A. Z. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. J. Shekita. Indexing Shared Content in Information Retrieval Systems. In EDBT, 2006.
[9]
C. Buckley and A. F. Lewit. Optimization of Inverted Vector Searches. In SIGIR, 1985.
[10]
M. Burrows and A. L. Hisgen. Method and Apparatus for Generating and Searching Range-Based Index of Word Locations. U.S. Patent 5,915,251, 1999.
[11]
S. Büttcher and C. L. A. Clarke. A Document-Centric Approach to Static Index Pruning in Text Retrieval Systems. In CIKM, 2006.
[12]
D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static Index Pruning for Information Retrieval Systems. In SIGIR, 2001.
[13]
http://www.europarchive.org.
[14]
R. Fagin, R. Kumar, and D. Sivakumar. Comparing Top k Lists. SIAM J. Discrete Math., 17(1):134--160, 2003.
[15]
R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware. J. Comput. Syst. Sci., 66(4):614--656, 2003.
[16]
S. Guha, K. Shim, and J. Woo. REHIST: Relative Error Histogram Construction Algorithms. In VLDB, 2004.
[17]
M. Hersovici, R. Lempel, and S. Yogev. Efficient Indexing of Versioned Document Sequences. In ECIR, 2007.
[18]
http://www.archive.org.
[19]
Y. E. Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD, 1995.
[20]
H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal Histograms with Quality Guarantees. In VLDB, 1998.
[21]
E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An Online Algorithm for Segmenting Time Series. In ICDM, 2001.
[22]
S. Kirkpatrick, D. G. Jr., and M. P. Vecchi. Optimization by Simulated Annealing. Science, 220(4598):671--680, 1983.
[23]
J. Kleinberg and E. Tardos. Algorithm Design. Addison-Wesley, 2005.
[24]
U. Manber. Introduction to Algorithms: A Creative Approach. Addison-Wesley, 1989.
[25]
K. Nørvåg and A. O. N. Nybø. DyST: Dynamic and Scalable Temporal Text Indexing. In TIME, 2006.
[26]
J. M. Ponte and W. B. Croft. A Language Modeling Approach to Information Retrieval. In SIGIR, 1998.
[27]
S. E. Robertson and S. Walker. Okapi/Keenbow at TREC-8. In TREC, 1999.
[28]
B. Salzberg and V. J. Tsotras. Comparison of Access Methods for Time-Evolving Data. ACM Comput. Surv., 31(2):158--221, 1999.
[29]
M. Stack. Full Text Search of Web Archive Collections. In IWAW, 2006.
[30]
E. Terzi and P. Tsaparas. Efficient Algorithms for Sequence Segmentation. In SIAM-DM, 2006.
[31]
M. Theobald, G. Weikum, and R. Schenkel. Top-k Query Evaluation with Probabilistic Guarantees. In VLDB, 2004.
[32]
http://www.wikipedia.org.
[33]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann publishers Inc., 1999.
[34]
J. Zhang and T. Suel. Efficient Search in Large Textual Collections with Redundancy. In WWW, 2007.
[35]
J. Zobel and A. Moffat. Inverted Files for Text Search Engines. ACM Comput. Surv., 38(2):6, 2006. 526.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. temporal text indexing
  2. time-travel text search
  3. web archives

Qualifiers

  • Article

Conference

SIGIR07
Sponsor:
SIGIR07: The 30th Annual International SIGIR Conference
July 23 - 27, 2007
Amsterdam, The Netherlands

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media