Article

To search or to crawl?: towards a query optimizer for text-centric tasks

Authors:

Panagiotis G. Ipeirotis,

Eugene Agichtein,

Luis GravanoAuthors Info & Claims

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

Pages 265 - 276

https://doi.org/10.1145/1142473.1142504

Published: 27 June 2006 Publication History

Abstract

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or 'crawl," the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output "completeness" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated cost-model parameters. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.

References

[1]

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In DL, 2000.]]

Digital Library

[2]

E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.]]

[3]

E. Agichtein, P. G. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.]]

[4]

M. Banko, E. Brill, S. Dumais, and J. Lin. AskMSR: Question answering using the World-Wide Web. In Symp. on Mining Answers from Texts and KBases, 2002.]]

[5]

M. K. Bergman. The Deep Web: Surfacing hidden value. Journal of Electronic Publishing, 7(1), Aug. 2001.]]

[6]

C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/¿mlearn/MLRepository.html]]

[7]

S. Brin. Extracting patterns and relations from the World Wide Web. In WebDB, 1998.]]

Digital Library

[8]

M. J. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW, 2005.]]

Digital Library

[9]

J. P. Callan and M. Connell. Query-based sampling of text databases. ACM TOIS, 19(2):97--130, 2001.]]

Digital Library

[10]

J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD, 1999.]]

Digital Library

[11]

J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In SIGIR, 1995.]]

Digital Library

[12]

S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2002.]]

[13]

S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In WWW, 2002.]]

Digital Library

[14]

S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 31, 1999.]]

Digital Library

[15]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003.]]

Digital Library

[16]

S. Chaudhuri, R. Motwani, and V. R. Narasayya. Random sampling for histogram construction: How much is enough? In SIGMOD, 1998.]]

Digital Library

[17]

S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. ACM TODS, 24(2):177--228, 1999.]]

Digital Library

[18]

F. Chung and L. Lu. Connected components in random graphs with given degree sequences. Annals of Combinatorics, 6:125--145, 2002.]]

[19]

W. W. Cohen. Learning trees and rules with set-valued features. In AAAI, 1996.]]

[20]

W. W. Cohen and Y. Singer. Learning to query the web. In AAAI Workshop on Internet-Based Information Systems, 1996.]]

[21]

M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In VLDB, 2000.]]

Digital Library

[22]

R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In AGENTS'97, 1997.]]

Digital Library

[23]

R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.]]

Digital Library

[24]

O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in KnowItAll. In WWW, 2004.]]

Digital Library

[25]

L. Gravano, H. Garcia-Molina, and A. Tomasic. GlOSS: Text-source discovery over the Internet. ACM TODS, 24(2):229--264, June 1999.]]

Digital Library

[26]

L. Gravano, P. G. Ipeirotis, and M. Sahami. Query- vs. crawling-based classification of searchable web databases. IEEE Data Eng. Bull., 25(1), 2002.]]

[27]

L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic classification of hidden-web databases. ACM TOIS, 21(1):1--41, Jan. 2003.]]

Digital Library

[28]

R. Grishman. Information extraction: Techniques and challenges. In SCIE, 1997.]]

Digital Library

[29]

R. Grishman, S. Huttunen, and R. Yangarber. Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35(4), 2002.]]

Digital Library

[30]

P. G. Ipeirotis. Classifying and Searching Hidden-Web Text Databases. Ph.D. thesis, Columbia University, 2004.]]

Digital Library

[31]

P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, 2002.]]

Digital Library

[32]

Y. Ling and W. Sun. An evaluation of sampling-based size estimation methods for selections in database systems. In ICDE, 1995.]]

Digital Library

[33]

F. Menczer, G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology, 4(4):378--419, Nov. 2004.]]

Digital Library

[34]

M. Mitzenmacher. Dynamic models for file sizes and double Pareto distributions. Internet Mathematics, 1(3):305--334, 2004.]]

[35]

M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Phys. Review E, 64(2):1--17, 2001.]]

[36]

A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content by keyword queries. In JCDL, 2005.]]

Digital Library

[37]

D. W. Oard. The state of the art in text filtering. UMUAI, 7(3):141--178, 1997.]]

Digital Library

[38]

S. M. Ross. Introduction to Probability Models. Academic Press, 8th ed., 2002.]]

Digital Library

[39]

F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, Mar. 2002.]]

Digital Library

[40]

V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, Sept. 1998.]]

Digital Library

[41]

H. S. Wilf. Generatingfunctionology. Academic Press Professional, Inc., 1990.]]

Digital Library

[42]

R. Yangarber and R. Grishman. NYU: Description of the Proteus/PET system as used for MUC-7. In MUC-7, 1998.]]

[43]

G. K. Zipf. Human Behavior and the Principle of Least Effort. 1949.]]

Cited By

Gupta DBerberich KBalog KSetty VLioma CLiu YZhang MBerberich K(2020)Optimizing Hyper-Phrase QueriesProceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval10.1145/3409256.3409827(41-48)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.1145/3409256.3409827
Ganti V(2018)Data CleaningEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_592(737-741)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_592
Anderson MCafarella M(2016)Input selection for fast feature engineering2016 IEEE 32nd International Conference on Data Engineering (ICDE)10.1109/ICDE.2016.7498272(577-588)Online publication date: May-2016
https://doi.org/10.1109/ICDE.2016.7498272
Show More Cited By

Index Terms

To search or to crawl?: towards a query optimizer for text-centric tasks
1. Computing methodologies
  1. Modeling and simulation
    1. Simulation theory
      1. Systems theory
2. Mathematics of computing
  1. Information theory

Recommendations

Towards a query optimizer for text-centric tasks

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused ...
Join Optimization of Information Extraction Output: Quality Matters!
ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering

Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a ...
Query by templates: a generalized approach for visual query formulation for text dominated databases
IEEE ADL '97: Proceedings of the IEEE international forum on Research and technology advances in digital libraries

The WWW has a great potential of evolving into a globally distributed digital document library. The primary use of such a library is to retrieve information quickly and easily. Because of the size of these libraries, simple keyword searches often result ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

June 2006

830 pages

ISBN:1595934340

DOI:10.1145/1142473

General Chairs:
Clement Yu
University of Illinois at Chicago
,
Peter Scheuermann
Northwestern University
,
Program Chair:
Surajit Chaudhuri
Microsoft Research

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGMOD/PODS06

Sponsor:

SIGMOD/PODS06: International Conference on Management of Data and Symposium on Principles Database and Systems

June 27 - 29, 2006

IL, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
1,311
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gupta DBerberich KBalog KSetty VLioma CLiu YZhang MBerberich K(2020)Optimizing Hyper-Phrase QueriesProceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval10.1145/3409256.3409827(41-48)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.1145/3409256.3409827
Ganti V(2018)Data CleaningEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_592(737-741)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_592
Anderson MCafarella M(2016)Input selection for fast feature engineering2016 IEEE 32nd International Conference on Data Engineering (ICDE)10.1109/ICDE.2016.7498272(577-588)Online publication date: May-2016
https://doi.org/10.1109/ICDE.2016.7498272
Ganti V(2016)Data CleaningEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_592-2(1-4)Online publication date: 8-Dec-2016
https://doi.org/10.1007/978-1-4899-7993-3_592-2
Termehchy AVakilian AChodpathumwan YWinslett M(2015)Cost-Effective Conceptual Design for Information ExtractionACM Transactions on Database Systems10.1145/271632140:2(1-39)Online publication date: 30-Jun-2015
https://dl.acm.org/doi/10.1145/2716321
Liu YAgah A(2013)Topical crawling on the web through local site-searchesJournal of Web Engineering10.5555/2535629.253563112:3-4(203-214)Online publication date: 1-Jul-2013
https://dl.acm.org/doi/10.5555/2535629.2535631
Ng RArocena PBarbosa DCarenini GGomes, Jr. LJou SLeung RMilios EMiller RMylopoulos JPottinger RTompa FYu E(2013)Perspectives on Business IntelligenceSynthesis Lectures on Data Management10.2200/S00491ED1V01Y201303DTM0345:1(1-163)Online publication date: 30-Apr-2013
https://doi.org/10.2200/S00491ED1V01Y201303DTM034
Simões GGalhardas HGravano L(2013)When speed has a priceProceedings of the VLDB Endowment10.14778/2536258.25362596:13(1462-1473)Online publication date: 1-Aug-2013
https://dl.acm.org/doi/10.14778/2536258.2536259
Zheng QWu ZCheng XJiang LLiu J(2013)Learning to crawl deep webInformation Systems10.1016/j.is.2013.02.00138:6(801-819)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1016/j.is.2013.02.001
Löser ANagel CPieper SBoden C(2013)Beyond searchInformation Systems Frontiers10.1007/s10796-012-9403-815:3(311-329)Online publication date: 1-Jul-2013
https://dl.acm.org/doi/10.1007/s10796-012-9403-8
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten