skip to main content
10.1145/1142473.1142504acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

To search or to crawl?: towards a query optimizer for text-centric tasks

Published: 27 June 2006 Publication History

Abstract

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or 'crawl," the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output "completeness" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated cost-model parameters. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.

References

[1]
E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In DL, 2000.]]
[2]
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.]]
[3]
E. Agichtein, P. G. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.]]
[4]
M. Banko, E. Brill, S. Dumais, and J. Lin. AskMSR: Question answering using the World-Wide Web. In Symp. on Mining Answers from Texts and KBases, 2002.]]
[5]
M. K. Bergman. The Deep Web: Surfacing hidden value. Journal of Electronic Publishing, 7(1), Aug. 2001.]]
[6]
C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/¿mlearn/MLRepository.html]]
[7]
S. Brin. Extracting patterns and relations from the World Wide Web. In WebDB, 1998.]]
[8]
M. J. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW, 2005.]]
[9]
J. P. Callan and M. Connell. Query-based sampling of text databases. ACM TOIS, 19(2):97--130, 2001.]]
[10]
J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD, 1999.]]
[11]
J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In SIGIR, 1995.]]
[12]
S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2002.]]
[13]
S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In WWW, 2002.]]
[14]
S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 31, 1999.]]
[15]
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003.]]
[16]
S. Chaudhuri, R. Motwani, and V. R. Narasayya. Random sampling for histogram construction: How much is enough? In SIGMOD, 1998.]]
[17]
S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. ACM TODS, 24(2):177--228, 1999.]]
[18]
F. Chung and L. Lu. Connected components in random graphs with given degree sequences. Annals of Combinatorics, 6:125--145, 2002.]]
[19]
W. W. Cohen. Learning trees and rules with set-valued features. In AAAI, 1996.]]
[20]
W. W. Cohen and Y. Singer. Learning to query the web. In AAAI Workshop on Internet-Based Information Systems, 1996.]]
[21]
M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In VLDB, 2000.]]
[22]
R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In AGENTS'97, 1997.]]
[23]
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.]]
[24]
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in KnowItAll. In WWW, 2004.]]
[25]
L. Gravano, H. Garcia-Molina, and A. Tomasic. GlOSS: Text-source discovery over the Internet. ACM TODS, 24(2):229--264, June 1999.]]
[26]
L. Gravano, P. G. Ipeirotis, and M. Sahami. Query- vs. crawling-based classification of searchable web databases. IEEE Data Eng. Bull., 25(1), 2002.]]
[27]
L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic classification of hidden-web databases. ACM TOIS, 21(1):1--41, Jan. 2003.]]
[28]
R. Grishman. Information extraction: Techniques and challenges. In SCIE, 1997.]]
[29]
R. Grishman, S. Huttunen, and R. Yangarber. Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35(4), 2002.]]
[30]
P. G. Ipeirotis. Classifying and Searching Hidden-Web Text Databases. Ph.D. thesis, Columbia University, 2004.]]
[31]
P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, 2002.]]
[32]
Y. Ling and W. Sun. An evaluation of sampling-based size estimation methods for selections in database systems. In ICDE, 1995.]]
[33]
F. Menczer, G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology, 4(4):378--419, Nov. 2004.]]
[34]
M. Mitzenmacher. Dynamic models for file sizes and double Pareto distributions. Internet Mathematics, 1(3):305--334, 2004.]]
[35]
M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Phys. Review E, 64(2):1--17, 2001.]]
[36]
A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content by keyword queries. In JCDL, 2005.]]
[37]
D. W. Oard. The state of the art in text filtering. UMUAI, 7(3):141--178, 1997.]]
[38]
S. M. Ross. Introduction to Probability Models. Academic Press, 8th ed., 2002.]]
[39]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, Mar. 2002.]]
[40]
V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, Sept. 1998.]]
[41]
H. S. Wilf. Generatingfunctionology. Academic Press Professional, Inc., 1990.]]
[42]
R. Yangarber and R. Grishman. NYU: Description of the Proteus/PET system as used for MUC-7. In MUC-7, 1998.]]
[43]
G. K. Zipf. Human Behavior and the Principle of Least Effort. 1949.]]

Cited By

View all

Index Terms

  1. To search or to crawl?: towards a query optimizer for text-centric tasks

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
      June 2006
      830 pages
      ISBN:1595934340
      DOI:10.1145/1142473
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 June 2006

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. focused crawling
      2. information extraction
      3. metasearching
      4. query optimization
      5. research
      6. text databases

      Qualifiers

      • Article

      Conference

      SIGMOD/PODS06
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)8
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 08 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media