skip to main content
article

Processing XML streams with deterministic automata and stream indexes

Published: 12 December 2004 Publication History

Abstract

We consider the problem of evaluating a large number of XPath expressions on a stream of XML packets. We contribute two novel techniques. The first is to use a single Deterministic Finite Automaton (DFA). The contribution here is to show that the DFA can be used effectively for this problem: in our experiments we achieve a constant throughput, independently of the number of XPath expressions. The major issue is the size of the DFA, which, in theory, can be exponential in the number of XPath expressions. We provide a series of theoretical results and experimental evaluations that show that the lazy DFA has a small number of states, for all practical purposes. These results are of general interest in XPath processing, beyond stream processing. The second technique is the Streaming IndeX (SIX), which consists of adding a small amount of binary data to each XML packet that allows the query processor to achieve significant speedups. As an application of these techniques we describe the XML Toolkit (XMLTK), a collection of command-line tools providing highly scalable XML data processing.

Supplementary Material

green-appendix (p1-green.pdf)
Online Appendix to: Processing XML streams with deterministic automata and stream indexes

References

[1]
Abiteboul, S., Buneman, P., and Suciu, D. 1999. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, San Fransisco, CA.
[2]
Aho, A. and Corasick, M. 1975. Efficient string matching: an aid to bibliographic search. Commun. Assoc. Comput. Mach. 18, 333--340.
[3]
Altinel, M. and Franklin, M. 2000. Efficient filtering of XML documents for selective dissemination. In Proceedings of VLDB (Cairo, Egipt). 53--64.
[4]
ANDIS/ISO. 1998. C++ Standard. ANDIS/ISO, Geneva, Switzerland.
[5]
Avila-Campillo, I., Green, T. J., Gupta, A., Onizuka, M., Raven, D., and Suciu, D. 2002. XMLTK: An XML toolkit for scalable XML stream processing. In Proceedings of PLANX.
[6]
Borne, K. D. n.d. NASA's astronomical data center. ADC XML resource page. Available online at http://xml.gsfc.nasa.gov/.
[7]
Buneman, P., Davidson, S., Fernandez, M., and Suciu, D. 1997. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory (Delphi, Greece). Springer Verlag, Berlin, Germany, 336--350.
[8]
Buneman, P., Naqvi, S. A., Tannen, V., and Wong, L. 1995. Principles of programming with complex objects and collection types. Theoret. Comput. Sci. 149, 1, 3--48.
[9]
Chan, C., Felber, P., Garofalakis, M., and Rastogi, R. 2002. Efficient filtering of XML documents with XPath expressions. In Proceedings of the International Conference on Data Engineering.
[10]
Chen, J., DeWitt, D., Tian, F., and Wang, Y. 2000. NiagaraCQ: A scalable continuous query system for internet databases. In Proceedings of the ACM/SIGMOD Conference on Management of Data. 379--390.
[11]
Cormen, T. H., Leiserson, C. E., and Rivest, R. L. 1990. Introduction to Algorithms. MIT Press, Cambridge, MA.
[12]
Corp., M. n.d. DIME---direct Internet message encapsulation specification index page. IETF Internet draft. Available online at http://msdn.microsoft.com/webservices/understanding/gxa/default.aspx.
[13]
Diao, Y., Altinel, M., Franklin, M., Zhang, H., and Fischer, P. 2003. Path sharing and predicate evaluation for high-performance XML filtering. ACM Trans. Database Syst. 28, 4, 467--516.
[14]
Diao, Y. and Franklin, M. 2003. Query processing for high-volume XML message brokering. In Proceedings of VLDB (Berlin, Germany).
[15]
Fernandez, M. and Suciu, D. 1998. Optimizing regular path expressions using graph schemas. In Proceedings of the International Conference on Data Engineering. 14--23.
[16]
Florescu, D., Hillary, C., Kossmann, D., P.Lucas, Riccardi, F., Westmann, T., Carey, M., Sundararajan, A., and Agrawal, G. 2003. The bea/xqrl streaming xquery processor. In Proceedings of VLDB (Berlin, Germany). 997--1008.
[17]
Garcia-Molina, H., Ullman, J. D., and Widom, J. 2000. Database System Implementation. Prentice Hall, Upper Saddle River, NJ.
[18]
Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of Very Large Data Bases. 436--445.
[19]
Graefe, G. 1993. Query evaluation techniques for large databases. ACM Comput. Surv. 25, 2 (June), 73--170.
[20]
Green, T. J., Miklau, G., Onizuka, M., and Suciu, D. 2003. Processing XML streams with deterministic automata. In Proceedings of ICDT. 173--189.
[21]
Gupta, A. K., Halevy, A. Y., and Suciu, D. 2002. View selection for XML stream processing. In Proceedings of the International Workshop on the Web and Database (Web DB). 83--88.
[22]
Gupta, A. and Suciu, D. 2003. Stream processing of XPath queries with predicates. In Proceedings of the ACM SIGMOD Conference on Management of Data.
[23]
Gupta, A., Suciu, D., and Halevy, A. 2003. The view selection problem for XML content based routing. In Proceedings of the PODS.
[24]
Higgins, D. G., Fuchs, R., Stoehr, P. J., and Cameron, G. N. 1992. The EMBL data library. Nucleic Acids Res. 20, 2071--2074.
[25]
Hopcroft, J. and Ullman, J. 1979. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, MA.
[26]
Ives, Z., Halevy, A., and Weld, D. 2002. An XML query engine for network-bound data. VLDB J. 11, 4, 380--402.
[27]
Laurikari, V. 2000. NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions. In Proceedings of SPIRE. 181--187.
[28]
Ley, M. n.d. Computer science bibliography (dblp). Available online at http://dblp.uni- trier.de.
[29]
Liefke, H. and Suciu, D. 2000. XMill: An efficent compressor for XML data. In Proceedings of SIGMOD (Dallas, TX). 153--164.
[30]
Ludaescher, B., Mukhopadhyay, P., and Papakonstantinou, Y. 2002. A transducer-based XML query processor. In Proceedings of VLDB. 227--238.
[31]
Marcus, M., Santorini, B., and Marcinkiewicz, M. A. 1993. Building a large annotated corpus of English: The Penn Treenbak. Computat. Ling. 19.
[32]
McHugh, J. and Widom, J. 1999. Query optimization for XML. In Proceedings of VLDB (Edinburgh, U.K.). 315--326.
[33]
Nguyen, B., Abiteboul, S., Cobena, G., and Preda, M. 2001. Monitoring XML data on the Web. In Proceedings of the ACM SIGMOD Conference on Management of Data (Santa Barbara, CA). 437--448.
[34]
Onizuka, M. 2003. Light-weight xpath processing of XML stream with deterministic automata. In Proceedings of the CIKM. 342--349.
[35]
Peng, F. and Chawathe, S. 2003. XPath queries on streaming data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 431--442.
[36]
Rozenberg, G. and Salomaa, A. 1997. Handbook of Formal Languages. Springer Verlag, Berlin, Germany.
[37]
Sahuguet, A. 2000. Everything you ever wanted to know about DTDs, but were afraid to ask. In Proceedings of WebDB, D. Suciu and G. Vossen, Eds. Sringer Verlag, Berlin, Germany, 171--183.
[38]
Snoeren, A., Conley, K., and Gifford, D. 2001. Mesh-based content routing using XML. In Proceedings of the 18th Symposium on Operating Systems Principles.
[39]
Thierry-Mieg, J. and Durbin, R. 1992. Syntactic Definitions for the ACEDB Data Base Manager. Tech. rep. MRC-LMB xx.92. MRC Laboratory for Molecular Biology, Cambridge, U.K.
[40]
Thompson, K. 1968. Regular expression search algorithm. Commun. Assoc. Comput. Mach. 11, 6, 419--422.
[41]
Watson, B. W. 1993. A taxonomy of finite automata construction algorithms. Computing Science report 93/43. University of Technology Eindhoven, Eindhoven, The Netherlands.
[42]
Watson, B. W. 1996. Implementing and using finite automata toolkits. J. Nat. Lang. Eng. 2, 4 (Dec.), 295--302.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 29, Issue 4
December 2004
250 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1042046
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2004
Published in TODS Volume 29, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. XML processing
  2. stream processing

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media