skip to main content
research-article

Fast XML document filtering by sequencing twig patterns

Published: 14 October 2009 Publication History

Abstract

XML-enabled publish-subscribe (pub-sub) systems have emerged as an increasingly important tool for e-commerce and Internet applications. In a typical pub-sub system, subscribed users specify their interests in a profile expressed in the XPath language. Each new data content is then matched against the user profiles so that the content is delivered only to the interested subscribers. As the number of subscribed users and their profiles can grow very large, the scalability of the service is critical to the success of pub-sub systems. In this article, we propose a novel scalable filtering system called iFiST that transforms user profiles of a twig pattern expressed in XPath into sequences using the Prüfer's method. Consequently, instead of breaking a twig pattern into multiple linear paths and matching them separately, FiST performs holistic matching of twig patterns with each incoming document in a bottom-up fashion. FiST organizes the sequences into a dynamic hash-based index for efficient filtering, and exploits the commonality among user profiles to enable shared processing during the filtering phase. We demonstrate that the holistic matching approach reduces filtering cost and memory consumption, thereby improving the scalability of FiST.

References

[1]
Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the 26th VLDB Conference. 53--64.
[2]
Apache. Apache Xerces C++ parser. http://xml.apache.org/xerces-c/.
[3]
Bar-Yossef, Z., Fontoura, M., and Josifovski, V. 2004. On the memory requirements of XPath evaluation over XML streams. In Proceedings of the 23rd ACM Symposium on Principles of Database Systems. 177--188.
[4]
Bar-Yossef, Z., Fontoura, M., and Josifovski, V. 2005. Buffering in query evaluation over XML streams. In Proceedings of the 24th ACM Symposium on Principles of Database Systems. 216--227.
[5]
Berglund, A., Boag, S., Chamberlin, D., Fernáandez, M. F., Kay, M., Robie, J., and Siméon, J. XML path language (XPath) 2.0. http://www.w3.org/TR/xpath20/.
[6]
Boag, S., Chamberlin, D., Fernández, M. F., Florescu, D., Robie, J., and Siméon, J. XQuery 1.0: An XML query language. http://www.w3.org/TR/xquery/.
[7]
Bow, C., Hughes, B., and Bird, S. 2003. Towards a general model of interlinear text. In Proceedings of EMELD Workshop.
[8]
Bruno, N., Gravano, L., Koudas, N., and Srivastava, D. 2003. Navigation- vs. index-based XML multi-query processing. In Proceedings of the 19th IEEE International Conference on Data Engineering. 139--150.
[9]
Bruno, N., Koudas, N., and Srivastava, D. 2002. Holistic twig joins: Optimal XML pattern matching. In Proceedings of the 2002 ACM-SIGMOD Conference.
[10]
Candan, K. S., Hsiung, W.-P., Chen, S., Tatemura, J., and Agrawal, D. 2006. AFilter: Adaptable XML filtering with prefix-caching and suffix-clustering. In Proceedings of the 32nd VLDB Conference. 559--570.
[11]
Carzaniga, A., Rutherford, M. J., and Wolf, A. L. 2004. A routing scheme for content-based networking. In Proceedings of IEEE InfoCom 2004. 918--928.
[12]
Castro, M., Druschel, P., marie Kermarrec, A., and Rowstron, A. 2002. Scribe: A large-scale and decentralized application-level multicast infrastructure. IEEE J. Select. Areas Comm. 20, 8, 1489--1499.
[13]
Chan, C. Y., Felber, P., Garofalakis, M. N., and Rastogi, R. 2002a. Efficient filtering of XML documents with XPath expressions. In Proceedings of the 18th IEEE International Conference on Data Engineering. 235--244.
[14]
Chan, C. Y., Felber, P., Garofalakis, M. N., and Rastogi, R. 2002b. Efficient filtering of XML documents with XPath expressions. VLDB J. 11, 4, 354--379.
[15]
Chan, C. Y. and Ni, Y. 2007. Efficient XML data dissemination with piggybacking. In Proceedings of the ACM-SIGMOD Conference. 737--748.
[16]
Chandramouli, B., Phillips, J., and Yang, J. 2007. Value-Based notification conditions in large-scale publish/subscribe systems. In Proceedings of the 33rd VLDB Conference. 878--889.
[17]
Chen, S., Li, H.-G., Tatemura, J., Hsiung, W.-P., Agrawal, D., and Candan, K. S. 2006a. Twig2 Stack: Bottom-Up processing of generalized-tree-pattern queries over XML documents. In Proceedings of the 32nd VLDB Conference. 283--294.
[18]
Chen, Y., Davidson, S. B., and Zheng, Y. 2006b. An efficient XPath query processor for XML streams. In Proceedings of the 22nd IEEE International Conference on Data Engineering. 79.
[19]
Chiu, A.-T. and Hsu, J.-L. 2006. An automaton-based filtering system for streaming musicxml. In Proceedings of the International Conference on Semantic Web&Web Services. 177--178.
[20]
Clark, J. 1999. XSL transformations (XSLT) version 1.0. http://www.w3.org/TR/xslt/.
[21]
Diao, Y., Altinel, M., Michael J. Franklin, H. Z., and Fischer, P. 2003. Path sharing and predicate evaluation for high-performance XML filtering. ACM Trans. Datab. Syst. 28, 4, 467--516.
[22]
Diao, Y., Rizvi, S., and Franklin, M. J. 2004. Towards an Internet-scale xml dissemination service. In Proceedings of the International Conference on Very Large Databases. 612--623.
[23]
Diaz, A. L. and Lovell, D. 1999. XML generator. http://www.alphaworks.ibm.com/tech/xmlgenerator.
[24]
Fenner, W. and Srivastava, D. 2005. XTreeNet: Scalable overlay networks for XML content dissemination and querying. In Proceedings of the 10th International Workshop on Web Content Caching and Distribution. 41--46.
[25]
Gong, X., Yan, Y., Qian, W., and Zhou, A. 2005. Bloom filter-based XML packets filtering for millions of path queries. In Proceedings of the 21st IEEE International Conference on Data Engineering. 890--901.
[26]
Green, T. J., Gupta, A., Miklau, G., Onizuka, M., and Suciu, D. 2004. Processing XML streams with deterministic automata and stream indexes. ACM Trans. Datab. Syst. 29, 4, 752--788.
[27]
Green, T. J., Miklau, G., Onizuka, M., and Suciu, D. 2003. Processing XML streams with deterministic automata. In Proceedings of the 9th International Conference on Database Theory. 173--189.
[28]
Gupta, A. K. and Suciu, D. 2003. Stream processing of XPath queries with predicates. In Proceedings of the ACM-SIGMOD Conference. ACM Press, 419--430.
[29]
He, B., Luo, Q., and Choi, B. 2005. Cache-Conscious automata for XML filtering. In Proceedings of the 21st IEEE International Conference on Data Engineering. 878--889.
[30]
He, B., Luo, Q., and Choi, B. 2006. Cache-Conscious automata for XML filtering. IEEE Trans. Knowl. Data Eng. 18, 12, 1629--1644.
[31]
Hong, M., Demers, A. J., Gehrke, J., Koch, C., Riedewald, M., and White, W. M. 2007. Massively multi-query join processing in publish/subscribe systems. In Proceedings of the ACM-SIGMOD Conference. 761--772.
[32]
Hou, S. and Jacobsen, H.-A. 2006. Predicate-Based filtering of XPath expressions. In Proceedings of the 22nd IEEE International Conference on Data Engineering. 53.
[33]
Kwon, J., Rao, P., Moon, B., and Lee, S. 2005. FiST: Scalable XML document filtering by sequencing twig patterns. In Proceedings of the 31st VLDB Conference. 217--228.
[34]
Kwon, J., Rao, P., Moon, B., and Lee, S. 2007. Value-Based predicate filtering of streaming XML data. In Proceedings of the 1st International Workshop on Data Management in Ubiquitous Computing. 266--271.
[35]
Kwon, J., Rao, P., Moon, B., and Lee, S. 2008. Value-Based predicate filtering of XML documents. Data Knowl. Eng. 67, 1, 51--73.
[36]
Lewis, W. D. Personal communications. http://zimmer.csufresno.edu/~wlewis/.
[37]
Li, G., Hou, S., and Jacobsen, H.-A. 2008. Routing of XML and XPath queries in data dissemination networks. In Proceedings of the 28th International Conference on Distributed Computing Systems. 627--638.
[38]
Li, Q. and Moon, B. 2001. Indexing and querying XML data for regular path expressions. In Proceedings of the 27th VLDB Conference. 361--370.
[39]
Lu, J., Chen, T., and Ling, T. W. 2004. Efficient processing of xml twig patterns with parent child edges: A look-ahead approach. In Proceedings of ACM the 13th International Conference on Information and Knowledge Management. 533--542.
[40]
Ludäscher, B., Mukhopadhyay, P., and Papakonstantinou, Y. 2002. A transducer-based XML query processor. In Proceedings of the 28th VLDB Conference. 227--238.
[41]
Megginson, D. Simple API for XML. http://sax.sourceforge.net/.
[42]
Miliaraki, I., Kaoudi, Z., and Koubarakis, M. 2008. Xml data dissemination using automata on top of structured overlay networks. In Proceedings of the 17th International World Wide Web Conference. ACM, New York, 865--874.
[43]
Milo, T., Zur, T., and Verbin, E. 2007. Boosting topic-based publish-subscribe systems with dynamic clustering. In Proceedings of the ACM-SIGMOD Conference. 749--760.
[44]
Moro, M. M., Bakalov, P., and Tsotras, V. J. 2007. Early profile pruning on XML-aware publish-subscribe systems. In Proceedings of the 33rd VLDB Conference. 866--877.
[45]
Müller, K. 2004. Semi-Automatic construction of a question treebank. In Proceedings of the 4th International Conference on Language Resources and Evaluation.
[46]
MusicXML. MusicXML definition. http://www.recordare.com//xml.html.
[47]
NITF. NITF: News industry text format. http://www.nitf.org/.
[48]
Peng, F. and Chawathe, S. S. 2003. XPath queries on streaming data. In Proceedings of the ACM-SIGMOD Conference. ACM Press, 431--442.
[49]
Prüfer, H. 1918. Neuer Beweis eines satzes über permutationen. Archiv für Mathematik und Physik 27, 142--144.
[50]
Ramasubramanian, V., Peterson, R., and Sirer, E. G. 2006. Corona: A high performance publish-subscribe system for the World Wide Web. In Proceedings of the 3rd Conference on Networked Systems Design&Implementation (NSDI'06). USENIX Association, 2--2.
[51]
Rao, P., Cappos, J., Khare, V., Moon, B., and Zhang, B. 2007. Net-x: Unified data-centric Internet services. In Proceedings of 3rd International Workshop on Networking Meets Databases (NetDB'07).
[52]
Rao, P. and Moon, B. 2004. PRIX: Indexing and querying XML using Prüfer sequences. In Proceedings of the 20th IEEE International Conference on Data Engineering. 288--299.
[53]
Rao, P. and Moon, B. 2006. Sequencing XML data and query twigs for fast pattern matching. ACM Trans. Datab. Syst. 31, 1, 299--345.
[54]
Shah, R., Ramzan, Z., Jain, R., Dendukuri, R., and Anjum, F. 2004. Efficient dissemination of personalized information using content-based multicast. IEEE Trans. Mobile Comput. 3, 4, 394--408.
[55]
Tian, F., Reinwald, B., Pirahesh, H., Mayr, T., and Myllymaki, J. 2004. Implementing a scalable XML publish/subscribe system using a relational database system. In Proceedings of the ACM-SIGMOD Conference. 479--490.
[56]
Treebank. The Penn treebank project. http://www.cis.upenn.edu/~treebank/.
[57]
XMark. XMark - An XML benchmark project. http://www.xml-benchmark.org/.
[58]
Yan, T. W. and Garcia-Molina, H. 1999. The SIFT information dissemination system. ACM Trans. Datab. Syst. 24, 4, 529--565.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology
ACM Transactions on Internet Technology  Volume 9, Issue 4
September 2009
165 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/1592446
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2009
Accepted: 01 July 2009
Revised: 01 June 2009
Received: 01 October 2008
Published in TOIT Volume 9, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Prüfer sequences
  2. XML filtering
  3. selective dissemination of information
  4. twig pattern

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media