skip to main content
research-article

Information preserving XML schema embedding

Published: 21 March 2008 Publication History

Abstract

A fundamental concern of data integration in an XML context is the ability to embed one or more source documents in a target document so that (a) the target document conforms to a target schema and (b) the information in the source documents is preserved. In this paper, information preservation for XML is formally studied, and the results of this study guide the definition of a novel notion of schema embedding between two XML DTD schemas represented as graphs. Schema embedding generalizes the conventional notion of graph similarity by allowing an edge in a source DTD schema to be mapped to a path in the target DTD. Instance-level embeddings can be derived from the schema embedding in a straightforward manner, such that conformance to a target schema and information preservation are guaranteed. We show that it is NP-complete to find an embedding between two DTD schemas. We also outline efficient heuristic algorithms to find candidate embeddings, which have proved effective by our experimental study. These yield the first systematic and effective approach to finding information preserving XML mappings.

References

[1]
Abiteboul, S., Buneman, P., and Suciu, D. 2000. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufman.
[2]
Abiteboul, S. and Duschka, O. M. 1998. Complexity of answering queries using materialized views. In Proceedings of ACM Symposium on Principles of Database Systems (PODS).
[3]
Abiteboul, S. and Hull, R. 1988. Restructuring hierarchical database objects. Theoretical Computer Science 62, 1-2, 3--38.
[4]
Abiteboul, S., Hull, R., and Vianu, V. 1995. Foundations of Databases. Addison-Wesley.
[5]
Alon, N., Milo, T., Neven, F., Suciu, D., and Vianu, V. 1995. XML with data values: Typechecking revisited. J. Comput. Syst, Sci. 66, 4, 688--727.
[6]
Arenas, M. and Libkin, L. 2005. XML data exchange: Consistency and query answering. In Proceedings of ACM Symposium on Principles of Database Systems (PODS).
[7]
Athitsos, V., Hadjieleftheriou, M., Kollios, G., and Sclaroff, S. 2005. Query-sensitive embeddings. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD).
[8]
Barbosa, D., Freire, J., and Mendelzon, A. 2005. Designing information-preserving mapping schemes for XML. In Proceedings of International Conference on Very Large Databases (VLDB).
[9]
Benedikt, M., Fan, W., and Geerts, F. 2005. XPath satisfiability in the presence of DTDs. In Proceedings of ACM Symposium on Principles of Database Systems (PODS).
[10]
Bohannon, P., Fan, W., Flaster, M., and Narayan, P. P. S. 2005. Information preserving XML schema embedding. In Proceedings of International Conference on Very Large Databases (VLDB).
[11]
Buneman, P., Khanna, S., and Tan, W. C. 2001. Why and where: A characterization of data provenance. In Proceedings of International Conference on Database Theory (ICDT).
[12]
Busygin, S., Butenko, S., and Pardalos, P. M. 2002. A heuristic for the maximum independent set problem based on optimization of a quadratic over a sphere. J. Comb. Optim. 6, 3, 287--297.
[13]
Calvanese, D., Giacomo, G. D., Lenzerini, M., and Vardi, M. Y. 2002. Lossless regular views. In Proceedings of ACM Symposium on Principles of Database Systems (PODS).
[14]
Castano, S., Antonellis, V. D., and di Vimercati, S. D. C. 2001. Global viewing of heterogeneous data sources. IEEE Trans. Data Knowl. Engin. 13, 2, 277--297.
[15]
Clark, J. 1999. XSL Transformations (XSLT). W3C Recommendation. http://www.w3.org/TR/xslt.
[16]
Clark, J. and DeRose, S. 1999. XML Path Language (XPath). W3C Working Draft.
[17]
Doan, A., Domingos, P., and Halevy, A. Y. 2001. Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings of ACM SIGMOD Conference on Management of Data.
[18]
Ehrenfeucht, A. and Zeiger, H. P. 1976. Complexity measures for regular expressions. J. Comput. Syst. Sci. 12, 2, 134--146.
[19]
Fagin, R. 2006. Inverting schema mappings. In Proceedings of ACM Symposium on Principles of Database Systems (PODS).
[20]
Fallside, D. C., Ed. 2000. XML Schema Part 0: Primer. World Wide Web Consortium (W3C). http://www.w3.org/TR/xmlschema-0/.
[21]
Fan, W., Geerts, F., Jia, X., and Kementsietsidis, A. 2007. Rewriting regular xpath queries on XML views. In IEEE International Conference on Data Engineering (ICDE).
[22]
Fan, W. and Libkin, L. 2002. On XML integrity constraints in the presence of DTDs. J. ACM 49, 3, 368--406.
[23]
Fuxman, A., Kolaitis, P., Miller, R., and Tan, W. 2005. Peer data exchange. In Proceedings of ACM Symposium on Principles of Database Systems (PODS).
[24]
Garey, M. and Johnson, D. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company.
[25]
Halevy, A. Y. 2000. Theory of answering queries using views. SIGMOD Record 29, 4, 40--47.
[26]
Halevy, A. Y., Ives, Z. G., Madhavan, J., Mork, P., Suciu, D., and Tatarinov, I. 2004. The Piazza peer data management system. IEEE Trans. Data Knowl. Engin. 16, 7, 787--798.
[27]
Hopcroft, J. E. and Ullman, J. D. 1979. Introduction to Automata Theory, Languages and Computation. Addision Wesley.
[28]
Hull, R. 1986. Relative information capacity of simple relational database schemata. SIAM J. Comput. 15, 3, 239--265.
[29]
Kementsietsidis, A., Arenas, M., and Miller, R. 2003. Mapping data in peer-to-peer systems: Semantics and algorithmic issues. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD).
[30]
Kolaitis, P. G. 2005. Schema mappings, data exchange, and metadata management. In Proceedings of ACM Symposium on Principles of Database Systems (PODS).
[31]
Lakshmanan, L., Sadri, F., and Subramanian, I. N. 1996. SchemaSQL---a language for interoperability in relational multi-database systems. In Proceedings of International Conference on Very Large Databases (VLDB).
[32]
Lenzerini, M. 2002. Data integration: A theoretical perspective. In Proceedings of ACM Symposium on Principles of Database Systems (PODS).
[33]
Levy, A. Y., Mendelzon, A. O., Sagiv, Y., and Srivastava, D. 1995. Answering queries using views. In Proceedings of ACM Symposium on Principles of Database Systems (PODS).
[34]
Li, W.-S. and Clifton, C. 2000. SemInt: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Engin. 33, 1, 49--84.
[35]
Madhavan, J., Bernstein, P. A., and Rahm, E. 2001. Generic schema matching with Cupid. In Proceedings of International Conference on Very Large Databases (VLDB).
[36]
Marx, M. 2004. XPath with conditional axis relations. In Proceedings of the International Conference on Extending Database Technology.
[37]
Melnik, S., Garcia-Molina, H., and Rahm, E. 2002. Similarity flooding: A versatile graph matching algorithm. In Proceedings of IEEE International Conference on Data Engineering (ICDE).
[38]
Melnik, S., Rahm, E., and Bernstein, P. A. 2003. Rondo: A programming platform for generic model management. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD).
[39]
Miller, R. J., Hernández, M. A., Haas, L. M., Yan, L.-L., Ho, C. T. H., Fagin, R., and Popa, L. 2001. The Clio project: Managing heterogeneity. SIGMOD Record 30, 1, 78--83.
[40]
Miller, R. J., Ioannidis, Y. E., and Ramakrishnan, R. 1993. The use of information capacity in schema integration and translation. In Proceedings of International Conference on Very Large Databases (VLDB).
[41]
Miller, R. J., Ioannidis, Y. E., and Ramakrishnan, R. 1994. Schema equivalence in heterogeneous systems: Bridging theory and practice. Inform. Syst. 19, 1, 3--31.
[42]
Milo, T. and Zohar, S. 1998. Using schema matching to simplify heterogeneous data translation. In Proceedings of International Conference on Very Large Databases (VLDB).
[43]
Palopoli, L., Sacca, D., and Ursino, D. 1998. Semi-automatic semantic discovery of properties from database schemas. In Proceedings International Database Engineering & Applications Symposium (IDEAS).
[44]
Papakonstantinou, Y. and Vianu, V. 2000. Type inference for views of semistructured data. In Proceedings of ACM Symposium on Principles of Database Systems (PODS).
[45]
Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350.
[46]
Siméon, J. and Fernandez, M. Galax. http://db.bell-labs.com/galax.
[47]
Tarjan, R. E. 1981. Fast algorithms for solving path problems. J. ACM 28, 3, 594--614.
[48]
Wadler, P. 2000. A formal semantics for patterns in xsl. Tech. rep., Bell Labs.
[49]
Xerces and Xalan. http://xml.apache.org.
[50]
Yu, S. 1996. Regular languages. In G. Rosenberg and A. Salomaa, Eds. Handbook of Formal Languages, Vol. 1. Springer.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 33, Issue 1
March 2008
211 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1331904
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2008
Accepted: 01 October 2007
Revised: 01 March 2007
Received: 01 June 2006
Published in TODS Volume 33, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data transformation
  2. XML
  3. XSLT
  4. information integration
  5. information preservation
  6. schema embedding
  7. schema mapping

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media