skip to main content
Lineage tracing in data warehouses
Publisher:
  • Stanford University
  • 408 Panama Mall, Suite 217
  • Stanford
  • CA
  • United States
ISBN:978-0-493-51695-0
Order Number:AAI3038081
Pages:
205
Reflects downloads up to 05 Jan 2025Bibliometrics
Skip Abstract Section
Abstract

Data warehousing systems collect data from multiple distributed data sources and store integrated and summarized information in local databases for efficient data analysis and mining. Sometimes, when analyzing data at a warehouse, it is useful to “drill down” and investigate the source data from which certain warehouse data was derived. For a given warehouse data item, identifying the exact set of source data items that produced the warehouse data item is termed the data lineage problem. This thesis presents our research results on tracing data lineage in a warehousing environment: (1) Formal definitions of data lineage for data warehouses defined as relational materialized views over relational sources, and for warehouses defined using graphs of general data transformations. (2) Algorithms for lineage tracing, again considering both relational and transformational warehouses, along with a suite of optimization techniques. (3) Performance evaluations through simulations, and a lineage tracing prototype developed within the WHIPS (WareHousing Information Processing System) project at Stanford. (4) Applying data lineage techniques to obtain improved algorithms for the well-known database view update problem.

Cited By

  1. Psallidas F, Agrawal A, Sugunan C, Ibrahim K, Karanasos K, Camacho-Rodríguez J, Floratou A, Curino C and Ramakrishnan R (2023). OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event Logs, Proceedings of the VLDB Endowment, 16:12, (3662-3675), Online publication date: 1-Aug-2023.
  2. Pavan A, Meel K, Vinodchandran N and Bhattacharyya A Constraint optimization over semirings Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, (4070-4077)
  3. ACM
    Gaur G, Bedathur S and Bhattacharya A Tracking the Impact of Fact Deletions on Knowledge Graph Queries using Provenance Polynomials Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, (2079-2082)
  4. Tuya J, Riva C, Suarez-Cabal M and Blanco R (2019). Coverage-Aware Test Database Reduction, IEEE Transactions on Software Engineering, 42:10, (941-959), Online publication date: 1-Oct-2016.
  5. ACM
    Karvounarakis G, Green T, Ives Z and Tannen V (2013). Collaborative data sharing via update exchange and provenance, ACM Transactions on Database Systems (TODS), 38:3, (1-42), Online publication date: 1-Aug-2013.
  6. ACM
    Tuchinda R, Knoblock C and Szekely P (2011). Building Mashups by Demonstration, ACM Transactions on the Web, 5:3, (1-45), Online publication date: 1-Jul-2011.
  7. ACM
    Karvounarakis G, Ives Z and Tannen V Querying data provenance Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, (951-962)
  8. ACM
    Zhang J and Jagadish H Lost source provenance Proceedings of the 13th International Conference on Extending Database Technology, (311-322)
  9. ACM
    Ives Z, Green T, Karvounarakis G, Taylor N, Tannen V, Talukdar P, Jacob M and Pereira F (2008). The ORCHESTRA Collaborative Data Sharing System, ACM SIGMOD Record, 37:3, (26-32), Online publication date: 30-Sep-2008.
  10. Talukdar P, Jacob M, Mehmood M, Crammer K, Ives Z, Pereira F and Guha S (2008). Learning to create data-integrating queries, Proceedings of the VLDB Endowment, 1:1, (785-796), Online publication date: 1-Aug-2008.
  11. Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R and Ives Z DBpedia Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference, (722-735)
  12. Green T, Karvounarakis G, Ives Z and Tannen V Update exchange with mappings and provenance Proceedings of the 33rd international conference on Very large data bases, (675-686)
  13. Fan H Data lineage tracing in data warehousing environments Proceedings of the 24th British national conference on Databases, (25-36)
  14. ACM
    Green T, Karvounarakis G and Tannen V Provenance semirings Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, (31-40)
  15. ACM
    Green T, Karvounarakis G, Taylor N, Biton O, Ives Z and Tannen V ORCHESTRA Proceedings of the 2007 ACM SIGMOD international conference on Management of data, (1131-1133)
  16. ACM
    Taylor N and Ives Z Reconciling while tolerating disagreement in collaborative data sharing Proceedings of the 2006 ACM SIGMOD international conference on Management of data, (13-24)
  17. Barkstrom B Data product configuration management and versioning in large-scale production of satellite scientific data Proceedings of the 2001 ICSE Workshops on SCM 2001, and SCM 2003 conference on Software configuration management, (118-133)
  18. Cui Y and Widom J Lineage Tracing for General Data Warehouse Transformations Proceedings of the 27th International Conference on Very Large Data Bases, (471-480)
Contributors
  • Stanford University
  • Stanford University

Recommendations