Data warehousing systems collect data from multiple distributed data sources and store integrated and summarized information in local databases for efficient data analysis and mining. Sometimes, when analyzing data at a warehouse, it is useful to “drill down” and investigate the source data from which certain warehouse data was derived. For a given warehouse data item, identifying the exact set of source data items that produced the warehouse data item is termed the data lineage problem. This thesis presents our research results on tracing data lineage in a warehousing environment: (1) Formal definitions of data lineage for data warehouses defined as relational materialized views over relational sources, and for warehouses defined using graphs of general data transformations. (2) Algorithms for lineage tracing, again considering both relational and transformational warehouses, along with a suite of optimization techniques. (3) Performance evaluations through simulations, and a lineage tracing prototype developed within the WHIPS (WareHousing Information Processing System) project at Stanford. (4) Applying data lineage techniques to obtain improved algorithms for the well-known database view update problem.
Cited By
- Psallidas F, Agrawal A, Sugunan C, Ibrahim K, Karanasos K, Camacho-Rodríguez J, Floratou A, Curino C and Ramakrishnan R (2023). OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event Logs, Proceedings of the VLDB Endowment, 16:12, (3662-3675), Online publication date: 1-Aug-2023.
- Pavan A, Meel K, Vinodchandran N and Bhattacharyya A Constraint optimization over semirings Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, (4070-4077)
- Gaur G, Bedathur S and Bhattacharya A Tracking the Impact of Fact Deletions on Knowledge Graph Queries using Provenance Polynomials Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, (2079-2082)
- Tuya J, Riva C, Suarez-Cabal M and Blanco R (2019). Coverage-Aware Test Database Reduction, IEEE Transactions on Software Engineering, 42:10, (941-959), Online publication date: 1-Oct-2016.
- Karvounarakis G, Green T, Ives Z and Tannen V (2013). Collaborative data sharing via update exchange and provenance, ACM Transactions on Database Systems (TODS), 38:3, (1-42), Online publication date: 1-Aug-2013.
- Tuchinda R, Knoblock C and Szekely P (2011). Building Mashups by Demonstration, ACM Transactions on the Web, 5:3, (1-45), Online publication date: 1-Jul-2011.
- Karvounarakis G, Ives Z and Tannen V Querying data provenance Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, (951-962)
- Zhang J and Jagadish H Lost source provenance Proceedings of the 13th International Conference on Extending Database Technology, (311-322)
- Ives Z, Green T, Karvounarakis G, Taylor N, Tannen V, Talukdar P, Jacob M and Pereira F (2008). The ORCHESTRA Collaborative Data Sharing System, ACM SIGMOD Record, 37:3, (26-32), Online publication date: 30-Sep-2008.
- Talukdar P, Jacob M, Mehmood M, Crammer K, Ives Z, Pereira F and Guha S (2008). Learning to create data-integrating queries, Proceedings of the VLDB Endowment, 1:1, (785-796), Online publication date: 1-Aug-2008.
- Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R and Ives Z DBpedia Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference, (722-735)
- Green T, Karvounarakis G, Ives Z and Tannen V Update exchange with mappings and provenance Proceedings of the 33rd international conference on Very large data bases, (675-686)
- Fan H Data lineage tracing in data warehousing environments Proceedings of the 24th British national conference on Databases, (25-36)
- Green T, Karvounarakis G and Tannen V Provenance semirings Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, (31-40)
- Green T, Karvounarakis G, Taylor N, Biton O, Ives Z and Tannen V ORCHESTRA Proceedings of the 2007 ACM SIGMOD international conference on Management of data, (1131-1133)
- Taylor N and Ives Z Reconciling while tolerating disagreement in collaborative data sharing Proceedings of the 2006 ACM SIGMOD international conference on Management of data, (13-24)
- Barkstrom B Data product configuration management and versioning in large-scale production of satellite scientific data Proceedings of the 2001 ICSE Workshops on SCM 2001, and SCM 2003 conference on Software configuration management, (118-133)
- Cui Y and Widom J Lineage Tracing for General Data Warehouse Transformations Proceedings of the 27th International Conference on Very Large Data Bases, (471-480)
Index Terms
- Lineage tracing in data warehouses
Recommendations
Practical Lineage Tracing in Data Warehouses
ICDE '00: Proceedings of the 16th International Conference on Data EngineeringWe consider the view data lineage problem in a warehousing environment: For a given data item in a materialized warehouse view, we want to identify the set of source data items that produced the view item. We formalize the problem, and we present a ...
Lineage Tracing in a Data Warehousing System
ICDE '00: Proceedings of the 16th International Conference on Data EngineeringA data warehousing system collects data from multiple distributed sources and stores the integrated information as materialized views in a local data warehouse. Users then perform data analysis and mining on the warehouse views. In many cases, the ...
Lineage tracing for general data warehouse transformations
Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of ...