skip to main content
research-article

Putting lipstick on pig: enabling database-style workflow provenance

Published: 01 December 2011 Publication History

Abstract

Workflow provenance typically assumes that each module is a "black-box", so that each output depends on all inputs (coarse-grained dependencies). Furthermore, it does not model the internal state of a module, which can change between repeated executions. In practice, however, an output may depend on only a small subset of the inputs (fine-grained dependencies) as well as on the internal state of the module. We present a novel provenance framework that marries database-style and workflow-style provenance, by using Pig Latin to expose the functionality of modules, thus capturing internal state and fine-grained dependencies. A critical ingredient in our solution is the use of a novel form of provenance graph that models module invocations and yields a compact representation of fine-grained workflow provenance. It also enables a number of novel graph transformation operations, allowing to choose the desired level of granularity in provenance querying (ZoomIn and ZoomOut), and supporting "what-if" workflow analytic queries. We implemented our approach in the Lipstick system and developed a benchmark in support of a systematic performance evaluation. Our results demonstrate the feasibility of tracking and querying fine-grained workflow provenance.

References

[1]
U. Acar et al. A graph model of data and workflow provenance. In TaPP, 2010.
[2]
Y. Amsterdamer, D. Deutch, and V. Tannen. Provenance for aggregate queries. In PODS, 2011.
[3]
O. Benjelloun et al. Databases with uncertainty and lineage. VLDB J., 17(2), 2008.
[4]
O. Biton, S. C. Boulakia, and S. B. Davidson. Zoom*UserViews: Querying relevant provenance in workflow systems. In VLDB, 2007.
[5]
S. Bowers, T. M. McPhillips, and B. Ludäscher. Provenance in collection-oriented scientific workflows. Concurrency and Computation: Practice and Experience, 20(5), 2008.
[6]
P. Buneman, J. Cheney, and S. Vansummeren. On the expressiveness of implicit provenance in query and update languages. ACM TODS, 33(4), 2008.
[7]
P. Buneman et al. Principles of programming with complex objects and collection types. Theor. Comput. Sci., 149(1), 1995.
[8]
P. Buneman, S. Khanna, and W. Tan. Why and where: A characterization of data provenance. In ICDT, 2001.
[9]
J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4), 2009.
[10]
S. B. Davidson et al. Provenance in scientific workflow systems. IEEE Data Eng. Bull., 30(4), 2007.
[11]
S. B. Davidson and J. Freire. Provenance and scientific workflows: Challenges and opportunities. In SIGMOD, 2008.
[12]
A. Deutsch, L. Sui, and V. Vianu. Specification and verification of data-driven web applications. J. Comput. Syst. Sci., 73(3), 2007.
[13]
I. Foster et al. Chimera: A virtual data system for representing, querying, and automating data derivation. SSDBM, 2002.
[14]
J. Foster, T. Green, and V. Tannen. Annotated XML: Queries and provenance. In PODS, 2008.
[15]
T. J. Green. Containment of conjunctive queries on annotated relations. In ICDT, 2009.
[16]
T. J. Green et al. Update exchange with mappings and provenance. In VLDB, 2007.
[17]
T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007.
[18]
D. Hull et al. Taverna: A tool for building and running workflows of services. Nucleic Acids Res., 34, 2006.
[19]
R. Ikeda, H. Park, and J. Widom. Provenance for generalized Map and Reduce workflows. In CIDR, 2011.
[20]
G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In SIGMOD, 2010.
[21]
N. Kwasnikowska and J. V. den Bussche. Mapping the NRC dataflow model to the Open Provenance Model. In IPAW, 2008.
[22]
P. Missier, N. Paton, and K. Belhajjame. Fine-grained and efficient lineage querying of collection-based workflow provenance. In EDBT, 2010.
[23]
L. Moreau et al. The Open Provenance Model: An overview. In IPAW, 2008.
[24]
K. Muniswamy-Reddy et al. Layering in provenance systems. In USENIX, 2009.
[25]
C. Olston and A. Das Sarma. Ibis: A provenance manager for multi-layer systems. In CIDR, 2011.
[26]
C. Olston et al. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, 2008.
[27]
V. Radionov and F. Fetterer, editors. Meteorological data from the Russian Arctic, 1961--2000. National Snow and Ice Data Center, 2003.
[28]
Y. L. Simhan, B. Plale, and D. Gammon. Karma2: Provenance management for data-driven workflows. Int. J. Web Service Res., 5(2), 2008.
[29]
J. Sroka et al. A formal semantics for the Taverna 2 workflow model. J. Comput. Syst. Sci., 76(6), 2010.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 5, Issue 4
December 2011
120 pages

Publisher

VLDB Endowment

Publication History

Published: 01 December 2011
Published in PVLDB Volume 5, Issue 4

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media