skip to main content
survey
Public Access

A Survey on Collecting, Managing, and Analyzing Provenance from Scripts

Published: 18 June 2019 Publication History

Abstract

Scripts are widely used to design and run scientific experiments. Scripting languages are easy to learn and use, and they allow complex tasks to be specified and executed in fewer steps than with traditional programming languages. However, they also have important limitations for reproducibility and data management. As experiments are iteratively refined, it is challenging to reason about each experiment run (or trial), to keep track of the association between trials and experiment instances as well as the differences across trials, and to connect results to specific input data and parameters. Approaches have been proposed that address these limitations by collecting, managing, and analyzing the provenance of scripts. In this article, we survey the state of the art in provenance for scripts. We have identified the approaches by following an exhaustive protocol of forward and backward literature snowballing. Based on a detailed study, we propose a taxonomy and classify the approaches using this taxonomy.

References

[1]
Ruben Acuña. 2015. Understanding Legacy Workflows through Runtime Trace Analysis. Master’s thesis. Arizona State University.
[2]
Ruben Acuña, Jacques Chomilier, and Zoé Lacroix. 2015. Managing and documenting legacy scientific workflows. J. Integr. Bioinformat. 12, 3 (2015), 277--277.
[3]
Ruben Acuña and Zoé Lacroix. 2016. Extracting semantics from legacy scientific workflows. In Proceedings of the ICSC. IEEE, 9--16.
[4]
Ruben Acuña, Zoé Lacroix, and Rida A. Bazzi. 2015. Instrumentation and trace analysis for ad-hoc Python workflows in cloud environments. In Proceedings of the CLOUD. IEEE, 114--121.
[5]
Ben Adida, Mark Birbeck, Shane McCarron, and Steven Pemberton. 2008. RDFa in XHTML: Syntax and processing. W3C Prop. Recommend. 7 (2008), 1--89.
[6]
Manish Kumar Anand, Shawn Bowers, and Bertram Ludäscher. 2010. Provenance browser: Displaying and querying scientific workflow provenance graphs. In Proceedings of the ICDE. IEEE, 1201--1204.
[7]
Elaine Angelino, Uri Braun, David A. Holland, and Daniel W. Margo. 2011. Provenance integration requires reconciliation. In Proceedings of the TaPP. USENIX, 1--6.
[8]
Elaine Angelino, Daniel Yamins, and Margo Seltzer. 2010. StarFlow: A script-centric data analysis environment. In Proceedings of the IPAW. Springer, 236--250.
[9]
Keith A. Baggerly and Kevin R. Coombes. 2009. Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology. Ann. Appl. Stat. 3, 4 (2009), 1309--1334.
[10]
Zhuowei Bao, Sarah Cohen-Boulakia, Susan B. Davidson, and Pierrick Girard. 2009. PDiffView: Viewing the difference in provenance of workflow results. In Proceedings of the VLDB. VLDB Endowment, 1638--1641.
[11]
Richard A. Becker and John M. Chambers. 1988. Auditing of data analyses. SIAM J. Sci. Statist. Comput. 9, 4 (1988), 747--760.
[12]
Carsten Bochner, Roland Gude, and Andreas Schreiber. 2008. A Python library for provenance recording and querying. In Proceedings of the IPAW. Springer, 229--240.
[13]
Uri Braun, Simson Garfinkel, David A. Holland, Kiran-Kumar Muniswamy-Reddy, and Margo I. Seltzer. 2006. Issues in automatic provenance collection. In Proceedings of the IPAW. Springer, 171--183.
[14]
Steven P. Callahan, Juliana Freire, Emanuele Santos, Carlos Eduardo Scheidegger, Claudio T. Silva, and Huy T. Vo. 2006. Managing the evolution of dataflows with VisTrails. In Proceedings of the ICDE. IEEE, 71--71.
[15]
Adriane Chapman and H. V. Jagadish. 2010. Understanding provenance black boxes. Distrib. Parallel Databases 27, 2 (2010), 139--167.
[16]
Amit Chavan, Silu Huang, Amol Deshpande, Aaron Elmore, Samuel Madden, and Aditya Parameswaran. 2015. Towards a unified query language for provenance and versioning. In Proceedings of the TaPP. USENIX, 1--6.
[17]
Artem Chebotko, John Abraham, Pearl Brazier, Anthony Piazza, Andrey Kashlev, and Shiyong Lu. 2013. Storing, indexing and querying large provenance data sets as RDF graphs in apache HBase. In Proceedings of the IEEE SERVICES. IEEE, 1--8.
[18]
Artem Chebotko, Shiyong Lu, Xubo Fei, and Farshad Fotouhi. 2010. RDFProv: A relational RDF store for querying and managing scientific workflow provenance. Data Knowl. Engineer. 69, 8 (2010), 836--865.
[19]
James Cheney, Amal Ahmed, and Umut A. Acar. 2011. Provenance as dependency analysis. Math. Struct. Comput. Sci. 21, 06 (2011), 1301--1337.
[20]
James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2007. Provenance in databases: Why, how, and where. Found. Trends Databases 1, 4 (2007), 379--474.
[21]
Fernando Chirigati, Dennis Shasha, and Juliana Freire. 2013. Reprozip: Using provenance to support computational reproducibility. In Proceedings of the TaPP. USENIX, 977-- 980.
[22]
Pavan Kumar Chittimalli and Ravindra Naik. 2014. Variable provenance in software systems. In Proceedings of the RSSE. ACM, 9--13.
[23]
Jon Claerbout and Martin Karrenbach. 1992. Electronic documents give reproducible research a new meaning. In Proceedings of the SEG. SEG, 601--604.
[24]
Ben Clifford, Ian Foster, Jens-S. Voeckler, Michael Wilde, and Yong Zhao. 2008. Tracking provenance in a virtual data grid. Concurr. Comput.: Pract. Exper. 20, 5 (2008), 565--575.
[25]
Flavio Costa, Vítor Silva, Daniel De Oliveira, Kary Ocaña, Eduardo Ogasawara, Jonas Dias, and Marta Mattoso. 2013. Capturing and querying workflow runtime provenance with PROV: A practical approach. In Proceedings of the EDBT/ICDT. ACM, 282--289.
[26]
Sergio Manuel Serra da Cruz and José Antonio Pires do Nascimento. 2016. SisGExp: Rethinking long-tail agronomic experiments. In Proceedings of the IPAW. Springer, 214--217.
[27]
Andrew Davison. 2012. Automated capture of experiment context for easier reproducibility in computational research. Comput. Sci. Engineer. 14, 4 (2012), 48--56.
[28]
Brian Demsky. 2009. Garm: Cross application data provenance and policy enforcement. In Proceedings of the HotSec, vol. 9. USENIX, 10--10.
[29]
Saumen Dey, Khalid Belhajjame, David Koop, Meghan Raul, and Bertram Ludäscher. 2015. Linking prospective and retrospective provenance in scripts. In Proceedings of the TaPP. USENIX, 1--7.
[30]
Christian Dietrich and Daniel Lohmann. 2015. The dataref versuchung: Saving time through better internal repeatability. SIGOPS Operat. Syst. Rev. 49, 1 (2015), 51--60.
[31]
David L. Donoho, Arian Maleki, Inam Ur Rahman, Morteza Shahram, and Victoria Stodden. 2009. Reproducible research in computational harmonic analysis. Comput. Sci. Engineer. 11, 1 (2009), 8--18.
[32]
Chris Drummond. 2009. Replicability is not reproducibility: Nor is it good science. In Proceedings of the ICML. International Machine Learning Society, 1--4.
[33]
Paul F Dubois. 1999. Ten good practices in scientific programming. Comput. Sci. Engineer. 1, 1 (1999), 7--11.
[34]
Philip Eichinski and Paul Roe. 2016. Datatrack: An R package for managing data in a multi-stage experimental workflow. In Proceedings of the eSoN. IEEE, 1--8.
[35]
Jacky Estublier. 2000. Software configuration management: A roadmap. In Proceedings of the ICSE. ACM, 279--289.
[36]
Rosa Filguiera, Iraklis Klampanos, Amrey Krause, Mario David, Alexander Moreno, and Malcolm Atkinson. 2014. Dispel4Py: A Python framework for data-intensive scientific computing. In Proceedings of the DISCS. 9--16.
[37]
Juliana Freire, David Koop, Emanuele Santos, and Cláudio T Silva. 2008. Provenance for computational tasks: A survey. Comput. Sci. Engineer. 10, 3 (2008), 11--21.
[38]
Juliana Freire, Cláudio T. Silva, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, and Huy T. Vo. 2006. Managing rapidly-evolving scientific workflows. In Proceedings of the IPAW. Springer, 10--18.
[39]
James Frew. 2004. Earth system science server (ES3): Local infrastructure for earth science product management. In Proceedings of the ESTC. NASA, 1--5.
[40]
James Frew and Rajendra Bose. 2001. Earth system science workbench: A data management infrastructure for earth science products. In Proceedings of the SSDBM. IEEE, 180--189.
[41]
James Frew, Greg Janée, and Peter Slaughter. 2010. Automatic provenance collection and publishing in a science data production environment—Early results. In Proceedings of the IPAW. Springer, 27--33.
[42]
James Frew, Greg Janée, and Peter Slaughter. 2011. Provenance-enabled automatic data publishing. In Proceedings of the SSDBM. Springer, 244--252.
[43]
James Frew, Dominic Metzger, and Peter Slaughter. 2008. Automatic capture and reconstruction of computational provenance. Concurr. Comput.: Pract. Exper. 20, 5 (2008), 485--496.
[44]
James Frew and Peter Slaughter. 2008. Es3: A demonstration of transparent provenance for scientific computation. In Proceedings of the IPAW. Springer, 200--207.
[45]
Matan Gavish and David Donoho. 2011. A universal identifier for computational results. Procedia Comput. Sci. 4 (2011), 637--647.
[46]
Ashish Gehani, Hasanat Kazmi, and Hassaan Irshad. 2016. Scaling spade to “big provenance.” In Proceedings of the TaPP. USENIX Association, 26--33.
[47]
Ashish Gehani and Dawood Tariq. 2012. SPADE: Support for provenance auditing in distributed environments. In Proceedings of the useR!. Springer-Verlag, 101--120.
[48]
Ashish Gehani and Dawood Tariq. 2014. Provenance-only integration. In Proceedings of the TaPP. USENIX, 1--8.
[49]
Ashish Gehani, Dawood Tariq, Basim Baig, and Tanu Malik. 2011. Policy-based integration of provenance metadata. In Proceedings of the POLICY. IEEE, 149--152.
[50]
Boris Glavic and Klaus R. Dittrich. 2007. Data provenance: A categorization of existing approaches. In Proceedings of the BTW. GI, 227--241.
[51]
Klaus Greff and Jürgen Schmidhuber. 2015. Introducing sacred: A tool to facilitate reproducible research. In Proceedings of the AutoML. International Machine Learning Society, 1--6.
[52]
Paul Groth, Simon Miles, and Luc Moreau. 2005. PReServ: Provenance recording for services. In Proceedings of the AHM, vol. 2005. EPSRC, 1--8.
[53]
Philip Jia Guo. 2012. Software Tools to Facilitate Research Programming. Ph.D. Dissertation. Stanford University, Stanford University.
[54]
Philip J. Guo and Dawson Engler. 2011. Using automatic persistent memoization to facilitate data analysis scripting. In Proceedings of the ISSTA. ACM, 287--297.
[55]
Philip J. Guo and Dawson R. Engler. 2010. Towards practical incremental recomputation for scientists: An implementation for the Python language. In Proceedings of the IPAW. Springer, 1--10.
[56]
Philip J. Guo and Dawson R. Engler. 2011. CDE: Using system call interposition to automatically create portable software packages. In Proceedings of the ATC. USENIX Association, 1--6.
[57]
Philip J. Guo and Margo Seltzer. 2012. BURRITO: Wrapping your lab notebook in computational infrastructure. In Proceedings of the TaPP, vol. 12. USENIX, 1--7.
[58]
Brooks Hanson, Andrew Sugden, and Bruce Alberts. 2011. Making data maximally available. Science 331, 6018 (2011), 649--649.
[59]
Rinke Hoekstra and Paul Groth. 2014. PROV-O-Viz-understanding the role of activities in provenance. In Proceedings of the IPAW. Springer, 215--220.
[60]
Mohammad Rezwanul Huq. 2013. An Inference-based Framework for Managing Data Provenance. Ph.D. Dissertation. University of Twente.
[61]
Mohammad Rezwanul Huq, Peter M. G. Apers, and Andreas Wombacher. 2013. An inference-based framework to manage data provenance in Geoscience applications. IEEE Trans. Geosci. Remote Sens. 51, 11 (2013), 5113--5130.
[62]
Mohammad Rezwanul Huq, Peter M. G. Apers, and Andreas Wombacher. 2013. ProvenanceCurious: A tool to infer data provenance from scripts. In Proceedings of the EDBT. ACM, 765--768.
[63]
John P. A. Ioannidis. 2005. Why most published research findings are false. PLOS Med. 2, 8 (2005), e124.
[64]
Keith R. Jackson. 2002. pyGlobus: A Python interface to the Globus toolkit. Concurr. Comput.: Pract. Exper. 14, 13--15 (2002), 1075--1083.
[65]
Samireh Jalali and Claes Wohlin. 2012. Systematic literature studies: Database searches vs. backward snowballing. In Proceedings of the ESEM. ACM, 29--38.
[66]
Matthew B. Jones, Bertram Ludäscher, Timothy McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, Lauren Walker, and Yaxing Wei. 2016. DataONE: A data federation with provenance support. In Proceedings of the IPAW, vol. 9672. Springer, 230.
[67]
Mary Beth Kery. 2017. Tools to support exploratory programming with data. In Proceedings of the VL/HCC. IEEE, 321--322.
[68]
Mary Beth Kery, Amber Horvath, and Brad Myers. 2017. Variolite: Supporting exploratory programming by data scientists. In Proceedings of the CHI. ACM, 1--12.
[69]
Donald E. Knuth. 1984. Literate programming. Computer 1, 2 (1984), 97--111.
[70]
Troy Kohwalter, Thiago Oliveira, Juliana Freire, Esteban Clua, and Leonardo Murta. 2016. Prov viewer: A graph-based visualization tool for interactive exploration of provenance data. In Proceedings of the IPAW. Springer, 71--82.
[71]
David Koop, Emanuele Santos, Bela Bauer, Matthias Troyer, Juliana Freire, and Cláudio T. Silva. 2010. Bridging workflow and data provenance using strong links. In Proceedings of the SSDBM, vol. 28. Springer, 397--415.
[72]
Johannes Köster and Sven Rahmann. 2012. Snakemake—A scalable bioinformatics workflow engine. Bioinformatics 28, 19 (2012), 2520--2522.
[73]
Hans Petter Langtangen. 2006. Python Scripting for Computational Science (3rd ed.), vol. 3. Springer, Berlin.
[74]
Barbara Lerner and Emery Boose. 2014. POSTER: RDataTracker and DDG explorer. In Proceedings of the IPAW. Springer, 1--3.
[75]
Barbara Lerner and Emery Boose. 2014. RDataTracker: Collecting provenance in an interactive scripting environment. In Proceedings of the TaPP. USENIX, 1--4.
[76]
Barbara Lerner, Emery Boose, and Luis Perez. 2018. Using introspection to collect provenance in R. Informatics 5, 1 (2018), 12.
[77]
Chunhyeok Lim, Shiyong Lu, Artem Chebotko, and Farshad Fotouhi. 2010. Prospective and retrospective provenance collection in scientific workflow environments. In Proceedings of the SCC. IEEE, 449--456.
[78]
Chunhyeok Lim, Shiyong Lu, Artem Chebotko, Farshad Fotouhi, and Andrey Kashlev. 2013. OPQL: Querying scientific workflow provenance at the graph level. Data Knowl. Engineer. 88, 0 (2013), 37--59.
[79]
Cui Lin, Shiyong Lu, Xubo Fei, Artem Chebotko, Darshan Pai, Zhaoqiang Lai, Farshad Fotouhi, and Jing Hua. 2009. A reference architecture for scientific workflow management systems and the VIEW SOA solution. IEEE Trans. Services Comput. 2, 1 (2009), 79--92.
[80]
Ji Liu, Esther Pacitti, Patrick Valduriez, and Marta Mattoso. 2015. A survey of data-intensive scientific workflow management. J. Grid Comput. 13, 4 (2015), 457-- --493.
[81]
Clifford Lynch. 2000. Authenticity and integrity in the digital environment: An exploratory analysis of the central role of trust. Council Library Info. Res. 32, 1 (2000), 1--84.
[82]
Peter Macko and Margo Seltzer. 2012. A general-purpose provenance library. In Proceedings of the TaPP. USENIX, 1--6.
[83]
Anderson Marinho, Marta Mattoso, Claudia Werner, Vanessa Braganholo, and Leonardo Murta. 2011. Challenges in managing implicit and abstract provenance data: Experiences with ProvManager. In TaPP. USENIX, Heraklion, Crete, Greece, 1--6.
[84]
Marta Mattoso, Jonas Dias, Kary A. C. S. Ocaña, Eduardo Ogasawara, Flavio Costa, Felipe Horta, Vítor Silva, and Daniel de Oliveira. 2015. Dynamic steering of HPC scientific workflows: A survey. Future Gen. Comput. Syst. 46 (2015), 100--113.
[85]
Marta Mattoso, Claudia Werner, Guilherme Horta Travassos, Vanessa Braganholo, Eduardo Ogasawara, Daniel Oliveira, Sergio Cruz, Wallace Martinho, and Leonardo Murta. 2010. Towards supporting the life cycle of large scale scientific experiments. Int. J. Bus. Process Integr. Manage. 5, 1 (2010), 79--92.
[86]
Timothy McPhillips, Shawn Bowers, Khalid Belhajjame, and Bertram Ludäscher. 2015. Retrospective provenance without a runtime provenance recorder. In Proceedings of the TaPP. USENIX, 1--7.
[87]
Timothy McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao, Fernando Chirigati, Saumen Dey, Juliana Freire et al. 2015. YesWorkflow: A user-oriented, language-independent tool for recovering workflow information from scripts. Int. J. Dig. Curat. 10, 1 (2015), 298--313.
[88]
Robert Meyer and Klaus Obermayer. 2015. pypet: A Python toolkit for simulations and numerical experiments. Neuroscience 16, Suppl 1 (2015), P184.
[89]
Robert Meyer and Klaus Obermayer. 2016. pypet: A Python toolkit for data management of parameter explorations. Front. Neuroinformat. 10 (2016), 1--16.
[90]
Danius T. Michaelides, Richard Parker, Chris Charlton, William J. Browne, and Luc Moreau. 2016. Intermediate notation for provenance and workflow reproducibility. In Proceedings of the IPAW. Springer, 83--94.
[91]
Simon Miles, Paul Groth, Steve Munroe, and Luc Moreau. 2011. PrIMe: A methodology for developing provenance-aware applications. ACM Trans. Software Engineer. Methodol. 20, 3 (2011), 8.
[92]
Paolo Missier, Saumen Dey, Khalid Belhajjame, Víctor Cuevas-Vicenttín, and Bertram Ludäscher. 2013. D-PROV: Extending the PROV provenance model with workflow structure. In Proceedings of the TaPP. USENIX, 1--7.
[93]
Scott Moore, Ashish Gehani, and Natarajan Shankar. 2013. Declaratively processing provenance metadata. In Proceedings of the TaPP. USENIX, 1--8.
[94]
Luc Moreau, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul Groth, Natalia Kwasnikowska, Simon Miles, Paolo Missier, Jim Myers et al. 2011. The open provenance model core specification (v1. 1). Future Gen. Comput. Syst. 27, 6 (2011), 743--756.
[95]
Luc Moreau, Bertram Ludäscher, Ilkay Altintas, Roger S. Barga, Shawn Bowers, Steven Callahan, George Chin, Ben Clifford, Shirley Cohen, Sarah Cohen-Boulakia et al. 2008. Special issue: The first provenance challenge. Concurr. Comput.: Pract. Exper. 20, 5 (2008), 409--418.
[96]
Luc Moreau and Paolo Missier. 2012. PROV-DM: The PROV Data Model. W3C Proposed Recommendation. Retrieved from http://www.w3.org/TR/prov-dm.
[97]
Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo I. Seltzer. 2006. Provenance-Aware storage systems. In Proceedings of the ATC. USENIX Association, 43--56.
[98]
Leonardo Murta, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. 2014. noWorkflow: Capturing and analyzing provenance of scripts. In Proceedings of the IPAW. Springer, 71--83.
[99]
Johnson Mwebaze, Danny Boxhoorn, and Edwin Valentijn. 2009. Astro-wise: Tracing and using lineage for scientific data processing. In Proceedings of the NBIS. IEEE, 475--480.
[100]
Johnson Mwebaze, Danny Boxhoorn, and Edwin Valentijn. 2011. Dynamic pipeline changes in scientific data processing. In Proceedings of the eSoN. IEEE, 263--270.
[101]
Wellington Oliveira, Daniel De Oliveira, and Vanessa Braganholo. 2018. Provenance analytics for workflow-based computational experiments: A survey. Comput. Surveys 51, 3 (2018), 53.
[102]
John K. Ousterhout. 1998. Scripting: Higher level programming for the 21st century. Computer 31, 3 (1998), 23--30.
[103]
Christian Schou Oxvig, Thomas Arildsen, and Torben Larsen. 2016. Storing reproducible results from computational experiments using scientific Python packages. In Proceedings of the SciPy. SciPy, 45--50.
[104]
Kai Petersen, Robert Feldt, Shahid Mujtaba, and Michael Mattsson. 2008. Systematic mapping studies in software engineering. In Proceedings of the EASE, vol. 8. ACM, 68--77.
[105]
João Felipe Pimentel, Saumen Dey, Timothy McPhillips, Khalid Belhajjame, David Koop, Leonardo Murta, Vanessa Braganholo, and Bertram Ludäscher. 2016. Yin 8 Yang: Demonstrating complementary provenance from noWorkflow 8 YesWorkflow. In Proceedings of the IPAW. Springer, 161--165.
[106]
João Felipe Pimentel, Juliana Freire, Vanessa Braganholo, and Leonardo Murta. 2016. Tracking and analyzing the evolution of provenance from scripts. In Proceedings of the IPAW. Springer, 16--28.
[107]
João Felipe Pimentel, Juliana Freire, Leonardo Murta, and Vanessa Braganholo. 2016. Fine-grained provenance collection over scripts through program slicing. In Proceedings of the IPAW. Springer, 199--203.
[108]
Joao Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2017. noWorkflow: A tool for collecting, analyzing, and managing provenance from Python scripts. Very Large Data Bases 10, 12 (2017), 1841--1844.
[109]
João Felipe Nicolaci Pimentel, Vanessa Braganholo, Leonardo Murta, and Juliana Freire. 2015. Collecting and analyzing provenance on interactive notebooks: When IPython meets noWorkflow. In Proceedings of the TaPP. USENIX, 1--6.
[110]
João Felipe N. Pimentel, Paolo Missier, Leonardo Murta, and Vanessa Braganholo. 2018. Versioned-PROV: A PROV extension to support mutable data entities. In Proceedings of the IPAW. Springer, 87--100.
[111]
Raghu Ramakrishnan and Jeffrey D. Ullman. 1995. A survey of deductive database systems. J. Logic Program. 23, 2 (1995), 125--149.
[112]
Andrew Runnalls and Chris Silles. 2012. Provenance tracking in R. In Proceedings of the IPAW. Springer, 237--239.
[113]
Andrew R. Runnalls. 2011. Aspects of CXXR internals. Comput. Stat. 26, 3 (2011), 427--442.
[114]
Andrew R. Runnalls and Chris A. Silles. 2011. CXXR: An ideas hatchery for future R development. In Proceedings of the JSM. AMSTAT, 1--9.
[115]
Helen Shen et al. 2014. Interactive notebooks: Sharing the code. Nature 515, 7525 (2014), 151--152.
[116]
Christopher Anthony Silles. 2014. Provenance-aware CXXR. Ph.D. Dissertation. University of Kent.
[117]
Chris A. Silles and Andrew R. Runnalls. 2010. Provenance-awareness in R. In Proceedings of the IPAW. Springer, 64--72.
[118]
Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. 2005. A survey of data provenance in e-science. SIGMOD Rec. 34, 3 (2005), 31--36.
[119]
Sébastien Sorlin and Christine Solnon. 2005. Reactive tabu search for measuring graph similarity. In Proceedings of the IAPR. Springer, 172--182.
[120]
Manolis Stamatogiannakis, Paul Groth, and Herbert Bos. 2014. Looking inside the black-box: Capturing data provenance using dynamic instrumentation. In Proceedings of the IPAW. Springer, 155--167.
[121]
Jean-Luc Richard Stevens, Marco Elver, and James A. Bednar. 2013. An automated and reproducible workflow for running and analyzing neural simulations using Lancet and IPython Notebook. Front. Neuroinform. 7, 44 (2013), 44.
[122]
Michael Stonebraker, Gerald Held, Eugene Wong, and Peter Kreps. 1976. The design and implementation of INGRES. ACM Trans. Database Syst. 1, 3 (1976), 189--222.
[123]
Wang Chiew Tan et al. 2007. Provenance in databases: Past, current, and future. IEEE Data Engineer. Bull. 30, 4 (2007), 3--12.
[124]
Dawood Tariq, Maisem Ali, and Ashish Gehani. 2012. Towards automated collection of application-level data provenance. In Proceedings of the TaPP. USENIX, 1--5.
[125]
Håvar Valeur. 2005. Tracking the Lineage of Arbitrary Processing Sequences. Ph.D. Dissertation. Norwegian University of Science and Technology, Trondheim.
[126]
André Van der Hoek. 2004. Design-time product line architectures for any-time variability. Sci. Comput. Program. 53, 3 (2004), 285--304.
[127]
Jianwu Wang, Daniel Crawl, Shweta Purawat, Mai Nguyen, and Ilkay Altintas. 2015. Big data provenance: Challenges, state of the art and opportunities. In Proceedings of the BigData. IEEE, 2509--2516.
[128]
Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of International Conference on Evaluation and Assessment in Software Engineering. ACM, 38:1--38:10.
[129]
Katherine Wolstencroft, Robert Haines, Donal Fellows, Alan Williams, David Withers, Stuart Owen, Stian Soiland-Reyes, Ian Dunlop, Aleksandra Nenadic, Paul Fisher et al. 2013. The Taverna workflow suite: Designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res. W557, 61 (2013), W557--W561.
[130]
Zhaogui Xu, Ju Qian, Lin Chen, Zhifei Chen, and Baowen Xu. 2013. Static slicing for Python first-class objects. In Proceedings of the QSIC. IEEE, 117--124.
[131]
Carlo Zaniolo. 1983. The database language GEM. SIGMOD Rec. 13, 4 (1983), 207--218.
[132]
Yong Zhao, Mihael Hategan, Ben Clifford, Ian Foster, Gregor Von Laszewski, Veronika Nefedova, Ioan Raicu, Tiberiu Stef-Praun, and Michael Wilde. 2007. Swift: Fast, reliable, loosely coupled parallel computation. In Proceedings of the SERVICES. IEEE, 199--206.
[133]
Yong Zhao and Shiyong Lu. 2008. A logic programming approach to scientific workflow provenance querying. In Proceedings of the IPAW. Springer, 31--44.
[134]
Yong Zhao, Michael Wilde, and Ian Foster. 2006. Applying the virtual data provenance model. In Proceedings of the IPAW. Springer, 148--161.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 52, Issue 3
May 2020
734 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3341324
  • Editor:
  • Sartaj Sahni
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2019
Accepted: 01 February 2019
Revised: 01 December 2018
Received: 01 September 2017
Published in CSUR Volume 52, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Provenance
  2. analyzing
  3. collecting
  4. managing
  5. scripts
  6. survey

Qualifiers

  • Survey
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)464
  • Downloads (Last 6 weeks)44
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media