survey

Public Access

A Survey on Collecting, Managing, and Analyzing Provenance from Scripts

Authors:

João Felipe Pimentel,

Juliana Freire,

Leonardo Murta,

Vanessa BraganholoAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 52, Issue 3

Article No.: 47, Pages 1 - 38

https://doi.org/10.1145/3311955

Published: 18 June 2019 Publication History

All formats PDF

Abstract

Scripts are widely used to design and run scientific experiments. Scripting languages are easy to learn and use, and they allow complex tasks to be specified and executed in fewer steps than with traditional programming languages. However, they also have important limitations for reproducibility and data management. As experiments are iteratively refined, it is challenging to reason about each experiment run (or trial), to keep track of the association between trials and experiment instances as well as the differences across trials, and to connect results to specific input data and parameters. Approaches have been proposed that address these limitations by collecting, managing, and analyzing the provenance of scripts. In this article, we survey the state of the art in provenance for scripts. We have identified the approaches by following an exhaustive protocol of forward and backward literature snowballing. Based on a detailed study, we propose a taxonomy and classify the approaches using this taxonomy.

References

[1]

Ruben Acuña. 2015. Understanding Legacy Workflows through Runtime Trace Analysis. Master’s thesis. Arizona State University.

[2]

Ruben Acuña, Jacques Chomilier, and Zoé Lacroix. 2015. Managing and documenting legacy scientific workflows. J. Integr. Bioinformat. 12, 3 (2015), 277--277.

[3]

Ruben Acuña and Zoé Lacroix. 2016. Extracting semantics from legacy scientific workflows. In Proceedings of the ICSC. IEEE, 9--16.

[4]

Ruben Acuña, Zoé Lacroix, and Rida A. Bazzi. 2015. Instrumentation and trace analysis for ad-hoc Python workflows in cloud environments. In Proceedings of the CLOUD. IEEE, 114--121.

Digital Library

[5]

Ben Adida, Mark Birbeck, Shane McCarron, and Steven Pemberton. 2008. RDFa in XHTML: Syntax and processing. W3C Prop. Recommend. 7 (2008), 1--89.

[6]

Manish Kumar Anand, Shawn Bowers, and Bertram Ludäscher. 2010. Provenance browser: Displaying and querying scientific workflow provenance graphs. In Proceedings of the ICDE. IEEE, 1201--1204.

[7]

Elaine Angelino, Uri Braun, David A. Holland, and Daniel W. Margo. 2011. Provenance integration requires reconciliation. In Proceedings of the TaPP. USENIX, 1--6.

[8]

Elaine Angelino, Daniel Yamins, and Margo Seltzer. 2010. StarFlow: A script-centric data analysis environment. In Proceedings of the IPAW. Springer, 236--250.

[9]

Keith A. Baggerly and Kevin R. Coombes. 2009. Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology. Ann. Appl. Stat. 3, 4 (2009), 1309--1334.

[10]

Zhuowei Bao, Sarah Cohen-Boulakia, Susan B. Davidson, and Pierrick Girard. 2009. PDiffView: Viewing the difference in provenance of workflow results. In Proceedings of the VLDB. VLDB Endowment, 1638--1641.

Digital Library

[11]

Richard A. Becker and John M. Chambers. 1988. Auditing of data analyses. SIAM J. Sci. Statist. Comput. 9, 4 (1988), 747--760.

Digital Library

[12]

Carsten Bochner, Roland Gude, and Andreas Schreiber. 2008. A Python library for provenance recording and querying. In Proceedings of the IPAW. Springer, 229--240.

Digital Library

[13]

Uri Braun, Simson Garfinkel, David A. Holland, Kiran-Kumar Muniswamy-Reddy, and Margo I. Seltzer. 2006. Issues in automatic provenance collection. In Proceedings of the IPAW. Springer, 171--183.

Digital Library

[14]

Steven P. Callahan, Juliana Freire, Emanuele Santos, Carlos Eduardo Scheidegger, Claudio T. Silva, and Huy T. Vo. 2006. Managing the evolution of dataflows with VisTrails. In Proceedings of the ICDE. IEEE, 71--71.

Digital Library

[15]

Adriane Chapman and H. V. Jagadish. 2010. Understanding provenance black boxes. Distrib. Parallel Databases 27, 2 (2010), 139--167.

Digital Library

[16]

Amit Chavan, Silu Huang, Amol Deshpande, Aaron Elmore, Samuel Madden, and Aditya Parameswaran. 2015. Towards a unified query language for provenance and versioning. In Proceedings of the TaPP. USENIX, 1--6.

Digital Library

[17]

Artem Chebotko, John Abraham, Pearl Brazier, Anthony Piazza, Andrey Kashlev, and Shiyong Lu. 2013. Storing, indexing and querying large provenance data sets as RDF graphs in apache HBase. In Proceedings of the IEEE SERVICES. IEEE, 1--8.

Digital Library

[18]

Artem Chebotko, Shiyong Lu, Xubo Fei, and Farshad Fotouhi. 2010. RDFProv: A relational RDF store for querying and managing scientific workflow provenance. Data Knowl. Engineer. 69, 8 (2010), 836--865.

Digital Library

[19]

James Cheney, Amal Ahmed, and Umut A. Acar. 2011. Provenance as dependency analysis. Math. Struct. Comput. Sci. 21, 06 (2011), 1301--1337.

Digital Library

[20]

James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2007. Provenance in databases: Why, how, and where. Found. Trends Databases 1, 4 (2007), 379--474.

Digital Library

[21]

Fernando Chirigati, Dennis Shasha, and Juliana Freire. 2013. Reprozip: Using provenance to support computational reproducibility. In Proceedings of the TaPP. USENIX, 977-- 980.

[22]

Pavan Kumar Chittimalli and Ravindra Naik. 2014. Variable provenance in software systems. In Proceedings of the RSSE. ACM, 9--13.

Digital Library

[23]

Jon Claerbout and Martin Karrenbach. 1992. Electronic documents give reproducible research a new meaning. In Proceedings of the SEG. SEG, 601--604.

[24]

Ben Clifford, Ian Foster, Jens-S. Voeckler, Michael Wilde, and Yong Zhao. 2008. Tracking provenance in a virtual data grid. Concurr. Comput.: Pract. Exper. 20, 5 (2008), 565--575.

Digital Library

[25]

Flavio Costa, Vítor Silva, Daniel De Oliveira, Kary Ocaña, Eduardo Ogasawara, Jonas Dias, and Marta Mattoso. 2013. Capturing and querying workflow runtime provenance with PROV: A practical approach. In Proceedings of the EDBT/ICDT. ACM, 282--289.

Digital Library

[26]

Sergio Manuel Serra da Cruz and José Antonio Pires do Nascimento. 2016. SisGExp: Rethinking long-tail agronomic experiments. In Proceedings of the IPAW. Springer, 214--217.

[27]

Andrew Davison. 2012. Automated capture of experiment context for easier reproducibility in computational research. Comput. Sci. Engineer. 14, 4 (2012), 48--56.

Digital Library

[28]

Brian Demsky. 2009. Garm: Cross application data provenance and policy enforcement. In Proceedings of the HotSec, vol. 9. USENIX, 10--10.

Digital Library

[29]

Saumen Dey, Khalid Belhajjame, David Koop, Meghan Raul, and Bertram Ludäscher. 2015. Linking prospective and retrospective provenance in scripts. In Proceedings of the TaPP. USENIX, 1--7.

Digital Library

[30]

Christian Dietrich and Daniel Lohmann. 2015. The dataref versuchung: Saving time through better internal repeatability. SIGOPS Operat. Syst. Rev. 49, 1 (2015), 51--60.

Digital Library

[31]

David L. Donoho, Arian Maleki, Inam Ur Rahman, Morteza Shahram, and Victoria Stodden. 2009. Reproducible research in computational harmonic analysis. Comput. Sci. Engineer. 11, 1 (2009), 8--18.

Digital Library

[32]

Chris Drummond. 2009. Replicability is not reproducibility: Nor is it good science. In Proceedings of the ICML. International Machine Learning Society, 1--4.

[33]

Paul F Dubois. 1999. Ten good practices in scientific programming. Comput. Sci. Engineer. 1, 1 (1999), 7--11.

Digital Library

[34]

Philip Eichinski and Paul Roe. 2016. Datatrack: An R package for managing data in a multi-stage experimental workflow. In Proceedings of the eSoN. IEEE, 1--8.

[35]

Jacky Estublier. 2000. Software configuration management: A roadmap. In Proceedings of the ICSE. ACM, 279--289.

Digital Library

[36]

Rosa Filguiera, Iraklis Klampanos, Amrey Krause, Mario David, Alexander Moreno, and Malcolm Atkinson. 2014. Dispel4Py: A Python framework for data-intensive scientific computing. In Proceedings of the DISCS. 9--16.

Digital Library

[37]

Juliana Freire, David Koop, Emanuele Santos, and Cláudio T Silva. 2008. Provenance for computational tasks: A survey. Comput. Sci. Engineer. 10, 3 (2008), 11--21.

Digital Library

[38]

Juliana Freire, Cláudio T. Silva, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, and Huy T. Vo. 2006. Managing rapidly-evolving scientific workflows. In Proceedings of the IPAW. Springer, 10--18.

Digital Library

[39]

James Frew. 2004. Earth system science server (ES3): Local infrastructure for earth science product management. In Proceedings of the ESTC. NASA, 1--5.

[40]

James Frew and Rajendra Bose. 2001. Earth system science workbench: A data management infrastructure for earth science products. In Proceedings of the SSDBM. IEEE, 180--189.

Digital Library

[41]

James Frew, Greg Janée, and Peter Slaughter. 2010. Automatic provenance collection and publishing in a science data production environment—Early results. In Proceedings of the IPAW. Springer, 27--33.

[42]

James Frew, Greg Janée, and Peter Slaughter. 2011. Provenance-enabled automatic data publishing. In Proceedings of the SSDBM. Springer, 244--252.

Digital Library

[43]

James Frew, Dominic Metzger, and Peter Slaughter. 2008. Automatic capture and reconstruction of computational provenance. Concurr. Comput.: Pract. Exper. 20, 5 (2008), 485--496.

Digital Library

[44]

James Frew and Peter Slaughter. 2008. Es3: A demonstration of transparent provenance for scientific computation. In Proceedings of the IPAW. Springer, 200--207.

Digital Library

[45]

Matan Gavish and David Donoho. 2011. A universal identifier for computational results. Procedia Comput. Sci. 4 (2011), 637--647.

[46]

Ashish Gehani, Hasanat Kazmi, and Hassaan Irshad. 2016. Scaling spade to “big provenance.” In Proceedings of the TaPP. USENIX Association, 26--33.

Digital Library

[47]

Ashish Gehani and Dawood Tariq. 2012. SPADE: Support for provenance auditing in distributed environments. In Proceedings of the useR!. Springer-Verlag, 101--120.

Digital Library

[48]

Ashish Gehani and Dawood Tariq. 2014. Provenance-only integration. In Proceedings of the TaPP. USENIX, 1--8.

[49]

Ashish Gehani, Dawood Tariq, Basim Baig, and Tanu Malik. 2011. Policy-based integration of provenance metadata. In Proceedings of the POLICY. IEEE, 149--152.

Digital Library

[50]

Boris Glavic and Klaus R. Dittrich. 2007. Data provenance: A categorization of existing approaches. In Proceedings of the BTW. GI, 227--241.

[51]

Klaus Greff and Jürgen Schmidhuber. 2015. Introducing sacred: A tool to facilitate reproducible research. In Proceedings of the AutoML. International Machine Learning Society, 1--6.

[52]

Paul Groth, Simon Miles, and Luc Moreau. 2005. PReServ: Provenance recording for services. In Proceedings of the AHM, vol. 2005. EPSRC, 1--8.

[53]

Philip Jia Guo. 2012. Software Tools to Facilitate Research Programming. Ph.D. Dissertation. Stanford University, Stanford University.

[54]

Philip J. Guo and Dawson Engler. 2011. Using automatic persistent memoization to facilitate data analysis scripting. In Proceedings of the ISSTA. ACM, 287--297.

Digital Library

[55]

Philip J. Guo and Dawson R. Engler. 2010. Towards practical incremental recomputation for scientists: An implementation for the Python language. In Proceedings of the IPAW. Springer, 1--10.

[56]

Philip J. Guo and Dawson R. Engler. 2011. CDE: Using system call interposition to automatically create portable software packages. In Proceedings of the ATC. USENIX Association, 1--6.

Digital Library

[57]

Philip J. Guo and Margo Seltzer. 2012. BURRITO: Wrapping your lab notebook in computational infrastructure. In Proceedings of the TaPP, vol. 12. USENIX, 1--7.

Digital Library

[58]

Brooks Hanson, Andrew Sugden, and Bruce Alberts. 2011. Making data maximally available. Science 331, 6018 (2011), 649--649.

[59]

Rinke Hoekstra and Paul Groth. 2014. PROV-O-Viz-understanding the role of activities in provenance. In Proceedings of the IPAW. Springer, 215--220.

Digital Library

[60]

Mohammad Rezwanul Huq. 2013. An Inference-based Framework for Managing Data Provenance. Ph.D. Dissertation. University of Twente.

[61]

Mohammad Rezwanul Huq, Peter M. G. Apers, and Andreas Wombacher. 2013. An inference-based framework to manage data provenance in Geoscience applications. IEEE Trans. Geosci. Remote Sens. 51, 11 (2013), 5113--5130.

[62]

Mohammad Rezwanul Huq, Peter M. G. Apers, and Andreas Wombacher. 2013. ProvenanceCurious: A tool to infer data provenance from scripts. In Proceedings of the EDBT. ACM, 765--768.

Digital Library

[63]

John P. A. Ioannidis. 2005. Why most published research findings are false. PLOS Med. 2, 8 (2005), e124.

[64]

Keith R. Jackson. 2002. pyGlobus: A Python interface to the Globus toolkit. Concurr. Comput.: Pract. Exper. 14, 13--15 (2002), 1075--1083.

[65]

Samireh Jalali and Claes Wohlin. 2012. Systematic literature studies: Database searches vs. backward snowballing. In Proceedings of the ESEM. ACM, 29--38.

Digital Library

[66]

Matthew B. Jones, Bertram Ludäscher, Timothy McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, Lauren Walker, and Yaxing Wei. 2016. DataONE: A data federation with provenance support. In Proceedings of the IPAW, vol. 9672. Springer, 230.

[67]

Mary Beth Kery. 2017. Tools to support exploratory programming with data. In Proceedings of the VL/HCC. IEEE, 321--322.

[68]

Mary Beth Kery, Amber Horvath, and Brad Myers. 2017. Variolite: Supporting exploratory programming by data scientists. In Proceedings of the CHI. ACM, 1--12.

Digital Library

[69]

Donald E. Knuth. 1984. Literate programming. Computer 1, 2 (1984), 97--111.

Digital Library

[70]

Troy Kohwalter, Thiago Oliveira, Juliana Freire, Esteban Clua, and Leonardo Murta. 2016. Prov viewer: A graph-based visualization tool for interactive exploration of provenance data. In Proceedings of the IPAW. Springer, 71--82.

[71]

David Koop, Emanuele Santos, Bela Bauer, Matthias Troyer, Juliana Freire, and Cláudio T. Silva. 2010. Bridging workflow and data provenance using strong links. In Proceedings of the SSDBM, vol. 28. Springer, 397--415.

Digital Library

[72]

Johannes Köster and Sven Rahmann. 2012. Snakemake—A scalable bioinformatics workflow engine. Bioinformatics 28, 19 (2012), 2520--2522.

Digital Library

[73]

Hans Petter Langtangen. 2006. Python Scripting for Computational Science (3rd ed.), vol. 3. Springer, Berlin.

[74]

Barbara Lerner and Emery Boose. 2014. POSTER: RDataTracker and DDG explorer. In Proceedings of the IPAW. Springer, 1--3.

Digital Library

[75]

Barbara Lerner and Emery Boose. 2014. RDataTracker: Collecting provenance in an interactive scripting environment. In Proceedings of the TaPP. USENIX, 1--4.

[76]

Barbara Lerner, Emery Boose, and Luis Perez. 2018. Using introspection to collect provenance in R. Informatics 5, 1 (2018), 12.

[77]

Chunhyeok Lim, Shiyong Lu, Artem Chebotko, and Farshad Fotouhi. 2010. Prospective and retrospective provenance collection in scientific workflow environments. In Proceedings of the SCC. IEEE, 449--456.

Digital Library

[78]

Chunhyeok Lim, Shiyong Lu, Artem Chebotko, Farshad Fotouhi, and Andrey Kashlev. 2013. OPQL: Querying scientific workflow provenance at the graph level. Data Knowl. Engineer. 88, 0 (2013), 37--59.

Digital Library

[79]

Cui Lin, Shiyong Lu, Xubo Fei, Artem Chebotko, Darshan Pai, Zhaoqiang Lai, Farshad Fotouhi, and Jing Hua. 2009. A reference architecture for scientific workflow management systems and the VIEW SOA solution. IEEE Trans. Services Comput. 2, 1 (2009), 79--92.

Digital Library

[80]

Ji Liu, Esther Pacitti, Patrick Valduriez, and Marta Mattoso. 2015. A survey of data-intensive scientific workflow management. J. Grid Comput. 13, 4 (2015), 457-- --493.

Digital Library

[81]

Clifford Lynch. 2000. Authenticity and integrity in the digital environment: An exploratory analysis of the central role of trust. Council Library Info. Res. 32, 1 (2000), 1--84.

[82]

Peter Macko and Margo Seltzer. 2012. A general-purpose provenance library. In Proceedings of the TaPP. USENIX, 1--6.

Digital Library

[83]

Anderson Marinho, Marta Mattoso, Claudia Werner, Vanessa Braganholo, and Leonardo Murta. 2011. Challenges in managing implicit and abstract provenance data: Experiences with ProvManager. In TaPP. USENIX, Heraklion, Crete, Greece, 1--6.

[84]

Marta Mattoso, Jonas Dias, Kary A. C. S. Ocaña, Eduardo Ogasawara, Flavio Costa, Felipe Horta, Vítor Silva, and Daniel de Oliveira. 2015. Dynamic steering of HPC scientific workflows: A survey. Future Gen. Comput. Syst. 46 (2015), 100--113.

Digital Library

[85]

Marta Mattoso, Claudia Werner, Guilherme Horta Travassos, Vanessa Braganholo, Eduardo Ogasawara, Daniel Oliveira, Sergio Cruz, Wallace Martinho, and Leonardo Murta. 2010. Towards supporting the life cycle of large scale scientific experiments. Int. J. Bus. Process Integr. Manage. 5, 1 (2010), 79--92.

[86]

Timothy McPhillips, Shawn Bowers, Khalid Belhajjame, and Bertram Ludäscher. 2015. Retrospective provenance without a runtime provenance recorder. In Proceedings of the TaPP. USENIX, 1--7.

Digital Library

[87]

Timothy McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao, Fernando Chirigati, Saumen Dey, Juliana Freire et al. 2015. YesWorkflow: A user-oriented, language-independent tool for recovering workflow information from scripts. Int. J. Dig. Curat. 10, 1 (2015), 298--313.

[88]

Robert Meyer and Klaus Obermayer. 2015. pypet: A Python toolkit for simulations and numerical experiments. Neuroscience 16, Suppl 1 (2015), P184.

[89]

Robert Meyer and Klaus Obermayer. 2016. pypet: A Python toolkit for data management of parameter explorations. Front. Neuroinformat. 10 (2016), 1--16.

[90]

Danius T. Michaelides, Richard Parker, Chris Charlton, William J. Browne, and Luc Moreau. 2016. Intermediate notation for provenance and workflow reproducibility. In Proceedings of the IPAW. Springer, 83--94.

[91]

Simon Miles, Paul Groth, Steve Munroe, and Luc Moreau. 2011. PrIMe: A methodology for developing provenance-aware applications. ACM Trans. Software Engineer. Methodol. 20, 3 (2011), 8.

Digital Library

[92]

Paolo Missier, Saumen Dey, Khalid Belhajjame, Víctor Cuevas-Vicenttín, and Bertram Ludäscher. 2013. D-PROV: Extending the PROV provenance model with workflow structure. In Proceedings of the TaPP. USENIX, 1--7.

[93]

Scott Moore, Ashish Gehani, and Natarajan Shankar. 2013. Declaratively processing provenance metadata. In Proceedings of the TaPP. USENIX, 1--8.

[94]

Luc Moreau, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul Groth, Natalia Kwasnikowska, Simon Miles, Paolo Missier, Jim Myers et al. 2011. The open provenance model core specification (v1. 1). Future Gen. Comput. Syst. 27, 6 (2011), 743--756.

Digital Library

[95]

Luc Moreau, Bertram Ludäscher, Ilkay Altintas, Roger S. Barga, Shawn Bowers, Steven Callahan, George Chin, Ben Clifford, Shirley Cohen, Sarah Cohen-Boulakia et al. 2008. Special issue: The first provenance challenge. Concurr. Comput.: Pract. Exper. 20, 5 (2008), 409--418.

Digital Library

[96]

Luc Moreau and Paolo Missier. 2012. PROV-DM: The PROV Data Model. W3C Proposed Recommendation. Retrieved from http://www.w3.org/TR/prov-dm.

[97]

Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo I. Seltzer. 2006. Provenance-Aware storage systems. In Proceedings of the ATC. USENIX Association, 43--56.

Digital Library

[98]

Leonardo Murta, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. 2014. noWorkflow: Capturing and analyzing provenance of scripts. In Proceedings of the IPAW. Springer, 71--83.

[99]

Johnson Mwebaze, Danny Boxhoorn, and Edwin Valentijn. 2009. Astro-wise: Tracing and using lineage for scientific data processing. In Proceedings of the NBIS. IEEE, 475--480.

Digital Library

[100]

Johnson Mwebaze, Danny Boxhoorn, and Edwin Valentijn. 2011. Dynamic pipeline changes in scientific data processing. In Proceedings of the eSoN. IEEE, 263--270.

Digital Library

[101]

Wellington Oliveira, Daniel De Oliveira, and Vanessa Braganholo. 2018. Provenance analytics for workflow-based computational experiments: A survey. Comput. Surveys 51, 3 (2018), 53.

Digital Library

[102]

John K. Ousterhout. 1998. Scripting: Higher level programming for the 21st century. Computer 31, 3 (1998), 23--30.

Digital Library

[103]

Christian Schou Oxvig, Thomas Arildsen, and Torben Larsen. 2016. Storing reproducible results from computational experiments using scientific Python packages. In Proceedings of the SciPy. SciPy, 45--50.

[104]

Kai Petersen, Robert Feldt, Shahid Mujtaba, and Michael Mattsson. 2008. Systematic mapping studies in software engineering. In Proceedings of the EASE, vol. 8. ACM, 68--77.

Digital Library

[105]

João Felipe Pimentel, Saumen Dey, Timothy McPhillips, Khalid Belhajjame, David Koop, Leonardo Murta, Vanessa Braganholo, and Bertram Ludäscher. 2016. Yin 8 Yang: Demonstrating complementary provenance from noWorkflow 8 YesWorkflow. In Proceedings of the IPAW. Springer, 161--165.

[106]

João Felipe Pimentel, Juliana Freire, Vanessa Braganholo, and Leonardo Murta. 2016. Tracking and analyzing the evolution of provenance from scripts. In Proceedings of the IPAW. Springer, 16--28.

[107]

João Felipe Pimentel, Juliana Freire, Leonardo Murta, and Vanessa Braganholo. 2016. Fine-grained provenance collection over scripts through program slicing. In Proceedings of the IPAW. Springer, 199--203.

[108]

Joao Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2017. noWorkflow: A tool for collecting, analyzing, and managing provenance from Python scripts. Very Large Data Bases 10, 12 (2017), 1841--1844.

[109]

João Felipe Nicolaci Pimentel, Vanessa Braganholo, Leonardo Murta, and Juliana Freire. 2015. Collecting and analyzing provenance on interactive notebooks: When IPython meets noWorkflow. In Proceedings of the TaPP. USENIX, 1--6.

[110]

João Felipe N. Pimentel, Paolo Missier, Leonardo Murta, and Vanessa Braganholo. 2018. Versioned-PROV: A PROV extension to support mutable data entities. In Proceedings of the IPAW. Springer, 87--100.

[111]

Raghu Ramakrishnan and Jeffrey D. Ullman. 1995. A survey of deductive database systems. J. Logic Program. 23, 2 (1995), 125--149.

[112]

Andrew Runnalls and Chris Silles. 2012. Provenance tracking in R. In Proceedings of the IPAW. Springer, 237--239.

Digital Library

[113]

Andrew R. Runnalls. 2011. Aspects of CXXR internals. Comput. Stat. 26, 3 (2011), 427--442.

Digital Library

[114]

Andrew R. Runnalls and Chris A. Silles. 2011. CXXR: An ideas hatchery for future R development. In Proceedings of the JSM. AMSTAT, 1--9.

[115]

Helen Shen et al. 2014. Interactive notebooks: Sharing the code. Nature 515, 7525 (2014), 151--152.

[116]

Christopher Anthony Silles. 2014. Provenance-aware CXXR. Ph.D. Dissertation. University of Kent.

[117]

Chris A. Silles and Andrew R. Runnalls. 2010. Provenance-awareness in R. In Proceedings of the IPAW. Springer, 64--72.

[118]

Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. 2005. A survey of data provenance in e-science. SIGMOD Rec. 34, 3 (2005), 31--36.

Digital Library

[119]

Sébastien Sorlin and Christine Solnon. 2005. Reactive tabu search for measuring graph similarity. In Proceedings of the IAPR. Springer, 172--182.

Digital Library

[120]

Manolis Stamatogiannakis, Paul Groth, and Herbert Bos. 2014. Looking inside the black-box: Capturing data provenance using dynamic instrumentation. In Proceedings of the IPAW. Springer, 155--167.

Digital Library

[121]

Jean-Luc Richard Stevens, Marco Elver, and James A. Bednar. 2013. An automated and reproducible workflow for running and analyzing neural simulations using Lancet and IPython Notebook. Front. Neuroinform. 7, 44 (2013), 44.

[122]

Michael Stonebraker, Gerald Held, Eugene Wong, and Peter Kreps. 1976. The design and implementation of INGRES. ACM Trans. Database Syst. 1, 3 (1976), 189--222.

Digital Library

[123]

Wang Chiew Tan et al. 2007. Provenance in databases: Past, current, and future. IEEE Data Engineer. Bull. 30, 4 (2007), 3--12.

[124]

Dawood Tariq, Maisem Ali, and Ashish Gehani. 2012. Towards automated collection of application-level data provenance. In Proceedings of the TaPP. USENIX, 1--5.

Digital Library

[125]

Håvar Valeur. 2005. Tracking the Lineage of Arbitrary Processing Sequences. Ph.D. Dissertation. Norwegian University of Science and Technology, Trondheim.

[126]

André Van der Hoek. 2004. Design-time product line architectures for any-time variability. Sci. Comput. Program. 53, 3 (2004), 285--304.

Digital Library

[127]

Jianwu Wang, Daniel Crawl, Shweta Purawat, Mai Nguyen, and Ilkay Altintas. 2015. Big data provenance: Challenges, state of the art and opportunities. In Proceedings of the BigData. IEEE, 2509--2516.

Digital Library

[128]

Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of International Conference on Evaluation and Assessment in Software Engineering. ACM, 38:1--38:10.

Digital Library

[129]

Katherine Wolstencroft, Robert Haines, Donal Fellows, Alan Williams, David Withers, Stuart Owen, Stian Soiland-Reyes, Ian Dunlop, Aleksandra Nenadic, Paul Fisher et al. 2013. The Taverna workflow suite: Designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res. W557, 61 (2013), W557--W561.

[130]

Zhaogui Xu, Ju Qian, Lin Chen, Zhifei Chen, and Baowen Xu. 2013. Static slicing for Python first-class objects. In Proceedings of the QSIC. IEEE, 117--124.

Digital Library

[131]

Carlo Zaniolo. 1983. The database language GEM. SIGMOD Rec. 13, 4 (1983), 207--218.

Digital Library

[132]

Yong Zhao, Mihael Hategan, Ben Clifford, Ian Foster, Gregor Von Laszewski, Veronika Nefedova, Ioan Raicu, Tiberiu Stef-Praun, and Michael Wilde. 2007. Swift: Fast, reliable, loosely coupled parallel computation. In Proceedings of the SERVICES. IEEE, 199--206.

[133]

Yong Zhao and Shiyong Lu. 2008. A logic programming approach to scientific workflow provenance querying. In Proceedings of the IPAW. Springer, 31--44.

Digital Library

[134]

Yong Zhao, Michael Wilde, and Ian Foster. 2006. Applying the virtual data provenance model. In Proceedings of the IPAW. Springer, 148--161.

Digital Library

Cited By

Mehra SRao MBansal ARathore NSidana SRaj SSinha ARao GShamim RSingh NKumar B(2024)Ethical Challenges and Innovations in AI-Driven Healthcare and EngineeringEthical Dimensions of AI Development10.4018/979-8-3693-4147-6.ch015(323-346)Online publication date: 16-Aug-2024
https://doi.org/10.4018/979-8-3693-4147-6.ch015
Köhler CUlianych DGrün SDecker SDenker M(2024)Facilitating the Sharing of Electrophysiology Data Analysis Results Through In-Depth Provenance Captureeneuro10.1523/ENEURO.0476-23.202411:6(ENEURO.0476-23.2024)Online publication date: 22-May-2024
https://doi.org/10.1523/ENEURO.0476-23.2024
Chapman ALauro LMissier PTorlone R(2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3644385
Show More Cited By

Index Terms

A Survey on Collecting, Managing, and Analyzing Provenance from Scripts
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Data provenance
2. Software and its engineering
  1. Software notations and tools
    1. Context specific languages
      1. Scripting languages

Recommendations

A survey of simulation provenance systems: modeling, capturing, querying, visualization, and advanced utilization

Research and education through computer simulation has been actively conducted in various scientific and engineering fields including computational science engineering. Accordingly, there have been a lot of attentions paid to actively utilize provenance ...
Character and numeral recognition for non-Indic and Indic scripts: a survey
Abstract
A collection of different scripts is employed in writing languages throughout the world. Character and numeral recognition of a particular script is a key area in the field of pattern recognition. In this paper, we have presented a comprehensive ...
A Detailed Study and Analysis of OCR Research in South Indian Scripts
ARTCOM '09: Proceedings of the 2009 International Conference on Advances in Recent Technologies in Communication and Computing

This paper provides an overview of the OCR (Optical Character Recognition) research in South Indian languages. OCR reading technology is benefited by the evolution of high-powered desktop computing allowing for the development of more powerful ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 52, Issue 3

May 2020

734 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3341324

Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2019

Accepted: 01 February 2019

Revised: 01 December 2018

Received: 01 September 2017

Published in CSUR Volume 52, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Funding Sources

AT&T
DARPA
Moore-Sloan Data Science Environment at NYU
Conselho Nacional de Desenvolvimento Científico e Tecnológico
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
National Science Foundation
Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
1,883
Total Downloads

Downloads (Last 12 months)464
Downloads (Last 6 weeks)44

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mehra SRao MBansal ARathore NSidana SRaj SSinha ARao GShamim RSingh NKumar B(2024)Ethical Challenges and Innovations in AI-Driven Healthcare and EngineeringEthical Dimensions of AI Development10.4018/979-8-3693-4147-6.ch015(323-346)Online publication date: 16-Aug-2024
https://doi.org/10.4018/979-8-3693-4147-6.ch015
Köhler CUlianych DGrün SDecker SDenker M(2024)Facilitating the Sharing of Electrophysiology Data Analysis Results Through In-Depth Provenance Captureeneuro10.1523/ENEURO.0476-23.202411:6(ENEURO.0476-23.2024)Online publication date: 22-May-2024
https://doi.org/10.1523/ENEURO.0476-23.2024
Chapman ALauro LMissier PTorlone R(2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3644385
Grayson SAguilar FMilewicz RKatz DMarinov D(2024)A benchmark suite and performance analysis of user-space provenance collectorsProceedings of the 2nd ACM Conference on Reproducibility and Replicability10.1145/3641525.3663627(85-95)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1145/3641525.3663627
Ahmad RJung HNakamura YMalik TSerra ESpezzano F(2024)Accurate Path Prediction of Provenance TracesProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679872(3617-3621)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679872
Nauman M(2024)Tractable Executable Binary Provenance Signalling through Vision Transformers2024 21st Learning and Technology Conference (L&T)10.1109/LT60077.2024.10469044(41-46)Online publication date: 15-Jan-2024
https://doi.org/10.1109/LT60077.2024.10469044
Auge TFeistel SEkaputra FKlettke MJürgensmann SMichels EWaltersdorfer L(2024)Towards an Integrated Provenance Framework: A Scenario for Marine Data2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)10.1109/EuroSPW61312.2024.00071(597-601)Online publication date: 8-Jul-2024
https://doi.org/10.1109/EuroSPW61312.2024.00071
Sembay Mde Macedo DJúnior LBraga RSarasa-Cabezuelo A(2023)Provenance Data Management in Health Information Systems: A Systematic Literature ReviewJournal of Personalized Medicine10.3390/jpm1306099113:6(991)Online publication date: 13-Jun-2023
https://doi.org/10.3390/jpm13060991
Gierend KFreiesleben SKadioglu DSiegel FGanslandt TWaltemath D(2023)The Status of Data Management Practices: a Mixed-Method Study across German Medical Data Integration Centers (Preprint)Journal of Medical Internet Research10.2196/48809Online publication date: 8-May-2023
https://doi.org/10.2196/48809
Duan SLiu CHan PJin XZhang XXiang XPan H(2023)Fed-DNN-DebuggerSecurity and Communication Networks10.1155/2023/59681682023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/5968168
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents