skip to main content
research-article

Provenance in Collaborative in Silico Scientific Research: a Survey

Published: 10 December 2020 Publication History

Abstract

Science is a collaborative activity by definition. Research is usually conducted by several scientists working together, and this behavior has been intensified in recent years. Furthermore, experiments are increasingly performed in silico, which demands proper support tools. Provenance-aware Workflow Management Systems and script-based tools have been popular ways of running in silico experiments, but these tools often neglect the collaboration aspect. Even solutions that aim at collaborative experiments do not always address the collaborators- needs. Literature shows surveys discussing subjects related to in silico experiments. However, they either focus on provenance collection and applications, thus treating collaboration as just another possible application, or focus on Workflow Management Systems, only listing collaboration as a possible challenge. This article surveys available tools and approaches that aim at aiding scientists to conduct collaborative in silico experiments. Particularly, we focus on challenges related to the provenance of these collaborative experiments. We devise a taxonomy with the aspects of collaboration in scientific research and discuss each of these aspects. We also identify literature gaps that provide future opportunities.

References

[1]
I. Altintas, M. K. Anand, D. Crawl, S. Bowers, A. Belloum, P. Missier, B. Lud¨ascher, C. A. Goble, and P. M. A. Sloot. Understanding collaborative studies through interoperable workflow provenance. In D. L. McGuinness, J. R. Michaelis, and L. Moreau, editors, Provenance and Annotation of Data and Processes, pages 42--58. Springer Berlin Heidelberg, 2010.
[2]
I. Altintas, M. K. Anand, T. N. Vuong, S. Bowers, B. Lud¨ascher, and P. M. A. Sloot. A data model for analyzing user collaborations in workflow-driven escience. International Journal of Computers and Their Applications, 18:160--179, 2011.
[3]
I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock. Kepler: an extensible system for design and execution of scientific workflows. In Scientific and Statistical Database Management, pages 423--424, 2004.
[4]
I. Altintas, A. W. Lin, J. Chen, C. Churas, M. Gujral, S. Sun, W. Li, R. Manansala, M. Sedova, J. S. Grethe, and M. Ellisman. Camera 2.0: A data-centric metagenomics community infrastructure driven by scientific workflows. In World Congress on Services, pages 352--359, 2010.
[5]
M. K. Anand, S. Bowers, T. McPhillips, and B. Lud¨ascher. Exploring scientific workflow provenance using hybrid queries over nested data and lineage graphs. In M. Winslett, editor, Scientific and Statistical Database Management, Lecture Notes in Computer Science, pages 237--254. Springer Berlin Heidelberg, 2009.
[6]
A. Belloum, M. A. Inda, D. Vasunin, V. Korkhov, Z. Zhao, H. Rauwerda, T. M. Breit, M. Bubak, and L. O. Hertzberger. Collaborative e-science experiments and scientific workflows. IEEE Internet Computing, 15(4):39--47, July 2011.
[7]
M. Bubak, T. Gubala, M. Kasztelnik, M. Malawski, P. Nowakowski, and P. Sloot. Collaborative virtual laboratory for e-health. In Expanding the Knowledge Economy: Issues, Applications, Case Studies, eChallenges, pages 537--544, 2007.
[8]
R. Caldwell and D. Lindberg. Participants in science behave scientifically. Understanding Science., 2018. Available at https://undsci.berkeley.edu/article/0_ 0_0/whatisscience_09.
[9]
S. Chacon and J. Long. Git. https://git-scm.com/. Accessed: 2018-06-09.
[10]
T. Classe, R. Braga, F. Campos, and J. M. N. David. A semantic peer to peer network to support e-science. In IEEE International Conference on e-Science, pages 503--512, 2015.
[11]
T. Classe, R. Braga, J. M. N. David, F. Campos, M. A. Ara´ujo, and V. Str¨oele. A collaborative approach to support e-science activities. In IEEE International Conference on Computer Supported Cooperative Work in Design, pages 20--25. IEEE, 2016.
[12]
T. Classe, R. Braga, J. M. N. David, F. Campos, and W. Arbex. A distributed infrastructure to support scientific experiments. Journal of Grid Computing, 15(4):475--500, 2017.
[13]
Cocalc user manual documentation. https://doc.cocalc.com/contents.html, 2013. Accessed: 2019--12-05.
[14]
G. C. B. Costa, R. Braga, J. M. N. David, and F. Campos. A scientific software product line for the bioinformatics domain. Journal of Biomedical Informatics, 56:239--264, 2015.
[15]
C. J. Date. An introduction to database systems. Pearson/Addison Wesley, Boston, 2004.
[16]
S. B. Davidson and J. Freire. Provenance and scientific workflows: Challenges and opportunities. In ACM Special Interest Group on Management of Data, pages 1345--1350. ACM, 2008.
[17]
A. Davison. Automated capture of experiment context for easier reproducibility in computational research. Computing in Science & Engineering, 14(4):48--56, 2012.
[18]
D. De Oliveira, E. Ogasawara, F. Baiao, and M. Mattoso. Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In International Conference on Cloud Computing, pages 378--385, Washington, DC, USA, 2010.
[19]
D. De Roure, C. Goble, and R. Stevens. The design and realisation of the myexperiment virtual research environment for social sharing of workflows. Future Generation Computer Systems, 25(5):561--567, 2009.
[20]
E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: a framework for mapping complex scientific workflows onto 48 SIGMOD Record, June 2020 (Vol. 49, No. 2) distributed systems. Scientific Programming Journal, 13(3):219--237, 2005.
[21]
D. A. Duce and M. S. Sagar. skML a markup language for distributed collaborative visualization. In Theory and Practice of Computer Graphics, pages 171--178, 2005.
[22]
T. Ellkvist, D. Koop, E. W. Anderson, J. Freire, and C. Silva. Using provenance to support real-time collaborative design of workflows. In International Workshop on Provenance and Annotation (IPAW), pages 266--279. Springer, 2008.
[23]
R. Elmasri and S. Navathe. Fundamentals of database systems. Addison-Wesley, 6 edition, Apr. 2010.
[24]
D. Foulser. IRIS Explorer: a framework for investigation. SIGGRAPH Computer Graphics, 29(2):13--16, 1995.
[25]
J. Freire, D. Koop, E. Santos, and C. T. Silva. Provenance for computational tasks: A survey. Computing in Science & Engineering, 10(3):11--21, 2008.
[26]
J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T. Vo. Managing rapidly-evolving scientific workflows. In L. Moreau and I. Foster, editors, Provenance and Annotation of Data, Lecture Notes in Computer Science, pages 10--18. Springer Berlin Heidelberg, 2006.
[27]
Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox, D. Gannon, C. Goble, M. Livny, L. Moreau, and J. Myers. Examining the challenges of scientific workflows. Computer, 40(12):24--32, 2007.
[28]
C. A. Goble, J. Bhagat, S. Aleksejevs, D. Cruickshank, D. Michaelides, D. Newman, M. Borkum, S. Bechhofer, M. Roos, P. Li, and D. De Roure. myexperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Research, 38(Web Server Issue):677--682, 2010.
[29]
C. A. Goble and D. C. D. Roure. myexperiment: Social networking for workflow-using e-scientists. In Workshop on Workflows in Support of Large-Scale Science, pages 1--2. ACM, 2007.
[30]
L. A. Goodman. Snowball sampling. The Annals of Mathematical Statistics, 32(1):148--170, 1961.
[31]
M. Herschel, R. Diestelk¨amper, and H. B. Lahmar. A survey on provenance: What for? what form? what from? The VLDB Journal, 26(6):881--906, 2017.
[32]
D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. R. Pocock, P. Li, and T. Oinn. Taverna: a tool for building and running workflows of services. Nucleic Acids Research, 34(2):729--732, 2006.
[33]
H. M. R. III, D. H. Honemann, T. J. Balch, D. E. Seabold, and S. Gerber. Robert's rules of order newly revised. PublicAffairs, 11 edition, 2011.
[34]
Jia Zhang, C. Chang, and Jen-Yao Chung. Mediating electronic meetings. In International Computer Software and Applications Conference, pages 216--221, 2003.
[35]
G. King. An introduction to the dataverse network as an infrastructure for data sharing, 2007.
[36]
B. Lerner and E. Boose. Rdatatracker: collecting provenance in an interactive scripting environment. In USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2014.
[37]
S. Lu and J. Zhang. Collaborative scientific workflows. In IEEE International Conference on Web Services, pages 527--534. IEEE, 2009.
[38]
S. Lu and J. Zhang. Collaborative scientific workflows supporting collaborative science. International Journal of Business Process Integration and Management, page 185, 2011.
[39]
M. Mattoso, C. Werner, G. H. Travassos, V. Braganholo, E. Ogasawara, D. Oliveira, S. Cruz, W. Martinho, and L. Murta. Towards supporting the life cycle of large scale scientific experiments. International Journal of Business Process Integration and Management, 5(1):79--92, 2010.
[40]
Mercurial scm. https://www.mercurial-scm.org/. Accessed: 2019-04--23.
[41]
H. Miao, A. Chavan, and A. Deshpande. Provdb: Lifecycle management of collaborative analysis workflows. In Workshop on Human-In-the-Loop Data Analytics (HILDA), pages 7:1--7:6, New York, NY, USA, 2017. ACM.
[42]
T. Miller, P. McBurney, J. McGinnis, and K. Stathis. First-class protocols for agent-based coordination of scientific instruments. In IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises, pages 41--46, 2007.
[43]
P. Missier, B. Ludascher, S. Bowers, S. Dey, A. Sarkar, B. Shrestha, I. Altintas, M. Anand, and C. Goble. Linking multiple workflow provenance traces for interoperable SIGMOD Record, June 2020 (Vol. 49, No. 2) 49 collaborative science. In Workshop on Workflows in Support of Large-Scale Science, pages 1--8, 2010.
[44]
L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth, N. Kwasnikowska, S. Miles, P. Missier, J. Myers, B. Plale, Y. Simmhan, E. Stephan, and J. V. den Bussche. The open provenance model core specification (v1.1). Future Generation Computer Systems, 27(6):743--756, 2011.
[45]
L. Moreau, P. Missier, K. Belhajjame, R. B'Far, J. Cheney, S. Coppens, S. Cresswell, Y. Gil, P. Groth, G. Klyne, T. Lebo, J. McCusker, S. Miles, J. Myers, S. Sahoo, and C. Tilmes. PROV-DM: The PROV data model. W3C Recommendation. W3C Recommendation, 2013. Available at http://www.w3.org/TR/2013/ REC-prov-dm-20130430/.
[46]
G. Mostaeen, B. Roy, C. K. Roy, and K. A. Schneider. Fine-grained attribute level locking scheme for collaborative scientific workflow development. In IEEE International Conference on Services Computing, pages 273--277, 2018.
[47]
L. Murta, V. Braganholo, F. Chirigati, D. Koop, and J. Freire. noworkflow: Capturing and analyzing provenance of scripts. In International Workshop on Provenance Annotation (IPAW), pages 1--12, 2014.
[48]
A. F. Pereira, J. M. N. David, R. Braga, and F. Campos. An architecture to enhance collaboration in scientific software product line. In International Conference on System Sciences, pages 338--347. IEEE, 2016.
[49]
J. F. Pimentel, J. Freire, L. Murta, and V. Braganholo. A survey on collecting, managing, and analyzing provenance from scripts. ACM Computing Surveys, 52(3):47:1--47:38, 2019.
[50]
J. Prudencio, L. Murta, C. Werner, and R. Cepeda. To lock, or not to lock: That is the question. Journal of Systems and Software, 85(2):277--289, 2012.
[51]
E. D. Ragan, A. Endert, J. Sanyal, and J. Chen. Characterizing provenance in visualization and data analysis: an organizational framework of provenance types and purposes. IEEE Transactions on Visualization and Computer Graphics, 22(1):31--40, 2016.
[52]
R. Ramakrishnan and J. Gehrke. Database management systems. McGraw-Hill, New York, third edition edition, 2003.
[53]
M. C. Reddy, P. Dourish, and W. Pratt. Temporality in medical work: Time also matters. Computer Supported Cooperative Work, 15(1):29--53, 2006.
[54]
D. H. Sonnenwald. Scientific collaboration. Annual review of information science and technology, 41(1):643--681, 2007.
[55]
Apache subversion. https://subversion.apache.org/. Accessed: 2019-04--23.
[56]
Sumatra 0.7.0 documentation. https://pythonhosted.org/Sumatra/ record_stores.html. Accessed: 2019--12-03.
[57]
S. Sun, J. Chen, W. Li, I. Altintas, A. Lin, S. Peltier, K. Stocks, E. E. Allen, M. Ellisman, J. Grethe, and J. Wooley. Community cyberinfrastructure for advanced microbial ecology research and analysis: the CAMERA resource. Nucleic Acids Research, 39:D546--551, 2011.
[58]
A. S. Tanenbaum. Modern operating systems. Prentice Hall, Upper Saddle River, N.J, 3 edition edition, Dec. 2007.
[59]
Gt4 globus toolkit web site. http://toolkit.globus.org/toolkit/. Accessed: 2019-04--23.
[60]
G. H. Travassos and M. O. Barros. Contributions of in virtuo and in silico experiments for the future of empirical studies in software engineering. In Workshop on Empirical Software Engineering the Future of Empirical Studies in Software Engineering, pages 117--130, 2003.
[61]
S. Vali and S. Sreerama. Multi-user tool for scientific work flow composition. International Journal of Computer Trends & Technology, 4, 2013.
[62]
J. N. Van Rijn, B. Bischl, L. Torgo, B. Gao, V. Umaashankar, S. Fischer, P. Winter, B. Wiswedel, M. R. Berthold, and J. Vanschoren. Openml: A collaborative science platform. In Joint european conference on machine learning and knowledge discovery in databases, pages 645--649. Springer, 2013.
[63]
H. Wang, K. W. Brodlie, J. W. Handley, and J. D. Wood. Service-oriented approach to collaborative visualization. Concurrency and Computation: Practice and Experience, 20(11):1289--1301, 2008.
[64]
M. Wilde, I. Foster, K. Iskra, P. Beckman, Z. Zhang, A. Espinosa, M. Hategan, B. Clifford, and I. Raicu. Parallel scripting for applications at the petascale and beyond. 50 SIGMOD Record, June 2020 (Vol. 49, No. 2) Computer, 42(11):50--60, 2009.
[65]
J. Wood, H. Wright, and K. Brodlie. Collaborative visualization. In Conference on Visualization, pages 253--259. IEEE Computer Society Press, 1997.
[66]
S. Wuchty, B. F. Jones, and B. Uzzi. The increasing dominance of teams in production of knowledge. Science, 316(5827):1036--1039, 2007.
[67]
J. Zhang. Co-taverna: A tool supporting collaborative scientific workflows. In IEEE International Conference on Services Computing, pages 41--48, 2010.
[68]
J. Zhang, Q. Bao, X. Duan, S. Lu, L. Xue, R. Shi, and P. Tang. Collaborative scientific workflow composition as a service: An infrastructure supporting collaborative data analytics workflow design and management. In IEEE International Conference on Collaboration and Internet Computing, pages 219--228, 2016.
[69]
J. Zhang, C. K. Chang, and J. Voas. A uniform meta-model for mediating formal electronic conferences. In International Computer Software and Applications Conference, pages 376--381. IEEE, 2004.
[70]
J. Zhang, D. Kuc, and S. Lu. Confucius: A tool supporting collaborative scientific workflow composition. IEEE Transactions on Services Computing, 7(1), 2012.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 49, Issue 2
June 2020
57 pages
ISSN:0163-5808
DOI:10.1145/3442322
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 December 2020
Published in SIGMOD Volume 49, Issue 2

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media