skip to main content
10.1145/3419111.3421292acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Influence-based provenance for dataflow applications with taint propagation

Published: 12 October 2020 Publication History

Abstract

Debugging big data analytics often requires a root cause analysis to pinpoint the precise culprit records in an input dataset responsible for incorrect or anomalous output. Existing debugging or data provenance approaches do not track fine-grained control and data flows in user-defined application code; thus, the returned culprit data is often too large for manual inspection and expensive post-mortem analysis is required.
We design FlowDebug to identify a highly precise set of input records based on two key insights. First, FlowDebug precisely tracks control and data flow within user-defined functions to propagate taints at a fine-grained level by inserting custom data abstractions through automated source to source transformation. Second, it introduces a novel notion of influence-based provenance for many-to-one dependencies to prioritize which input records are more responsible than others by analyzing the semantics of a user-defined function used for aggregation. By design, our approach does not require any modification to the framework's runtime and can be applied to existing applications easily. FlowDebug significantly improves the precision of debugging results by up to 99.9 percentage points and avoids repetitive re-runs required for post-mortem analysis by a factor of 33 while incurring an instrumentation overhead of 0.4X - 6.1X on vanilla Spark.

Supplementary Material

MOV File (p372-teoh-presentation.mov)

References

[1]
[n.d.]. AggregateByKey. https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/rdd/PairRDDFunctions.html.
[2]
[n.d.]. Hadoop. http://hadoop.apache.org/.
[3]
[n.d.]. Spark. https://spark.apache.org/.
[4]
Hiralal Agrawal and Joseph R. Horgan. 1990. Dynamic Program Slicing. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation (White Plains, New York, USA) (PLDI '90). ACM, New York, NY, USA, 246--256.
[5]
Manish Kumar Anand, Shawn Bowers, and Bertram Ludäscher. 2010. Techniques for Efficiently Querying Scientific Workflow Provenance Graphs. In Proceedings of the 13th International Conference on Extending Database Technology (Lausanne, Switzerland) (EDBT '10). ACM, New York, NY, USA, 287--298.
[6]
Olivier Biton, Sarah Cohen-Boulakia, Susan B. Davidson, and Carmem S. Hara. 2008. Querying and Managing Provenance Through User Views in Scientific Workflows. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE '08). IEEE Computer Society, Washington, DC, USA, 1072--1081.
[7]
Michael Carbin and Martin C. Rinard. 2010. Automatically Identifying Critical Input Regions and Code in Applications. In Proceedings of the 19th International Symposium on Software Testing and Analysis (Trento, Italy) (ISSTA '10). ACM, New York, NY, USA, 37--48.
[8]
Tat W. Chan and Arun Lakhotia. 1998. Debugging program failure exhibited by voluminous data. Journal of Software Maintenance (1998).
[9]
Jong-Deok Choi and Andreas Zeller. 2002. Isolating Failure-inducing Thread Schedules. In Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis (Roma, Italy) (ISSTA '02). ACM, New York, NY, USA, 210--220.
[10]
Zaheer Chothia, John Liagouris, Frank McSherry, and Timothy Roscoe. 2016. Explaining Outputs in Modern Data Analytics. Proc. VLDB Endow. 9, 12 (Aug. 2016), 1137--1148.
[11]
James Clause, Wanchun Li, and Alessandro Orso. 2007. Dytan: A Generic Dynamic Taint Analysis Framework. In Proceedings of the 2007 International Symposium on Software Testing and Analysis (London, United Kingdom) (ISSTA '07). ACM, New York, NY, USA, 196--206.
[12]
James Clause and Alessandro Orso. 2009. Penumbra: Automatically Identifying Failure-relevant Inputs Using Dynamic Tainting. In Proceedings of the Eighteenth International Symposium on Software Testing and Analysis (Chicago, IL, USA) (ISSTA '09). ACM, New York, NY, USA, 249--260.
[13]
Holger Cleve and Andreas Zeller. 2005. Locating Causes of Program Failures. In Proceedings of the 27th International Conference on Software Engineering (St. Louis, MO, USA) (ICSE '05). ACM, New York, NY, USA, 342--351.
[14]
Bertty Contreras-Rojas, Jorge-Arnulfo Quiané-Ruiz, Zoi Kaoudi, and Saravanan Thirumuruganathan. 2019. TagSniff: Simplified Big Data Debugging for Dataflow Jobs. In Proceedings of the ACM Symposium on Cloud Computing (Santa Cruz, CA, USA) (SoCC '19). Association for Computing Machinery, New York, NY, USA, 453--464.
[15]
Y. Cui and J. Widom. 2003. Lineage Tracing for General Data Warehouse Transformations. The VLDB Journal 12, 1 (May 2003), 41--58.
[16]
Ankur Dave, Matei Zaharia, and I Stoica. 2013. Arthur: Rich Post-Facto Debugging for Production Analytics Applications. Technical Report. Citeseer.
[17]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113.
[18]
Muhammad Ali Gulzar, Matteo Interlandi, Xueyuan Han, Mingda Li, Tyson Condie, and Miryung Kim. 2017. Automated Debugging in Data-Intensive Scalable Computing. In Proceedings of the 2017 Symposium on Cloud Computing (Santa Clara, California) (SoCC '17). Association for Computing Machinery, New York, NY, USA, 520--534.
[19]
Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali, Tyson Condie, Todd Millstein, and Miryung Kim. 2016. BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark. In Proceedings of the 38th International Conference on Software Engineering (Austin, Texas) (ICSE ' 16). Association for Computing Machinery, New York, NY, USA, 784--795.
[20]
Neelam Gupta, Haifeng He, Xiangyu Zhang, and Rajiv Gupta. 2005. Locating Faulty Code Using Failure-inducing Chops. In Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering (Long Beach, CA, USA) (ASE '05). ACM, New York, NY, USA, 263--272.
[21]
Thomas Heinis and Gustavo Alonso. 2008. Efficient Lineage Tracking for Scientific Workflows. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD '08). ACM, New York, NY, USA, 1007--1018.
[22]
R. Ikeda, J. Cho, C. Fang, S. Salihoglu, S. Torikai, and J. Widom. 2012. Provenance-Based Debugging and Drill-Down in Data-Oriented Workflows. In 2012 IEEE 28th International Conference on Data Engineering. 1249--1252.
[23]
Robert Ikeda, Hyunjung Park, and Jennifer Widom. 2011. Provenance for generalized map and reduce workflows. In In Proc. Conference on Innovative Data Systems Research (CIDR).
[24]
R. Ikeda, A. Das Sarma, and J. Widom. 2013. Logical provenance in data-oriented workflows?. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 877--888.
[25]
Matteo Interlandi, Ari Ekmekji, Kshitij Shah, Muhammad Ali Gulzar, Sai Deep Tetali, Miryung Kim, Todd Millstein, and Tyson Condie. 2018. Adding Data Provenance Support to Apache Spark. The VLDB Journal 27, 5 (Oct. 2018), 595--615.
[26]
V. Jagannath, Z. Yin, and M. Budiu. 2011. Monitoring and Debugging DryadLINQ Applications with Daphne. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 1266--1273.
[27]
Pang Wei Koh and Percy Liang. 2017. Understanding Black-Box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (Sydney, NSW, Australia) (ICML'17). JMLR.org, 1885--1894.
[28]
Timothy Robert Leek, Graham Z Baker, Ruben Edward Brown, Michael A Zhivich, and RP Lippmann. 2007. Coverage maximization using dynamic taint tracing. Technical Report. DTIC Document.
[29]
Daniel Lemire, Gregory Ssi-Yan-Kai, and Owen Kaser. 2016. Consistently Faster and Smaller Compressed Bitmaps with Roaring. Softw. Pract. Exper. 46, 11 (Nov. 2016), 1547--1569.
[30]
Dionysios Logothetis, Soumyarupa De, and Kenneth Yocum. 2013. Scalable lineage capture for debugging DISC analytics. In Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 17.
[31]
W. Masri, A. Podgurski, and D. Leon. 2004. Detecting and debugging insecure information flows. In 15th International Symposium on Software Reliability Engineering. 198--209.
[32]
Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, and Dan Suciu. 2010. The Complexity of Causality and Responsibility for Query Answers and non-Answers. PVLDB 4, 1 (2010), 34--45.
[33]
Ghassan Misherghi and Zhendong Su. 2006. HDD: Hierarchical Delta Debugging. In Proceedings of the 28th International Conference on Software Engineering (Shanghai, China) (ICSE '06). ACM, New York, NY, USA, 142--151.
[34]
James Newsome and Dawn Song. 2005. Dynamic taint analysis: Automatic detection, analysis, and signature generation of exploit attacks on commodity software. In In In Proceedings of the 12th Network and Distributed Systems Security Symposium. Citeseer.
[35]
Christopher Olston and Benjamin Reed. 2011. Inspector Gadget: A Framework for Custom Monitoring and Debugging of Distributed Dataflows. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (Athens, Greece) (SIGMOD '11). Association for Computing Machinery, New York, NY, USA, 1221--1224.
[36]
Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-Grained Lineage at Interactive Speed. Proc. VLDB Endow. 11, 6 (Feb. 2018), 719--732.
[37]
Sudeepa Roy and Dan Suciu. 2014. A formal approach to finding explanations for database queries. In SIGMOD. 1579--1590.
[38]
Manolis Stamatogiannakis, Paul Groth, and Herbert Bos. 2015. Looking Inside the Black-Box: Capturing Data Provenance Using Dynamic Instrumentation. In Provenance and Annotation of Data and Processes, Bertram Ludäscher and Beth Plale (Eds.). Springer International Publishing, Cham, 155--167.
[39]
Mark Weiser. 1981. Program Slicing. In Proceedings of the 5th International Conference on Software Engineering (San Diego, California, USA) (ICSE '81). IEEE Press, Piscataway, NJ, USA, 439--449. http://dl.acm.org/citation.cfm?id=800078.802557
[40]
Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining Away Outliers in Aggregate Queries. Proc. VLDB Endow. 6, 8 (June 2013), 553--564.
[41]
Andreas Zeller. 1999. Yesterday, My Program Worked. Today, It Does Not. Why?. In Proceedings of the 7th European Software Engineering Conference (Toulouse, France) (ESEC). Springer-Verlag, London, UK, UK, 253--267. http://dl.acm.org/citation.cfm?id=318773.318946
[42]
Andreas Zeller. 2002. Isolating Cause-effect Chains from Computer Programs. In Proceedings of the 10th ACM SIGSOFT Symposium on Foundations of Software Engineering (Charleston, South Carolina, USA) (SIGSOFT '02/FSE-10). ACM, New York, NY, USA, 1--10.
[43]
Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and isolating failure-inducing input. Software Engineering, IEEE Transactions on 28, 2 (2002), 183--200.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing
October 2020
535 pages
ISBN:9781450381376
DOI:10.1145/3419111
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data systems
  2. data intensive scalable computing
  3. data provenance
  4. fault localization
  5. taint analysis

Qualifiers

  • Research-article

Funding Sources

Conference

SoCC '20
Sponsor:
SoCC '20: ACM Symposium on Cloud Computing
October 19 - 21, 2020
Virtual Event, USA

Acceptance Rates

SoCC '20 Paper Acceptance Rate 35 of 143 submissions, 24%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)76
  • Downloads (Last 6 weeks)10
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media