research-article

The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing

Editors: Chen Li, Volker Markl Authors:

Robert Bradshaw,

Craig Chambers,

Slava Chernyak,

Rafael J. Fernández-Moctezuma,

Sam WhittleAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 12

Pages 1792 - 1803

https://doi.org/10.14778/2824032.2824076

Published: 01 August 2015 Publication History

Abstract

Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves, in addition to an insatiable hunger for faster answers. Meanwhile, practicality dictates that one can never fully optimize along all dimensions of correctness, latency, and cost for these types of input. As a result, data processing practitioners are left with the quandary of how to reconcile the tensions between these seemingly competing propositions, often resulting in disparate implementations and systems.

We propose that a fundamental shift of approach is necessary to deal with these evolved requirements in modern data processing. We as a field must stop trying to groom unbounded datasets into finite pools of information that eventually become complete, and instead live and breathe under the assumption that we will never know if or when we have seen all of our data, only that new data will arrive, old data may be retracted, and the only way to make this problem tractable is via principled abstractions that allow the practitioner the choice of appropriate tradeoffs along the axes of interest: correctness, latency, and cost.

In this paper, we present one such approach, the Dataflow Model, along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development.

References

[1]

D. J. Abadi et al. Aurora: A New Model and Architecture for Data Stream Management. The VLDB Journal, 12(2):120--139, Aug. 2003.

[2]

T. Akidau et al. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. In Proc. of the 39th Int. Conf. on Very Large Data Bases (VLDB), 2013.

[3]

A. Alexandrov et al. The Stratosphere Platform for Big Data Analytics. The VLDB Journal, 23(6):939--964, 2014.

[4]

Apache. Apache Hadoop. http://hadoop.apache.org, 2012.

[5]

Apache. Apache Storm. http://storm.apache.org, 2013.

[6]

Apache. Apache Flink. http://flink.apache.org/, 2014.

[7]

Apache. Apache Samza. http://samza.apache.org, 2014.

[8]

R. S. Barga et al. Consistent Streaming Through Time: A Vision for Event Stream Processing. In Proc. of the Third Biennial Conf. on Innovative Data Systems Research (CIDR), pages 363--374, 2007.

[9]

Botan et al. SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems. Proc. VLDB Endow., 3(1-2):232--243, Sept. 2010.

[10]

O. Boykin et al. Summingbird: A Framework for Integrating Batch and Online MapReduce Computations. Proc. VLDB Endow., 7(13):1441--1451, Aug. 2014.

[11]

Cask. Tigon. http://tigon.io/, 2015.

[12]

C. Chambers et al. FlumeJava: Easy, Efficient Data-Parallel Pipelines. In Proc. of the 2010 ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 363--375, 2010.

[13]

B. Chandramouli et al. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. In Proc. of the 41st Int. Conf. on Very Large Data Bases (VLDB), 2015.

[14]

S. Chandrasekaran et al. TelegraphCQ: Continuous Dataflow Processing. In Proc. of the 2003 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD), SIGMOD '03, pages 668--668, New York, NY, USA, 2003. ACM.

[15]

J. Chen et al. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. In Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD), pages 379--390, 2000.

[16]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of the Sixth Symposium on Operating System Design and Implementation (OSDI), 2004.

[17]

EsperTech. Esper. http://www.espertech.com/esper/, 2006.

[18]

Gates et al. Building a High-level Dataflow System on Top of Map-Reduce: The Pig Experience. Proc. VLDB Endow., 2(2):1414--1425, Aug. 2009.

[19]

Google. Dataflow SDK. https://github.com/GoogleCloudPlatform/DataflowJavaSDK, 2015.

[20]

Google. Google Cloud Dataflow. https://cloud.google.com/dataflow/, 2015.

[21]

T. Johnson et al. A Heartbeat Mechanism and its Application in Gigascope. In Proc. of the 31st Int. Conf. on Very Large Data Bases (VLDB), pages 1079--1088, 2005.

[22]

J. Li et al. Semantics and Evaluation Techniques for Window Aggregates in Data Streams. In Proceedings og the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD), pages 311--322, 2005.

[23]

J. Li et al. Out-of-order Processing: A New Architecture for High-performance Stream Systems. Proc. VLDB Endow., 1(1):274--288, Aug. 2008.

[24]

D. Maier et al. Semantics of Data Streams and Operators. In Proc. of the 10th Int. Conf. on Database Theory (ICDT), pages 37--52, 2005.

[25]

N. Marz. How to beat the CAP theorem. http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html, 2011.

[26]

S. Murthy et al. Pulsar -- Real-Time Analytics at Scale. Technical report, eBay, 2015.

[27]

SQLStream. http://sqlstream.com/, 2015.

[28]

U. Srivastava and J. Widom. Flexible Time Management in Data Stream Systems. In Proc. of the 23rd ACM SIGMOD-SIGACT-SIGART Symp. on Princ. of Database Systems, pages 263--274, 2004.

[29]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow., 2(2):1626--1629, Aug. 2009.

[30]

P. A. Tucker et al. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 15, 2003.

[31]

J. Whiteneck et al. Framing the Question: Detecting and Filling Spatial-Temporal Windows. In Proc. of the ACM SIGSPATIAL Int. Workshop on GeoStreaming (IWGS), 2010.

[32]

F. Yang and others. Sonora: A Platform for Continuous Mobile-Cloud Computing. Technical Report MSR-TR-2012-34, Microsoft Research Asia.

[33]

M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proc. of the 9th USENIX Conf. on Networked Systems Design and Implementation (NSDI), pages 15--28, 2012.

[34]

M. Zaharia et al. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. In Proc. of the 24th ACM Symp. on Operating Systems Principles, 2013.

Cited By

Justen DRitter DFraser CLamb ALee ABodner THaddad MZeuch SMarkl VBoehm M(2024)POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least ResistanceProceedings of the VLDB Endowment10.14778/3648160.364817517:6(1350-1363)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648175
Lam KZhao XZhu CKuo T(2024)MVLevelDB+: Meeting Relative Consistency Requirements of Temporal Queries in Sensor Stream DatabasesACM Transactions on Embedded Computing Systems10.1145/369478724:1(1-26)Online publication date: 4-Sep-2024
https://dl.acm.org/doi/10.1145/3694787
Chen SZuo DZhang Z(2024)FlexSP:(1 + β)-Choice based Flexible Stream Partitioning for Stateful OperatorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673157(732-741)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673157
Show More Cited By

Recommendations

Model checking of scenario-aware dataflow with CADP
DATE '12: Proceedings of the Conference on Design, Automation and Test in Europe

Various dataflow formalisms have been used for capturing the potential parallelism in streaming applications to realise distributed (multi-core) implementations as well as for analysing key properties like absence of deadlock, throughput and buffer ...
Hybrid Latency Minimization Approach using Model Checking and Dataflow Analysis
SCOPES '17: Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems

Bounding the latency of real-time multiprocessor applications is crucial for safety-critical systems. Several approximative analysis approaches exist that can efficiently analyze the latency. However, these approaches produce pessimistic latency results ...
Efficient computation of buffer capacities for cyclo-static dataflow graphs
DAC '07: Proceedings of the 44th annual Design Automation Conference

A key step in the design of cyclo-static real-time systems is the determination of buffer capacities. In our multi-processor system, we apply back-pressure, which means that tasks wait for space in output buffers. Consequently buffer capacities affect ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 12

Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii

August 2015

728 pages

ISSN:2150-8097

Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2015

Published in PVLDB Volume 8, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

170
Total Citations
View Citations
1,676
Total Downloads

Downloads (Last 12 months)248
Downloads (Last 6 weeks)38

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Justen DRitter DFraser CLamb ALee ABodner THaddad MZeuch SMarkl VBoehm M(2024)POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least ResistanceProceedings of the VLDB Endowment10.14778/3648160.364817517:6(1350-1363)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648175
Lam KZhao XZhu CKuo T(2024)MVLevelDB+: Meeting Relative Consistency Requirements of Temporal Queries in Sensor Stream DatabasesACM Transactions on Embedded Computing Systems10.1145/369478724:1(1-26)Online publication date: 4-Sep-2024
https://dl.acm.org/doi/10.1145/3694787
Chen SZuo DZhang Z(2024)FlexSP:(1 + β)-Choice based Flexible Stream Partitioning for Stateful OperatorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673157(732-741)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673157
Havers BPapatriantafilou MGulisano VGramoli V(2024)Research Summary: Enhancing Localization, Selection, and Processing of Data in Vehicular Cyber-Physical SystemsProceedings of the 2024 Workshop on Advanced Tools, Programming Languages, and PLatforms for Implementing and Evaluating algorithms for Distributed systems10.1145/3663338.3663680(1-5)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3663338.3663680
Jindal ABeedkar KSingh VMohammed JSingla TGupta AChoudhary K(2024)Reactive Dataflow for Inflight Error Handling in ML WorkflowsProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663333(51-61)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3650203.3663333
Tosi MVenugopal VTheobald M(2024)TensAIR: Real-Time Training of Neural Networks from Data-streamsProceedings of the 2024 8th International Conference on Machine Learning and Soft Computing10.1145/3647750.3647762(73-82)Online publication date: 26-Jan-2024
https://dl.acm.org/doi/10.1145/3647750.3647762
Henning SVogel ALeichtfried MErtl ORabiser RBalsamo SKnottenbelt WAbad CShang W(2024)ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing FrameworksProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645036(2-13)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629526.3645036
Gulisano VMargara A(2024)Aggregates are all you need (to bridge stream processing and Complex Event Recognition)Proceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666032(66-77)Online publication date: 24-Jun-2024
https://dl.acm.org/doi/10.1145/3629104.3666032
Meldrum MCarbone P(2024)μWheel: Aggregate Management for Streams and QueriesProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666031(54-65)Online publication date: 24-Jun-2024
https://dl.acm.org/doi/10.1145/3629104.3666031
Segeljakt KHaridi SCarbone P(2024)AquaLang: A Dataflow Programming LanguageProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666030(42-53)Online publication date: 24-Jun-2024
https://dl.acm.org/doi/10.1145/3629104.3666030
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents