skip to main content
research-article

The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing

Published: 01 August 2015 Publication History

Abstract

Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves, in addition to an insatiable hunger for faster answers. Meanwhile, practicality dictates that one can never fully optimize along all dimensions of correctness, latency, and cost for these types of input. As a result, data processing practitioners are left with the quandary of how to reconcile the tensions between these seemingly competing propositions, often resulting in disparate implementations and systems.
We propose that a fundamental shift of approach is necessary to deal with these evolved requirements in modern data processing. We as a field must stop trying to groom unbounded datasets into finite pools of information that eventually become complete, and instead live and breathe under the assumption that we will never know if or when we have seen all of our data, only that new data will arrive, old data may be retracted, and the only way to make this problem tractable is via principled abstractions that allow the practitioner the choice of appropriate tradeoffs along the axes of interest: correctness, latency, and cost.
In this paper, we present one such approach, the Dataflow Model, along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development.

References

[1]
D. J. Abadi et al. Aurora: A New Model and Architecture for Data Stream Management. The VLDB Journal, 12(2):120--139, Aug. 2003.
[2]
T. Akidau et al. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. In Proc. of the 39th Int. Conf. on Very Large Data Bases (VLDB), 2013.
[3]
A. Alexandrov et al. The Stratosphere Platform for Big Data Analytics. The VLDB Journal, 23(6):939--964, 2014.
[4]
Apache. Apache Hadoop. http://hadoop.apache.org, 2012.
[5]
Apache. Apache Storm. http://storm.apache.org, 2013.
[6]
Apache. Apache Flink. http://flink.apache.org/, 2014.
[7]
Apache. Apache Samza. http://samza.apache.org, 2014.
[8]
R. S. Barga et al. Consistent Streaming Through Time: A Vision for Event Stream Processing. In Proc. of the Third Biennial Conf. on Innovative Data Systems Research (CIDR), pages 363--374, 2007.
[9]
Botan et al. SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems. Proc. VLDB Endow., 3(1-2):232--243, Sept. 2010.
[10]
O. Boykin et al. Summingbird: A Framework for Integrating Batch and Online MapReduce Computations. Proc. VLDB Endow., 7(13):1441--1451, Aug. 2014.
[11]
Cask. Tigon. http://tigon.io/, 2015.
[12]
C. Chambers et al. FlumeJava: Easy, Efficient Data-Parallel Pipelines. In Proc. of the 2010 ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 363--375, 2010.
[13]
B. Chandramouli et al. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. In Proc. of the 41st Int. Conf. on Very Large Data Bases (VLDB), 2015.
[14]
S. Chandrasekaran et al. TelegraphCQ: Continuous Dataflow Processing. In Proc. of the 2003 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD), SIGMOD '03, pages 668--668, New York, NY, USA, 2003. ACM.
[15]
J. Chen et al. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. In Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD), pages 379--390, 2000.
[16]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of the Sixth Symposium on Operating System Design and Implementation (OSDI), 2004.
[17]
EsperTech. Esper. http://www.espertech.com/esper/, 2006.
[18]
Gates et al. Building a High-level Dataflow System on Top of Map-Reduce: The Pig Experience. Proc. VLDB Endow., 2(2):1414--1425, Aug. 2009.
[19]
Google. Dataflow SDK. https://github.com/GoogleCloudPlatform/DataflowJavaSDK, 2015.
[20]
Google. Google Cloud Dataflow. https://cloud.google.com/dataflow/, 2015.
[21]
T. Johnson et al. A Heartbeat Mechanism and its Application in Gigascope. In Proc. of the 31st Int. Conf. on Very Large Data Bases (VLDB), pages 1079--1088, 2005.
[22]
J. Li et al. Semantics and Evaluation Techniques for Window Aggregates in Data Streams. In Proceedings og the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD), pages 311--322, 2005.
[23]
J. Li et al. Out-of-order Processing: A New Architecture for High-performance Stream Systems. Proc. VLDB Endow., 1(1):274--288, Aug. 2008.
[24]
D. Maier et al. Semantics of Data Streams and Operators. In Proc. of the 10th Int. Conf. on Database Theory (ICDT), pages 37--52, 2005.
[25]
N. Marz. How to beat the CAP theorem. http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html, 2011.
[26]
S. Murthy et al. Pulsar -- Real-Time Analytics at Scale. Technical report, eBay, 2015.
[27]
SQLStream. http://sqlstream.com/, 2015.
[28]
U. Srivastava and J. Widom. Flexible Time Management in Data Stream Systems. In Proc. of the 23rd ACM SIGMOD-SIGACT-SIGART Symp. on Princ. of Database Systems, pages 263--274, 2004.
[29]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow., 2(2):1626--1629, Aug. 2009.
[30]
P. A. Tucker et al. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 15, 2003.
[31]
J. Whiteneck et al. Framing the Question: Detecting and Filling Spatial-Temporal Windows. In Proc. of the ACM SIGSPATIAL Int. Workshop on GeoStreaming (IWGS), 2010.
[32]
F. Yang and others. Sonora: A Platform for Continuous Mobile-Cloud Computing. Technical Report MSR-TR-2012-34, Microsoft Research Asia.
[33]
M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proc. of the 9th USENIX Conf. on Networked Systems Design and Implementation (NSDI), pages 15--28, 2012.
[34]
M. Zaharia et al. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. In Proc. of the 24th ACM Symp. on Operating Systems Principles, 2013.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 12
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
August 2015
728 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2015
Published in PVLDB Volume 8, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)248
  • Downloads (Last 6 weeks)38
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media