skip to main content
research-article

Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow

Published: 01 July 2021 Publication History

Abstract

Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks, a key tool for reasoning about temporal completeness in infinite streams.
First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches:
• Computing a single correct answer, as in notifications.
• Reasoning about a lack of data, as in dip detection.
• Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models.
• Safely and punctually garbage collecting obsolete inputs and intermediate state.
• Surfacing a reliable signal of overall pipeline health.
Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.

References

[1]
T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. Millwheel: Fault-tolerant stream processing at internet scale. Proc. VLDB Endow., 6(11):1033--1044, Aug. 2013.
[2]
T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12):1792--1803, 2015.
[3]
T. Akidau, S. Chernyak, and R. Lax. Streaming Systems. O'Reilly Media, Inc., 1st edition, 2018.
[4]
D. Anicic, P. Fodor, S. Rudolph, R. Stühmer, N. Stojanovic, and R. Studer. Etalis: Rule-based reasoning in event processing. In Reasoning in event-based distributed systems, pages 99--124. Springer, 2011.
[5]
A. Awad, J. Traub, and S. Sakr. Adaptive watermarks: A concept drift-based approach for predicting event-time progress in data streams. In EDBT, pages 622--625, 2019.
[6]
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.
[7]
B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J. C. Platt, J. F. Terwilliger, and J. Wernsing. Trill: A high-performance incremental query processor for diverse analytics. Proceedings of the VLDB Endowment, 8(4):401--412, 2014.
[8]
T. Das. Event-time aggregation and watermarking in apache spark's structured streaming. https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apacmhe-sparks-structured-streaming.html, 2017. [Online; accessed 06-Feb-2021].
[9]
M. J. S. Eno Thereska, Michael Noll. Watermarks, tables, event time, and the dataflow model. https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/, 2017. [Online; accessed 25-Jan-2021].
[10]
D. Gyllstrom, E. Wu, H.-J. Chae, Y. Diao, P. Stahlberg, and G. Anderson. Sase: Complex event processing over streams. arXiv preprint cs/0612128, 2006.
[11]
C. S. Jensen and R. Snodgrass. Temporal specialization and generalization. IEEE Transactions on Knowledge and Data Engineering, 6(6):954--974, 1994.
[12]
K. Kulkarni and J.-E. Michels. Temporal features in sql:2011. SIGMOD Rec., 41(3):34--43, Oct. 2012.
[13]
R. Lax. After lambda: Exactly-once processing in cloud dataflow, part 2 (ensuring low latency). https://cloud.google.com/blog/products/gcp/after-lambda-exactly-once-processing-in-cloud-dataflow-part-2-ensuring-low-latency, 2017. [Online; accessed 06-Feb-2021].
[14]
F. McSherry. Timelydataflow. https://github.com/TimelyDataflow/timely-dataflow, 2020.
[15]
F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Differential dataflow. In CIDR, 2013.
[16]
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439--455, 2013.
[17]
J. Roesler. Kafka streams' take on watermarks and triggers. https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers/, 2019. [Online; accessed 25-Jan-2021].
[18]
U. Srivastava and J. Widom. Flexible time management in data stream systems. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 263--274, 2004.
[19]
J. Teich, L. Thiele, and E. A. Lee. Modeling and simulation of heterogeneous real-time systems based on a deterministic discrete event model. In Proceedings of the 8th international symposium on System synthesis, pages 156--161, 1995.
[20]
P. A. Tucker, D. Maier, T. Sheard, and L. Fegaras. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 15(3):555--568, 2003.
[21]
A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the European Conference on Computer Systems (EuroSys), Bordeaux, France, 2015.
[22]
G. Wang, L. Chen, A. Dikshit, J. Gustafson, B. Chen, M. J. Sax, J. Roesler, S. Blee-Goldman, B. Cadonna, A. Mehta, V. Madan, and J. Rao. Consistency and completeness: Rethinking distributed stream processing in apache kafka. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21, 2021.
[23]
E. Wu, Y. Diao, and S. Rizvi. High-performance complex event processing over streams. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 407--418, 2006.

Cited By

View all

Index Terms

  1. Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Information & Contributors

              Information

              Published In

              cover image Proceedings of the VLDB Endowment
              Proceedings of the VLDB Endowment  Volume 14, Issue 12
              July 2021
              587 pages
              ISSN:2150-8097
              Issue’s Table of Contents

              Publisher

              VLDB Endowment

              Publication History

              Published: 01 July 2021
              Published in PVLDB Volume 14, Issue 12

              Qualifiers

              • Research-article

              Contributors

              Other Metrics

              Bibliometrics & Citations

              Bibliometrics

              Article Metrics

              • Downloads (Last 12 months)45
              • Downloads (Last 6 weeks)7
              Reflects downloads up to 01 Jan 2025

              Other Metrics

              Citations

              Cited By

              View all

              View Options

              Login options

              Full Access

              View options

              PDF

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              Media

              Figures

              Other

              Tables

              Share

              Share

              Share this Publication link

              Share on social media