research-article

Providing streaming joins as a service at Facebook

Authors:

Gabriela Jacques-Silva,

Guoqiang Jerry Chen,

Anirban Banerjee,

Benjamin Heintz,

Anshul JaiswalAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 11, Issue 12

Pages 1809 - 1821

https://doi.org/10.14778/3229863.3229869

Published: 01 August 2018 Publication History

Abstract

Stream processing applications reduce the latency of batch data pipelines and enable engineers to quickly identify production issues. Many times, a service can log data to distinct streams, even if they relate to the same real-world event (e.g., a search on Facebook's search bar). Furthermore, the logging of related events can appear on the server side with different delay, causing one stream to be significantly behind the other in terms of logged event times for a given log entry. To be able to stitch this information together with low latency, we need to be able to join two different streams where each stream may have its own characteristics regarding the degree in which its data is out-of-order. Doing so in a streaming fashion is challenging as a join operator consumes lots of memory, especially with significant data volumes. This paper describes an end-to-end streaming join service that addresses the challenges above through a streaming join operator that uses an adaptive stream synchronization algorithm that is able to handle the different distributions we observe in real-world streams regarding their event times. This synchronization scheme paces the parsing of new data and reduces overall operator memory footprint while still providing high accuracy. We have integrated this into a streaming SQL system and have successfully reduced the latency of several batch pipelines using this approach.

References

[1]

Introducing AthenaX, Uber Engineerings Open Source Streaming Analytics Platform. https://eng.uber.com/athenax/, 2017.

[2]

Amazon Kinesis Data Analytics. https://docs.aws.amazon.com/kinesisanalytics/latest/dev, 2018.

[3]

Apache Calcite. https://calcite.apache.org/, 2018.

[4]

Introducing Stream-Stream Joins in Apache Spark 2.3. https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html, 2018.

[5]

Presto. https://prestodb.io/, 2018.

[6]

RocksDB. https://github.com/facebook/rocksdb/, 2018.

[7]

Zstd. http://facebook.github.io/zstd/, 2018.

[8]

L. Abraham, J. Allen, O. Barykin, V. Borkar, B. Chopra, C. Gerea, D. Merl, J. Metzler, D. Reiss, S. Subramanian, J. L. Wiener, and O. Zed. Scuba: Diving into data at Facebook. PVLDB, 6(11):1057--1067, 2013.

Digital Library

[9]

T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: Fault-tolerant stream processing at internet scale. PVLDB, 6(11):1033--1044, 2013.

Digital Library

[10]

T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, and S. Whittle. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. PVLDB, 8(12):1792--1803, 2015.

Digital Library

[11]

R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, H. Jiang, T. Qiu, A. Reznichenko, D. Ryabkov, M. Singh, and S. Venkataraman. Photon: Fault-tolerant and scalable joining of continuous data streams. In SIGMOD, 2013.

Digital Library

[12]

M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica, and M. Zaharia. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. In SIGMOD, 2018.

Digital Library

[13]

S. Babu, U. Srivastava, and J. Widom. Exploiting k-constraints to reduce memory overhead in continuous queries over data streams. ACM Trans. Database Syst., 29(3):545--580, Sept. 2004.

Digital Library

[14]

P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas. State management in Apache Flink: Consistent stateful distributed stream processing. PVLDB, 10(12):1718--1729, 2017.

Digital Library

[15]

G. J. Chen, J. L. Wiener, S. Iyer, A. Jaiswal, R. Lei, N. Simha, W. Wang, K. Wilfong, T. Williamson, and S. Yilmaz. Realtime data processing at Facebook. SIGMOD, 2016.

Digital Library

[16]

A. Das, J. Gehrke, and M. Riedewald. Approximate join processing over data streams. SIGMOD, 2003.

Digital Library

[17]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.

Digital Library

[18]

L. Ding and E. A. Rundensteiner. Evaluating window joins over punctuated streams. CIKM, 2004.

Digital Library

[19]

B. Gedik, K.-L. Wu, P. S. Yu, and L. Liu. Adaptive load shedding for windowed stream joins. CIKM, 2005.

Digital Library

[20]

B. Gedik, P. S. Yu, and R. R. Bordawekar. Executing stream joins on the Cell processor. VLDB, 2007.

Digital Library

[21]

G. Graefe. Volcano - an extensible and parallel query evaluation system. IEEE Trans. on Knowl. and Data Eng., 6(1):120--135, Feb. 1994.

Digital Library

[22]

M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soulé, and K.-L. Wu. IBM Streams Processing Language: Analyzing big data in motion. IBM Journal of Research and Development, 57(3/4):7, 2013.

Digital Library

[23]

G. Jacques-Silva, F. Zheng, D. Debrunner, K.-L. Wu, V. Dogaru, E. Johnson, M. Spicer, and A. E. Sariyüce. Consistent regions: Guaranteed tuple processing in IBM Streams. PVLDB, 9(13):1341--1352, 2016.

Digital Library

[24]

J. Kang, J. F. Naughton, and S. D. Viglas. Evaluating window joins over unbounded streams. In ICDE, 2003.

[25]

T. Karnagel, D. Habich, B. Schlegel, and W. Lehner. The HELLS-join: A heterogeneous stream join for extremely large windows. DaMoN, 2013.

Digital Library

[26]

P. A. Larson and J. Zhou. Efficient maintenance of materialized outer-join views. In ICDE, pages 56--65, April 2007.

[27]

Q. Lin, B. C. Ooi, Z. Wang, and C. Yu. Scalable distributed stream join processing. SIGMOD, 2015.

Digital Library

[28]

A. Narayanan. Tupperware: Containerized deployment at Facebook. In DockerCon, 2014.

[29]

S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst, I. Gupta, and R. H. Campbell. Samza: Stateful scalable stream processing at Linked In. PVLDB, 10(12):1634--1645, 2017.

Digital Library

[30]

C. Tang, T. Kooburat, P. Venkatachalam, A. Chander, Z. Wen, A. Narayanan, P. Dowell, and R. Karl. Holistic configuration management at Facebook. In SOSP, 2015.

Digital Library

[31]

J. Teubner and R. Mueller. How soccer players would do stream joins. SIGMOD, 2011.

Digital Library

[32]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive -a petabyte scale data warehouse using hadoop. In ICDE, 2010.

[33]

S. Venkataraman, A. Panda, K. Ousterhout, M. Armbrust, A. Ghodsi, M. J. Franklin, B. Recht, and I. Stoica. Drizzle: Fast and adaptable stream processing at scale. In SOSP, 2017.

Digital Library

[34]

A. Wilschut and P. Apers. Dataflow query execution in a parallel main-memory environment. In PDIS, 1991.

Digital Library

[35]

J. Wu, K. L. Tan, and Y. Zhou. Window-oblivious join: A data-driven memory management scheme for stream join. In SSDBM, 2007.

Digital Library

Cited By

Wang QZuo DZhang ZShu YLiu XHe M(2024)Low-Latency Adaptive Distributed Stream Join System Based on a Flexible Join ModelProceedings of the ACM on Management of Data10.1145/36549532:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654953
Yu SChen HJin H(2023)Nereus: A Distributed Stream Band Join System With Adaptive Range PartitioningIEEE Transactions on Consumer Electronics10.1109/TCE.2023.324929269:4(949-961)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1109/TCE.2023.3249292
Michalke AGrulich PLutz CZeuch SMarkl V(2021)An Energy-Efficient Stream Join for the Internet of ThingsProceedings of the 17th International Workshop on Data Management on New Hardware10.1145/3465998.3466005(1-6)Online publication date: 20-Jun-2021
https://dl.acm.org/doi/10.1145/3465998.3466005
Show More Cited By

Providing streaming joins as a service at Facebook
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

PSoup: a system for streaming queries over streaming data

Abstract.Recent work on querying data streams has focused on systems where newly arriving data is processed and continuously streamed to the user in real time. In many emerging applications, however, ad hoc queries and/or intermittent connectivity also ...
Providing statistically guaranteed streaming quality for peer-to-peer live streaming
NOSSDAV '09: Proceedings of the 18th international workshop on Network and operating systems support for digital audio and video

Most of the literature on peer-to-peer (P2P) live streaming focuses on how to provide best-effort streaming quality by efficiently using the system bandwidth; however, there is no guarantee about the provided streaming quality. This paper considers how ...
Optimizing away joins on data streams
SSPS '08: Proceedings of the 2nd international workshop on Scalable stream processing system

Monitoring aggregates on network traffic streams is a compelling application of data stream management systems. Often, streaming aggregation queries involve joining multiple inputs (e.g., client requests and server responses) using temporal join ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 11, Issue 12

August 2018

426 pages

ISSN:2150-8097

Editors:
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
,
Jian Pei
Simon Fraser University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2018

Published in PVLDB Volume 11, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
237
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)2

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang QZuo DZhang ZShu YLiu XHe M(2024)Low-Latency Adaptive Distributed Stream Join System Based on a Flexible Join ModelProceedings of the ACM on Management of Data10.1145/36549532:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654953
Yu SChen HJin H(2023)Nereus: A Distributed Stream Band Join System With Adaptive Range PartitioningIEEE Transactions on Consumer Electronics10.1109/TCE.2023.324929269:4(949-961)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1109/TCE.2023.3249292
Michalke AGrulich PLutz CZeuch SMarkl V(2021)An Energy-Efficient Stream Join for the Internet of ThingsProceedings of the 17th International Workshop on Data Management on New Hardware10.1145/3465998.3466005(1-6)Online publication date: 20-Jun-2021
https://dl.acm.org/doi/10.1145/3465998.3466005
Tan JChen HWang YJin Hde Supinski BHall MGamblin T(2021)WhaleProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476192(1-12)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476192
Farhat ODaudjee KQuerzoni LLi GLi ZIdreos SSrivastava D(2021)Klink: Progress-Aware Scheduling for Streaming Data SystemsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452794(485-498)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3452794
Zhang SMao YHe JGrulich PZeuch SHe BMa RMarkl VLi GLi ZIdreos SSrivastava D(2021)Parallelizing Intra-Window Join on MulticoresProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452793(2089-2101)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3452793
Vilim MRucker AOlukotun KMartínez JDuato JJohn L(2021)AurochsProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00039(402-415)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00039
Karimov JRabl TMarkl V(2020)AJoinProceedings of the VLDB Endowment10.14778/3372716.337271813:4(435-448)Online publication date: 6-Jan-2020
https://dl.acm.org/doi/10.14778/3372716.3372718

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents