skip to main content
10.1145/3297858.3304031acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory

Published: 04 April 2019 Publication History

Abstract

Stream analytics has an insatiable demand for memory and performance. Emerging hybrid memories combine commodity DDR4 DRAM with 3D-stacked High Bandwidth Memory (HBM) DRAM to meet such demands. However, achieving this promise is challenging because (1) HBM is capacity-limited and (2) HBM boosts performance best for sequential access and high parallelism workloads. At first glance, stream analytics appears a particularly poor match for HBM because they have high capacity demands and data grouping operations, their most demanding computations, use random access. This paper presents the design and implementation of StreamBox-HBM, a stream analytics engine that exploits hybrid memories to achieve scalable high performance. StreamBox-HBM performs data grouping with sequential access sorting algorithms in HBM, in contrast to random access hashing algorithms commonly used in DRAM. StreamBox-HBM solely uses HBM to store Key Pointer Array (KPA) data structures that contain only partial records (keys and pointers to full records) for grouping operations. It dynamically creates and manages prodigious data and pipeline parallelism, choosing when to allocate KPAs in HBM. It dynamically optimizes for both the high bandwidth and limited capacity of HBM, and the limited bandwidth and high capacity of standard DRAM. StreamBox-HBM achieves 110 million records per second and 238 GB/s memory bandwidth while effectively utilizing all 64 cores of Intel's Knights Landing, a commercial server with hybrid memory. It outperforms stream engines with sequential access algorithms without KPAs by 7x and stream engines with random access algorithms by an order of magnitude in throughput. To the best of our knowledge, StreamBox-HBM is the first stream engine optimized for hybrid memories.

References

[1]
Apache Beam. https://beam.apache.org/.
[2]
Intel Performance Counter Monitor - A better way to measure CPU utilization. https://software.intel.com/en-us/articles/intel-performance-counter-monitor. Last accessed: May. 01, 2017.
[3]
Agarwal, N., and Wenisch, T. F. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2017), ASPLOS '17, ACM, pp. 631--644.
[4]
Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R. J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., et al. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment 8, 12 (2015), 1792--1803.
[5]
Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R. J., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., et al. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment 8, 12 (2015), 1792--1803.
[6]
Albutiu, M.-C., Kemper, A., and Neumann, T. Massively parallel sort-merge joins in main memory multi-core database systems. Proc. VLDB Endow. 5, 10 (June 2012), 1064--1075.
[7]
AMPLab. Amplab big data benchmark. https://amplab.cs.berkeley.edu/benchmark/#. Last accessed: July 25, 2018.
[8]
Arasu, A., Babu, S., and Widom, J. The cql continuous query language: Semantic foundations and query execution. The VLDB Journal 15, 2 (June 2006), 121--142.
[9]
Balkesen, C., Alonso, G., Teubner, J., and Özsu, M. T. Multi-core, main-memory joins: Sort vs. hash revisited. Proc. VLDB Endow. 7, 1 (Sept. 2013), 85--96.
[10]
Boncz, P. A., Zukowski, M., and Nes, N. Monetdb/x100: Hyper-pipelining query execution. In Cidr (2005), vol. 5, pp. 225--237.
[11]
Bramas, B. Fast sorting algorithms using avx-512 on intel knights landing. arXiv preprint arXiv:1704.08579 (2017).
[12]
Carbone, P., Ewen, S., Haridi, S., Katsifodimos, A., Markl, V., and Tzoumas, K. Apache flink: Stream and batch processing in a single engine. Data Engineering (2015), 28.
[13]
Chandramouli, B., Goldstein, J., Barnett, M., DeLine, R., Fisher, D., Platt, J. C., Terwilliger, J. F., and Wernsing, J. Trill: A high-performance incremental query processor for diverse analytics. Proceedings of the VLDB Endowment 8, 4 (2014), 401--412.
[14]
Cheng, X., He, B., Du, X., and Lau, C. T. A study of main-memory hash joins on many-core processor: A case with intel knights landing architecture. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (New York, NY, USA, 2017), CIKM '17, ACM, pp. 657--666.
[15]
Data Artisians. The Curious Case of the Broken Benchmark: Revisiting Apache Flink vs. Databricks Runtime. https://data-artisans.com/blog/curious-case-broken-benchmark-revisiting-apache-flink-vs-databricks-runtime. Last accessed: May. 01, 2018.
[16]
DataBricks. Benchmarking Structured Streaming on Databricks Runtime Against State-of-the-Art Streaming Systems. https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html. Last accessed: May. 01, 2018.
[17]
Doudali, T. D., and Gavrilovska, A. Comerge: Toward efficient data placement in shared heterogeneous memory systems. In Proceedings of the International Symposium on Memory Systems (New York, NY, USA, 2017), MEMSYS '17, ACM, pp. 251--261.
[18]
Drumond, M., Daglis, A., Mirzadeh, N., Ustiugov, D., Picorel, J., Falsafi, B., Grot, B., and Pnevmatikatos, D. The mondrian data engine. In Proceedings of the 44th Annual International Symposium on Computer Architecture (New York, NY, USA, 2017), ISCA '17, ACM, pp. 639--651.
[19]
Dulloor, S. R., Roy, A., Zhao, Z., Sundaram, N., Satish, N., Sankaran, R., Jackson, J., and Schwan, K. Data tiering in heterogeneous memory systems. In Proceedings of the Eleventh European Conference on Computer Systems (New York, NY, USA, 2016), EuroSys '16, ACM, pp. 15:1--15:16.
[20]
EsperTech. Esper. http://www.espertech.com/esper/, 2017.
[21]
Facebook. Folly. https://github.com/facebook/folly#folly-facebook-open-source-library, 2017.
[22]
Fluhr, E. J., Friedrich, J., Dreps, D., Zyuban, V., Still, G., Gonzalez, C., Hall, A., Hogenmiller, D., Malgioglio, F., Nett, R., Paredes, J., Pille, J., Plass, D., Puri, R., Restle, P., Shan, D., Stawiasz, K., Deniz, Z. T., Wendel, D., and Ziegler, M. Power8: A 12-core server-class processor in 22nm soi with 7.6tb/s off-chip bandwidth. In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) (Feb 2014), pp. 96--97.
[23]
Google. Google protocol buffers. https://developers.google.com/protocol-buffers/. Last accessed: July 25, 2018.
[24]
Google. Google clout tpu. https://cloud.google.com/tpu/, 2018.
[25]
Hagiescu, A., Wong, W.-F., Bacon, D. F., and Rabbah, R. A computing origami: folding streams in fpgas. In Proceedings of the 46th Annual Design Automation Conference (2009), ACM, pp. 282--287.
[26]
Hammarlund, P., Kumar, R., Osborne, R. B., Rajwar, R., Singhal, R., D'Sa, R., Chappell, R., Kaushik, S., Chennupaty, S., Jourdan, S., Gunther, S., Piazza, T., and Burton, T. Haswell: The fourth-generation intel core processor. IEEE Micro, 2 (2014), 6--20.
[27]
Hongyu Miao, Heejin Park, M. J. G. P. K. S. M., and Lin, F. X. Strea code. https://engineering.purdue.edu/ xzl/xsel/p/strea/index.html. Last accessed: July 25, 2018.
[28]
iMatix Corporation. Zeromq. http://zeromq.org/, 2018.
[29]
Intel. Knights Landing, the Next Generation of Intel Xeon Phi. http://www.enterprisetech.com/2014/11/17/enterprises-get-xeon-phi-roadmap/. Last accessed: Dec. 08, 2014.
[30]
Jan. String-to-uint64. http://jsteemann.github.io/blog/2016/06/02/fastest-string-to-uint64-conversion-method/. Last accessed: Jan 25, 2019.
[31]
JEDEC. High bandwidth memory (hbm) dram. standard no. jesd235, 2013.
[32]
JEDEC. High bandwidth memory 2. standard no. jesd235a, 2016.
[33]
Jeffers, J., Reinders, J., and Sodani, A. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann, 2016.
[34]
Jerzak, Z., and Ziekow, H. The debs 2014 grand challenge. In Proceedings of the 8th ACM International Conference on Distributed Event-Based Systems (New York, NY, USA, 2014), DEBS '14, ACM, pp. 266--269.
[35]
Kim, C., Kaldewey, T., Lee, V. W., Sedlar, E., Nguyen, A. D., Satish, N., Chhugani, J., Di Blas, A., and Dubey, P. Sort vs. hash revisited: Fast join implementation on modern multi-core cpus. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1378--1389.
[36]
Koliousis, A., Weidlich, M., Castro Fernandez, R., Wolf, A. L., Costa, P., and Pietzuch, P. Saber: Window-based hybrid stream processing for heterogeneous architectures. In Proceedings of the 2016 International Conference on Management of Data (New York, NY, USA, 2016), SIGMOD '16, ACM, pp. 555--569.
[37]
Larson, P.-A., Clinciu, C., Fraser, C., Hanson, E. N., Mokhtar, M., Nowakiewicz, M., Papadimos, V., Price, S. L., Rangarajan, S., Rusanu, R., and Saubhasik, M. Enhancements to sql server column stores. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2013), SIGMOD '13, ACM, pp. 1159--1168.
[38]
Lee, D. U., Kim, K. W., Kim, K. W., Kim, H., Kim, J. Y., Park, Y. J., Kim, J. H., Kim, D. S., Park, H. B., Shin, J. W., Cho, J. H., Kwon, K. H., Kim, M. J., Lee, J., Park, K. W., Chung, B., and Hong, S. 25.2 a 1.2v 8gb 8-channel 128gb/s high-bandwidth memory (hbm) stacked dram with effective microbump i/o test methods using 29nm process and tsv. In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) (Feb 2014), pp. 432--433.
[39]
Lehman, T. J., and Carey, M. J. Query processing in main memory database management systems, vol. 15. ACM, 1986.
[40]
Li, A., Liu, W., Kristensen, M. R. B., Vinter, B., Wang, H., Hou, K., Marquez, A., and Song, S. L. Exploring and analyzing the real impact of modern on-package memory on hpc scientific kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2017), SC '17, ACM, pp. 26:1--26:14.
[41]
Li, J., Tufte, K., Shkapenyuk, V., Papadimos, V., Johnson, T., and Maier, D. Out-of-order processing: a new architecture for high-performance stream systems. Proceedings of the VLDB Endowment 1, 1 (2008), 274--288.
[42]
Lin, W., Qian, Z., Xu, J., Yang, S., Zhou, J., and Zhou, L. Streamscope: continuous reliable distributed processing of big data streams. In Proc. of NSDI (2016), pp. 439--454.
[43]
Lu, L., Pillai, T. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Wisckey: Separating keys from values in ssd-conscious storage. In 14th USENIX Conference on File and Storage Technologies (FAST 16) (Santa Clara, CA, 2016), USENIX Association, pp. 133--148.
[44]
Miao, H., Park, H., Jeon, M., Pekhimenko, G., McKinley, K. S., and Lin, F. X. Streambox: Modern stream processing on a multicore machine. In Proceedings of the 2017 USENIX Conference on USENIX Annual Technical Conference (2017).
[45]
Murray, D. G., McSherry, F., Isaacs, R., Isard, M., Barham, P., and Abadi, M. Naiad: A timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 439--455.
[46]
nVIDIA. nvidia titan v. https://www.nvidia.com/en-us/titan/titan-v/, 2018.
[47]
Nyberg, C., Barclay, T., Cvetanovic, Z., Gray, J., and Lomet, D. Alphasort: A risc machine sort. SIGMOD Rec. 23, 2 (May 1994), 233--242.
[48]
Oracle®. Stream explorer. http://bit.ly/1L6tKz3, 2017.
[49]
Pekhimenko, G., Guo, C., Jeon, M., Huang, R., and Zhou, L. Tersecades: Efficient data compression in stream processing. In 2018 USENIX Annual Technical Conference (USENIX ATC 18) (2018), USENIX Association.
[50]
Peng, I. B., Gioiosa, R., Kestor, G., Cicotti, P., Laure, E., and Markidis, S. Exploring the performance benefit of hybrid memory system on hpc environments. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (May 2017), pp. 683--692.
[51]
Polychroniou, O., Raghavan, A., and Ross, K. A. Rethinking simd vectorization for in-memory databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2015), SIGMOD '15, ACM, pp. 1493--1508.
[52]
Qian, Z., He, Y., Su, C., Wu, Z., Zhu, H., Zhang, T., Zhou, L., Yu, Y., and Zhang, Z. Timestream: Reliable stream computation in the cloud. In Proceedings of the 8th ACM European Conference on Computer Systems (New York, NY, USA, 2013), EuroSys '13, ACM, pp. 1--14.
[53]
Rajadurai, S., Bosboom, J., Wong, W.-F., and Amarasinghe, S. Gloss: Seamless live reconfiguration and reoptimization of stream programs. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2018), ASPLOS '18, ACM, pp. 98--112.
[54]
Raman, V., Attaluri, G., Barber, R., Chainani, N., Kalmuk, D., KulandaiSamy, V., Leenstra, J., Lightstone, S., Liu, S., Lohman, G. M., Malkemus, T., Mueller, R., Pandis, I., Schiefer, B., Sharpe, D., Sidle, R., Storm, A., and Zhang, L. Db2 with blu acceleration: So much more than just a column store. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1080--1091.
[55]
Roy, A., Mihailovic, I., and Zwaenepoel, W. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 472--488.
[56]
Solutions, H. Tpc-h. http://www.tpc.org/tpch/. Last accessed: July 25, 2018.
[57]
Stanley Zdonik, Michael Stonebraker, M. C. Streambase systems. http://www.tibco.com/products/tibco-streambase, 2017.
[58]
Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O'Neil, E., O'Neil, P., Rasin, A., Tran, N., and Zdonik, S. C-store: A column-oriented dbms. In Proceedings of the 31st International Conference on Very Large Data Bases (2005), VLDB '05, VLDB Endowment, pp. 553--564.
[59]
Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J. M., Kulkarni, S., Jackson, J., Gade, K., Fu, M., Donham, J., et al. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (2014), ACM, pp. 147--156.
[60]
Venkataraman, S., Panda, A., Ousterhout, K., Armbrust, M., Ghodsi, A., Franklin, M. J., Recht, B., and Stoica, I. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the 26th Symposium on Operating Systems Principles (New York, NY, USA, 2017), SOSP '17, ACM, pp. 374--389.
[61]
Wang, C., Coa, T., Zigman, J., Lv, F., Zhang, Y., and Feng, X. Efficient management for hybrid memory in managed language runtime. In IFIP International Conference on Network and Parallely Computing (NPC) (2016).
[62]
Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., and Qiu, B. Bigdatabench: A big data benchmark suite from internet services. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on (Feb 2014), pp. 488--499.
[63]
Wei, W., Jiang, D., McKee, S. A., Xiong, J., and Chen, M. Exploiting program semantics to place data in hybrid memory. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT) (2015).
[64]
Wen, S., Cherkasova, L., Lin, F. X., and Liu, X. Profdp: A lightweight profiler to guide data placement in heterogeneous memory systems. In Proceedings of the 32th ACM on International Conference on Supercomputing (New York, NY, USA, 2018), ICS '18, ACM.
[65]
Xia, F., Jiang, D., Xiong, J., and Sun, N. Hikv: A hybrid index key-value store for dram-nvm memory systems. In 2017 USENIX Annual Technical Conference (USENIX ATC 17) (Santa Clara, CA, 2017), USENIX Association, pp. 349--362.
[66]
Xie, R. Malware detection. https://www.endgame.com/blog/technical-blog/data-science-security-using-passive-dns-query-data-analyze-malware. Last accessed: Jan 25, 2019.
[67]
Xilinx. Xilinx virtex ultrascale+. https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus.html, 2018.
[68]
Yahoo! Benchmarking Streaming Computation Engines at Yahoo! https://yahooeng.tumblr.com/post/135321837876/. Last accessed: May. 01, 2018.
[69]
Yip, M., and Company, T. Rapidjson. https://github.com/Tencent/rapidjson. Last accessed: July 25, 2018.
[70]
You, Y., Bulucc, A., and Demmel, J. Scaling deep learning on gpu and knights landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2017), SC '17, ACM, pp. 9:1--9:12.
[71]
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., and Stoica, I. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), ACM, pp. 423--438.
[72]
Zhang, W., and Li, T. Exploring phase change memory and 3d die-stacking for power/thermal friendly, fast and durable memory architectures. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT) (2009).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
April 2019
1126 pages
ISBN:9781450362405
DOI:10.1145/3297858
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. KPA
  2. data analytics
  3. high bandwidth memory
  4. hybrid memory
  5. multicore
  6. stream processing

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '19

Acceptance Rates

ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;
Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)262
  • Downloads (Last 6 weeks)36
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media