research-article

HippogriffDB: balancing I/O and GPU bandwidth in big data analytics

Editor: Surajit Chaudhuri Authors:

Hung-Wei Tseng,

Yannis Papakonstantinou,

Steven SwansonAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 9, Issue 14

Pages 1647 - 1658

https://doi.org/10.14778/3007328.3007331

Published: 01 October 2016 Publication History

Abstract

As data sets grow and conventional processor performance scaling slows, data analytics move towards heterogeneous architectures that incorporate hardware accelerators (notably GPUs) to continue scaling performance. However, existing GPU-based databases fail to deal with big data applications efficiently: their execution model suffers from scalability limitations on GPUs whose memory capacity is limited; existing systems fail to consider the discrepancy between fast GPUs and slow storage, which can counteract the benefit of GPU accelerators.

In this paper, we propose HippogriffDB, an efficient, scalable GPU-accelerated OLAP system. It tackles the bandwidth discrepancy using compression and an optimized data transfer path. HippogriffDB stores tables in a compressed format and uses the GPU for decompression, trading GPU cycles for the improved I/O bandwidth. To improve the data transfer efficiency, HippogriffDB introduces a peer-to-peer, multi-threaded data transfer mechanism, directly transferring data from the SSD to the GPU. HippogriffDB adopts a query-over-block execution model that provides scalability using a stream-based approach. The model improves kernel efficiency with the operator fusion and double buffering mechanism.

We have implemented HippogriffDB using an NVMe SSD, which talks directly to a commercial GPU. Results on two popular benchmarks demonstrate its scalability and efficiency. HippogriffDB outperforms existing GPU-based databases (YDB) and in-memory data analytics (MonetDB) by 1-2 orders of magnitude.

References

[1]

http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3700-spec.pdf.

[2]

http://www.nvidia.com/object/tesla-servers.html.

[3]

https://developer.nvidia.com/gpudirect.

[4]

http://blog.pmcs.com/project-donard-peer-to-peer-communication-with-nvm-express-devices-part-two.

[5]

https://trademarks.justia.com/865/43/nvmedirect-86543720.html.

[6]

D. J. Abadi. Query execution in column-oriented database systems. PhD thesis, Massachusetts Institute of Technology, 2008.

Digital Library

[7]

D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores: How different are they really? In SIGMOD, pages 967--980. ACM, 2008.

Digital Library

[8]

D. Agrawal, P. Bernstein, E. Bertino, S. Davidson, U. Dayal, M. Franklin, J. Gehrke, L. Haas, A. Halevy, J. Han, et al. Challenges and opportunities with big data 2011--1. 2011.

[9]

P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, volume 5, pages 225--237, 2005.

[10]

S. Breß and G. Saake. Why it is time for a hype: A hybrid query processing engine for efficient gpu coprocessing in dbms. VLDB, 6(12):1398--1403, 2013.

Digital Library

[11]

E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-chip heterogeneous computing: Does the future include custom logic, fpgas, and gpgpus? In MICRO, pages 225--236. IEEE Computer Society, 2010.

Digital Library

[12]

R. H. Dennard, V. Rideout, E. Bassous, and A. Leblanc. Design of ion-implanted mosfet's with very small physical dimensions. Solid-State Circuits, IEEE Journal of, 9(5):256--268, 1974.

[13]

H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In ISCA, pages 365--376, 2011.

Digital Library

[14]

W. Fang, B. He, and Q. Luo. Database compression on graphics processors. VLDB, 3(1--2):670--680, 2010.

Digital Library

[15]

N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In SIGMOD, pages 325--336. ACM, 2006.

Digital Library

[16]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward dark silicon in servers. IEEE Micro, 31(EPFL-ARTICLE-168285):6--15, 2011.

Digital Library

[17]

B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In SIGMOD, pages 511--524, 2008.

Digital Library

[18]

B. He and J. X. Yu. High-throughput transaction executions on graphics processors. VLDB, 4(5):314--325, 2011.

Digital Library

[19]

J. He, M. Lu, and B. He. Revisiting co-processing for hash joins on the coupled cpu-gpu architecture. VLDB, 6(10):889--900, 2013.

Digital Library

[20]

M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. VLDB, 6(9):709--720, 2013.

Digital Library

[21]

H. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi. Big data and its technical challenges. Communications of the ACM, 57(7):86--94, 2014.

Digital Library

[22]

S. Kim, S. Huh, Y. Hu, X. Zhang, A. Wated, E. Witchel, and M. Silberstein. Gpunet: Networking abstractions for gpu programs. In OSDI, pages 6--8, 2014.

Digital Library

[23]

R. Kimball and M. Ross. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons, 2013.

Digital Library

[24]

J. Li, H.-W. Tseng, C. Lin, Y. Papakonstantinou, and S. Swanson. Hippogriffdb: Balancing i/o and gpu bandwidth in big data analytics. VLDB, 9(14), 2016.

Digital Library

[25]

Y. Liu, H.-W. Tseng, M. Gahagan, J. Li, Y. Jin, and S. Swanson. Hippogriff: Efficiently Moving Data in Heterogeneous Computing Systems. In ICCD, 2016.

[26]

S. Martello and P. Toth. Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc., 1990.

Digital Library

[27]

M. A. O'Neil and M. Burtscher. Floating-point data compression at 75 gb/s on a gpu. In GPGPU, page 7. ACM, 2011.

Digital Library

[28]

P. ONeil, E. ONeil, X. Chen, and S. Revilak. The star schema benchmark and augmented fact table indexing. In Performance evaluation and benchmarking, pages 237--252. Springer, 2009.

Digital Library

[29]

R. Pagh and F. F. Rodler. Cuckoo hashing. Springer, 2001.

[30]

R. Patel, Y. Zhang, J. Mak, A. Davidson, J. D. Owens, et al. Parallel lossless data compression on the GPU. IEEE, 2012.

[31]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178. ACM, 2009.

Digital Library

[32]

V. Sathish, M. J. Schulte, and N. S. Kim. Lossless and lossy memory i/o link compression for improving performance of gpgpu workloads. In PACT, pages 325--334. ACM, 2012.

Digital Library

[33]

S. Seshadri, M. Gahagan, S. Bhaskaran, T. Bunker, A. De, Y. Jin, Y. Liu, and S. Swanson. Willow: A user-programmable ssd. In OSDI, pages 67--80, Broomfield, CO, Oct. 2014. USENIX Association.

Digital Library

[34]

B. Smith. A survey of compressed domain processing techniques. Cornell University, 1995.

[35]

J. Teuhola. A compression method for clustered bit-vectors. Information processing letters, 7(6):308--311, 1978.

[36]

H.-W. Tseng, Y. Liu, M. Gahagan, J. Li, Y. Jin, and S. Swanson. Gullfoss: Accelerating and simplifying data movement among heterogeneous computing and storage resources. Technical report.

[37]

N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu. A case for core-assisted bottleneck acceleration in gpus: enabling flexible data compression with assist warps. In ISCA, pages 41--53. ACM, 2015.

Digital Library

[38]

K. Wang, Y. Huai, R. Lee, F. Wang, X. Zhang, and J. H. Saltz. Accelerating pathology image data cross-comparison on cpu-gpu hybrid systems. VLDB, 5(11):1543--1554, 2012.

Digital Library

[39]

K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, and X. Zhang. Concurrent analytical query processing with gpus. VLDB, 7(11):1011--1022, July 2014.

Digital Library

[40]

H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In MICRO, pages 107--118. IEEE Computer Society, 2012.

Digital Library

[41]

Y. Yuan, R. Lee, and X. Zhang. The yin and yang of processing data warehousing queries on gpu devices. VLDB, 6(10):817--828, 2013.

Digital Library

Cited By

Tang DWu ZWang YGu YLiu FQi Z(2025)gCom: Fine-grained Compressors in Graphics Memory of Mobile GPUACM Transactions on Architecture and Code Optimization10.1145/3711819Online publication date: 8-Jan-2025
https://dl.acm.org/doi/10.1145/3711819
Dandugala LVani K(2024)Big data clustering using fuzzy based energy efficient clustering and MobileNet V2Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23038746:1(269-284)Online publication date: 10-Jan-2024
https://dl.acm.org/doi/10.3233/JIFS-230387
Zheng YTan K(2024)Sorting on Byte-Addressable Storage: The Resurgence of Tree StructureProceedings of the VLDB Endowment10.14778/3648160.364818517:6(1487-1500)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648185
Show More Cited By

HippogriffDB: balancing I/O and GPU bandwidth in big data analytics
1. Information systems
  1. Data management systems
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Mars: Accelerating MapReduce with Graphics Processors

We design and implement Mars, a MapReduce runtime system accelerated with graphics processing units (GPUs). MapReduce is a simple and flexible parallel programming paradigm originally proposed by Google, for the ease of large-scale data processing on ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 9, Issue 14

October 2016

96 pages

ISSN:2150-8097

Editor:
Surajit Chaudhuri
Microsoft Research

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 October 2016

Published in PVLDB Volume 9, Issue 14

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

50
Total Citations
View Citations
475
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)7

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tang DWu ZWang YGu YLiu FQi Z(2025)gCom: Fine-grained Compressors in Graphics Memory of Mobile GPUACM Transactions on Architecture and Code Optimization10.1145/3711819Online publication date: 8-Jan-2025
https://dl.acm.org/doi/10.1145/3711819
Dandugala LVani K(2024)Big data clustering using fuzzy based energy efficient clustering and MobileNet V2Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23038746:1(269-284)Online publication date: 10-Jan-2024
https://dl.acm.org/doi/10.3233/JIFS-230387
Zheng YTan K(2024)Sorting on Byte-Addressable Storage: The Resurgence of Tree StructureProceedings of the VLDB Endowment10.14778/3648160.364818517:6(1487-1500)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648185
Mohr-Daurat HSun XPirk H(2024)BOSS - An Architecture for Database Kernel CompositionProceedings of the VLDB Endowment10.14778/3636218.363623917:4(877-890)Online publication date: 5-Mar-2024
https://dl.acm.org/doi/10.14778/3636218.3636239
Boeschen NZiegler TBinnig C(2024)GOLAP: A GPU-in-Data-Path Architecture for High-Speed OLAPProceedings of the ACM on Management of Data10.1145/36988122:6(1-26)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698812
Petrescu DGuirguis AQuoc DPicorel JGuerraoui RDinu F(2024)Accelerating Transfer Learning with Near-Data Computation on Cloud Object StoresProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698549(995-1011)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698549
Kroviakov AKurapov PAnneser CGiceva J(2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663441
Zhang YZhang FLi HZhang SGuo XChen YPan ADu X(2024)Data-Aware Adaptive Compression for Stream ProcessingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.337771036:9(4531-4549)Online publication date: 19-Mar-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3377710
Li XXiao MYu DLee RZhang X(2024)UltraPrecise: A GPU-Based Framework for Arbitrary-Precision Arithmetic in Database Systems2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00294(3837-3850)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00294
Huang YFan XYan SWeng C(2024)Neos: A NVMe-GPUs Direct Vector Service Buffer in User Space2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00289(3767-3781)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00289
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents