article

Hardware-oblivious parallelism for in-memory column-stores

Authors:

Michael Saecker,

Stefan Manegold,

Volker MarklAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 6, Issue 9

Pages 709 - 720

https://doi.org/10.14778/2536360.2536370

Published: 01 July 2013 Publication History

Abstract

The multi-core architectures of today's computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely on labor-intensive and error-prone manual tuning to exploit the full potential of modern parallel hardware architectures like multi-core CPUs and graphics cards. We propose an alternative design for a parallel database engine, based on a single set of hardware-oblivious operators, which are compiled down to the actual hardware at runtime. This design reduces the development overhead for parallel database engines, while achieving competitive performance to hand-tuned systems.

We provide a proof-of-concept for this design by integrating operators written using the parallel programming framework OpenCL into the open-source database MonetDB. Following this approach, we achieve efficient, yet highly portable parallel code without the need for optimization by hand. We evaluated our implementation against MonetDB using TPC-H derived queries and observed a performance that rivals that of MonetDB's query execution on the CPU and surpasses it on the GPU. In addition, we show that the same set of operators runs nearly unchanged on a GPU, demonstrating the feasibility of our approach.

References

[1]

Advanced Micro Devices. OpenCL Zone. http://developer.amd.com/resources/heterogeneous-computing/opencl-zone/, January 2013.

[2]

D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta. Real-time parallel hashing on the gpu. In ACM SIGGRAPH Asia 2009 papers, SIGGRAPH Asia'09, pages 154:1-154:9, New York, NY, USA, 2009. ACM.

[3]

D. A. F. Alcantara. Efficient Hash Tables on the GPU. PhD thesis, University of California, Davis, 2011.

[4]

Altera Corporation. OpenCL for Altera FPGAs: Accelerating Performance and Design Productivity. http://www.altera.com/products/software/opencl/opencl-index.html, January 2013.

[5]

C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu. Main-memory hash joins on multi-core cpus: Tuning to the underlying hardware. ETH Zurich, Systems Group, Tech. Rep, 2012.

[6]

D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, pages 119-130. ACM, 2010.

[7]

P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking The Memory Wall In MonetDB. Communications of the ACM, 51(12):77-85, December 2008.

[8]

S. Borkar and A. A. Chien. The future of microprocessors. Commun. ACM, 54(5):67-77, 2011.

[9]

S. Breß, F. Beier, H. Rauhe, E. Schallehn, K.-U. Sattler, and G. Saake. Automatic selection of processing units for coprocessing in databases. In Advances in Databases and Information Systems, pages 57-70. Springer, 2012.

[10]

N. Cascarano, P. Rolando, F. Risso, and R. Sisto. infant: Nfa pattern matching on gpgpu devices. SIGCOMM Comput. Commun. Rev., 40(5):20-26, Oct. 2010.

[11]

M. M. Chakravarty, R. Leshchinskiy, S. P. Jones, G. Keller, and S. Marlow. Data parallel haskell: a status report. In Proceedings of the 2007 workshop on Declarative aspects of multicore programming, pages 10-18. ACM, 2007.

[12]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107-113, 2008.

[13]

D. J. DeWitt. Direct - a multiprocessor organization for supporting relational data base management systems. In Proceedings of the 5th annual symposium on Computer architecture, ISCA'78, pages 182-189, New York, NY, USA, 1978. ACM.

[14]

I. García, S. Lefebvre, S. Hornus, and A. Lasram. Coherent parallel hashing. In Proceedings of the 2011 SIGGRAPH Asia Conference, SA'11, pages 161:1-161:8, New York, NY, USA, 2011. ACM.

[15]

B. Gold, A. Ailamaki, L. Huston, and B. Falsafi. Accelerating database operators using a network processor. In Proceedings of the 1st international workshop on Data management on new hardware, DaMoN'05, New York, NY, USA, 2005. ACM.

[16]

N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, SIGMOD'06, pages 325-336, New York, NY, USA, 2006. ACM.

[17]

N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD'04, pages 215-226, New York, NY, USA, 2004. ACM.

[18]

B. He, N. K. Govindaraju, Q. Luo, and B. Smith. Efficient gather and scatter operations on graphics processors. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC'07, pages 46:1-46:12, New York, NY, USA, 2007. ACM.

[19]

B. He, M. Lu, K. Yang, R. Fang, N. Govindaraju, Q. Luo, and P. Sander. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems (TODS), 34(4):21, 2009.

[20]

B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 511-524. ACM, 2008.

[21]

M. Heimel and V. Markl. A first step towards gpu-assisted query optimization. ADMS, 2012.

[22]

P. Helluy. A portable implementation of the radix sort algorithm in opencl.

[23]

S. Héman, N. Nes, M. Zukowski, and P. Boncz. Vectorized data processing on the cell broadband engine. In Proceedings of the 3rd international workshop on Data management on new hardware, page 4. ACM, 2007.

[24]

D. Horn. GPU Gems 2nd Edition, chapter Stream reduction operations for GPGPU applications. Addision Wesley, 2005.

[25]

M. Ivanova, M. Kersten, and F. Groffen. Just-in-time data distribution for analytical query processing. In Advances in Databases and Information Systems, pages 209-222. Springer, 2012.

[26]

C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. Sort vs. hash revisited: fast join implementation on modern multi-core cpus. Proceedings of the VLDB Endowment, 2(2):1378-1389, 2009.

[27]

S. Lee, M. M. Chakravarty, V. Grover, and G. Keller. Gpu kernels as data-parallel array computations in haskell. In Workshop on Exploiting Parallelism using GPUs and other Hardware-Assisted Methods, 2009.

[28]

R. Mueller, J. Teubner, and G. Alonso. Data processing on fpgas. Proc. VLDB Endow., 2(1):910-921, Aug. 2009.

[29]

C. Nvidia. Compute Unified Device Architecture Programming Guide. NVIDIA: Santa Clara, CA, 83:129, 2007.

[30]

N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS'09, pages 1-10, Washington, DC, USA, 2009. IEEE Computer Society.

[31]

N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS'09, pages 1-10, Washington, DC, USA, 2009. IEEE Computer Society.

[32]

N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD'10, pages 351-362, New York, NY, USA, 2010. ACM.

[33]

S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for gpu computing. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, GH'07, pages 97-106, Aire-la-Ville, Switzerland, Switzerland, 2007. Eurographics Association.

[34]

D. Singh and S. P. Engineer. Higher level programming abstractions for fpgas using opencl. In Workshop on Design Methods and Tools for FPGA-Based Acceleration of Scientific Computing, 2011.

[35]

The Khronos Group Inc. OpenCL - the open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/, May 2011.

[36]

Transaction Processing Performance Council. TPC-H. http://www.tpc.org/tpch/default.asp, May 2011.

[37]

R. Wu, B. Zhang, M. Hsu, and Q. Chen. Gpu-accelerated predicate evaluation on column store. In Proceedings of the 11th international conference on Web-age information management, WAIM'10, pages 570-581, Berlin, Heidelberg, 2010. Springer-Verlag.

Cited By

Deng YYan MTang B(2024)Accelerating Merkle Patricia Trie with GPUProceedings of the VLDB Endowment10.14778/3659437.365944317:8(1856-1869)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659443
Kroviakov AKurapov PAnneser CGiceva J(2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663441
Subramanian HGurumurthy BDurand GBroneske DSaake G(2023)Out-of-the-box library support for DBMS operations on GPUsDistributed and Parallel Databases10.1007/s10619-023-07431-341:3(489-509)Online publication date: 10-May-2023
https://dl.acm.org/doi/10.1007/s10619-023-07431-3
Show More Cited By

Hardware-oblivious parallelism for in-memory column-stores
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue ...
A GPGPU compiler for memory optimization and parallelism management
PLDI '10

This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...
Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 6, Issue 9

July 2013

180 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2013

Published in PVLDB Volume 6, Issue 9

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

58
Total Citations
View Citations
492
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Deng YYan MTang B(2024)Accelerating Merkle Patricia Trie with GPUProceedings of the VLDB Endowment10.14778/3659437.365944317:8(1856-1869)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659443
Kroviakov AKurapov PAnneser CGiceva J(2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663441
Subramanian HGurumurthy BDurand GBroneske DSaake G(2023)Out-of-the-box library support for DBMS operations on GPUsDistributed and Parallel Databases10.1007/s10619-023-07431-341:3(489-509)Online publication date: 10-May-2023
https://dl.acm.org/doi/10.1007/s10619-023-07431-3
He DNakandala SBanda DSen RSaur KPark KCurino CCamacho-Rodríguez JKaranasos KInterlandi M(2022)Query processing on tensor computation runtimesProceedings of the VLDB Endowment10.14778/3551793.355183315:11(2811-2825)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551833
Xing HAgrawal GRamnath RKloeckner AMoreira J(2022)GPU Adaptive In-situ Parallel Analytics (GAP)Proceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569661(467-480)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569661
Liu HTang BZhang JDeng YYan XZheng XShen QZeng DMao ZZhang CYou ZWang ZJiang RWang FYiu MLi HHan MLi QLuo ZGavrilovska AAltınbüken DBinnig C(2022)GHiveProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563503(158-172)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3542929.3563503
Lutz CBreß SZeuch SRabl TMarkl VIves ZBonifati AEl Abbadi A(2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517911
Lee RZhou MLi CHu STeng JLi DZhang X(2021)The art of balanceProceedings of the VLDB Endowment10.14778/3476311.347637814:12(2999-3013)Online publication date: 28-Oct-2021
https://dl.acm.org/doi/10.14778/3476311.3476378
Koutsoukos DNakandala SKaranasos KSaur KAlonso GInterlandi M(2021)TensorsProceedings of the VLDB Endowment10.14778/3467861.346786914:10(1797-1804)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.14778/3467861.3467869
Fett JUngethüm AHabich DLehner W(2021)The Case for SIMDified Analytical Query Processing on GPUsProceedings of the 17th International Workshop on Data Management on New Hardware10.1145/3465998.3466015(1-5)Online publication date: 20-Jun-2021
https://dl.acm.org/doi/10.1145/3465998.3466015
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents