skip to main content
article

Hardware-oblivious parallelism for in-memory column-stores

Published: 01 July 2013 Publication History

Abstract

The multi-core architectures of today's computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely on labor-intensive and error-prone manual tuning to exploit the full potential of modern parallel hardware architectures like multi-core CPUs and graphics cards. We propose an alternative design for a parallel database engine, based on a single set of hardware-oblivious operators, which are compiled down to the actual hardware at runtime. This design reduces the development overhead for parallel database engines, while achieving competitive performance to hand-tuned systems.
We provide a proof-of-concept for this design by integrating operators written using the parallel programming framework OpenCL into the open-source database MonetDB. Following this approach, we achieve efficient, yet highly portable parallel code without the need for optimization by hand. We evaluated our implementation against MonetDB using TPC-H derived queries and observed a performance that rivals that of MonetDB's query execution on the CPU and surpasses it on the GPU. In addition, we show that the same set of operators runs nearly unchanged on a GPU, demonstrating the feasibility of our approach.

References

[1]
Advanced Micro Devices. OpenCL Zone. http://developer.amd.com/resources/heterogeneous-computing/opencl-zone/, January 2013.
[2]
D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta. Real-time parallel hashing on the gpu. In ACM SIGGRAPH Asia 2009 papers, SIGGRAPH Asia'09, pages 154:1-154:9, New York, NY, USA, 2009. ACM.
[3]
D. A. F. Alcantara. Efficient Hash Tables on the GPU. PhD thesis, University of California, Davis, 2011.
[4]
Altera Corporation. OpenCL for Altera FPGAs: Accelerating Performance and Design Productivity. http://www.altera.com/products/software/opencl/opencl-index.html, January 2013.
[5]
C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu. Main-memory hash joins on multi-core cpus: Tuning to the underlying hardware. ETH Zurich, Systems Group, Tech. Rep, 2012.
[6]
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, pages 119-130. ACM, 2010.
[7]
P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking The Memory Wall In MonetDB. Communications of the ACM, 51(12):77-85, December 2008.
[8]
S. Borkar and A. A. Chien. The future of microprocessors. Commun. ACM, 54(5):67-77, 2011.
[9]
S. Breß, F. Beier, H. Rauhe, E. Schallehn, K.-U. Sattler, and G. Saake. Automatic selection of processing units for coprocessing in databases. In Advances in Databases and Information Systems, pages 57-70. Springer, 2012.
[10]
N. Cascarano, P. Rolando, F. Risso, and R. Sisto. infant: Nfa pattern matching on gpgpu devices. SIGCOMM Comput. Commun. Rev., 40(5):20-26, Oct. 2010.
[11]
M. M. Chakravarty, R. Leshchinskiy, S. P. Jones, G. Keller, and S. Marlow. Data parallel haskell: a status report. In Proceedings of the 2007 workshop on Declarative aspects of multicore programming, pages 10-18. ACM, 2007.
[12]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107-113, 2008.
[13]
D. J. DeWitt. Direct - a multiprocessor organization for supporting relational data base management systems. In Proceedings of the 5th annual symposium on Computer architecture, ISCA'78, pages 182-189, New York, NY, USA, 1978. ACM.
[14]
I. García, S. Lefebvre, S. Hornus, and A. Lasram. Coherent parallel hashing. In Proceedings of the 2011 SIGGRAPH Asia Conference, SA'11, pages 161:1-161:8, New York, NY, USA, 2011. ACM.
[15]
B. Gold, A. Ailamaki, L. Huston, and B. Falsafi. Accelerating database operators using a network processor. In Proceedings of the 1st international workshop on Data management on new hardware, DaMoN'05, New York, NY, USA, 2005. ACM.
[16]
N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, SIGMOD'06, pages 325-336, New York, NY, USA, 2006. ACM.
[17]
N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD'04, pages 215-226, New York, NY, USA, 2004. ACM.
[18]
B. He, N. K. Govindaraju, Q. Luo, and B. Smith. Efficient gather and scatter operations on graphics processors. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC'07, pages 46:1-46:12, New York, NY, USA, 2007. ACM.
[19]
B. He, M. Lu, K. Yang, R. Fang, N. Govindaraju, Q. Luo, and P. Sander. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems (TODS), 34(4):21, 2009.
[20]
B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 511-524. ACM, 2008.
[21]
M. Heimel and V. Markl. A first step towards gpu-assisted query optimization. ADMS, 2012.
[22]
P. Helluy. A portable implementation of the radix sort algorithm in opencl.
[23]
S. Héman, N. Nes, M. Zukowski, and P. Boncz. Vectorized data processing on the cell broadband engine. In Proceedings of the 3rd international workshop on Data management on new hardware, page 4. ACM, 2007.
[24]
D. Horn. GPU Gems 2nd Edition, chapter Stream reduction operations for GPGPU applications. Addision Wesley, 2005.
[25]
M. Ivanova, M. Kersten, and F. Groffen. Just-in-time data distribution for analytical query processing. In Advances in Databases and Information Systems, pages 209-222. Springer, 2012.
[26]
C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. Sort vs. hash revisited: fast join implementation on modern multi-core cpus. Proceedings of the VLDB Endowment, 2(2):1378-1389, 2009.
[27]
S. Lee, M. M. Chakravarty, V. Grover, and G. Keller. Gpu kernels as data-parallel array computations in haskell. In Workshop on Exploiting Parallelism using GPUs and other Hardware-Assisted Methods, 2009.
[28]
R. Mueller, J. Teubner, and G. Alonso. Data processing on fpgas. Proc. VLDB Endow., 2(1):910-921, Aug. 2009.
[29]
C. Nvidia. Compute Unified Device Architecture Programming Guide. NVIDIA: Santa Clara, CA, 83:129, 2007.
[30]
N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS'09, pages 1-10, Washington, DC, USA, 2009. IEEE Computer Society.
[31]
N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS'09, pages 1-10, Washington, DC, USA, 2009. IEEE Computer Society.
[32]
N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD'10, pages 351-362, New York, NY, USA, 2010. ACM.
[33]
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for gpu computing. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, GH'07, pages 97-106, Aire-la-Ville, Switzerland, Switzerland, 2007. Eurographics Association.
[34]
D. Singh and S. P. Engineer. Higher level programming abstractions for fpgas using opencl. In Workshop on Design Methods and Tools for FPGA-Based Acceleration of Scientific Computing, 2011.
[35]
The Khronos Group Inc. OpenCL - the open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/, May 2011.
[36]
Transaction Processing Performance Council. TPC-H. http://www.tpc.org/tpch/default.asp, May 2011.
[37]
R. Wu, B. Zhang, M. Hsu, and Q. Chen. Gpu-accelerated predicate evaluation on column store. In Proceedings of the 11th international conference on Web-age information management, WAIM'10, pages 570-581, Berlin, Heidelberg, 2010. Springer-Verlag.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 6, Issue 9
July 2013
180 pages

Publisher

VLDB Endowment

Publication History

Published: 01 July 2013
Published in PVLDB Volume 6, Issue 9

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media