skip to main content
research-article

Voodoo - a vector algebra for portable database performance on modern hardware

Published: 01 October 2016 Publication History

Abstract

In-memory databases require careful tuning and many engineering tricks to achieve good performance. Such database performance engineering is hard: a plethora of data and hardware-dependent optimization techniques form a design space that is difficult to navigate for a skilled engineer --- even more so for a query compiler. To facilitate performance-oriented design exploration and query plan compilation, we present Voodoo, a declarative intermediate algebra that abstracts the detailed architectural properties of the hardware, such as multi- or many-core architectures, caches and SIMD registers, without losing the ability to generate highly tuned code. Because it consists of a collection of declarative, vector-oriented operations, Voodoo is easier to reason about and tune than low-level C and related hardware-focused extensions (Intrinsics, OpenCL, CUDA, etc.). This enables our Voodoo compiler to produce (OpenCL) code that rivals and even outperforms the fastest state-of-the-art in memory databases for both GPUs and CPUs. In addition, Voodoo makes it possible to express techniques as diverse as cache-conscious processing, predication and vectorization (again on both GPUs and CPUs) with just a few lines of code. Central to our approach is a novel idea we termed control vectors, which allows a code generating frontend to expose parallelism to the Voodoo compiler in a abstract manner, enabling portable performance across hardware platforms.
We used Voodoo to build an alternative backend for MonetDB, a popular open-source in-memory database. Our backend allows MonetDB to perform at the same level as highly tuned in-memory databases, including HyPeR and Ocelot. We also demonstrate Voodoo's usefulness when investigating hardware conscious tuning techniques, assessing their performance on different queries, devices and data.

References

[1]
D. Abadi, D. Myers, D. DeWitt, and S. Madden. Materialization strategies in a column-oriented dbms. In ICDE 2007. IEEE, 2007.
[2]
E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebra on emerging architectures: The plasma and magma projects. In Journal of Physics: Conference Series, volume 180, 2009.
[3]
C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu. Main-memory hash joins on multi-core cpus: Tuning to the underlying hardware. ETH Zurich, Tech. Rep, 2012.
[4]
S. Barrachina, M. Castillo, F. D. Igual, R. Mayo, and E. S. Quintana-Orti. Evaluation and tuning of the level 3 cublas for graphics processors. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 2008.
[5]
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of parallel and distributed computing, 37(1), 1996.
[6]
P. Boncz, T. Neumann, and O. Erling. Tpc-h analyzed: Hidden messages and lessons learned from an influential benchmark. In TPC-TC. Springer, 2013.
[7]
P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking the memory wall in monetdb. CACM, 12 2008.
[8]
T. Condie, D. Chu, J. M. Hellerstein, and P. Maniatis. Evita raced: metacompilation for declarative networks. PVLDB, 1(1), 2008.
[9]
A. Crotty, A. Galakatos, K. Dursun, T. Kraska, U. Cetintemel, and S. Zdoni. Tupleware:" big" data, big analytics, small clusters. In CIDR, 2015.
[10]
L. Dagum and R. Menon. Openmp: an industry standard api for shared-memory programming. IEEE computational science and engineering, 5(1), 1998.
[11]
C. Gregg and K. Hazelwood. Where is the data? why you cannot debate cpu vs. gpu performance without the answer. In ISPASS '11. IEEE, 2011.
[12]
B. He, M. Lu, K. Yang, R. Fang, N. Govindaraju, Q. Luo, and P. Sander. Relational query coprocessing on graphics processors. TODS, 34(4):21, 2009.
[13]
M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. VLDB, 2013.
[14]
Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building efficient query engines in a high-level language. PVLDB, 7(10), 2014.
[15]
K. Krikellas, S. Viglas, and M. Cintra. Generating code for holistic query evaluation. In ICDE, 2010.
[16]
M. Liu, Z. G. Ives, and B. T. Loo. Enabling incremental query re-optimization. In SIGMOD, 2016.
[17]
J. Malcolm, P. Yalamanchili, C. McClanahan, V. Venugopalakrishnan, K. Patel, and J. Melonakos. Arrayfire: a gpu acceleration platform. In SPIE Defense, Security, and Sensing, 2012.
[18]
T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9), 2011.
[19]
C. J. Newburn, B. So, Z. Liu, M. McCool, A. Ghuloum, S. D. Toit, Z. G. Wang, Z. H. Du, Y. Chen, G. Wu, et al. Intel's array building blocks: A retargetable, dynamic compiler and embedded language. In CGO, 2011.
[20]
H. Nguyen. Gpu gems 3. Addison-Wesley Professional, 2007.
[21]
H. Pirk et al. Cpu and cache efficient management of memory-resident databases. In ICDE, 2013.
[22]
H. Pirk, S. Manegold, and M. L. Kersten. Waste not...efficient co-processing of relational data. In ICDE 2014, pages ---. IEEE, April 2014.
[23]
H. Pirk, O. Moll, M. Zaharia, and S. Madden. Voodoo - portable database performance on modern hardware. Technical report, MIT CSAIL, 2016.
[24]
H. Pirk, E. Petraki, S. Idreos, S. Manegold, and M. Kersten. Database cracking: fancy scan, not poor man's sort! In DaMoN. ACM, 2014.
[25]
O. Polychroniou, A. Raghavan, and K. A. Ross. Rethinking simd vectorization for in-memory databases. In SIGMOD 2015. ACM, 2015.
[26]
V. Raman, G. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, et al. Db2 with blu acceleration: So much more than just a column store. PVLDB, 6(11), 2013.
[27]
J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. " O'Reilly Media, Inc.", 2007.
[28]
K. A. Ross. Selection conditions in main memory. ACM Trans. Database Syst., 29(1), Mar. 2004.
[29]
C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. Dandelion: a compiler and runtime for heterogeneous systems. In SOSP. ACM, 2013.
[30]
A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM TECS, 13:134, 2014.
[31]
E. Wang, Q. Zhang, B. Shen, G. Zhang, X. Lu, Q. Wu, and Y. Wang. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi. Springer, 2014.
[32]
H. Wu, G. Diamos, J. Wang, S. Cadambi, S. Yalamanchili, and S. Chakradhar. Optimizing data warehousing applications for gpus using kernel fusion/fission. In IPDPSW. IEEE, 2012.
[33]
M. Zukowski, P. Boncz, N. Nes, and S. Héman. Monetdb/x100-a dbms in the cpu cache. IEEE Data Engineering Bulletin, 1001:17, 2005.
[34]
M. Zukowski, N. Nes, and P. Boncz. DSM vs. NSM: CPU Performance Tradeoffs in Block-oriented Query Processing. In DaMoN 08, 2008.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 9, Issue 14
October 2016
96 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 October 2016
Published in PVLDB Volume 9, Issue 14

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)75
  • Downloads (Last 6 weeks)5
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media