skip to main content
research-article

Fast queries over heterogeneous data through engine customization

Published: 01 August 2016 Publication History

Abstract

Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of heterogeneous datasets to gain insights. The different data models and formats pose a significant challenge on performing analysis over a combination of diverse datasets. Serving all queries using a single, general-purpose query engine is slow. On the other hand, using a specialized engine for each heterogeneous dataset increases complexity: queries touching a combination of datasets require an integration layer over the different engines.
This paper presents a system design that natively supports heterogeneous data formats and also minimizes query execution times. For multi-format support, the design uses an expressive query algebra which enables operations over various data models. For minimal execution times, it uses a code generation mechanism to mimic the system and storage most appropriate to answer a query fast. We validate our design by building Proteus, a query engine which natively supports queries over CSV, JSON, and relational binary data, and which specializes itself to each query, dataset, and workload via code generation. Proteus outperforms state-of-the-art open-source and commercial systems on both synthetic and real-world workloads without being tied to a single data model or format, all while exposing users to a single query interface.

References

[1]
Apache Drill. https://drill.apache.org/.
[2]
LLVM's Analysis and Transform Passes. http://llvm.org/docs/Passes.html.
[3]
A. Abouzeid et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009.
[4]
S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB, 2000.
[5]
I. Alagiannis et al. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD, 2012.
[6]
I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: a hands-free adaptive store. In SIGMOD, 2014.
[7]
S. Alsubaiee et al. AsterixDB: A Scalable, Open Source BDMS. PVLDB, 7(14):1905--1916, 2014.
[8]
M. Armbrust et al. Spark SQL: Relational Data Processing in Spark. In SIGMOD, 2015.
[9]
C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, 2013.
[10]
R. Barber et al. Business Analytics in (a) Blink. IEEE Data Eng. Bull., 35(1):9--14, 2012.
[11]
K. S. Beyer et al. System RX: one part relational, one part XML. In SIGMOD, 2005.
[12]
K. S. Beyer et al. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. PVLDB, 4(12):1272--1283, 2011.
[13]
S. Blanas et al. Parallel data analysis directly on scientific file formats. In SIGMOD, 2014.
[14]
P. Boncz et al. MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In SIGMOD, 2006.
[15]
P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking the memory wall in MonetDB. Commun. ACM, 51(12):77--85, 2008.
[16]
R. Brunel et al. Supporting hierarchical data in SAP HANA. In ICDE, 2015.
[17]
F. Bugiotti et al. Invisible Glue: Scalable Self-Tuning Multi-Stores. In CIDR, 2015.
[18]
M. J. Carey et al. Towards Heterogeneous Multimedia Information Systems: The Garlic Approach. In RIDE-DOM, 1995.
[19]
C. Chasseur, Y. Li, and J. M. Patel. Enabling JSON document stores in relational systems. In WebDB, 2013.
[20]
S. S. Chawathe et al. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In IPSJ, 1994.
[21]
Y. Cheng and F. Rusu. Parallel In-Situ Data Processing with Speculative Loading. In SIGMOD, 2014.
[22]
D. J. DeWitt et al. Split Query Processing in Polybase. In SIGMOD, 2013.
[23]
J. Duggan et al. The BigDAWG Polystore System. SIGMOD Record, 44(2):11--16, 2015.
[24]
L. Fegaras and D. Maier. Optimizing object queries using an effective calculus. ACM Trans. Database Syst., 25(4):457--516, 2000.
[25]
M. F. Fernández, J. Siméon, and P. Wadler. An Algebra for XML Query. In FST TCS, 2000.
[26]
S. J. Finkelstein. Common Subexpression Analysis in Database Applications. In SIGMOD, 1982.
[27]
G. Graefe and W. McKenna. The Volcano optimizer generator: extensibility and efficient search. In ICDE, 1993.
[28]
M. Grund et al. HYRISE - A Main Memory Hybrid Storage Engine. PVLDB, 4(2):105--116, 2010.
[29]
A. Y. Halevy. Answering queries using views: A survey. VLDB J., 10(4):270--294, 2001.
[30]
S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my Data Files. Here are my Queries. Where are my Results? In CIDR, 2011.
[31]
M. Ivanova, M. Kersten, N. Nes, and R. Goncalves. An architecture for recycling intermediates in a column-store. In SIGMOD, 2009.
[32]
M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki. Adaptive Query Processing on RAW Data. PVLDB, 7(12):1119--1130, 2014.
[33]
M. Karpathiotakis et al. Just-In-Time Data Virtualization: Lightweight Data Management with ViDa. In CIDR, 2015.
[34]
Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building Efficient Query Engines in a High-Level Language. PVLDB, 7(10):853--864, 2014.
[35]
Y. Kotidis and N. Roussopoulos. DynaMat: A Dynamic View Management System for Data Warehouses. In SIGMOD, 1999.
[36]
K. Krikellas, S. Viglas, and M. Cintra. Generating code for holistic query evaluation. In ICDE, 2010.
[37]
C. Lattner and V. S. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO, 2004.
[38]
Z. H. Liu, B. C. Hammerschmidt, and D. McMahon. JSON data management: supporting schema-less development in RDBMS. In SIGMOD, 2014.
[39]
S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing main-memory join on modern hardware. IEEE TKDE, 14(4):709--730, 2002.
[40]
S. Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1):330--339, 2010.
[41]
R. Murthy et al. Towards an enterprise XML architecture. In SIGMOD, 2005.
[42]
F. Nagel, P. A. Boncz, and S. Viglas. Recycling in pipelined query evaluation. In ICDE, 2013.
[43]
T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9):539--550, 2011.
[44]
C. Olston et al. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008.
[45]
J. Rao, H. Pirahesh, C. Mohan, and G. M. Lohman. Compiled Query Execution Engine using JVM. In ICDE, 2006.
[46]
M. T. Roth, F. Ozcan, and L. M. Haas. Cost Models DO Matter: Providing Cost Information for Diverse Data Sources in a Federated System. In VLDB, 1999.
[47]
M. T. Roth and P. M. Schwarz. Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources. In VLDB, 1997.
[48]
P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient and Extensible Algorithms for Multi Query Optimization. In SIGMOD, 2000.
[49]
A. Shaikhha et al. How to Architect a Query Compiler. In SIGMOD, 2016.
[50]
J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities. In VLDB, 1999.
[51]
M. Stonebraker. Technical perspective - One size fits all: an idea whose time has come and gone. Commun. ACM, 51(12):76, 2008.
[52]
D. Tahara, T. Diamond, and D. J. Abadi. Sinew: a SQL system for multi-structured data. In SIGMOD, 2014.
[53]
A. Thusoo et al. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626--1629, 2009.
[54]
A. Tomasic, L. Raschid, and P. Valduriez. Scaling Access to Heterogeneous Data Sources with DISCO. IEEE TKDE, 10(5):808--823, 1998.
[55]
P. W. Trinder. Comprehensions, a Query Notation for DBPLs. In Database Programming Languages: Bulk Types and Persistent Data., 1991.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 9, Issue 12
August 2016
345 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2016
Published in PVLDB Volume 9, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media