Split query processing in polybase

DJ DeWitt, A Halverson, R Nehme, S Shankar… - Proceedings of the …, 2013 - dl.acm.org
DJ DeWitt, A Halverson, R Nehme, S Shankar, J Aguilar-Saborit, A Avanes, M Flasza…
Proceedings of the 2013 ACM SIGMOD International Conference on Management of …, 2013dl.acm.org
This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage
and query data stored in a Hadoop cluster using the standard SQL query language. Unlike
other database systems that provide only a relational view over HDFS-resident data through
the use of an external table mechanism, Polybase employs a split query processing
paradigm in which SQL operators on HDFS-resident data are translated into MapReduce
jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper …
This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFS-resident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a split-based query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.
ACM Digital Library