- Research
- Open access
- Published:
Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems
Journal of Big Data volume 6, Article number: 34 (2019)
Abstract
Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.
Introduction
One of the fundamental reasons for the notoriety of the Big Data phenomenon is the current extent to which information can be generated and made available [11], mainly due to the constant innovation, transformation, globalization and personalization of the services associated with new business models. Many definitions of the Big Data concept exist, mainly aligned with the consensus that Big Data can be defined as large amounts of data, flowing at different velocities, with varying degrees of complexity, without structure and/or organization, which cannot be processed or analyzed using traditional processes or tools [11, 18, 23, 36].
One of the most popular approaches for managing large-scale datasets in a structured way is by the use of a Data Warehouse (DW), a repository with analytical purposes that is mainly responsible for integrating and storing data coming from operational systems, and that is widely considered as a fundamental enterprise asset to support decision-making. However, data volume is nowadays a major challenge for the DW, taking into consideration its traditional supporting technologies. Moreover, current data types and formats are also a major problem, since they challenge the fundamentals of DW processing, as these cannot be applied to free text, images, videos or sensor data [18]. Due to this current conceptual, technological and organizational context, the design and implementation of Big Data Warehouses (BDWs) is becoming an important area of study [6, 7, 13, 18, 20]. These repositories substantially differ from traditional DWs, since they must be based on new logical models, more flexible than the relational ones, and new technologies with higher levels of performance, scalability and fault-tolerance [14, 23].
Hadoop, an open source ecosystem for reliable, scalable and distributed computing [1], emerged as a solution to address Big Data processing on low-cost platforms, providing the computational resources to handle these large amounts of data [18]. Moreover, Hive, which is built on top of Hadoop, emerged as a system to store, query and manage large data volumes stored in distributed environments. Since its appearance, research in the area of Big Data Warehousing has been intensified, with developments aiming to bring the well-known concepts from relational databases, such as declarative query languages, tables and columns, into the unstructured environment of Hadoop. These characteristics, along with the metastore concept, i.e., the system catalog with the metadata information, contributed to the classification of Hive as a DW repository for Big Data [24]. In this sense, Hive is a distributed DW system that manages the data stored in HDFS (Hadoop Distributed File System) and provides a SQL-like language (HiveQL) for querying the data [3, 26]. For data storage, Hive has four main components for organizing data: databases, tables, partitions and buckets. Partitions and buckets can theoretically improve query performance, as tables are split by the defined partitions and/or buckets, distributing the data into smaller and more manageable parts [27].
This is a recent area of research where there is a lack of related work on the way data must be organized in Hive, as well as on the impact of that organization in query performance. Several open issues need further exploration from the scientific community, reason why the fundamental research questions of this work are expressed as follows: Are there any significant advantages in using partitions and/or buckets in Hive-based BDWs? Do these organization strategies have any impact on the efficiency of online analytical processing (OLAP) queries? What factors may influence the definition of an appropriate data organization strategy?
Given this context, this work has as main motivation verifying to what extent the way in which data is modelled and organized influences the query processing time of BDWs. Partitioning and bucketing strategies can be used when building BDWs, but they can be neglected by the practitioners or, sometimes, used in an ad hoc manner. The insights from this paper can be used to improve the knowledge-base regarding the guidelines for creating partitions and buckets, which we consider as a topic that is frequently unknown or subjective for (Big) Data Warehousing practitioners. For addressing this main concern, this study aims to understand the impact of different data organization strategies in the query processing time of BDWs, extending the preliminary work and results addressed in [10], specifically focusing on the following aspects: (i) the relationship and impact between the definition of partitions and buckets in Hive, either individually or combining these two strategies; and, (ii) how the data processing workloads are affected regarding query processing time, as the volume of data that needs to be manipulated in a specific query can be significantly reduced with the adoption of an appropriate distribution of the data. As the implementation of BDWs is a significantly recent area of research, almost no guidelines are available regarding the way these repositories can be organized for increasing the overall performance of the system. Consequently, after the presentation, evaluation and discussion of the results, this paper summarizes a set of good practices for the modelling and organization of data in Hive-based BDWs.
This paper is structured as follows: “Related work” section presents the related contributions in this topic. “Methods/experimental” section describes the technological infrastructure, the dataset and the test scenarios used in this research process. “Results” section describes the obtained results, highlighting the performed benchmarks and the needed resources, both in terms of processing time and central processing unit (CPU) usage. “Discussion” section discusses the obtained “Results” and “Conclusions” section presents the main conclusions, pointing the usefulness and applicability of the several strategies for organizing BDWs.
Related work
Data models have been key components in Business Intelligence and Analytics (BI&A) systems, ensuring that the analytical needs of the business are properly integrated and considered, allowing data analysis through different perspectives [27, 28]. In a traditional BI&A context, dimensional data models are the most popular ones [17], including star schemas for the different considered business processes. Although very useful, these logical models are not usually appropriate for Big Data contexts, requiring the adoption of new logical constructs that address the characteristics of NoSQL databases and the associated technologies available in the Hadoop environment [14]. In the work of [6, 25], the authors highlight that the design of a BDW should focus not only on the physical layer (the technological infrastructure), but also on a logical layer, giving an overall perspective on the data models, the logical components and how the data flows throughout the components. For [21], the design methodology of a BDW should be highly agile and iterative, integrating as many data sources as possible (either internal or external to the organization), and may use or not a rigid data model, aiming for a fast understanding and perception of the data.
Currently, SQL-on-Hadoop systems are significantly popular solutions for querying data available in a Hadoop cluster, of which several can be highlighted: Hive; Presto; Spark SQL; Drill; and Impala. Due to their popularity, several benchmarks compare their performance, as for instance the available in [9, 29]. However, SQL-on-Hadoop benchmarks do not usually consider the impact of the data models, addressing mostly how fast these systems can be considering different workloads.
In the context of a BDW and having into consideration that Hive is the main Data Warehousing solution in Hadoop, supporting queries in HiveQL, it is important to understand how the way data is stored and organized in this system affects the performance of the solution. Thus, as previously mentioned, this system supports three types of data structures, namely tables, partitions and buckets [12, 31], included in databases. The concept of tables in Hive is similar to the concept of tables in relational databases (common structures with columns and rows), and each table corresponds to an HDFS directory. A Hive’s table can have one or more partitions that define the distribution of the data within subdirectories of the table’s directory, splitting the data horizontally and speeding up query processing. The buckets correspond to file segments in HDFS and can only be applied to a single attribute. These structures help to organize data in each table/partition by dividing it by several files. To identify the segment to which a data record must be assigned, a hash function is applied on the bucketing column. Consequently, it is a technique for grouping data vertically, segmenting data records by a given attribute. Each bucket is stored as a file within the table’s directory or the partitions’ directories [12, 15, 27, 31].
Regarding data modelling, an evaluation of different data modelling and organization strategies for Hive-based DWs is described [9], showing the benefits of implementing a BDW based on a fully denormalized table, when compared with a dimensional structure (star schema). Moreover, [4, 5, 35] analyzed the implementation of BDWs based in NoSQL databases. While [4] studied the implementation of a DW based on a document-oriented NoSQL database and [5] explored implementations of DWs on top of column-oriented NoSQL databases, [35] proposed a transformation process for moving from a dimensional DW into a column-oriented and document-oriented NoSQL data model.
Regarding the data organization strategies, the creation of partitions and buckets in Hive has already been addressed in the literature. Kumar [19] presented a brief performance analysis and comparison of MySQL partitions, Hive Partition/Bucketing and Apache Pig, highlighting the Hive’s advantages with the use of partitioning and bucketing techniques. To [30], Hive partitioning can be used for improving the performance of a very specific set of queries, as long as the partitions are aligned with the attributes used in the queries’ filters. Moreover, in [27], it is recommended that the attribute, or attributes, used for partitioning have low cardinality, avoiding the creation of a significantly high number of subdirectories, a process that will overload HDFS. Furthermore, according to [2], partitioning can improve query performance in large datasets, when, as already mentioned, the partition scheme considers the attributes used in the queries’ filters. These benefits were also shown in [9], presenting the advantages of creating data partitions using two different data organization strategies (star schemas and fully denormalized tables).
Partitioning requires the use of an attribute that does not create a large number of small partitions, avoiding a large number of small files that typically slow down the processing time of Hadoop [30], while bucketing clusters large data sets into more manageable parts, corresponding to file segments in HDFS [2]. This means that bucketing is an ideal technique for sampling and joining tables more efficiently. For [27], buckets help to organize the data in each partition, distributing the data in several segments, being useful for attributes with high cardinality. The work of [30] highlights other useful considerations for using bucketing in Hive, namely: it is useful for fact tables in a star schema; map-side joins can be more efficient if the joining attribute is bucketed; the bucket file size should have, at least, 1 GB; the number of buckets cannot be changed after the creation of the table; processing times can also be improved by combining bucketing with sort techniques. In general, bucketing may also optimize execution times, namely when bucketing by the attributes used in the queries’ “group by” and “order by” clauses and when a bucket has at least the size of one HDFS block or a multiple of that size. Besides these contexts, the use of bucketing is usually discouraged. However, all these considerations are theoretical considerations, not corroborated by any type of practical work or performance analysis, which emphasizes the lack of studies about the real impact of the implementation of bucketing techniques.
Nowadays, and due to the youth of this research area, scientific papers related with data organization strategies in a BDW are scarce. Despite some of the mentioned studies already considering some partitioning strategies, there is a significant absence of works analyzing the impact of bucketing, the combination of partitioning and bucketing on Hive’s data models, and how the use of these techniques can be optimized. Therefore, this work, extending the work previously presented in [10], seeks to fulfil these scientific gaps by addressing different data organization strategies, i.e., by benchmarking different combinations of partitions and buckets for two different data modelling patterns, based on star schemas and fully denormalized tables, as these are the most common modelling approaches used when implementing Hive-based BDWs. To accomplish this task, several workloads were tested using different scale factors (SFs), providing a clear overview of the impact of partitioning and bucketing strategies in these data modelling patterns.
Methods/experimental
Considering that the main goal of this work is the proposal of some best practices for modelling and organizing Hive-based BDWs, it is important that the guidelines and considerations here provided are adequately validated and the results are replicable. Therefore, a benchmark that includes several workloads was conducted to evaluate the performance of a Hive BDW in different scenarios. This section describes the materials and methods used in this research process.
Technological infrastructure
For this study, a Hadoop cluster including five nodes with similar configurations was used. Each node is composed of the following components:
-
(i)
1 Intel Core i5, quad core, with a clock speed ranging between 3.1 GHz and 3.3 GHz;
-
(ii)
32 GB of 1333 MHz DDR3 Random Access Memory (RAM), with 24 GB available for query processing;
-
(iii)
1 Samsung 850 EVO 500 GB Solid State Drive (SSD) with up to 540 MB/s read speed and up to 520 MB/s write speed;
-
(iv)
1 Gigabit Ethernet card connected through Cat5e Ethernet cables and a gigabit Ethernet switch;
-
(v)
The operating system installed in all nodes is CentOS 7 with an XFS file system.
In this infrastructure, one of the nodes is configured with the HDFS NameNode and YARN ResourceManager, assuring the typical management roles in Hadoop, and the other four nodes are configured as HDFS DataNodes and YARN NodeManagers.
The Hadoop distribution used in this work is the Hortonworks Data Platform (HDP) 2.6.0 with the default configurations, excluding the HDFS replication factor, which was set to 2. Besides Hadoop (including Hive), Presto v.0180 is also available, being the coordinator installed on the NameNode and the workers on the four remaining DataNodes. All Presto’s configurations were left to their defaults, except the memory configuration, which was set to use 24 GB of the 32 GB available in each worker (similar to the memory available for YARN applications in each DataNode/NodeManager).
Dataset and queries
In this work, the well-known star schema benchmark (SSB) was used, which considers a traditional sales data mart modeled according to dimensional structures (star schemas). This benchmark is based on the TPC-H Benchmark [33], with the necessary adaptations to transform the data model into a star schema, as can be seen in [22]. From the proposal of [22] and the data schema here used, there are some particular differences, namely: i) the original TPC-H scale factor of the customer and supplier tables was left unchanged, since in real contexts it is possible to have large customer and supplier dimensions, as happens in large e-commerce enterprises and social media networks; ii) a temporal dimension with less attributes than the one used by [22] was created, maintaining only the attributes that are relevant for executing the workloads available in [22], in order to keep a leveled ground between the two types of data modelling strategies evaluated in this work (star schemas and denormalized tables).
Therefore, both SSB’s relational tables and the fully denormalized table were implemented in the Hive BDW, being stored using the Optimized Row Columnar (ORC) format and compressed using ZLIB. Besides the dataset, this work also uses the 13 queries included in the SSB benchmark, measuring the performance of the BDW in typical OLAP workloads. The 13 queries are available in the work of [22] and, also, in [8] that provides all the scripts used in this work to run the queries in Hive and Presto. For having an overall overview of the queries and their patterns, the following listing code shows the first query of each group, as the SSB includes four groups of queries, as will be seen in the following subsection.
Test scenarios
In order to understand the impact in query processing times when using different strategies for data partitioning and bucketing, several test scenarios were defined (Fig. 1). In these scenarios, two different data models (star schema and denormalized table) are tested for three different SFs (30, 100 and 300), following the application of three main data organization strategies: partitioning by multiple attributes, bucketing and the combination of both. For each SF, the SSB data is stored in HDFS, and Hive tables are created for both data organization strategies. The queries are executed in Presto and Hive (on Tez). The selection of these two SQL-on-Hadoop engines takes into consideration the results in [29]. Moreover, considering the work of [9], the broadcast join strategy was used for Presto to optimize the star schema processing times, in order to assure that they are comparable to the results of the denormalized table.
The study of the cardinality and distribution of the attributes available in the dataset was done to choose the attributes and the several combinations among them, in order to adequately plan partitioning strategies, bucketing strategies and the combinations of both. Regarding the denormalized table, for the highest SF used in this work, it was not possible to replicate all the scenarios due to the memory limitations of the infrastructure used in this work.
To obtain more rigorous results, several scripts were developed to sequentially execute each query four times. The results in this work are presented as the average of the four executions. These scripts were adapted according to the SQL-on-Hadoop system in use (Presto or Hive), the applied data model (denormalized or star schema) and the data organization strategy (with or without partitions and buckets).
Results
After the work of [9], showing the advantages of simple partitioning using the attributes more frequently used in the query filters, and considering the work described in [10], this paper extends that previous work and presents the results obtained with: (i) the use of a multiple partitioning strategy; (ii) the use of different bucketing strategies (simple and multiple); and (iii) the combination of partitioning (simple and multiple) and bucketing strategies.
Despite the results depicted in [9], regarding the advantages of using a fully denormalized table over a dimensional model based on a star schema in Hive, this work also extends the comparison between these two data modelling techniques by applying different partitioning and bucketing strategies not only to a denormalized table but also to a star schema.
To give a global overview of the efficiency of the different strategies, extending the focus of the analysis besides query processing time, the impact of the data organization strategies in the use of the CPU was also studied. Therefore, after presenting the time needed for processing the several workloads, each subsection ends with a study of CPU usage, taking as examples some scenarios used for the processing time analysis.
All the processing times for each query and for the several scenarios are presented in the next subsections without decimal places, for the sake of clarity and simplification in the visualization of results.
Multiple partitioning
As previously mentioned, the work of [9] showed that simple partitioning, using an attribute that frequently appears in the “where” clause of the queries, has benefits in terms of processing time. Having that in mind, this subsection presents the results obtained when tables are partitioned by more than one attribute, continuing to study the impact of this type of data organization strategy. Along this subsection, the fastest processing time for each query, workload, tool and data model is highlighted in italics when illustrating the results of the benchmark. Table 1 shows the results when the attributes “Od_Year” (order year) and “S_Region” (supplier region) were considered as partitioning attributes. These attributes are used as filters in 11 out of 13 SSB queries, either appearing individually or combined in the queries’ “where” clauses. As can be seen, this scenario highlights the advantages of multiple partitioning when compared with no specific data organization strategy in terms of partitions and/or bucketing. In a star schema context, the decrease in the overall processing time reaches 42% in Hive and 46% in Presto. In the context of a denormalized table, the decreases in Hive vary between 16 and 45%, while with Presto the decrease can be over 50% (54% in the best scenario).
The only queries that do not directly use any of these filters are Q1.2 and Q3.4, but they have related filters like “YearMonth” (concatenation of year and month) and “S_City” (supplier city). These results mean that, with this partitioning scheme, the same files and folders contain the “YearMonth” and “S_City” values that are related to the partitioning attributes, storing them closely and allowing the predicate pushdown at the level of the ORC stripe and file. This is a data filtering technique based on reading the headers and statistics of the ORC stripes and files created for the table. This technique first checks if the ORC stripe/file contains any line that matches the query predicate, identifying if the stripe/file needs to be scanned or if it can be ignored, advancing to another stripe/file [16]. Queries 3.1 and 4.2 present the two attributes in their “where” clauses and are, in fact, the ones with significant decreases, sometimes higher than 50%. In these results, it is also verified that the third group of queries (Q3.1 to Q3.4) does not always highlight the advantages of partitioning. This may be related to their filters that only exclude 1 year, implying a search throughout 6 of the 7 folders created by the “Od_Year” partitioning attribute, plus the 5 folders created by the “S_Region” attribute.
Nevertheless, in general, as some queries verify a decrease in the response time equal to or higher than 50% (in 15% of the cases), these balance the ones that are negatively affected by this type of partitioning, providing an overall benefit when using this data organization strategy. Its advantages are verified in all SFs, both for Hive and Presto.
After discussing the results for a two-level partitioning scheme, Table 2 shows the results considering a three-level partitioning scheme, for the star schema, studying the creation of folders based on a spatial hierarchy, in which all the attributes appear at least once in the query “where” predicate. The attributes used here are “S_Region” (supplier region), “S_Nation” (supplier nation) and “S_City” (supplier city).
This partitioning scheme is complex and, by verifying the organization of the data, it is possible to realize that with the combination of 5 regions, 5 countries, and 9 cities, the data was distributed throughout 225 folders. Considering as an example the smallest SF with around 30 GB of data, several small files were created in HDFS. In this scenario, each partition gets a total of approximately 22 MB, a data distribution context that is not adequate for enhancing HDFS performance, as already discussed in previous sections. Even in this case, it was possible to observe a reduction of the overall processing time, as shown in Table 2.
This scenario was not replicated using the denormalized tables, due to memory limitations in the used cluster. Moreover, for the same reason, the largest scaling factor was not replicated using the star schema, given the complexity of this scenario. Nevertheless, given all the performed tests, we believe that the main conclusions remain true.
As already mentioned, although this strategy involves significant complexity in its creation, as well as memory problems and the storage of several small files, there is a noticeable decrease from Q2.1 to Q4.3 (highlighted in italics in Table 2), as the partitioning attributes are often used in the “where” clauses. The decrease in processing time is even more evident in Q3.3 and Q3.4, as these queries have the filter by city (the most detailed level of partitioning). Combining the predicate pushdown technique, on the two previous partitioning levels of region and nation, with the highest level of detail provided by city, these queries need, in the best scenario, less 85% of the time to execute.
However, although the advantages may be clear, it is important to be careful with this type of data organization strategy, in order to avoid an excessive partitioning that may impact the performance of HDFS. Consequently, the creation of many partitioning levels must be carefully analyzed, including a study of the expected total data volume, the amount of data that will be stored in each folder, and the type of updating processes that may be implemented for the BDW.
In addition to the analysis of the processing time, the use of CPU by the queries was also verified, in both data models with the application of multiple partitioning. Figure 2 shows the variation in CPU usage by the partitioned tables, using as an example the partitioning by “Od_Year” and “S_Region”. The obtained results are compared with the CPU workload for processing the tables without any type of data organization strategy. In this figure, the presented results consider the use of Presto and a SF of 100 GB, showing the impact on CPU needs with an intermediate workload in terms of data volume.
Analyzing Fig. 2, and taking into consideration the results presented for the star schema, the queries that have at least one of the partitioning attributes as filter (namely Q1.1, Q1.3, Q2.1, Q.2.2, Q2.3, Q3.1, Q3.2, Q3.3, Q4.1, Q4.2 and Q4.3) are the ones that present higher decreases in CPU usage. Regarding the denormalized table, the advantage of partitioning is evident in all queries, although the decreases tend to be smaller in queries that do not have the partitioning attributes in their filters. Thus, with these results, it is possible to conclude that, in addition to the decrease in processing time, this type of strategy also achieves less CPU usage, improving the overall system performance.
Bucketing
According to the literature, although rarely mentioned or exemplified, the definition of buckets can consider the attributes with high cardinality and the way data should be grouped/sorted according to the expected queries. In addition, the number of buckets should be defined to avoid the creation of several small files [15, 16, 27, 30]. According to [30], the files must have at least 1 GB of size to optimize storage. Given the lack of strict guidelines, the empirical knowledge obtained when testing different scenarios was used to identify the number of buckets, by following this expression:
In this work, as the cluster’s minimum size of a HDFS block is 128 MB, some tested scenarios followed the previous expression, while in other cases, the size of the dataset was divided by 1 GB, mainly for larger SFs, avoiding the creation of several small files in HDFS, as this may jeopardize its performance.
Based on these two options for calculating the number of buckets, Table 3 shows the number of buckets that must be used in the creation of the tables that only use bucketing as the data organization strategy.
Considering the high number of buckets that would be created for files of 128 MB, the second approach (files with at least 1 GB of data) was followed for all the tables that were created with bucketing as the only data organization strategy.
Considering this, this subsection presents the results obtained when the tables are bucketed by attributes that have high cardinality and/or that are grouped/sorted according to those attributes. In each table, the fastest processing time for each query, workload and tool is highlighted in italics. Table 4 presents the results obtained when the table is bucketed by “Orderkey”, an attribute with high cardinality.
The results obtained with this bucketing strategy do not show any advantage when compared with a scenario without any type of data organization strategy. Apparently, there is no recognition of the data organization strategy here applied, since processing times always increase, except in the case of Hive for SF = 30. Although it does not present any significant changes (e.g., decreases of 2%), it presents minor variations that can change from execution to execution. Thus, this scenario does not seem to represent an adequate practice for organizing the data.
Given this less advantageous results for bucketing, when using a high cardinality bucketing attribute, another approach was followed, taking into consideration the documentation from Hortonworks and Hive, which states that using bucketing attributes that are sorted by a common attribute used in the queries can be advantageous for processing time, Table 5 presents the results obtained for a denormalized table with buckets created by “Od_Year” (order year) and sorted by “P_Brand” (product brand) [12, 16].
In addition to the study of this technique for sorting the data, this scenario also intends to study the definition of buckets using attributes that are used in the “group by” and “order by” clauses of the queries. Despite being a low cardinality attribute, buckets were defined by order year (“Od_Year”), creating 7 buckets in both scaling factors (SF = 30, SF = 100), with each year’s data being stored in a different file. This strategy is only applied to the denormalized table, since it is not possible to create buckets for attributes only present in the dimensions of the star schema. Additionally, this scenario could not be replicated in the SF = 300 workload, due to the cluster’s memory limitations.
In general, the results demonstrate a decrease in processing times in a context of sorted buckets. Almost all the queries, which include the sorted attribute in the “group by” and “order by” clauses (Q2.1 to Q4.3—highlighted in italics in Table 5), present advantages with this data organization strategy. Of all these queries, the third group (Q3) is the one where this decrease is not always verified. Nevertheless, it is important to recall that these queries are complex and have filters with large time intervals, reason why in contexts of larger amounts of data, additional processing time may be needed.
Even queries that do not include the sorted attribute in the “group by” and “order by” clauses verify decreases in the processing time (namely Q1.1, Q1.2, Q1.3). Analyzing this group of queries, it is possible to see that the ones with less complexity are also the ones that only deal with temporal data in almost all conditions, reason why an organization of the files per year enhances their processing, due to the optimization techniques used by the querying systems. In addition, Q2.2, Q2.3 and Q4.3 also present the “P_Brand” attribute in the “select” clause, as well as in filters and in other clauses. As the files are sorted by this attribute, performance in data processing increases less than 50% in most of the executions. This is the first scenario where this data organization strategy presents benefits in processing time.
For the star schema, as it is not possible to create sorted buckets in the fact table using dimensions’ attributes, this data organization strategy was tested creating the fact table with buckets by the attribute “Orderkey”, sorted by “Orderdate”. However, since these two attributes are not recognized in any of the “group by” clauses of the queries, the results for this scenario have no benefits when compared to the results obtained for the tables without any kind of data organization strategy. Therefore, the results for this scenario are not here presented.
As mentioned in previous sections, the works of [15, 27, 30] argue that the definition of buckets can have advantages when joining two or more tables, as long as both tables use bucketing by the same column. Thus, and only considering the star schema model, two distinct scenarios have been defined to create buckets that intend to study the bucketing advantages when using join operations, namely simple bucketing and multiple bucketing.
Table 6 presents the first scenario, which includes creating the fact table with the definition of buckets by a key that is used to perform join operations with one of its dimensions, in this case the supplier dimension. Both tables were created using as bucketing attribute “Suppkey” (supplier key).
The results presented here show the potential of Hive in this type of data organization strategy. Contrary to what has been seen so far, this type of data organization, which points out the benefits of buckets to join two or more tables, ensures better performance when using Hive, though Presto continues to have better processing times. Figure 3 highlights the results obtained using Hive for SF = 300.
In this case, Hive appears to activate the bucket map join, an appropriate join strategy for large tables with buckets using the join attribute, as long as the number of buckets in one of the tables is a multiple of the number of buckets in the other [15]. As these conditions are verified in this scenario (although in some SFs the size of the tables is not so large), there is a clear advantage, with the SF = 300 workload showing decreases of 63%. Presto probably does not recognize this type of data organization strategy and, therefore, the advantages are not verified with the use of this system, despite its faster processing times. Consequently, this would be a beneficial strategy for contexts where a BDW based on dimensional models is chosen, and where Hive is used not only for storage, but also for query processing.
The second scenario considers bucketing based on multiple attributes. Although several authors state that it is advisable to create buckets using only one attribute [2, 27, 30, 32], this work shows that it is possible to create buckets with multiple attributes, as Hive internally applies a hash function to the concatenation of these attributes as a single string. Therefore, this scenario tests if the tools recognize this strategy and if there is any kind of advantage in using it. The fact table was defined with four buckets, corresponding to the keys used to perform joins with the four dimensions (“Orderdate”, “Custkey”, “Suppkey” and “Partkey”). The dimensions were bucketed by the corresponding key, and the results are presented in Table 7.
Regarding the possible benefits that could be obtained with bucketing when joining two or more tables, and with several bucketing attributes, the results show a clear disadvantage for this type of organization strategy, since in 92% of the cases this bucketing strategy did not show any performance benefits. Even the queries that include all the attributes in the join operation (Q4.1, Q4.2 and Q4.3) did not present any benefit by having this configuration. Therefore, the disadvantages of the application of multiple bucketing are here shown, confirming that the SQL-on-Hadoop systems used in this work did not benefit from this type of data organization strategy.
Again, in addition to the study of the processing time, the CPU usage was also analyzed for each query in both data models, now with the implementation of bucketing techniques. Figure 4 shows the variation in CPU usage, obtained with Presto, by the tables using simple bucketing in relation to the tables without any data organization strategy. The first example here presented considers bucketing by “Orderkey” (Fig. 4).
As in the study of processing times, this scenario shows a disadvantage for bucketing strategies when analyzing the CPU usage. All queries, in both data models, require a higher use of the CPU, with values sometimes higher than 60% of the time needed for a star schema without buckets, and higher than 35% of the time needed for a denormalized table without any data organization strategy. As such, this data organization strategy brings disadvantages both considering processing time and the use of CPU resources.
Given the more satisfactory results obtained with the implementation of simple bucketing in two tables of the star schema by the same attribute, Fig. 5 shows the variations of CPU usage obtained by the two querying tools in SF = 100.
As shown in the variations of the processing time (Table 6), it is not for the SF = 100 that more advantages are verified. Although in these tests Hive better recognizes the bucketing strategy, when compared with Presto, this does not imply a decrease in the CPU usage. However, if we consider this strategy for the SF = 300 and only using Hive, as Presto does not recognize this strategy in any of the analyzed workloads, Fig. 6 presents the variation in CPU usage and the clear decrease in the used resources for the majority of the queries.
Considering these results, it is possible to highlight that, with Hive, significant decreases are obtained not only in the processing time, but also in CPU usage. On average, there is a decrease of about 41% with the application of this data organization strategy in the larger SF, which is when Hive seems to activate join optimization mechanisms.
Considering the use of the “sorted by” technique, which was the simple bucketing scenario for the denormalized table that obtained the best results, Fig. 7 presents the variation in CPU usage by the tables with simple bucketing and sorted data in relation to the tables without any data organization strategy. The simple bucketing example here used was bucketing by Od_Year (sorted by P_Brand) and all results presented here are the results for Presto SF = 100 and for the denormalized model.
As can be seen, the queries that recognize the “Od_Year” attribute in the “group by” clause (ranging from Q2.1 to Q4.3) are queries that usually require less CPU usage. The queries of group 3 (Q3.1 to Q3.4), as in the analysis of processing times, show some increases due to their complexity and the filters with large time intervals, which in contexts of larger amounts of data require some extra CPU usage. The queries in group 1 (Q1.1 to Q1.3) also show decreases, because these are queries that benefit from an organization of the files per year, due to their filters and temporal conditions.
Complementing the results shown so far, to conclude the analysis of bucketing as a data organization strategy, and although the results regarding query processing time did not show any additional benefit, the variation in CPU usage in a multiple bucketing context is also studied and is shown in Fig. 8.
In this case, as with processing time, the disadvantage of this scenario in terms of CPU usage is also clear. All the queries require more CPU usage, spending, in average, 40% more resources than in the case of a star schema without buckets. Consequently, the multiple bucketing strategy seems to have no advantages both in terms of processing speed and in terms of the resources needed to do such processing.
Combination of partitioning and bucketing
As the scenarios defined for this section do not always make sense for the two data models compared in this work, this section will be divided into two subsections. The first subsection presents the scenarios combining partitioning with bucketing applied to the star schema, while the second subsection presents the scenarios tested for the denormalized table.
Regarding the definition of the number of buckets, in the previous scenario, it was considered the total size of the table without any distribution of the data by partitions, for example (“Bucketing” subsection). However, the calculation in this scenario is based on the average size of a table partition, since the creation of too many small files within each subdirectory needs to be avoided. Table 8 presents the possible approaches for defining an appropriate number of buckets. In this data organization strategy, combining partitioning and bucketing, it was not possible to follow one of the approaches for all cases, as it depends on the size of the partitions. The approach used in each case is the one highlighted in italics in Table 8.
Star schema
Table 9 presents the results obtained for the scenario of simple partitioning by “Od_Year”, an attribute that is frequently used in the “where” clause of the SSB queries (Q1.1, Q1.3, Q3.1, Q3.2, Q3.3, Q4.2 and Q4.3), and as bucketing an attribute with high cardinality (“Orderkey”).
In this case, the queries presenting a decrease in the processing time are mainly those in which the attribute used as partition appears in their filters, consolidating the results obtained in the work of [9]. If the attribute used for bucketing is not present in the filters of the queries, it is important to verify whether the positive results in this configuration are related to the combination of the two techniques or if they are only related with the use of partitions. Therefore, the increase in the overall processing time verified in the highest SF can be related with the use of bucketing, adding complexity to the queries that do not include the bucketing attributes as filtering attributes, withdrawing the possible positive impact of partitioning for some of the queries.
The next scenarios explore if there are benefits of using buckets when joining two tables and having one of them partitioned. Table 10 presents the results for the scenario of simple partitioning by “S_Region”, an usual attribute in the “where” clause of the SSB queries (Q2.1, Q2.2, Q2.3, Q3.1, Q4.1 and Q4.2), and as bucketing attribute one with high cardinality typically used as the join attribute with the Supplier dimension, the “Suppkey”.
Looking into the results of Hive in the SF = 300, the bucketing technique reveals a positive impact on the processing times, being the results presented in Fig. 9. The subset of queries that do not present a join with the Supplier dimension are Q1.1, Q1.2 and Q1.3. By analyzing the results, these are effectively the only ones that do not present decreases in the processing time. All the other queries present significant decreases, ranging from 5 to 92%. Again, Hive seems to activate the bucket map join, as it did in a previous scenario (using only buckets, Table 6), showing a clear advantage for this SF, presenting, in average, a decrease of 80% of the execution time. Although this is a remarkable result for Hive, Presto continues to obtain the fastest overall processing time (452 s in SF = 300).
Complementing this last scenario and considering that multiple partitioning previously showed advantages in data processing, an analysis with the use of multiple partitioning (partitioning by “Od_Year” and “S_Region”) combined with bucketing by the same attribute (“Suppkey”) was also performed (Table 11).
As can be seen in Table 11, the obtained results are significantly similar to the ones presented in the previous scenario. Nevertheless, it is relevant to emphasize that, with the use of these two partitioning attributes, only two queries (Q1.2 and Q3.4) did not acknowledge the filter, not verifying any decrease (or at least a significant one) in the processing time. The decreases achieved in the overall processing time of this scenario were relevant, although very similar to the results obtained in the scenario of multiple partitioning without bucketing, which may question the usefulness of buckets in these cases. Besides that, just like in the previous scenario, Hive achieves a significant decrease, higher than 80%, in the highest factors, which may be justified, once again, by the join optimization mechanisms (bucket map join).
Denormalized table
When considering a fully denormalized table, there are no guidelines regarding the way bucketing can be defined to influence execution time. Thus, in order to verify if a bucketing strategy brings any advantages when considering the attributes in the “group by” clause, partitioning by “Od_Year” and bucketing by “P_Brand” was done (Table 12).
In this case, the decreases in processing time are not only due to partitioning, since the queries that present the partitioning attribute as a filter are not the only ones that show a decrease in the processing time. Queries Q1.1, Q1.2 and Q1.3 show decreases in all executions, although these decreases may be related to the low complexity of these queries or to the use of straight temporal filters. This means that they only benefit from the partitioning by year and the predicate pushdown.
In this scenario, Q2.2 and Q2.3 present decreases in some of the SFs, maintaining the same performance levels in the remaining ones. Analyzing the queries, the “P_Brand” attribute is present in the “group by” and “order by” clauses, which may influence these decreases. However, Q2.1 also presents this attribute in these clauses and does not verify any decrease in the processing time. As these queries present this attribute not only in the “group by” and “order by” clauses, but also in the “where” clause, it may happen that the files that are bucketed by “P_Brand” can be searched more easily.
In the last scenario presented in this paper, multiple partitioning is combined with bucketing by an attribute with high cardinality, namely using partitioning by “Od_Year” and “S_Region”, and bucketing by “Suppkey” (Table 13).
The obtained results show an overall increase in efficiency, considering the processing time. In this case, two attributes frequently used in the queries (isolated or combined) were used to partition the table. Only Q1.2 and Q3.4 do not include any of them in their filters. Nevertheless, even in these queries, some scenarios present decreases when using this partitioning strategy. These decreases may be related with the predicate pushdown technique applied by the two SQL-on-Hadoop engines, meaning that, although they do not present the attributes of the partitions in the “where” clause, they have filters by attributes hierarchically related to them (“S_Nation” or “YearMonth”, for example), making the search faster as only some of the folders should be considered.
Analyzing the obtained results and their implications, it seems that bucketing techniques have no significant impact in reducing processing time, besides the very specific cases already shown in this paper, as the positive results seem to be more related with the attributes used in the partitioning strategy rather than the advantages of the bucketing strategy.
CPU usage
To conclude the various performed tests, the variation in CPU usage per query in both data models is presented, now integrating the two data organization strategies. Figure 10 presents this variation for Presto and SF = 100, taking as an example the scenario that combines multiple partitioning (“Od_Year” and “S_Region”) with bucketing by “Suppkey”, since it was the scenario with the best results in terms of processing time.
As can be seen in Fig. 10, the decreases obtained for this SF in both data models are similar to the decreases obtained with the multiple partitioning scenario (with the same attributes), without using the bucketing strategy. In these results, the queries that present the partitioning attributes in the “where” clause (all except Q1.2 and Q3.4) are the ones that verified the most significant decreases in CPU usage. This shows that, in contexts of smaller data volumes, applying bucketing strategies does not seem to have any advantage when compared with the multiple partitioning strategy.
However, in the star schema with Hive and the SF = 300, the highest overall reduction in processing time was obtained, due to the possible activation of join optimization mechanisms, Fig. 11 shows the variation in CPU usage per query, measured with Hive in vcores-seconds.Footnote 1
Considering these values (Fig. 11), it is possible to notice that, in this scenario, there are not only significant decreases in processing time, but also in CPU usage. On average, there is a decrease of about 65% with the application of this data organization strategy in the highest SF, which is when Hive seems to activate the join optimization mechanisms.
Synopsis
To summarize the several results presented in this section, this subsection provides an overview of the presented scenarios, highlighting if any improvements were achieved, either in the processing time or in CPU usage. For all the data organization strategies, their impact in the processing time was analyzed. The same is not verified for CPU usage as, for some scenarios, CPU usage was not extensively analyzed due to the results achieved in the other tested scenarios. Given this context, Table 14 depicts the different data organization strategies, the tested data models, the attributes that were used on those strategies and the advantages, when verified, in the processing time and/or CPU usage. As can be seen, bucketing is the strategy in which the advantages, analyzed from the perspective of the decrease in processing time and CPU usage, are more limited and restricted to very specific use cases, as explained in more detail in the following section of this paper.
Discussion
The purpose of this work was to study the impact of implementing different data organization strategies in the processing times of Hive-based BDWs. Considering all the tested scenarios, the most adequate results for each SF and for each configuration will be presented throughout the next subsection. Afterwards, in “Guidelines for practitioners” subsection, a set of guidelines for practitioners is presented, taking into consideration the main insights retrieved from this work.
Main insights
Regarding partitioning strategies, Table 15 summarizes the best overall processing times obtained for each tested configuration, by scaling factor. These results are compared with the best results obtained when no data organization strategy is applied, presenting the difference obtained with the application of multiple partitioning techniques. It should be noted that all the results presented in this section correspond to the results obtained using Presto, as it was the system with the best processing times in all contexts. As Presto targets low-latency query execution, having constantly running daemons on each node based on in-memory data processing and avoiding costly coordination overheads, its results are significantly superior to Hive’s results, reason why the tables presented in this section are more focused on Presto’s results. However, the focus is not on the results comparison between these two engines, but on the main insights regarding different data organization and distribution strategies.
Considering the results obtained in this first scenario, it is possible to conclude that partitioning should indeed consider attributes that are often used in the queries’ predicates. Hierarchical partitioning (e.g. spatial partitioning) may also be one of the most adequate partitioning strategies to implement, and the results obtained here are aligned with the related studies referred throughout this paper. Considering most of the real organizational contexts, effectively, it is easy to observe that the queries executed on the data imply, in most cases, temporal and geographic filters, so the main guideline is to choose a partitioning scheme that includes this type of attributes. In addition, considering the HDFS performance requirements, an appropriate data organization strategy must take into account the size of the files by which the data is distributed. This means that, despite the adequate results obtained in this scenario, one must be careful about excessive partitioning, as this, in addition to the processing complexity associated with the existence of multiple levels of folders, makes the system store less data per folder and have several small files that can degrade the performance of the BDW.
Regarding the bucketing strategies, Table 16 summarizes the best overall processing times obtained for each tested configuration, by SF. Throughout the tests that were performed using bucketing, it was verified that the definition of buckets depends on the type of model applied to the data. The scenario based on the definition of buckets by attributes with high cardinality was not beneficial in any of the contexts, so it may not be considered, in these contexts, an adequate data organization strategy.
Regarding the use of bucketing techniques for star schemas, the inefficiency of the application of multiple bucketing was shown, since it is a strategy that is not advantageous in any of the used SQL-on-Hadoop systems. However, in the context of simple bucketing, there were scenarios that have shown some advantages, namely with Hive and for the larger SF. Thus, the only context in which some advantages were found in the definition of buckets is the context of the bucketing of two tables by the same attribute, the one used to join these tables. This context has seen decreases of around 80% in Hive’s query execution times compared to a scenario without any type of strategy, which indicates that Hive is an adequate tool to deal with this type of configuration due to its inherent optimization features. Regarding Presto’s tests, when some decreases were verified, most of them were related to another type of strategy (use of the sorting strategy by attributes used in the “group by” or “order by” of the queries), and in other cases, increases in processing times were observed. Nevertheless, the best results continue to be presented in scenarios using Presto, achieving faster processing times than the ones obtained with Hive.
For denormalized tables, the only scenarios where bucketing shows some benefits are the scenarios in which they are defined by the attributes that appear several times in the “group by” clause of the queries, combined with the sorting technique by an attribute that is relevant for the executed queries, most of which are frequently used in the “order by” clause. Consequently, although in one of the scenarios an attribute with low cardinality (“Od_Year”) was used, since most queries were oriented towards temporal conditions, significant decreases were verified in all the scenarios, pointing that this may be a strategy to be considered in contexts where the use of partitions is not intended.
Nevertheless, it is relevant to highlight Hive’s restriction of not allowing changes to the number of buckets after the table has been created, a feature that makes difficult the refreshment of the data. Thus, whenever there is the need to add new records to the table, and that implies a different number of buckets, a new table must be created with a reworked bucketing strategy, if necessary, and load the data from the old table to the new one. Consequently, this additional complexity shows another disadvantage of this type of data organization strategy. Taking this into consideration, it is possible to realize that only in very specific cases may exist some advantage in using buckets, although its definition is not a straightforward process.
Concerning the combination of the two data organizations strategies, Table 17 summarizes the best overall processing times obtained for each tested configuration, by scaling factor. In most cases, the increase in the queries processing efficiency seems to be highly influenced by the benefits of partitioning rather than by the benefits of bucketing, since the best results are obtained in the queries that use the partitioning attributes as filters. In the tested scenarios, the only case where there are advantages of using partitions and buckets together, not only as result of partitioning, is in the star model using Hive and using the joining attribute (between the fact table and a dimension table) for bucketing. Nevertheless, even in this case, this is not the best obtained processing time, since Hive’s processing times are higher than those obtained by Presto, where the results seem to be only influenced by the partitioning strategy.
In conclusion, and to have an overall picture of all the tested scenarios and the best obtained results, Table 18 shows the best configuration, highlighting the lowest achieved total processing time by SF.
Considering all the results presented above, the advantages associated with the use of partitioning techniques were evident, since they cause considerable decreases in the time needed for data processing. On the other hand, the use of bucketing techniques falls short of expectations, since the scenarios in which they demonstrated benefit for the attributes here studied were rare. Despite the fastest processing time of the SF 30, this was the only scenario with benefits of the bucketing strategy. In all the other scenarios and scale factors there was no evidence of the advantages of this technique. Nevertheless, its use in very specific contexts, with an in-depth study of how to define them, and even combining with partitioning techniques, can assure some advantages in the storage and processing of data.
Although it is not the focus of this work, it is important to highlight the performance of Presto as a system for querying data, since it presented the best processing times in all contexts here studied. However, the potential of Hive is highlighted in two bucketing contexts (using bucketing by the attribute used in the join), where Presto has showed that it may not recognize or have any kind of optimization mechanism to handle this data organization strategy. Still, the overall processing times obtained by Presto, even in these scenarios, are considerably lower than the ones of Hive.
To complement the discussion of the results, looking into the perspective of the scalability of the tools, the additional time needed when the size of the datasets increases in now analyzed. Both Hive and Presto, as processing tools, were designed to be scalable, implementing different strategies to achieve that. As can be seen in Table 19, from one workload to another, Hive needs between more 1.18 to 2.60× extra time to accommodate the increase between 30 GB and 100 GB, while Presto needs between 1.65× and 3.70× extra time for the same job, depending on the adopted data organization strategy. For the denormalized tables, mainly with partitions, both Hive and Presto demonstrate a significantly adequate scalability.
When the analysis looks into a more severe increase in the dataset size, namely from 30 GB to 300 GB, Table 20 shows that Presto maintains the increase in the extra time between 5.87× and 8.02×, while Hive presents values ranging from 2.06× to 13.67×. Taking this into consideration, there are three scenarios that do not seem to have advantages for data processing in Hive, and those are related with the use of star schemas and, considering this data modeling approach, the use of bucketing strategies. Besides these cases, and once again for denormalized tables, Hive presents a satisfactory behavior in terms of scalability.
Guidelines for practitioners
After analyzing all the tested scenarios, it is possible to summarize a set of guidelines that can be followed by practitioners when addressing the definition of data organization strategies in Hive-based BDWs. Before that, Table 21 shows the role of the several attributes used throughout the benchmark, pointing if the attributes are used, or not, as filters in the “where” conditions, as grouping or sorting attributes in the “group by” and “order by” clauses, or as joining attributes.
Considering all the results evaluated in this paper, it is possible to identify some good practices for the modelling and organization of data in Hive-based BDWs:
-
1.
Generally, use data models based on denormalized tables for better performance;
-
2.
Perform a study of the cardinality and distribution of the attributes that integrate the dataset, in order to identify the most appropriate attributes for partitioning (attributes with low cardinality and uniform distribution) and/or for bucketing (attributes with high cardinality). For partitioning and bucketing, it is important to recall that:
-
a.
Partitions are stored in subdirectories of a table’s directory, performing a hierarchical organization of the data, and are used to prune the data that is searched in a specific query, influencing the processing time of that query;
-
b.
Buckets can be associated to tables or to partitions, being stored in a file within the partition or table’s directory, and are used as a technique to cluster large datasets;
-
c.
The definition of partitions and buckets is constrained by the available data like, for instance, the number of different products, customers, or years, as data should not be over partitioned. If partitions are relatively small in terms of data volume, the cost of searching many directories becomes more expensive than simple scanning a file with all the data. In addition, partitions should be similar in size to prevent a single long-running operation in one of them. In contexts where two or three directories (for partitions) would contain the majority of the data and many other directories would contain small data files, the use of bucketing would be preferred as different small data files can be clustered in the same bucket;
-
a.
-
3.
For the implementation of partitioning techniques:
-
a.
Knowing the queries in advance is relevant to partition the tables by the attributes that are more frequently used in the query filters;
-
b.
Pay attention to excessive partitioning, avoiding the creation of a large number of subdirectories, as already mentioned, as it adds additional overheads on HDFS;
-
c.
Give preference to temporal, geographic or departmental partitioning, depending on the filters needed in the real contexts and how the data is updated in those contexts;
-
a.
-
4.
Bucketing techniques alone tend to not improve performance, but, if one finds them useful, for the implementation of bucketing techniques consider the following guidelines:
-
a.
Define a number of buckets that is appropriate for the size of the dataset, in order to avoid the creation of several small files;
-
b.
Give preference to the attributes that appear frequently in the “group by” or “order by” query clauses;
-
a.
-
5.
If processing speed is crucial, it is advisable to use Presto, or other similar interactive distributed SQL engine (e.g., Impala), as the querying technology.
Through the experiments, it was also possible to verify that there is a set of optimizations that can be used, such as:
-
1.
Apply to the created tables (without buckets) the “alter table concatenate” function, in order to optimize the distribution of the data throughout the several files (transforming several small files into few larger files), optimizing HDFS’ performance;
-
2.
Apply to the created tables the “analyze table compute statistics” and “analyze table compute statistics for columns” functions, in order to keep Hive’s metadata and statistics updated, optimizing the execution of the queries;
-
3.
In a star schema context, force the use of broadcast joins in Presto (map joins in Hive), since the results are better than those obtained with the distributed join (default setting in Presto v.0180).
Conclusions
This work presented an evaluation and discussion of the use of several data organization strategies in Hive-based BDWs, testing different combinations of partitions and buckets, either individually or combining these as different data organization strategies. The SSB, both the dataset and the queries, was used to evaluate the performance of a star schema and a fully denormalized table, with or without partitions and buckets, using three SFs (30, 100 and 300) and two SQL-on-Hadoop systems as query engines (Presto and Hive on Tez).
In general, the implementation of data organization strategies, mainly based on partitioning, brings benefits both in terms of storage (better organization and distribution of the data) and in terms of query processing (with lower response times). These benefits support faster decision-making processes, as well as less use of resources, as shown with the decreases in CPU usage, especially for the denormalized tables and with Presto as the query engine.
From all the results and the discussion presented throughout the paper, it is possible to infer some good practices for the modelling and organization of the data. First, it is important to perform a study of the cardinality and distribution of the attributes that integrate the dataset, as well as an analysis of the queries to be executed, in order to identify the most appropriate attributes for partitioning and/or bucketing. Then, regarding the partitioning strategies, and knowing the queries beforehand, the partitions should be defined with the attributes that frequently appear in the filters applied on the data. In addition, an adequate strategy would be to choose temporal, geographical or departmental attributes for partitioning, depending on the filters used in real contexts and on how the data is updated. Considering the level of partitioning, it is important to pay attention to excessive partitioning, avoiding the creation of many subdirectories with a high number of small files.
Concerning the use of bucketing, there was no evidence of significant advantages when using this strategy. As shown and explained in several scenarios, partitioning alone may significantly improve efficiency, since the use of the appropriate attributes enable better processing times. In several of the tested scenarios, considerable decreases in queries’ execution time were verified. For bucketing, the scenarios where clear advantages emerged from the use of this data organization strategy were limited. Nevertheless, its use in very specific contexts, with an in-depth study of how to define them, and always combined with partitioning techniques, can bring some advantages in the storage and processing of data. Therefore, when considering the implementation of bucketing strategies, the number of buckets must consider the size of the dataset, in order to avoid creating many small files, and should also consider attributes that appear frequently in the “group by” or “order by” of the queries.
Despite not being the focus of this work, it was also possible to conclude that, although the implementation of dimensional data models in Big Data Warehousing contexts is possible, they do not seem to be the most advantageous design pattern for the decision-making process, since these models need more processing time and CPU usage in all the tested scenarios. Besides that, if processing speed is crucial, aiming to achieve higher efficiency, it is advisable to use Presto, or a similar SQL-on-Hadoop technology, as the query engine.
Taking into account the results, insights and guidelines presented in this paper, we believe that this work provides more clarification to researchers and practitioners regarding the use of certain data modelling strategies, such as partitioning and bucketing techniques, through the several scenarios here depicted. Previously to this work, to the best of our knowledge, there was no attempt to solidify a set of general guidelines supported by a structured benchmark, such as the guidelines provided in “Guidelines for practitioners” subsection. Consequently, to summarize the main insights and guidelines of this work, researchers and practitioners should consider the following: (i) denormalized tables tend to outperform star schemas; (ii) partitioning performs appropriately when using low cardinality attributes, while bucketing makes more sense when applied to high cardinality attributes; (iii) partitioning must be based on the attributes that appear frequently as filters in the queries, typically considering temporal, geospatial or departmental attributes, and avoiding over partitioning; (iv) bucketing techniques did not show any significant performance advantages, but there are scenarios in which their use is possible. In these scenarios, practitioners should plan the number of buckets according to the size of the dataset, and they should select the attributes that appear frequently in “group by” or “order by” clauses; and, (v) the selection of an interactive SQL-on-Hadoop engine is crucial to accomplish certain latency requirements, as seen in the scenarios where Presto’s performance was significantly superior to Hive’s performance. Considering this, practitioners should be open to perform some preliminary tests using several SQL-on-Hadoop engines, before committing entirely to a specific technology.
For future work, the study of different approaches for identifying the number of buckets for Hive-based BDWs will be considered, as well as testing this data organization strategy in other datasets with different data contexts. It would also be interesting to extend the analysis to the real impact of the size of the denormalized tables in contexts of higher data volumes, identifying possible alternative approaches for these cases.
Notes
Vcores-seconds = number of vcores (3 per cluster node) * time worked by each vcore, in seconds.
Abbreviations
- BDW:
-
Big Data Warehouse
- BI&A:
-
Business Intelligence and Analytics
- CPU:
-
central processing unit
- DT:
-
denormalized table
- DT-B:
-
denormalized table with buckets
- DT-P:
-
denormalized table with partitions
- DW:
-
Data Warehouse
- HDFS:
-
Hadoop Distributed File System
- HDP:
-
Hortonworks Data Platform
- OLAP:
-
online analytical processing
- ORC:
-
optimized row columnar
- RAM:
-
random access memory
- SS:
-
star schema
- SS-B:
-
star schema with buckets
- SS-P:
-
star schema with partitions
- SSD:
-
Solid State Drive
- SF:
-
scale factor
- SSB:
-
star schema benchmark
References
Apache (2014) Apache Hadoop. http://hadoop.apache.org/.
Capriolo E, Wampler D, Rutherglen J. Programming Hive. O’Reilly Media, Inc. 2012.
Cassavia N, Dicosta P, Masciari E, Saccà D. Data preparation for tourist Data Big Data Warehousing. In: Proceedings of 3rd international conference on data management technologies and applications (DATA). SciTePress, 2014. p. 419–26.
Chavalier M, El Malki M, Kopliku A, et al. Document-Oriented Data Warehouses: models and extended cuboids. In: 10th international conference on research challenges in information science (RCIS). IEEE, 2016. P. 1–11.
Chevalier M, El Malki M, Kopliku A, et al. Implementation of multidimensional databases in column-oriented NoSQL systems. In: East European conference on advances in databases and information systems. 2015. p. 79–91.
Costa C, Santos MY. The SusCity big data warehousing approach for smart cities. In: Proceedings of the 21st international database engineering & applications symposium. 2017. p. 264–73.
Costa C, Santos MY. Evaluating Several Design Patterns and Trends in Big Data Warehousing Systems. In J. Krogstie & H. A. Reijers (Eds.), Advanced Information Systems Engineering (Vol. 10816, pp. 459–473). In: Proceedings of the 30th international conference on advanced information systems engineering (CAiSE’2018). Cham: Springer International Publishing; 2018.
Costa E (2018) SSB Scripts. https://github.com/EduardaCosta/ScriptsSSB. Accessed 19 Dec 2018.
Costa E, Costa C, Santos MY. Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses. In: Themistocleous M, Morabito V, editors. 14th European, Mediterranean, and Middle Eastern Conference (EMCIS). Coimbra: Springer International Publishing; 2017. p. 3–16.
Costa E, Costa C, Santos MY (2018) Partitioning and Bucketing in Hive-Based Big Data Warehouses. In: WorldCIST’18 - World Conference on Information Systems and Technologies. Springer International Publishing, pp 764–774.
De Mauro A, Greco M, Grimaldi M. What is Big Data? A Consensual Definition and a Review of Key Research Topics. In: AIP conference proceedings. AIP Publishing; 2015. p. 97–104.
Dere J (2017) Apache Hive. https://cwiki.apache.org/confluence/display/Hive/Home.
Di Tria F, Lefons E, Tangorra F. A framework for evaluating design methodologies for Big Data Warehouses: measurement of the design process. Int J Data Warehous Min. 2018;14:15–39. https://doi.org/10.4018/IJDWM.2018010102.
Di Tria F, Lefons E, Tangorra F. Design process for Big Data Warehouses. In: IEEE 2014 International conference on data science and advanced analytics (DSAA). 2014. p. 512–18.
Du D. Apache Hive Essentials. Packt Publishing Ltd. 2015.
Hortonworks I (2017) Hortonworks. https://hortonworks.com. Accessed 22 Oct 2017.
Kimball R, Ross M. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3 edn. New York: Wiley; 2013.
Krishnan K (2013) Data Warehousing in the Age of Big Data. Elsevier Inc.
Kumar AS (2016) Performance analysis of MySQL Partition, Hive Partition-Bucketing and Apache Pig. In: Information Processing (IICIP), 2016 1st India International Conference. IEEE, p. 1–6.
Martinho B, Santos MY. An architecture for Data Warehousing in Big Data environments. International conference on research and practical issues of enterprise information systems. Cham: Springer; 2016. p. 237–50.
Mohanty S, Jagadeesh M, Srivatsa H. Big data imperatives: enterprise Big Data Warehouse, BI implementations and analytics. New York: Apress; 2013.
O’Neil P, O’Neil B, Chen X. The star schema benchmark (SSB). 2007.
Philip Chen CL, Zhang CY. Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf Sci. 2014;275:314–47. https://doi.org/10.1016/j.ins.2014.01.015.
Ptiček M, Vrdoljak B. Big Data and New Data Warehousing Approaches. In: Proceedings of the 2017 International Conference on Cloud and Big Data Computing. ACM, 2017. p. 6–10.
Russom P. Evolving Data Warehouse Architectures in the Age of Big Data. 2014.
Sandoval LJ. Design of business intelligence applications using big data technology. In: Central American and Panama Convention (CONCAPAN XXXV), 2015 IEEE Thirty Fifth. Institute of Electrical and Electronics Engineers Inc., 2016. p. 1–6.
Santos MY, Costa C (2016a) Data Warehousing in Big Data: from multidimensional to tabular data models. In: C3S2E’16—Ninth international C* conference on computer science & software engineering. p. 10.
Santos MY, Costa C. Data models in NoSQL databases for Big Data contexts. In: Tan Y, Shi Y, editors. International Conference on Data Mining and Big Data. Cham: Springer International Publishing; 2016. p. 475–85.
Santos MY, Costa C, Galvão J, et al. Evaluating SQL-on-Hadoop for Big Data Warehousing on not-so-good hardware. In: Proceedings of the 21st international database engineering & applications symposium. ACM, New York, NY, USA. 2017. p. 242–52.
Shaw S, Vermeulen AF, Gupta A, Kjerrumgaard D. Practical Hive: a guide to Hadoop’s Data Warehouse System. New York: Apress; 2016.
Thusoo A, Sarma J Sen, Jain N, et al. Hive—a Warehousing solution over a map-reduce framework. In: Proceedings of the VLDB endowment. 2009. p. 1626–9.
Thusoo A, Sen Sarma J, Jain N, et al. Hive—a Petabyte Scale Data Warehouse using Hadoop. In: 2010 IEEE 26th international conference on Data Engineering (ICDE), 2010. p. 996–1005.
TPC (2017a) TPC. http://www.tpc.org/tpch/.
TPC (2017b) TPC-H—Homepage. http://www.tpc.org/tpch/. Accessed 16 Aug 2017.
Yangui R, Nabli A, Gargouri F. Automatic transformation of data warehouse schema to NoSQL data base: comparative study. Procedia Comput Sci. 2016;96:255–64.
Zikopoulos P, Eaton C. Understanding Big Data: analytics for enterprise class hadoop and streaming data. 1st ed. Delhi: McGraw-Hill Osborne Media; 2011.
Authors’ contributions
EC designed and executed the benchmark, having also reported and analyzed the results. CC contributed to the design of the research process, managed the distributed storage and processing infrastructure, and reviewed the work. MYS supervised the entire research process, contributing to the design of the research process and review of the work. All authors read and approved the final manuscript.
Acknowledgements
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The execution scripts used in this work are openly available on GitHub [8]. Moreover, the original TPC-H data generator used in this work can be retrieved from the TPC-H homepage [34].
Funding
This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT—Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and by European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project no. 002814; Funding Reference: POCI-01-0247-FEDER-002814].
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Costa, E., Costa, C. & Santos, M.Y. Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems. J Big Data 6, 34 (2019). https://doi.org/10.1186/s40537-019-0196-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537-019-0196-1