research-article

HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Authors:

Kamil Bajda-Pawlikowski,

Avi Silberschatz,

Alexander RasinAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 2, Issue 1

Pages 922 - 933

https://doi.org/10.14778/1687627.1687731

Published: 01 August 2009 Publication History

Abstract

The production environment for analytical data management applications is rapidly changing. Many enterprises are shifting away from deploying their analytical databases on high-end proprietary machines, and moving towards cheaper, lower-end, commodity hardware, typically arranged in a shared-nothing MPP architecture, often in a virtualized environment inside public or private "clouds". At the same time, the amount of data that needs to be analyzed is exploding, requiring hundreds to thousands of machines to work in parallel to perform the analysis.

There tend to be two schools of thought regarding what technology to use for data analysis in such an environment. Proponents of parallel databases argue that the strong emphasis on performance and efficiency of parallel databases makes them well-suited to perform such analysis. On the other hand, others argue that MapReduce-based systems are better suited due to their superior scalability, fault tolerance, and flexibility to handle unstructured data. In this paper, we explore the feasibility of building a hybrid system that takes the best features from both technologies; the prototype we built approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems.

References

[1]

Hadoop. Web Page. hadoop.apache.org/core/.

[2]

HadoopDB Project. Web page. db.cs.yale.edu/hadoopdb/hadoopdb.html.

[3]

Vertica. www.vertica.com/.

[4]

D. Abadi. What is the right way to measure scale? DBMS Musings Blog. dbmsmusings.blogspot.com/2009/06/what-is-right-way-to-measure-scale.html.

[5]

P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Proc. of SOSP, 2003.

Digital Library

[6]

R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient parallel processing of massive data sets. In Proc. of VLDB, 2008.

Digital Library

[7]

G. Czajkowski. Sorting 1pb with mapreduce. googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html.

[8]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004.

Digital Library

[9]

D. DeWitt and M. Stonebraker. MapReduce: A major step backwards. DatabaseColumn Blog. www.databasecolumn. com/2008/01/mapreduce-a-major-step-back.html.

[10]

D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - A High Performance Dataflow Database Machine. In VLDB '86, 1986.

Digital Library

[11]

Facebook. Hive. Web page. issues.apache.org/jira/browse/HADOOP-3601.

[12]

S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System Software of A Parallel Relational Database Machine. In VLDB '86, 1986.

Digital Library

[13]

Hadoop Project. Hadoop Cluster Setup. Web Page. hadoop.apache.org/core/docs/current/cluster_setup.html.

[14]

J. Hamilton. Cooperative expendable micro-slice servers (cems): Low cost, low power servers for internet-scale services. In Proc. of CIDR, 2009.

[15]

Hive Project. Hive SVN Repository. Accessed May 19th 2009. svn.apache.org/viewvc/hadoop/hive/.

[16]

J. N. Hoover. Start-Ups Bring Google's Parallel Processing To Data Warehousing. InformationWeek, August 29th, 2008.

[17]

S. Madden, D. DeWitt, and M. Stonebraker. Database parallelism choices greatly impact scalability. DatabaseColumn Blog. www.databasecolumn.com/2007/10/database-parallelism-choices.html.

[18]

Mayank Bawa. A $5.1M Addendum to our Series B. www.asterdata.com/blog/index.php/2009/02/25/a-51m-addendum-to-our-series-b/.

[19]

C. Monash. The 1-petabyte barrier is crumbling. www.networkworld.com/community/node/31439.

[20]

C. Monash. Cloudera presents the MapReduce bull case. DBMS2 Blog. www.dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case/.

[21]

C. Olofson. Worldwide RDBMS 2005 vendor shares. Technical Report 201692, IDC, May 2006.

[22]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proc. of SIGMOD, 2008.

Digital Library

[23]

A. Pavlo, A. Rasin, S. Madden, M. Stonebraker, D. DeWitt, E. Paulson, L. Shrinivas, and D. J. Abadi. A Comparison of Approaches to Large Scale Data Analysis. In Proc. of SIGMOD, 2009.

Digital Library

[24]

M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A column-oriented DBMS. In VLDB, 2005.

Digital Library

[25]

D. Vesset. Worldwide data warehousing tools 2005 vendor shares. Technical Report 203229, IDC, August 2006.

Cited By

Koupil PCrha DHolubová I(2025)A universal approach for simplified redundancy-aware cross-model queryingInformation Systems10.1016/j.is.2024.102456127:COnline publication date: 7-Jan-2025
https://dl.acm.org/doi/10.1016/j.is.2024.102456
Ashwini Shivarudra (2024)Optimizing Test Data Management Strategies in Banking Domain ProjectsJournal of Sustainable Solutions10.36676/j.sust.sol.v1.i4.371:4(87-100)Online publication date: 28-Oct-2024
https://doi.org/10.36676/j.sust.sol.v1.i4.37
Prokhorenko VBabar M(2024)Offloaded Data Processing Energy Efficiency EvaluationInformatica10.15388/24-INFOR567(649-669)Online publication date: 24-Jul-2024
https://doi.org/10.15388/24-INFOR567
Show More Cited By

Index Terms

HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
  2. Information systems applications

Recommendations

HadoopDB in action: building real world applications
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

HadoopDB is a hybrid of MapReduce and DBMS technologies, designed to meet the growing demand of analyzing massive datasets on very large clusters of machines. Our previous work has shown that HadoopDB approaches parallel databases in performance and ...
Big data analysis and query optimization improve HadoopDB performance
SEM '14: Proceedings of the 10th International Conference on Semantic Systems

High performance and scalability are two essentials requirements for data analytics systems as the amount of data being collected, stored and processed continue to grow rapidly. In this paper, we propose a new approach based on HadoopDB. Our main goal ...
Tradeoffs between parallel database systems, Hadoop, and HadoopDB as platforms for petabyte-scale analysis
SSDBM'10: Proceedings of the 22nd international conference on Scientific and statistical database management

As the market demand for analyzing data sets of increasing variety and scale continues to explode, the software options for performing this analysis are beginning to proliferate. No fewer than a dozen companies have launched in the past few years that ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 2, Issue 1

August 2009

1293 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2009

Published in PVLDB Volume 2, Issue 1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

605
Total Citations
View Citations
4,755
Total Downloads

Downloads (Last 12 months)84
Downloads (Last 6 weeks)7

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Koupil PCrha DHolubová I(2025)A universal approach for simplified redundancy-aware cross-model queryingInformation Systems10.1016/j.is.2024.102456127:COnline publication date: 7-Jan-2025
https://dl.acm.org/doi/10.1016/j.is.2024.102456
Ashwini Shivarudra (2024)Optimizing Test Data Management Strategies in Banking Domain ProjectsJournal of Sustainable Solutions10.36676/j.sust.sol.v1.i4.371:4(87-100)Online publication date: 28-Oct-2024
https://doi.org/10.36676/j.sust.sol.v1.i4.37
Prokhorenko VBabar M(2024)Offloaded Data Processing Energy Efficiency EvaluationInformatica10.15388/24-INFOR567(649-669)Online publication date: 24-Jul-2024
https://doi.org/10.15388/24-INFOR567
De Martini LMargara ACugola GDonadoni MMorassutto E(2024)The Renoir Dataflow PlatformFuture Generation Computer Systems10.1016/j.future.2024.06.018160:C(472-488)Online publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1016/j.future.2024.06.018
Werner STai S(2024)A reference architecture for serverless big data processingFuture Generation Computer Systems10.1016/j.future.2024.01.029155:C(179-192)Online publication date: 18-Jul-2024
https://dl.acm.org/doi/10.1016/j.future.2024.01.029
Zhou XGe BXia ZXiao WChen Z(2023)BDMCA: a big data management system for Chinese auditingPeerJ Computer Science10.7717/peerj-cs.13179(e1317)Online publication date: 13-Apr-2023
https://doi.org/10.7717/peerj-cs.1317
S VA R(2023)Modeling of class imbalance handling with optimal deep learning enabled big data classification modelIntelligent Decision Technologies10.3233/IDT-23019817:4(1179-1197)Online publication date: 20-Nov-2023
https://doi.org/10.3233/IDT-230198
Margara ACugola GFelicioni NCilloni S(2023)A Model and Survey of Distributed Data-Intensive SystemsACM Computing Surveys10.1145/360480156:1(1-69)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3604801
Bian HSha TAilamaki A(2023)Using Cloud Functions as Accelerator for Elastic Data AnalyticsProceedings of the ACM on Management of Data10.1145/35893061:2(1-27)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589306
Cao YFan WFu WJin ROu WYi W(2023)Extracting Graphs Properties with Semantic Joins2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00175(2262-2275)Online publication date: May-2023
https://doi.org/10.1109/ICDE55515.2023.00175
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents