research-article

HAMR

Authors:

Guang R GaoAuthors Info & Claims

International Journal of High Performance Computing Applications, Volume 31, Issue 5

Pages 361 - 374

https://doi.org/10.1177/1094342016672080

Published: 01 September 2017 Publication History

Abstract

As the attention given to big data grows, cluster computing systems for distributed processing of large data sets become the mainstream and critical requirement in high performance distributed system research. One of the most successful systems is Hadoop, which uses MapReduce as a programming/execution model and takes disks as intermedia to process huge volumes of data. Spark, as an in-memory computing engine, can solve the iterative and interactive problems more efficiently. However, currently it is a consensus that they are not the final solutions to big data due to a MapReduce-like programming model, synchronous execution model and the constraint that only supports batch processing, and so on. A new solution, especially, a fundamental evolution is needed to bring big data solutions into a new era.

In this paper, we introduce a new cluster computing system called HAMR which supports both batch and streaming processing. To achieve better performance, HAMR integrates high performance computing approaches, i.e. dataflow fundamental into a big data solution. With more specifications, HAMR is fully designed based on in-memory computing to reduce the unnecessary disk access overhead; task scheduling and memory management are in fine-grain manner to explore more parallelism; asynchronous execution improves efficiency of computation resource usage, and also makes workload balance across the whole cluster better. The experimental results show that HAMR can outperform Hadoop MapReduce and Spark by up to 19x and 7x respectively, in the same cluster environment. Furthermore, HAMR can handle scaling data size well beyond the capabilities of Spark.

References

[1]

Barik R, Budimlic Z, Cavè V, et al. 2009The Habanero multicore software research project. In: Proceedings of the 24th ACM SIGPLAN conference companion on object oriented programming systems languages and applications, Orlando, FL, 25-29 October 2009, pp.pp.735-–736. New York, NY: ACM.

Digital Library

[2]

Dennis JB 1974First version of a data flow procedure language. In: Programming Symposium Lecture Notes in Computer Science, Volume volume 19, pp.pp.362-–376. London, UK: Springer.

[3]

Dennis JB 2003Fresh Breeze: A multiprocessor chip architecture guided by modular programming principles. ACM SIGARCH Computer Architecture NewsVolume 31 Issue 1: pp.7-–15.

[4]

Ekanayake J, Zhang B, Gunarathne T, et al. 2010Twister: A runtime for iterative mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, Chicago, IL, 21-25 June 2010, pp.pp.810-–818. New York, NY: ACM.

Digital Library

[5]

Gao GR, Sterling T, Stevens R, et al. 2007Parallex: A study of a new parallel computation model. In: Proceedings of IEEE international parallel and distributed processing symposium, Long Beach, CA, 26-30 March 2007, pp.pp.1-–6. Piscataway, NJ: IEEE.

[6]

Isard M, Budiu M, Yu Y, et al. 2007Dryad: Distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS Operating Systems Review. Volume Vol. 41 . pp.pp.59-–72. New York, NY: ACM.

[7]

Lauderdale C, Khan R 2012Towards a codelet-based runtime for exascale computing: Position paper. In: Proceedings of the 2nd international workshop on adaptive self-tuning computing systems for the exaflop era, London, UK</conf-loc>, <conf-loc>3-4 March 2012, pp.pp.21-–26. New York, NY: ACM.

[8]

Low Y, Gonzalez J, Kyrola A, et al. 2012Distributed Graphlab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB EndowmentVolume 5 Issue 8: pp.716-–727.

[9]

Najjar WA, Lee EA, Gao GR 1999Advances in the dataflow computational model. Parallel ComputingVolume 25 Issue 13: pp.1907-–1929.

[10]

Neumeyer L, Robbins B, Nair A, et al. 2010S4: Distributed stream computing platform. In: The 10th IEEE international conference on data mining workshops, Sydney, Australia, 14-17 December 2010, pp.pp.170-–177. Piscataway, NJ: IEEE.

[11]

Orozco D, Garcia E, Pavel R, et al. 2011TIDeFlow: The time iterated dependency flow execution model. In: Proceedings of the 1st workshop on data-flow execution models for extreme scale computing, Galveston, TX, 10 October 2011, pp.pp.1-–9. Piscataway, NJ: IEEE.

[12]

White T 2012Hadoop: The Definitive Guide. Sebastopol, CA: O'Reilly.

[13]

Zaharia M, Chowdhury M, Franklin MJ, et al. 2010Spark: Cluster computing with working sets. In: Proceedings of the 2nd USENIX workshop on hot topics in cloud computing, Boston, MA, 22-25 June 2010, pp.pp.10-–16. New York, NY: ACM.

[14]

Zaharia M, Das T, Li H, et al. 2012Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters hotcloud-12. In: Proceedings of the 4th USENIX workshop on hot topics in cloud computing, Boston, MA, 12-15 June 2012, pp.pp.10-–16. New York, NY: ACM.

[15]

Zuckerman S, Suetterlein J, Knauerhase R, et al. 2011Using a "codelet" program execution model for exascale machines: Position paper. In: Proceedings of the 1st international workshop on adaptive self-tuning computing systems for the exaflop era, San Jose, CA, 5 June 2011, pp.pp.64-–69. New York, NY: ACM.

Digital Library

HAMR

Recommendations

Design and evaluation of a novel dataflow based bigdata solution
PMAM '15: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores

As the attention given to big data grows, cluster computing systems for distributed processing of large data sets become the mainstream and critical requirement in high performance distributed system research. One of the most successful system is Hadoop ...
Making sense of performance in in-memory computing frameworks for scientific data analysis: A case study of the spark system
Abstract
Over the last five years, Apache Spark has become a major software platform for in-memory data analysis. Acknowledging its widespread use, we present a comprehensive study of system characteristics of Spark targeting scientific data ...
Highlights
- We develop a benchmark, ArrayBench, for benchmarking scientific data analytics that process gene expression matrices using Spark and SciDB.
Simba: Efficient In-Memory Spatial Analytics
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications

International Journal of High Performance Computing Applications Volume 31, Issue 5

9 2017

108 pages

Issue’s Table of Contents

Copyright © © The Authors 2016.

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 September 2017

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents