skip to main content
research-article

HAMR

Published: 01 September 2017 Publication History

Abstract

As the attention given to big data grows, cluster computing systems for distributed processing of large data sets become the mainstream and critical requirement in high performance distributed system research. One of the most successful systems is Hadoop, which uses MapReduce as a programming/execution model and takes disks as intermedia to process huge volumes of data. Spark, as an in-memory computing engine, can solve the iterative and interactive problems more efficiently. However, currently it is a consensus that they are not the final solutions to big data due to a MapReduce-like programming model, synchronous execution model and the constraint that only supports batch processing, and so on. A new solution, especially, a fundamental evolution is needed to bring big data solutions into a new era.
In this paper, we introduce a new cluster computing system called HAMR which supports both batch and streaming processing. To achieve better performance, HAMR integrates high performance computing approaches, i.e. dataflow fundamental into a big data solution. With more specifications, HAMR is fully designed based on in-memory computing to reduce the unnecessary disk access overhead; task scheduling and memory management are in fine-grain manner to explore more parallelism; asynchronous execution improves efficiency of computation resource usage, and also makes workload balance across the whole cluster better. The experimental results show that HAMR can outperform Hadoop MapReduce and Spark by up to 19x and 7x respectively, in the same cluster environment. Furthermore, HAMR can handle scaling data size well beyond the capabilities of Spark.

References

[1]
Barik R, Budimlic Z, Cavè V, et al. 2009The Habanero multicore software research project. In: Proceedings of the 24th ACM SIGPLAN conference companion on object oriented programming systems languages and applications, Orlando, FL, 25-29 October 2009, pp.pp.735-–736. New York, NY: ACM.
[2]
Dennis JB 1974First version of a data flow procedure language. In: Programming Symposium Lecture Notes in Computer Science, Volume volume 19, pp.pp.362-–376. London, UK: Springer.
[3]
Dennis JB 2003Fresh Breeze: A multiprocessor chip architecture guided by modular programming principles. ACM SIGARCH Computer Architecture NewsVolume 31 Issue 1: pp.7-–15.
[4]
Ekanayake J, Zhang B, Gunarathne T, et al. 2010Twister: A runtime for iterative mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, Chicago, IL, 21-25 June 2010, pp.pp.810-–818. New York, NY: ACM.
[5]
Gao GR, Sterling T, Stevens R, et al. 2007Parallex: A study of a new parallel computation model. In: Proceedings of IEEE international parallel and distributed processing symposium, Long Beach, CA, 26-30 March 2007, pp.pp.1-–6. Piscataway, NJ: IEEE.
[6]
Isard M, Budiu M, Yu Y, et al. 2007Dryad: Distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS Operating Systems Review. Volume Vol. 41 . pp.pp.59-–72. New York, NY: ACM.
[7]
Lauderdale C, Khan R 2012Towards a codelet-based runtime for exascale computing: Position paper. In: Proceedings of the 2nd international workshop on adaptive self-tuning computing systems for the exaflop era, London, UK</conf-loc>, <conf-loc>3-4 March 2012, pp.pp.21-–26. New York, NY: ACM.
[8]
Low Y, Gonzalez J, Kyrola A, et al. 2012Distributed Graphlab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB EndowmentVolume 5 Issue 8: pp.716-–727.
[9]
Najjar WA, Lee EA, Gao GR 1999Advances in the dataflow computational model. Parallel ComputingVolume 25 Issue 13: pp.1907-–1929.
[10]
Neumeyer L, Robbins B, Nair A, et al. 2010S4: Distributed stream computing platform. In: The 10th IEEE international conference on data mining workshops, Sydney, Australia, 14-17 December 2010, pp.pp.170-–177. Piscataway, NJ: IEEE.
[11]
Orozco D, Garcia E, Pavel R, et al. 2011TIDeFlow: The time iterated dependency flow execution model. In: Proceedings of the 1st workshop on data-flow execution models for extreme scale computing, Galveston, TX, 10 October 2011, pp.pp.1-–9. Piscataway, NJ: IEEE.
[12]
White T 2012Hadoop: The Definitive Guide. Sebastopol, CA: O'Reilly.
[13]
Zaharia M, Chowdhury M, Franklin MJ, et al. 2010Spark: Cluster computing with working sets. In: Proceedings of the 2nd USENIX workshop on hot topics in cloud computing, Boston, MA, 22-25 June 2010, pp.pp.10-–16. New York, NY: ACM.
[14]
Zaharia M, Das T, Li H, et al. 2012Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters hotcloud-12. In: Proceedings of the 4th USENIX workshop on hot topics in cloud computing, Boston, MA, 12-15 June 2012, pp.pp.10-–16. New York, NY: ACM.
[15]
Zuckerman S, Suetterlein J, Knauerhase R, et al. 2011Using a "codelet" program execution model for exascale machines: Position paper. In: Proceedings of the 1st international workshop on adaptive self-tuning computing systems for the exaflop era, San Jose, CA, 5 June 2011, pp.pp.64-–69. New York, NY: ACM.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications  Volume 31, Issue 5
9 2017
108 pages

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 September 2017

Author Tags

  1. Dataflow
  2. big data
  3. fine-grain
  4. in-memory computing
  5. runtime

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media