research-article

Bridging the gap between HPC and big data frameworks

Authors:

Michael Anderson,

Narayanan Sundaram,

Subramanya Dulloor,

Nadathur Satish,

Theodore L. WillkeAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 10, Issue 8

Pages 901 - 912

https://doi.org/10.14778/3090163.3090168

Published: 01 April 2017 Publication History

Abstract

Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.

References

[1]

D. C. Anastasiu and G. Karypis. L2knng: Fast exact k-nearest neighbor graph construction with l2-norm pruning. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, pages 791--800, New York, NY, USA, 2015. ACM.

Digital Library

[2]

M. J. Anderson, N. Sundaram, N. Satish, M. M. A. Patwary, T. L. Willke, and P. Dubey. GraphPad: Optimized graph primitives for parallel and distributed platforms. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 313--322, May 2016.

[3]

A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2009, pages 27--34, Arlington, Virginia, United States, 2009. AUAI Press.

Digital Library

[4]

M. Axtmann, T. Bingmann, E. Jöbstl, S. Lamm, H. C. Nguyen, A. Noe, M. Stumpp, P. Sanders, S. Schlag, and T. Sturm. Thrill - distributed big data batch processing framework in C++. http://project-thrill.org/, 2016.

[5]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, Mar. 2003.

[6]

R. Bosagh Zadeh, X. Meng, A. Ulanov, B. Yavuz, L. Pu, S. Venkataraman, E. Sparks, A. Staple, and M. Zaharia. Matrix computations and optimization in Apache Spark. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pages 31--38, New York, NY, USA, 2016. ACM.

Digital Library

[7]

deeplearning4j. Deep Learning for Java. Open-Source, Distributed, Deep Learning Library for the JVM. http://deeplearning4j.org/, 2016.

[8]

G. E. Fagg and J. J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI), pages 346--353. Springer Nature, 2000.

Digital Library

[9]

M. P. I. Forum. Mpi: A message-passing interface standard version 3.1. Technical report, 2015.

[10]

M. Gamell, D. S. Katz, H. Kolla, J. Chen, S. Klasky, and M. Parashar. Exploring automatic, online failure recovery for scientific applications at extreme scales. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014.

Digital Library

[11]

A. Gittens, A. Devarakonda, E. Racah, M. Ringenburg, L. Gerhardt, J. Kottalam, J. Liu, K. Maschhoff, S. Canon, J. Chhugani, P. Sharma, J. Yang, J. Demmel, J. Harrell, V. Krishnamurthy, M. W. Mahoney, and Prabhat. Matrix factorizations at scale: A comparison of scientific data analytics in Spark and C+MPI using three case studies. In 2016 IEEE International Conference on Big Data (Big Data), pages 204--213, Dec 2016.

[12]

M. Grossman and V. Sarkar. SWAT: A programmable, in-memory, distributed, high-performance computing platform. In International Symposium on High-Performance Parallel and Distributed Computing (HPDC), pages 81--92, New York, New York, USA, 2016. ACM Press.

Digital Library

[13]

H2O.ai. Sparkling Water. https://github.com/h2oai/sparkling-water, 2016.

[14]

S. Jha, J. Qiu, A. Luckow, P. Mantha, and G. C. Fox. A tale of two data-intensive paradigms: Applications, abstractions, and architectures. In IEEE International Congress on Big Data. IEEE, 2014.

Digital Library

[15]

O. Kaya and B. Uçar. Scalable sparse tensor decompositions in distributed memory systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 77. ACM, 2015.

Digital Library

[16]

T. G. Kolda and B. Bader. The TOPHITS model for higher-order web link analysis. In Proceedings of Link Analysis, Counterterrorism and Security 2006, 2006.

[17]

T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455--500, 2009.

Digital Library

[18]

H. Kwak, C. Lee, H. Park, and S. B. Moon. What is Twitter, a social network or a news media? In WWW, pages 591--600, 2010.

Digital Library

[19]

X. Lu, F. Liang, B. Wang, L. Zha, and Z. Xu. Datampi: Extending MPI to Hadoop-like big data computing. In 28th IEEE International Parallel and Distributed Processing Symposium, pages 829--838, May 2014.

Digital Library

[20]

J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on recommender systems, pages 165--172. ACM, 2013.

Digital Library

[21]

X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning in Apache Spark. J. Mach. Learn. Res., 17(1):1235--1241, Jan. 2016.

Digital Library

[22]

K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun. Making sense of performance in data analytics frameworks. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, NSDI 2015, pages 293--307, Berkeley, CA, USA, 2015. USENIX Association.

Digital Library

[23]

A. Raveendran, T. Bicer, and G. Agrawal. A framework for elastic execution of existing MPI programs. In IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW), pages 940--947, May 2011.

Digital Library

[24]

J. L. Reyes-Ortiz, L. Oneto, and D. Anguita. Big data analytics in the cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf. Procedia Computer Science, 53:121 -- 130, 2015.

[25]

N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pages 979--990, New York, NY, USA, 2014. ACM.

Digital Library

[26]

Y. Shi, A. Karatzoglou, L. Baltrunas, M. Larson, A. Hanjalic, and N. Oliver. TFMAP: optimizing map for top-n context-aware recommendation. In Proceedings of the 35th International ACM SIGIR conference on Research and development in information retrieval, pages 155--164. ACM, 2012.

Digital Library

[27]

N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. Tensor decomposition for signal processing and machine learning. arXiv preprint arXiv:1607.01668, 2016.

[28]

G. M. Slota, S. Rajamanickam, and K. Madduri. A case study of complex graph analysis in distributed memory: Implementation and optimization. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 293--302, May 2016.

[29]

S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. FROSTT: The formidable repository of open sparse tensors and tools, 2017.

[30]

S. Smith and G. Karypis. SPLATT: The Surprisingly ParalleL spArse Tensor Toolkit. http://cs.umn.edu/~splatt/.

[31]

S. Smith and G. Karypis. A medium-grained algorithm for distributed sparse tensor factorization. In 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2016), 2016.

[32]

N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey. GraphMat: High performance graph analytics made productive. Proc. VLDB Endow., 8(11):1214--1225, July 2015.

Digital Library

[33]

R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. GraphX: A resilient distributed graph system on Spark. In First International Workshop on Graph Data Management Experiences and Systems, GRADES 2013, pages 2:1--2:6, New York, NY, USA, 2013. ACM.

Digital Library

[34]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, pages 2--2, Berkeley, CA, USA, 2012. USENIX Association.

Digital Library

Cited By

Khan TFegaras L(2024)A Planner for Scalable Tensor Programs2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825779(54-63)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825779
Chen QChen ZZhang KWang X(2023)CLIC: An Extensible and Efficient Cross-Platform Data Analytics SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329803835:1(34-45)Online publication date: 24-Jul-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3298038
Sun YKumar SGilray TMicinski K(2023)Communication-Avoiding Recursive Aggregation2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00024(197-208)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00024
Show More Cited By

Bridging the gap between HPC and big data frameworks

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 10, Issue 8

April 2017

60 pages

ISSN:2150-8097

Editor:
Divesh Srivastava
AT&T Labs

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 April 2017

Published in PVLDB Volume 10, Issue 8

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
474
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)4

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Khan TFegaras L(2024)A Planner for Scalable Tensor Programs2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825779(54-63)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825779
Chen QChen ZZhang KWang X(2023)CLIC: An Extensible and Efficient Cross-Platform Data Analytics SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329803835:1(34-45)Online publication date: 24-Jul-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3298038
Sun YKumar SGilray TMicinski K(2023)Communication-Avoiding Recursive Aggregation2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00024(197-208)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00024
Pandey RSilakari S(2023)Investigations on optimizing performance of the distributed computing in heterogeneous environment using machine learning technique for large scale data setMaterials Today: Proceedings10.1016/j.matpr.2021.07.08980(2976-2982)Online publication date: 2023
https://doi.org/10.1016/j.matpr.2021.07.089
Kumari SMuthulakshmi P(2023)High-Performance Computation in Big Data AnalyticsIntelligent Systems Design and Applications10.1007/978-3-031-27440-4_52(543-553)Online publication date: 31-May-2023
https://doi.org/10.1007/978-3-031-27440-4_52
Fan YLan ZRich PAllcock WPapka M(2022)Hybrid Workload Scheduling on HPC Systems2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00052(470-480)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00052
Piñeiro CPichel J(2022)A unified framework to improve the interoperability between HPC and Big Data languages and programming modelsFuture Generation Computer Systems10.1016/j.future.2022.04.002134:C(123-139)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1016/j.future.2022.04.002
Löff JHoffmann RPieper RGriebler DFernandes L(2022)DSParLib: A C++ Template Library for Distributed Stream ParallelismInternational Journal of Parallel Programming10.1007/s10766-022-00737-250:5-6(454-485)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s10766-022-00737-2
Yu BFeng GCao HLi XSun ZWang HZhu XZheng WChen W(2021)ChukonuProceedings of the VLDB Endowment10.14778/3503585.350359615:4(872-885)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.14778/3503585.3503596
Alshahrani SAl Shehri WAlmalki JAlghamdi AAlammari A(2021)Accelerating Spark-Based Applications with MPI and OpenACCComplexity10.1155/2021/99432892021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/9943289
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents