research-article

Public Access

Apache Spark: a unified engine for big data processing

Authors:

Shivaram Venkataraman,

Ion StoicaAuthors Info & Claims

Communications of the ACM, Volume 59, Issue 11

Pages 56 - 65

https://doi.org/10.1145/2934664

Published: 28 October 2016 Publication History

All formats PDF

Abstract

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

References

[1]

Apache Storm project; http://storm.apache.org

Google Scholar

[2]

Armbrust, M. et al. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM SIGMOD/PODS Conference (Melbourne, Australia, May 31-June 4). ACM Press, New York, 2015.

Digital Library

Google Scholar

[3]

Dave, A. Indexedrdd project; http://github.com/amplab/spark-indexedrdd

Google Scholar

[4]

Dean, J. and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proceedings of the Sixth OSDI Symposium on Operating Systems Design and Implementation (San Francisco, CA, Dec. 6--8). USENIX Association, Berkeley, CA, 2004.

Digital Library

Google Scholar

[5]

Freeman, J., Vladimirov, N., Kawashima, T., Mu, Y., Sofroniew, N.J., Bennett, D.V., Rosen, J., Yang, C.-T., Looger, L.L., and Ahrens, M.B. Mapping brain activity at scale with cluster computing. Nature Methods 11, 9 (Sept. 2014), 941--950.

Crossref

Google Scholar

[6]

Gonzalez, J.E. et al. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11^th OSDI Symposium on Operating Systems Design and Implementation (Broomfield, CO, Oct. 6--8). USENIX Association, Berkeley, CA, 2014.

Digital Library

Google Scholar

[7]

Isard, M. et al. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the EuroSys Conference (Lisbon, Portugal, Mar. 21--23). ACM Press, New York, 2007.

Digital Library

Google Scholar

[8]

Karloff, H., Suri, S., and Vassilvitskii, S. A model of computation for MapReduce. In Proceedings of the ACM-SIAM SODA Symposium on Discrete Algorithms (Austin, TX, Jan. 17--19). ACM Press, New York, 2010.

Digital Library

Google Scholar

[9]

Kornacker, M. et al. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the Seventh Biennial CIDR Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 4--7, 2015).

Google Scholar

[10]

Low, Y. et al. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In Proceedings of the 38^th International VLDB Conference on Very Large Databases (Istanbul, Turkey, Aug. 27--31, 2012).

Google Scholar

[11]

Malewicz, G. et al. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD/PODS Conference (Indianapolis, IN, June 6--11). ACM Press, New York, 2010.

Digital Library

Google Scholar

[12]

McSherry, F., Isard, M., and Murray, D.G. Scalability! But at what COST? In Proceedings of the 15^th HotOS Workshop on Hot Topics in Operating Systems (Kartause Ittingen, Switzerland, May 18--20). USENIX Association, Berkeley, CA, 2015.

Digital Library

Google Scholar

[13]

Melnik, S. et al. Dremel: Interactive analysis of Webscale datasets. Proceedings of the VLDB Endowment 3 (Sept. 2010), 330--339.

Digital Library

Google Scholar

[14]

Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., and Talwalkar, A. MLlib: Machine learning in Apache Spark. Journal of Machine Learning Research 17, 34 (2016), 1--7.

Digital Library

Google Scholar

[15]

Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M., Franklin, M.J., Joseph, A.D., and Patterson, D.A. Rethinking data-intensive science using scalable analytics systems. In Proceedings of the SIGMOD/PODS Conference (Melbourne, Australia, May 31--June 4). ACM Press, New York, 2015.

Digital Library

Google Scholar

[16]

Shun, J. and Blelloch, G.E. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN PPoPP Symposium on Principles and Practice of Parallel Programming (Shenzhen, China, Feb. 23--27). ACM Press, New York, 2013.

Digital Library

Google Scholar

[17]

Sparks, E.R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J.E., Franklin, M.J., Jordan, M.I., and Kraska, T. MLI: An API for distributed machine learning. In Proceedings of the IEEE ICDM International Conference on Data Mining (Dallas, TX, Dec. 7--10). IEEE Press, 2013.

Crossref

Google Scholar

[18]

Stonebraker, M. and Cetintemel, U. 'One size fits all': An idea whose time has come and gone. In Proceedings of the 21st International ICDE Conference on Data Engineering (Tokyo, Japan, Apr. 5--8). IEEE Computer Society, Washington, D.C., 2005, 2--11.

Digital Library

Google Scholar

[19]

Thomas, K., Grier, C., Ma, J., Paxson, V., and Song, D. Design and evaluation of a real-time URL spam filtering service. In Proceedings of the IEEE Symposium on Security and Privacy (Oakland, CA, May 22--25). IEEE Press, 2011.

Digital Library

Google Scholar

[20]

Valiant, L.G. A bridging model for parallel computation. Commun. ACM 33, 8 (Aug. 1990), 103--111.

Digital Library

Google Scholar

[21]

Venkataraman, S. et al. SparkR; http://dl.acm.org/citation.cfm?id=2903740&CFID=687410325&CFTOKEN=83630888

Google Scholar

[22]

Xin, R. and Zaharia, M. Lessons from running large-scale Spark workloads; http://tinyurl.com/large-scale-spark

Google Scholar

[23]

Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., and Stoica, I. Shark: SQL and rich analytics at scale. In Proceedings of the ACM SIGMOD/PODS Conference (New York, June 22--27). ACM Press, New York, 2013.

Digital Library

Google Scholar

[24]

Zaharia, M. An Architecture for Fast and General Data Processing on Large Clusters. Ph.D. thesis, Electrical Engineering and Computer Sciences Department, University of California, Berkeley, 2014; https://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf

Google Scholar

[25]

Zaharia, M. et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the Ninth USENIX NSDI Symposium on Networked Systems Design and Implementation (San Jose, CA, Apr. 25--27, 2012).

Digital Library

Google Scholar

[26]

Zaharia, M. et al. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24^th ACM SOSP Symposium on Operating Systems Principles (Farmington, PA, Nov. 3--6). ACM Press, New York, 2013.

Digital Library

Google Scholar

[27]

Zhang, Z., Barbary, K., Nothaft, N.A., Sparks, E., Zahn, O., Franklin, M.J., Patterson, D.A., and Perlmutter, S. Scientific Computing Meets Big Data Technology: An Astronomy Use Case. In Proceedings of IEEE International Conference on Big Data (Santa Clara, CA, Oct. 29--Nov. 1). IEEE, 2015.

Digital Library

Google Scholar

Cited By

View all

Turaga VChebrolu S(2025)Rapid and optimized parallel attribute reduction based on neighborhood rough sets and MapReduceExpert Systems with Applications10.1016/j.eswa.2024.125323260(125323)Online publication date: Jan-2025
https://doi.org/10.1016/j.eswa.2024.125323
Panda DSubramoni HAbduljabbar MShafi AAlnaasan NXu S(2025)Designing Converged Middleware for HPC, AI, and Big Data: Challenges and OpportunitiesArtificial Intelligence and High Performance Computing in the Cloud10.1007/978-3-031-78698-3_4(40-63)Online publication date: 1-Jan-2025
https://doi.org/10.1007/978-3-031-78698-3_4
Cairns JUrbanek SMurrell P(2024)A Platform for Large Scale Statistical Modelling in RJournal of Data Science10.6339/24-JDS1132(1-13)Online publication date: 24-May-2024
https://doi.org/10.6339/24-JDS1132
Show More Cited By

Index Terms

Apache Spark: a unified engine for big data processing

Recommendations

Learning Apache Spark 2.0
Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research

The term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...

Comments

Information & Contributors

Information

Published In

Communications of the ACM Volume 59, Issue 11

November 2016

118 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/3013530

Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2016

Published in CACM Volume 59, Issue 11

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Popular
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1,818
Total Citations
View Citations
152,947
Total Downloads

Downloads (Last 12 months)5,955
Downloads (Last 6 weeks)757

Reflects downloads up to 28 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Turaga VChebrolu S(2025)Rapid and optimized parallel attribute reduction based on neighborhood rough sets and MapReduceExpert Systems with Applications10.1016/j.eswa.2024.125323260(125323)Online publication date: Jan-2025
https://doi.org/10.1016/j.eswa.2024.125323
Panda DSubramoni HAbduljabbar MShafi AAlnaasan NXu S(2025)Designing Converged Middleware for HPC, AI, and Big Data: Challenges and OpportunitiesArtificial Intelligence and High Performance Computing in the Cloud10.1007/978-3-031-78698-3_4(40-63)Online publication date: 1-Jan-2025
https://doi.org/10.1007/978-3-031-78698-3_4
Cairns JUrbanek SMurrell P(2024)A Platform for Large Scale Statistical Modelling in RJournal of Data Science10.6339/24-JDS1132(1-13)Online publication date: 24-May-2024
https://doi.org/10.6339/24-JDS1132
Guo HLiu YHu LZhang X(2024)Machine Learning Enterprise Financial Intelligent Risk Control System Based on New DatabaseInternational Journal of Global Economics and Management10.62051/ijgem.v5n2.305:2(283-289)Online publication date: 5-Dec-2024
https://doi.org/10.62051/ijgem.v5n2.30
Li TLi YZhu WXu YLui JMa XWon Y(2024)MinFlowProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650716(311-328)Online publication date: 27-Feb-2024
https://dl.acm.org/doi/10.5555/3650697.3650716
Tama EMark D(2024)The Future of Database Management in the Era of Big Data and Cloud ComputingInternational Journal of Information technology and Computer Engineering10.55529/ijitc.45.48.60(48-60)Online publication date: 13-Sep-2024
https://doi.org/10.55529/ijitc.45.48.60
Reddy Bussu V(2024)Leveraging AI with Databricks and Azure Data Lake StorageInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24JUN417(1424-1426)Online publication date: 1-Jul-2024
https://doi.org/10.38124/ijisrt/IJISRT24JUN417
Santhosh Bussa (2024)Evolution of Data Engineering in Modern Software DevelopmentJournal of Sustainable Solutions10.36676/j.sust.sol.v1.i4.431:4(116-130)Online publication date: 2-Dec-2024
https://doi.org/10.36676/j.sust.sol.v1.i4.43
Rigas STzouveli PKollias S(2024)An End-to-End Deep Learning Framework for Fault Detection in Marine MachinerySensors10.3390/s2416531024:16(5310)Online publication date: 16-Aug-2024
https://doi.org/10.3390/s24165310
Su NHuang SSu C(2024)Elevating Smart Manufacturing with a Unified Predictive Maintenance Platform: The Synergy between Data Warehousing, Apache Spark, and Machine LearningSensors10.3390/s2413423724:13(4237)Online publication date: 29-Jun-2024
https://doi.org/10.3390/s24134237
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Learning Apache Spark 2.0

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Performance comparison of Apache Hadoop and Apache Spark

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Digital Edition

Magazine Site

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations