skip to main content
10.1145/3447786.3456228acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Open access

PaSh: light-touch data-parallel shell processing

Published: 21 April 2021 Publication History

Abstract

This paper presents PaSh, a system for parallelizing POSIX shell scripts. Given a script, PaSh converts it to a dataflow graph, performs a series of semantics-preserving program transformations that expose parallelism, and then converts the dataflow graph back into a script---one that adds POSIX constructs to explicitly guide parallelism coupled with PaSh-provided Unix-aware runtime primitives for addressing performance- and correctness-related issues. A lightweight annotation language allows command developers to express key parallelizability properties about their commands. An accompanying parallelizability study of POSIX and GNU commands---two large and commonly used groups---guides the annotation language and optimized aggregator library that PaSh uses. PaSh's extensive evaluation over 44 unmodified Unix scripts shows significant speedups (0.89--61.1×, avg: 6.7×) stemming from the combination of its program transformations and runtime primitives.

References

[1]
Amnon Barak and Oren La'adan. 1998. The MOSIX multicomputer operating system for high performance cluster computing. Future Generation Computer Systems 13, 4 (1998), 361--372.
[2]
Jonathan C Beard, Peng Li, and Roger D Chamberlain. 2017. RaftLib: A C++ template library for high performance stream parallel processing. The International Journal of High Performance Computing Applications 31, 5 (2017), 391--404.
[3]
Jon Bentley. 1985. Programming Pearls: A Spelling Checker. Commun. ACM 28, 5 (May 1985), 456--462.
[4]
Jon Bentley, Don Knuth, and Doug McIlroy. 1986. Programming Pearls: A Literate Program. Commun. ACM 29, 6 (June 1986), 471--483.
[5]
Pawan Bhandari. 2020. Solutions to unixgame.io. https://git.io/Jf2dn Accessed: 2020-04-14.
[6]
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. Brook for GPUs: Stream Computing on Graphics Hardware. ACM Trans. Graph. 23, 3 (2004), 777--786.
[7]
Michael Burke and Ron Cytron. 1986. Interprocedural Dependence Analysis and Parallelization. In Proceedings of the 1986 SIGPLAN Symposium on Compiler Construction (SIGPLAN '86). ACM, New York, NY, USA, 162--175.
[8]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38 (2015), 28--38.
[9]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113.
[10]
Tom Duff. 1990. Rc-A shell for Plan 9 and Unix systems. AUUGN 12, 1 (1990), 75.
[11]
Jeff Epstein, Andrew P. Black, and Simon Peyton-Jones. 2011. Towards Haskell in the Cloud. In Proceedings of the 4th ACM Symposium on Haskell (Haskell '11). ACM, New York, NY, USA, 118--129.
[12]
Yuan Yu Michael Isard Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, and Pradeep Kumar Gunda Jon Currey. 2009. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. Proc. LSDS-IR 8 (2009).
[13]
Matteo Frigo, Charles E Leiserson, and Keith H Randall. 1998. The implementation of the Cilk-5 multithreaded language. ACM Sigplan Notices 33, 5 (1998), 212--223.
[14]
Wolfgang Gentzsch. 2001. Sun grid engine: Towards creating a compute power grid. In Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE, 35--36.
[15]
Michael I Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S Meli, Andrew A Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, et al. 2002. A stream compiler for communication-exposed architectures. In ACM SIGOPS Operating Systems Review, Vol. 36. ACM, 291--303.
[16]
Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe. 2002. A Stream Compiler for Communication-Exposed Architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS X). Association for Computing Machinery, New York, NY, USA, 291âĂŞ303.
[17]
Michael Greenberg. 2018. The POSIX shell is an interactive DSL for concurrency. https://cs.pomona.edu/~michael/papers/dsldi2018.pdf.
[18]
The Open Group. 2018. POSIX. https://pubs.opengroup.org/onlinepubs/9699919799/. [Online; accessed November 22, 2019].
[19]
Mary W Hall, Jennifer M Anderson, Saman P. Amarasinghe, Brian R Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S Lam. 1996. Maximizing multiprocessor performance with the SUIF compiler. Computer 29, 12 (1996), 84--89.
[20]
Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, and Martin Rinard. 2020. An Order-aware Dataflow Model for Extracting Shell Script Parallelism. arXiv preprint arXiv:2012.15422 (2020).
[21]
Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A Catalog of Stream Processing Optimizations. ACM Computing Surveys (CSUR) 46, 4, Article 46 (March 2014), 34 pages.
[22]
Lluis Batlle i Rossell. 2016. tsp(1) Linux User's Manual. https://vicerveza.homeunix.net/viric/soft/ts/.
[23]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. 59--72.
[24]
Makoto Ishihara, Hiroki Honda, and Mitsuhisa Sato. 2006. Development and implementation of an interactive parallelization assistance tool for OpenMP: iPat/OMP. IEICE transactions on information and systems 89, 2 (2006), 399--407.
[25]
Konstantinos Kallas, Filip Niksic, Caleb Stanford, and Rajeev Alur. 2020. DiffStream: Differential Output Testing for Stream Processing Programs. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1--29.
[26]
Ken Kennedy, Kathryn S McKinley, and C-W Tseng. 1991. Interactive parallel programming using the ParaScope Editor. IEEE Transactions on Parallel and Distributed Systems 2, 3 (1991), 329--341.
[27]
Charles Edwin Killian, James W. Anderson, Ryan Braud, Ranjit Jhala, and Amin M. Vahdat. 2007. Mace: Language Support for Building Distributed Systems. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07). ACM, New York, NY, USA, 179--188.
[28]
Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L Paul Chew. 2007. Optimistic parallelism requires abstractions. ACM SIGPLAN Notices 42, 6 (2007), 211--222.
[29]
Nokia Bell Labs. 2019. The Unix Game---Solve puzzles using Unix pipes. https://unixgame.io/unix50 Accessed: 2020-03-05.
[30]
Amy W. Lim and Monica S. Lam. 1997. Maximizing Parallelism and Minimizing Synchronization with Affine Transforms. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '97). ACM, New York, NY, USA, 201--214.
[31]
Konstantinos Mamouras, Caleb Stanford, Rajeev Alur, Zachary G. Ives, and Val Tannen. 2019. Data-Trace Types for Distributed Stream Processing Systems. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). ACM, New York, NY, USA, 670--685.
[32]
Chris McDonald and Trevor I Dix. 1988. Support for graphs of processes in a command interpreter. Software: Practice and Experience 18, 10 (1988), 1011--1016.
[33]
Malcolm D McIlroy, Elliot N Pinson, and Berkley A Tague. 1978. UNIX Time-Sharing System: Foreword. Bell System Technical Journal 57, 6 (1978), 1899--1904.
[34]
Peter M McIlroy, Keith Bostic, and M Douglas McIlroy. 1993. Engineering radix sort. Computing systems 6, 1 (1993), 5--27.
[35]
Frank McSherry, Michael Isard, and Derek G Murray. 2015. Scalability! But at what COST?. In 15th Workshop on Hot Topics in Operating Systems (HotOS XV).
[36]
Sape J Mullender, Guido Van Rossum, AS Tanenbaum, Robbert Van Renesse, and Hans Van Staveren. 1990. Amoeba: A distributed operating system for the 1990s. Computer 23, 5 (1990), 44--53. https://www.cs.cornell.edu/home/rvr/papers/Amoeba1990s.pdf
[37]
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 439--455.
[38]
National Oceanic and Atmospheric Administration. 2017. National Climatic Data Center. https://www.ncdc.noaa.gov/.
[39]
John K Ousterhout, Andrew R. Cherenson, Fred Douglis, Michael N. Nelson, and Brent B. Welch. 1988. The Sprite network operating system. Computer 21, 2 (1988), 23--36. http://www.research.ibm.com/people/f/fdouglis/papers/sprite.pdf
[40]
David A Padua, Rudolf Eigenmann, Jay Hoeflinger, Paul Petersen, Peng Tu, Stephen Weatherford, and Keith Faigin. 1993. Polaris: A new-generation parallelizing compiler for MPPs. In In CSRD Rept. No. 1306. Univ. of Illinois at Urbana-Champaign.
[41]
Shoumik Palkar and Matei Zaharia. 2019. Optimizing Data-intensive Computations in Existing Libraries with Split Annotations. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19). ACM, New York, NY, USA, 291--305.
[42]
Davide Pasetto and Albert Akhriev. 2011. A comparative study of parallel sort algorithms. In Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion. 203--204.
[43]
Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey, et al. 1990. Plan 9 from Bell Labs. In Proceedings of the summer 1990 UKUUG Conference. 1--9. http://css.csail.mit.edu/6.824/2014/papers/plan9.pdf
[44]
Pixelbeat. 2015. Answer to: Sort -parallel isn't parallelizing. https://superuser.com/a/938634 Accessed: 2020-04-14.
[45]
Deepti Raghavan, Sadjad Fouladi, Philip Levis, and Matei Zaharia. 2020. POSH: A Data-Aware Shell. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). 617--631.
[46]
Dennis M. Ritchie and Ken Thompson. 1973. The UNIX Time-sharing System. SIGOPS Oper. Syst. Rev. 7, 4 (Jan. 1973), 27--.
[47]
Scott Schneider, Martin Hirzel, Buğra Gedik, and Kun-Lung Wu. 2015. Safe Data Parallelism for General Streaming. IEEE Trans. Comput. 64, 2 (Feb 2015), 504--517.
[48]
Peter Sewell, James J. Leifer, Keith Wansbrough, Francesco Zappa Nardelli, Mair Allen-Williams, Pierre Habouzit, and Viktor Vafeiadis. 2005. Acute: High-level Programming Language Design for Distributed Computation. In Proceedings of the Tenth ACM SIGPLAN International Conference on Functional Programming (ICFP '05). ACM, New York, NY, USA, 15--26.
[49]
Diomidis Spinellis and Marios Fragkoulis. 2017. Extending Unix Pipelines to DAGs. IEEE Trans. Comput. 66, 9 (2017), 1547--1561.
[50]
Richard M Stallman and Roland McGrath. 1991. GNU Make---A Program for Directing Recompilation. https://www.gnu.org/software/make/manual/make.pdf.
[51]
Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. 2011. Phoenix++: Modular MapReduce for Shared-Memory Systems. In Proceedings of the Second International Workshop on MapReduce and Its Applications (MapReduce '11). Association for Computing Machinery, New York, NY, USA, 9--16.
[52]
Ole Tange. 2011. GNU Parallel---The Command-Line Power Tool. ;login: The USENIX Magazine 36, 1 (Feb 2011), 42--47.
[53]
Dave Taylor. 2004. Wicked Cool Shell Scripts: 101 Scripts for Linux, Mac OS X, and Unix Systems. No Starch Press.
[54]
Nikos Vasilakis, Ben Karel, Yash Palkhiwala, John Sonchack, André DeHon, and Jonathan M. Smith. 2019. Ignis: Scaling Distribution-oblivious Systems with Light-touch Distribution. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). ACM, New York, NY, USA, 1010--1026.
[55]
Nikos Vasilakis, Ben Karel, Nick Roessler, Nathan Dautenhahn, André DeHon, and Jonathan M. Smith. 2018. BreakApp: Automated, Flexible Application Compartmentalization. In Networked and Distributed Systems Security (NDSS'18).
[56]
Nikos Vasilakis, Ben Karel, and Jonathan M. Smith. 2015. From Lone Dwarfs to Giant Superclusters: Rethinking Operating System Abstractions for the Cloud. In Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems (HOTOS'15). USENIX Association, Berkeley, CA, USA, 15--15. http://dl.acm.org/citation.cfm?id=2831090.2831105
[57]
Nikos Vasilakis, Jiasi Shen, and Martin Rinard. 2020. Automatic Synthesis of Parallel and Distributed Unix Commands with KumQuat. arXiv preprint arXiv:2012.15443 (2020).
[58]
Robert Virding, Claes Wikström, and Mike Williams. 1996. Concurrent Programming in ERLANG (2nd Ed.). Prentice Hall International (UK) Ltd., Hertfordshire, UK, UK.
[59]
Tom White. 2015. Hadoop: The Definitive Guide (4th ed.). O'Reilly Media, Inc.
[60]
Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simple linux utility for resource management. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 44--60.
[61]
Gina Yuan, Shoumik Palkar, Deepak Narayanan, and Matei Zaharia. 2020. Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 293--306. https://www.usenix.org/conference/atc20/presentation/yuan
[62]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 15--28. http://dl.acm.org/citation.cfm?id=2228298.2228301

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '21: Proceedings of the Sixteenth European Conference on Computer Systems
April 2021
631 pages
ISBN:9781450383349
DOI:10.1145/3447786
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 April 2021

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. POSIX
  2. Unix
  3. automatic parallelization
  4. pipelines
  5. shell
  6. source-to-source compiler

Qualifiers

  • Research-article

Funding Sources

Conference

EuroSys '21
Sponsor:
EuroSys '21: Sixteenth European Conference on Computer Systems
April 26 - 28, 2021
Online Event, United Kingdom

Acceptance Rates

EuroSys '21 Paper Acceptance Rate 38 of 181 submissions, 21%;
Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)557
  • Downloads (Last 6 weeks)58
Reflects downloads up to 27 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media