skip to main content
10.1145/2660193.2660204acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

Cybertron: pushing the limit on I/O reduction in data-parallel programs

Published: 15 October 2014 Publication History

Abstract

I/O reduction has been a major focus in optimizing data-parallel programs for big-data processing. While the current state-of-the-art techniques use static program analysis to reduce I/O, Cybertron proposes a new direction that incorporates runtime mechanisms to push the limit further on I/O reduction. In particular, Cybertron tracks how data is used in the computation accurately at runtime to filter unused data at finer granularity dynamically, beyond what current static-analysis based mechanisms are capable of, and to facilitate a new mechanism called constraint based encoding for more efficient encoding. Cybertron has been implemented and applied to production data-parallel programs; our extensive evaluations on real programs and real data have shown its effectiveness on I/O reduction over the existing mechanisms at reasonable CPU cost, and its improvement on end-to-end performance in various network environments.

References

[1]
D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671--682, 2006.
[2]
Apache. Hadoop. http://lucene.apache.org/hadoop/.
[3]
D. Brumley, J. Newsome, D. X. Song, H. Wang, and S. Jha. Towards automatic generation of vulnerability-based signatures. In IEEE Symposium on Security and Privacy, pages 2--16, 2006.
[4]
C. Cadar, D. Dunbar, and D. R. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, pages 209--224, 2008.
[5]
M. Castro, M. Costa, and J.-P. Martin. Better bug reporting with better privacy. In ASPLOS, pages 319--328, 2008.
[6]
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In PLDI, pages 363--375, 2010.
[7]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., pages 4:1--4:26, 2008.
[8]
J. Cleary and I. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396--402, 1984.
[9]
M. Costa, M. Castro, L. Zhou, L. Zhang, and M. Peinado. Bouncer: securing software by blocking bad input. In SOSP, pages 117--130, 2007.
[10]
L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In TACAS, pages 337--340, 2008.
[11]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[12]
P. Deutsch. DEFLATE compressed data format specification version 1.3. http://www.ietf.org/rfc/rfc1951.txt.
[13]
C. Gkantsidis, D. Vytiniotis, O. Hodson, D. Narayanan, F. Dinu, and A. Rowstron. Rhea: automatic filtering for unstructured cloud storage. In NSDI, pages 343--356, 2013.
[14]
Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou, S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou. Spotting code optimizations in data-parallel pipelines through PeriSCOPE. In OSDI, pages 121--133, 2012.
[15]
ILSpy. The open-source .NET assembly browser and decompiler. http://ilspy.net/.
[16]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.
[17]
E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for MapReduce programs. PVLDB, 4(6):385--396, 2011.
[18]
J. C. King. Symbolic execution and program testing. Commun. ACM, 19(7):385--394, 1976.
[19]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In VLDB, pages 330--339, 2010.
[20]
Microsoft. PEX. http://research.microsoft.com/en-us/projects/pex/.
[21]
C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX ATC, pages 267--273, 2008.
[22]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008.
[23]
I. Pavlov. 7-Zip. http://www.7-zip.org/, 2013.
[24]
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: a column-oriented DBMS. In VLDB, pages 553--564, 2005.
[25]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A warehousing solution over a Map-Reduce framework. PVLDB, 2(2):1626--1629, 2009.
[26]
J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: diagnosing production run failures at the user's site. In SOSP, pages 131--144, 2007.
[27]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1--14, 2008.
[28]
Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation for data-parallel computing: interfaces and implementations. In SOSP, pages 247--260, 2009.
[29]
D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: error diagnosis by connecting clues from run-time logs. In ASPLOS, pages 143--154, 2010.
[30]
J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li, W. Lin, J. Zhou, and L. Zhou. Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In NSDI, pages 295--308, 2012.
[31]
J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In ICDE, pages 1060--1071, 2010.
[32]
J. Zhou, N. Bruno, M. chuan Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: parallel databases meet MapReduce. In The VLDB Journal, volume 21, pages 611--636, 2012.

Cited By

View all
  • (2017)Too Big to Eat: Boosting Analytics Data Ingestion from Object Stores with Scoop2017 IEEE 33rd International Conference on Data Engineering (ICDE)10.1109/ICDE.2017.243(309-320)Online publication date: Apr-2017
  • (2021)Tigris: A DSL and framework for monitoring software systems at runtimeJournal of Systems and Software10.1016/j.jss.2021.110963177(110963)Online publication date: Jul-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications
October 2014
946 pages
ISBN:9781450325851
DOI:10.1145/2660193
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 49, Issue 10
    OOPSLA '14
    October 2014
    907 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2714064
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data-parallel
  2. execution-equivalent
  3. i/o reduction
  4. mapreduce

Qualifiers

  • Research-article

Funding Sources

Conference

SPLASH '14
Sponsor:

Acceptance Rates

OOPSLA '14 Paper Acceptance Rate 52 of 186 submissions, 28%;
Overall Acceptance Rate 268 of 1,244 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2017)Too Big to Eat: Boosting Analytics Data Ingestion from Object Stores with Scoop2017 IEEE 33rd International Conference on Data Engineering (ICDE)10.1109/ICDE.2017.243(309-320)Online publication date: Apr-2017
  • (2021)Tigris: A DSL and framework for monitoring software systems at runtimeJournal of Systems and Software10.1016/j.jss.2021.110963177(110963)Online publication date: Jul-2021

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media