research-article

Limits of region-based dynamic binary parallelization

Authors:

Tobias J.K. Edler von Koch,

Björn FrankeAuthors Info & Claims

ACM SIGPLAN Notices, Volume 48, Issue 7

Pages 13 - 22

https://doi.org/10.1145/2517326.2451518

Published: 16 March 2013 Publication History

Get Access

Abstract

Efficiently executing sequential legacy binaries on chip multi-processors (CMPs) composed of many, small cores is one of today's most pressing problems. Single-threaded execution is a suboptimal option due to CMPs' lower single-core performance, while multi-threaded execution relies on prior parallelization, which is severely hampered by the low-level binary representation of applications compiled and optimized for a single-core target. A recent technology to address this problem is Dynamic Binary Parallelization (DBP), which creates a Virtual Execution Environment (VEE) taking advantage of the underlying multicore host to transparently parallelize the sequential binary executable. While still in its infancy, DBP has received broad interest within the research community. The combined use of DBP and thread-level speculation (TLS) has been proposed as a technique to accelerate legacy uniprocessor code on modern CMPs. In this paper, we investigate the limits of DBP and seek to gain an understanding of the factors contributing to these limits and the costs and overheads of its implementation. We have performed an extensive evaluation using a parameterizable DBP system targeting a CMP with light-weight architectural TLS support. We demonstrate that there is room for a significant reduction of up to 54% in the number of instructions on the critical paths of legacy SPEC CPU2006 benchmarks. However, we show that it is much harder to translate these savings into actual performance improvements, with a realistic hardware-supported implementation achieving a speedup of 1.09 on average.

References

[1]

E. R. Altman, D. R. Kaeli, and Y. Sheffer. Welcome to the opportunities of binary translation. Computer, 33 (3): 40--45, Mar. 2000.

Digital Library

Google Scholar

[2]

V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, pages 1--12, New York, NY, USA, 2000. ACM.

Digital Library

Google Scholar

[3]

G. Blake, R. G. Dreslinski, and T. Mudge. A survey of multicore processors. IEEE Signal Processing Magazine, 26 (6): 26--37, Oct. 2009.

Crossref

Google Scholar

[4]

et al.(2011)Böhm, Edler von Koch, Kyle, Franke, and Topham}Bohm: 2011I. Böhm, T. J. Edler von Koch, S. C. Kyle, B. Franke, and N. Topham. Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator. ACM SIGPLAN Conference on Programming Language Design and Implementation, 2011.

Digital Library

Google Scholar

[5]

M. Chen and K. Olukotun. The Jrpm system for dynamically parallelizing java programs. ACM/IEEE International Symposium on Computer Architecture, 2003.

Digital Library

Google Scholar

[6]

M. DeVuyst, D. M. Tullsen, and S. W. Kim. Runtime parallelization of legacy code on a transactional memory system. International Conference on High Performance Embedded Architectures and Compilers, 2011.

Digital Library

Google Scholar

[7]

L. Gao, L. Li, J. Xue, and T.-F. Ngai. Loop recreation for thread-level speculation. In International Conference on Parallel and Distributed Systems, 2007.

Digital Library

Google Scholar

[8]

M. Gillespie. Preparing for the second stage of multi-core hardware: Asymmetric (heterogeneous) cores. Technical report, Intel, 2009. URL http://software.intel.com/file/1639.

Google Scholar

[9]

B. Hertzberg and K. Olukotun. Runtime automatic speculative parallelization. International Symposium on Code Generation and Optimization, 2011.

Digital Library

Google Scholar

[10]

M. D. Hill and M. R. Marty. Amdahl's law in the multicore era. Computer, 41: 33--38, July 2008.

Digital Library

Google Scholar

[11]

H. Inoue, H. Hayashizaki, P. Wu, and T. Nakatani. A trace-based java jit compiler retrofitted from a method-based compiler. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 246--256, Washington, DC, USA, 2011. IEEE Computer Society.

Digital Library

Google Scholar

[12]

Intel. Single-chip cloud computer: Project. http://www.intel.co.uk/content/www/us/en/research/intel-labs-single-chip-cloud-computer.html, 2012.

Google Scholar

[13]

N. Ioannou, J. Singer, S. Khan, P. Xekalakis, P. Yiapanis, A. Pocock, G. Brown, M. Lujan, I. Watson, and M. Cintra. Toward a more accurate understanding of the limits of the TLS execution paradigm. IEEE International Symposium on Workload Characterization, 2010.

Digital Library

Google Scholar

[14]

Q. Jacobson, E. Rotenberg, and J. Smith. Path-based next trace prediction. 30th Annual International Symposium on Microarchitecture, 1997.

Digital Library

Google Scholar

[15]

V. Krishnan and J. Torrellas. Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor. In Proceedings of the 12th International Conference on Supercomputing, International Conference on Supercomputing, pages 85--92, New York, NY, USA, 1998. ACM.

Digital Library

Google Scholar

[16]

P. Marcuello and A. González. Clustered speculative multithreaded processors. International Conference on Supercomputing, 1999.

Digital Library

Google Scholar

[17]

P. Marcuello and A. González. Thread-spawning schemes for speculative multithreading. International Symposium on High Performance Computer Architecture, 2002.

Digital Library

Google Scholar

[18]

V. Packirisamy, A. Zhai, W.-C. Hsu, P.-C. Yew, and T.-F. Ngai. Exploring speculative parallelism in SPEC2006. IEEE International Symposium on Performance Analysis of Systems and Software, 2009.

Crossref

Google Scholar

[19]

B. Pradelle, A. Ketterlin, and P. Clauss. Polyhedral parallelization of binary code. ACM Trans. Archit. Code Optim., 8 (4): 39:1--39:21, Jan. 2012.

Digital Library

Google Scholar

[20]

V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn. PIN: a binary instrumentation tool for computer architecture research and education. Workshop on Computer Architecture Education, 2004.

Digital Library

Google Scholar

[21]

M. Reilly. When multicore isn't enough: Trends and the future for multi-multicore systems. High Performance Embedded Computing Workshop, 2008.

Google Scholar

[22]

G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. ACM/IEEE International Symposium on Computer Architecture, 1995.

Digital Library

Google Scholar

[23]

T. Suganuma, T. Yasue, and T. Nakatani. A region-based compilation technique for a java just-in-time compiler. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, pages 312--323, New York, NY, USA, 2003. ACM.

Digital Library

Google Scholar

[24]

N. Vachharajani, M. Iyer, C. Ashok, M. Vachharajani, D. I. August, and D. Connors. Chip multi-processor scalability for single-threaded applications. SIGARCH Comput. Archit. News, 33: 44--53, Nov 2005.

Digital Library

Google Scholar

[25]

C. Wang, Y. Wu, E. Borin, S. Hu, W. Liu, D. Sager, T. F. Ngai, and J. Fang. Dynamic parallelization of single-threaded binary programs using speculative slicing. International Conference on Supercomputing, 2009.

Digital Library

Google Scholar

[26]

D. Wentzlaff and A. Agarwal. Constructing virtual architectures on a tiled processor. International Symposium on Code Generation and Optimization, 2006.

Digital Library

Google Scholar

[27]

J. Yang, K. Skadron, M. Soffa, and K. Whitehouse. Feasibility of dynamic binary parallelization. 3rd USENIX Workshop on Hot Topics in Parallelism, 2011.

Google Scholar

[28]

E. Yardımcı and M. Franz. Dynamic parallelization and mapping of binary executables on hierarchical platforms. ACM International Conference on Computing Frontiers, 2006.

Digital Library

Google Scholar

Cited By

View all

Saad MPalmieri RRavindran B(2019)LernaACM Transactions on Storage10.1145/331036815:1(1-24)Online publication date: 22-Mar-2019
https://dl.acm.org/doi/10.1145/3310368
Saad MPalmieri RRavindran BBreitgand DYadgar GPorter DEyal I(2018)LernaProceedings of the 11th ACM International Systems and Storage Conference10.1145/3211890.3211897(37-48)Online publication date: 4-Jun-2018
https://dl.acm.org/doi/10.1145/3211890.3211897
Ying VJeffrey MSanchez D(2020)T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA45697.2020.00024(159-172)Online publication date: May-2020
https://doi.org/10.1109/ISCA45697.2020.00024
Show More Cited By

Index Terms

Limits of region-based dynamic binary parallelization
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

Limits of region-based dynamic binary parallelization
VEE '13: Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

Efficiently executing sequential legacy binaries on chip multi-processors (CMPs) composed of many, small cores is one of today's most pressing problems. Single-threaded execution is a suboptimal option due to CMPs' lower single-core performance, while ...
Speculative parallelization using software multi-threaded transactions
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

With the right techniques, multicore architectures may be able to continue the exponential performance trend that elevated the performance of applications of all types for decades. While many scientific programs can be parallelized without speculative ...
Speculative parallelization using software multi-threaded transactions
ASPLOS '10

With the right techniques, multicore architectures may be able to continue the exponential performance trend that elevated the performance of applications of all types for decades. While many scientific programs can be parallelized without speculative ...

Reviews

Reviewer: Andre Maximo

Parallel computation is everywhere, from small portable devices to laptops, desktops, and data centers. This wealth of available parallelism challenges computer scientists and developers alike to rethink algorithms and rewrite old sequential source codes. However, we still have a long way to go. This paper offers a deep analysis on a different way to address the challenge. The authors investigate the performance limits of dynamic binary parallelization (DBP), a technique for transparently parallelizing single-thread binary executables. The success of instruction-level parallelism on single-core architectures with multiple arithmetic logic units (ALUs) motivates a higher level of automatic parallelization on multi-core architectures (the DBP). In the paper, the authors first analyze this new proposed architecture by explaining how sections of binary code are identified for parallel execution, and then present their experiments and extensive results. The key idea of DBP is to reduce the number of critical path instructions by overlapping execution segments of the instruction stream. The overlapping is done by speculative cores launched by a master core responsible for the correct execution of the original binary. However, this strategy may result in performance reduction if many speculative threads are invalidated, mainly due to data dependency between threads. The experiments address various applications of a benchmark suite. The results, from an eight-core reduced instruction set computing (RISC) processor over a single-core execution, show an interesting 54 percent reduction in the number of instructions on critical paths. This is associated with a mismatched average speedup of 1.43 times. This is still far from the ideal performance gain, which suggests that the main bottleneck is bandwidth rather than instruction processing. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

ACM SIGPLAN Notices Volume 48, Issue 7

VEE '13

July 2013

194 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2517326

Issue’s Table of Contents

VEE '13: Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
March 2013
210 pages
ISBN:9781450312660
DOI:10.1145/2451512
General Chair:
Steve Muir
VMware, USA
,
Program Chairs:
Gernot Heiser
NICTA and University of New South Wales, Australia
,
Steve Blackburn
Australian National University, Australia

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Published in SIGPLAN Volume 48, Issue 7

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
260
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Saad MPalmieri RRavindran B(2019)LernaACM Transactions on Storage10.1145/331036815:1(1-24)Online publication date: 22-Mar-2019
https://dl.acm.org/doi/10.1145/3310368
Saad MPalmieri RRavindran BBreitgand DYadgar GPorter DEyal I(2018)LernaProceedings of the 11th ACM International Systems and Storage Conference10.1145/3211890.3211897(37-48)Online publication date: 4-Jun-2018
https://dl.acm.org/doi/10.1145/3211890.3211897
Ying VJeffrey MSanchez D(2020)T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA45697.2020.00024(159-172)Online publication date: May-2020
https://doi.org/10.1109/ISCA45697.2020.00024
Zhou RJones TKandemir MJimborean AMoseley T(2019)Janus: statically-driven and profile-guided automatic dynamic binary parallelisationProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314877(15-25)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314877
Zhou RWort GErdős MJones TSartor JNaik MRossbach C(2019)The janus triad: exploiting parallelism through dynamic binary modificationProceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3313808.3313812(88-100)Online publication date: 14-Apr-2019
https://dl.acm.org/doi/10.1145/3313808.3313812
Zhou RJones T(2019)Janus: Statically-Driven and Profile-Guided Automatic Dynamic Binary Parallelisation2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2019.8661196(15-25)Online publication date: Feb-2019
https://doi.org/10.1109/CGO.2019.8661196

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations