skip to main content
research-article

Limits of region-based dynamic binary parallelization

Published: 16 March 2013 Publication History

Abstract

Efficiently executing sequential legacy binaries on chip multi-processors (CMPs) composed of many, small cores is one of today's most pressing problems. Single-threaded execution is a suboptimal option due to CMPs' lower single-core performance, while multi-threaded execution relies on prior parallelization, which is severely hampered by the low-level binary representation of applications compiled and optimized for a single-core target. A recent technology to address this problem is Dynamic Binary Parallelization (DBP), which creates a Virtual Execution Environment (VEE) taking advantage of the underlying multicore host to transparently parallelize the sequential binary executable. While still in its infancy, DBP has received broad interest within the research community. The combined use of DBP and thread-level speculation (TLS) has been proposed as a technique to accelerate legacy uniprocessor code on modern CMPs. In this paper, we investigate the limits of DBP and seek to gain an understanding of the factors contributing to these limits and the costs and overheads of its implementation. We have performed an extensive evaluation using a parameterizable DBP system targeting a CMP with light-weight architectural TLS support. We demonstrate that there is room for a significant reduction of up to 54% in the number of instructions on the critical paths of legacy SPEC CPU2006 benchmarks. However, we show that it is much harder to translate these savings into actual performance improvements, with a realistic hardware-supported implementation achieving a speedup of 1.09 on average.

References

[1]
E. R. Altman, D. R. Kaeli, and Y. Sheffer. Welcome to the opportunities of binary translation. Computer, 33 (3): 40--45, Mar. 2000.
[2]
V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, pages 1--12, New York, NY, USA, 2000. ACM.
[3]
G. Blake, R. G. Dreslinski, and T. Mudge. A survey of multicore processors. IEEE Signal Processing Magazine, 26 (6): 26--37, Oct. 2009.
[4]
et al.(2011)Böhm, Edler von Koch, Kyle, Franke, and Topham}Bohm: 2011I. Böhm, T. J. Edler von Koch, S. C. Kyle, B. Franke, and N. Topham. Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator. ACM SIGPLAN Conference on Programming Language Design and Implementation, 2011.
[5]
M. Chen and K. Olukotun. The Jrpm system for dynamically parallelizing java programs. ACM/IEEE International Symposium on Computer Architecture, 2003.
[6]
M. DeVuyst, D. M. Tullsen, and S. W. Kim. Runtime parallelization of legacy code on a transactional memory system. International Conference on High Performance Embedded Architectures and Compilers, 2011.
[7]
L. Gao, L. Li, J. Xue, and T.-F. Ngai. Loop recreation for thread-level speculation. In International Conference on Parallel and Distributed Systems, 2007.
[8]
M. Gillespie. Preparing for the second stage of multi-core hardware: Asymmetric (heterogeneous) cores. Technical report, Intel, 2009. URL http://software.intel.com/file/1639.
[9]
B. Hertzberg and K. Olukotun. Runtime automatic speculative parallelization. International Symposium on Code Generation and Optimization, 2011.
[10]
M. D. Hill and M. R. Marty. Amdahl's law in the multicore era. Computer, 41: 33--38, July 2008.
[11]
H. Inoue, H. Hayashizaki, P. Wu, and T. Nakatani. A trace-based java jit compiler retrofitted from a method-based compiler. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 246--256, Washington, DC, USA, 2011. IEEE Computer Society.
[12]
Intel. Single-chip cloud computer: Project. http://www.intel.co.uk/content/www/us/en/research/intel-labs-single-chip-cloud-computer.html, 2012.
[13]
N. Ioannou, J. Singer, S. Khan, P. Xekalakis, P. Yiapanis, A. Pocock, G. Brown, M. Lujan, I. Watson, and M. Cintra. Toward a more accurate understanding of the limits of the TLS execution paradigm. IEEE International Symposium on Workload Characterization, 2010.
[14]
Q. Jacobson, E. Rotenberg, and J. Smith. Path-based next trace prediction. 30th Annual International Symposium on Microarchitecture, 1997.
[15]
V. Krishnan and J. Torrellas. Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor. In Proceedings of the 12th International Conference on Supercomputing, International Conference on Supercomputing, pages 85--92, New York, NY, USA, 1998. ACM.
[16]
P. Marcuello and A. González. Clustered speculative multithreaded processors. International Conference on Supercomputing, 1999.
[17]
P. Marcuello and A. González. Thread-spawning schemes for speculative multithreading. International Symposium on High Performance Computer Architecture, 2002.
[18]
V. Packirisamy, A. Zhai, W.-C. Hsu, P.-C. Yew, and T.-F. Ngai. Exploring speculative parallelism in SPEC2006. IEEE International Symposium on Performance Analysis of Systems and Software, 2009.
[19]
B. Pradelle, A. Ketterlin, and P. Clauss. Polyhedral parallelization of binary code. ACM Trans. Archit. Code Optim., 8 (4): 39:1--39:21, Jan. 2012.
[20]
V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn. PIN: a binary instrumentation tool for computer architecture research and education. Workshop on Computer Architecture Education, 2004.
[21]
M. Reilly. When multicore isn't enough: Trends and the future for multi-multicore systems. High Performance Embedded Computing Workshop, 2008.
[22]
G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. ACM/IEEE International Symposium on Computer Architecture, 1995.
[23]
T. Suganuma, T. Yasue, and T. Nakatani. A region-based compilation technique for a java just-in-time compiler. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, pages 312--323, New York, NY, USA, 2003. ACM.
[24]
N. Vachharajani, M. Iyer, C. Ashok, M. Vachharajani, D. I. August, and D. Connors. Chip multi-processor scalability for single-threaded applications. SIGARCH Comput. Archit. News, 33: 44--53, Nov 2005.
[25]
C. Wang, Y. Wu, E. Borin, S. Hu, W. Liu, D. Sager, T. F. Ngai, and J. Fang. Dynamic parallelization of single-threaded binary programs using speculative slicing. International Conference on Supercomputing, 2009.
[26]
D. Wentzlaff and A. Agarwal. Constructing virtual architectures on a tiled processor. International Symposium on Code Generation and Optimization, 2006.
[27]
J. Yang, K. Skadron, M. Soffa, and K. Whitehouse. Feasibility of dynamic binary parallelization. 3rd USENIX Workshop on Hot Topics in Parallelism, 2011.
[28]
E. Yardımcı and M. Franz. Dynamic parallelization and mapping of binary executables on hierarchical platforms. ACM International Conference on Computing Frontiers, 2006.

Cited By

View all

Recommendations

Reviews

Andre Maximo

Parallel computation is everywhere, from small portable devices to laptops, desktops, and data centers. This wealth of available parallelism challenges computer scientists and developers alike to rethink algorithms and rewrite old sequential source codes. However, we still have a long way to go. This paper offers a deep analysis on a different way to address the challenge. The authors investigate the performance limits of dynamic binary parallelization (DBP), a technique for transparently parallelizing single-thread binary executables. The success of instruction-level parallelism on single-core architectures with multiple arithmetic logic units (ALUs) motivates a higher level of automatic parallelization on multi-core architectures (the DBP). In the paper, the authors first analyze this new proposed architecture by explaining how sections of binary code are identified for parallel execution, and then present their experiments and extensive results. The key idea of DBP is to reduce the number of critical path instructions by overlapping execution segments of the instruction stream. The overlapping is done by speculative cores launched by a master core responsible for the correct execution of the original binary. However, this strategy may result in performance reduction if many speculative threads are invalidated, mainly due to data dependency between threads. The experiments address various applications of a benchmark suite. The results, from an eight-core reduced instruction set computing (RISC) processor over a single-core execution, show an interesting 54 percent reduction in the number of instructions on critical paths. This is associated with a mismatched average speedup of 1.43 times. This is still far from the ideal performance gain, which suggests that the main bottleneck is bandwidth rather than instruction processing. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 48, Issue 7
VEE '13
July 2013
194 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2517326
Issue’s Table of Contents
  • cover image ACM Conferences
    VEE '13: Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
    March 2013
    210 pages
    ISBN:9781450312660
    DOI:10.1145/2451512
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013
Published in SIGPLAN Volume 48, Issue 7

Check for updates

Author Tags

  1. automatic parallelization
  2. dynamic binary parallelization
  3. runtime systems
  4. thread-level speculation
  5. transactional memory

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media