As increasing the performance of single-threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of thread-level speculation (TLS). As a case study, we manually parallelized several of the SPEC CPU2000 floating point and integer applications using TLS. The use of manual parallelization enabled us to apply techniques and programmer expertise that are beyond the current capabilities of automated parallelizers. With the experience gained from this, we provide insight into ways to aggressively apply TLS to parallelize applications for high performance. This information can help guide future advanced TLS compiler design.For each application, we discuss how and where parallelism was located within the application, the impediments to extracting this parallelism using TLS, and the code transformations that were required to overcome these impediments. We also generalize these experiences to a discussion of common hindrances to TLS parallelization, and describe methods of programming that help expose application parallelism to TLS systems. These guidelines can assist developers of uniprocessor programs to create applications that can easily port to TLS systems and yield good performance. By using manual parallelization on SPEC2000, we provide guidance on where thread-level parallelism exists in these well known benchmarks, what limits its extraction, how to reduce these limitations and what performance can be expected on these applications from a chip multiprocessor system with TLS.
References
[1]
B. Blume, et. al, "Restructuring programs for high-speed computers with Polaris," Proc. 1996 ICPP Workshop on. Challenges for Parallel Processing, pp. 149--161, Aug. 1996.
M. Chen and K. Olukotun, "The JRPM system for dynamically parallelizing Java programs," Proc. 30th Annual Intl. Sym. on Computer Architecture (ISCA), San Diego, CA, pp. 434--445, Jun. 2003.
M. Cintra, J. Martinez and J. Torrellas, "Architectural support for scalable speculative parallelization in shared-memory multiprocessors," ISCA-27, Vancouver, Canada, pp. 13--24, June 2000.
M. Cintra and J. Torrellas, "Eliminating squashes through learning cross-thread violations in speculative parallelization for Multiprocessors," Proc. 8th Intl. Sym. on High-Performance Computer Architecture (HPCA), Cambridge, Massachusetts, pp. 43--54, Feb. 2002.
J. Clabes, et al., "Design and implementation of the POWER5 microprocessor," IEEE Intl. Solid-State Circuits Conference (ISSCC), San Francisco, CA, Feb. 15-19, 2004.
F. Gabbay and A. Mendelson, "Using value prediction to increase the power of speculative execution hardware," ACM Transactions on Computer Systems, vol. 16, pp. 234--270, Aug. 1998.
S.W. Liao, et al., "SUIF Explorer: An Interactive and Interprocedural Parallelizer," Proc. Sym. Principles and Practices of Parallel Programming 1999 (PPOPP 1999), pp. 37--48, Atlanta, Georgia, Aug. 1999.
J.F. Martinez and J. Torrellas, "Speculative synchronization: applying thread-level speculation to explicitly parallel applications," Proc. 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), San Jose, California, pp. 18--29, Oct. 2002.
A. Moshovos, S.E. Breach, T.N. Vijaykumar, G.S. Sohi, "Dynamic speculation and synchronization of data dependences," ISCA-24, Denver, Colorado, pp. 181--193, June 1997.
K. Olukotun, L. Hammond, and M. Willey, "Improving the performance of speculatively parallel applications on the Hydra CMP," Proc. 13th ACM International Conference on Supercomputing (ICS), Rhodes, Greece, pp. 21--30, June 1999.
M. Prabhu and K. Olukotun, "Using thread-level speculation to simplify manual parallelization," Proc. Sym. PPOPP'03, San Diego, CA, pp. 1--12, June 11-13, 2003.
L. Rauchwerger, N. Amato, and D. Padua, "Run-time methods for parallelizing partially parallel loops," ICS-9, Barcelona, Spain, pp. 137--146, July 1995.
J.G. Steffan, C.B. Colohan, A. Zhai, and T.C. Mowry, "Improving value communication for thread-level speculation," HPCA-8, Cambridge, Massachusetts, pp. 65--75, Feb. 2002.
A. Zhai, C.B. Colohan, J.G. Steffan, and T.C. Mowry, "Compiler optimization of scalar value communication between speculative threads," ASPLOS-X, San Jose, California, pp. 171--183, Oct. 2002.
Y. Zhang, L. Rauchwerger, and J. Torrellas, "Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors," HPCA-5, Orlando, Florida, pp. 135--141, Jan. 1999.
Li YZhang ZLiu B(2020)An Adaptive Thread Partitioning Approach in Speculative MultithreadingAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-60245-1_6(78-91)Online publication date: 29-Sep-2020
PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
In this paper, we provide examples of how thread-level speculation
(TLS) simplifies manual parallelization and enhances its
performance. A number of techniques for manual parallelization
using TLS are presented and results are provided that indicate the
...
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP 2003) and workshop on partial evaluation and semantics-based program manipulation (PEPM 2003)
In this paper, we provide examples of how thread-level speculation
(TLS) simplifies manual parallelization and enhances its
performance. A number of techniques for manual parallelization
using TLS are presented and results are provided that indicate the
...
The authors present results and experience gathered from using thread level speculation (TLS) techniques to manually parallelize seven applications chosen from CPU2000, one of the most popular benchmark suites for measuring intensive performance. The applications are parallelized for the Stanford hydra chip multiprocessor (Hydra CMP), a hardware transport layer security (TLS)-enabled architecture. The goal of these experiments is to detect TLS-based techniques that can be automated, as well as to extract a number of useful guidelines for helping sequential programmers write easy-to-parallelize code.
TLS techniques allow the out-of-order execution of ordered threads, preserving their in-order appearance. The paper focuses on loop-only speculation, a TLS technique used to parallelize single-level loops by assigning each loop iteration to a thread, and speculatively running several iterations in parallel. For allowing the sequential code to run in parallel, two categories of parallelization techniques are used: automated ones, already available in existing compilers?like loop chunking/slicing, parallel reductions, and explicit synchronization, and expertise-based, more application-dependent ones, not yet automated in present-time compilers?speculative pipelining, algorithm/data structures adaptation, and complex value predictions.
Based on implementation language and profiling information, the authors selected four floating-point applications and three integer applications out of CPU2000's complete suite. They are: 177.mesa (a 3D graphics library), 179.art (an image-recognition application based on neural networks), 183.equake (a seismic wave propagation simulation), 188.ammp (a computational chemistry application), 175.vpr (a field-programmable gate array (FPGA) circuit place and route application), 181.mcf (a combinatorial optimization application), and 300.twolf (a place and route simulator). For each of these applications, the parallelization process starts with code profiling, which indicates what parts of the code (in this specific case, which loops) are the best candidates for parallelization. The identified blocks, later referred to as "speculative regions," are then transformed by applying different combinations of automated and expertise-based parallelization techniques.
Specifically, not more than seven speculative regions were identified in any of the seven applications, while four of the applications actually exhibited only one speculative region. The execution time coverage of these speculative regions ranges between 84 and 100 percent. The performance was measured by running the applications on a cycle-accurate simulator, using three different memory models: realistic, perfect with TLS-software overheads, and perfect without TLS overheads. The results show that three out of the seven applications (three out of four floating-point applications) can be efficiently parallelized with automated techniques, while integer applications need expertise-based techniques in order to obtain good speed-up. The full measured results are presented in the paper, and prove that TLS is a valuable technique for extracting parallelism out of sequential applications, allowing such applications to be ported on TLS chip multiprocessors (CMPs).
Based on the results of the case studies, the authors have derived six guidelines for "TLS-friendly programming," advising programmers of practices to avoid when writing uniprocessor applications that are intended to be easily parallelized, using TLS-driven methods; the authors explain and exemplify why such practices should not be used in the context of TLS, offering hints and alternative solutions. Nevertheless, it must be said that avoiding such practices is quite restrictive and unnatural for sequential application programmers: methods such as reuse of variables and varying recursion depth usually help to produce elegant and efficient sequential code.
The paper is well written, well structured, and complete. All specific terminology is swiftly explained and referenced. Methods and techniques are also presented clearly and augmented with examples. The experimental setup is characterized in detail, while each of the chosen applications is briefly discussed. The results of the experiments are clear and coherent. The authors have presented a successful series of case studies, which provide a valuable list of guidelines for applications to be parallelized using TLS-based methods.
Online Computing Reviews Service
Access critical reviews of Computing literature here
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Li YZhang ZLiu B(2020)An Adaptive Thread Partitioning Approach in Speculative MultithreadingAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-60245-1_6(78-91)Online publication date: 29-Sep-2020
KAWAKAMI SONO TOHTSUKA TINOUE K(2018)Parallel Precomputation with Input Value Prediction for Model Predictive Control SystemsIEICE Transactions on Information and Systems10.1587/transinf.2018PAP0003E101.D:12(2864-2877)Online publication date: 1-Dec-2018
Yan XJoa AWong BCassell BSzepesi TNaouach MLam DPierre GFerreira PShrira L(2018)SpecRPCProceedings of the 19th International Middleware Conference10.1145/3274808.3274829(266-278)Online publication date: 26-Nov-2018
Yiapanis PBrown GLuján M(2015)Compiler-Driven Software Speculation for Thread-Level ParallelismACM Transactions on Programming Languages and Systems10.1145/282150538:2(1-45)Online publication date: 22-Dec-2015
Martinsen JGrahn HIsberg A(2015)The Effects of Parameter Tuning in Software Thread-Level Speculation in JavaScript EnginesACM Transactions on Architecture and Code Optimization10.1145/268603611:4(1-25)Online publication date: 9-Jan-2015
Wang YAn HLiu ZZhang LWang Q(2015)Parallelizing Block Cryptography Algorithms on Speculative MulticoresAlgorithms and Architectures for Parallel Processing10.1007/978-3-319-27119-4_1(3-15)Online publication date: 16-Dec-2015