As increasing the performance of single-threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of thread-level speculation (TLS). As a case study, we manually parallelized several of the SPEC CPU2000 floating point and integer applications using TLS. The use of manual parallelization enabled us to apply techniques and programmer expertise that are beyond the current capabilities of automated parallelizers. With the experience gained from this, we provide insight into ways to aggressively apply TLS to parallelize applications for high performance. This information can help guide future advanced TLS compiler design.For each application, we discuss how and where parallelism was located within the application, the impediments to extracting this parallelism using TLS, and the code transformations that were required to overcome these impediments. We also generalize these experiences to a discussion of common hindrances to TLS parallelization, and describe methods of programming that help expose application parallelism to TLS systems. These guidelines can assist developers of uniprocessor programs to create applications that can easily port to TLS systems and yield good performance. By using manual parallelization on SPEC2000, we provide guidance on where thread-level parallelism exists in these well known benchmarks, what limits its extraction, how to reduce these limitations and what performance can be expected on these applications from a chip multiprocessor system with TLS.
PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP 2003) and workshop on partial evaluation and semantics-based program manipulation (PEPM 2003)
The authors present results and experience gathered from using thread level speculation (TLS) techniques to manually parallelize seven applications chosen from CPU2000, one of the most popular benchmark suites for measuring intensive performance. The applications are parallelized for the Stanford hydra chip multiprocessor (Hydra CMP), a hardware transport layer security (TLS)-enabled architecture. The goal of these experiments is to detect TLS-based techniques that can be automated, as well as to extract a number of useful guidelines for helping sequential programmers write easy-to-parallelize code.
TLS techniques allow the out-of-order execution of ordered threads, preserving their in-order appearance. The paper focuses on loop-only speculation, a TLS technique used to parallelize single-level loops by assigning each loop iteration to a thread, and speculatively running several iterations in parallel. For allowing the sequential code to run in parallel, two categories of parallelization techniques are used: automated ones, already available in existing compilers?like loop chunking/slicing, parallel reductions, and explicit synchronization, and expertise-based, more application-dependent ones, not yet automated in present-time compilers?speculative pipelining, algorithm/data structures adaptation, and complex value predictions.
Based on implementation language and profiling information, the authors selected four floating-point applications and three integer applications out of CPU2000's complete suite. They are: 177.mesa (a 3D graphics library), (an image-recognition application based on neural networks), 183.equake (a seismic wave propagation simulation), 188.ammp (a computational chemistry application), 175.vpr (a field-programmable gate array (FPGA) circuit place and route application), 181.mcf (a combinatorial optimization application), and 300.twolf (a place and route simulator). For each of these applications, the parallelization process starts with code profiling, which indicates what parts of the code (in this specific case, which loops) are the best candidates for parallelization. The identified blocks, later referred to as "speculative regions," are then transformed by applying different combinations of automated and expertise-based parallelization techniques.
Specifically, not more than seven speculative regions were identified in any of the seven applications, while four of the applications actually exhibited only one speculative region. The execution time coverage of these speculative regions ranges between 84 and 100 percent. The performance was measured by running the applications on a cycle-accurate simulator, using three different memory models: realistic, perfect with TLS-software overheads, and perfect without TLS overheads. The results show that three out of the seven applications (three out of four floating-point applications) can be efficiently parallelized with automated techniques, while integer applications need expertise-based techniques in order to obtain good speed-up. The full measured results are presented in the paper, and prove that TLS is a valuable technique for extracting parallelism out of sequential applications, allowing such applications to be ported on TLS chip multiprocessors (CMPs).
Based on the results of the case studies, the authors have derived six guidelines for "TLS-friendly programming," advising programmers of practices to avoid when writing uniprocessor applications that are intended to be easily parallelized, using TLS-driven methods; the authors explain and exemplify why such practices should not be used in the context of TLS, offering hints and alternative solutions. Nevertheless, it must be said that avoiding such practices is quite restrictive and unnatural for sequential application programmers: methods such as reuse of variables and varying recursion depth usually help to produce elegant and efficient sequential code.
The paper is well written, well structured, and complete. All specific terminology is swiftly explained and referenced. Methods and techniques are also presented clearly and augmented with examples. The experimental setup is characterized in detail, while each of the chosen applications is briefly discussed. The results of the experiments are clear and coherent. The authors have presented a successful series of case studies, which provide a valuable list of guidelines for applications to be parallelized using TLS-based methods.
