Article

Exposing speculative thread parallelism in SPEC2000

Authors:

Manohar K. Prabhu,

Kunle OlukotunAuthors Info & Claims

PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 142 - 152

https://doi.org/10.1145/1065944.1065964

Published: 15 June 2005 Publication History

Get Access

Abstract

As increasing the performance of single-threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of thread-level speculation (TLS). As a case study, we manually parallelized several of the SPEC CPU2000 floating point and integer applications using TLS. The use of manual parallelization enabled us to apply techniques and programmer expertise that are beyond the current capabilities of automated parallelizers. With the experience gained from this, we provide insight into ways to aggressively apply TLS to parallelize applications for high performance. This information can help guide future advanced TLS compiler design.For each application, we discuss how and where parallelism was located within the application, the impediments to extracting this parallelism using TLS, and the code transformations that were required to overcome these impediments. We also generalize these experiences to a discussion of common hindrances to TLS parallelization, and describe methods of programming that help expose application parallelism to TLS systems. These guidelines can assist developers of uniprocessor programs to create applications that can easily port to TLS systems and yield good performance. By using manual parallelization on SPEC2000, we provide guidance on where thread-level parallelism exists in these well known benchmarks, what limits its extraction, how to reduce these limitations and what performance can be expected on these applications from a chip multiprocessor system with TLS.

References

[1]

B. Blume, et. al, "Restructuring programs for high-speed computers with Polaris," Proc. 1996 ICPP Workshop on. Challenges for Parallel Processing, pp. 149--161, Aug. 1996.

Crossref

Google Scholar

[2]

M. Chen and K. Olukotun, "The JRPM system for dynamically parallelizing Java programs," Proc. 30th Annual Intl. Sym. on Computer Architecture (ISCA), San Diego, CA, pp. 434--445, Jun. 2003.

Digital Library

Google Scholar

[3]

G.Z. Chrysos and J.S. Emer, "Memory dependence prediction using store sets," ISCA-25, Barcelona, Spain, pp. 142--153, June 1998.

Digital Library

Google Scholar

[4]

M. Cintra, J. Martinez and J. Torrellas, "Architectural support for scalable speculative parallelization in shared-memory multiprocessors," ISCA-27, Vancouver, Canada, pp. 13--24, June 2000.

Digital Library

Google Scholar

[5]

M. Cintra and J. Torrellas, "Eliminating squashes through learning cross-thread violations in speculative parallelization for Multiprocessors," Proc. 8th Intl. Sym. on High-Performance Computer Architecture (HPCA), Cambridge, Massachusetts, pp. 43--54, Feb. 2002.

Digital Library

Google Scholar

[6]

J. Clabes, et al., "Design and implementation of the POWER5 microprocessor," IEEE Intl. Solid-State Circuits Conference (ISSCC), San Francisco, CA, Feb. 15-19, 2004.

Digital Library

Google Scholar

[7]

F. Gabbay and A. Mendelson, "Using value prediction to increase the power of speculative execution hardware," ACM Transactions on Computer Systems, vol. 16, pp. 234--270, Aug. 1998.

Digital Library

Google Scholar

[8]

L. Hammond, et al., "The Stanford Hydra CMP," IEEE Micro, pp. 71--84, Mar.-Apr. 2000.

Digital Library

Google Scholar

[9]

P. Kongetira, "A 32-way multithreaded SPARC processor," Hot Chips 16, Stanford, California, Aug. 22-24, 2004.

Google Scholar

[10]

K. Krewell, "AMD vs. Intel in dual-core duel," Microprocessor Report, Scottsdale, AZ, July 6, 2004.

Google Scholar

[11]

D. Lammers, "Intel cancels Tejas, moves to dual-core designs," EETimes, Manhasset, New York, May 7, 2004.

Google Scholar

[12]

K.M. Lepak, G.B. Bell, and M.H. Lipasti, "Silent stores and store value locality," IEEE Transactions on Computers, vol. 50, pp. 1174--1190, Nov. 2001.

Digital Library

Google Scholar

[13]

S.W. Liao, et al., "SUIF Explorer: An Interactive and Interprocedural Parallelizer," Proc. Sym. Principles and Practices of Parallel Programming 1999 (PPOPP 1999), pp. 37--48, Atlanta, Georgia, Aug. 1999.

Digital Library

Google Scholar

[14]

J.F. Martinez and J. Torrellas, "Speculative synchronization: applying thread-level speculation to explicitly parallel applications," Proc. 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), San Jose, California, pp. 18--29, Oct. 2002.

Digital Library

Google Scholar

[15]

C. McNairy and R. Bhatia, "Montecito - The next product in the Itanium Processor Family," Hot Chips 16, Stanford, California, Aug. 22-24, 2004.

Google Scholar

[16]

A. Moshovos, S.E. Breach, T.N. Vijaykumar, G.S. Sohi, "Dynamic speculation and synchronization of data dependences," ISCA-24, Denver, Colorado, pp. 181--193, June 1997.

Digital Library

Google Scholar

[17]

K. Olukotun, L. Hammond, and M. Willey, "Improving the performance of speculatively parallel applications on the Hydra CMP," Proc. 13th ACM International Conference on Supercomputing (ICS), Rhodes, Greece, pp. 21--30, June 1999.

Digital Library

Google Scholar

[18]

C.-L. Ooi, et al., "Multiplex: unifying conventional and speculative thread-level parallelism on a chip multiprocessor," ICS-15, June 2001.

Digital Library

Google Scholar

[19]

M. Prabhu and K. Olukotun, "Using thread-level speculation to simplify manual parallelization," Proc. Sym. PPOPP'03, San Diego, CA, pp. 1--12, June 11-13, 2003.

Digital Library

Google Scholar

[20]

L. Rauchwerger, N. Amato, and D. Padua, "Run-time methods for parallelizing partially parallel loops," ICS-9, Barcelona, Spain, pp. 137--146, July 1995.

Digital Library

Google Scholar

[21]

T. Sherwood and B. Calder, "Time varying behavior of programs," Tech. Rep. No. CS99-630, Dept. of Computer Science and Eng., UCSD, Aug. 1999.

Google Scholar

[22]

J.G. Steffan, C.B. Colohan, A. Zhai, and T.C. Mowry, "Improving value communication for thread-level speculation," HPCA-8, Cambridge, Massachusetts, pp. 65--75, Feb. 2002.

Digital Library

Google Scholar

[23]

J. Steffan, C. Colohan, A. Zhai, and T. Mowry, "A scalable approach to thread-level speculation," ISCA-27, Vancouver, Canada, pp. 1--12, June 2000.

Digital Library

Google Scholar

[24]

A. Zhai, C.B. Colohan, J.G. Steffan, and T.C. Mowry, "Compiler optimization of scalar value communication between speculative threads," ASPLOS-X, San Jose, California, pp. 171--183, Oct. 2002.

Digital Library

Google Scholar

[25]

Y. Zhang, L. Rauchwerger, and J. Torrellas, "Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors," HPCA-5, Orlando, Florida, pp. 135--141, Jan. 1999.

Digital Library

Google Scholar

[26]

C. Zilles and G. Sohi, "Execution-based prediction using speculative slices," ISCA-28, Goteborg, Sweden, pp. 2--13, July 2001.

Digital Library

Google Scholar

Cited By

View all

Yuxiang LZhiyong ZXinyong WShuaina HYaning S(2024)IDaTPA: importance degree based thread partitioning approach in thread level speculationDiscover Computing10.1007/s10791-024-09440-x27:1Online publication date: 19-Jun-2024
https://doi.org/10.1007/s10791-024-09440-x
Olukotun KHammond LLaudon JOlukotun KHammond LLaudon J(2022)Improving Latency Using Manual Parallel ProgrammingChip Multiprocessor Architecture10.1007/978-3-031-01720-9_4(103-139)Online publication date: 5-Mar-2022
https://doi.org/10.1007/978-3-031-01720-9_4
Li YZhang ZLiu B(2020)An Adaptive Thread Partitioning Approach in Speculative MultithreadingAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-60245-1_6(78-91)Online publication date: 29-Sep-2020
https://doi.org/10.1007/978-3-030-60245-1_6
Show More Cited By

Index Terms

Exposing speculative thread parallelism in SPEC2000

Recommendations

Using thread-level speculation to simplify manual parallelization
PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming

In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual parallelization and enhances its performance. A number of techniques for manual parallelization using TLS are presented and results are provided that indicate the ...
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Using thread-level speculation to simplify manual parallelization
Proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP 2003) and workshop on partial evaluation and semantics-based program manipulation (PEPM 2003)

In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual parallelization and enhances its performance. A number of techniques for manual parallelization using TLS are presented and results are provided that indicate the ...

Reviews

Reviewer: Henk Sips

The authors present results and experience gathered from using thread level speculation (TLS) techniques to manually parallelize seven applications chosen from CPU2000, one of the most popular benchmark suites for measuring intensive performance. The applications are parallelized for the Stanford hydra chip multiprocessor (Hydra CMP), a hardware transport layer security (TLS)-enabled architecture. The goal of these experiments is to detect TLS-based techniques that can be automated, as well as to extract a number of useful guidelines for helping sequential programmers write easy-to-parallelize code. TLS techniques allow the out-of-order execution of ordered threads, preserving their in-order appearance. The paper focuses on loop-only speculation, a TLS technique used to parallelize single-level loops by assigning each loop iteration to a thread, and speculatively running several iterations in parallel. For allowing the sequential code to run in parallel, two categories of parallelization techniques are used: automated ones, already available in existing compilers?like loop chunking/slicing, parallel reductions, and explicit synchronization, and expertise-based, more application-dependent ones, not yet automated in present-time compilers?speculative pipelining, algorithm/data structures adaptation, and complex value predictions. Based on implementation language and profiling information, the authors selected four floating-point applications and three integer applications out of CPU2000's complete suite. They are: 177.mesa (a 3D graphics library), 179.art (an image-recognition application based on neural networks), 183.equake (a seismic wave propagation simulation), 188.ammp (a computational chemistry application), 175.vpr (a field-programmable gate array (FPGA) circuit place and route application), 181.mcf (a combinatorial optimization application), and 300.twolf (a place and route simulator). For each of these applications, the parallelization process starts with code profiling, which indicates what parts of the code (in this specific case, which loops) are the best candidates for parallelization. The identified blocks, later referred to as "speculative regions," are then transformed by applying different combinations of automated and expertise-based parallelization techniques. Specifically, not more than seven speculative regions were identified in any of the seven applications, while four of the applications actually exhibited only one speculative region. The execution time coverage of these speculative regions ranges between 84 and 100 percent. The performance was measured by running the applications on a cycle-accurate simulator, using three different memory models: realistic, perfect with TLS-software overheads, and perfect without TLS overheads. The results show that three out of the seven applications (three out of four floating-point applications) can be efficiently parallelized with automated techniques, while integer applications need expertise-based techniques in order to obtain good speed-up. The full measured results are presented in the paper, and prove that TLS is a valuable technique for extracting parallelism out of sequential applications, allowing such applications to be ported on TLS chip multiprocessors (CMPs). Based on the results of the case studies, the authors have derived six guidelines for "TLS-friendly programming," advising programmers of practices to avoid when writing uniprocessor applications that are intended to be easily parallelized, using TLS-driven methods; the authors explain and exemplify why such practices should not be used in the context of TLS, offering hints and alternative solutions. Nevertheless, it must be said that avoiding such practices is quite restrictive and unnatural for sequential application programmers: methods such as reuse of variables and varying recursion depth usually help to produce elegant and efficient sequential code. The paper is well written, well structured, and complete. All specific terminology is swiftly explained and referenced. Methods and techniques are also presented clearly and augmented with examples. The experimental setup is characterized in detail, while each of the chosen applications is briefly discussed. The results of the experiments are clear and coherent. The authors have presented a successful series of case studies, which provide a valuable list of guidelines for applications to be parallelized using TLS-based methods. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming

June 2005

310 pages

ISBN:1595930809

DOI:10.1145/1065944

General Chair:
Keshav Pingali
Cornell University
,
Program Chairs:
Katherine Yelick
University of California, Berkeley and LBNL
,
Andrew Grimshaw
University of Virginia

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

PPoPP05

Sponsor:

PPoPP05: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2005

June 15 - 17, 2005

IL, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
969
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Yuxiang LZhiyong ZXinyong WShuaina HYaning S(2024)IDaTPA: importance degree based thread partitioning approach in thread level speculationDiscover Computing10.1007/s10791-024-09440-x27:1Online publication date: 19-Jun-2024
https://doi.org/10.1007/s10791-024-09440-x
Olukotun KHammond LLaudon JOlukotun KHammond LLaudon J(2022)Improving Latency Using Manual Parallel ProgrammingChip Multiprocessor Architecture10.1007/978-3-031-01720-9_4(103-139)Online publication date: 5-Mar-2022
https://doi.org/10.1007/978-3-031-01720-9_4
Li YZhang ZLiu B(2020)An Adaptive Thread Partitioning Approach in Speculative MultithreadingAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-60245-1_6(78-91)Online publication date: 29-Sep-2020
https://doi.org/10.1007/978-3-030-60245-1_6
KAWAKAMI SONO TOHTSUKA TINOUE K(2018)Parallel Precomputation with Input Value Prediction for Model Predictive Control SystemsIEICE Transactions on Information and Systems10.1587/transinf.2018PAP0003E101.D:12(2864-2877)Online publication date: 1-Dec-2018
https://doi.org/10.1587/transinf.2018PAP0003
Yan XJoa AWong BCassell BSzepesi TNaouach MLam DPierre GFerreira PShrira L(2018)SpecRPCProceedings of the 19th International Middleware Conference10.1145/3274808.3274829(266-278)Online publication date: 26-Nov-2018
https://dl.acm.org/doi/10.1145/3274808.3274829
Estebanez ALlanos DGonzalez-Escribano A(2016)A Survey on Thread-Level Speculation TechniquesACM Computing Surveys10.1145/293836949:2(1-39)Online publication date: 30-Jun-2016
https://dl.acm.org/doi/10.1145/2938369
Wang YAn HLiu ZLi LHuang J(2016)A Flexible Chip Multiprocessor Simulator Dedicated for Thread Level Speculation2016 IEEE Trustcom/BigDataSE/ISPA10.1109/TrustCom.2016.0327(2127-2132)Online publication date: Aug-2016
https://doi.org/10.1109/TrustCom.2016.0327
Yiapanis PBrown GLuján M(2015)Compiler-Driven Software Speculation for Thread-Level ParallelismACM Transactions on Programming Languages and Systems10.1145/282150538:2(1-45)Online publication date: 22-Dec-2015
https://dl.acm.org/doi/10.1145/2821505
Martinsen JGrahn HIsberg A(2015)The Effects of Parameter Tuning in Software Thread-Level Speculation in JavaScript EnginesACM Transactions on Architecture and Code Optimization10.1145/268603611:4(1-25)Online publication date: 9-Jan-2015
https://dl.acm.org/doi/10.1145/2686036
Wang YAn HLiu ZZhang LWang Q(2015)Parallelizing Block Cryptography Algorithms on Speculative MulticoresAlgorithms and Architectures for Parallel Processing10.1007/978-3-319-27119-4_1(3-15)Online publication date: 16-Dec-2015
https://doi.org/10.1007/978-3-319-27119-4_1
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Using thread-level speculation to simplify manual parallelization

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Using thread-level speculation to simplify manual parallelization

Reviews

Access critical reviews of Computing literature here