skip to main content
10.1145/1065944.1065964acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
Article

Exposing speculative thread parallelism in SPEC2000

Published: 15 June 2005 Publication History

Abstract

As increasing the performance of single-threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of thread-level speculation (TLS). As a case study, we manually parallelized several of the SPEC CPU2000 floating point and integer applications using TLS. The use of manual parallelization enabled us to apply techniques and programmer expertise that are beyond the current capabilities of automated parallelizers. With the experience gained from this, we provide insight into ways to aggressively apply TLS to parallelize applications for high performance. This information can help guide future advanced TLS compiler design.For each application, we discuss how and where parallelism was located within the application, the impediments to extracting this parallelism using TLS, and the code transformations that were required to overcome these impediments. We also generalize these experiences to a discussion of common hindrances to TLS parallelization, and describe methods of programming that help expose application parallelism to TLS systems. These guidelines can assist developers of uniprocessor programs to create applications that can easily port to TLS systems and yield good performance. By using manual parallelization on SPEC2000, we provide guidance on where thread-level parallelism exists in these well known benchmarks, what limits its extraction, how to reduce these limitations and what performance can be expected on these applications from a chip multiprocessor system with TLS.

References

[1]
B. Blume, et. al, "Restructuring programs for high-speed computers with Polaris," Proc. 1996 ICPP Workshop on. Challenges for Parallel Processing, pp. 149--161, Aug. 1996.
[2]
M. Chen and K. Olukotun, "The JRPM system for dynamically parallelizing Java programs," Proc. 30th Annual Intl. Sym. on Computer Architecture (ISCA), San Diego, CA, pp. 434--445, Jun. 2003.
[3]
G.Z. Chrysos and J.S. Emer, "Memory dependence prediction using store sets," ISCA-25, Barcelona, Spain, pp. 142--153, June 1998.
[4]
M. Cintra, J. Martinez and J. Torrellas, "Architectural support for scalable speculative parallelization in shared-memory multiprocessors," ISCA-27, Vancouver, Canada, pp. 13--24, June 2000.
[5]
M. Cintra and J. Torrellas, "Eliminating squashes through learning cross-thread violations in speculative parallelization for Multiprocessors," Proc. 8th Intl. Sym. on High-Performance Computer Architecture (HPCA), Cambridge, Massachusetts, pp. 43--54, Feb. 2002.
[6]
J. Clabes, et al., "Design and implementation of the POWER5 microprocessor," IEEE Intl. Solid-State Circuits Conference (ISSCC), San Francisco, CA, Feb. 15-19, 2004.
[7]
F. Gabbay and A. Mendelson, "Using value prediction to increase the power of speculative execution hardware," ACM Transactions on Computer Systems, vol. 16, pp. 234--270, Aug. 1998.
[8]
L. Hammond, et al., "The Stanford Hydra CMP," IEEE Micro, pp. 71--84, Mar.-Apr. 2000.
[9]
P. Kongetira, "A 32-way multithreaded SPARC processor," Hot Chips 16, Stanford, California, Aug. 22-24, 2004.
[10]
K. Krewell, "AMD vs. Intel in dual-core duel," Microprocessor Report, Scottsdale, AZ, July 6, 2004.
[11]
D. Lammers, "Intel cancels Tejas, moves to dual-core designs," EETimes, Manhasset, New York, May 7, 2004.
[12]
K.M. Lepak, G.B. Bell, and M.H. Lipasti, "Silent stores and store value locality," IEEE Transactions on Computers, vol. 50, pp. 1174--1190, Nov. 2001.
[13]
S.W. Liao, et al., "SUIF Explorer: An Interactive and Interprocedural Parallelizer," Proc. Sym. Principles and Practices of Parallel Programming 1999 (PPOPP 1999), pp. 37--48, Atlanta, Georgia, Aug. 1999.
[14]
J.F. Martinez and J. Torrellas, "Speculative synchronization: applying thread-level speculation to explicitly parallel applications," Proc. 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), San Jose, California, pp. 18--29, Oct. 2002.
[15]
C. McNairy and R. Bhatia, "Montecito - The next product in the Itanium Processor Family," Hot Chips 16, Stanford, California, Aug. 22-24, 2004.
[16]
A. Moshovos, S.E. Breach, T.N. Vijaykumar, G.S. Sohi, "Dynamic speculation and synchronization of data dependences," ISCA-24, Denver, Colorado, pp. 181--193, June 1997.
[17]
K. Olukotun, L. Hammond, and M. Willey, "Improving the performance of speculatively parallel applications on the Hydra CMP," Proc. 13th ACM International Conference on Supercomputing (ICS), Rhodes, Greece, pp. 21--30, June 1999.
[18]
C.-L. Ooi, et al., "Multiplex: unifying conventional and speculative thread-level parallelism on a chip multiprocessor," ICS-15, June 2001.
[19]
M. Prabhu and K. Olukotun, "Using thread-level speculation to simplify manual parallelization," Proc. Sym. PPOPP'03, San Diego, CA, pp. 1--12, June 11-13, 2003.
[20]
L. Rauchwerger, N. Amato, and D. Padua, "Run-time methods for parallelizing partially parallel loops," ICS-9, Barcelona, Spain, pp. 137--146, July 1995.
[21]
T. Sherwood and B. Calder, "Time varying behavior of programs," Tech. Rep. No. CS99-630, Dept. of Computer Science and Eng., UCSD, Aug. 1999.
[22]
J.G. Steffan, C.B. Colohan, A. Zhai, and T.C. Mowry, "Improving value communication for thread-level speculation," HPCA-8, Cambridge, Massachusetts, pp. 65--75, Feb. 2002.
[23]
J. Steffan, C. Colohan, A. Zhai, and T. Mowry, "A scalable approach to thread-level speculation," ISCA-27, Vancouver, Canada, pp. 1--12, June 2000.
[24]
A. Zhai, C.B. Colohan, J.G. Steffan, and T.C. Mowry, "Compiler optimization of scalar value communication between speculative threads," ASPLOS-X, San Jose, California, pp. 171--183, Oct. 2002.
[25]
Y. Zhang, L. Rauchwerger, and J. Torrellas, "Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors," HPCA-5, Orlando, Florida, pp. 135--141, Jan. 1999.
[26]
C. Zilles and G. Sohi, "Execution-based prediction using speculative slices," ISCA-28, Goteborg, Sweden, pp. 2--13, July 2001.

Cited By

View all

Recommendations

Reviews

Henk Sips

The authors present results and experience gathered from using thread level speculation (TLS) techniques to manually parallelize seven applications chosen from CPU2000, one of the most popular benchmark suites for measuring intensive performance. The applications are parallelized for the Stanford hydra chip multiprocessor (Hydra CMP), a hardware transport layer security (TLS)-enabled architecture. The goal of these experiments is to detect TLS-based techniques that can be automated, as well as to extract a number of useful guidelines for helping sequential programmers write easy-to-parallelize code. TLS techniques allow the out-of-order execution of ordered threads, preserving their in-order appearance. The paper focuses on loop-only speculation, a TLS technique used to parallelize single-level loops by assigning each loop iteration to a thread, and speculatively running several iterations in parallel. For allowing the sequential code to run in parallel, two categories of parallelization techniques are used: automated ones, already available in existing compilers?like loop chunking/slicing, parallel reductions, and explicit synchronization, and expertise-based, more application-dependent ones, not yet automated in present-time compilers?speculative pipelining, algorithm/data structures adaptation, and complex value predictions. Based on implementation language and profiling information, the authors selected four floating-point applications and three integer applications out of CPU2000's complete suite. They are: 177.mesa (a 3D graphics library), 179.art (an image-recognition application based on neural networks), 183.equake (a seismic wave propagation simulation), 188.ammp (a computational chemistry application), 175.vpr (a field-programmable gate array (FPGA) circuit place and route application), 181.mcf (a combinatorial optimization application), and 300.twolf (a place and route simulator). For each of these applications, the parallelization process starts with code profiling, which indicates what parts of the code (in this specific case, which loops) are the best candidates for parallelization. The identified blocks, later referred to as "speculative regions," are then transformed by applying different combinations of automated and expertise-based parallelization techniques. Specifically, not more than seven speculative regions were identified in any of the seven applications, while four of the applications actually exhibited only one speculative region. The execution time coverage of these speculative regions ranges between 84 and 100 percent. The performance was measured by running the applications on a cycle-accurate simulator, using three different memory models: realistic, perfect with TLS-software overheads, and perfect without TLS overheads. The results show that three out of the seven applications (three out of four floating-point applications) can be efficiently parallelized with automated techniques, while integer applications need expertise-based techniques in order to obtain good speed-up. The full measured results are presented in the paper, and prove that TLS is a valuable technique for extracting parallelism out of sequential applications, allowing such applications to be ported on TLS chip multiprocessors (CMPs). Based on the results of the case studies, the authors have derived six guidelines for "TLS-friendly programming," advising programmers of practices to avoid when writing uniprocessor applications that are intended to be easily parallelized, using TLS-driven methods; the authors explain and exemplify why such practices should not be used in the context of TLS, offering hints and alternative solutions. Nevertheless, it must be said that avoiding such practices is quite restrictive and unnatural for sequential application programmers: methods such as reuse of variables and varying recursion depth usually help to produce elegant and efficient sequential code. The paper is well written, well structured, and complete. All specific terminology is swiftly explained and referenced. Methods and techniques are also presented clearly and augmented with examples. The experimental setup is characterized in detail, while each of the chosen applications is briefly discussed. The results of the experiments are clear and coherent. The authors have presented a successful series of case studies, which provide a valuable list of guidelines for applications to be parallelized using TLS-based methods. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
June 2005
310 pages
ISBN:1595930809
DOI:10.1145/1065944
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. SPEC CPU2000
  2. chip multiprocessors
  3. feedback-driven optimization
  4. manual parallel programming
  5. multithreading
  6. thread-level speculation

Qualifiers

  • Article

Conference

PPoPP05
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media