skip to main content
10.1145/2755573.2755603acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

The Cilkprof Scalability Profiler

Published: 13 June 2015 Publication History

Abstract

Cilkprof is a scalability profiler for multithreaded Cilk computations. Unlike its predecessor Cilkview, which analyzes only the whole-program scalability of a Cilk computation, Cilkprof collects work (serial running time) and span (critical-path length) data for each call site in the computation to assess how much each call site contributes to the overall work and span. Profiling work and span in this way enables a programmer to quickly diagnose scalability bottlenecks in a Cilk program. Despite the detail and quantity of information required to collect these measurements, Cilkprof runs with only constant asymptotic slowdown over the serial running time of the parallel computation. As an example of Cilkprof's usefulness, we used Cilkprof to diagnose a scalability bottleneck in an 1800-line parallel breadth-first search (PBFS) code. By examining Cilkprof's output in tandem with the source code, we were able to zero in on a call site within the PBFS routine that imposed a scalability bottleneck. A minor code modification then improved the parallelism of PBFS by a factor of 5. Using Cilkprof, it took us less than two hours to find and fix a scalability bug which had, until then, eluded us for months. This paper describes the Cilkprof algorithm and proves theoretically using an amortization argument that Cilkprof incurs only constant overhead compared with the application's native serial running time. Cilkprof was implemented by compiler instrumentation, that is, by modifying the LLVM compiler to insert instrumentation into user programs. On a suite of 16 application benchmarks, Cilkprof incurs a geometric-mean multiplicative overhead of only 1.9 and a maximum multiplicative overhead of only 7.4 compared with running the benchmarks without instrumentation.

References

[1]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010.
[2]
T. E. Anderson and E. D. Lazowska. Quartz: A tool for tuning parallel program performance. In SIGMETRICS, pp. 115--125, 1990.
[3]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In PACT, pp. 72--81, 2008.
[4]
C. Bienia and K. Li. Characteristics of workloads using the pipeline programming model. In EAMA ISCA-10 Workshop, pp. 161--171, 2010.
[5]
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720--748, 1999.
[6]
D. Bruening, E. Duesterwald, and S. Amarasinghe. Design and implementation of a dynamic optimization framework for Windows. In FDDO-4, 2001.
[7]
H. Brunst, M. Winkler, W. E. Nagel, and H.-C. Hoppe. Performance optimization for large scale computing: The scalable VAMPIR approach. In ICCS, pp. 751--760, 2001.
[8]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009.
[9]
A. E. Eichenberger, J. Mellor-Crummey, M. Schulz, M. Wong, N. Copty, R. Dietrich, X. Liu, E. Loh, and D. Lorenz. OMPT: An OpenMP tools application programming interface for performance analysis. In IWOMP, pp. 171--185, 2013.
[10]
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pp. 212--223, 1998.
[11]
S. Garcia, D. Jeon, C. M. Louie, and M. B. Taylor. Kremlin: Rethinking and rebootingtextttgprof for the multicore age. In PLDI, pp. 458--469, 2011.
[12]
J. R. Gilbert, G. L. Miller, and S.-H. Teng. Geometric mesh partitioning: Implementation and experiments. SIAM J. Sci. Comput., 19(6):2091--2110, 1998.
[13]
S. L. Graham, P. B. Kessler, and M. K. McKusick.textttgprof: A call graph execution profiler. In SIGPLAN Symposium on Compiler Construction, pp. 120--126, 1982.
[14]
Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview scalability analyzer. In SPAA, pp. 145--156, 2010.
[15]
C. A. R. Hoare. Algorithm 63: Partition; Algorithm 64: Quicksort; and Algorithm 65: Find. CACM, 4(7):321--322, 1961.
[16]
Intel Corporation. Intel Cilk Plus language specification. Document Number 324396-001US\@. Available from http://software.intel.com/sites/products/cilk-plus/cilk_plus_language_specification.pdf, 2010.
[17]
Intel Corporation. Intrinsics for low overhead tool annotations. Document Number 326357-001US\@. Available from https://www.cilkplus.org/open_specification/intrinsics-low-overhead-tool-annotations-v10, 2011.
[18]
Intel Corporation. Download Intel Cilk Plus software development kit. https://software.intel.com/en-us/articles/download-intel-cilk-plus-software-development-kit/, 2012.
[19]
Intel Corporation. CilkPlus/LLVM. http://cilkplus.github.io/, 2013.
[20]
Intel Corporation. Intel Cilk Plus. https://software.intel.com/en-us/intel-cilk-plus, 2015.
[21]
Intel Corporation. Intel VTune Amplifier XE 2015. http://software.intel.com/en-us/intel-vtune-amplifier-xe, 2015.
[22]
High efficiency video coding. Standard H.265, ITU, 2014.
[23]
D. Jeon, S. Garcia, C. Louie, and M. B. Taylor. Kismet: Parallel speedup estimates for serial programs. In OOPSLA, 2011.
[24]
D. Jeon, S. Garcia, C. Louie, S. K. Venkata, and M. B. Taylor. Kremlin: Like gprof, but for parallelization. In PPoPP, pp. 293--294, 2011.
[25]
A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel. The Vampir performance analysis tool-set. In Tools for High Performance Computing, pp. 139--155, 2008.
[26]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO, p. 75, 2004.
[27]
C. E. Leiserson. The Cilk++ concurrency platform. J. Supercomputing, 51(3):244--257, 2010.
[28]
C. E. Leiserson and T. B. Schardl. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In SPAA, pp. 303--314, 2010.
[29]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI, pp. 190--200, 2005.
[30]
N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. In PLDI, pp. 89--100, 2007.
[31]
K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov. AddressSanitizer: A fast address sanity checker. In USENIX ATC, pp. 309--318, 2012.
[32]
K. Serebryany and T. Iskhodzhanov. ThreadSanitizer -- data race detection in practice. In WBIA, pp. 62--71, 2009.
[33]
S. S. Shende and A. D. Malony. The Tau parallel performance system. Int. J. High Perform. Comput. Appl., 20(2):287--311, 2006.
[34]
N. R. Tallent and J. M. Mellor-Crummey. Effective performance measurement and analysis of multithreaded applications. In PPoPP, pp. 229--240, 2009.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SPAA '15: Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures
June 2015
362 pages
ISBN:9781450335881
DOI:10.1145/2755573
  • General Chair:
  • Guy Blelloch,
  • Program Chair:
  • Kunal Agrawal
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cilk
  2. cilkprof
  3. compiler instrumentation
  4. llvm
  5. multithreading
  6. parallelism
  7. performance
  8. profiling
  9. scalability
  10. serial bottleneck
  11. span
  12. work

Qualifiers

  • Research-article

Funding Sources

Conference

SPAA '15

Acceptance Rates

SPAA '15 Paper Acceptance Rate 31 of 131 submissions, 24%;
Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media