skip to main content
10.1145/1088149.1088176acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

TAPE: a transactional application profiling environment

Published: 20 June 2005 Publication History

Abstract

Transactional Coherence and Consistency (TCC) provides a new parallel programming model that uses transactions as the basic unit of parallel work and communication. TCC simplifies the development of correct parallel code because hardware provides transaction atomicity and ordering. Nevertheless, the programmer or a dynamic compiler must still optimize the parallel code for performance.This paper presents TAPE, a hardware and software infrastructure for profiling in TCC systems. TAPE extends the hardware for transactional execution to identify performance impediments such as dependence violations, buffer overflows, and work imbalance. It filters infrequent events to reduce resource requirements and allows the programmer to focus on the most important bottlenecks. We demonstrate that TAPE introduces minimal die area and performance overhead and can be used continuously, even for production runs. Moreover, we demonstrate how to leverage the profiling information to guide optimization for a set of parallel applications. TAPE accurately identifies the source code location and type of the most important bottlenecks, allowing a programmer to achieve maximum parallel speedup with a few profiling steps.

References

[1]
Intel Corporation, VTune: a visual tuning environment. http://support.intel.com/support/performancetools/vtune/.
[2]
Stanford Parallel Applications for Shared Memory, SPLASH. http://www-flash.stanford.edu/apps/SPLASH/.
[3]
Java Grande Forum, Java Grande Benchmark Suite. http://www.epcc.ed.ac.uk/javagrande/, 2000.
[4]
V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA-27: Proceedings of the 27th International Symposium on Computer Architecture, pages 248--259, 2000.
[5]
C. S. Ananian, K. Asanović, B. C. Kuszmaul, C. E. Leiserson, and S. Lie. Unbounded Transactional Memory. In HPCA'05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 316--327, Feb. 2005.
[6]
J. M. Anderson et al. Continuous profiling: where have all the cycles gone? In SOSP-XVI: Proceedings of the sixteenth ACM symposium on Operating systems principles, 1997.
[7]
Broadcom Corporation. The Broadcom BCM-1250 Multiprocessor. In Presentation at 2002 Embedded Processor Forum, April 2002.
[8]
M. Chen and K. Olukotun. TEST: a tracer for extracting speculative threads. In CGO '03: Proceedings of the international symposium on Code generation and optimization, pages 301--312. IEEE Computer Society, 2003.
[9]
J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Z. Chrysos. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In MICRO'97: International Symposium on Microarchitecture, pages 292--302, 1997.
[10]
A. J. Goldberg and J. L. Hennessy. Performance debugging shared memory multiprocessor programs with MTOOL. In Supercomputing '91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing, pages 481--490. ACM Press, 1991.
[11]
L. Hammond, B. D. Carlstrom, V. Wong, B. Hertzberg, M. Chen, C. Kozyrakis, and K. Olukotun. Programming with transactional coherence and consistency. In ASPLOS-XI: Proceedings of the 11th Intl, Conference on Arch. Support for Programming Languages and Operating Systems, Oct. 2004.
[12]
L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional memory coherence and consistency. In ISCA-31: Proceedings of the 31st International Symposium on Computer Architecture, pages 102--113, June 2004.
[13]
P. Kongetira, A 32-way multithreaded Sparc processor. In Conference Record of Hot Chips 16, Stanford, CA, August 2004.
[14]
J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In ISCA-21: Proceedings of the 21st International Symposium on Computer Architecture, pages 302--313, 1994.
[15]
M. Martonosi, A. Gupta, and T. Anderson. MemSpy: analyzing memory system bottlenecks in programs. In SIGMETRICS '92/PERFORMANCE '92: Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, pages 1--12. ACM Press, 1992.
[16]
M. Martonosi, D. Ofelt, and M. Heinrich. Integrating performance monitoring and communication in parallel computers. In Measurement and Modeling of Computer Systems, pages 138--147, 1996.
[17]
A. McDonald et al. Characterization of TCC on Chip-Multiprocessors. In PACT-XIV: The Fourteenth International Conference on Parallel Architectures and Compilation Techniques, Sept. 2005.
[18]
J. T. R. Kalla, B. Sinharoy. Simultaneous multi-threading implementation in POWER5. In Conference Record of Hot Chips 15 Symposium, Stanford, CA, August 2003.
[19]
R. Raman. UltraSparc Gemini: Dual CPU processor. In Conference Record of Hot Chips 15 Symposium, Palo Alto, CA, August 2003.
[20]
S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Hardware support for flexible distributed shared memory. IEEE Transactions on Computers, 47(10):1056--1072, 1998.
[21]
Standard Performance Evaluation Corporation, SPEC CPU Benchmarks. http://www.specbench.org/, 1995--2000.
[22]
S. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH2 programs: Characterization and methodological considerations. In ISCA-22: Proceedings of the 22nd International Symposium on Computer Architecture, pages 24--36, June 1995.
[23]
M. Wolfe. High-Performance Compilers for Parallel Computing. Addison-Wesley, 1995.
[24]
Z. Xu, J. R. Larus, and B. P. Miller. Shared-memory performance profiling. In PPoPP-VI: Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 240--251, 1997.
[25]
M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance analysis using the MIPS R10000 performance counters. 1996.
[26]
C. B. Zilles and G. S. Sohi. A programmable co-processor for profiling. In HPCA-7: Proceedings of the 7th International Symposium on High-Performance Computer Architecture, pages 241--253, 2001.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '05: Proceedings of the 19th annual international conference on Supercomputing
June 2005
414 pages
ISBN:1595931678
DOI:10.1145/1088149
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2005

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

ICS05
Sponsor:
ICS05: International Conference on Supercomputing 2005
June 20 - 22, 2005
Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media