skip to main content
research-article

Hill-climbing SMT processor resource distribution

Published: 13 February 2009 Publication History

Abstract

The key to high performance in Simultaneous MultiThreaded (SMT) processors lies in optimizing the distribution of shared resources to active threads. Existing resource distribution techniques optimize performance only indirectly. They infer potential performance bottlenecks by observing indicators, like instruction occupancy or cache miss counts, and take actions to try to alleviate them. While the corrective actions are designed to improve performance, their actual performance impact is not known since end performance is never monitored. Consequently, potential performance gains are lost whenever the corrective actions do not effectively address the actual bottlenecks occurring in the pipeline.
We propose a different approach to SMT resource distribution that optimizes end performance directly. Our approach observes the impact that resource distribution decisions have on performance at runtime, and feeds this information back to the resource distribution mechanisms to improve future decisions. By evaluating many different resource distributions, our approach tries to learn the best distribution over time. Because we perform learning online, learning time is crucial. We develop a hill-climbing algorithm that quickly learns the best distribution of resources by following the performance gradient within the resource distribution space. We also develop several ideal learning algorithms to enable deeper insights through limit studies.
This article conducts an in-depth investigation of hill-climbing SMT resource distribution using a comprehensive suite of 63 multiprogrammed workloads. Our results show hill-climbing outperforms ICOUNT, FLUSH, and DCRA (three existing SMT techniques) by 11.4%, 11.5%, and 2.8%, respectively, under the weighted IPC metric. A limit study conducted using our ideal learning algorithms shows our approach can potentially outperform the same techniques by 19.2%, 18.0%, and 7.6%, respectively, thus demonstrating additional room exists for further improvement. Using our ideal algorithms, we also identify three bottlenecks that limit online learning speed: local maxima, phased behavior, and interepoch jitter. We define metrics to quantify these learning bottlenecks, and characterize the extent to which they occur in our workloads. Finally, we conduct a sensitivity study, and investigate several extensions to improve our hill-climbing technique.

References

[1]
Burger, D. and Austin, T. M. 1997. The SimpleScalar tool set, version 2.0. CS TR 1342, University of Wisconsin-Madison. June.
[2]
Cazorla, F. J., Ramirez, A., Valero, M., and Fernandez, E. 2004. Dynamically controlled resource allocation in SMT processors. In Proceedings of the 37th International Symposium on Microarchitecture. IEEE Computer Society, 171--182.
[3]
Choi, S. and Yeung, D. 2006. Learning-Based SMT processor resource distribution via hill-climbing. In Proceedings of the 33rd Annual International Symposium on Computer Architecture. IEEE Computer Society, 239--250.
[4]
Dorai, G. K., Yeung, D., and Choi, S. 2003. Optimizing SMT processors for high single-thread performance. J. Instruction-Level Parallel. 5, 1--35.
[5]
El-Moursy, A. and Albonesi, D. H. 2003. Front-End policies for improved issue efficiency in SMT processors. In Proceedings of the 9th International Conference on High Performance Computer Architecture. IEEE Computer Society, 31--40.
[6]
Goncalves, R., Ayguade, E., Valero, M., and Navau, P. O. A. 2001. Performance evaluation of decoding and dispatching stages in simultaneous multithreaded architectures. In Proceedings of the 13th Symposium on Computer Architecture and High Performance Computing.
[7]
Kalla, R. N., Sinharoy, B., and Tendler, J. M. 2004. IBM Power5 chip: A dual-core multithreaded processor. IEEE Micro 24, 2, 40--47.
[8]
Latorre, F., Gonzalez, J., and Gonzalez, A. 2004. Back-End assignment schemes for clustered multithreaded processors. In Proceedings of the 18th Annual International Conference on Supercomputing. ACM Press, 316--325.
[9]
Luo, K., Franklin, M., Mukherjee, S. S., and Seznec, A. 2001. Boosting SMT performance by speculation control. In Proceedings of the International Parallel and Distributed Processing Symposium. IEEE Computer Society.
[10]
Luo, K., Gummaraju, J., and Franklin, M. 2001. Balancing throughput and fairness in SMT processors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. IEEE Computer Society, 164--171.
[11]
Madon, D., Sanchez, E., and Monnier, S. 1999. A study of a simultaneous multithreaded processor implementation. In Proceedings of EuroPar'99. Springer, 716--726.
[12]
Marr, D. T., Binns, F., Hill, D., Hinton, G., Koufaty, D., Miller, J. A., and Upton, M. 2002. Hyper-Threading technology architecture and microarchitecture. Intel Technol. J. 6, 1, 4--15.
[13]
Pentium4. 2002. Intel Pentium 4 processor. http://www.intel.com/design/Pentium4/index.htm.
[14]
Raasch, S. E. and Reinhardt, S. K. 1999. Applications of thread prioritization in SMT processors. In Proceedings of the Multithreaded Execution, Architecture, and Compilation Workshop.
[15]
Raasch, S. E. and Reinhardt, S. K. 2003. The impact of resource partitioning on SMT processors. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 15--25.
[16]
Sherwood, T., Perelman, E., and Calder, B. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 3--14.
[17]
Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. 2002. Automatically characterizing large scale program behavior. In Proceedings of 10th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 45--57.
[18]
Sherwood, T., Sair, S., and Calder, B. 2003. Phase tracking and prediction. In Proceedings of the 30th Annual International Symposium on Computer Architecture. IEEE Computer Society, 336--347.
[19]
Snavely, A., Tullsen, D. M., and Voelker, G. 2002. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems. ACM, 66--76.
[20]
Tullsen, D. M. and Brown, J. A. 2001. Handling long-latency loads in a simultaneous multithreading processor. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 318--327.
[21]
Tullsen, D. M., Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L., and Stamm, R. L. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the International Symposium on Computer Architecture. IEEE Computer Society, 191--202.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 27, Issue 1
February 2009
100 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/1482619
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 February 2009
Accepted: 01 December 2008
Received: 01 August 2007
Published in TOCS Volume 27, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hill-climbing algorithm
  2. SMT processor
  3. limit study

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)3
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media