skip to main content
10.1109/MICRO.2010.13acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
Article

Task Superscalar: An Out-of-Order Task Pipeline

Published: 04 December 2010 Publication History

Abstract

We present \emph{Task Super scalar}, an abstraction of instruction-level out-of-order pipeline that operates at the task-level. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task super scalar uncovers task-level parallelism among tasks generated by a sequential thread. Utilizing intuitive programmer annotations of task inputs and outputs, the task super scalar pipeline dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks out-of-order. Furthermore, we propose a design for a distributed task super scalar pipeline front end, that can be embedded into any many core fabric, and manages cores as functional units. We show that our proposed mechanism is capable of driving hundreds of cores simultaneously with non-speculative tasks, which allows our pipeline to sustain work windows consisting of tens of thousands of tasks. We further show that our pipeline can maintain a decode rate faster than 60ns per task and dynamically uncover data dependencies among as many as \tilde 50,000 in-flight tasks, using 7MB of on-chip eDRAM storage. This configuration achieves speedups of 95–255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated CMP with 256 cores. Task super scalar thus enables programmers to exploit many core systems effectively, while simultaneously simplifying their programming model.

References

[1]
Arvind and R. Nikhil. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. on Computers, 39(3):300-318, Mar 1990.
[2]
P. Bellens, J. Perez, R. Badia, and J. Labarta. CellSs: a programming model for the Cell BE architecture. Supercomputing, Nov. 2006.
[3]
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In Symp. on Principles and Practice of Parallel Prog., pp. 207-216, 1995.
[4]
D. E. Culler, K. E. Schauser, and T. von Eicken. Two fundamental limits on dataflow multiprocessing. In Intl. Conf. on Parallel Arch. and Compilation Techniques, pp. 153-164, 1993.
[5]
J. B. Dennis and D. Misunas. A preliminary architecture for a basic data flow processor. In Intl. Symp. on Computer Architecture, pp. 126-132, 1974.
[6]
Y. Etsion, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero. Task superscalar: Using processors as functional units. In Hot Topics in Parallelism, Jun 2010.
[7]
K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. Supercomputing, 2006.
[8]
A. Ghuloum, T. Smith, G. Wu, X. Zhou, J. Fang, P. Guo, B. So, M. Rajagopalan, Y. Chen, and C. B. Future-proof data parallel algorithms and software on intelÂö multicore architecture. Intel Technology Journal, 11(4), Nov 2007.
[9]
L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE Micro, 20(2):71-84, 2000.
[10]
J. Jenista, Y. hun Eom, and B. Demsky. OoOJava: An out-of-order approach to parallelizing java. In Hot Topics in Parallelism, Jun 2010.
[11]
F. Karim, A. Mellan, B. Stramm, A. Nguyen, T. Abdelrahman, and U. Aydonat. The Hyperprocessor: A template System-on-Chip architecture for embedded multimedia applications. In Workshop on Application Specific Processors, Dec 2003.
[12]
K. Knobe. Ease of use with concurrent collections (CnC). In Hot Topics in Parallelism, Mar 2009.
[13]
S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In Intl. Symp. on Computer Architecture, pp. 162-173, 2007.
[14]
M. McCool. Data-parallel programming on the Cell BE and the GPU using the RapidMind development platform. In Proc. GSPx Multicore Applications Conf., Oct 2006.
[15]
G. M. Papadopoulos and K. R. Traub. Multithreading: a revisionist view of dataflow architectures. In Intl. Symp. on Computer Architecture, pp. 342-351, 1991.
[16]
Y. N. Patt, W. M. Hwu, and M. Shebanow. HPS, a new microarchitecture: rationale and introduction. In Intl. Symp. on Microarch., pp. 103-108, 1985.
[17]
J. Perez, R. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Intl. Conf. on Cluster Computing, pp. 142-151, Sep 2008.
[18]
A. Rico, F. Cabarcas, A. Quesada, M. Pavlovic, A. J. Vega, C. Villavieja, Y. Etsion, and A. Ramirez. Scalable simulation of decoupled accelerator architectures. Technical Report UPC-DAC-RR-2010-14, Universitat Politècnica de Catalunya, Jun 2010.
[19]
A. Rico, A. Ramirez, and M. Valero. Available task-level parallelism on the Cell B.E. Scientific Programming, 17(1- 2):59-76, 2009.
[20]
M. C. Rinard, D. J. Scales, and M. S. Lam. Jade: A highlevel, machine-independent language for parallel programming. IEEE Computer, 26:28-38, 1993.
[21]
E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith. Trace processors. Intl. Symp. on Microarch., p. 138, 1997.
[22]
D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible architectural support for fine-grain scheduling. In Intl. Conf. on Arch. Support for Programming Languages & Operating Systems, pp. 311-322, 2010.
[23]
K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan, S. Drolia, M. S. Govindan, P. Gratz, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, S. W. Keckler, and D. Burger. Distributed microarchitectural protocols in the TRIPS prototype processor. In Intl. Symp. on Microarch., pp. 480-491, 2006.
[24]
G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Intl. Symp. on Computer Architecture, pp. 414- 425, 1995.
[25]
F. Song, A. YarKhan, and J. Dongarra. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In Supercomputing, pp. 1-11, 2009.
[26]
S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. WaveScalar. In Intl. Symp. on Microarch., p. 291, 2003.
[27]
W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Intl. Conf. on Compiler Construction, pp. 179-196, Apr 2002.
[28]
I. Watson and J. Gurd. A practical data flow computer. IEEE Computer, 15(2):51-57, Feb 1982.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '43: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
December 2010
542 pages
ISBN:9780769542997

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 04 December 2010

Check for updates

Author Tags

  1. CMP/manycore
  2. Out-of-order execution
  3. parallel programming
  4. task superscalar

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)4
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media