Article

Task Superscalar: An Out-of-Order Task Pipeline

Authors:

Felipe Cabarcas,

Alejandro Rico,

Eduard Ayguade,

Mateo ValeroAuthors Info & Claims

MICRO '43: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 89 - 100

https://doi.org/10.1109/MICRO.2010.13

Published: 04 December 2010 Publication History

Abstract

We present \emph{Task Super scalar}, an abstraction of instruction-level out-of-order pipeline that operates at the task-level. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task super scalar uncovers task-level parallelism among tasks generated by a sequential thread. Utilizing intuitive programmer annotations of task inputs and outputs, the task super scalar pipeline dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks out-of-order. Furthermore, we propose a design for a distributed task super scalar pipeline front end, that can be embedded into any many core fabric, and manages cores as functional units. We show that our proposed mechanism is capable of driving hundreds of cores simultaneously with non-speculative tasks, which allows our pipeline to sustain work windows consisting of tens of thousands of tasks. We further show that our pipeline can maintain a decode rate faster than 60ns per task and dynamically uncover data dependencies among as many as \tilde 50,000 in-flight tasks, using 7MB of on-chip eDRAM storage. This configuration achieves speedups of 95–255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated CMP with 256 cores. Task super scalar thus enables programmers to exploit many core systems effectively, while simultaneously simplifying their programming model.

References

[1]

Arvind and R. Nikhil. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. on Computers, 39(3):300-318, Mar 1990.

Digital Library

[2]

P. Bellens, J. Perez, R. Badia, and J. Labarta. CellSs: a programming model for the Cell BE architecture. Supercomputing, Nov. 2006.

Digital Library

[3]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In Symp. on Principles and Practice of Parallel Prog., pp. 207-216, 1995.

Digital Library

[4]

D. E. Culler, K. E. Schauser, and T. von Eicken. Two fundamental limits on dataflow multiprocessing. In Intl. Conf. on Parallel Arch. and Compilation Techniques, pp. 153-164, 1993.

Digital Library

[5]

J. B. Dennis and D. Misunas. A preliminary architecture for a basic data flow processor. In Intl. Symp. on Computer Architecture, pp. 126-132, 1974.

Digital Library

[6]

Y. Etsion, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero. Task superscalar: Using processors as functional units. In Hot Topics in Parallelism, Jun 2010.

Digital Library

[7]

K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. Supercomputing, 2006.

Digital Library

[8]

A. Ghuloum, T. Smith, G. Wu, X. Zhou, J. Fang, P. Guo, B. So, M. Rajagopalan, Y. Chen, and C. B. Future-proof data parallel algorithms and software on intelÂö multicore architecture. Intel Technology Journal, 11(4), Nov 2007.

[9]

L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE Micro, 20(2):71-84, 2000.

Digital Library

[10]

J. Jenista, Y. hun Eom, and B. Demsky. OoOJava: An out-of-order approach to parallelizing java. In Hot Topics in Parallelism, Jun 2010.

Digital Library

[11]

F. Karim, A. Mellan, B. Stramm, A. Nguyen, T. Abdelrahman, and U. Aydonat. The Hyperprocessor: A template System-on-Chip architecture for embedded multimedia applications. In Workshop on Application Specific Processors, Dec 2003.

[12]

K. Knobe. Ease of use with concurrent collections (CnC). In Hot Topics in Parallelism, Mar 2009.

Digital Library

[13]

S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In Intl. Symp. on Computer Architecture, pp. 162-173, 2007.

Digital Library

[14]

M. McCool. Data-parallel programming on the Cell BE and the GPU using the RapidMind development platform. In Proc. GSPx Multicore Applications Conf., Oct 2006.

[15]

G. M. Papadopoulos and K. R. Traub. Multithreading: a revisionist view of dataflow architectures. In Intl. Symp. on Computer Architecture, pp. 342-351, 1991.

Digital Library

[16]

Y. N. Patt, W. M. Hwu, and M. Shebanow. HPS, a new microarchitecture: rationale and introduction. In Intl. Symp. on Microarch., pp. 103-108, 1985.

Digital Library

[17]

J. Perez, R. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Intl. Conf. on Cluster Computing, pp. 142-151, Sep 2008.

[18]

A. Rico, F. Cabarcas, A. Quesada, M. Pavlovic, A. J. Vega, C. Villavieja, Y. Etsion, and A. Ramirez. Scalable simulation of decoupled accelerator architectures. Technical Report UPC-DAC-RR-2010-14, Universitat Politècnica de Catalunya, Jun 2010.

[19]

A. Rico, A. Ramirez, and M. Valero. Available task-level parallelism on the Cell B.E. Scientific Programming, 17(1- 2):59-76, 2009.

Digital Library

[20]

M. C. Rinard, D. J. Scales, and M. S. Lam. Jade: A highlevel, machine-independent language for parallel programming. IEEE Computer, 26:28-38, 1993.

Digital Library

[21]

E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith. Trace processors. Intl. Symp. on Microarch., p. 138, 1997.

Digital Library

[22]

D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible architectural support for fine-grain scheduling. In Intl. Conf. on Arch. Support for Programming Languages & Operating Systems, pp. 311-322, 2010.

Digital Library

[23]

K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan, S. Drolia, M. S. Govindan, P. Gratz, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, S. W. Keckler, and D. Burger. Distributed microarchitectural protocols in the TRIPS prototype processor. In Intl. Symp. on Microarch., pp. 480-491, 2006.

Digital Library

[24]

G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Intl. Symp. on Computer Architecture, pp. 414- 425, 1995.

Digital Library

[25]

F. Song, A. YarKhan, and J. Dongarra. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In Supercomputing, pp. 1-11, 2009.

Digital Library

[26]

S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. WaveScalar. In Intl. Symp. on Microarch., p. 291, 2003.

Digital Library

[27]

W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Intl. Conf. on Compiler Construction, pp. 179-196, Apr 2002.

Digital Library

[28]

I. Watson and J. Gurd. A practical data flow computer. IEEE Computer, 15(2):51-57, Feb 1982.

Digital Library

Cited By

Durvasula SZhao AKiguru RGuan YChen ZVijaykumar N(2024)ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational GraphsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676897(258-270)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676897
Fox DMonsalve Diaz JLi X(2023)A gem5 Implementation of the Sequential Codelet Model: Reducing Overhead and Expanding the Software Memory InterfaceProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624152(839-846)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624152
Elsabbagh FSheikhha SYing VNguyen QEmer JSanchez D(2023)Accelerating RTL Simulation with Hardware-Software Co-DesignProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614257(153-166)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614257
Show More Cited By

Index Terms

Task Superscalar: An Out-of-Order Task Pipeline

Recommendations

Microarchitecture of a Coarse-Grain Out-of-Order Superscalar Processor

We explore the design, implementation, and evaluation of a coarse-grain superscalar processor in the context of the microarchitecture of the Control Processor (CP) of the Multilevel Computing Architecture (MLCA), a novel architecture targeted for ...
An out-of-order superscalar processor on FPGA: the ReOrder buffer design
DATE '12: Proceedings of the Conference on Design, Automation and Test in Europe

Embedded systems based on FPGA (Field-Programmable Gate Arrays) must exhibit more performance for new applications. However, no high-performance superscalar soft processor is available on the FPGA, because the superscalar architecture is not suitable ...
Performance Study of a Multithreaded Superscalar Microprocessor
HPCA '96: Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture

This paper describes a technique for improving the performance of a superscalar processor through multithreading. The technique exploits the instruction-level parallelism available both inside each individual stream, and across streams. The former is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '43: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture

December 2010

542 pages

ISBN:9780769542997

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

IEEE Computer Society

United States

Publication History

Published: 04 December 2010

Check for updates

Author Tags

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
497
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)4

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Durvasula SZhao AKiguru RGuan YChen ZVijaykumar N(2024)ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational GraphsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676897(258-270)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676897
Fox DMonsalve Diaz JLi X(2023)A gem5 Implementation of the Sequential Codelet Model: Reducing Overhead and Expanding the Software Memory InterfaceProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624152(839-846)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624152
Elsabbagh FSheikhha SYing VNguyen QEmer JSanchez D(2023)Accelerating RTL Simulation with Hardware-Software Co-DesignProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614257(153-166)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614257
Diaz JHarms KGuaitero RPerdomo DKumaran KGao GValero-Lara PLee SKestor G(2022)The SuperCodelet architectureProceedings of the 1st International Workshop on Extreme Heterogeneity Solutions10.1145/3529336.3530823(1-6)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3529336.3530823
Posluns GZhu YZhang GJeffrey MSalapura VZahran MChong FTang L(2022)A scalable architecture for reprioritizing ordered parallelismProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527387(437-453)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527387
Wang DKim NSherwood TBerger EKozyrakis C(2021)DiAG: a dataflow-inspired architecture for general-purpose processorsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446703(93-106)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446703
Wang LDeng YGong RShi WLuo LWang Y(2020)CSMO-DSEACM Journal on Emerging Technologies in Computing Systems10.1145/337140616:2(1-22)Online publication date: 30-Jan-2020
https://dl.acm.org/doi/10.1145/3371406
Morais LSilva VGoldman AAlvarez CBosch JFrank MAraujo G(2019)Adding Tightly-Integrated Task Scheduling Acceleration to a RISC-V Multi-core ProcessorProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358271(861-872)Online publication date: 12-Oct-2019
https://dl.acm.org/doi/10.1145/3352460.3358271
Puthoor STang XGross JBeckmann BKaeli DCavazos J(2018)Oversubscribed Command Queues in GPUsProceedings of the 11th Workshop on General Purpose GPUs10.1145/3180270.3180271(50-60)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3180270.3180271
Margerm SSharifian AGuha AShriraman APokam GOskin MInoue K(2018)TAPASProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00028(245-257)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00028
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents