article

Instruction scheduling for a tiled dataflow architecture

Authors:

Martha Mercaldi,

Steven Swanson,

Andrew Petersen,

Andrew Schwerin,

Susan J. EggersAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 34, Issue 5

Pages 141 - 150

https://doi.org/10.1145/1168919.1168876

Published: 20 October 2006 Publication History

Abstract

This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned into large sections, the bottom-level algorithm must more carefully analyze program structure when producing the final schedule.Our analysis reveals that at this bottom level, good scheduling depends upon carefully balancing instruction contention for processing elements and operand latency between producer and consumer instructions. We develop a parameterizable instruction scheduler that more effectively optimizes this trade-off. We use this scheduler to determine the contention-latency sweet spot that generates the best instruction schedule for each application. To avoid this application-specific tuning, we also determine the parameters that produce the best performance across all applications. The result is a contention-latency setting that generates instruction schedules for all applications in our workload that come within 17% of the best schedule for each.

References

[1]

S.P. Amarasinghe and M.S. Lam. Communication optimization and code generation for distributed memory machines. In Proceedings of the Conference on Programming Language Design and Implementation, 1993.]]

Digital Library

[2]

J.M. Anderson and M.S. Lam. Global optimizations for parallelism and locality on scalable parallel machines. In Proceedings of the Conference on Programming Language Design and Implementation, 1993.]]

Digital Library

[3]

Arvind and R. Nikhil. Executing a program on the mit tagged-token dataflow architecture. IEEE Transactions on Computers, 39(3), 1990.]]

Digital Library

[4]

D. Buell et al. Splash 2: FPGAs in a Custom Computing Machine. IEEE Computer Society, 1996.]]

[5]

T.M. Chilimbi, M.D. Hill, and J.R. Larus. Cache-conscious structure layout. In Proceedings of the Conference on Programming Language Design and Implementation, 1999.]]

Digital Library

[6]

K. Coons, X. Chen, S. Kushwaha, K. McKinley, and D. Burger. A spatial path scheduling algorithm for EDGE architectures. In Symposium on Architectural Support for Programming Languages and Operating Systems, 2006.]]

Digital Library

[7]

D.E. Culler, A. Sah, K.E. Schauser, T. von Eicken, and J. Wawrzynek. Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1991.]]

Digital Library

[8]

A.L. Davis. The architecure and system method of DDM1: A recursively structured data driven machine. In Proceedings of the Annual Symposium on Computer Architecture, Palo Alto, California, April 3-5, 1978. IEEE Computer Society and ACM SIGARCH.]]

Digital Library

[9]

J.B. Dennis. A preliminary architecture for a basic dataflow processor. In Proceedings of the Symposium on Computer Architecture, 1975.]]

Digital Library

[10]

G. Desoli. Instruction assignment for clustered VLIW DSP compilers: A new approach. Technical Report HPL-98-13, Hewlett-Packard Laboratories, January 1998.]]

[11]

J. Ellis. Bulldog: A Compiler for VLIW Architectures. PhD thesis, MIT, 1986.]]

Digital Library

[12]

J.R. Ellis. Bulldog: A Compiler for VLIW Architectures. ACM doctoral dissertation award; 1985. The MIT Press, 1986.]]

Digital Library

[13]

P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, and F. Homewood. Lx: A technology platform for customizable VLIW embedded processing. In International Symposium on Computer Architecture, 2000.]]

Digital Library

[14]

V.G. Grafe, G.S. Davidson, J.E. Hoch, and V.P. Holmes. The Epsilon dataflow processor. In Proceedings of the International Symposium on Computer Architecture, 1989.]]

Digital Library

[15]

J.R. Gurd, C.C. Kirkham, and I. Watson. The Manchester prototype dataflow computer. Communications of the ACM, 28(1), 1985.]]

Digital Library

[16]

D.R. Kerns and S.J. Eggers. Balanced scheduling: Instruction scheduling when memory latency is uncertain. In Proceedings of the Conference on Programming Language Design and Implementation, 1993.]]

Digital Library

[17]

M. Kishi, H. Yasuhara, and Y. Kawamura. DDDP-A distributed data driven processor. In Proceedings of the International Symposium on Computer Architecture, 1983.]]

Digital Library

[18]

W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. Space-time scheduling of instruction-level parallelism on a Raw machine. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1998.]]

Digital Library

[19]

W. Lee et al. Space-time scheduling of instruction-level parallelism on a Raw machine. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating, 1998.]]

Digital Library

[20]

W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent scheduling. In Proceedings of the International Symposium on Microarchitecture, 2002.]]

Digital Library

[21]

J.L. Lo and S.J. Eggers. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In Proceedings of the Conference on Programming Language Design and Implementation, 1995.]]

Digital Library

[22]

P.G. Lowney, S.M. Freudenberger, T.J. Karzes, W.D. Lichtenstein, R.P. Nix, J.S. O'Donnell, and J. Ruttenberg. The multiflow trace scheduling compiler. J. Supercomputing, 1993.]]

Digital Library

[23]

K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and M. Horowitz. Smart memories: A modular reconfigurable architecture. In International Symposium on Computer Architecture, 2002.]]

Digital Library

[24]

R. Nagarajan, S.K. Kushwaha, D. Burger, K.S. McKinley, C. Lin, and S.W. Keckler. Static placement, dynamic issue (SPDI) scheduling for EDGE architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2004.]]

Digital Library

[25]

R. Nagarajan, K. Sankaralingam, D. Burger, and S. Keckler. A design space evaluation of grid processor architectures. In Proceedings of the International Symposium on Microarchitecture, 2001.]]

Digital Library

[26]

E. Özer, S. Banerjia, and T.M. Conte. Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures. In Proceedings of the International Symposium on Microarchitecture, 1998.]]

Digital Library

[27]

G. Papadopoulos and D. Culler. Monsoon: An explicit token-store architecture. In Proceedings of the International Symposium on Computer Architecture, 1990.]]

Digital Library

[28]

G.M. Papadopoulos and K.R. Traub. Multithreading: A revisionist view of dataflow architectures. In Proceedings of the International Symposium on Computer Architecture, 1991.]]

Digital Library

[29]

T.A. Proebsting and C.N. Fischer. Linear-time, optimal code scheduling for delayed-load architectures. In Proceedings of the Conference on Programming Language Design and Implementation, 1991.]]

Digital Library

[30]

G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the Conference on Programming Language Design and Implementation, 1998.]]

Digital Library

[31]

S. Sakai, y. Yamaguchi, K. Hiraki, Y. Kodama, and T. Yuba. An architecture of a dataflow single chip processor. In Proceedings of the International Symposium on Computer Architecture, 1989.]]

Digital Library

[32]

J. Sanchez and A. Gonzalez. Instruction scheduling for clustered VLIW architectures. In Proceedings of the International Symposium on System Synthesis, 2000.]]

Digital Library

[33]

T. Shimada, K. Hiraki, K. Nishida, and S. Sekiguchi. Evaluation of a prototype data flow processor of the sigma-1 for scientific computations. In Proceedings of the International Symposium on Computer Architecture, 1986.]]

Digital Library

[34]

Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the Conference on Programming Language Design and Implementation, 1999.]]

Digital Library

[35]

SPEC. Spec CPU 2000 benchmark specifications. SPEC2000 Benchmark Release, 2000.]]

[36]

S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. WaveScalar. In Proceedings of the International Symposium on Microarchitecture, 2003.]]

Digital Library

[37]

S. Swanson, A. Putnam, M. Mercaldi, K. Michelson, A. Petersen, A. Schwerin, M. Oskin, and S. Eggers. Area-performance trade-offs in tiled dataflow architectures. In Proceedings of the International Symposium on Computer Architecture, 2006.]]

Digital Library

[38]

R. von Hanxleden and K. Kennedy. Give-n-take - a balanced code placement framework. In Proceedings of the Conference on Programming Language Design and Implementation, 1994.]]

Digital Library

[39]

E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring it all to software: Raw machines. Computer, 30(9), 1997.]]

Digital Library

[40]

K. Wilken, J. Liu, and M. Heffernan. Optimal instruction scheduling using integer programming. In Proceedings of the Conference on Programming Language Design and Implementation, 2000.]]

Digital Library

[41]

M.E. Wolf and M.S. Lam. A data locality optimizing algorithm. In Proceedings of the Conference on Programming Language Design and Implementation, 1991.]]

Digital Library

[42]

T. Yang and A. Gerasoulis. PYRROS: static task scheduling and code generation for message passing multiprocessors. In Proceedings of the International Conference on Supercomputing, 1992.]]

Digital Library

[43]

J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. Modulo scheduling with integrated register spilling for clustered vliw architectures. In Proceedings of International Symposium on Microarchitecture, 2001.]]

Digital Library

Cited By

B. Khan MR. Khan AAlkahtani H(2022)Exploring the Approaches to Data Flow ComputingComputers, Materials & Continua10.32604/cmc.2022.02062371:2(2333-2346)Online publication date: 2022
https://doi.org/10.32604/cmc.2022.020623
Weng JLiu SWang ZDadu VNowatzki T(2020)A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00063(703-716)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00063
Nowatzki TArdalani NSankaralingam KWeng JEvripidou SStenström PO'Boyle M(2018)Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesignProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243212(1-15)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243212
Show More Cited By

Index Terms

Instruction scheduling for a tiled dataflow architecture
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Data flow architectures
2. Hardware
  1. Communication hardware, interfaces and storage

Recommendations

Instruction scheduling for a tiled dataflow architecture
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems

This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned ...
Instruction scheduling for a tiled dataflow architecture
Proceedings of the 2006 ASPLOS Conference

This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned ...
Instruction scheduling for a tiled dataflow architecture
Proceedings of the 2006 ASPLOS Conference

This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 34, Issue 5

Proceedings of the 2006 ASPLOS Conference

December 2006

425 pages

ISSN:0163-5964

DOI:10.1145/1168919

Issue’s Table of Contents

ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
October 2006
440 pages
ISBN:1595934510
DOI:10.1145/1168857
General Chair:
John Paul Shen
Intel Corp.
,
Program Chair:
Margaret R. Martonosi
Princeton University

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2006

Published in SIGARCH Volume 34, Issue 5

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
1,085
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)6

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

B. Khan MR. Khan AAlkahtani H(2022)Exploring the Approaches to Data Flow ComputingComputers, Materials & Continua10.32604/cmc.2022.02062371:2(2333-2346)Online publication date: 2022
https://doi.org/10.32604/cmc.2022.020623
Weng JLiu SWang ZDadu VNowatzki T(2020)A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00063(703-716)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00063
Nowatzki TArdalani NSankaralingam KWeng JEvripidou SStenström PO'Boyle M(2018)Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesignProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243212(1-15)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243212
Anders MBhagyanath ASchneider K(2018)On Memory Optimal Code Generation for Exposed Datapath Architectures with Buffered Processing Units2018 18th International Conference on Application of Concurrency to System Design (ACSD)10.1109/ACSD.2018.00020(115-124)Online publication date: Jun-2018
https://doi.org/10.1109/ACSD.2018.00020
Nowatzki TSartin-Tarm MDe Carli LSankaralingam KEstan CRobatmili B(2014)A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving TheoriesACM Transactions on Programming Languages and Systems10.1145/265899337:1(1-30)Online publication date: 17-Nov-2014
https://dl.acm.org/doi/10.1145/2658993
Yang TGerasoulis A(2014)Author retrospective for PYRROSACM International Conference on Supercomputing 25th Anniversary Volume10.1145/2591635.2591647(18-20)Online publication date: 10-Jun-2014
https://dl.acm.org/doi/10.1145/2591635.2591647
Nowatzki TFerris MSankaralingam KEstan CVaish NWood D(2013)Optimization and Mathematical Modeling in Computer ArchitectureSynthesis Lectures on Computer Architecture10.2200/S00531ED1V01Y201308CAC0268:4(1-144)Online publication date: 30-Sep-2013
https://doi.org/10.2200/S00531ED1V01Y201308CAC026
Nowatzki TSartin-Tarm MDe Carli LSankaralingam KEstan CRobatmili B(2013)A general constraint-centric scheduling framework for spatial architecturesACM SIGPLAN Notices10.1145/2499370.246216348:6(495-506)Online publication date: 16-Jun-2013
https://dl.acm.org/doi/10.1145/2499370.2462163
Nowatzki TSartin-Tarm MDe Carli LSankaralingam KEstan CRobatmili BBoehm HFlanagan C(2013)A general constraint-centric scheduling framework for spatial architecturesProceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2491956.2462163(495-506)Online publication date: 16-Jun-2013
https://dl.acm.org/doi/10.1145/2491956.2462163
Sartin-Tarm MNowatzki TDe Carli LSankaralingam KEstan C(2013)Constraint centric scheduling guideACM SIGARCH Computer Architecture News10.1145/2490302.249030641:2(17-21)Online publication date: 29-May-2013
https://dl.acm.org/doi/10.1145/2490302.2490306
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents