skip to main content
article

The WaveScalar architecture

Published: 01 May 2007 Publication History

Abstract

Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge that conventional superscalar designs will not be able to meet. We present WaveScalar as a scalable alternative to conventional designs. WaveScalar is a dataflow instruction set and execution model designed for scalable, low-complexity/high-performance processors. Unlike previous dataflow machines, WaveScalar can efficiently provide the sequential memory semantics that imperative languages require. To allow programmers to easily express parallelism, WaveScalar supports pthread-style, coarse-grain multithreading and dataflow-style, fine-grain threading. In addition, it permits blending the two styles within an application, or even a single function.
To execute WaveScalar programs, we have designed a scalable, tile-based processor architecture called the WaveCache. As a program executes, the WaveCache maps the program's instructions onto its array of processing elements (PEs). The instructions remain at their processing elements for many invocations, and as the working set of instructions changes, the WaveCache removes unused instructions and maps new ones in their place. The instructions communicate directly with one another over a scalable, hierarchical on-chip interconnect, obviating the need for long wires and broadcast communication.
This article presents the WaveScalar instruction set and evaluates a simulated implementation based on current technology. For single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area. For coarse-grain threaded applications the WaveCache achieves nearly linear speedup with up to 64 threads and can sustain 7--14 multiply-accumulates per cycle on fine-grain threaded versions of well-known kernels. Finally, we apply both styles of threading to equake from Spec2000 and speed it up by 9x compared to the serial version.

References

[1]
Adve, S. V. and Gharachorloo, K. 1996. Shared memory consistency models: A tutorial. Comput. 29, 12, 66--76.
[2]
Agarwal, V., Hrishikesh, M. S., Keckler, S. W., and Burger, D. 2000. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA). ACM Press, New York. 248--259.
[3]
Arvind. 2005. Dataflow: Passing the token. ISCA keynote in Annual International Symposium on Computer Architecture.
[4]
Arvind, Nikhil, R. S., and Pingali, K. K. 1989. I-structures: Data structures for parallel computing. ACM Trans. Program. Lang. Syst. 11, 4, 598--632.
[5]
Barroso, L. A., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R., and Verghese, B. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA). ACM Press, New York. 282--293.
[6]
Barth, P. S., Nikhil, R. S., and Arvind. 1991. M-structures: Extending a parallel, non-strict, functional languages with state. Tech. Rep. MIT/LCS/TR-327, Massachusetts Institute of Technology. March.
[7]
Beck, M., Johnson, R., and Pingali, K. 1991. From control flow to data flow. J. Parallel Distrib. Comput. 12, 2, 118--129.
[8]
Budiu, M., Venkataramani, G., Chelcea, T., and Goldstein, S. C. 2004. Spatial computation. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM Press, New York. 14--26.
[9]
Cadence. 2007. Cadence website. http://www.cadence.com.
[10]
Chrysos, G. Z. and Emer, J. S. 1998. Memory dependence prediction using store sets. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Los Alamitos, CA. 142--153.
[11]
Culler, D. E. 1990. Managing parallelism and resources in scientific dataflow programs. Ph.D. thesis, Massachusetts Institute of Technology.
[12]
Culler, D. E., Sah, A., Schauser, K. E., von Eicken, T., and Wawrzynek, J. 1991. Fine-Grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM Press, New York. 164--175.
[13]
Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N., and Zadeck, F. K. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13, 4, 451--490.
[14]
Dally, W. J. and Seitz, C. L. 1987. Deadlock-Free message routing in multiprocessor interconnection networks. IEEE Trans. Comput. 36, 5, 547--553.
[15]
Davis, A. L. 1978. The architecture and system method of DDM1: A recursively structured data driven machine. In Proceedings of the 5th Annual Symposium on Computer Architecture (ISCA). ACM Press, New York. 210--215.
[16]
Dennis, J. B. and Misunas, D. P. 1975. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2nd Annual Symposium on Computer Architecture (ISCA). ACM Press, New York. 126--132.
[17]
Desikan, R., Burger, D. C., Keckler, S. W., and Austin, T. M. 2001. Sim-Alpha: A validated, execution-driven Alpha 21264 simulator. Tech. Rep. TR-01-23, University of Texas-Austin, Department of Computer Sciences.
[18]
Ekman, M. and Stenström, P. 2003. Performance and power impact of issue width in chip-multiprocessor cores. In Proceedings of the International Conference on Parallel Processing.
[19]
Feo, J. T., Miller, P. J., and Skedzielewski, S. K. 1995. SISAL90. In Proceedings of the Conference on High Performance Functional Computing.
[20]
Goldstein, S. C. and Budiu, M. 2001. NanoFabrics: Spatial computing using molecular electronics. In Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA). ACM Press, New York. 178--191.
[21]
Goodman, J. R., Vernon, M. K., and Woest, P. J. 1989. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM Press, New York. 64--75.
[22]
Grafe, V. G., Davidson, G. S., Hoch, J. E., and Holmes, V. P. 1989. The Epsilon dataflow processor. In Proceedings of the 16th Annual International Symposium on Computer Architecture (ISCA). ACM Press, New York. 36--45.
[23]
Gurd, J. R., Kirkham, C. C., and Watson, I. 1985. The Manchester prototype dataflow computer. Commun. ACM 28, 1, 34--52.
[24]
Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M., and Olukotun, K. 2000. The Stanford Hydra cmp. IEEE Micro. 20, 2, 71--84.
[25]
Hrishikesh, M. S., Burger, D., Jouppi, N. P., Keckler, S. W., Farkas, K. I., and Shivakumar, P. 2002. The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Los Alamitos, CA. 14--24.
[26]
Jain, A. E. A. 2001. A 1.2GHz Alpha microprocessor with 44.8GB/s chip pin bandwidth. In Proceedings of the IEEE International Solid-State Circuits Conference. Vol. 1. 240--241.
[27]
Keckler, S. W., Dally, W. J., Maskit, D., Carter, N. P., Chang, A., and Lee, W. S. 1998. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Los Alamitos, CA. 306--317.
[28]
Kishi, M., Yasuhara, H., and Kawamura, Y. 1983. Dddp-A distributed data driven processor. In Proceedings of the 10th Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society Press, Los Alamitos, CA. 236--242.
[29]
Krewel, K. 2005. Alpha EV7 processor: A high-performance tradition continues. Microprocessor Rep.
[30]
Lee, C., Potkonjak, M., and Mangione-Smith, W. H. 1997. Mediabench: A tool for evaluating and synthesizing multimedia and communicatons systems. In Proceedings of the 30th International Symposium on Microarchitecture (MICRO). 330--335.
[31]
Lee, W., Barua, R., Frank, M., Srikrishna, D., Babb, J., Sarkar, V., and Amarasinghe, S. 1998. Space-Time scheduling of instruction-level parallelism on a Raw machine. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM Press, New York. 46--57.
[32]
Lo, J. L., Emer, J. S., Levy, H. M., Stamm, R. L., Tullsen, D. M., and Eggers, S. J. 1997. Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading. ACM Trans. Comput. Syst. 15, 3, 322--354.
[33]
Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., and Bringmann, R. A. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO). IEEE Computer Society Press, Los Alamitos, CA. 45--54.
[34]
Mai, K., Paaske, T., Jayasena, N., Ho, R., Dally, W. J., and Horowitz, M. 2000. Smart memories: A modular reconfigurable architecture. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA). ACM Press, New York. 161--171.
[35]
Mercaldi, M., Swanson, S., Petersen, A., Putnam, A., Schwerin, A., Oskin, M., and Eggers, S. J. 2006a. Instruction scheduling for a tiled dataflow architecture. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 141--150.
[36]
Mercaldi, M., Swanson, S., Petersen, A., Putnam, A., Schwerin, A., Oskin, M., and Eggers, S. J. 2006b. Modeling instruction placement on a spatial architecture. In Proceedings of the 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 158--169.
[37]
Mukherjee, S. S., Weaver, C., Emer, J., Reinhardt, S. K., and Austin, T. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, Los Alamitos, CA. 29.
[38]
Nagarajan, R., Sankaralingam, K., Burger, D., and Keckler, S. W. 2001. A design space evaluation of grid processor architectures. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO). IEEE Computer Society, Los Alamitos, CA. 40--51.
[39]
Nichols, B., Buttlar, D., and Farrell, J. P. 1996. Pthreads Programming. O'Reilly, Sebastopol, CA.
[40]
Nikhil, R. 1990. The parallel programming language id and its compilation for parallel machines. In Proceedings of the Workshop on Massive Paralleism: Hardware, Programming and Applications. Acamedic Press.
[41]
Papadopoulos, G. M. and Culler, D. E. 1990. Monsoon: An explicit token-store architecture. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA). ACM Press, New York. 82--91.
[42]
Papadopoulos, G. M. and Traub, K. R. 1991. Multithreading: A revisionist view of dataflow architectures. In Proceedings of the 18th Annual International Symposium on Computer Architecture (ISCA). ACM Press, New York. 342--351.
[43]
Petersen, A., Putnam, A., Mercaldi, M., Schwerin, A., Eggers, S., Swanson, S., and Oskin, M. 2006. Reducing control overhead in dataflow architectures. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (PACT). 182--191.
[44]
Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Burger, D., Keckler, S. W., and Moore, C. R. 2003. Exploiting ilp, tlp, and dlp with the polymorphous trips architecture. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA). 422--433.
[45]
Shimada, T., Hiraki, K., and Nishida, K. 1984. An architecture of a data flow machine and its evaluation. ACM SIGARCH Comput. Architecure News 14, 2, 226--234.
[46]
Shimada, T., Hiraki, K., Nishida, K., and Sekiguchi, S. 1986. Evaluation of a prototype data flow processor of the sigma-1 for scientific computations. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society Press, Los Alamitos, CA. 226--234.
[47]
Smith, A., Gibson, J., Maher, B., Nethercote, N., Yoder, B., Burger, D., McKinle, K. S., and Burrill, J. 2006. Compiling for edge architectures. In Proceedings of the International Symposium on Code Generation and Optimization (CGO). IEEE Computer Society, Los Alamitos, CA. 185--195.
[48]
SPEC. 2000. SPEC CPU 2000 benchmark specifications. SPEC2000 Benchmark Release.
[49]
Swanson, S. 2006. The WaveScalar architecture. Ph.D. thesis, University of Washington.
[50]
Swanson, S., Michelson, K., Schwerin, A., and Oskin, M. 2003. Wavescalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, Los Alamitos, CA. 291.
[51]
Swanson, S., Putnam, A., Mercaldi, M., Michelson, K., Petersen, A., Schwerin, A., Oskin, M., and Eggers, S. J. 2006. Area-Performance trade-offs in tiled dataflow architectures. In Proceedings of the 33rd International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Los Alamitos, CA. 314--326.
[52]
Taylor, M. B., Lee, W., Miller, J., Wentzlaff, D., Bratt, I., Greenwald, B., Hoffmann, H., Johnson, P., Kim, J., Psota, J., Saraf, A., Shnidman, N., Strumpen, V., Frank, M., Amarasinghe, S., and Agarwal, A. 2004. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Los Alamitos, CA. 2.
[53]
TSMC. 2007. Silicon design chain cooperation enables nanometer chip design. Cadence Whitepaper. http://www.cadence.com/whitepapers/.
[54]
Tullsen, D. M., Lo, J. L., Eggers, S. J., and Levy, H. M. 1999. Supporting fine-grained synchronization on a simultaneous multithreading processor. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA. 54.

Cited By

View all
  • (2024)Canalis: A Throughput-Optimized Framework for Real-Time Stream Processing of Wireless CommunicationACM Transactions on Reconfigurable Technology and Systems10.1145/369588017:4(1-32)Online publication date: 18-Sep-2024
  • (2023)Accelerating RTL Simulation with Hardware-Software Co-DesignProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614257(153-166)Online publication date: 28-Oct-2023
  • (2023)Towards Efficient Control Flow Handling in Spatial Architecture via Architecting the Control Flow PlaneProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614246(1395-1408)Online publication date: 28-Oct-2023
  • Show More Cited By

Recommendations

Reviews

Joseph M. Arul

The WaveScalar architecture aims to take advantages found in dataflow to achieve more parallelism with less silicon area, along with better performance and scalability than superscalar architecture. The program counter, which is a bottleneck in von Neumann architecture, has been eliminated, and the natural parallelism found in dataflow architecture is combined with fine-grain multi-threading to achieve a linear speedup when used with several processing elements. Instructions are ordered according to the path to be stored in a WaveCache, which is designed for WaveScalar architecture in a tiled fashion. In this architecture, deep speculation is done naturally. To alleviate the cache coherency, directory-based MESI is used. One novel idea in this architecture is that the WaveCache loads instructions from memory, and assigns them to processing elements instead of processing elements requesting certain instructions to process. Besides, the WaveScalar architecture does not use a register file for intervening accesses, but directly stores the required instructions. In previous dataflow architectures, even the dynamic ones, the matching of operands was a bottleneck, and reduced the performance. In this architecture, the authors claim that the logic of match and dispatch is the most complex. What impact will it have for the WaveScalar architecture__?__ Will it be a bottleneck, as it was in the past__?__ These things need to be addressed clearly. The WaveCache for this architecture is designed rather well. The compiler for this architecture is yet to be finished-it is rather tedious work. Will there be a linear speedup with the compiler for this particular architecture__?__ This is yet to be evaluated. A hand-coded version would be naturally optimized. Is it possible to compare the performance of WaveScalar architecture with TRIPS, which takes a similar approach, but with slight variations__?__ There is also a stream processor with clusters of processing elements, which uses local registers and global registers that try to take advantage of the von Neumann model without including dataflow concepts. Is it possible to compare the performance of the WaveScalar architecture to a stream processor__?__ Since WaveScalar does not include any register file for intervening accesses, will it have a positive or negative impact on the overall performance of this architecture__?__ The authors try to bring dataflow concepts closer to von Neumann, but with less wire delay, and less silicon that can perform in a single-threaded or multithreaded version better than the superscalar architectures. The paper is rather long; the simple concepts of the von Neumann model and dataflow could have been omitted, since anyone reading this paper should know the basics in both realms. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 25, Issue 2
May 2007
115 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/1233307
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2007
Published in TOCS Volume 25, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. WaveScalar
  2. dataflow computing
  3. multithreading

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)92
  • Downloads (Last 6 weeks)14
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media