article

Efficient compilation for queue size constrained queue processors

Authors:

Arquimedes Canedo,

Ben A. Abderazek,

Masahiro SowaAuthors Info & Claims

Parallel Computing, Volume 35, Issue 4

Pages 213 - 225

https://doi.org/10.1016/j.parco.2008.11.004

Published: 01 April 2009 Publication History

Abstract

Queue computers use a FIFO data structure for data processing. The essential characteristics of a queue-based architecture excel at satisfying the demands of embedded systems, including compact instruction set, simple hardware logic, high parallelism, and low power consumption. The size of the queue is an important concern in the design of a realizable embedded queue processor. We introduce the relationship between parallelism, length of data dependency edges in data flow graphs and the queue utilization requirements. This paper presents a technique developed to make the compiler aware of the size of the queue register file and, thus, optimize the programs to effectively utilize the available hardware. The compiler examines the data flow graph of the programs and partitions it into clusters whenever it exceeds the queue limits of the target architecture. The presented algorithm deals with the two factors that affect the utilization of the queue, namely parallelism and the length of variables' reaching definitions. We analyze how the quality of the generated code is affected for SPEC CINT95 benchmark programs and different queue size configurations. Our results show that for reasonable queue sizes the compiler generates a code that is comparable to the code generated for infinite resources in terms of instruction count, static execution time, and instruction level parallelism.

References

[1]

Abderazek, B., Kawata, S. and Sowa, M., Design and architecture for an embedded 32-bit QueueCore. Journal of Embedded Computing. v2 i2. 191-205.

[2]

Abderazek, B., Yoshinaga, T. and Sowa, M., High-level modeling and FPGA prototyping of produced order parallel queue processor core. Journal of Supercomputing. v38 i1. 3-15.

[3]

Ayala, J., Veidenbaum, A. and Lopez-Vallejo, M., Power-aware compilation for register file reduction. International Journal of Parallel Programming. v31 i6. 451-467.

[4]

Benitez, M.E. and Davidson, J.W., Code generation for streaming: an access/execute mechanism. SIGARCH Computer Architecture News. v19 i2. 132-141.

[5]

Blake, R., Exploring a stack architecture. Computer. v10 i5. 30-39.

[6]

Brooks, D.M., Bose, P., Schuster, S.E., Jacobson, H., Kudva, P.N., Buyuktosunoglu, A., Wellman, J.-D., Zyuban, V., Gupta, M. and Cook, P.W., Power-aware microarchitecture: design and modeling challenges for next-generation microprocessors. IEEE Micro. v20 i6. 26-44.

[7]

Burd, T.D. and Brodersen, R.W., Processor design for portable systems. Journal of VLSI Signal Processing Systems. v13 i2-3. 203-221.

[8]

A. Canedo, Code generation algorithms for consumed and produced order queue machines, Master's Thesis, University of Electro-Communications, Tokyo, Japan, September 2006.

[9]

A. Canedo, B. Abderazek, M. Sowa, Queue register file optimization algorithm for QueueCore processor, in: Proceedings of the 19th International Symposium on Computer Architecture and High Performance Computing, 2007, pp. 169-176.

[10]

Canedo, A., Abderazek, B. and Sowa, M., A new code generation algorithm for 2-offset producer order queue computation model. Journal of Computer Languages, Systems and Structures. v34 i4. 184-194.

[11]

Dennis, J.B. and Misunas, D.P., A preliminary architecture for a basic data-flow processor. ACM SIGARCH Computer Architecture News. v3 i4. 126-132.

[12]

Dujmovic, J.J. and Dujmovic, I., Evolution and evaluation of SPEC benchmarks. ACM SIGMETRICS Performance Evaluation Review. v26 i3. 2-9.

[13]

K. Farkas, P. Chow, N. Jouppi, Register file design considerations in dynamically scheduled processors, in: Proceedings of the Second IEEE Symposium on High-Performance Computer Architecture (HPCA'96), 1996, p. 40.

[14]

M. Fernandes, Using queues for register file organization in VLIW architectures, Technical Report ECS-CSG-29-97, University of Edinburgh, 1997.

[15]

L. Goudge, S. Segars, Thumb: reducing the cost of 32-bit RISC performance in portable and consumer applications, in: Proceedings of the COMPCON'96, 1996, pp. 176-181.

[16]

Hasegawa, A., Kawasaki, I., Yamada, K., Yoshioka, S., Kawasaki, S. and Biswas, P., SH: high code density, low power. IEEE Micro. v15 i6. 11-19.

[17]

M. Hasegawa, Y. Shigei, High-speed top-of-stack scheme for VLSI processor: a management algorithm and its analysis, in: Proceedings of the 12th Annual International Symposium on Computer Architecture, 1985, pp. 48-54.

[18]

Heath, L.S. and Pemmaraju, S.V., Stack and queue layouts of directed acyclic graphs: Part I. SIAM Journal on Computing. v28 i4. 1510-1539.

[19]

Heath, L.S. and Rosenberg, A.L., Laying out graphs using queues. SIAM Journal on Computing. v21 i5. 927-958.

[20]

S. Jang, S. Carr, P. Sweany, D. Kuras, A code generation framework for VLIW architectures with partitioned register banks, in: Proceedings of the Third International Conference on Massively Parallel Computing Systems, 1998.

[21]

J. Janssen, H. Corporaal, Partitioned register file for TTAs, in: Proceedings of the 28th Annual International Symposium on Microarchitecture, 1995, pp. 303-312.

Digital Library

[22]

Kennedy, A. and Syme, D., Design and implementation of generics for the .NET common language runtime. ACM SIGPLAN Notices. v36 i5. 1-12.

[23]

K. Kissel, MIPS16: high-density MIPS for the embedded market, Technical Report, Silicon Graphics MIPS Group, 1997.

[24]

Koopman, P.J., Stack Computers: The New Wave. 1989. Ellis Horwood.

[25]

Kucuk, G., Ergin, O., Ponomarev, D. and Ghose, K., Energy efficient register renaming. Lecture Notes in Computer Science. v2799. 219-228.

[26]

Kwon, Y., Ma, X. and Lee, H.J., Pare: instruction set architecture for efficient code size reduction. Electronics Letters. 2098-2099.

[27]

Lindholm, T. and Yellin, F., The Java Virtual Machine Specification. 1996. Addison-Wesley.

[28]

Llosa, J., Ayguade, E. and Valero, M., Quantitative evaluation of register pressure on software pipelined loops. International Journal of Parallel Programming. v26 i2. 121-142.

[29]

Louden, K., P-code and compiler portability: experience with Modula-2 optimizing compiler. ACM SIGPLAN Notices. v25 i5. 53-59.

[30]

S.A. Mahlke, W.Y. Chen, P.P. Chang, W. Mei, W. Hwu, Scalar program performance on muliple-instruction-issue processors with a limited number of registers, in: Proceedings of the 25th Annual Hawaii International Conference on System Sciences, 1992, pp. 34-44.

[31]

McGhan, H. and O'Connor, M., Picojava: a direct execution engine for java bytecode. Computer. v31 i10. 22-30.

[32]

J. Merrill, GENERIC and GIMPLE: a new tree representation for entire functions, in: Proceedings of the GCC Developers Summit, 2003, pp. 171-180.

[33]

S. Okamoto, Design of a superscalar processor based on queue machine computation model, in: IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 1999, pp. 151-154.

[34]

M. Postiff, D. Greene, T. Mudge, The need for large register file in integer codes, Technical Report CSE-TR-434-00, University of Michigan, 2000.

[35]

B. Preiss, C. Hamacher, Data flow on queue machines, in: Proceedings of the 12th International IEEE Symposium on Computer Architecture, 1985, pp. 342-351.

Digital Library

[36]

E. Rather, D. Colburn, C. Moore, The evolution of Forth, in: History of Programming Languages II, 1996, pp. 625-670.

[37]

Ravindran, R., Senger, R., Marsman, E., Dasika, G., Guthaus, M., Mahlke, S. and Brown, R., Partitioning variables across register windows to reduce spill code in a low-power processor. IEEE Transactions on Computers. v54 i8. 998-1012.

[38]

H. Schmit, B. Levine, B. Ylvisaker, Queue machines: hardware compilation in hardware, in: Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2002, p. 152.

[39]

H. Shi, C. Bailey, Investigating available instruction level parallelism for stack based machine architectures, in: Proceedings of the Digital System Design, EUROMICRO Systems on (DSD'04), 2004, pp. 112-120.

[40]

Shrivastava, A., Biswas, P., Halambi, A., Dutt, N. and Nicolau, A., Compilation framework for code size reduction using reduced bit-width ISAs (rISAs). ACM Transactions on Design Automation of Electronic Systems. v11 i1. 203-221.

[41]

Sima, D., The design space of register renaming techniques. IEEE Micro. v20 i5. 70-83.

[42]

M.G. Smelyanskiy, S. Tyson, E.S. Davidson, Register queues: a new hardware/software approach to efficient software pipelining, in: Proceedings of the Parallel Architectures and Compilation Techniques, 2000, pp. 3-12.

[43]

Sowa, M., Abderazek, B. and Yoshinaga, T., Parallel queue processor architecture based on produced order computation model. Journal of Supercomputing. v32 i3. 217-229.

[44]

N. Vijaykrishnan, Issues in the design of a java processor architecture, Ph.D. Thesis, University of South Florida, 1998.

[45]

Wall, D., Limits of instruction-level parallelism. ACM SIGARCH Computer Architecture News. v19 i2. 176-188.

[46]

W. Wulf, Evaluation of the WM architecture, in: Proceedings of the 19th Annual International Symposium on Computer Architecture, 1992, pp. 382-390.

Digital Library

[47]

Zalamea, J., Llosa, J., Ayguade, E. and Valero, M., Software and hardware techniques to optimize register file utilization in VLIW architectures. International Journal of Parallel Programming. v32 i6. 447-474.

[48]

H. Zhou, T.M. Conte, Code size efficiency in global scheduling for ILP processors, in: Proceedings of the Sixth Annual Workshop on Interaction between Compilers and Computer Architectures, 2002, pp. 79-90.

Efficient compilation for queue size constrained queue processors
1. Software and its engineering
  1. Software notations and tools

Recommendations

Compiling for Reduced Bit-Width Queue Processors

Embedded systems are characterized by the requirement of demanding small memory footprint code. A popular architectural modification to improve code density in RISC embedded processors is to use a reduced bit-width instruction set. This approach reduces ...
Design and implementation of a queue compiler

Queue processors are a viable alternative for high performance embedded computing and parallel processing. We present the design and implementation of a compiler for a queue-based processor. Instructions of a queue processor implicitly reference their ...
Natural instruction level parallelism-aware compiler for high-performance QueueCore processor architecture

This work presents a static method implemented in a compiler for extracting high instruction level parallelism for the 32-bit QueueCore, a queue computation-based processor. The instructions of a queue processor implicitly read and write their operands, ...

Comments

Information & Contributors

Information

Published In

cover image Parallel Computing

Parallel Computing Volume 35, Issue 4

April, 2009

61 pages

ISSN:0167-8191

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2008.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 April 2009

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents