skip to main content
research-article

On the exploitation of loop-level parallelism in embedded applications

Published: 09 February 2009 Publication History

Abstract

Advances in the silicon technology have enabled increasing support for hardware parallelism in embedded processors. Vector units, multiple processors/cores, multithreading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. To what extent the available hardware parallelism can be exploited is directly dependent on the amount of parallelism inherent in the given application and the congruence between the granularity of hardware and application parallelism. This paper discusses how loop-level parallelism in embedded applications can be exploited in hardware and software. Specifically, it evaluates the efficacy of automatic loop parallelization and the performance potential of different types of parallelism, viz., true thread-level parallelism (TLP), speculative thread-level parallelism and vector parallelism, when executing loops. Additionally, it discusses the interaction between parallelization and vectorization. Applications from both the industry-standard EEMBC®,1 1.1, EEMBC 2.0 and the academic MiBench embedded benchmark suites are analyzed using the Intel®2 C compiler. The results show the performance that can be achieved today on real hardware and using a production compiler, provide upper bounds on the performance potential of the different types of thread-level parallelism, and point out a number of issues that need to be addressed to improve performance. The latter include parallelization of libraries such as libc and design of parallel algorithms to allow maximal exploitation of parallelism. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution.
1 Other names and brands may be claimed as the property of others.
2 Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.

References

[1]
Adve, S. V. and Gharachorloo, K. 1996. Shared memory consistency models: A tutorial. IEEE Comput. 29, 12, 66--76.
[2]
Agbaria, A., Kang, D.-I., and Singh, K. 2006. LMPI: MPI for heterogeneous embedded distributed systems. In Proceedings of the 12th International Conference on Parallel and Distributed Systems. IEEE, Los Alamitos, CA, 79--86.
[3]
Allen, J. R., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. In Proceedings of the Conference Record of the 10th Annual ACM Symposium on the Principles of Programming Languages. ACM, New York, 177--189.
[4]
Amdahl, G. M. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the AFIPS Conference, ACM, New York, 483--485.
[5]
Anderson, J. M., Berc, L. M., Dean, J., Ghemawat, S., Henzinger, M. R., Leung, S.-T. A., Sites, R. L., Vandevoorde, M. T., Waldspurger, C. A., and Weihl, W. E. 1997. Continuous profiling: Where have all the cycles gone? In Proceedings of the 16th Symposium on Operating Systems Principles, 1--14.
[6]
Arm11 Family. http://www.arm.com/products/CPUs/families/ARM11Family.html.
[7]
ATLAS (Automatically Tuned Linear Algebra Software). http://math-atlas.sourceforge.net/.
[8]
Auerbach, J. S., Goldberg, A. P., Goldszmidt, G. S., Gopal, A. S., Kennedy, M. T., Rao, J. R., and Russell, J. R. 1994. Concert/C: A language for distributed programming. In Proceedings of the USENIX Winter Technical Conference. ACM, New York, 79--96.
[9]
Auto-Vectorization in GCC. http://gcc.gnu.org/projects/tree-ssa/vectorization.html.
[10]
Ball, T. and Larus, J. 1993. Branch prediction for free. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, 300--313.
[11]
Banerjee, U. 1993. Loop Transformation for Restructuring Compilers. Kluwer Academic Publishers, Boston, MA.
[12]
Banerjee, U., Eigenmann, R., Nicolau, A., and Padua, D. 1993. Automatic program parallelization. Proc. IEEE. IEEE, Los Alamitos, CA, 211--243.
[13]
Bernstein, A. J. 1966. Analysis of programs for parallel processing. IEEE Trans. Electron. Comput. 15, 5, 757--763.
[14]
Bik, A. J. C. 2004. The Software Vectorization Handbook: Applying Multimedia Extensions for Maximum Performance. Intel Press, Hillsboro, OR.
[15]
Bird, R. S. 1977. Notes on recursion elimination. Comm. ACM 20, 6, 434--439.
[16]
BLAS. BLAS (Basic Linear Algebra Subprograms). http://www.netlib.org/blas/.
[17]
Bodin, F., Beckman, P., Gannon, D. B., Narayana, S., and Yang, S. 1991. Distributed pC++: Basic ideas for an object parallel language. In Proceedings of Supercomputing. IEEE, Los Alamitos, CA, 273--282.
[18]
Cell. The Cell Project at IBM Research. http://www.research.ibm.com/cell/.
[19]
Chandra, R., Gupta, A., and Hennessy, J. L. 1994. COOL: An object-based language for parallel programming. IEEE Comput. 27, 8, 13--26.
[20]
Chandy, K. M. and Kesselman, C. 1993. CC++: A Declarative Concurrent Object Oriented Programming Notation. Tech. rep. 00000203, California Institute of Technology, Pasadena, CA.
[21]
CM-2. 1989.Thinking Machines Corporation, Connection Machine Model CM-2 Technical Summary, Version 5.1.
[22]
Derby, J. H. and Moreno, J. H. 2003. A high-performance embedded DSP core with novel SIMD features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing II. IEEE, Los Alamitos, CA, 301--304.
[23]
EEMBC. EEMBC Benchmarks. http://www.eembc.org.
[24]
Eijndhoven, J. T. J. V. and Pol, E. J. D. 1999. Trimedia CPU64 architecture. In Proceedings of the IEEE International Conference on Computer Design. IEEE, Los Alamitos, CA, 586--592.
[25]
Fisher, J. A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput. C-30, 7, 478--490.
[26]
Fisher, J. A., Faraboschi, P., and Young, C. 2004. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann Publishers, New York.
[27]
Flynn, M. 1966. Very high-speed computing systems. Proc. IEEE 54, 12, 1901--1909.
[28]
Franchetti, F. and Puschel, M. 2003. Short vector code generation for the discrete fourier transform. In Proceedings of the 17th International Parallel and Distributed Processing Symposium. IEEE, Los Alamitos, CA, 58.
[29]
Franklin, M. and Sohi, G. S. 1992. The expandable split window paradigm for exploiting finegrain parallelism. In Proceedings of the 19th International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 58--67.
[30]
Gamma, E., Helm, R., Johnson, R., and Vlissides, J. 1995. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Longman Publishing Co., Inc., Boston, MA.
[31]
Gehani, N. and Roome, W. D. 1989. The Concurrent C Programming Language. Silicon Press, Summit, NJ.
[32]
Gerber, R., Bik, A. J. C., Smith, K. B., and Tian, X. 2006. The Software Optimization Cookbook 2nd ed. Intel Press, New York, NY.
[33]
GNU C Library. GNU C Library. http://www.gnu.org/software/libc/.
[34]
Grimshaw, A. S. 1991. An Introduction to Parallel Object-Oriented Programming With Mentat. Tech. rep. CS-91-07, University of Virginia, Charlottesville, VA.
[35]
Gupta, R. and Soffa, M. L. 1990. Region scheduling: An approach for detecting and redistributing parallelism. IEEE Trans. Softw. Engin. 16, 4, 421--431.
[36]
Guthier, L., Yoo, S., and Jerraya, A. 2001. Automatic generation and targeting of application specific operating systems and embedded systems software. In Proceedings of the Conference on Design, Automation and Test in Europe. IEEE, Los Alamitos, CA, 679--685.
[37]
Halfhill, T. R. 2006. Cell processor isn't just for games? Microprocessor Report.
[38]
Hank, R. E., Hwu, W. W., and Rau, B. R. 1997. Region-based compilation: An introduction and motivation. Int. J. Parall. Program. 25, 2, 113--146.
[39]
Hatcher, P. J., Quinn, M. J., Lapadula, A. J., Seevers, B. K., Anderson, R. J., and Jones, R. R. 1991. Data-parallel programming on MIMD computers. IEEE Trans. Parall. Distrib. Syst. 2, 3, 377--383.
[40]
Herity, D. 2006. Applying Distributed System Concepts to Embedded Multiprocessor Designs. http://www.embedded.com/showArticle.jhtml?articleID=177104979.
[41]
Herlihy, M. and Moss, E. 1993. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th International Symposium on Computer Architecture, IEEE, Los Alamitos, CA, 289--300.
[42]
HOOD. HOOD: A user-level threads library for multiprogrammed multiprocessors. http://www.cs.utexas.edu/users/hood/.
[43]
Hwu, W. M. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., Hank, R. E., Kiyohara, T., Haab, G. E., Holm, J. G., and Lavery, D. M. 1993. The superblock: An effective technique for VLIW and superscalar compilation. J. Supercomput. 7, 1--2, 229--248.
[44]
Intel. 2007. Quad-Core, kentsfield, targeted for Q1. http://www.intel.com/technology/architecture/coremicro/index.htm.
[45]
Intel. IXP1200. Network processor. http://www.intel.com/design/network/products/npfamily/ixp1200.htm.
[46]
Intel. IXP2850. Network processor. http://www.intel.com/design/network/products/npfamily/ixp2850.htm.
[47]
Intel. 8.0. Math kernel library. http://www.intel.com/cd/software/products/asmo-na/eng/perflib/mkl/index.htm.
[48]
Intel. Multi core processor architecture development. http://www.intel.com/cd/ids/developer/asmo-na/eng/201969.htm?page=6.
[49]
Intel. Teraflops research chip. http://www.intel.com/research/platform/terascale/teraflops.htm.
[50]
Intel. VTune#8482; Performance Analyzer 8.0 for Window. http://www.intel.com/cd/software/products/asmo-na/eng/vtune/219898.htm.
[51]
Intel. Yonah: Multi-core processor architecture development. http://www.intel.com/cd/ids/developer/asmo-na/eng/201969.htm?page=6.
[52]
Jerraya, A. and Wolf, W. 2004. Multiprocessor Systems-on-Chips. Morgan Kaufmann Publishers, New York.
[53]
Johnson, R. E. 1997. Components, frameworks, patterns. In Proceedings of the Symposium on Software Reusability. ACM, New York, 10--17.
[54]
Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer, T. R., and Shippy, D. 2005. Introduction to the cell multiprocessor. IBM J. Resear. Develop. 49, 4--5.
[55]
Kejariwal, A. and Nicolau, A. Reading list of performance analysis, speculative execution. http://www.ics.uci.edu/_akejariw/SpeculativeExecutionReadingList.pdf.
[56]
Kejariwal, A., Tian, X., Li, W., Girkar, M., Kozhukhov, S., Saito, H., Banerjee, U., Nicolau, A., Veidenbaum, A. V., and Polychronopoulos, C. D. 2006. On the performance potential of different types of speculative thread-level parallelism. In Proceedings of the 20th ACM International Conference on Supercomputing, ACM, New York, 24--35.
[57]
Kejariwal, A., Veidenbaum, A. V., Nicolau, A., Girkar, M., Tian, X., and Saito, H. 2006. Challenges in exploitation of loop parallelism in embedded applications. In Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis. IEEE, Los Alamitos, CA. 173--180.
[58]
Kopetz, H. 1997. Real-Time Systems -- Design Principles for Distributed Embedded Applications. Kluwer Academic Publishers, Boston, MA.
[59]
Kuck, D. 1978. The Structure of Computers and Computations, Vol. 1. John Wiley and Sons, New York, NY.
[60]
Kuck, D. 2005. Platform 2015 software: Enabling innovation in parallelism for the next decade. ftp://download.intel.com/technology/computing/archinnov/platform2015/download/Parallelism. pdf.
[61]
Kwong, Y.-S. 1982. On Reductions and Livelocks in Asynchronous Parallel Computation. UMI Research Press, New York, NY.
[62]
Larus, J. 1993. Loop-level parallelism in numeric and symbolic programs. IEEE Trans. Parall. Distrib. Syst. 4, 7, 812--826.
[63]
Lee, E. D. 2006. The Problem with Threads. Tech. rep. TR UCB/EECS-2006-1, EECS Department, University of California at Berkeley.
[64]
Liles Jr., A. and Wilner, B. 1979. Branch prediction mechanism. IBM Tech. Disclos. Bull. 22, 7, 3013--3016.
[65]
Liu, Y. A. and Stoller, S. D. 1999. From recursion to iteration: what are the optimizations? In Proceedings of the ACM SIGPLAN Workshop on Partial Evaluation and Semantics-based Program Manipulation. ACM, New York, 73--82.
[66]
Lundstrom, S. F. and Barnes, G. H. 1980. A controllable MIMD architectures. In Proceedings of the International Conference on Parallel Processing. ACM, New York, 19--27.
[67]
Lyonnard, D., Yoo, S., Baghdadi, A., and Jerraya, A. A. 2001. Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on-chip. In Proceedings of the 38th Design Automation Conference. ACM, New York, 518--523.
[68]
Macdonald, S., Anvik, J., Bromling, S., Schaeffer, J., Szafron, D., and Tan, K. 2002. From patterns to frameworks to parallel programs. Parall. Comput. 28, 12, 1663--1683.
[69]
Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., and Bringmann, R. A. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th International Symposium of Microarchitecture. IEEE, Los Alamitos, CA, 45--54.
[70]
Massingill, B., Mattson, T., and Sanders, B. 2000. A pattern language for parallel application programs. In Proceedings of Euro-Par, Springer, Berlin, Germany, 678--681.
[71]
Mcmahon, T. P. and Skjellum, A. 1996. eMPI/eMPICH: Embedding MPI. In Proceedings of the 2nd MPI Developers Conference. IEEE, Los Alamitos, CA 180--184.
[72]
MiBench. MiBench Version 1.0. http://www.eecs.umich.edu/mibench/.
[73]
Muchnick, S. 2000. Advanced Compiler Design Implementation. 2nd Ed. Morgan Kauffman Publishers, San Francisco, CA.
[74]
Muchnick, S. S. 1997. Advanced Compiler Design and Implementation. Morgan Kauffman Publishers, San Francisco, CA.
[75]
Nemirovsky, M. D., Brewer, F., and Wood, R. C. 1991. DISC: Dynamic instruction stream computer. In Proceedings of the 24th International Symposium of Microarchitecture (MICRO-24), ACM, New York, 163--171.
[76]
Nicolau, A. 1985. Percolation scheduling. In Proceedings of the International Conference on Parallel Processing. IEEE, Los Alamitos, CA.
[77]
Nuzman, D., Rosen, I., and Zaks, A. 2006. Auto-vectorization of interleaved data for SIMD. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, 132--143.
[78]
Omap2420. http://focus.ti.com/general/docs/wtbu/wtbuproductcontent.tsp?templateId=6123&navigationId=11990&contentId=4671.
[79]
OpenMP. OpenMP specification, Version 2.5. http://www.openmp.org/drupal/mp-documents/spec25.pdf.
[80]
Polychronopoulos, C. 1987. Loop coalescing: A compiler transformation for parallel machines. In Proceedings of the International Conference on Parallel Processing. ACM, New York, 235--242.
[81]
Prakash, S. and Parker, A. C. 1992. SOS: Synthesis of application-specific heterogeneous multiprocessor systems. J. Parall. Distrib. Comput. 16, 338--351.
[82]
Rae, A. and Parame-Swaran, S. 1998. Application-specific heterogeneous multiprocessor synthesis using differential-evolution. In Proceedings of the 11th International Symposium on System Synthesis. IEEE, Los Alamitos, CA, 83--88.
[83]
Rau, B. R. and Fisher, J. A. 1993. Instruction level parallel processing: History, overview and perspective. Read. Comput. Arch. 7, 1, 97.
[84]
Rau, B. R. and Glaeser, C. D. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Workshop on Microprogramming. ACM, New York, 183--198.
[85]
Ren, G.,Wu, P., and Padua, D. 2006. Optimizing data permutations for simd devices. In Proceedings of the SIGPLAN '06 Conference on Programming Language Design and Implementation. ACM, New York, 118--131.
[86]
Ren, G., Wu, P., and Padua, D. A. 2005. An empirical study on the vectorization of multimedia applications for multimedia extensions. In Proceedings of the 19th International Parallel and Distributed Processing Symposium. IEEE, Los Alamitos, CA, 89b.
[87]
Snir, M., Otto, S. W., H.-Lederman, S., Walker, D. W., and Dongarra, J. 1995. MPI: The Complete Reference. MIT Press, Cambridge, MA.
[88]
SPEC. SPEC: Standard Performance Evaluation Corporation. http://www.spec.org/.
[89]
Sun, F., Ravi, S., Raghunathan, A., and Jha, N. K. 2005. Synthesis of application-specific heterogeneous multiprocessor architectures using extensible processors. In Proceedings of the 18th International Conference on VLSI Design. IEEE, Los Alamitos, CA, 551--556.
[90]
Sutter, H. and Larus, J. 2005. Software and the concurrency revolution. ACM Queue 3, 7.
[91]
Thakur, R., Bordawekar, R., Choudhary, A., Ponnusamy, R., and Singh, T. 1994. PASSION runtime library for parallel I/O. In Proceedings of the Scalable Parallel Libraries Conference. IEEE, Los Alamitos, CA, 119--128.
[92]
Tian, X., Bik, A., Girkar, M., Grey, P., Saito, H., and Su, E. 2002. Intel OpenMP C++/Fortran compiler for hyper-threading technology: Implementation and performance. Intel Techn. J. 3, 1.
[93]
Tomasulo, R. M. 1967. An efficient algorithm for exploiting multiple arithmetic units. IBM J. Resear. Develop. 11, 25--33.
[94]
Tullsen, D. M., Eggers, S., and Levy, H. M. 1995. Simultaneous multithreading: Maximizing onchip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 392--403.
[95]
Turley, J. 2003. Embedded processors of tomorrow. http://www.embedded.com/columns/technicalinsights/15201862?requestid=804418.
[96]
UPC. http://upc.gwu.edu/.
[97]
Vajapeyam, S., Joseph, P. J., and Mitra, T. 1999. Dynamic vectorization: A mechanism for exploiting far-flung ILP in ordinary programs. In Proceedings of the 26th International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 16--27.
[98]
Wang, P. H., Collins, J. D., Wang, H., Kim, D., Greene, B., Chan, K.-M., Yunus, A. B., Sych, T., Moore, S. F., and Shen, J. P. 2004. Helper threads via virtual multithreading. IEEE Micro 24, 6, 74--82.
[99]
Winslett, M., Ad Y. Chen, K. E. S., Cho, Y., Kuo, S., and Subramanium, M. 1996. The PANDA library for parallel I/O of large multidimensional arrays. In Proceedings of the Scalable Parallel Libraries Conference III. IEEE, Los Alamitos, CA.
[100]
Wolf, W. 2004. The future of multiprocessor systems-on-chips. In Proceedings of the 41st Design Automation Conference. IEEE, Los Alamitos, CA, 681--685.
[101]
Wu, P., Eichenberger, A. E.,Wang, A., and Zhao, P. 2005. An integrated simdization framework using virtual vectors. In Proceedings of the 19th Annual International Conference on Supercomputing. ACM, New York, 169--178.
[102]
Xeon. Dual-Core IntelR XeonR Processor 7000 Sequence Platform Brief. ftp://download.intel.com/products/processor/xeon/dc7kplatbrief.pdf.
[103]
Yang, W.-S. and Ding, C. 2003. ZioLib: A Parallel I/O Library. Number LBNL-53521.
[104]
Zima, H. and Chapman, B. 1991. Supercompilers for Parallel and Vector Computers. Addison-Wesley, New York.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 8, Issue 2
January 2009
243 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/1457255
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 09 February 2009
Accepted: 01 July 2008
Revised: 01 March 2008
Received: 01 June 2007
Published in TECS Volume 8, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Multi-cores
  2. libraries
  3. multithreading
  4. parallel loops
  5. programming models
  6. system-on-chip (Soc)
  7. thread-level speculation
  8. vectorization

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)1
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media