research-article

Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors

Authors:

Xiangrong Zhou,

Peter PetrovAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 13, Issue 1

Article No.: 16, Pages 1 - 25

https://doi.org/10.1145/1297666.1297682

Published: 06 February 2008 Publication History

Abstract

Maintaining local caches coherently in shared-memory multiprocessors results in significant power consumption. The customization methodology we propose exploits the fact that in embedded systems, important knowledge is available to the system designers regarding memory sharing between tasks. We demonstrate how the snoop-induced cache probings can be significantly reduced by identifying and exploiting in a deterministic way the shared memory regions between the processors. Snoop activity is enabled only for the accesses referring to known shared regions. The hardware support is not only cost efficient, but also software programmable, which allows for reprogrammability and customization across different tasks and applications.

References

[1]

Barroso, L., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R., and Verghese, B. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. In Proceedings of the International Symposium on Computer Architecture (ISCA). ACM Press, New York, 282--293.

Digital Library

[2]

Bashirullah, R., Liu, W., and Cavin, R. K. 2003. Low-Power design methodology for an on-chip bus with adaptive bandwidth capability. In Proceedings of the Design Automation Conference (DAC). ACM Press, New York, 628--633.

Digital Library

[3]

Berndl, M., Lhotak, O., Qian, F., Hendren, L., and Umanee, N. 2003. Points-To analysis using BDDS. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). 103--114.

Digital Library

[4]

Binkert, N., Dreslinski, R., Hsu, L., Lim, K., Saidi, A., and Reinhardt, S. 2006. The m5 simulator: Modeling networked systems. IEEE Micro. 26, 4, 52--60.

Digital Library

[5]

Cantin, J. F., Lipasti, M. H., and Smith, J. E. 2005. Improving multiprocessor performance with coarse-grain coherence tracking. SIGARCH Comput. Archit. News 33, 2, 246--257.

Digital Library

[6]

Cekleov, M. and Dubois, M. 1997. Virtual-address caches. Part 1: Problems and solutions in uniprocessors. IEEE Micro. 17, 5 (Sept.), 64--71.

Digital Library

[7]

Cumming, P. 2003. The TI OMAP platform approach to SoC. In Winning the SOC Revolution. Kluwer Academic.

[8]

Das, M. 2000. Unification-Based pointer analysis with directional assignments. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 35--46.

Digital Library

[9]

Ekman, M., Dahlgren, F., and Stenstrom, P. 2002. TLB and snoop energy-reduction using virtual caches in low-power chip-microprocessors. In Proceedings of the International Symposium on Low-Power Electronics and Design (ISLPED), 243--246.

Digital Library

[10]

Furber, S. B. 2000. ARM System-on-Chip Architecture. Addison-Wesley, Boston, MA.

Digital Library

[11]

Gonzalez, R. E. 2000. Xtensa: A configurable and extensible processor. IEEE Micro. 20, 2, 60--70.

Digital Library

[12]

Hind, M. 2001. Pointer analysis: Haven't we solved this problem yet&quest; In ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE).

Digital Library

[13]

Intel Corporation. 2007. Intel XScale Microarchitecture. http://www.intel.com/design/intelxscale/316283.htm.

[14]

Kathail, V., Aditya, S., Schreiber, R., Rau, B. R., Cronquist, D. C., and Sivaraman, M. 2002. Pico: Automatically designing custom computers. IEEE Comput. 35, 9, 39--47.

Digital Library

[15]

Landi, W. 1992. Undecidability of static analysis. ACM Lett. Program. Lang. Syst. 1, 4 (Dec.), 323--337.

Digital Library

[16]

Lenoski, D., Laudon, J., Gharachorloo, K., Gupta, A., and Hennessy, J. 1990. The directory-based cache-coherence protocol for the dash multiprocessor. In Proceedings of the International Symposium on Computer Architecture (ISCA). ACM Press, New York, 148--159.

Digital Library

[17]

Li, M.-L., Sasanka, R., Adve, S., Chen, Y.-K., and Debes, E. 2005. The ALPbench benchmark suite for complex multimedia applications. In Proceedings of the International Symposium on Workload Characterization, 34--45.

[18]

Loghi, M., Letis, M., Benini, L., and Poncino, M. 2005. Exploring the energy efficiency of cache-coherence protocols in single-chip multi-processors. In Proceedings of the 15th Great Lakes Symposium on VLSI (GLSVLSI), 276--281.

Digital Library

[19]

Lyonnard, D., Yoo, S., Baghdadi, A., and Jerraya, A. 2001. Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on-chip. In Proceedings of the Design Automation Conference (DAC). ACM Press, New York, 518--523.

Digital Library

[20]

Martin, M. K., Hill, M. D., and Wood, D. A. 2003. Token coherence: Decoupling performance and correctness. In Proceedings of the International Symposium on Computer Architecture (ISCA). ACM Press, New York, 182--193.

Digital Library

[21]

Martin, M. M. K., Sorin, D. J., Hill, M. D., and Wood, D. A. 2002. Bandwidth adaptive snooping. In Proceedings of the Intrnational Symposium on High-Performance Computer Architecture (HPCA), 251--262.

Digital Library

[22]

Moshovos, A. 2005. Regionscout: Exploiting coarse grain sharing in snoop-based coherence. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Washington, DC, 234--245.

Digital Library

[23]

Moshovos, A., Memik, G., Choudhary, A., and Falsafi, B. 2001. Jetty: Filtering snoops for reduced energy consumption in SMP servers. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, Washington, DC, 85--96.

Digital Library

[24]

Nilsson, J., Landin, A., and Stenstrom, P. 2003. The coherence predictor cache: A resource-efficient and accurate coherence prediction infrastructure. In Proceedings of the International Symposium on Parallel and Distributed Processing. IEEE Computer Society, Washington, DC, 10--17.

Digital Library

[25]

Ramalingam, G. 1994. The undecidability of aliasing. ACM Trans. Program. Lang. Syst. 16, 5, 1467--1471.

Digital Library

[26]

Rowen, C. 2004. Engineering the Complex SOC. Fast, Flexible Design with Configurable Processors. Prentice Hall, NJ.

[27]

Rugina, R. and Rinard, M. 1999. Pointer analysis for multithreaded programs. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI) 34, 5, 77--90.

Digital Library

[28]

Salcianu, A. and Rinard, M. 2001. Pointer and escape analysis for multithreaded programs. In Proceedings of the Symposium on Principles and Practices of Parallel Programming (PPoPP), 12--23.

Digital Library

[29]

Saldanha, C. and Lipasti, M. 2001. Power efficient cache-coherence. In Workshop on Memory Performance Issues.

[30]

Sangiovanni-Vincentelli, A. and Martin, G. 2001. Platform-Based design and software design methodology for embeddedsystems. IEEE Des. Test Comput. 18, 23--33.

Digital Library

[31]

Singh, J. P., Weber, W.-D., and Gupta, A. 1992. Splash: Stanford parallel applications for shared-memory. SIGARCH Comput. Archit. News 20, 1, 5--44.

Digital Library

[32]

Tarjan, D., Thoziyoor, S., and Jouppi, N. 2006. Cacti 4.0: An integrated cache timing, power and area model. Tech. Rep., HP Laboratories, Palo Alto, CA. June.

[33]

Wenisch, T. F., Somogyi, S., Hardavellas, N., Kim, J., Ailamaki, A., and Falsafi, B. 2005. Temporal streaming of shared memory. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Washington, DC, 222--233.

Digital Library

[34]

Wolf, W. 2001. Computers as Components: Principles of Embedded Computing Systems Design. Morgan Kaufmann, San Francisco, CA.

Digital Library

[35]

Wolf, W. 2004. The future of multiprocessor systems-on-chips. In Proceedings of the Design Automation Conference (DAC), 681--685.

Digital Library

Cited By

Chen CHsia AZhan YLiu T(2018)Energy-efficient hybrid coherence protocol for multicore processorsCluster Computing10.5555/3287988.328800321:3(1521-1541)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.5555/3287988.3288003
Chen CHsia AZhan YLiu T(2018)Energy-efficient hybrid coherence protocol for multicore processorsCluster Computing10.1007/s10586-018-1947-z21:3(1521-1541)Online publication date: 16-Feb-2018
https://doi.org/10.1007/s10586-018-1947-z
Ranganathan ABayrak AKluter TBrisk PCharbon EIenne P(2012)Counting stream registers: An efficient and effective snoop filter architecture2012 International Conference on Embedded Computer Systems (SAMOS)10.1109/SAMOS.2012.6404165(120-127)Online publication date: Jul-2012
https://doi.org/10.1109/SAMOS.2012.6404165
Show More Cited By

Index Terms

Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors
1. Computer systems organization

Recommendations

The locality-aware adaptive cache coherence protocol
ICSA '13

Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in ...
An efficient cache design for scalable glueless shared-memory multiprocessors
CF '06: Proceedings of the 3rd conference on Computing frontiers

Traditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the access to main memory to recover the sharing status of the block is ...
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 13, Issue 1

January 2008

496 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/1297666

Issue’s Table of Contents

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 06 February 2008

Accepted: 01 July 2007

Revised: 01 May 2007

Received: 01 May 2006

Published in TODAES Volume 13, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
492
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen CHsia AZhan YLiu T(2018)Energy-efficient hybrid coherence protocol for multicore processorsCluster Computing10.5555/3287988.328800321:3(1521-1541)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.5555/3287988.3288003
Chen CHsia AZhan YLiu T(2018)Energy-efficient hybrid coherence protocol for multicore processorsCluster Computing10.1007/s10586-018-1947-z21:3(1521-1541)Online publication date: 16-Feb-2018
https://doi.org/10.1007/s10586-018-1947-z
Ranganathan ABayrak AKluter TBrisk PCharbon EIenne P(2012)Counting stream registers: An efficient and effective snoop filter architecture2012 International Conference on Embedded Computer Systems (SAMOS)10.1109/SAMOS.2012.6404165(120-127)Online publication date: Jul-2012
https://doi.org/10.1109/SAMOS.2012.6404165
Chang KLiao ILiao C(2012)Improving performance of multi-core NUCA coherent systems using NoC-assisted mechanismsThe Journal of Supercomputing10.1007/s11227-012-0793-762:3(1318-1337)Online publication date: 9-Jun-2012
https://doi.org/10.1007/s11227-012-0793-7
Yu BDong SMa YLin TWang YChen SGoto S(2011)Network flow-based simultaneous retiming and slack budgeting for low power designProceedings of the 16th Asia and South Pacific Design Automation Conference10.5555/1950815.1950913(473-478)Online publication date: 25-Jan-2011
https://dl.acm.org/doi/10.5555/1950815.1950913
Chang KLiao IShiu B(2010)Design and implementation of a NoC supporting priority-based communications for many-core SoCs2010 International Computer Symposium (ICS2010)10.1109/COMPSYM.2010.5685465(483-488)Online publication date: Dec-2010
https://doi.org/10.1109/COMPSYM.2010.5685465

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents