skip to main content
research-article

A Locality Optimizer for Loop-dominated Applications Based on Reuse Distance Analysis

Published: 02 September 2020 Publication History

Abstract

Source code optimization can heavily improve software code implementation quality while still being complementary to conventional compilers’ optimizations. Source code analysis tools are very useful in supporting source code optimization. This article discusses MemAssist, a source-level optimization environment for semi-automatic locality optimization of loop-dominated code. MemAssist applies reuse distance analysis and a relevant optimization algorithm to explore the design space. It generates a set of suggestions for locality optimizing loop transformations that reduce data cache miss rate and execution time. MemAssist has been used to optimize a number of applications. Experimental results show that MemAssist leads to cache miss rate reduction at all cache layers, memory accesses reduction by up to 42%, and to a speedup of up to three times. Therefore, MemAssist can be used for efficient early-stage software optimization leading to development effort and time reduction.

References

[1]
George Almási, Cǎlin Caşcaval, and David A. Padua. 2002. Calculating stack distances efficiently. In Proceedings of the Workshop on Memory System Performance (MSP’02). ACM, 37--43.
[2]
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. 2002. Counting distinct elements in a data stream. In Randomization and Approximation Techniques in Computer Science, José D. P. Rolim and Salil Vadhan (Eds.). Lecture Notes in Computer Science, Vol. 2483. Springer Berlin, 1--10.
[3]
Brian T. Bennett and Vincent J. Kruskal. 1975. LRU stack processing. IBM J. Res. Dev. 19, 4 (July 1975), 353--357.
[4]
Erik Berg and Erik Hagersten. 2004. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’04). 20--27.
[5]
Kristof Beyls and Erik D’Hollander. 2001. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems. 617--622. Retrieved from https://hdl.handle.net/1854/LU-144051.
[6]
Kristof Beyls and Erik H. D’Hollander. 2006. Discovery of locality-improving refactorings by reuse path analysis. In High Performance Computing and Communications, Michael Gerndt and Dieter Kranzlmüller (Eds.). Lecture Notes in Computer Science, Vol. 4208. Springer Berlin, 220--229.
[7]
Kristof Beyls and Erik H. D’Hollander. 2006. Intermediately executed code is the key to find refactorings that improve temporal data locality. In Proceedings of the 3rd Conference on Computing Frontiers (CF’06). ACM, 373--382.
[8]
Kristof Beyls and Erik H. D’Hollander. 2008. Refactoring intermediately executed code to reduce cache capacity misses. J. Instruct.-Level Parallel. 10 (June 2008). Retrieved from https://www.jilp.org/vol10/index.html.
[9]
Kristof Beyls and Erik H. D’Hollander. 2009. Refactoring for data locality. Computer 42, 2 (Feb. 2009), 62--71.
[10]
Martin Burtscher, Byoung-Do Kim, Jeff Diamond, John McCalpin, Lars Koesterke, and James Browne. 2010. PerfExpert: An easy-to-use performance diagnosis tool for HPC applications. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10). IEEE Computer Society, 1--11.
[11]
Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler optimizations for improving data locality. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’94). ACM, 252--262.
[12]
Francky Catthoor, Koen Danckaert, Sven Wuytack, and Nikil D. Dutt. 2001. Code transformations for data transfer and storage exploration preprocessing in multimedia processors. IEEE Des. Test Comput. 18, 3 (May 2001), 70--82.
[13]
Germán Ceballos, Erik Hagersten, and David Black-Schaffer. 2015. StatTask: Reuse distance analysis for task-based applications. In Proceedings of the Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO’15). ACM, 1:1–1:7.
[14]
Arun Chauhan and Chun-Yu Shei. 2010. Static reuse distances for locality-based optimizations in MATLAB. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS’10). ACM, 295--304.
[15]
Huimin Cui, Qing Yi, Jingling Xue, Lei Wang, Yang Yang, and Xiaobing Feng. 2012. A highly parallel reuse distance analysis algorithm on GPUs. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS’12). 1080--1092.
[16]
Eric van der Deijl, Gerco Kanbier, Olivier Temam, and Elana D. Granston. 1997. A cache visualization tool. Computer 30, 7 (July 1997), 71--78.
[17]
Nameirakpam Dhanachandra, Khumanthem Manglem, and Yambem Jina Chanu. 2015. Image segmentation using K -means clustering algorithm and subtractive clustering algorithm. Proced. Comput. Sci. 54 (Jan. 2015), 764--771.
[18]
Grigoris Dimitroulakos, Christakis Lezos, and Konstantinos Masselos. 2012. MEMSCOPT: A source-to-source compiler for dynamic code analysis and loop transformations. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP’12). 385--386. Retrieved from https://ieeexplore.ieee.org/document/6385418.
[19]
Grigoris Dimitroulakos, Theodoros Lioris, Christakis Lezos, and Konstantinos Masselos. 2012. XMSIM: A tool for early memory hierarchy evaluation. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP’12). 405--406. Retrieved from https://ieeexplore.ieee.org/document/6385428.
[20]
Jan Edler and Mark D. Hill. 1998. Dinero IV Trace-Driven Uniprocessor Cache Simulator. Retrieved from http://pages.cs.wisc.edu/~markhill/DineroIV/.
[21]
David Eklov and Erik Hagersten. 2010. StatStack: Efficient modeling of LRU caches. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems Software (ISPASS’10). 55--65.
[22]
Xiong Fu, Yu Zhang, and Yiyun Chen. 2006. Data-layout optimization using reuse distance distribution. In Emerging Directions in Embedded and Ubiquitous Computing, Xiaobo Zhou, Oleg Sokolsky, Lu Yan, Eun-Sun Jung, Zili Shao, Yi Mu, Dong Chun Lee, Dae Young Kim, Young-Sik Jeong, and Cheng-Zhong Xu (Eds.). Lecture Notes in Computer Science, Vol. 4097. Springer Berlin, 858--867.
[23]
Jiancong Ge and Ming Ling. 2019. Fast modeling of the L2 cache reuse distance histograms from software traces. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). 145--146.
[24]
Phillip B. Gibbons. 2007. Distinct-values estimation over data streams. In Data Stream Management: Processing High-Speed Data Streams. Springer Berlin.
[25]
Kecheng Ji, Ming Ling, Longxing Shi, and Jianping Pan. 2018. An analytical cache performance evaluation framework for embedded out-of-order processors using software characteristics. ACM Trans. Embed. Comput. Syst. 17, 4 (Aug. 2018), 79:1–79:25.
[26]
Yunlian Jiang, Eddy Z. Zhang, Kai Tian, and Xipeng Shen. 2010. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Compiler Construction, Rajiv Gupta (Ed.). Lecture Notes in Computer Science, Vol. 6011. Springer Berlin, 264--282.
[27]
Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA. Retrieved from https://dl.acm.org/doi/book/10.5555/502981.
[28]
Christakis Lezos, Grigoris Dimitroulakos, Ioannis Latifis, and Konstantinos Masselos. 2016. Automatic generation of code analysis tools: The CastQL approach. In Proceedings of the 1st International Workshop on Real World Domain Specific Languages (RWDSL’16). ACM, 3:1–3:10.
[29]
Christakis Lezos, Grigoris Dimitroulakos, and Konstantinos Masselos. 2015. Reuse distance analysis for locality optimization in loop-dominated applications. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’15). 1237--1240. Retrieved from https://dl.acm.org/doi/10.5555/2755753.2757099.
[30]
Christakis Lezos, Ioannis Latifis, Grigoris Dimitroulakos, and Konstantinos Masselos. 2016. Compiler-directed data locality optimization in MATLAB. In Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems (SCOPES’16). ACM, 6--9.
[31]
Xu Liu and John Mellor-Crummey. 2013. A data-centric profiler for parallel programs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13). ACM, 28:1–28:12.
[32]
Xu Liu and John Mellor-Crummey. 2013. Pinpointing data locality bottlenecks with low overhead. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’13). 183--193.
[33]
Xu Liu, Kamal Sharma, and John Mellor-Crummey. 2014. ArrayTool: A lightweight profiler to guide array regrouping. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, 405--416.
[34]
Richard L. Mattson, Jan Gecsei, Donald R. Slutz, and Irving L. Traiger. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2 (June 1970), 78--117.
[35]
Steve McConnell. 2004. Code Complete: A Practical Handbook of Software Construction (2nd ed.). Microsoft Press, Redmond, WA. Retrieved from https://dl.acm.org/doi/book/10.5555/1096143.
[36]
Carlos Moreno and Sebastian Fischmeister. 2017. Accurate measurement of small execution times—Getting around measurement errors. IEEE Embed. Syst. Lett. PP, 99 (2017).
[37]
Steffen Mueller. 2010. Your benchmarks suck! Retrieved from http://blogs.perl.org/users/steffen_mueller/2010/09/your-benchmarks-suck.html.
[38]
Nicholas Nethercote. 2004. Dynamic Binary Analysis and Instrumentation. Ph.D. Dissertation. University of Cambridge, Cambridge, UK.
[39]
Qingpeng Niu, James Dinan, Qingda Lu, and Ponnuswamy Sadayappan. 2012. PARDA: A fast parallel reuse distance analysis algorithm. In Proceedings of the IEEE 26th International Parallel Distributed Processing Symposium (IPDPS’12). 1284--1294.
[40]
Frank Olken. 1981. Efficient Methods for Calculating the Success Function of Fixed-Space Replacement Policies. Technical Report LBL-12370. Lawrence Berkeley Lab, CA. Retrieved from https://www.osti.gov/biblio/6051879.
[41]
ParaTools Inc. 2018. ThreadSpotter. Retrieved from http://threadspotter.paratools.com/.
[42]
Miquel Pericas, Kenjiro Taura, and Satoshi Matsuoka. 2014. Scalable analysis of multicore data reuse and sharing. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14). ACM, 353--362.
[43]
Changwoo Pyo, Kyung-Woo Lee, Hye-Kyung Han, and Gyungho Lee. 1997. Reference distance as a metric for data locality. In Proceedings of the High Performance Computing on the Information Superhighway Conference (HPC Asia’97). 151--156.
[44]
Boris Quaing, Jie Tao, and Wolfgang Karl. 2005. YACO: A user conducted visualization tool for supporting cache optimization. In High Performance Computing and Communications, Laurence T. Yang, Omer F. Rana, Beniamino Di Martino, and Jack Dongarra (Eds.). Lecture Notes in Computer Science, Vol. 3726. Springer Berlin, 694--703.
[45]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, 519--530.
[46]
Probir Roy and Xu Liu. 2016. StructSlim: A lightweight profiler to guide structure splitting. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’16). ACM, 36--46.
[47]
Jasmine M. Sabarimuthu and T. G. Venkatesh. 2018. Analytical miss rate calculation of L2 cache from the RD profile of L1 cache. IEEE Trans. Comput. 67, 1 (Jan. 2018), 9--15.
[48]
Mazen A. R. Saghir. 1998. Application-Specific Instruction-Set Architectures for Embedded DSP Applications. Ph.D. Dissertation. University of Toronto, Toronto, Canada.
[49]
Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, 53--64.
[50]
Rathijit Sen and David A. Wood. 2013. Reuse-based online models for caches. In Proceedings of the ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’13). ACM, 279--292.
[51]
Xipeng Shen, Jonathan Shaw, Brian Meeker, and Chen Ding. 2007. Locality approximation using time. In Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’07). ACM, 55--61.
[52]
Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi. 2001. The JPEG 2000 still image compression standard. IEEE Sig. Proc. Mag. 18, 5 (Sept. 2001), 36--58.
[53]
Olalekan A. Sopeju, Martin Burtscher, Ashay Rane, and James Browne. 2011. AutoSCOPE: Automatic suggestions for code optimizations using PerfExpert. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’11). 19--25. Retrieved from https://www.worldcomp-proceedings.com/proc/p2011/PDP.htm.
[54]
Rabin A. Sugumar and Santosh G. Abraham. 1993. Efficient simulation of caches under optimal replacement with applications to miss characterization. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’93). ACM, 24--35.
[55]
The Linux Kernel Archives. 2018. perf: Linux profiling with performance counters. Retrieved from https://perf.wiki.kernel.org.
[56]
Yves Tillé. 2006. Sampling Algorithms. Springer New York. Retrieved from https://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-30814-2.
[57]
Jeffrey S. Vitter. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1 (Mar. 1985), 37--57.
[58]
Gregory K. Wallace. 1992. The JPEG still picture compression standard. IEEE Trans. Consum. Electron. 38, 1 (Feb. 1992), xviii–xxxiv.
[59]
Qingsen Wang, Xu Liu, and Milind Chabbi. 2019. Featherlight reuse-distance measurement. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). 440--453.
[60]
Robert P. Wilson, Robert S. French, Christopher S. Wilson, Saman P. Amarasinghe, Jennifer M. Anderson, Steve W. K. Tjiang, Shih-Wei Liao, Chau-Wen Tseng, Mary W. Hall, Monica S. Lam, and John L. Hennessy. 1994. SUIF: An infrastructure for research on parallelizing and optimizing compilers. SIGPLAN Not. 29, 12 (Dec. 1994), 31--37.
[61]
Yutao Zhong and Wentao Chang. 2008. Sampling-based program locality approximation. In Proceedings of the 7th International Symposium on Memory Management (ISMM’08). ACM, 91--100.
[62]
Yutao Zhong, Steven G. Dropsho, Xipeng Shen, Ahren Studer, and Chen Ding. 2007. Miss rate prediction across program inputs and cache configurations. IEEE Trans. Comput. 56, 3 (Mar. 2007), 328--343.
[63]
Yutao Zhong, Maksim Orlovich, Xipeng Shen, and Chen Ding. 2004. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’04). ACM, 255--266.
[64]
Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program locality analysis using reuse distance. ACM Trans. Program. Lang. Syst. 31, 6 (Aug. 2009), 20:1–20:39.

Cited By

View all

Index Terms

  1. A Locality Optimizer for Loop-dominated Applications Based on Reuse Distance Analysis

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Design Automation of Electronic Systems
        ACM Transactions on Design Automation of Electronic Systems  Volume 25, Issue 6
        November 2020
        164 pages
        ISSN:1084-4309
        EISSN:1557-7309
        DOI:10.1145/3417499
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Journal Family

        Publication History

        Published: 02 September 2020
        Online AM: 07 May 2020
        Accepted: 01 April 2020
        Revised: 01 February 2020
        Received: 01 November 2019
        Published in TODAES Volume 25, Issue 6

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. MATLAB-to-C
        2. Reuse distance analysis
        3. data locality optimization
        4. source-to-source optimization

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)14
        • Downloads (Last 6 weeks)5
        Reflects downloads up to 03 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media