research-article

HyperAST: Enabling Efficient Analysis of Software Histories at Scale

Authors:

Quentin Le Dilavrec,

Djamel Eddine Khelladi,

Jean-Marc JézéquelAuthors Info & Claims

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Article No.: 54, Pages 1 - 12

https://doi.org/10.1145/3551349.3560423

Published: 05 January 2023 Publication History

Abstract

Abstract Syntax Trees (ASTs) are widely used beyond compilers in many tools that measure and improve code quality, such as code analysis, bug detection, mining code metrics, refactoring. With the advent of fast software evolution and multistage releases, the temporal analysis of an AST history is becoming useful to understand and maintain code.

However, jointly analyzing thousands versions of ASTs independently faces scalability issues, mostly combinatorial, both in terms of memory and CPU usage. In this paper, we propose a novel type of AST, called HyperAST, that enables efficient temporal code analysis on a given software history by: 1/ leveraging code redundancy through space (between code elements) and time (between versions); 2/ reusing intermediate computation results. We show how the HyperAST can be built incrementally on a set of commits to capture all multiple ASTs at once in an optimized way. We evaluated the HyperAST on a curated list of large software projects. Compared to Spoon, a state-of-the-art technique, we observed that the HyperAST outperforms it with an order-of-magnitude difference from × 6 up to × 8076 in CPU construction time and from × 12 up to × 1159 in memory footprint. While the HyperAST requires up to 2 h 22 min and 7.2 GB for the biggest project, Spoon requires up to 93 h and 31 min and 2.2 TB. The gains in construction time varied from to and the gains in memory footprint varied from to . We further compared the task of finding references of declarations with the HyperAST and Spoon. We observed on average precision and recall without a significant difference in search time.

References

[1]

Carol V Alexandru, Sebastiano Panichella, Sebastian Proksch, and Harald C Gall. 2019. Redundancy-free analysis of multi-revision software artifacts. Empirical Software Engineering 24, 1 (2019), 332–380.

Digital Library

[2]

Thazin Win Win Aung, Huan Huo, and Yulei Sui. 2020. A literature review of automatic traceability links recovery for software change impact analysis. In Proceedings of the 28th International Conference on Program Comprehension. IEEE/ACM, 14–24.

Digital Library

[3]

Gabriele Bavota, Luigi Colangelo, Andrea De Lucia, Sabato Fusco, Rocco Oliveto, and Annibale Panichella. 2012. TraceME: traceability management in eclipse. In 2012 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 642–645.

Digital Library

[4]

Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426.

Digital Library

[5]

Arnaud Blouin, Valéria Lelli, Benoit Baudry, and Fabien Coulon. 2018. User Interface Design Smell: Automatic Detection and Refactoring of Blob Listeners. Information and Software Technology 102 (May 2018), 49–64.

[6]

Paolo Boldi, Antoine Pietri, Sebastiano Vigna, and Stefano Zacchiroli. 2020. Ultra-large-scale repository analysis via graph compression. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 184–194.

[7]

Diego Cedrim, Alessandro Garcia, Melina Mongiovi, Rohit Gheyi, Leonardo Sousa, Rafael de Mello, Baldoino Fonseca, Márcio Ribeiro, and Alexander Chávez. 2017. Understanding the impact of refactoring on smells: A longitudinal study of 23 software projects. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. IEEE/ACM, 465–475.

Digital Library

[8]

Michel Chilowicz, Etienne Duris, and Gilles Roussel. 2009. Syntax tree fingerprinting for source code similarity detection. In 2009 IEEE 17th international conference on program comprehension. IEEE, 243–247.

[9]

Gregory Cobena, Serge Abiteboul, and Amelie Marian. 2002. Detecting changes in XML documents. In Proceedings 18th International Conference on Data Engineering. IEEE, 41–52.

[10]

Benoit Cornu, Earl T Barr, Lionel Seinturier, and Martin Monperrus. 2016. Casper: Automatic tracking of null dereferences to inception with causality traces. Journal of Systems and Software 122 (2016), 52–62.

Digital Library

[11]

Barthélémy Dagenais and Martin P Robillard. 2014. Using traceability links to recommend adaptive changes for documentation evolution. IEEE Transactions on Software Engineering 40, 11 (2014), 1126–1146.

[12]

Roberto Di Cosmo and Stefano Zacchiroli. 2017. Software Heritage: Why and How to Preserve Software Source Code. In iPRES 2017 - 14th International Conference on Digital Preservation. iPRES, 1–10.

[13]

Roberto Di Cosmo and Stefano Zacchiroli. 2017. Software heritage: Why and how to preserve software source code. In iPRES 2017-14th International Conference on Digital Preservation. 1–10.

[14]

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14. ACM/IEEE, 313–324.

Digital Library

[15]

Felix Grund, Shaiful Alam Chowdhury, Nick C Bradley, Braxton Hall, and Reid Holmes. 2021. CodeShovel: Constructing method-level source code histories. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1510–1522.

Digital Library

[16]

Johannes Härtel and Ralf Lämmel. 2020. Incremental map-reduce on repository history. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 320–331.

[17]

Yoshiki Higo, Shinpei Hayashi, and Shinji Kusumoto. 2020. On tracking Java methods with Git mechanisms. Journal of Systems and Software 165 (2020), 110571.

[18]

Roya Hosseini and Peter Brusilovsky. 2013. Javaparser: A fine-grain concept indexing tool for java problems. In CEUR Workshop Proceedings, Vol. 1009. University of Pittsburgh, 60–63.

[19]

Foutse Khomh, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol. 2012. An exploratory study of the impact of antipatterns on class change-and fault-proneness. Empirical Software Engineering 17, 3 (2012), 243–275.

Digital Library

[20]

Miryung Kim and David Notkin. 2006. Program element matching for multi-version program analyses. In Proceedings of the 2006 international workshop on Mining software repositories. IEEE/ACM, 58–64.

Digital Library

[21]

Gabriël Konat, Lennart C. L. Kats, Guido Wachsmuth, and Eelco Visser. 2012. Declarative Name Binding and Scope Rules. In Software Language Engineering, 5th International Conference, SLE 2012, Dresden, Germany, September 26-28, 2012, Revised Selected Papers(Lecture Notes in Computer Science, Vol. 7745), Krzysztof Czarnecki and Görel Hedin (Eds.). Springer, 311–331.

[22]

Gabriël Konat, Vlad A. Vergu, Lennart C. L. Kats, Guido Wachsmuth, and Eelco Visser. 2012. The Spoofax Name Binding Language. In Companion to the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2011, part of SPLASH 2012, Tucson, AR, USA, October 19 - 26, 2012. ACM.

[23]

Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon. 2020. Fixminer: Mining relevant fix patterns for automated program repair. Empirical Software Engineering 25, 3 (2020), 1980–2024.

Digital Library

[24]

Octave Larose, Sophie Kaleba, and Stefan Marr. 2022. Less Is More: Merging AST Nodes To Optimize Interpreters. In MoreVMs’22: Workshop on Modern Language Runtimes, Ecosystems, and VMs. ACM, 1.

[25]

Quentin Le Dilavrec, Djamel Eddine Khelladi, Arnaud Blouin, and Jean-Marc Jézéquel. 2021. Untangling Spaghetti of Evolutions in Software Histories to Identify Code and Test Co-evolutions. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 206–216.

[26]

Stanislav Levin and Amiram Yehudai. 2017. The co-evolution of test maintenance and code maintenance through the lens of fine-grained semantic changes. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 35–46.

[27]

Zhiqiang Lin, Xuxian Jiang, Dongyan Xu, Bing Mao, and Li Xie. 2007. AutoPaG: towards automated software patch generation with source code root cause identification and repair. In Proceedings of the 2nd ACM symposium on Information, computer and communications security. 329–340.

Digital Library

[28]

Mateus Lopes and Andre Hora. 2022. How and why we end up with complex methods: a multi-language study. Empirical Software Engineering 27, 5 (2022), 1–42.

Digital Library

[29]

Davood Mazinanian, Nikolaos Tsantalis, Raphael Stein, and Zackary Valenta. 2016. JDeodorant: clone refactoring. In Proceedings of the 38th international conference on software engineering companion. IEEE/ACM, 613–616.

Digital Library

[30]

Tom Mens. 2008. Introduction and roadmap: History and challenges of software evolution. Springer.

[31]

Microsoft. 2022. Tree-Sitter. https://tree-sitter.github.io/tree-sitter/. Accessed: 2022-05-06.

[32]

Zhen Ni, Bin Li, Xiaobing Sun, Tianhao Chen, Ben Tang, and Xinchen Shi. 2020. Analyzing bug fix for automatic bug cause classification. Journal of Systems and Software 163 (2020), 110538.

[33]

Robert Nystrom. 2014. Game programming patterns. Genever Benning.

[34]

Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Denys Poshyvanyk, and Andrea De Lucia. 2014. Mining version histories for detecting code smells. IEEE Transactions on Software Engineering 41, 5 (2014), 462–489.

Digital Library

[35]

Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Fausto Fasano, Rocco Oliveto, and Andrea De Lucia. 2018. On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. Empirical Software Engineering 23, 3 (2018), 1188–1221.

Digital Library

[36]

Fabio Palomba, Marco Zanoni, Francesca Arcelli Fontana, Andrea De Lucia, and Rocco Oliveto. 2017. Toward a smell-aware bug prediction model. IEEE Transactions on Software Engineering 45, 2 (2017), 194–218.

[37]

Jevgenija Pantiuchina, Fiorella Zampetti, Simone Scalabrino, Valentina Piantadosi, Rocco Oliveto, Gabriele Bavota, and Massimiliano Di Penta. 2020. Why developers refactor source code: A mining-based study. ACM Transactions on Software Engineering and Methodology (TOSEM) 29, 4(2020), 1–30.

Digital Library

[38]

Renaud Pawlak, Martin Monperrus, Nicolas Petitprez, Carlos Noguera, and Lionel Seinturier. 2015. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code. Software: Practice and Experience 46 (2015), 1155–1179.

Digital Library

[39]

Rachel Potvin and Josh Levenberg. 2016. Why Google Stores Billions of Lines of Code in a Single Repository. Commun. ACM 59, 7 (jun 2016), 78–87.

Digital Library

[40]

Jonathan Protzenko, Sebastian Burckhardt, Michał Moskal, and Jedidiah McClurg. 2015. Implementing real-time collaboration in TouchDevelop using AST merges. In Proceedings of the 3rd International Workshop on Mobile Development Lifecycle. ACM, 25–27.

Digital Library

[41]

D Rapu, Stéphane Ducasse, Tudor Gîrba, and Radu Marinescu. 2004. Using history information to improve design flaws detection. In Eighth European Conference on Software Maintenance and Reengineering, 2004. CSMR 2004. Proceedings.IEEE, 223–232.

[42]

Github Semantic. accessed July 27, 2022. Title of Citation. https://github.com/github/semantic.

[43]

Pablo Diego Silva da Silva, Rodrigo Oliveira Campos, and Carla Rocha. 2021. OSS Scripting System for Game Development in Rust. In IFIP International Conference on Open Source Systems. Springer, 51–58.

[44]

Leonardo Sousa, Diego Cedrim, Alessandro Garcia, Willian Oizumi, Ana C Bibiano, Daniel Oliveira, Miryung Kim, and Anderson Oliveira. 2020. Characterizing and identifying composite refactorings: Concepts, heuristics and patterns. In Proceedings of the 17th International Conference on Mining Software Repositories. IEEE/ACM, 186–197.

Digital Library

[45]

Github Stack-graphs. accessed July 27, 2022. Title of Citation. https://github.com/github/stack-graphs.

[46]

Nikolaos Tsantalis, Ameya Ketkar, and Danny Dig. 2020. RefactoringMiner 2.0. IEEE Transactions on Software Engineering 1, 1 (2020), 1.

[47]

Michele Tufano, Fabio Palomba, Gabriele Bavota, Rocco Oliveto, Massimiliano Di Penta, Andrea De Lucia, and Denys Poshyvanyk. 2015. When and why your code starts to smell bad. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 403–414.

[48]

Michele Tufano, Fabio Palomba, Gabriele Bavota, Rocco Oliveto, Massimiliano Di Penta, Andrea De Lucia, and Denys Poshyvanyk. 2017. When and why your code starts to smell bad (and whether the smells go away). IEEE Transactions on Software Engineering 43, 11 (2017), 1063–1088.

Digital Library

[49]

Hendrik van Antwerpen, Casper Bach Poulsen, Arjen Rouvoet, and Eelco Visser. 2018. Scopes as types. Proceedings of the ACM on Programming Languages 2, OOPSLA(2018).

Digital Library

[50]

Christian Wimmer and Thomas Würthinger. 2012. Truffle: a self-optimizing runtime system. In Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity. ACM, 13–14.

Digital Library

[51]

Andy Zaidman, Bart Van Rompaey, Serge Demeyer, and Arie Van Deursen. 2008. Mining software repositories to study co-evolution of production & test code. In 2008 1st international conference on software testing, verification, and validation. IEEE, 220–229.

Digital Library

Cited By

Janke MMäder P(2024)7 Dimensions of software change patternsScientific Reports10.1038/s41598-024-54894-014:1Online publication date: 13-Mar-2024
https://doi.org/10.1038/s41598-024-54894-0

Index Terms

HyperAST: Enabling Efficient Analysis of Software Histories at Scale

Index terms have been assigned to the content through auto-classification.

Recommendations

Mining Version Histories for Detecting Code Smells
Code smells are symptoms of poor design and implementation choices that may hinder code comprehension, and possibly increase changeand fault-proneness. While most of the detection techniques just rely on structural information, many code smells are ...
Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management

Hybrid main memories composed of DRAM as a cache to scalable non-volatile memories such as phase-change memory (PCM) can provide much larger storage capacity than traditional main memories. A key challenge for enabling high-performance and scalable ...
Enabling Efficient Erasure Coding in Disaggregated Memory Systems
Disaggregated memory (DM) separates compute and memory resources to build a huge memory pool. Erasure coding (EC) is expected to provide fault tolerance in DM with low memory cost. In DM with EC, objects are first coded in compute servers, then directly ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

October 2022

2006 pages

ISBN:9781450394758

DOI:10.1145/3551349

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Distinguished Paper

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Agence Nationale de la Recherche

Conference

ASE '22

ASE '22: 37th IEEE/ACM International Conference on Automated Software Engineering

October 10 - 14, 2022

MI, Rochester, USA

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
210
Total Downloads

Downloads (Last 12 months)91
Downloads (Last 6 weeks)13

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Janke MMäder P(2024)7 Dimensions of software change patternsScientific Reports10.1038/s41598-024-54894-014:1Online publication date: 13-Mar-2024
https://doi.org/10.1038/s41598-024-54894-0

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents