skip to main content
10.1145/3551349.3560423acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

HyperAST: Enabling Efficient Analysis of Software Histories at Scale

Published: 05 January 2023 Publication History

Abstract

Abstract Syntax Trees (ASTs) are widely used beyond compilers in many tools that measure and improve code quality, such as code analysis, bug detection, mining code metrics, refactoring. With the advent of fast software evolution and multistage releases, the temporal analysis of an AST history is becoming useful to understand and maintain code.
However, jointly analyzing thousands versions of ASTs independently faces scalability issues, mostly combinatorial, both in terms of memory and CPU usage. In this paper, we propose a novel type of AST, called HyperAST, that enables efficient temporal code analysis on a given software history by: 1/ leveraging code redundancy through space (between code elements) and time (between versions); 2/ reusing intermediate computation results. We show how the HyperAST can be built incrementally on a set of commits to capture all multiple ASTs at once in an optimized way. We evaluated the HyperAST on a curated list of large software projects. Compared to Spoon, a state-of-the-art technique, we observed that the HyperAST outperforms it with an order-of-magnitude difference from × 6 up to × 8076 in CPU construction time and from × 12 up to × 1159 in memory footprint. While the HyperAST requires up to 2 h 22 min and 7.2 GB for the biggest project, Spoon requires up to 93 h and 31 min and 2.2 TB. The gains in construction time varied from to and the gains in memory footprint varied from to . We further compared the task of finding references of declarations with the HyperAST and Spoon. We observed on average precision and recall without a significant difference in search time.

References

[1]
Carol V Alexandru, Sebastiano Panichella, Sebastian Proksch, and Harald C Gall. 2019. Redundancy-free analysis of multi-revision software artifacts. Empirical Software Engineering 24, 1 (2019), 332–380.
[2]
Thazin Win Win Aung, Huan Huo, and Yulei Sui. 2020. A literature review of automatic traceability links recovery for software change impact analysis. In Proceedings of the 28th International Conference on Program Comprehension. IEEE/ACM, 14–24.
[3]
Gabriele Bavota, Luigi Colangelo, Andrea De Lucia, Sabato Fusco, Rocco Oliveto, and Annibale Panichella. 2012. TraceME: traceability management in eclipse. In 2012 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 642–645.
[4]
Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426.
[5]
Arnaud Blouin, Valéria Lelli, Benoit Baudry, and Fabien Coulon. 2018. User Interface Design Smell: Automatic Detection and Refactoring of Blob Listeners. Information and Software Technology 102 (May 2018), 49–64.
[6]
Paolo Boldi, Antoine Pietri, Sebastiano Vigna, and Stefano Zacchiroli. 2020. Ultra-large-scale repository analysis via graph compression. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 184–194.
[7]
Diego Cedrim, Alessandro Garcia, Melina Mongiovi, Rohit Gheyi, Leonardo Sousa, Rafael de Mello, Baldoino Fonseca, Márcio Ribeiro, and Alexander Chávez. 2017. Understanding the impact of refactoring on smells: A longitudinal study of 23 software projects. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. IEEE/ACM, 465–475.
[8]
Michel Chilowicz, Etienne Duris, and Gilles Roussel. 2009. Syntax tree fingerprinting for source code similarity detection. In 2009 IEEE 17th international conference on program comprehension. IEEE, 243–247.
[9]
Gregory Cobena, Serge Abiteboul, and Amelie Marian. 2002. Detecting changes in XML documents. In Proceedings 18th International Conference on Data Engineering. IEEE, 41–52.
[10]
Benoit Cornu, Earl T Barr, Lionel Seinturier, and Martin Monperrus. 2016. Casper: Automatic tracking of null dereferences to inception with causality traces. Journal of Systems and Software 122 (2016), 52–62.
[11]
Barthélémy Dagenais and Martin P Robillard. 2014. Using traceability links to recommend adaptive changes for documentation evolution. IEEE Transactions on Software Engineering 40, 11 (2014), 1126–1146.
[12]
Roberto Di Cosmo and Stefano Zacchiroli. 2017. Software Heritage: Why and How to Preserve Software Source Code. In iPRES 2017 - 14th International Conference on Digital Preservation. iPRES, 1–10.
[13]
Roberto Di Cosmo and Stefano Zacchiroli. 2017. Software heritage: Why and how to preserve software source code. In iPRES 2017-14th International Conference on Digital Preservation. 1–10.
[14]
Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14. ACM/IEEE, 313–324.
[15]
Felix Grund, Shaiful Alam Chowdhury, Nick C Bradley, Braxton Hall, and Reid Holmes. 2021. CodeShovel: Constructing method-level source code histories. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1510–1522.
[16]
Johannes Härtel and Ralf Lämmel. 2020. Incremental map-reduce on repository history. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 320–331.
[17]
Yoshiki Higo, Shinpei Hayashi, and Shinji Kusumoto. 2020. On tracking Java methods with Git mechanisms. Journal of Systems and Software 165 (2020), 110571.
[18]
Roya Hosseini and Peter Brusilovsky. 2013. Javaparser: A fine-grain concept indexing tool for java problems. In CEUR Workshop Proceedings, Vol. 1009. University of Pittsburgh, 60–63.
[19]
Foutse Khomh, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol. 2012. An exploratory study of the impact of antipatterns on class change-and fault-proneness. Empirical Software Engineering 17, 3 (2012), 243–275.
[20]
Miryung Kim and David Notkin. 2006. Program element matching for multi-version program analyses. In Proceedings of the 2006 international workshop on Mining software repositories. IEEE/ACM, 58–64.
[21]
Gabriël Konat, Lennart C. L. Kats, Guido Wachsmuth, and Eelco Visser. 2012. Declarative Name Binding and Scope Rules. In Software Language Engineering, 5th International Conference, SLE 2012, Dresden, Germany, September 26-28, 2012, Revised Selected Papers(Lecture Notes in Computer Science, Vol. 7745), Krzysztof Czarnecki and Görel Hedin (Eds.). Springer, 311–331.
[22]
Gabriël Konat, Vlad A. Vergu, Lennart C. L. Kats, Guido Wachsmuth, and Eelco Visser. 2012. The Spoofax Name Binding Language. In Companion to the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2011, part of SPLASH 2012, Tucson, AR, USA, October 19 - 26, 2012. ACM.
[23]
Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon. 2020. Fixminer: Mining relevant fix patterns for automated program repair. Empirical Software Engineering 25, 3 (2020), 1980–2024.
[24]
Octave Larose, Sophie Kaleba, and Stefan Marr. 2022. Less Is More: Merging AST Nodes To Optimize Interpreters. In MoreVMs’22: Workshop on Modern Language Runtimes, Ecosystems, and VMs. ACM, 1.
[25]
Quentin Le Dilavrec, Djamel Eddine Khelladi, Arnaud Blouin, and Jean-Marc Jézéquel. 2021. Untangling Spaghetti of Evolutions in Software Histories to Identify Code and Test Co-evolutions. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 206–216.
[26]
Stanislav Levin and Amiram Yehudai. 2017. The co-evolution of test maintenance and code maintenance through the lens of fine-grained semantic changes. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 35–46.
[27]
Zhiqiang Lin, Xuxian Jiang, Dongyan Xu, Bing Mao, and Li Xie. 2007. AutoPaG: towards automated software patch generation with source code root cause identification and repair. In Proceedings of the 2nd ACM symposium on Information, computer and communications security. 329–340.
[28]
Mateus Lopes and Andre Hora. 2022. How and why we end up with complex methods: a multi-language study. Empirical Software Engineering 27, 5 (2022), 1–42.
[29]
Davood Mazinanian, Nikolaos Tsantalis, Raphael Stein, and Zackary Valenta. 2016. JDeodorant: clone refactoring. In Proceedings of the 38th international conference on software engineering companion. IEEE/ACM, 613–616.
[30]
Tom Mens. 2008. Introduction and roadmap: History and challenges of software evolution. Springer.
[31]
Microsoft. 2022. Tree-Sitter. https://tree-sitter.github.io/tree-sitter/. Accessed: 2022-05-06.
[32]
Zhen Ni, Bin Li, Xiaobing Sun, Tianhao Chen, Ben Tang, and Xinchen Shi. 2020. Analyzing bug fix for automatic bug cause classification. Journal of Systems and Software 163 (2020), 110538.
[33]
Robert Nystrom. 2014. Game programming patterns. Genever Benning.
[34]
Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Denys Poshyvanyk, and Andrea De Lucia. 2014. Mining version histories for detecting code smells. IEEE Transactions on Software Engineering 41, 5 (2014), 462–489.
[35]
Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Fausto Fasano, Rocco Oliveto, and Andrea De Lucia. 2018. On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. Empirical Software Engineering 23, 3 (2018), 1188–1221.
[36]
Fabio Palomba, Marco Zanoni, Francesca Arcelli Fontana, Andrea De Lucia, and Rocco Oliveto. 2017. Toward a smell-aware bug prediction model. IEEE Transactions on Software Engineering 45, 2 (2017), 194–218.
[37]
Jevgenija Pantiuchina, Fiorella Zampetti, Simone Scalabrino, Valentina Piantadosi, Rocco Oliveto, Gabriele Bavota, and Massimiliano Di Penta. 2020. Why developers refactor source code: A mining-based study. ACM Transactions on Software Engineering and Methodology (TOSEM) 29, 4(2020), 1–30.
[38]
Renaud Pawlak, Martin Monperrus, Nicolas Petitprez, Carlos Noguera, and Lionel Seinturier. 2015. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code. Software: Practice and Experience 46 (2015), 1155–1179.
[39]
Rachel Potvin and Josh Levenberg. 2016. Why Google Stores Billions of Lines of Code in a Single Repository. Commun. ACM 59, 7 (jun 2016), 78–87.
[40]
Jonathan Protzenko, Sebastian Burckhardt, Michał Moskal, and Jedidiah McClurg. 2015. Implementing real-time collaboration in TouchDevelop using AST merges. In Proceedings of the 3rd International Workshop on Mobile Development Lifecycle. ACM, 25–27.
[41]
D Rapu, Stéphane Ducasse, Tudor Gîrba, and Radu Marinescu. 2004. Using history information to improve design flaws detection. In Eighth European Conference on Software Maintenance and Reengineering, 2004. CSMR 2004. Proceedings.IEEE, 223–232.
[42]
Github Semantic. accessed July 27, 2022. Title of Citation. https://github.com/github/semantic.
[43]
Pablo Diego Silva da Silva, Rodrigo Oliveira Campos, and Carla Rocha. 2021. OSS Scripting System for Game Development in Rust. In IFIP International Conference on Open Source Systems. Springer, 51–58.
[44]
Leonardo Sousa, Diego Cedrim, Alessandro Garcia, Willian Oizumi, Ana C Bibiano, Daniel Oliveira, Miryung Kim, and Anderson Oliveira. 2020. Characterizing and identifying composite refactorings: Concepts, heuristics and patterns. In Proceedings of the 17th International Conference on Mining Software Repositories. IEEE/ACM, 186–197.
[45]
Github Stack-graphs. accessed July 27, 2022. Title of Citation. https://github.com/github/stack-graphs.
[46]
Nikolaos Tsantalis, Ameya Ketkar, and Danny Dig. 2020. RefactoringMiner 2.0. IEEE Transactions on Software Engineering 1, 1 (2020), 1.
[47]
Michele Tufano, Fabio Palomba, Gabriele Bavota, Rocco Oliveto, Massimiliano Di Penta, Andrea De Lucia, and Denys Poshyvanyk. 2015. When and why your code starts to smell bad. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 403–414.
[48]
Michele Tufano, Fabio Palomba, Gabriele Bavota, Rocco Oliveto, Massimiliano Di Penta, Andrea De Lucia, and Denys Poshyvanyk. 2017. When and why your code starts to smell bad (and whether the smells go away). IEEE Transactions on Software Engineering 43, 11 (2017), 1063–1088.
[49]
Hendrik van Antwerpen, Casper Bach Poulsen, Arjen Rouvoet, and Eelco Visser. 2018. Scopes as types. Proceedings of the ACM on Programming Languages 2, OOPSLA(2018).
[50]
Christian Wimmer and Thomas Würthinger. 2012. Truffle: a self-optimizing runtime system. In Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity. ACM, 13–14.
[51]
Andy Zaidman, Bart Van Rompaey, Serge Demeyer, and Arie Van Deursen. 2008. Mining software repositories to study co-evolution of production & test code. In 2008 1st international conference on software testing, verification, and validation. IEEE, 220–229.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering
October 2022
2006 pages
ISBN:9781450394758
DOI:10.1145/3551349
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 January 2023

Permissions

Request permissions for this article.

Check for updates

Badges

  • Distinguished Paper

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ASE '22

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)91
  • Downloads (Last 6 weeks)13
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media