A Common Language of Software Evolution in Repositories (CLOSER)
Abstract
:1. Introduction
- Identify the common features across VCSs and additional information for particular VCSs.
- Define a lossless isomorphically mapped DSL, CLOSER, that can store the common features while retaining any additional metadata.
- Implement parsers for major VCS output formats to and from CLOSER.
- Demonstrate the efficacy of CLOSER through the ability to convert to and from a specific VCS, proving that it is lossless and retains all data required for analysis.
2. Related Work
2.1. Version Control System Offerings
2.2. Uses and Techniques for the Mining of Software Repositories
2.3. Methods for Mining Software Repositories Using Multiple Version Control Systems in Combination
2.4. Opportunities for a Tool Such as CLOSER
3. Design and Methodology of CLOSER
3.1. Methodology
- Analysis of VCS features and differences—to understand what data are contained in different VCS stores, data fields, and formats need to be examined (Section 3.3).
- Application scenario generation—the scenarios in which the approach could be used, conversion from a specific format to/from the DSL will be identified to ensure the implementation can meet these requirements (Section 3.4).
- Data model developed—using commonalities (and mapping as appropriate), a set of common VCS features will be expressed into a DSL, including any VCS-specific extensions required for data retention (Section 3.5 and Section 3.6).
- Formal definition, schema mapping, and data structure definition—the DSL and model will be formally defined, and the schema mapped to the VCS elements with any limitations (potentially lossy conversions) identified (Section 3.7 and Section 3.8). The CLOSER data structure for implementation will be defined (Section 3.10).
3.2. Research Questions
- Is it possible to demonstrate lossless conversion between VCS output to the generic CLOSER data format? (Section 4.1)
- Is it possible to demonstrate effective conversion between VCS output formats using CLOSER? (Section 4.2).
- Is it possible to use the CLOSER data format directly for MSR approaches? (Section 4.3).
- Is it possible to use MSR approaches implemented for a different VCS format following conversion with CLOSER? (Section 4.4).
3.3. Identification of VCS Features
- Creation of a folder.
- Deletion of a folder.
- Creation of a new file with contents.
- Update the contents of a file.
- Delete a file.
- Rename a file with the contents remaining unchanged.
- Rename a file with a change in the contents at the same time.
- Change the type of the file.
- Move the file to a folder with the contents remaining unchanged.
- Move the file to a folder with a change in the contents at the same time.
3.4. Usage Scenarios for CLOSER
- Textual logging output of metadata from System X containing a set of features.
- Textual logging output of metadata from System Y containing a set of features, where the features in Y are a proper subset of those in System X. Therefore, System X contains all features represented in System Y, but System Y will not contain all features in System X. An example of this state can be seen in the case that Mercurial output format in Figure 6 contains all features of the Git repository in Figure 3.
- CLOSER format containing the VCS metadata.
3.5. Common VCS Components for CLOSER
- Each VCS has incremental versions of the code. In Git, this is known as a commit, while in SVN, these are referred to as revisions. Each VCS has a unique ID for these versions, which will be known as UID in CLOSER. For the purposes of CLOSER, we shall use the term revision to refer to a collection of changes to file(s).
- Each VCS has an identified person who made the commit known as the committer in Git, as well as a timestamp recording when the commit was made. In Git, there is also the concept of an author and the date on which the code was authored. In order to retain isomorphic mapping and ensure that CLOSER is lossless it will retain the data for both the committer, author, and their respective dates.
- In Git, the user identifier has a name and an email, while in SVN, the user only has a single identifier.
- The VCS outputs identify the files changed and how they were changed in each revision. This is consistent across Git and SVN; however, the coding to indicate how the files were changed differs across VCS variants. CLOSER will maintain a set of all the variants of file changes that will then map from the respective VCS and share values across VCS options where possible (where this is and is not possible are identified in Section 3.6).
3.6. VCS-Specific Components for CLOSER
- SVN has a line count included in the output, which is the number of lines changed. While this is not included by any other VCS being considered, it is possible to dynamically calculate this accurately from available data; therefore, this field can be generated in all scenarios mentioned in Section 3.4.
- SVN, by default, only has a single textual identifier for each user making a commit. Upon analysis of production, SVN repositories, this textual field is occasionally an email address; however there is no guarantee of this as a convention for all repositories. As VCSs such as Git and Mercurial have an identifier and email address for each user, there is no direct mapping between these and SVN output. For mapping from SVN to CLOSER, the textual identifier can be stored as both the identifier and email fields required for the other VCSs in a multivalued function. Scenario (3), as mentioned in Section 3.4, allows for lossless bidirectional mapping between SVN and CLOSER despite the mapping from CLOSER to SVN requiring a non-injective surjective function for mapping the two user fields to one in the SVN output. This mapping for Scenario (1) and (2) is lossless due to both fields in CLOSER storing the single SVN identifier; therefore, regardless of the field from CLOSER mapped back to SVN in the implementation a conversion will be accurate with full data retention. Mapping from Git or Mercurial to SVN can be understood as Scenario (3) in Section 3.4, where the conversion to SVN in both cases for the user identifiers require the same non-injective surjective function mapping one of the fields to the single field present in SVN. This mapping for Scenario (3) does result in data loss due to both fields in CLOSER potentially having distinct values; therefore, regardless of the field chosen for mapping to the single field in SVN, causing the other to be discarded and those data lost. For the most appropriate mapping, the “author” data are mapped, and the concept of a separate committer is lost.
- Git having an author and a committer compared to the other VCSs having a single user for each code commit presents a similar issue to the SVN single identifier problem above. In this case, CLOSER should retain both the author and committer from Git data so that for other VCSs with only one user, a multivalued function can be used to map to both fields. Like the issue of SVN users in Scenarios (1) and (2), this case would be lossless, while Scenario (4) would not be due to the limitations of mapping of users similar to the those for the fields of each user role.
- Git has an internal understanding of renaming (file movement is regarded as renaming) operations comprising two locations for the event and a change percentage (if the file has been changed and renamed in the same commit), as seen in Figure 7. A similar event in an SVN repository is shown as a simple deletion of a file and an addition of a new file at the new location. The shared schema should retain the details of the Git event in a single event and retain the old and new locations with the change percentage. However, the issue comes when converting these events to a VCS such as SVN which has no direct mapping for a rename event having the two locations and a change percentage. In mapping from any potential schema to a VCS output format that does not support events with these additional fields, the best option is to map the events as if these changes were made in the target VCS. For example, in SVN this would involve a multivalue function from the rename event in Git to an add and delete event and then discarding change percentage. How these are mapped between VCSs to and from CLOSER are detailed in Section 3.10.
- All VCSs considered have a defined set of changes that can occur on a file in a given revision; however, there is a discrepancy between each VCS in terms of the number and types of events supported. Any shared schema should support the superset of events between all VCS vendors so that conversion from a VCS output to the shared schema can remain lossless. One issue comes with the mapping of these events from the potential shared schema to a particular VCS output format that only supports a subset of events. The solution for this mapping can be found by considering the actions that cause each event to occur. For example, Figure 7 and Figure 8 show how the same action maps from one event to two in Git and SVN, respectively. Using this method to map to the VCS outputs from CLOSER, using the user actions as a foundation, ensures the accuracy of conversion and provides the most value with regard to the type of file change.
3.7. Formal Definition of Features for a Shared Schema
- A user who has made the change.
- A unique ID for the revision.
- A time for the revision creation.
- A textual description of the changes that have occurred at that revision.
- A set of all files changed at that revision, including how the file was changed.
- A unique identifier for the user.
- The location of the file changed.
- The type of change that occurs for the file.
3.8. Extension of Formal Definition for VCS-Specific Components
- Author of the revision.
- Committer of the revision.
- The time that the commit was authored.
- The time that the commit was committed.
- Merge details containing the unique identifier of the two revisions combined.
- A new location that can be applied to a file that has been moved (renamed) where applies to the original location. This is required for renaming events in Git, in particular.
- The percentage of change that has occurred within the given file. This is required for renaming events in Git, in particular.
3.9. CLOSER Data Structure Definition
3.10. CLOSER Data Mappings
- The VCS logging output supports distinct fields for both the user identifier and email address denoted as .
- The VCS logging output supports only the user identifier denoted as .
- The file change for consideration is supported by the target VCS denoted by .
- The file change for consideration is not supported by the target VCS denoted by .
3.11. CLOSER for Mining Software Repositories
4. Results and Discussion
4.1. Validation of Lossless Conversion Between VCS Outputs and CLOSER
4.2. Conversion Between VCS Output Formats
4.3. CLOSER Data Format Used Directly for MSR Techniques
4.4. Proof of Existing MSR Technique Used After Conversion to Different VCS Output Format
4.5. Limitations and Threats to Validity
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Lanubile, F.; Ebert, C.; Prikladnicki, R.; Vizcaíno, A. Collaboration Tools for Global Software Engineering. IEEE Softw. 2010, 27, 52–55. [Google Scholar] [CrossRef]
- Hassan, A.E. The road ahead for Mining Software Repositories. In Proceedings of the 2008 Frontiers of Software Maintenance, Beijing, China, 28 September–4 October 2008; pp. 48–57. [Google Scholar] [CrossRef]
- Dyer, R.; Nguyen, H.A.; Rajan, H.; Nguyen, T.N. Boa: A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories. In Proceedings of the 2013 International Conference on Software Engineering, ICSE’13, San Francisco, CA, USA, 18–26 May 2013; IEEE Press: New York, NY, USA, 2013; pp. 422–431. [Google Scholar]
- González-Barahona, J.M.; Robles, G. On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir. Softw. Eng. 2012, 17, 75–89. [Google Scholar] [CrossRef]
- Fowler, M.; Parsons, R. Domain-Specific Languages, 1st ed.; Pearson Education: London, UK, 2010. [Google Scholar]
- Scheidgen, M.; Smidt, M.; Fischer, J. Creating and Analyzing Source Code Repository Models. In Proceedings of the 5th International Conference on Model-Driven Engineering and Software Development; SCITEPRESS-Science and Technology Publications, Lda: Berlin, Germany, 2017; pp. 329–336. [Google Scholar]
- Spinellis, D. Version control systems. IEEE Softw. 2005, 22, 108–109. [Google Scholar] [CrossRef]
- Kagdi, H.; Collard, M.L.; Maletic, J.I. A survey and taxonomy of approaches for mining software repositories in the context of software evolution. J. Softw. Maint. Evol. Res. Pract. 2007, 19, 77–131. [Google Scholar] [CrossRef]
- Trautsch, A.; Trautsch, F.; Herbold, S.; Ledel, B.; Grabowski, J. The smartshark ecosystem for software repository mining. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings, Seoul, Republic of Korea, 27 June–19 July 2020; pp. 25–28. [Google Scholar]
- Cubranic, D.; Murphy, G.C.; Singer, J.; Booth, K.S. Hipikat: A project memory for software development. IEEE Trans. Softw. Eng. 2005, 31, 446–465. [Google Scholar] [CrossRef]
- Hassan, A.E.; Holt, R.C. Using development history sticky notes to understand software architecture. In Proceedings of the 12th IEEE International Workshop on Program Comprehension, Bari, Italy, 26 June 2004; pp. 183–192. [Google Scholar] [CrossRef]
- Graves, T.L.; Karr, A.F.; Marron, J.S.; Siy, H. Predicting fault incidence using software change history. IEEE Trans. Softw. Eng. 2000, 26, 653–661. [Google Scholar] [CrossRef]
- Herraiz, I.; Gonzalez-Barahona, J.M.; Robles, G. Towards a Theoretical Model for Software Growth. In Proceedings of the Fourth International Workshop on Mining Software Repositories (MSR’07:ICSE Workshops 2007), Minneapolis, MN, USA, 20–26 May 2007; p. 21. [Google Scholar] [CrossRef]
- Engler, D.; Chen, D.Y.; Hallem, S.; Chou, A.; Chelf, B. Bugs as deviant behavior: A general approach to inferring errors in systems code. ACM SIGOPS Oper. Syst. Rev. 2001, 35, 57–72. [Google Scholar] [CrossRef]
- Li, Z.; Zhou, Y. PR-Miner: Automatically extracting implicit programming rules and detecting violations in large software code. ACM SIGSOFT Softw. Eng. Notes 2005, 30, 306–315. [Google Scholar] [CrossRef]
- Hassan, A.E. Mining Software Repositories to Assist Developers and Support Managers. In Proceedings of the 2006 22nd IEEE International Conference on Software Maintenance, Philadelphia, PA, USA, 24–27 September 2006; pp. 339–342. [Google Scholar] [CrossRef]
- Zimmermann, T.; Weißgerber, P. Preprocessing CVS Data for Fine-Grained Analysis. Proc. MSR 2004, 4, 2–6. [Google Scholar]
- Allamanis, M.; Sutton, C. Mining source code repositories at massive scale using language modeling. In Proceedings of the 2013 10th Working Conference on Mining Software Repositories (MSR), San Francisco, CA, USA, 18–19 May 2013; pp. 207–216. [Google Scholar] [CrossRef]
- Stack Overflow. 2015. Available online: https://insights.stackoverflow.com/survey/2015 (accessed on 28th March 2024).
- Git Documentation. 2018. Available online: https://git-scm.com/doc (accessed on 19th April 2024).
- Fitzpatrick, B.W.; Pilato, C.M.; Collins-Sussman, B. Version Control with Subversion; O’Reilly Media: Sebastopol, CA, USA, 2004. [Google Scholar]
- O’Sullivan, B. Mercurial: The Definitive Guide: The Definitive Guide; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
- Cutting, D. Enhancing Legacy Software System Analysis by Combining Behavioural and Semantic Information Sources. Ph.D. Thesis, University of East Anglia, Norwich, UK, 2016. [Google Scholar] [CrossRef]
- Garrity, J. CLOSER Source Code. 2021. Available online: https://github.com/closer-evolution/closer (accessed on 12th October 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Garrity, J.; Cutting, D. A Common Language of Software Evolution in Repositories (CLOSER). Software 2025, 4, 1. https://doi.org/10.3390/software4010001
Garrity J, Cutting D. A Common Language of Software Evolution in Repositories (CLOSER). Software. 2025; 4(1):1. https://doi.org/10.3390/software4010001
Chicago/Turabian StyleGarrity, Jordan, and David Cutting. 2025. "A Common Language of Software Evolution in Repositories (CLOSER)" Software 4, no. 1: 1. https://doi.org/10.3390/software4010001
APA StyleGarrity, J., & Cutting, D. (2025). A Common Language of Software Evolution in Repositories (CLOSER). Software, 4(1), 1. https://doi.org/10.3390/software4010001