A Common Language of Software Evolution in Repositories (CLOSER)

Garrity, Jordan; Cutting, David

doi:10.3390/software4010001

Open AccessArticle

A Common Language of Software Evolution in Repositories (CLOSER)

by

Jordan Garrity

¹ and

David Cutting

^2,*

¹

iManage, Belfast BT3 9DT, UK

²

School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT7 1NN, UK

^*

Author to whom correspondence should be addressed.

Software 2025, 4(1), 1; https://doi.org/10.3390/software4010001

Submission received: 29 October 2024 / Revised: 18 December 2024 / Accepted: 23 December 2024 / Published: 6 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Version Control Systems (VCSs) are used by development teams to manage the collaborative evolution of source code, and there are several widely used industry standard VCSs. In addition to the code files themselves, metadata about the changes made are also recorded by the VCS, and this is often used with analytical tools to provide insight into the software development, a process known as Mining Software Repositories (MSRs). MSR tools are numerous but most often limited to one VCS format and, therefore, restricted in their scope of application in addition to the initial effort required to implement parsers for verbose textual VCS output. To address this limitation, a domain-specific language (DSL), the Common Language of Software Evolution in Repositories (CLOSER), was defined that abstracted away from specific implementations while isomorphically mapping to the data model of all major VCS formats. Using CLOSER directly as a data model or as an intermediate stage in a conversion analysis approach could make use of all major repositories rather than be limited to a single format. The initial barrier to adoption for MSR approaches was also lowered as CLOSER output is a concise, easily machine-readable format. CLOSER was implemented in tooling and tested against a number of common expected use cases, including a direct use in MSR analysis, proving the fidelity of the model and implementation. CLOSER was also successfully used to convert raw output logs from one VCS format to another, offering the possibility that legacy analysis tools could be used on other technologies without any changes being required. In addition to the advantages of a generic model opening all major VCS formats for analysis parsing, the CLOSER format was found to require less code and complete parsing faster than traditional VCS logging outputs.

Keywords:

repository data; mining software repositories; MSR; Git; SVN; mercurial

1. Introduction

Version Control Systems (VCSs) store and manage source code artifacts and generate evolutionary metadata, including historical information about how these artifacts have been developed. VCSs are commonly an integral component of collaborative software development where one of these systems is selected early in the development lifecycle in order to deliver an effective way for cooperation between contributors [1].

Analysis of historic changes to a system, held within the metadata and artifacts of a VCS, is widely used in industry along with other code analysis tools for a variety of purposes, collectively referred to as the Mining of Software Repositories (MSR). Example applications of such analysis tools include the identification of semantic links between software elements, bug identification, and the ability to provide projections for future code changes [2]. Research into mining software repositories has received an increased focus in recent years. It has highlighted different techniques by which the data can be attained and used in order to infer various things about the development of software [2].

Currently, billions of lines of code are stored across the various VCSs [3], where each specific VCS has different data models and output formats. Any particular MSR analysis approach that uses VCS metadata is usually limited to a single VCS without any ability to work between them due to the requirement of a complex parser to be written to understand each specific VCS output. Furthermore, any analysis performed on a specific VCS output may not guarantee consistent results in comparison with another VCS due to the discrepancies in the data and output formats for each VCS requiring parsing re-implementation [4].

Our approach is the definition and implementation of a domain-specific language (DSL), which is capable of an abstracted data model that can represent all major VCS metadata in a non-lossy yet compatible manner but is also suitable for generic direct use.

The first step was to investigate the commonly used VCSs and consider potential solutions for creating a DSL. This DSL, the Common Language of Software Evolution in Repositories (CLOSER), will store the metadata produced by different repository technologies in a common yet generic domain-wide form. CLOSER will primarily store all common features found in all VCSs, such as file changes, and will additionally include any VCS-specific data in order to ensure a comprehensive output for all considered VCSs. The key driver behind using a DSL is to allow generic cross-VCS use of metadata through a clearly defined data representation regardless of the input format [5]. A further consideration in DSL definition is the issue of the verbose textual nature of VCS outputs, which can cause difficulty when implementing parsers such that implementations may be complex and inefficient. CLOSER will provide a common format that is both sensibly structured and machine-readable in order to provide for simple and efficient parsing. Consideration from the beginning was also given to the mapping functions required for conversion to and from CLOSER and specific VCS formats.

To implement CLOSER as a functional tool parser was created, allowing the conversion from a VCS-specific output to CLOSER and from CLOSER to a VCS-specific output. This implementation will allow MSR approaches to use CLOSER, ensuring approaches can be more easily applied to different VCSs. While new, “greenfield” MSR implementations may use the CLOSER format directly with the metadata having been converted from one or more VCS outputs of interest, existing MSR implementations for a specific VCS output format can use CLOSER as an intermediate step for conversion from a source format to the specific VCS output format required. To demonstrate the applicability of CLOSER for cross-VCS comparison, example implementations were created. This implementation approach followed these steps:

Identify the common features across VCSs and additional information for particular VCSs.
Define a lossless isomorphically mapped DSL, CLOSER, that can store the common features while retaining any additional metadata.
Implement parsers for major VCS output formats to and from CLOSER.
Demonstrate the efficacy of CLOSER through the ability to convert to and from a specific VCS, proving that it is lossless and retains all data required for analysis.

Existing approaches, such as Boa, attempt to address the issue of different data models between VCSs by implementing a DSL and providing the ability to query their unified format in order to find information across repository types [3]. This approach and others tend to make use of the actual source code artifacts (tracked files) and historical changes stored in the repository. While this approach has the merit of having full access to the data, including exactly how files have been changed, it can be very cumbersome and verbose, as well as having potential confidentiality issues when used with third-party tooling [6]. The advantages of having an approach purely based on the logging output metadata from the VCS compared to this existing approach of using tracked files are mainly in streamlining the stored data while also providing the ability to reuse existing tools that are built to target and analyze a particular logging output without the need to re-implement.

Software was written in Java to implement CLOSER including implementing parsers for the VCS solutions, which are available as an open-source command line tool. Analysis of CLOSER then progressed to prove the core concept that MSR approaches can be performed in a non-VCS-specific manner whereby a specific VCS output can be converted to CLOSER and then analyzed. This analysis proved that the common features shared across VCSs are retained, as well as data specific to a particular VCS output.

This implementation of CLOSER also included parsers to convert from CLOSER to a specific VCS output log to prove that not only was such conversion possible but that no data were lost and demonstrate conversion from a specific VCS to another. These parsers from CLOSER to VCS output formats allow for conversion between two specific VCS output formats using CLOSER as an intermediate stage.

The isomorphic nature of CLOSER is demonstrated by conversion from CLOSER to the original VCS format, indicating that all data and formatting are preserved. Another experiment to prove the validity of CLOSER was the comparison of an MSR technique performed on CLOSER with those implemented for a specific VCS output, where in both cases, the output of the technique was the same, showing that CLOSER retained all VCS data through the conversion.

The remainder of this article is structured as follows: in Section 2, related work is identified and discussed through a literature review, Section 3 details the design and methodology of CLOSER, the results are discussed and analyzed in Section 4, and finally Section 5 is the conclusion and identification of future work.

2. Related Work

In this review of related work, core relevant topics are covered, including the use of VCSs, MSR approaches, and specifically the problem of MSR in a generic non-VCS-specific manner, including existing work into cross-VCS approaches.

2.1. Version Control System Offerings

Version control systems (VCSs) allow for multiple contributors to share and collaborate on artifacts such as source code in a controlled manner [1,2] and are increasingly an essential component of almost all collaborative software solutions, meaning they are widely used in industry [1,7]. VCSs can be put into two main categories based on architecture: centralized and distributed. The most common centralized architecture VCS used is SVN, and the most common distributed architecture VCS used is Git [1].

2.2. Uses and Techniques for the Mining of Software Repositories

The mining of software repositories (MSR) is a technique to draw on historical data and versions held to try and produce insight, analysis, and suggestions to those developing or managing code held in a VCS [2,3] and there have been numerous efforts to understanding the evolution of a software product through the mining of their repositories [8]. Repositories store a large volume of diverse forms of data, including source code, bug reports, developer logs, and other artifacts, so understanding the relationships between each element and their associated changes in this verbose, complex output can be challenging [2,8].

Hassan [2] catalogs existing techniques in the MSR field and identifies key continuing issues in the field, such as the complexity of data and difficulty of access, issues that CLOSER can potentially address. There are many useful tools already implemented and available [9,10,11] that make use of repository data, and it is useful to understand how these tools extract and use repository data as this provides insight for potential uses of CLOSER.

Examples of the type of information generated through MSR include [2,12,13], which use the data for bug prediction, and [12,13], which have found relations between the number of lines of code and the complexity of the code, where longer (more complex) code tends to be buggier. There are a good number of tools that attempt to solve the problem of flagging locations in the code that could be potential bugs so that they can be fixed [2] while [14,15] uncover undocumented correctness rules in the code and detect deviations from them.

There are many issues in the effective use of MSR, including simplifying the extraction of high-quality data as well as scaling MSR techniques to very large repositories [2]. Such issues can potentially be alleviated by a universal generic approach such as CLOSER. In the area of data extraction specifically, there are several examples of existing extraction techniques that can be considered industry standards for various problems, so they provide useful insight on which any future extraction approach can be based [16,17].

In the area of scaling to large repositories, a solution using a language model to extract key information from 350 million lines of code has been identified, which highlights how the potential for modeling at scale allows for the identification of trends on large projects and even across project domains [18].

Many other potential opportunities and areas of interest in MSR are identified in [2], such as exploring non-structured data, understanding the limitations of repository data, and simplifying access. Hassan [2] also considers the evaluation and practical benefits of any implemented MSR technique in order to help define how they should be used and how they can be of benefit to anyone performing work in this area.

2.3. Methods for Mining Software Repositories Using Multiple Version Control Systems in Combination

Dyer et al. [3] state that billions of lines of code are stored across various repositories that use different VCSs and cover several fundamental problems that people interested in MSR would need to know or consider in general. Ref. [3] also identifies the major issue that currently, nearly all solutions to answer these questions are limited to one specific VCS format. Dyer et al. [3] further attempt to address the issue of reproducibility in MSR techniques where results are highly dependent on the VCS and technique being used; therefore, there is no way to reliably compare these test metrics across all available repositories in order to validate or make useful comparisons.

The approach outlined in [3] to address the problem was to create a DSL and infrastructure named Boa, which contains the key features shared among the VCSs and then shows how Boa can be used to describe any revision generically regardless of VCS format. Using an identified DSL and the Boa language, the results of any queries are easily reproducible for the same DSL output, a good approach as reproducibility is one of the main issues identified in MSR approaches [4].

The approach in [3] to solve the problems identified has significant merit. By understanding the VCS options available, a sensible DSL was defined to abstract the data from the specific technologies being used, which is effective as it removes the need for individual approaches for each VCS offering as they require a conversion to the DSL to be compatible. However, a limitation of this approach is that Boa requires full access to the repository, including all source code artifacts and stored metadata, rather than just stripping the key data from the VCS metadata log outputs as proposed with CLOSER. Boa then contains all possible data and, therefore, is extremely verbose, requiring significant storage when considering large repositories with potentially large amounts of redundant data depending on the actual needs of the MSR technique being used. This approach also has limitations with confidentiality requiring full access to repository contents, including any copyright or propitiatory code. No provision is made within Boa for the conversion to any VCS output format to use directly in legacy tooling, so it provides just another format of data, albeit one capable of representing sources from different VCS systems.

2.4. Opportunities for a Tool Such as CLOSER

The related work reviewed in this section identifies the wide range of MSR approaches, including those focusing on the use of historic metadata, but the limitations seen that implemented tooling will be restricted to specific VCS formats. The utility of a cross-VCS approach can be seen with the implementation of Boa [3]; however, this limits analysis to the specific output format it generates, again “locking in” tooling. It can, therefore, be seen from existing work that (1) there is a potential offer for the use of MSR tooling more widely if cross-VCS format conversions can be performed and (2) a generic, easily accessible format for use in metadata MSR work would be beneficial.

3. Design and Methodology of CLOSER

3.1. Methodology

The overall aim of CLOSER is to produce, as much as possible, a format-agnostic approach to the conversion from one VCS format to another and conversion to a universal computationally accessible format (the CLOSER DSL), as shown in Figure 1.

The methodology to be followed to build a full design for implementation will consist of the following steps:

Analysis of VCS features and differences—to understand what data are contained in different VCS stores, data fields, and formats need to be examined (Section 3.3).
Application scenario generation—the scenarios in which the approach could be used, conversion from a specific format to/from the DSL will be identified to ensure the implementation can meet these requirements (Section 3.4).
Data model developed—using commonalities (and mapping as appropriate), a set of common VCS features will be expressed into a DSL, including any VCS-specific extensions required for data retention (Section 3.5 and Section 3.6).
Formal definition, schema mapping, and data structure definition—the DSL and model will be formally defined, and the schema mapped to the VCS elements with any limitations (potentially lossy conversions) identified (Section 3.7 and Section 3.8). The CLOSER data structure for implementation will be defined (Section 3.10).

3.2. Research Questions

In order to frame the development of CLOSER and to validate the approach in meeting its goals and the efficacy it may have in the field of mining software repositories, we pose the following research questions (and show where in the results section of this paper they are answered):

Is it possible to demonstrate lossless conversion between VCS output to the generic CLOSER data format? (Section 4.1)
Is it possible to demonstrate effective conversion between VCS output formats using CLOSER? (Section 4.2).
Is it possible to use the CLOSER data format directly for MSR approaches? (Section 4.3).
Is it possible to use MSR approaches implemented for a different VCS format following conversion with CLOSER? (Section 4.4).

3.3. Identification of VCS Features

After considering relevant literature of related work (Section 2), the process through which the problem of inconsistencies between VCS outputs could be addressed was defined. To begin and allow comparison an example set of repositories needed to be created in different formats but containing the same series of change events. These data sets will allow an initial comparison to be completed and can be used in the creation and validation of the core features of CLOSER, demonstrating cross-VCS compatibility.

To create a tool that has the widest application and usefulness, it is necessary to consider which specific VCS formats to support based on how widely each is used. Identifying the number of repositories for each type is difficult as many are private; however, using Google Trends (Figure 2) it was possible to determine a good estimate of the most popular VCS systems. This estimate is also supported by a developer study from 2015 in which Git, SVN, and Mercurial are seen as the three most popular in preceding years [19]. Based on Google Trends and [19], Git, SVN, and Mercurial are chosen for support in CLOSER.

Relevant documentation for each of the VCS solutions was considered in order to gain an understanding of the possible output formats each provided as well as potential limitations [20,21,22]. This analysis of the documentation indicates that the changes to files for each revision are one of the most complex and differentiated features across the considered VCS systems.

Owing to these differences in concrete implementation, it was necessary to consider the possible change modes abstractly. Potential change types were considered based on distinct user actions that can occur involving a file under version control. In the case of SVN, the repository also stores historical changes in folders as well as files so such changes must also be considered in the abstract model. The following list of user actions encapsulates a generalized set of the potential actions performed and logged within a repository, providing all possible change types regardless of the VCS.

Creation of a folder.
Deletion of a folder.
Creation of a new file with contents.
Update the contents of a file.
Delete a file.
Rename a file with the contents remaining unchanged.
Rename a file with a change in the contents at the same time.
Change the type of the file.
Move the file to a folder with the contents remaining unchanged.
Move the file to a folder with a change in the contents at the same time.

Using this simple but comprehensive list of user actions in a scripted fashion, repositories containing an identical set of changes were generated for repositories in the formats of Git, SVN and Mercurial. Figure 3, Figure 4 and Figure 5 show the logging output for the same subset of user actions at a single commit, performed on all VCSs. The output for the Git repository, shown in Figure 3, was generated using the command “git log -pretty=fuller -name-status” and contains all features of a change in the Git repository. The output for the SVN repository, shown in Figure 4, was generated using the command “svn log . -v” and contains all features of a change in an SVN repository.

For the Mercurial repository, the most detailed standard log can be generated using the command “hg log -verbose”, which can be seen in Figure 5. As seen in the analysis of this output (and as documented in the Mercurial documentation [22]), the textual logging by default does not contain the type of file change that occurred for each file but rather a flat list. This, therefore, highlights a disparity when comparing this output and the outputs for Git and SVN, as these outputs include the type of file change.

Mercurial does, however, support the ability to generate a more detailed custom textual output, which was identified as the potential solution to the missing information. Through generating custom logging output, it becomes possible to include a subset of possible file changes such as add, update, and delete. These file changes form the core of many MSR techniques and match almost exactly the full set available in SVN [21]. Figure 6 shows the custom logging generated using a custom template for Mercurial with the command “hg log –template "{rev}:\n{author} \n{date|isodate} \n{files} \n{file_adds} \n {file_dels} \n{desc} \n\n"”. This custom output includes the added, deleted, and all changed files on separate lines, allowing for the detection of the change type that has occurred, allowing for effective conversion from Mercurial to CLOSER or, in turn, another format.

While the term “revision” has specific meaning for certain VCSs throughout the design of CLOSER, the term revision will refer, in general, to a set of changes at a given time, regardless of VCS output format.

3.4. Usage Scenarios for CLOSER

The design of CLOSER involves the identification of the core data states:

$V C S_{X} :$ Textual logging output of metadata from System X containing a set of features.
$V C S_{Y} :$ Textual logging output of metadata from System Y containing a set of features, where the features in Y are a proper subset of those in System X. Therefore, System X contains all features represented in System Y, but System Y will not contain all features in System X. An example of this state can be seen in the case that Mercurial output format in Figure 6 contains all features of the Git repository in Figure 3.
$C L O S E R :$ CLOSER format containing the VCS metadata.

Specific scenarios (core use cases) that will be used for planning and implementation when determining the mappings and data required for CLOSER are defined in Equations (1)–(4).

V C S_{X} \to C L O S E R

(1)

V C S_{X} \to C L O S E R \to V C S_{X}

(2)

V C S_{X} \to C L O S E R \to V C S_{Y}

(3)

V C S_{X} \to C L O S E R \to V C S_{Y} \to C L O S E R \to V C S_{X}

(4)

Scenario (1) shows the conversion from a given System output to CLOSER. In this scenario, CLOSER should losslessly retain all VCS metadata, regardless of the specific input system (

V C S_{X}

) being considered. This is the primary use case of CLOSER to allow for the conversion of all VCSs to a common and consistent machine-readable format.

Scenario (2) shows the conversion from a given VCS output to CLOSER and conversion to the same source format again. This scenario should guarantee isomorphic (lossless) bidirectional conversion between the VCS output. While this scenario provides no immediate solution to the issues identified for MSR techniques on raw VCS logging outputs, it provides the ability to prove that upon conversion to CLOSER, all data are retained. This proof will serve to highlight that CLOSER is a viable replacement for traditional parsing techniques as it still has all the data available.

Scenario (3) shows the conversion from a given System X output format to another System Y output format using CLOSER as an intermediate step. In this scenario, as

V C S_{X}

contains all features in

V C S_{Y}

, the conversion to System Y should be guaranteed to be “complete” in

V C S_{Y}

(as if the changes had originally been made in System Y and the textual logging output generated directly). This conversion, however, will not be lossless as the additional features supported by System X cannot be retained upon conversion to System Y, so any features not present in System Y will be irretrievably lost.

Scenario (4) shows the conversion from a given System X to another System Y and then back to System X using CLOSER as an intermediate step. This scenario can be understood as a combination of Scenario (3) and its inverse. While this scenario may initially appear convoluted it is required to help quantify the limitations of CLOSER and the mappings available between VCSs. For example, in reference to Scenario (3), it can be seen that after conversion to System Y, any features not contained in System Y will be lost due to no mapping existing from System X. Therefore, the conversion from System Y back to System X will require inference of the features lost in the initial conversion from CLOSER to System Y and now missing in the data. These missing features will require the best possible inference; however, this inference cannot be guaranteed to be exact as if the changes had originally been made in System X as the original data have been lost. This scenario, therefore, has data loss and cannot guarantee that the original System X textual input will be identical to the output after conversion to System Y. This is the case for all conversions between VCSs where the intermediate VCS does not contain all features in the original VCS.

3.5. Common VCS Components for CLOSER

Using the key features highlighted in the respective VCS outputs, the following differential data must be retained in the common language:

Each VCS has incremental versions of the code. In Git, this is known as a commit, while in SVN, these are referred to as revisions. Each VCS has a unique ID for these versions, which will be known as UID in CLOSER. For the purposes of CLOSER, we shall use the term revision to refer to a collection of changes to file(s).
Each VCS has an identified person who made the commit known as the committer in Git, as well as a timestamp recording when the commit was made. In Git, there is also the concept of an author and the date on which the code was authored. In order to retain isomorphic mapping and ensure that CLOSER is lossless it will retain the data for both the committer, author, and their respective dates.
In Git, the user identifier has a name and an email, while in SVN, the user only has a single identifier.
The VCS outputs identify the files changed and how they were changed in each revision. This is consistent across Git and SVN; however, the coding to indicate how the files were changed differs across VCS variants. CLOSER will maintain a set of all the variants of file changes that will then map from the respective VCS and share values across VCS options where possible (where this is and is not possible are identified in Section 3.6).

3.6. VCS-Specific Components for CLOSER

In addition to the commonality between features of VCSs, the generated textual logging output highlights that there are VCS-specific features included that cannot be mapped to another VCS output format. These have been collated and are considered for representation in the shared schema so that the full data from each VCS can be returned to the source output format without loss and can be inferred where possible for conversion to another format, and are as follows:

SVN has a line count included in the output, which is the number of lines changed. While this is not included by any other VCS being considered, it is possible to dynamically calculate this accurately from available data; therefore, this field can be generated in all scenarios mentioned in Section 3.4.
SVN, by default, only has a single textual identifier for each user making a commit. Upon analysis of production, SVN repositories, this textual field is occasionally an email address; however there is no guarantee of this as a convention for all repositories. As VCSs such as Git and Mercurial have an identifier and email address for each user, there is no direct mapping between these and SVN output. For mapping from SVN to CLOSER, the textual identifier can be stored as both the identifier and email fields required for the other VCSs in a multivalued function. Scenario (3), as mentioned in Section 3.4, allows for lossless bidirectional mapping between SVN and CLOSER despite the mapping from CLOSER to SVN requiring a non-injective surjective function for mapping the two user fields to one in the SVN output. This mapping for Scenario (1) and (2) is lossless due to both fields in CLOSER storing the single SVN identifier; therefore, regardless of the field from CLOSER mapped back to SVN in the implementation a conversion will be accurate with full data retention. Mapping from Git or Mercurial to SVN can be understood as Scenario (3) in Section 3.4, where the conversion to SVN in both cases for the user identifiers require the same non-injective surjective function mapping one of the fields to the single field present in SVN. This mapping for Scenario (3) does result in data loss due to both fields in CLOSER potentially having distinct values; therefore, regardless of the field chosen for mapping to the single field in SVN, causing the other to be discarded and those data lost. For the most appropriate mapping, the “author” data are mapped, and the concept of a separate committer is lost.
Git having an author and a committer compared to the other VCSs having a single user for each code commit presents a similar issue to the SVN single identifier problem above. In this case, CLOSER should retain both the author and committer from Git data so that for other VCSs with only one user, a multivalued function can be used to map to both fields. Like the issue of SVN users in Scenarios (1) and (2), this case would be lossless, while Scenario (4) would not be due to the limitations of mapping of users similar to the those for the fields of each user role.
Git has an internal understanding of renaming (file movement is regarded as renaming) operations comprising two locations for the event and a change percentage (if the file has been changed and renamed in the same commit), as seen in Figure 7. A similar event in an SVN repository is shown as a simple deletion of a file and an addition of a new file at the new location. The shared schema should retain the details of the Git event in a single event and retain the old and new locations with the change percentage. However, the issue comes when converting these events to a VCS such as SVN which has no direct mapping for a rename event having the two locations and a change percentage. In mapping from any potential schema to a VCS output format that does not support events with these additional fields, the best option is to map the events as if these changes were made in the target VCS. For example, in SVN this would involve a multivalue function from the rename event in Git to an add and delete event and then discarding change percentage. How these are mapped between VCSs to and from CLOSER are detailed in Section 3.10.
All VCSs considered have a defined set of changes that can occur on a file in a given revision; however, there is a discrepancy between each VCS in terms of the number and types of events supported. Any shared schema should support the superset of events between all VCS vendors so that conversion from a VCS output to the shared schema can remain lossless. One issue comes with the mapping of these events from the potential shared schema to a particular VCS output format that only supports a subset of events. The solution for this mapping can be found by considering the actions that cause each event to occur. For example, Figure 7 and Figure 8 show how the same action maps from one event to two in Git and SVN, respectively. Using this method to map to the VCS outputs from CLOSER, using the user actions as a foundation, ensures the accuracy of conversion and provides the most value with regard to the type of file change.

3.7. Formal Definition of Features for a Shared Schema

Based on the common features identified, a formal definition can be created to encapsulate all commonalities between VCS types, which can then be used as a foundation for the CLOSER schema. This formal definition will also consider the specific data differences present between VCSs to allow conversion, as detailed in Section 3.6.

Using set notation, we can formally define how VCS outputs are presented.

A VCS textual metadata log can be defined as an ordered list C of distinct changes R of number n, shown in Equation (5).

C = [R_{0}, R_{1}, \dots, R_{n - 2}, R_{n - 1}]

(5)

A single revision R in the list can be defined as one or more actions by a particular user affecting one or more files performed at a given time. Formally, a revision in any VCS contains at least:

$R . U :$ A user who has made the change.
$R . I :$ A unique ID for the revision.
$R . T :$ A time for the revision creation.
$R . D :$ A textual description of the changes that have occurred at that revision.
$R . F :$ A set of all files changed at that revision, including how the file was changed.

The ordering of revisions is based on the time

R . T

of each revision, with the newest changes appearing at the start of the list, shown in Equation (6).

\forall x . 1 \leq x < n . R_{x - 1} . T \leq R_{x} . T

(6)

A user U can be defined as a map of metadata that represents the individual who made the change. Formally, a user in any VCS contains at least:

$U . I :$ A unique identifier for the user.

The uniqueness of this identifier is not necessarily enforced by the VCS vendor; however, in industry and research development it is standard practice that this is observed in order to understand the changes that can be attributed to that individual.

A set of file changes F can be defined as the distinct set of individual file changes X numbering x, shown in Equation (7). Formally, a file change X in any VCS can be represented as a map of metadata for a change on a single file and contains at least:

$X . L :$ The location of the file changed.
$X . T :$ The type of change that occurs for the file.

Changes are distinct based on the location of their respective file

X . L

such that the location of one file change is distinct within a particular revision R.

F = \{X_{0}, X_{1}, \dots, X_{x - 2}, X_{x - 1}\}

(7)

Each VCS supports a distinct set of file change types

T_{v}

where a single type t may be shared across vendors being considered, such as the add and delete events.

The type of file changes

T_{C L O S E R}

, which can be represented in the CLOSER data model, are the combination of all vendor types. Each vendor has a specific set of file change types

T_{v}

, which are a subset of those present in the CLOSER representation where V is the set of all VCSs being considered, shown in Equation (8). This relationship of file change types in Equation (8) can be seen with respect to the specific VCS vendors being considered in Figure 9. This shows how the type changes differ between vendors while also showing how CLOSER will retain all variants so that this detail can be converted back to the original VCS model.

\forall v . v \in V . T_{v} \subset T_{C L O S E R}

(8)

3.8. Extension of Formal Definition for VCS-Specific Components

The formal definitions in Section 3.7 cover the minimum required in order to form a basic VCS logging output; however, further extensions can be considered to incorporate the vendor-specific logging components identified in Section 3.6. While these additional elements will often not be present depending on the VCS considered, it is important to retain this information when possible in order to ensure isomorphic mapping from all VCSs to CLOSER and back to their original format again.

Starting with the definition for a single revision R, there are several extensions to this definition required to fully support all data components of the VCS vendors being considered. As identified in Section 3.6 and Figure 3, Git supports two users for each revision, known as an author and a committer, which relate to two distinct actions taken in the Git VCS during the creation of a revision. This detail can be retained through the extension of the formal definition to consider a multiple user in place of the single user identified originally

R . U

, shown as:

$R . U_{A} :$ Author of the revision.
$R . U_{C} :$ Committer of the revision.

Again, due to Git having an internal process consisting of two distinct actions of committing and authoring changes, a revision can have two distinct timings for each. This detail can be retained through the extension of the formal definition to consider multiple times in place of the single times defined (originally

R . T

), shown as:

$R . T_{A} :$ The time that the commit was authored.
$R . T_{C} :$ The time that the commit was committed.

This change can also be understood to impact the ordering of revisions shown in Equation (6) as there are now multiple times for consideration. Based on the Git documentation, the two events must always happen in order; thus, a change must be authored before it can be committed [20]. Therefore, it can be seen that the ordering of revisions in the cases where these times differ can be based on the commit time, so Equation (6) can be refined to Equation (9).

\forall x . 1 \leq x < n . R_{x - 1} . T_{C} \leq R_{x} . T_{C}

(9)

Additionally, Git has the concept of a merge revision, which contains details of the two revisions that have been combined. This is a very common occurrence in Git repositories with multiple contributors, as branching and then re-combining is a common working practice. This re-combination is referred to as merging, and when merging branches, a merge commit is generated. Formally, the merge commit can be defined as:

$R . M :$ Merge details containing the unique identifier $(R . I)$ of the two revisions combined.

Having expanded the definition for a revision next data object to consider is the definition of a user. Git and Mercurial present more data than Git for a user, which includes both a unique identifier and email address. In both systems, the email address may not be configured; however, it is necessary that the data structure can retain both fields, and so the formal definition can be expanded to include

U . E

for the email address of a given user U.

While the current definition for a file change X contains the minimum required features, several other vendor-specific components may be present; therefore, a file change in CLOSER will also contain the following:

$X . N :$ A new location that can be applied to a file that has been moved (renamed) where $X . L$ applies to the original location. This is required for renaming events in Git, in particular.
$X . P :$ The percentage of change that has occurred within the given file. This is required for renaming events in Git, in particular.

3.9. CLOSER Data Structure Definition

From the analysis of considered VCS outputs and subsequent identification of common elements to form a shared schema, an object-oriented data structure has been defined for CLOSER to encapsulate the required data identified in Section 3.7 and Section 3.8. Figure 10 shows the entity-based structure that can encapsulate the key features required for the VCS outputs in a shared format.

The goal of the CLOSER schema is to combine all shared features in a single universal manner while also including any vendor-specific data, providing the ability for lossless isomorphic mapping from any of the VCS output formats supported. This model is also easier to use for MSR techniques than traditional VCS logging outputs due to the ability to easily serialize and deserialize this structure to machine-readable JSON format. Figure 11 shows an example of this data schema converted to JSON, allowing for it to be easily used for analysis with external tools.

3.10. CLOSER Data Mappings

Having identified and formally defined the data structure required, it is now necessary to define the mappings for conversion to this structure from the VCSs as well as the inverse back to the VCS formats.

For all mappings not explicitly discussed, it can be assumed that a given element a of a supported VCS can be mapped to the corresponding element in CLOSER b using a function F as detailed in Equation (10). For each function F, there exists an inverse function

F^{'}

with mapping functionality as shown in Equation (11). The lossless isomorphic mapping provided by these functions ensures that Scenario 1, as identified in Equation (1), is possible for all VCSs.

F (a) \to b

(10)

F^{'} (b) \to a

(11)

This mapping is the ideal implementation for all data elements of CLOSER; however, this is not possible due to the differences in features between VCSs as identified in Section 3.6, such as the number of file change types supported.

The first element that this mapping is not possible for is the user data structure U. In this case, there are two variants to consider:

The VCS logging output supports distinct fields for both the user identifier and email address denoted as $V C S_{1}$ .
The VCS logging output supports only the user identifier denoted as $V C S_{2}$ .

For Case 1, where the textual logging metadata are from

V C S_{1}

, the general mapping F is possible due to the two features existing in both the input VCS and CLOSER, allowing for isomorphic bidirectional mapping with lossless conversion.

For Case 2, where the textual logging metadata are from

V C S_{2}

, no bidirectional mapping is possible as CLOSER supports two fields for a user, while the input has only a single field, the user identifier. The best available mapping, therefore, is one that can provide lossless mapping based on Scenario (1) while also providing the best opportunity for accurate inferences required for Scenarios (2)–(4). Therefore, this mapping will use a multivalued function where the single identifier in the raw VCS input is mapped to both fields in CLOSER to allow for a higher-quality conversion to a

V C S_{1}

format. The mapping from CLOSER to

V C S_{2}

involves a non-injective surjective function that will copy either of the fields in CLOSER to

V C S_{2}

based on user-defined options. This mapping is lossless in the case of conversion from

V C S_{2}

to CLOSER and the inverse regardless of the field mapped from CLOSER. It will match the original raw

V C S_{2}

input. The limitation of this mapping for the user fields is seen, however, in the conversion of

V C S_{1}

output to

V C S_{2}

output using CLOSER as an intermediate stage (Scenario (2)). This issue is due to the two distinct fields in

V C S_{1}

being mapped to a single field in

V C S_{2}

, based on user-defined options, and will result in one of the fields being excluded from the output and the contained data being lost. This means that the conversion between VCS types, specifically the conversion from

V C S_{1}

output to

V C S_{2}

output, is a one-way mapping as no perfect lossless inverse mapping function exists so that the conversion back can exactly match the original raw logging input, as detailed in Scenario (4).

The next data element that the ideal mapping (function F) does not apply to is the conversion of a file change X as a result of the differences in file change types supported as shown formally in Equation (8) and visually represented for the supported VCS types in Figure 9. As with the previous issue with the user data, this issue with this mapping to CLOSER is that the range of the function is smaller than the domain; therefore, any inverse function would have a larger range than the domain, meaning not all possible domain values could be mapped. In order to support the mapping of all change types, the set of file changes in CLOSER (

T_{C L O S E R}

) contains all supported VCS types shown formally in Equation (8). Therefore, in the case of a given VCS, the mapping to CLOSER can be given using an injective function as shown for the supported VCS variants in Figure 12, which also allows for the retention of the additional fields for a file change as identified previously (

X . N

and

X . P

) where these values are present.

The function for mapping a file change to CLOSER can, therefore, be formally defined as shown in Equation (12), where an input file change from a VCS a can be output as a single CLOSER file change b.

F_{T} (a) \to b

(12)

The next consideration is the mapping of file changes from CLOSER to the target VCS output, which is complex and nuanced. In this case, there are two variants to consider:

The file change for consideration is supported by the target VCS denoted by $V C S_{1}$ .
The file change for consideration is not supported by the target VCS denoted by $V C S_{2}$ .

In Case 1, where the file change for consideration is supported by the target VCS, there exists an inverse function

F_{T}^{'}

so that the single file change b in CLOSER can be directly mapped to a file change in

V C S_{1}

output format a as shown in Equation (13).

F_{T}^{'} (b) \to a

(13)

In this case, the mapping can be proven to be lossless meaning the conversion to CLOSER retains all details on the file change through the existence of the inverse function and the additional fields for a file change present in CLOSER as required.

In the second case where the change event in CLOSER is not directly supported in the VCS output

V C S_{2}

, there exists no direct inverse function for all cases that the file change can be mapped in a one-to-one relation as seen in (13). This case, therefore, requires an abstraction based on the events causing the single file change b in CLOSER and determining what this event would be represented as in

V C S_{2}

. In many cases, this does result in a one-to-one mapping through a surjective function; however, in some cases, this may result in a multivalued function where one event in CLOSER may map to multiple events in

V C S_{2}

as seen in Figure 13.

The general expression for this function

F^{″}

can be seen in Equation (14), and most cases will result in an equivalent mapping as

F^{'}

as the set will contain a single file change.

F_{T}^{″} (b) \to \{a_{0}, a_{1}, . . ., a_{n - 2}, a_{n - 1}\}

(14)

This mapping for all supported VCS variants shows that the mapping to CLOSER and back to the original VCS output format is lossless in all cases, as all data are retained in this mapping.

3.11. CLOSER for Mining Software Repositories

The primary goal of the CLOSER data structure is the ability to improve the accessibility of repository data for MSR techniques; thus, the time for development, amount of code required, and potential for error in output parsing should all be significantly reduced using CLOSER and the parsing mapping identified.

The reduction of code required and development time can be understood to be due to the machine-readable nature of the object-oriented format such that through the use of a standard JSON library, parsing of CLOSER JSON requires only a few lines of code compared to previously working with a verbose less structured textual format. This allows for the user attention to be quickly focused on the MSR task they are investigating rather than the logging data preprocessing that traditionally was the initial time-consuming challenge of any MSR pipeline. The specific improvements can be seen in terms of the number of revisions considered and how the performance changes compared to traditional raw log parsing.

Figure 14 highlights an advantage of using CLOSER for implementation of an MSR pipeline as by abstracting the MSR task from the specific output format it allows for analysis of repositories stored on multiple different VCS technologies in a consistent and easily accessed fashion allowing the results of any MSR technique to be comparable between the VCSs due to the lossless conversion to CLOSER from any VCS while ensuring that all VCS-specific data are available for use.

Comparability between VCS outputs in CLOSER format is also ensured through the mappings identified in all cases for conversion to CLOSER, where the same features in the different VCSs are grouped in the same field in CLOSER. This allows the VCSs to be easily compared as part of an MSR technique, as in the raw VCS logging metadata. It may not be clear that these fields are the same feature and should be considered together due to the VCS output structuring also expanding the availability of data for a given MSR task as any VCS output can be converted to CLOSER and, therefore, large-scale analysis can be performed to gain robust inferences that exist across the VCS boundaries.

Given that the MSR field has been growing for years, there are already many legacy implementations of complex MSR techniques specific to a single VCS output format. The ability for CLOSER to be converted to all supported VCS formats may increase the accessibility and usability of these techniques. This can be seen in the generalized case that if a legacy MSR technique is implemented for

V C S_{1}

output, it will only have access to repositories that are stored using

V C S_{1}

, but CLOSER could be used as an intermediate step to convert from an unsupported format

V C S_{2}

to the supported format

V C S_{1}

for native consumption. This ability to build on legacy implementations is possible without any need to understand or change the MSR implementation, as no direct access to the functionality of the MSR pipeline is required to use CLOSER for conversion to the required input format, as it can be prepended to the pipeline as can be seen in Figure 15.

Databases are often used to store the preprocessed VCS output data in some object-oriented format so that it can be recovered and used for MSR tasks after all required data have been preprocessed [9]. The CLOSER data structure maps well to databases and could be structured in tables similarly to the structure shown in Figure 10 or even make use of native JSON storage in some DBMS software. Such a facility allows for CLOSER to be used for large-scale cross-repository analysis using mining software repository techniques, for example, high-performance computing or cloud computing.

One limitation of the CLOSER approach is the available metadata present in a raw VCS logging output. For example, the VCS domain, project name, and description are not included; however, this is not unique to CLOSER but exists for all MSR techniques that use the VCS logging output. A potential expansion of CLOSER to store these data may mitigate this limitation; however, extraction of these data may be difficult in comparison to the generated logs.

4. Results and Discussion

4.1. Validation of Lossless Conversion Between VCS Outputs and CLOSER

This section of the results and analysis demonstrates how the CLOSER language is lossless as conversion to CLOSER and back to the original VCS output type is isomorphic. Based on the formal definition (Section 3.7), it can be understood that all supported VCSs have been considered within the CLOSER data structure such that it can retain all data from the VCS logging output.

Figure 16 shows how revision-specific data are mapped to and retained within the CLOSER data format while retaining the data in full. The shown revision, extracted from a Git repository, contains the base revision data shared across VCS vendors as well as Git-specific data, which is all retained in CLOSER. This, along with the formal definition of the data encapsulated in a revision, allows CLOSER to retain all revision data, including VCS-specific data, so that conversion back to the original input format is possible.

In order to prove that CLOSER is lossless, the conversion from CLOSER back to the original VCS format using the inverse mapping techniques must be possible, allowing the original and the reconstructed output to be compared. This is proven through the formal definition of CLOSER and the mapping used so that the conversion back to the original VCS format involves the reintroduction of the formatting that is standard across all revisions of a particular VCS output.

In our validation, this was proven through the comparison of file hashing of the input and output of Scenario (2). For a given VCS output format (

V C S_{1}

), an input log I was generated containing all types of file changes present in

V C S_{1}

. This input log I was then converted to CLOSER C and then back to

V C S_{1}

format O. Finally, hashing was performed on both files (I and O) and compared as the file content must be identical, including formatting for the MD5 hashes to be equal. For all VCSs considered, the hashes for I and O were equal; therefore, it is proven that CLOSER is lossless. A single example of this validation experiment can be seen in Figure 17.

4.2. Conversion Between VCS Output Formats

Using CLOSER as an intermediate step, this section includes results that show an effective conversion between specific VCS outputs. Such an ability allows for legacy MSR techniques developed against a specific VCS output to be used for different repositories not hosted on that VCS, expanding the range of potential applications.

One limitation of the mapping from

V C S_{1}

output to

V C S_{2}

output is the exclusion of VCS-specific data; therefore, the mapping is unidirectional, based on Scenario (4) and demonstrated through Figure 16 and Figure 18. While the

V C S_{2}

output, in this case, can be converted to the

V C S_{1}

output format, any

V C S_{1}

vendor-specific components, such as certain file change types, will not be guaranteed to map back to the original

V C S_{1}

output exactly as they do not exist in the

V C S_{2}

output format. Using the same Git revision considered for conversion to CLOSER in Figure 16, this can be converted to another VCS supported by CLOSER, such as SVN, as shown in Figure 18. This mapping, however, is not lossless from the original Git logging output as concepts specific to Git have no equivalent in SVN and cannot be mapped; hence, these data are lost. An example of this can be seen in Figure 18, where various fields encapsulated in the CLOSER representation are lost through conversion to the SVN output format, specifically the file change percentage and user fields. In general, the conversion from

V C S_{1}

output to

V C S_{2}

output is an effective conversion between formats as the logging representation in

V C S_{2}

is an accurate representation of the same changes occurring in a

V C S_{2}

repository.

Another limitation of this mapping is the issue caused due to the multivalued mapping of certain fields, such as the action of renaming in Git causing two events in SVN as seen in Figure 18, an issue due to the discrepancies with how different VCS vendors handle the same user actions. For example, Git takes a more fine-grained approach to types of changes for a file, whereas SVN has a very limited pool for consideration. CLOSER supports the set of all change types for VCSs considered; however, in order to support the conversion between VCS vendors, certain decisions on how this mapping could best represent the changes as if the actions had been taken on that technology. Such an abstraction allowed for the definition of the multivalued functions in the case of certain file change types that map to multiple file changes in another VCS; however, this is not bidirectional due to the fact that the addition and removal of a file with the same name in SVN cannot be guaranteed to be a rename event in Git. The same inference issue can also be seen with the user details, as Git has two fields while SVN stores only one. Multivalued functionality is limited for Scenario (4), as the mapping is unidirectional, where the data lost through mapping a single event in one VCS to multiple events in another cannot be accurately inverted due to the mapping from all VCSs to CLOSER involving a lossless isomorphic mapping.

The efficacy of this conversion and its utility is specifically demonstrated in Section 4.4, where an MSR approach is successfully applied following VCS format conversion with CLOSER.

4.3. CLOSER Data Format Used Directly for MSR Techniques

Using the CLOSER format directly for MSR techniques many VCS outputs from different repository types can be compared together. In order to demonstrate this, an existing MSR tool, Java Code Relational Analysis (jcRA), was selected and updated through the addition of CLOSER as a support input format as an addition to VCS-specific formats already implemented [23].

Once CLOSER had been implemented as an input format, a series of experiments were performed to compare the analysis output of direct Git logs and the same data but preprocessed into CLOSER format. The results of these experiments, the generation of similarity matrices, were found to be identical, using MD5 hashing (an example of which is shown in Figure 19), proving that CLOSER retained all the data required. Therefore, CLOSER is as effective for MSR techniques as the raw logging output in terms of the data available while having the advantages of machine-readability, repository availability, and ease of implementation.

The increased machine-readability can be seen in the processing times when reading the CLOSER JSON format compared to the raw VCS logging output. On a small test repository (19 revisions), a speed improvement of 5.3% was seen, while on a large test repository (4450 revisions), an improvement of 11% was seen when using CLOSER compared to Git log parsing in jcRA.

The increased repository availability can be understood due to CLOSER being abstracted from a particular VCS technology. Therefore, any repository, regardless of VCS technologies, can be mapped to CLOSER and used with repositories from another VCS technology.

Through the extension of Java Code Relational Analysis (jcRA) for CLOSER, a comparison was performed between the implementation of parsers for the raw VCS logging output and the CLOSER. In jcRA, the Git parser contains 173 lines of code, and the CLOSER parser has 88 lines of code, excluding empty lines and comments. This shows that the complexity of parsing the CLOSER format is less than the raw VCS logging output parsing reducing the overhead of developing new MSR techniques.

This increase in usability while maintaining the accuracy of results means that CLOSER forms an effective format to use when considering new MSR implementations.

4.4. Proof of Existing MSR Technique Used After Conversion to Different VCS Output Format

This section shows that using CLOSER as an intermediate stage, legacy MSR techniques developed against a specific VCS output can be used for different repositories not hosted on that VCS. This shows the true value in the CLOSER language and parsers as they allow for expanded use of these MSR techniques already implemented for a single format.

Such an ability was demonstrated through the implementation of CLOSER for jcRA where previously Mercurial repositories could not be traditionally used with this technique; however, through conversion to CLOSER, comparisons can now be made between Mercurial repositories and other vendors or combined with other repositories in order to determine global trends. It also provides an effective method for retention of legacy logging data after migration of a repository, as previously these data would have been lost through migration, and they can now be retained and converted to the new vendor format for processing.

4.5. Limitations and Threats to Validity

While it seems clear that CLOSER addresses the questions set in Section 3.2, it is important to consider key limitations and threats to validity.

Internal threats to validity: The most obvious limitation is that lossless conversion between all formats is not possible, so a full use case of all MSR approaches being used generically cannot be achieved; however, the conversion to the CLOSER data format is lossless in all cases and presents an opportunity to implement for a single format to support multiple-source repository formats.

External threats to validity: Changes in the format and data storage of VCSs would inevitably require a redesign of the mapping and a rewrite of the parsing components, but this is a normal part of any software lifecycle and does not necessarily detract from the utility of CLOSER today.

5. Conclusions and Future Work

MSR approaches typically face the challenge of having to implement parsers for specific VCS output formats, which are verbose and inconsistently formatted. Commonly, approaches will implement specific parsers, enabling them to work with just one VCS system. CLOSER provides a realistic and working solution to these issues.

Through the definition and implementation of a generic data model that can cope with all types of data held in the three most widely used VCS systems repository, data can be output in a standardized, highly machine-readable format irrespective of its original source. This functionality allows for easier MSR pipeline implementation increasing productivity by removing the need to write text parsers, the usual first time-consuming step of any MSR approach. As CLOSER provides the ability for conversion between VCS output formats, using CLOSER as an intermediate stage, this also supports unmodified legacy MSR analysis to be run on previously unavailable repositories, increasing the range and efficacy of existing approaches. In turn, this means CLOSER greatly expands the available repositories that can be used for MSR techniques, providing the opportunity to fairly compare VCS logging formats in order to determine specific trends across VCS boundaries, which until now has been lacking in many MSR approaches that target a specific system.

The inherent limitations in output format differences have been addressed through carefully defined mappings between CLOSER and VCS formats to optimize the retention of data and the formatting of changes as if they had originally occurred in a given VCS, allowing the exact hashed replica of an input file to be re-created after conversion to and back from CLOSER.

In conclusion, the design and implementation of CLOSER, as detailed in this paper and available under free, open-source license [24] for use, has successfully addressed the primary open issues in MSR relating to the easy access of repository data and the application of approaches against repositories stored in different VCS formats.

The development will actively continue both the approach and the tooling. One limitation of VCS logging outputs, in general, is the lack of additional repository information. This is an area of potential future development that could increase the usability of CLOSER further though the retention of these data also in the generic format. Work is also underway to provide accessible wrappers of the CLOSER implementation to allow, for example, a web service for dynamic conversion or an integrated stateful database for structured query of converted data. Further future work would include the validation and evaluation of CLOSER by other MSR researchers to determine its efficacy for MSR approaches.

CLOSER is freely available at https://closer-evolution.github.io/closer/ (accessed on 15 December 2024).

Author Contributions

D.C. defined the initial problem. J.G. performed the analysis, definition of the model and initial implementation of the system as well as the validation experiments. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data and code is freely available at https://closer-evolution.github.io/closer/ (accessed on 15 December 2024).

Conflicts of Interest

At the time of conducting this work Jordan Garrity was a student at Queen’s University Belfast under the supervision of David Cutting. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Lanubile, F.; Ebert, C.; Prikladnicki, R.; Vizcaíno, A. Collaboration Tools for Global Software Engineering. IEEE Softw. 2010, 27, 52–55. [Google Scholar] [CrossRef]
Hassan, A.E. The road ahead for Mining Software Repositories. In Proceedings of the 2008 Frontiers of Software Maintenance, Beijing, China, 28 September–4 October 2008; pp. 48–57. [Google Scholar] [CrossRef]
Dyer, R.; Nguyen, H.A.; Rajan, H.; Nguyen, T.N. Boa: A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories. In Proceedings of the 2013 International Conference on Software Engineering, ICSE’13, San Francisco, CA, USA, 18–26 May 2013; IEEE Press: New York, NY, USA, 2013; pp. 422–431. [Google Scholar]
González-Barahona, J.M.; Robles, G. On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir. Softw. Eng. 2012, 17, 75–89. [Google Scholar] [CrossRef]
Fowler, M.; Parsons, R. Domain-Specific Languages, 1st ed.; Pearson Education: London, UK, 2010. [Google Scholar]
Scheidgen, M.; Smidt, M.; Fischer, J. Creating and Analyzing Source Code Repository Models. In Proceedings of the 5th International Conference on Model-Driven Engineering and Software Development; SCITEPRESS-Science and Technology Publications, Lda: Berlin, Germany, 2017; pp. 329–336. [Google Scholar]
Spinellis, D. Version control systems. IEEE Softw. 2005, 22, 108–109. [Google Scholar] [CrossRef]
Kagdi, H.; Collard, M.L.; Maletic, J.I. A survey and taxonomy of approaches for mining software repositories in the context of software evolution. J. Softw. Maint. Evol. Res. Pract. 2007, 19, 77–131. [Google Scholar] [CrossRef]
Trautsch, A.; Trautsch, F.; Herbold, S.; Ledel, B.; Grabowski, J. The smartshark ecosystem for software repository mining. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings, Seoul, Republic of Korea, 27 June–19 July 2020; pp. 25–28. [Google Scholar]
Cubranic, D.; Murphy, G.C.; Singer, J.; Booth, K.S. Hipikat: A project memory for software development. IEEE Trans. Softw. Eng. 2005, 31, 446–465. [Google Scholar] [CrossRef]
Hassan, A.E.; Holt, R.C. Using development history sticky notes to understand software architecture. In Proceedings of the 12th IEEE International Workshop on Program Comprehension, Bari, Italy, 26 June 2004; pp. 183–192. [Google Scholar] [CrossRef]
Graves, T.L.; Karr, A.F.; Marron, J.S.; Siy, H. Predicting fault incidence using software change history. IEEE Trans. Softw. Eng. 2000, 26, 653–661. [Google Scholar] [CrossRef]
Herraiz, I.; Gonzalez-Barahona, J.M.; Robles, G. Towards a Theoretical Model for Software Growth. In Proceedings of the Fourth International Workshop on Mining Software Repositories (MSR’07:ICSE Workshops 2007), Minneapolis, MN, USA, 20–26 May 2007; p. 21. [Google Scholar] [CrossRef]
Engler, D.; Chen, D.Y.; Hallem, S.; Chou, A.; Chelf, B. Bugs as deviant behavior: A general approach to inferring errors in systems code. ACM SIGOPS Oper. Syst. Rev. 2001, 35, 57–72. [Google Scholar] [CrossRef]
Li, Z.; Zhou, Y. PR-Miner: Automatically extracting implicit programming rules and detecting violations in large software code. ACM SIGSOFT Softw. Eng. Notes 2005, 30, 306–315. [Google Scholar] [CrossRef]
Hassan, A.E. Mining Software Repositories to Assist Developers and Support Managers. In Proceedings of the 2006 22nd IEEE International Conference on Software Maintenance, Philadelphia, PA, USA, 24–27 September 2006; pp. 339–342. [Google Scholar] [CrossRef]
Zimmermann, T.; Weißgerber, P. Preprocessing CVS Data for Fine-Grained Analysis. Proc. MSR 2004, 4, 2–6. [Google Scholar]
Allamanis, M.; Sutton, C. Mining source code repositories at massive scale using language modeling. In Proceedings of the 2013 10th Working Conference on Mining Software Repositories (MSR), San Francisco, CA, USA, 18–19 May 2013; pp. 207–216. [Google Scholar] [CrossRef]
Stack Overflow. 2015. Available online: https://insights.stackoverflow.com/survey/2015 (accessed on 28th March 2024).
Git Documentation. 2018. Available online: https://git-scm.com/doc (accessed on 19th April 2024).
Fitzpatrick, B.W.; Pilato, C.M.; Collins-Sussman, B. Version Control with Subversion; O’Reilly Media: Sebastopol, CA, USA, 2004. [Google Scholar]
O’Sullivan, B. Mercurial: The Definitive Guide: The Definitive Guide; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
Cutting, D. Enhancing Legacy Software System Analysis by Combining Behavioural and Semantic Information Sources. Ph.D. Thesis, University of East Anglia, Norwich, UK, 2016. [Google Scholar] [CrossRef]
Garrity, J. CLOSER Source Code. 2021. Available online: https://github.com/closer-evolution/closer (accessed on 12th October 2024).

Figure 1. High-level overview of the proposed CLOSER input and conversion process.

Figure 2. Google Trends results showing the relative popularities of the major VCS vendors.

Figure 3. Illustration of logging output generated for a Git repository containing all the Git change features generated with “git log –pretty=fuller –name-status”.

Figure 4. Illustration of logging output generated for an SVN repository containing all the SVN change features generated with “svn log . -v”.

Figure 5. Illustration of default verbose logging output generated for a Mercurial repository showing the most detailed change logging generated with “hg log -verbose”.

Figure 6. Illustration of custom logging output generated for a Mercurial repository including additional data.

Figure 7. Example of a Git textual output log with a file rename, containing the old and new locations and a change percentage.

Figure 8. Example of an SVN textual output log with a file rename, containing the old and new locations as separate events for adding and deleting.

Figure 9. Venn diagram showing the types of file changes and how they differ between VCSs.

Figure 10. UML diagram describing the object-oriented schema defined for CLOSER, the field notation in [square brackets] denote the formal definition element they represent (from Section 3.7 and Section 3.8).

Figure 11. Example of the CLOSER JSON format.

Figure 12. Mapping definition of file change types from Git, SVN, and Mercurial to CLOSER.

Figure 13. Mapping definition of file change types from CLOSER to Git, SVN and Mercurial.

Figure 14. Flow chart for mining software repository task implemented using CLOSER to allow for multiple VCS input types without the need for specific parsing for each.

Figure 15. Flow chart for using CLOSER to build on legacy mining software repository tasks.

Figure 16. Visual representation of how all the revision-specific data from a Git revision is retained in the CLOSER format (the colors identify the mapped features between both formats).

Figure 17. Results of MD5 hashing of raw Git logging outputs for the input and output states of Scenario (2).

Figure 18. Visual representation of how an SVN revision can be fully reconstructed from the data stored in the CLOSER format. These data are mapped from a Git revision shown in Figure 16.

Figure 19. Results of MD5 hashing jcRA outputs, performed on raw Git logging output and that same output in CLOSER format.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Garrity, J.; Cutting, D. A Common Language of Software Evolution in Repositories (CLOSER). Software 2025, 4, 1. https://doi.org/10.3390/software4010001

AMA Style

Garrity J, Cutting D. A Common Language of Software Evolution in Repositories (CLOSER). Software. 2025; 4(1):1. https://doi.org/10.3390/software4010001

Chicago/Turabian Style

Garrity, Jordan, and David Cutting. 2025. "A Common Language of Software Evolution in Repositories (CLOSER)" Software 4, no. 1: 1. https://doi.org/10.3390/software4010001

APA Style

Garrity, J., & Cutting, D. (2025). A Common Language of Software Evolution in Repositories (CLOSER). Software, 4(1), 1. https://doi.org/10.3390/software4010001

Article Menu

A Common Language of Software Evolution in Repositories (CLOSER)

Abstract

1. Introduction

2. Related Work

2.1. Version Control System Offerings

2.2. Uses and Techniques for the Mining of Software Repositories

2.3. Methods for Mining Software Repositories Using Multiple Version Control Systems in Combination

2.4. Opportunities for a Tool Such as CLOSER

3. Design and Methodology of CLOSER

3.1. Methodology

3.2. Research Questions

3.3. Identification of VCS Features

3.4. Usage Scenarios for CLOSER

3.5. Common VCS Components for CLOSER

3.6. VCS-Specific Components for CLOSER

3.7. Formal Definition of Features for a Shared Schema

3.8. Extension of Formal Definition for VCS-Specific Components

3.9. CLOSER Data Structure Definition

3.10. CLOSER Data Mappings

3.11. CLOSER for Mining Software Repositories

4. Results and Discussion

4.1. Validation of Lossless Conversion Between VCS Outputs and CLOSER

4.2. Conversion Between VCS Output Formats

4.3. CLOSER Data Format Used Directly for MSR Techniques

4.4. Proof of Existing MSR Technique Used After Conversion to Different VCS Output Format

4.5. Limitations and Threats to Validity

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI