1 Introduction
Software vulnerabilities represent serious threats to software dependability, allowing malicious users to attack its confidentiality, integrity, or availability [
39,
69]. Vulnerabilities require specific methods and techniques to be detected [
7,
62] and removed [
18] promptly to avoid undesired consequences [
32]. However, not all vulnerabilities are the same, as their exploitation may require different expertise and may have variable consequences. Since the number of newly discovered vulnerabilities increases rapidly [
38], both vendors (i.e., owners of vulnerable products) and clients are interested in obtaining reliable feedback on the risk that a vulnerability may be exploited. The availability of such feedback is useful for various tasks, including prioritizing security verification [
36,
81] and identifying suitable preventive actions to reduce the risk of an attack [
30], such as the replacement of vulnerable constructs or APIs [
47,
55].
Researchers have envisioned novel methods to estimate the dangerousness of newly discovered vulnerabilities. The most basic mechanisms are driven by the
CVSS (Common Vulnerability Scoring System1) “Base” score assigned. In particular, a vulnerability receiving a higher CVSS
“Base” score is deemed more severe than others. To a certain extent, it is considered to have a higher chance of being exploited soon. Unfortunately, such an assumption does not always hold: a higher severity score does not necessarily imply a more likely exploitability, nor do lower scores denote less risk of being exploited [
1]. For instance, this is the case of the infamous
Heartbleed bug (CVE-2014-0160), whose CVSS 2.0
“Base” score was 5.0 (translating into medium risk), way lower than other vulnerabilities that were never observed to be exploited in the wild [
32]. The upgraded version of CVSS, i.e., version 3.0, added a number of new metrics to capture additional aspects that were neglected in the previous version and corrected some inaccuracies. The new and improved
“Base” score formula could address scenarios like the one observed for
Heartbleed. Yet, this new version still cannot be used as a proxy for the exploitability risk. Indeed, CVE-2014-6049 received a CVSS 3.0
“Base” score of 2.7 (i.e., low risk), but was exploited less than a week after its disclosure. Such a score was even lower than the CVSS 2.0
“Base” score, i.e., 5.5. Therefore, it is unfortunate to note that CVSS 3.0 does not solve all the issues affecting version 2.0. Moreover, there is no significant correlation between CVSS
“Base” score and the exploitability risk of a vulnerability. This alarming lack of correlation also happens for the more specific
“Exploitability” and
“Impact” sub-scores—i.e., the partial metrics required to compute the final
“Base” score. Further detail about the correlation between the CVSS metrics and the risk of exploitability is available in the online appendix of this paper [
49]. In summary, the values of the CVSS metrics assigned to the vulnerabilities affecting a system cannot provide a useful estimation of the probability of the system being attacked.
An alternative approach to gain insights on the risks associated with a vulnerability consists in adopting various strategies based on machine [
14,
17,
85] and deep [
45] learning models. The goal is to predict whether a new vulnerability will be exploited, either labeling it as
“likely exploitable” or estimating a probability of exploitation [
51] using a number of predictors from different data sources. In this respect, researchers have been using many sources of information connected to a vulnerability, ranging from its brief description contained in the
CVE (Common Vulnerabilities and Exposures) record—i.e., a data structure containing all the information linked to a disclosed vulnerability—to online mentions from social media or the dark web [
2,
85]. The vast majority of the proposed models rely on all the complete CVE information that were obtainable when the datasets used for the experimentations were built, hence also including the data that became available days or weeks after the official disclosure of a new vulnerability—for instance, CVE-2020-0583 received the CVSS
“Base” score only 10 days after the disclosure. Indeed, as soon as a new CVE is added to the official database, it is only provided with a short description and at least one public reference, as required by CVE [
35]. Thus, all the CVSS scores, the appropriate
CWE (Common Weakness Enumeration2) and
CPE (Common Platform Enumeration3) that have been leveraged so far for exploitability prediction tasks cannot be used
before the results of the in-depth analysis made by security experts are available. During the period ranging from the vulnerability disclosure and the expert’s analysis—which can even last about two months—the existing prediction models cannot get these data from anywhere, and therefore are inoperable in practice. Developers are hence left disoriented, without any estimate of the risk associated with the new vulnerability. However, the phase immediately following the disclosure of a new vulnerability is the most alarming, as practitioners must take countermeasures as soon as possible; therefore, even a small hint of its possible exploitability would be beneficial to dedicate the right effort to its resolution. We recognize the need for an exploitability analysis of software vulnerabilities to provide developers and security experts with preliminary information that could be used to assess the dangerousness of a newly disclosed vulnerability and immediately take appropriate preventive actions to limit the potential harm of an attack.
In this work, we aim to investigate the effectiveness of early exploitability prediction models that exclusively rely on what we refer to as “early data”, i.e., the data already available in the CVE record of a just-disclosed software vulnerability, to determine whether it will be exploited in the future. An early prediction model leverages only those pieces of information that were already published before the disclosure date, i.e., the short description and the referenced online discussions, e.g., mailing lists and security advisories, and in no way relies on further analyses performed by security experts or additional data to be retrieved from the vulnerable system. Our goal is to experiment with early exploitability prediction modeling to assess whether and to what extent the dangerousness of a vulnerability can be estimated with sufficient confidence as soon as the vulnerability is disclosed for the first time.
To achieve our goal, we first collect all known vulnerabilities and exploits in the National Vulnerability Database(NVD) and the Exploit Database(EDB), respectively, at the time we started this study, and enrich them with the data coming from the online discussions mentioning them. Such data are joined in seven combinations to build the text corpus from which the prediction models extract the textual features. Then, we experiment with a total of 72 models, made from the combination of six different machine learning (ML) algorithms, three data balancing settings, and four different ways to encode unstructured text into features, and we investigate the employment of five pre-trained Large Language Models (LLMs), to determine which is the best solution to employ for this kind of task.
In addition, we evaluate all these models in a
realistic scenario, i.e., simulating a real software production environment to investigate whether they can be effective in practice. To do this, we validate the models pretending to build and deploy them at different points in time, i.e.,
reference dates, and we assess how their performance changes with the evolution of the software history. In particular, we apply a
time-aware validation mechanism, sorting the full dataset of CVEs by disclosure date, and splitting it at the reference dates. At each round of validation, we use all the data before the reference date to build the dataset fold for the round—this simulates the models’ deployment scenario in which all the information available from the past is leveraged for the learning, and the base of knowledge grows over time. To evaluate the models, we split each fold into training and test sets, ensuring that the vulnerabilities in the test set are published after those falling into the training set. Then, we adopt a data labeling strategy similar to the one explained by Jimenez et al. [
54], i.e., we mark the training instances as
“exploitable” only if they were exploited before the train-test split date, and we mark the test instances as
“exploitable” if they were exploited before the reference date. However, we recognize that not all the information collected from the past is always reliable. In fact, since vulnerabilities are exploited over time, the CVEs published close to the reference date that were not exploited yet cannot be confidently labeled as
“neutral”, as no sufficient time has elapsed to let the first exploits arise. Therefore, we further clear the dataset fold from those vulnerabilities falling into such an “uncertainty window” with no associated exploit yet. This data labeling strategy—detailed in Section
3—aims to provide a more realistic evaluation of the early exploitability prediction models.
The results show that the text from online discussions, particularly from Security Focus, can significantly boost the effectiveness of prediction models that leverage only the CVE initial description. All traditional ML models reached their best performance with vulnerability data before 2010. Oversampling the training data generally benefits all models, which draw the best performance when the features are weighted by their frequency (i.e., Term Frequency). The classifier with the best trade-off is the Logistic Regression, reaching a weighted F-measure of 0.49 and weighted MCC of 0.36. The most precise classifier is the Random Forest, while the one having the highest recall is KNN. The pre-trained LLMs used as-is failed to perform well, behaving like constant classifiers always predicting “neutral”.
Based on the results obtained, we envision a set of research directions that aim to (i) improve the quality of exploitability prediction models with particular attention to the data quality—i.e., the ground truth choice, the feature representation, and the like—and (ii) integrate exploitability prediction models into existing vulnerability assessment pipelines to better support security analysts. The experimented prediction models only assess the exploitability of publicly disclosed vulnerabilities without addressing undisclosed vulnerabilities due to inaccessibility to information that enables the prediction [
13]. Our key contributions can be summarized as follows:
(i)
An evaluation method to assess the effectiveness of early exploitability prediction models that exclusively rely on the data available in a CVE record at the time of its public disclosure;
(ii)
A data cleaning strategy to remove those data instances that cannot be labeled with sufficient confidence since not enough time has passed since the vulnerability disclosure date.
(iii)
A data collection procedure to mine and aggregate online discussion data referenced by the external links in the CVE records, resulting in a novel dataset that researchers can use for further analyses.
(iv)
An empirical comparison of how different configurations of an ML pipeline can influence the performance of an early exploitability prediction model, involving 72 traditional learning algorithms, three balancing settings, and four feature representation techniques, and also five pre-trained LLMs used as-is.
(v)
A
comprehensive dataset—which we publicly release [
49]—containing information about disclosed vulnerabilities (CVEs), the linked initial mentions in online sources like
Security Focus and
BugTraq, and their public proof-of-concept exploits.
(vi)
A
reproducible pipeline for training, validating, and analyzing all the ML models implemented as a collection of
Python scripts, made available in an
online appendix [
49].
In this paper, we focus on the feasibility of building early exploitability prediction models, evaluating their performance in a realistic setting. We highlight that our work is not intended to provide practitioners with a ready-to-use solution, but rather to take the first steps into the empirical investigation of early exploitability prediction, which can be finally put into practice with further research effort spent by the community.
Structure of the paper. Section
2 presents background information on the life cycle of software vulnerabilities, other than discussing the related literature and the limitations we aim to address. Section
3 reports the research questions driving our work and the methods we applied to address them, while Section
4 presents the results we observed. Section
5 further elaborates on the insights of the study and the implications for researchers and practitioners. The potential threats to the validity of the study are discussed in Section
6. Finally, Section
7 concludes the paper, outlining our future research agenda.
3 Empirical Study Design
In Section
3.1, we formulate the goal of our study according to the
Goal-Question-Metric (GQM) template [
95]; from this goal, we distilled our
research questions (
RQs). In Section
3.2, we describe the steps we followed to collect the data needed to fuel the prediction models. In Section
3.3, we report the details of the models that were selected to participate in our experiments, and in Section
3.4 we explain how we trained and evaluated them. Lastly, in Section
3.5 we focus on the implementation details regarding the models, and we describe the infrastructure we employed to run our experiments.
3.1 Study Goal and Research Questions
The goal of this empirical study was to investigate the performance of machine learning (ML)-based classifiers to predict the exploitability of a just-disclosed vulnerability, with the purpose of providing early feedback on the exploitability of new vulnerabilities in a realistic scenario. The perspective was of both practitioners and researchers. The former are interested in obtaining as much information as possible to (i) have an initial assessment supporting the CVSS measurement and (ii) understand when and how the vulnerabilities afflicting their applications must be handled. The latter are interested in (i) comprehending the predictive capabilities of textual data—obtained from online discussions about the vulnerabilities written in natural language—represented in different ways and (ii) assessing the effectiveness of different learning configurations.
To the best of our knowledge, the current research in exploitability prediction has always involved all the data available in CVE databases at the time of the study when building the dataset used by the experimented classifiers. This scenario caused the models to rely on subjective measures like the CVSS scores to make their predictions. However, in a real-world scenario, such information is only made available at a non-negligible distance from the CVE disclosure date [
35]—on average, 20 days after the disclosure by the year 2021. We hypothesize that a realistic prediction model should only leverage
early data, i.e., those pieces of information available at the
disclosure time, as soon as a vulnerability is made public through a CVE record.
Therefore, our empirical investigation focused on employing ML algorithms to train classification models that recognize potentially exploitable vulnerabilities using exclusively early data. We observed that whenever a new vulnerability is discovered, the ordinary mechanism to raise awareness about it consists of opening a free commentary about it on public mailing lists or other similar discussion channels [
5,
19]. Such
data sources contain valuable information that may provide additional insight into the seriousness and impact of the vulnerability, e.g., a crash stack trace [
90], the reproduction steps [
20], or even code snippets showing that an exploit is feasible in principle (a.k.a.
Proof of Concepts, PoCs) [
92]. Whilst involving data from multiple sources could provide a more comprehensive perspective of the problem [
103,
106], the effect on the accuracy of the prediction models is not always positive [
67]. Indeed, the information in each source might have evident contradicting information [
25] that would not allow the models to distinguish between “exploitable” and “neutral” vulnerabilities. Hence, we are interested in investigating how different combinations of data from multiple sources affect the models’ performance. We asked:
Extracting relevant information from vulnerability data sources that allow the models to recognize exploitable vulnerabilities is not straightforward. Not only do such sources predominantly contain
unstructured text, but also, no automated mining tool that extracts the relevant pieces of information, e.g., the code snippets isolated accurately, exists. The only available solutions can only work with traditional bug reports and for specific programming languages, such as
Infozilla [
10] that only works for bug reports in
Java. We must rely on text processing techniques that automatically determine the features from a corpus of natural language text [
11] to let the prediction models learn from unstructured text.
There are many ways to achieve such a task, that we group into two main categories: (1) those encoding the tokens found in the corpus—after adequate pre-processing—as individual features, and (2) those that learn how to represent a given text as an
embedding. In the first case, once the textual features, e.g., tokens or words, are extracted, they are weighted according to different mechanisms, such as by counting the number of times a certain feature appears in a document in the corpus, or by measuring the frequency of that token over the entire corpus [
8]. The number of resulting features cannot be controlled directly and heavily depends on the actual content of the documents in the corpus and on the pre-processing steps taken before, e.g.,
stemming. In the second case, the features do not represent specific textual elements, but they are “latent variables” that the model infers from the input corpus [
59,
71]. Unlike the first category of techniques, the size of the embeddings is generally decided upstream before launching the embedding algorithm. The choice of the specific text representation technique can greatly impact the models’ performance [
28,
43].
Furthermore, we believe other elements can influence the models’ performance that are worth investigating, such as the choice of the specific learning algorithm [
94] or the use of data balancing algorithms to deal with the imbalance between “exploitable” and “neutral” instances [
74]. Thus, we asked:
Due to the recent advancements in the field of Natural Language Processing (NLP) and the popularity of pre-trained Large Language Models (LLMs), we also wanted to assess their suitability for such a task, as experimented by Yin et al. [
100]. Such models come with pre-trained weights learned in a self-supervised manner from large corpora of text not directly related to the specific tasks, e.g., general English text and/or examples of code written in different programming languages. The advantage of such models stands in their ability to determine the representation for the input (depending on what was seen during the pre-training stage) and return the prediction in a single shot. Therefore, we formulated two sub-questions to answer
RQ\(_2\), the first one investigating the performance of several ML models made of the traditional key elements—i.e., feature representation, training data balancing, and learning algorithm—while the latter focused on the use of end-to-end pre-trained LLMs.
3.2 Data Collection
The
context of this empirical study was made of publicly disclosed vulnerabilities accompanied by references to online discussions mentioning them, such as public mailing lists and security advisories. Our literature review found no readily available dataset with all the data we need, i.e., the text of online discussion linked to disclosed vulnerabilities and the dates when the vulnerability was disclosed and exploited. Therefore, we adopted a systematic data collection procedure to link all the existing known vulnerabilities to public websites where they had been likely discussed for the first time before the official disclosure date—in the rest of the paper, we also use the wording “publication date” when referring to the date on which a CVE record is made accessible. Then, we mined a large set of public scripts and PoCs available online and linked them to the vulnerabilities they exploited to carry out (pseudo–)realistic attacks. Figure
2 depicts all the steps (from 1 to 7) we took to collect the data needed to answer our research questions, each detailed in the following. Table
2 summarizes the collected data, reporting how much information we could retrieve from each considered data source. The scripts to run the entire data collection procedure are available in the online appendix of this paper [
49].
3.2.1 Mining Known Vulnerability Data.
We relied on the
National Vulnerability Database (NVD),
7 a comprehensive catalog of disclosed vulnerabilities reported in the form of CVE records. NVD enriches the upstream
CVE List managed by MITRE
8 by adding the CVSS vectors, the labeling of known affect software versions via CPE, and the like. For such reasons, NVD has been the basis of many empirical studies on software vulnerabilities, being considered a reliable source of high-quality information [
48,
54,
65,
72]. Our study did not target any specific platform or programming language, so we collected all existing vulnerabilities available in NVD at the time of the study. Thus, we downloaded the full dump of NVD curated by the
CVE Search project
9 on November 03, 2021, gathering 148,299 CVE entries (Step 1 in Figure
2). We pre-processed this raw dataset to ensure that the data quality was suitable for our study. In particular, we filtered out those entries that were (i) malformed (e.g., the identifier did not point to a really-existing CVE record), (ii) rejected (i.e., the CVE identifier was allocated but never approved at the final stage), or (iii) lacking external references. We could recognize vulnerabilities falling in those cases by inspecting the content of the dump: malformed CVE identifiers did not adhere to the pattern
CVE-XXXX-YYYY; rejected CVEs had a clear statement of their rejection in the description; CVEs lacking external references were missing a list of links in the HTML page on
CVE List. In the end, the three filters led to the removal of just 214, 135, and 5 entries, respectively (Step 2 in Figure
2), resulting in 147,900 valid CVEs.
Since we were interested in using only the information available at the disclosure time, we retraced the change history of the CVE record stored in NVD; for each CVE we scraped the content of its descriptive HTML page in NVD, as it contains a set of tables reporting the changes made to the record and their dates. In doing so, we could fetch the original description that the CVE had at the time it was first disclosed, so that we could use it within our
early exploitability prediction models in a realistic scenario—indeed, the models should not be allowed to use information that was made available years after the disclosure. It is worth pointing out that the original date on which a CVE record was created is not reported in NVD but rather on the
CVE List website. Hence, we mined the HTML pages in the
CVE List website as well. The scraping of HTML pages of both NVD and
CVE List was supported by the
BeautifulSoup library for
Python.
103.2.2 Mining Online Discussion Data.
Any CVE record is equipped with a continuously updated list of external reference URLs pointing to web pages concerning that specific vulnerability, e.g., online discussions, official patches, bug reports. In this respect, our goal was to define an exploitability prediction model that leverages only those references available at the disclosure time, as we believe they could be the reason behind the allocation request of the CVE identifier. However, the references in the NVD dump are not provided with the date they were added to the CVE records, preventing us from knowing whether they have already been linked at the time of the record allocation or at a later stage. We observed that the
CVE List website maintains the references in a different way: each link is labeled with a special keyword indicating its type—e.g., vendor advisory, mailing list—and origin, i.e., the website it points to).
11 Hence, we analyzed 714,854
CVE List references among all the 147,900 CVE records (Step 3 in Figure
2). We observed that the most recurring keyword was
BID (
SecurityFocus Bugtraq ID), which refers to security advisories published on the
SecurityFocus website (43% of CVEs had at least one reference to this category). Such a website has long been considered a reliable source to report security bugs—each uniquely identified with a
BID—and tracks existing solutions and working exploits [
38]. Moreover, its plain HTML structure allowed the easy recovery of all the data using a simple HTML parser with the application of filters to exclude those references published after the vulnerability disclosure date. For all these reasons, we ignored the references listed in the NVD dump, navigated the
BID-labeled URLs in
CVE List, and parsed the content of the HTML in the response using
BeautifulSoup—this was the only available option to recover such data, as
SecurityFocus does not expose any accessible API. It is worth pointing out that since February 2020,
SecurityFocus has stopped publishing further
BID advisories; hence the URLs stored in the CVE records are actually inaccessible. We circumvented this limitation by exploiting the
Wayback Machine 12 service provided by the
Internet Archive library [
3], which offers free access to many digital resources that were once available on the web. Therefore, we queried the API exposed by the service that returns an active URL—stored in its archives—having the same HTML content as the input URL (Step 4a in Figure
2).
We also observed that
BID reports are commonly related to a discussion on a public mailing list known as
BugTraq, one of the most popular discussion boards where participants have been conducting discussions on newly-discovered vulnerabilities since 1993. All the discussions are held in natural language (commonly English) without following any specific text structure. Among the discussions, the only consistency is the header containing the original publication date. Consequently, we considered
BugTraq references in addition to those labeled with
BID.
BugTraq has encountered a similar fate to
SecurityFocus, as it was shut down in 2020; yet, we still considered it alongside
SecurityFocus as it was referenced by a non-negligible number of the CVEs we selected (14% CVEs had at least one
BugTraq reference). All the discussions held in the past are now archived by third-party websites, such as
SecLists,
13 from which we downloaded all the discussions pointed by the CVE records in our context. To recover the missing mailing list discussions, we exploited the format of the
BugTraq identifiers, made of eight digits representing the publication day according to the format
YYYYMMDD, plus a short text summarizing the content of the discussion. Both the year and the month allowed us to reach a page on
SecLists containing the list of
BugTraqs of that period, from which we retrieved the entry that had the highest similarity—using the Gestalt Pattern Matching [
79]—with the short text in the
BugTraq identifier. Then, we mined the content of the matched discussions leveraging
BeautifulSoup (Step 4b in Figure
2). To summarize, 70,513 out of 147,900 CVEs had at least one
BID or
BugTraq type reference, linked to a total of 65,978
BID references and 26,387
BugTraq references. We considered these numbers sufficiently high to address the research goals of our investigation.
Afterward, we made sure to discard the text of those
BID and
BugTraq references (i) published after the CVE disclosure date or (ii) whose format did not allow to reliably recover their publication date. This happened for 10,973 out of 65,978
BIDs and for 5,368 out of 26,387
BugTraqs. Although these two filters caused some vulnerabilities not to have any
BID or
BugTraq reference, we still did not discard them entirely, as the sole original description provided in the CVE record might contain enough information to predict their exploitability, as observed in similar works [
45,
66].
As a result of the process of retrieving input data for our investigation, we finally considered three data sources, namely (i) CVE records, retrieved from
CVE List, (ii)
SecurityFocus Bugtraq ID reports, restored from the
Wayback Machine, and (iii) discussions on the
BugTraq mailing list, collected from
SecLists website. Each of these sources provided us with textual data that we combined to perform our experiments, as explained in Section
3.3.
Each text underwent a set of pre-processing steps to remove irrelevant pieces of information and facilitate encoding textual features (Step 5 in Figure
2). In particular, we first employed a set of regular expressions to detect and remove data that could negatively affect the process, such as websites, URLs, e-mail addresses, PGP signatures and messages, hex numbers, and words containing repetitions of the same letters for at least three times in a row [
11,
45,
61]. Second, we applied the lowercase reduction, removed non-alphabetic characters (punctuation and Unicode symbols), and split the remaining content into tokens using the whitespace as a separator. Then, we removed any English stop word [
26,
91], applied the suffix stripping using the Porter’s stemmer [
76], and removed those terms having fewer than three characters.
3.2.3 Mining Exploit Data.
The CVE references labeled as
EXPLOIT-DB in the
CVE List point to
Exploit Database (EDB in short),
14 which is the most comprehensive collection of public exploits and Proofs of Concepts (PoCs) that explicitly target known vulnerabilities. Navigating these references could establish the links between the vulnerabilities and their exploits. However, we observed that only a minimal set of CVE records had an explicit link to EDB (i.e., 7.9%), likely owing to an improper curation of the CVE records—and not necessarily to a real lack of an exploit. Hence, similarly to Bhatt et al. [
12], we rebuilt the links crossing the opposite way, connecting all the exploits in EDB to the affected CVEs using the metadata contained in the exploit entries. To this aim, we downloaded the complete list of exploits at the date of November 03, 2021, from the official
GitHub repository
15 to obtain the list of valid exploit identifiers, queried the EDB website, and parsed the HTML pages—still using the
BeautifulSoup library—of each exploit to collect the target CVEs, if made explicit. Note that a single exploit or PoC may target more than one, often related, vulnerability; similarly, more than one exploit may affect a single vulnerability. In the end, a total of 47,742 exploits were collected, linked to 23,690 different CVEs, corresponding to 16.02% of the total CVEs with valid data (Step 6 in Figure
2). After connecting each CVE to their exploits, we could obtain the dates on which the first exploit of the vulnerability described in the CVE was uploaded in the
Exploit Database. We collected all these dates and considered them as the
“exploitation dates” (Step 7 in Figure
2), which will be needed when building our ground truth (see Section
3.4).
3.3 Model Selection
Once we had collected all the required data, which is summarized in Table
2, we could select the prediction models subject to our experiments.
To answer RQ\(_{1}\), we first considered the textual data from the three sources selected—i.e., CVE, SecurityFocus(SF), and BugTraq(BT)—individually to assess which one was the most helpful in predicting the exploitability of the vulnerabilities. Then we combined them by means of string concatenation to understand whether the data coming from multiple sources can let the models have additional information on the vulnerabilities and improve their accuracy. Hence, we formed four combinations, i.e., \(\langle\)CVE + SF\(\rangle\), \(\langle\)CVE + BT\(\rangle\), \(\langle\)SF + BT\(\rangle\), \(\langle\)CVE + SF + BT\(\rangle\). In the end, we tested with a total of seven different combinations of data sources, which we call corpora from now on.
Then, regarding RQ\(_{2.1}\), we experimented with 72 traditional learning configurations, determined by the combination of:
—
Six machine learning algorithms, opting for the most adopted for training binary classifiers, i.e., (i)
Logistic Regression(LR) [
70], (ii)
Naïve Bayes(NB) [
83],
K-nearest Neighbors(KNN) [
104], (iv)
Support Vector Machine(SVM) [
24], (v)
Decision Tree(DT) [
16], and (vi)
Random Forest(RF) [
15].
—
Four feature representation schemas, three of which encode each token found in training set as an independent feature—i.e., the simple word counting (a.k.a. Bag-of-Words, BoW), term frequency(TF), term frequency-inverse document frequency(TF-IDF)—and one that automatically learns embeddings from the training set—i.e., doc2vec(DE).
—
Three ways for managing the data imbalance during the training stage, i.e., leaving the data untouched (
Original), over-sampling with SMOTE [
21], and under-sampling with
NearMiss (version 3) [
102].
Similarly, for
RQ\(_{2.2}\), we involved
five pre-trained LLMs, all based on the BERT architecture [
29]. Specifically, we selected: (i)
DistilBERT [
87], (ii) ALBERT [
57], (iii)
XLM-RoBERTa [
23], (iv)
CodeBERT [
37], and (v)
CodeBERTa [
64]. We selected such models because of their noticeably different pre-training backgrounds. Indeed, all models we selected have a general understanding of the English language, which was required as the text of CVE descriptions and the other data sources we considered, i.e.,
SecurityFocus and
BugTraq, were also in English as well. Both ALBERT and DistilBERT were pre-trained on
BookCorpus16 and
English Wikipedia17 corpora (just like the vanilla BERT), while
XML-RoBERTa was pre-trained on
CommonCrawl18 corpus containing text from 100 languages. Yet, vulnerability data often include elements that are not part of a common text in the English language, such as code snippets and many punctuation characters; for this reason, we also included two models having experience with programming languages, i.e.,
CodeBERT and
CodeBERTa. Such models were pre-trained on
CodeSearchNet19 corpus, containing examples of methods from six programming languages, as well as their associated documentation (e.g.,
JavaDoc), generally written in English. To allow the models to be fine-tuned on our binary classification downstream task, we equipped them with a linear layer on top of the pooled output.
To better contextualize the models’ performance, we also involve four baseline models, meant to determine the real usefulness of the non-trivial models. We selected: a Random(RND) classifier, stating that a vulnerability is “exploitable” with 50% probability, a Pessimistic(PES) classifier, always predicting that a vulnerability is “exploitable”, an Optimistic(OPT) classifier, always predicting that a vulnerability is “neutral”, and a Stratified(STR) classifier, predicting the exploitability with a probability equal to the frequency of “exploitable” instances in the training set. Due to their nature, the baseline models ignore any feature representation and data balancing technique employed.
To summarize, we experimented with a total of 567 models, i.e., 72 traditional ML models, five LLMs, and four baseline classifiers, all trained and tested on seven corpora.
3.4 Model Evaluation Framework
To ensure high realism in our experimentation, each of the 567 models was validated in the context of a
time-aware validation, which emulates a scenario where the prediction models are iteratively re-trained and validated at different
reference dates with different data. Such reference dates represent the moment in which the models would be put into production. To this end, we had first to sort all the instances (i.e., the vulnerabilities) previously collected (Section
3.2) by their
publication date, then create the folds to form the validations rounds, and finally determine the target
labels (i.e., the expected values the models should predict) to assign to each instance according to their
exploitation date, if any. Given the time-aware nature of the validation, the assignment of the labels could not be done for all the instances in a single shot, as it would break its realism, since exploitability data is not always available. To better clarify this concept, let us suppose we wanted to validate the models in December 2005: we would not be allowed to look into any piece of information that came out after this date. For example, if a CVE had been published in 2003 and was first seen exploited only in 2006, we must treat that instance as
“neutral” in December 2005. Therefore, we assigned the labels to the instances at each round of validation, in line with the recommendation by Jimenez et al. [
54].
We also took into account other noise-introducing factors that could affect the labeling activity, which we explain in more detail in Section
3.4.1. The way we split the entire collection of vulnerabilities into folds to create the rounds for the time-aware validation is explained in Section
3.4.2, while in Section
3.4.3 we report how we built the training and test sets for each round. Lastly, in Section
3.4.5, we describe how the models’ performance was assessed. Figure
3 summarizes the entire framework with which we trained and tested the 567 models.
3.4.1 Data Cleaning Strategy.
Each instance in the dataset had to be labeled according to the presence of an exploit in
Exploit Database. The most straightforward strategy would have been to mark as
“exploitable” (
true class) those instances having at least one associated exploit and as
“neutral” (
false class) all those not having any reported exploit at all. This would have caused 16.02% vulnerabilities to be labeled as
“exploitable” and 83.98% as
“neutral”. However, such an approach is improper in the context of a realistic validation as it would produce a large number of instances with inappropriate labels. Let us consider the case of CVE-2020-14340, published in June 2021. By the time of this part of the study, i.e., the reference date is November 2021, only five months had passed since its publication, so it was quite expected that an exploit was not already present in
Exploit Database, as not enough time had passed since its publication to observe the first public exploit. As a matter of fact, the average time between the disclosure and the first exploitation of a vulnerability is 194 days, i.e., more than half a year. Marking as
“neutral” such a recently-disclosed vulnerability would be
too eager, causing an overabundance of
false labels. We hypothesize that if we concede some more time to make the exploits emerge, we could label the instances more confidently. In other words, if recently disclosed vulnerabilities have not been seen exploited yet, we cannot deem them as
“exploitable” or
“neutral” with sufficient confidence. Thus, we decided to completely remove those instances from the validation round having the reference date of November 2021 to avoid introducing data with noisy labels—following an analogous strategy adopted by Garg et al. [
42]. It is worth pointing out that such instances should not be used either as training data or as test data because, in the first case, they would inflate the number of
false instances, while in the latter case, they would distort the models’ real performance.
We observed that the number of instances that risk being labeled improperly strongly depends on the amount of time we are willing to “concede” for exploits to emerge. Let us consider the case of CVE-2020-25649, which had been published six months before CVE-2020-14340 (seen in the previous example). Similarly to the previous case, half a year was not enough to let its first exploit manifest—indeed, this is even below the average exploitation time of 194 days. To minimize the risk of having improperly labeled instances among our train and test data, we selected the 90th percentile from the exploitation time distribution—corresponding to 532 days (about one year and a half)—to be the “tolerance period” we concede to exploits to manifest. Specifically, given the reference date \(D_i\) in a validation round \(R_i\), we applied our cleaning strategy to all vulnerabilities that have been published within 532 days from \(D_i\). All vulnerabilities within this uncertainty window that were not exploited before \(D_i\) were completely excluded from the round \(R_i\). All the vulnerabilities that passed the cleaning step and those outside the uncertainty window were labeled according to the presence of an exploit in the Exploit Database reported before \(D_i\).
3.4.2 Validation Round Creation.
To determine the number of folds into which the dataset must be divided, and so forming the validation rounds, we used the duration of the uncertainty window set before (Section
3.4.1). Namely, starting from November 2021 (the time of this part of the study) we repeatedly went “back in time” by 532 days at a time until the date in which the first vulnerability in the dataset was published (i.e., 1989). In this way, we ended up with 22 folds, each made of vulnerabilities published in 532 days time span. For instance, the 22th split consists of all the CVEs published between May 19, 2020, and November 2, 2021, while the 21th split consists of the CVEs published between May 18, 2020, and December 4, 2018, and so forth. Such a splitting allowed us to evaluate the models’ behavior when the uncertainty window is made of wholly different sets of vulnerabilities. Figure
3 shows an example of what happens within each round of the time-aware validation. We observe that the 22th round, i.e., the last one, corresponds to the case in which we use the entire dataset of vulnerabilities mined in this work to train and test the models.
3.4.3 Training & Test Set Preparation.
At each validation round
\(R_i\), we used all folds from 1 to
i to form the dataset of round
\(R_i\). Then, we apply the data cleaning strategy described in Section
3.4.3 for all the vulnerabilities admitted in round
\(R_i\), and we created the training and test sets using a
time-aware 80/20 splitting, i.e., placing the first 80% instances in the training set and the remaining 20% in the test set, ensuring that all the training instances were published before all the test instances. Afterward, we could proceed with the labeling strategy described by Jimenez et al. [
54]. Namely, we marked the training instances as
“exploitable” if and only if they were exploited before the
training date, which corresponds to the
latest publication date among all the training instances, while we marked as
“exploitable” the test instances if and only if they were exploited before the reference date of round
\(R_i\).
3.4.4 Model Training & Testing.
At each validation round
\(R_i\), all the vulnerabilities in the
i-th training and test sets were linked with the content of each of the seven corpora (explained at the beginning of Section
3.3)—hence, forming seven “variants” of the
i-th training and test sets.
The five pre-trained LLMs, and the four baseline classifiers were trained and tested at this stage without any other processing. On the contrary, the 72 machine learning configurations required further processing according to the selected feature representation schema and data balancing algorithm. Consequently, for each variant of the
i-th training set, we fit the four feature representation techniques selected (Section
3.3), i.e., word counting (BoW), term frequency (TF), term frequency-inverse document frequency (TF-IDF), and
doc2vec (DE). For the first three schemas, i.e., BoW, TF, and TF-IDF, we built three
document-term matrices, where the rows represent each instance, and the columns represent all the tokens (using the white space as separator) found in the textual content associated. The values inside each cell are weighted depending on the specific schema:
(i)
BoW assigns to the ij-th cell the number of times the j-th term appears in the i-th document;
(ii)
TF assigns to the ij-th element the number of times the j-th term appears in the i-th document, divided by the total number of times the j-th term appears in the corpus;
(iii)
TF-IDF assigns to the ij-th element the TF value multiplied by the IDF (inverse document frequency) of the j-th term, which is computed as the logarithm of the total number of documents divided by the documents where the j-th term appear, hence lowering the weight for terms appearing in too many documents.
Therefore, each instance was represented as a numeric vector, which we used to represent the associated vulnerability. We observe that the final number of features varies among the seven corpora as they have different terms appearing in them; thus, each of the seven variants of the
i-th training set ended up having different dimensionalities. On the other hand, the fourth schema, i.e., DE, learns a predetermined number of features extracted via the use of a neural network, which learns how to represent all the documents in an unsupervised manner. We chose the
Distributed Bag-of-Words (PV-DBOW) variant, setting the size of the embeddings to 300, following the configuration that provided the best results in related work [
45] and keeping all the other settings to the default recommendations for
doc2vec. At this point, the two groups of features (tokens and embeddings) extracted from the
i-th training set were reused as-is to determine the feature space of all the corresponding seven “variants” of the
i-th test set without any modification to avoid data leakage [
6]. To do this, on the one hand, we added the test instances into the existing document-term matrices fitted at the training time and weighed the new instances with BoW, TF, and TF-IDF; on the other hand, we fed the fitted
doc2vec model with the test instances to obtain their embedding in the same space learned at the training time. Lastly, all variants of the
i-th training set obtained so far were balanced with two algorithms, i.e., SMOTE over-sampling and
NearMiss (version 3) under-sampling.
3.4.5 Performance Assessment.
Once we have obtained the predictions of all the 567 models on the test sets across all the 22 validation rounds, we derived the
confusion matrices reporting the True/False Positive and True/False Negative predictions. From them, we computed the performance metrics commonly adopted for the binary classification task, i.e., accuracy, precision, recall, and F-measure [
77]. The F-measure represents an aggregation of precision and recall, both crucial for evaluating binary classifiers [
8]. The trade-off between such values is particularly tricky in the context of exploitability prediction, as practitioners might wish for higher precision to identify the potentially exploitable vulnerabilities correctly but also for high recall to avoid false negatives—i.e., vulnerabilities considered safe but exploitable. However, the F-measure does not consider the number of true negative instances, i.e., the neutral vulnerabilities correctly classified as
“neutral”. The problem of exploitability prediction is highly imbalanced, so we were interested in evaluating all four quadrants of the confusion matrix. To this end, we also involved
Matthews’s Correlation Coefficient (MCC) [
68], which represents an indicator of the correlation between the predicted values and the actual labels of the instances, taking into account the class imbalance in the test set—differently from other traditional metrics.
We computed the selected performance metrics on all the 22 time-aware validation rounds. Consequently, the models had 22 scores of a given metric, which did not allow for direct comparison. Hence, we carried out two kinds of analyses. First, we aggregated the results observed in all 22 iterations of the time-aware validation using a
weighted average, assigning a weight proportionate to the size of the training set used in an iteration. In other terms, the 22 scores were not treated equally, as (i) the initial iterations faced a problem that is less representative of today’s situation, and (ii) the amount of data the model worked with in the initial iterations was lower. Assigning equal weights to all the iterations, like the simple average, would have provided unrealistic and inflated results, rewarding the models that behaved well in most iterations rather than in the most recent—and, therefore, significant for today’s practitioners—ones. We exploited the aggregated scores to depict the box plots of each of the seven corpora, to highlight the distribution of the performance of the models trained and tested using a given corpus. Besides, we leveraged the Friedman test [
40] to discover whether the seven distributions exhibit statistically significant differences (
\(\alpha = 0.05\)). In case a difference is observed, we conducted the Nemenyi post hoc test [
73] to identify the pairs of corpora having noticeable differences—indeed, the null hypothesis states that the compared groups have the same distribution. Such a test is robust to repeated comparisons and does not require the tested distributions to be normal. All of this was needed to answer
RQ\(_1\). Afterward, we plot how the model performance varied over the 22 iterations, having the validation rounds on the
x axes and the value scored with a given performance metric over the
y axes. Such plots allowed us to observe the models’ general “trend” from different perspectives. Analyzing the trends was needed to answer both
RQ\(_{2.1}\) and
RQ\(_{2.2}\).
The raw results of our analyses are available in the online appendix of this paper [
49].
3.5 Implementation Details and Experimental Infrastructure
The entirety of our experimentation was implemented with a collection of
Python scripts. The traditional machine learning algorithms and the baseline models used the implementation provided by
Scikit-Learn,
20 while the pre-trained LLMs were downloaded from
HuggingFace21 using its
Transformers library. In this respect, the exact pre-trained model versions we used are
distilbert-
base-
uncased,
albert-
base-
v2,
xml-
roberta-
base,
codebert-
base, and
codeberta-
small-
v1, all implemented with
PyTorch.
22 The document-term matrices and the feature weighting for BoW, TF, and TF-IDF were done using
Scikit-Learn,
23 while the
doc2vec model was provided by the
Gensim library.
24 The data balancing algorithms, i.e., SMOTE and
NearMiss, were implemented with
Imbalanced-Learn25 library. All the performance metrics relied on the implementation provided by
Scikit-Learn, while the statistical tests leveraged the
SciPy26 and
scikit-posthocs27 packages.
We ran the experiments involving the baseline models and machine learning algorithms on a Linux machine equipped with a quad-core 1.50 GHz processor and 32 GB of memory. The full data collection procedure took about 11 days, while the dataset cleaning, splitting, and labeling required about 13 hours. Due to the large size of the dataset and the high number of configurations to evaluate and rounds to execute, the models’ training and testing phases were considerably time- and resource-consuming, taking a total of 65 hours to complete. To experiment with LLMs, we leveraged a GPU-equipped machine via the
Vast.ai28 cloud GPU rental service. The GPU was an
NVIDIA RTX 3090 with 24 GB of memory, and the execution of the experiments took about 13 days to complete.
We warmly encourage replication and verification of our work. Thus, we make all the scripts available in the online appendix of this paper [
49].
4 Analysis and Discussion of the Results
In this section, we present the results obtained in our experiments to answer our research questions (presented in Section
3.1).
4.1 The Impact of Early Data Source Combinations (RQ\(_1\))
Figures
4 and
5 show the distribution of the aggregated F-measure and MCC scored by the 77 models built on the seven corpora involved in the analysis for
RQ\(_1\). We can immediately observe interesting differences among the distributions. The models trained using the
\(\langle\)BT
\(\rangle\) corpus had the worst performance, scoring less than
\(\sim\)0.15 median weighted F-measure and less than
\(\sim\)0.10 median weighted MCC. On the contrary,
\(\langle\)CVE
\(\rangle\) and
\(\langle\)SF
\(\rangle\) obtained better performance (
\(p=0.001\)), reaching
\(\sim\)0.40 and
\(\sim\)0.30 median weighted F-measure, while scoring
\(\sim\)0.25 and
\(\sim\)0.16 median weighted MCC, respectively. According to the Nemenyi test, the difference between the two corpora is statistically significant for both metrics (
\(p\lt 0.05\) for both metrics).
Then, we observe that combining data from multiple corpora led to improvements compared to individual ones. Specifically, adding the text data from the \(\langle\)CVE\(\rangle\) corpus to the \(\langle\)BT\(\rangle\) corpora can lead up to \(\sim\)0.20 median improvement in both weighted F-measure and MCC (\(p=0.001\) for both metrics). A smaller median improvement, though still statistically significant according to the Nemenyi test (\(p=0.001\) for both metrics), is observed when adding \(\langle\)CVE\(\rangle\) to the \(\langle\)SF\(\rangle\) and \(\langle\)SF + BT\(\rangle\) corpora, namely slightly less than 0.10 in both weighted F-measure and MCC. Conversely, adding the data from the \(\langle\)SF\(\rangle\) or \(\langle\)BT\(\rangle\) corpora to the \(\langle\)CVE\(\rangle\) corpus does not lead to any noticeable change, as also confirmed by the Nemenyi test (\(p\gt 0.05\) for both metrics). Interestingly, although with minimal (less than \(\sim\)0.01) and no significant differences (\(p\gt 0.05\) for both metrics), the models trained on the \(\langle\)CVE + SF\(\rangle\) corpus experienced a small drop in the median performance for both weighted F-measure and MCC when \(\langle\)BT\(\rangle\) is added. In the end, \(\langle\)CVE + SF\(\rangle\) and \(\langle\)CVE + SF + BT\(\rangle\) have been found to be the best corpora on which the models should train, reaching up to 0.48 weighted F-measure (0.40 on a median) and 0.35 weighted MCC (0.26 on a median). Yet, we cannot confidently determine which of the two is the best option since the models obtained comparable performance with negligible and non-statically significant differences.
The weighted precision and recall (Figures
6 and
7) help better comprehend the F-measure scores observed. The precision distributions are mainly centered around 0.43, though their variance appears higher when the data from the
\(\langle\)CVE
\(\rangle\) corpus is not involved. In other terms, the central tendency seems only slightly affected by the textual content used to describe the instances, but the same does not happen for the variance—i.e., without data from the
\(\langle\)CVE
\(\rangle\) corpus, the models behave largely differently in terms of precision. The best results were obtained by the models trained on the
\(\langle\)CVE + SF
\(\rangle\) corpus, reaching 0.46 on a median. The situation is somehow different when looking at the weighted recall (Figures
6 and
7). The distributions are noticeably different, with much wider variances; this means that the recall metric is highly subject to the specific model rather than the kind of textual data used. The most “contradicting” results were seen when the text from
\(\langle\)BT
\(\rangle\) corpus is involved, where the median weighted recall is noticeably lower than the mean, and the boxes are wider. Such a scenario indicates the presence of many models having very low recall, i.e., models tending to avoid predicting
true (i.e., “exploitable”), and models that predominantly predicted
true that can easily raise their recall. The models that did not use the
\(\langle\)CVE
\(\rangle\) corpus scored less than 0.25 weighted recall on a median. Once adding
\(\langle\)CVE
\(\rangle\), the central tendency between the corpora becomes more equalized (
\(p\gt 0.05\)).
All these results indicate that the sole CVE description is sufficient for determining the majority of the performance [
4,
45]. The text from
SecurityFocus can provide additional information that further boosts the performance without changing the general trend. Despite not giving useful information on its own, the text from
BugTraq does not hinder the predictions when mixed with text from other sources.
4.2 The Performance of Different Learning Configurations (RQ\(_2\))
Once we determined the best corpora to train the prediction models, we investigated the performance scored by the experimented learning configurations to answer
RQ\(_2\). We subdivided it into
RQ\(_{2.1}\) and
RQ\(_{2.2}\) to have focused analyses on the models built with the traditional learning pipeline and those leveraging end-to-end pre-trained LLMs. For this analysis, we chose to focus on the models trained and tested on the
\(\langle\)CVE + SF
\(\rangle\) corpus as its content determined the best models overall. The raw results for the other corpora can be found in our online appendix [
49].
To answer
RQ\(_{2.1}\), we analyze the ML models built with the traditional pipeline, which is made of three key parts: (1) feature representation, (2) training data balancing, and (3) learning algorithm. Figure
8 provides a broad overview of the F-measure scores obtained by the six learning algorithms on the 12 training settings made by the combination of the four feature representation schemas and the three data balancing algorithms. We observe that all models under every training setting followed one great pattern: the F-measure steadily increases—net of sporadic drops—from the 1st round to the 12th, i.e., the point where almost all models achieve the best score of 0.97. Yet, all models start dropping their performance from that round on, reaching their lowest peak of less than 0.16—excluding the very initial rounds. This phenomenon shows that the most recent “versions” of such models are not able to recognize exploitable vulnerabilities properly, despite seeing dozens of thousands of examples during training. It seems the learning becomes less and less fruitful as the round goes by, likely due to the difficulty of recognizing a clear distinction between exploitable vulnerabilities and those not exploited yet, among many examples. In other words, the text data were enough to recognize the exploitability of “historical” vulnerabilities but are less helpful for modern-day vulnerabilities. It is worth pointing out that there could be other reasons behind such a drop. For instance, we observe that the disclosure of public exploits has become less frequent than it used to be in the past, from over 15,000 in the period 2001-2010 to less than 7,500 in the period 2011-2020. This difference becomes even more relevant when looking at the number of disclosed vulnerabilities: around 42,000 in 2001-2010 and around 100,000 in 2011-2020. Thus, the number of disclosed vulnerabilities doubled in a decade while the published exploits halved. This inevitably affected the distribution of
true and
false instances, ending up with highly imbalanced test sets in the latest validation rounds.
Due to the limited reliability of F-measure for measuring the model performance when the number of
true instances is noticeably lower than the number of
false instances [
99], we also looked at the MCC metric (depicted in Figure
9) to observe whether a similar pattern occurred. Interestingly, the models achieve an MCC score around zero in the 12th round, indicating the absence of any correlations between the model predictions and the target variable (i.e., the exploitability). A lack of correlation means that the models make utterly unrelated predictions with the target variable, implying that the model performs no better than a fully random or constant classifier [
9,
99]. The diverging results of MCC and F-measure scored at the 12th round suggests that many models in that round behaved almost like a constant classifier always predicting
true; this behavior benefited the F-measure since over 95% test instances had
true label in the 12th round but not the MCC metric, which does not reward models making one-way predictions.
The best MCC scores were obtained around the 15th round, going beyond 0.60 MCC in the best-case scenario, i.e., when SMOTE balancing is employed. Such a score indicates the presence of a strong positive correlation, which suggests the model performs well. This is further confirmed by the good F-measure scores, reaching 0.80 when SMOTE is used. Thus, the 15th round provides a definitely better trade-off than the 12th. All the models in the 15th round were trained on all vulnerabilities disclosed until December 2008 and tested on those disclosed until August 2011. We observe that the amount of true and false test instances is more balanced (around 50% for both), indicating that the exploitability prediction task was easier than it is today—indeed, in the last round the number of true instances in the test set is just the 5%. Unfortunately, after this round, all the models meet a similar demise seen for the F-measure: they slowly converge to no more than 0.12 MCC in the last round.
Looking deeper at the effect of feature representation schema on the F-measure trends (Figure
8), we observe that the models based on the document-term matrix (i.e., BoW, TF, and TF-IDF) share the same general trends once a data balancing technique is applied. In particular, we observe that the effect of a balancer looks the same in all three schemas, favoring and hindering the same classifiers. For instance, the KNN classifiers draw many benefits from an oversampled training set (i.e., SMOTE). Moreover, all the classifiers follow closely similar trends when SMOTE is used. On the contrary, the models have highly diversified trends with document embeddings (i.e., DE), standing out from the other three feature representation schemas. The KNN classifier still can be seen benefiting from the use of SMOTE, though with inferior performance than in other schemas. The MCC trends (Figure
9) exhibit a similar effect though with less diversification, i.e., the effect of the feature representation schema and the data balancing is smoother, particularly with TF and TF-IDF. Interestingly, the training sets undersampled with
NearMiss determined models with negative MCC scores (reaching less than -0.3 in several cases), indicating the presence of moderate negative correlations between the model predictions and the target variable; yet, this happened only in the initial validation rounds, which does not imply any negative impact of this balancer.
We used the weighted metrics to determine the models that achieved the best results across all rounds. We found that the best model overall was a Logistic Regression classifier (LR) using TF feature schema and with a training set oversampled by SMOTE, scoring 0.49 weighted F-measure and 0.36 weighted MCC and touching 0.82 F-measure and 0.65 MCC in the 15th round. In particular, we observed that most learning algorithms had their best F-measure and MCC scores with TF and SMOTE. The story is slightly different by looking at the precision and recall. We found that the learning algorithm with the highest weighted recall was KNN, reaching 0.80 with BoW and SMOTE—0.08 higher than the score obtained with TF and SMOTE. Symmetrically, the Random Forest (RF) achieved the highest weighted precision score, reaching 0.65 with TF-IDF without data balancing—only 0.04 higher than the score obtained with TF and SMOTE. In the end, we observed a generally positive trend with TF and SMOTE, though maximizing a specific metric might require a specific learning configuration.
Lastly, we looked at the performance scored by the four baseline models, i.e., the random (RND), the pessimistic (PES), the optimistic (OPT), and the stratified (STR) classifiers. Among the four, the best baseline classifier was PES, achieving 0.37 weighted F-measure and 0.26 weighted precision. We remark that the MCC is always zero as the true and false negatives are always zero, while the recall is always maximum (i.e., one) for the same reason. Such results show that the experimented prediction models do make meaningful predictions as the best model, i.e., the Logistic Regression (with TF and SMOTE), outperforms PES by 0.12 and 0.34 in weighted F-measure and precision, respectively. Nevertheless, PES outperforms all models in the 12th round for F-measure. Indeed, due to the large presence of true instances in the test set, the PES model has a very low probability of making false positive predictions, obtaining high precision in return and boosting the F-measure—thanks to the recall score fixed to one. In any case, its performance drops in all the rounds, where the test instances have less imbalanced distributions.
To answer
RQ\(_{2.2}\), we analyze the models made with the pre-trained LLMs. We found that, with the exception of a few sporadic rounds, all models tend to act like perfect optimistic classifiers, i.e., always predicting
“neutral” (having
false label). Therefore, the F-measure (Figure
10) turned out to be extremely low (the weighted aggregated score did not go beyond 0.01 with
CodeBERTa) as a consequence of the recall being always zero—due to the absence of any
true prediction). Such behavior had an extremely positive impact on the accuracy (Figure
11), on which all the models achieved very high performance as the rounds went on; this happened because of the scarce number of
true instances in the test sets of the later rounds.
The round that had the most interesting performance in the \(\langle\)CVE + SF\(\rangle\) corpus is the 12th, the same where the traditional ML models achieved the highest F-measure scores. In such a round, the CodeBERTa model reached 0.51 F-measure thanks to the quasi-perfect precision, i.e., 0.98, which happened because of the large number of true instances in the 12th test set that minimized the chances of making false positive predictions. Nevertheless, given the peculiarity of the 12th test set, it is not clear whether in this round CodeBERTa successfully learned something or it was just making random predictions (with 33% true predictions and 67% false ones). Such an interesting behavior did not only happen for the \(\langle\)CVE + SF\(\rangle\) corpus but for all the other corpora, though with different “fortunate” rounds.
Ultimately, we can conclude that the pre-trained LLMs could not learn anything from the training phases—except for the few “fortunate” rounds—despite the large amount of data available. The only models that apparently learned something were CodeBERTa and CodeBERT, both having experienced source code text during their pre-training stage; yet, they both tend to behave like the other models in later rounds. We believe the reasons could be imputed to the lack of a massive pre-training on text containing the typical vocabulary of the security domain, but also to an inadequate data preprocessing for the experimented models.
4.3 Further Analysis
The time-aware validation setting allowed us to observe how the models behave at different points in time where the training and test sets had diversified compositions. We wanted to shed light on the composition of both training and test sets to comprehend the possible reasons for the model to make such predictions further. Thus, we employed a dimensionality reduction technique based on the
Singular Value Decomposition (SVD) [
33], which projects the data into a lower dimensional space using matrix factorization. Such a technique is better known as
Latent Semantic Analysis (LSA, a.k.a.
Latent Semantic Indexing, LSI) [
27,
31] when adapted for highly sparse data, like the textual represented with BoW, TF, and TF-IDF. Essentially, this technique forms a lower-dimensional “semantic space” of a given size—typically vastly lower the number of terms—where the instances sharing similar concepts are mapped to the same cluster, also dealing with cases of synonymy and polysemy of terms.
In our case, we chose to build a semantic space of two dimensions to allow plotting into a 2D space and inspect how the training and test instances are distributed. Specifically, we focused on the training and test sets employed in the most interesting rounds that emerged from the model performance analysis. Therefore, we inspected the 15th and the 22th rounds due to their contrasting performance; the former achieved the best trade-off between F-measure and MCC, and the latter had the worst performance overall. In continuity with the previous analyses, we focused on the \(\langle\)CVE + SF\(\rangle\) corpus and opted for visualizing the document-term matrices made with TF schema as it was the schema that had the best results overall.
Figures
12 and
13 show the scatter plots of the training and test instances drawn from the 15th and the 22nd rounds, respectively. Both plots depict the large number of instances involved in training and testing. We immediately observe in all cases, the “exploitable” (
true) and “neutral” (
false) instances are somehow “intermixed”. This could explain why many models had trouble understanding the difference between the two classes of instances and so opted to behave like constant classifiers in several cases. Nevertheless, we cannot exclude the visualization algorithm that failed to faithfully preserve the differences among the instances, though it is recommended to visualize data with textual features. Such an aspect is worth further investigation. Looking deeper at the training and test sets of the 15th round, we observe the two share a similar arrangement. This might indicate that the models, once trained, did not find a different problem once going to the test phase, which might be the main motivation for the good performance obtained in that round. On the contrary, the “neutral” (
false) training and test instances of the 22nd round share the same arrangement but the “exploitable” (
true) instances do not. Indeed, the arrangement in the test phase seems like a subset of the arrangement seen at the training time. Likely, this could be one of the reasons why the models increased their false positive rate (i.e.,
false instances deemed as
true) due to this reduced presence of
true instances at the testing time.
5 Discussion and Implications
The results achieved in our study shed light on several aspects that may lead to several implications for the research community and the practitioners, as discussed in the following.
Searching for a Reliable Ground Truth. The results reported in Section
4 revealed the noticeable performance drop that affected all the models—including the baselines such as the pessimistic classifier—as the validation rounds proceeded. In this respect, we made two key observations: (1) the F-measure score is directly proportionate to the number of
true instances (i.e.,
“exploitable”) appearing in the test set, independently from the composition of the training set; (2) the MCC scores revealed the existence of several rounds with positive correlations between the model predictions and the target variable. Similar findings were only encountered in similar research work applying a time-aware validation framework [
2,
17]. Yet, such works only brought attention to the aggregate score performed in all rounds, while we opted for a hybrid strategy, presenting both the aggregated (weighted) scores and focused attention to those rounds exhibiting particular behaviors. We suspect that one of the main reasons behind such results lies in the strategy adopted to build the ground truth. In this work, we relied on the
Exploit Database because of its good reputation and popularity among researchers in exploitability prediction [
2,
34,
50,
85]. Nevertheless, we observed a noticeable reduction in the publication rate of exploited vulnerabilities—i.e., half as many exploits released in the period 2011-2020 than in the previous decade, with the rate of disclosed vulnerabilities doubled. Indeed, it seems that it has been struggling to keep up the pace of newly disclosed vulnerabilities in recent years. This could imply that either exploits are disclosed with less frequency than before or the rate of new vulnerabilities is too high to keep up the pace; this phenomenon makes the
Exploit Database progressively less reliable for building a solid ground truth in both cases. In this respect, our study constitutes a baseline for future re-evaluations with alternative data sources to build better ground truths [
2,
34]. Indeed, many other sources point to instances of exploits (or tentative exploits) observed in the wild. For example,
Symantec Attack Signatures collect traces of attackers’ attempts via intrusion detection systems.
29 In the context of
Project Zero, Google gathers 0-day exploits observed in the wild, enriched with a detailed root cause analysis.
30 The US
Cybersecurity & Infrastructure Security Agency (CISA) curates the
Known Exploited Vulnerabilities (KEV) catalog, containing hundreds reports of exploited vulnerabilities.
31 Due to their newness, the size of such datasets is still limited (e.g.,
Google Project Zero has only 69 entries as of February 2024), though their increasing popularity should address this problem eventually, making them suitable for large-scale analyses like ours. We envision a
triangulation of multiple strategies to improve the reliability and quality of the labeling process. To this end, there is a need for novel and automated monitoring solutions that automatically discover “silent” exploits on the web and map them to the related vulnerabilities; thus, the exploitability prediction models could rely on a wider and continuously growing knowledge base about real-world exploits.
Classifying Vulnerabilities for Fine-grained Inspections. Our large-scale analysis involved all disclosed vulnerabilities in NVD until November 03, 2021. In our work, we did not make any difference between vulnerabilities, e.g., analyzing web-based and memory-related vulnerabilities separately, but treating all of them as equal. Many factors concerning vulnerabilities inevitably influence any prediction activity, especially the exploitability prediction. In particular, how the distribution of vulnerability types varies over type could be one of such factors that might influence the prediction performance observed. To reach this goal, we have given each vulnerability a category guided by its assigned CWE. Namely, we re-mapped the given CWE according to the
“Simplified Mapping” view provided by CWE itself.
32 Such a step was meant to greatly reduce the many weakness types into a more reasonable set of categories, allowing us to decrease it from 180 to 89. Being this number still great, we opted to further assign new categories based on our knowledge of the 89 CWE types resulting from the first re-mapping, ending up with ten broader categories like
“Authentication and Authorization” and
“Resource Protection”. Figure
14 shows the distribution trend into such categories over time (over the 22 splits of the dataset). We can immediately observe that as the year passes, the precise type of vulnerability assigned becomes clearer. Indeed, during the initial periods of the CVE system, most vulnerabilities had their CWE not specified (“NVD-CWE-noinfo” or “NVD-CWE-Other”), resulting in many CVEs falling into the
“Other” category, particularly during the first 12 rounds. Such lack of information does not allow us to easily understand whether the vulnerability types could have played a role in the performance drop we observed at the later rounds. The results of this work should be remade into a subset of vulnerabilities for triangulating the issue that affected the model performance.
Engineering the Learning Configuration. As observed in the context of
RQ\(_2\), the four feature representation techniques had a relevant impact on the overall models’ performance. In this study, we relied on widespread settings to set up the text pre-processing pipeline and to configure the
doc2vec model without carrying out a profound empirical investigation. Indeed, our goal was to assess the key differences among the main techniques employed when working with textual data. Hence, our work does not declare the best feature representation technique on all fronts but rather encourages the evaluation of alternative learning configurations and techniques. The Latent Semantic Analysis (LSA) [
27,
31] used to reduce the dimensionality of the document-term matrices (Section
4.3) is a candidate technique that can be employed to represent the textual features the traditional ML models can use, acting somehow similarly to the embedding strategies like
doc2vec. As regards the
time-aware validation setting, we followed Liu et al.’s [
63] approach by considering a deployment setting in which the knowledge base grows over time. In a different way, the work by Bullough et al. [
17] employed a “sliding window” to train only on a limited set of past data, i.e., only those that are temporally closer to the testing data. The rationale of their choice is that recent data might represent the reality better than older data, as some characteristics might have changed over time—i.e., a
concept drift has occurred [
97]. For instance, the style and content of discussions in
BugTraq written in 2010 might differ from those of 2000, negatively affecting the models when learning the relations between the textual features and the target variable. The traditional sliding window approach completely ignores older instances during the training based on a pre-determined or moving threshold (i.e., the “window size”). Alternatively, we could also assign a lower
weight to older instances, using a “decaying window”, so that the learners would give less importance to the old instances that likely induced the models into error. In addition, by employing a mechanism to assess the quality of the online discussions, e.g., their readability [
89] or the amount of their informative content [
22], we could assign higher weights to “good” instances and instruct the models to give more attention to them during the training, hopefully improving their overall capabilities.
On the Practical Usages of Early and Realistic EPMs. Adopting an
early exploitability prediction model provides many advantages in assessing the severity of newly discovered vulnerabilities. Let us consider a scenario where a software project adopts one. When a new vulnerability is discovered, either by internals or externals, the developers report the issue to MITRE and request the allocation of a CVE record, where they explain the issue found. Once third-party experts verify the issue, the vulnerability is officially disclosed in a CVE record containing the first official description in natural language. Such a description can be directly fed into the early exploitability prediction model to readily generate an initial assessment of that vulnerability. Besides, if other free commentaries are already available—e.g., via
GitHub issues—the prediction model can integrate those pieces of information to boost the prediction accuracy further, as we also observed during the analyses for
RQ\(_1\). Should the model flag the vulnerability as potentially exploitable, the developers can take specific countermeasures [
56,
86] to (i) address the vulnerability earlier than other issues, (ii) release the software version containing the patch quickly, and (iii) adopt a better communication strategy to recommend the users to install the update as soon as possible. It is worth remarking that any countermeasure adopted in this sense is meant to hasten the
vulnerability remediation process, not to replace other forms of security assessment like “late” exploitability prediction models or the CVSS analysis. In this respect,
early assessment can also be used to support the security analysts in charge of making the CVSS measurement, who can rely on an additional “opinion” when it comes to judging the vulnerabilities’ nature. In particular, bringing forward the assessment of just-disclosed vulnerabilities facilitates the
prioritization of all the security issues found until that moment. Indeed, developers can better understand which issue requires more attention than others and allocate adequate resources accordingly in the hope of reducing the duration of the exposure window and, therefore, the risk of being attacked, as explored by Jacobs et al. with EPSS [
52]. Afterward, as soon as new information on the vulnerabilities becomes gradually available, e.g., a Proof-of-Concept is disclosed, developers can progressively leverage more reliable solutions to adjust the prioritization of their interventions, such as
Evocatio fuzzer [
53]. In such a mechanism, we envision that models based on
early data can represent the first step of a prioritization pipeline which takes advantage of all the strengths of existing solutions as soon as the information they use becomes available. Despite the performance observed at the latest validation rounds not supporting the practical usefulness of such early models, we believe this work acts as a cornerstone for determining the feasibility of
early vulnerability assessment, willing to channel more attention to this topic and express its potential to the utmost.
Early Predictions and Beyond. Our empirical investigation did not aim to provide a cutting-edge exploitability prediction model but rather to evaluate its performance with early data and in a realistic scenario. We acknowledge the existence of models that achieved better results in the literature [
14,
45,
85]; yet, many of them did not consider the precautions indicated by [
17] or those we adopted in this work. In this respect, we believe that replicating previous work under a realistic validation setting is necessary to estimate the models’ real effectiveness. Moreover, we envision a combination of all existing models, both
early and
late, to develop an
incremental exploitability prediction system, i.e., an integrated framework that provides the best possible predictions according to the information available at a given time. For instance, after discovering a new vulnerability (day zero), the incremental system would just rely on the short description and the initial online discussions—as we have presented in this paper; on the day the experts make in-depth analyses, the system will consider all the features obtainable from the CVSS vector to further improve its predictive power—acting as “late” models. Such a solution may express its full usefulness in the case of borderline classifications, i.e., when the early predictions fall too close to the decision threshold, making the model unsure about the appropriate class to assign. In such scenarios, the system might recommend waiting for additional data, such as the CVSS exploitability metrics, before providing a more trustworthy response. Furthermore, this framework could be integrated with an
impact prediction module that estimates the harms that the potential exploitability of that vulnerability could cause to the confidentiality, integrity, and availability of the targeted asset. This additional piece will cover the second part covered by CVSS base metrics, i.e., the “Impact metrics,” fulfilling the role of assisting the human experts in providing a broad understanding of the risks connected to keeping a vulnerability unfixed.
6 Threats to Validity
Threats to Construct Validity. We mined the full content of the National Vulnerability Database (NVD) combined with CVE List to collect all the known vulnerabilities disclosed before November 03, 2021, being careful to avoid the inclusion of malformed and rejected CVEs. We did not perform an extensive manual validation of the retrieved dataset to detect possible curation errors, such as a CVE incorrectly pointing to an external reference related to a different vulnerability. However, we are confident that the considered data sources are reliable since both databases are known to maintain high-quality data. Besides, we deliberately focused only on the URLs labeled as BID or BugTraq for three reasons: (1) URLs of these kind point to well-known sources where developers used to discuss vulnerabilities way before their official disclosure; (2) the pointed websites were easy to mine—a common HTML parser sufficed—to gather the required data; (3) developing a generic script able to mine all the thousands of different websites reachable from the CVEs would have been impractical. We handled the shutdown of both Security Focus and BugTraq using the Wayback Machine service and the SecLists archive to recover the missing links with the CVE records. Nevertheless, we cannot guarantee the freedom from missing or incorrect links caused by Wayback Machine or the imprecision of the pattern matching heuristic employed to reach the right page on SecLists. In any case, the approach of early prediction models is not strictly bound to BID and BugTraq references, and it can be adapted to any other source of online discussions with just minor tweaks.
The text of the online discussions contained many irrelevant data, such as e-mail addresses, PGP signatures, and hex numbers. We applied regular expressions to capture these patterns and remove them to improve the quality of the document corpus. In addition, we adopted the recommended pre-processing steps when working with natural language text to allow the feature representation techniques to learn a compact and representative vocabulary. We are aware that our text cleaning procedure may not have been complete and could have left other forms of noise, such as partial code snippets; yet, to the best of our knowledge, there are no tools able to capture partial code elements for any programming language; hence we opted not to implement an ad-hoc solution as it would have required dedicated effort and extensive validation.
When assigning the labels to the instances in our dataset, we carefully avoided labeling all instances outside the context of the
time-aware validation. To this end, we followed the strategy proposed by Jimenez et al. [
54], assigning more realistic labels to the instances at each round. Specifically, we labeled as “exploitable” (
true class) the instances in the training set that were exploited before the
training date—i.e., the latest publication date in the training set—and we marked as “exploitable” the test instances only if they were exploited before the date of the last vulnerability published in that round.
Threats to Internal Validity. The investigation for RQ\(_1\) analyzed the impact caused by the seven corpora created from the three data sources considered, i.e., CVE, Security Focus and BugTraq, and their combination. The combination consisted of applying a string concatenation to create the four combined corpora before creating the document-term matrices or fitting the word2vec model—this determined different feature spaces for each corpus. Not only did the analysis help understand the impact of each corpus, but it also showed that text-driven early prediction models can be employed with any source available, though with noticeably different performance. In other words, it is not mandatory that a vulnerability is disclosed via CVE before running the predictions, but the entire procedure can be done with any kind of text explaining the issue.
As indicated by Bullough et al. [
17], several exploitability prediction models in literature had some issues in their machine learning setup. First, we adopted a
time-aware validation setting to simulate a realistic production scenario in which the prediction model is periodically re-trained and deployed. We deliberately avoided a fully-random cross-validation as our data had time relations among them; indeed, training on “future” data to predict data belonging to the “past” would have generated inflated and misleading results. Moreover, we were careful to avoid applying the feature encoding and data balancing (where applied) on the test data, but only on the training set made up at each iteration of the
time-aware validation. Indeed, such bad practices would produce overly optimistic results, as the models would learn information from data that should be left completely unseen before the testing phase because they represent instances to predict in a real deployment scenario [
6].
To perform the time-aware validation, we split the dataset into several folds, each made of the vulnerabilities disclosed in (about) one year and a half time span—precisely, 532 days. We started the splitting from the last published CVE in 2021 and “jumped” back in time to form the 22 folds. The size of such a time span was determined by the 90th percentile of the exploitation time distribution, i.e., the duration of the uncertainty window. We made this choice to observe how the models behave when the uncertainty window is made of totally different sets of vulnerabilities. We acknowledge that there exist different ways to create the folds, such as by equally splitting the dataset by the number of CVEs. The results we obtained are still subject to the choice we made to set the size of the uncertainty window. We chose the 90th percentile of the exploitation time distribution as it represents a largely sufficient time to let exploits emerge. As a matter of fact, the average exploitation time, i.e., 194 days, appeared quite limited and too eager. We are aware of different, and perhaps more appropriate, widths of the uncertainty window that could determine more valid results, and that would be worth exploring with dedicated further analyses.
When generating the document embeddings from the training corpora with
doc2vec, we used the configuration that provided the best results in previous work [
45] and others settings recommended for
doc2vec models, e.g., setting to 300 the size of the embedding or using the
Distributed Bag-of-Words variant. The results achieved by this feature representation technique might change if different configurations are employed.
Threats to External Validity. We used the
Exploit Database as the main source to build our ground truth (i.e., to label the CVEs with
true and
false) because of its reliability and completeness. Nevertheless, it only stores Proof of Concepts (PoC) and exploits publicly released by their authors without tracing any exploit observed in the wild (e.g., via attack signature detection) as done by
Symantec Attack Signatures or Google Project Zero’s
0-days-In-The-Wild (described in Section
5). Therefore, our models can only predict whether an exploitation will be released without generalizing to other forms of exploitation. Moreover, the observed results hinted at possible flaws in our ground truth that caused the model performance degradation. Thus, integrating multiple data sources could improve the quality of the ground truth and, hopefully, the performance as well.
The exploitability prediction models experimented in this work target disclosed vulnerabilities. This choice was driven by the fact that the metadata for such vulnerabilities is available for initial assessments. Hence, the models cannot estimate the exploitability of 0-day vulnerabilities since they are supposed to be unknown to the developers or any other party involved in taking care of the system’s security. Predicting the risk of undergoing 0-day exploits inevitably might require monitoring the accesses to the application or any other suspicious actions, relying on principles different from those recalled in this work [
44]. Nevertheless, the prediction of exploits to known and disclosed vulnerabilities can also be used as a proxy indicator for estimating the exploitation of other unknown vulnerabilities in the system sharing commonalities with those already disclosed [
13]. Furthermore, the experimented models are meant to predict the event of future exploitation for individual vulnerabilities, in line with all the related work presented in Section
2. Hence, the models cannot predict attacks concerning multiple vulnerabilities or chains of exploits. Achieving such a goal is indeed feasible, though it might require more mature models to make accurate predictions for individual vulnerabilities.
Threats to Conclusion Validity. From all the 504 models, we computed multiple metrics capturing the classifiers’ performance from different points of view, reducing the risk of drawing erroneous conclusions. In particular, we deliberately did not consider the accuracy—except for observing the scarce performance of LLMs—as it produces largely inflated results leading to optimistic conclusions in imbalanced problems. In this paper, we largely relied on the F-measure and MCC metrics to observe how the model performed. The rest of the raw results are in the online appendix [
49].
We aggregated the results of the 22 validation round with the weighted average to have a single number representing the overall performance and facilitate the comparison. We preferred this aggregator to other popular choices, such as the simple average or the median, because the 22 validation rounds are not equivalent representations of the same problem. Indeed, in the 15th and 22th rounds, all models achieved utterly different performance; nevertheless, the exploitability prediction problems of those two rounds cannot be directly compared as they are separated by the events that occurred in ten years. Therefore, we assigned more weights to the rounds having wider training sets. We also looked at the aggregated scores obtained using the simple average and the median. For instance, the model that had the largest weighted F-measure under \(\langle\)CVE + SF\(\rangle\) corpus (i.e., Logistic Regression with TF-IDF and oversampled training data) would score 0.54 with the simple average and 0.64 with the median, noticeably higher than 0.49 weighted score. We believe such inflated values do not accurately describe the overall model performance, motivating the use of a weighted aggregator. Still, we opted to closely inspect the model trends over the 22 validation rounds to avoid concluding using only a single aggregated value.
The Latent Semantic Analysis (LSA) employed in Section
4.3 allowed us to inspect the arrangement of the training and test instances at key validation rounds. We opted for this technique due to its suitability for textual data based on document-term matrices, which are known to generate a sparse feature space [
31]. We were careful to fit the semantic space only on the training data to prevent it from being influenced by future data that was supposed to be unseen at that time. In other words, the test data were projected in the same semantic space previously fitted on the corresponding training set.
7 Conclusion
This paper presented a large-scale empirical evaluation of the effectiveness of early exploitability prediction models relying on the data available in a just-disclosed vulnerability, comparing 72 learning configurations, involving six traditional ML classifiers, four feature representation schemas, and three data balancing settings, as well as five pre-trained LLMs. All models were evaluated in the context of a time-aware validation setting representing a realistic scenario where the models are periodically re-trained and deployed. Additionally, we handled possible issues connected to an unrealistic and eager assignment of labels by employing a special data cleaning strategy.
The results showed that CVE descriptions alone suffice, but the addition of online discussions from Security Focus further boosts the performance of any model. The best combination of feature representation and data balancing was with TF and SMOTE in the majority of the cases. The best classifier depends on the performance metric: the Logistic Regression achieved the best F-measure and MCC scores, the Random Forest maximized the precision, and the KNN had a quasi-perfect recall. Unfortunately, pre-trained LLMs did not achieve the expected performance, requiring further pre-training in the security domain. Nevertheless, all models fell victim to the same phenomenon, i.e., a noticeable drop at later validation rounds, likely due to the large imbalance in the test sets.
Future research directions include the experimentation of novel mechanisms to build a more reliable and sound ground truth—e.g., by combining multiple data sources—or alternative learning configurations to improve the early exploitability prediction model performance. We envision possible further developments to make exploitability prediction more powerful and useful, such as employing an incremental exploitability prediction system to guide the choice of which countermeasures to apply when a new vulnerability is published, e.g., helping to decide which vulnerability must be addressed before than others. Such a system can also be integrated with an impact estimation module to provide a full overview of the risk connected to a newly found vulnerability. From a different perspective, we hypothesize that exploitability prediction modeling can be further improved by retrieving peculiar information from the software systems affected by just-disclosed vulnerabilities and using fine-grained text analysis tools that extract relevant elements from unstructured text, such as code snippets or stack traces, to have a more relevant feature space from which the models can better learn.