research-article

Open access

Early and Realistic Exploitability Prediction of Just-Disclosed Software Vulnerabilities: How Reliable Can It Be?

Authors:

Fabio PalombaAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 33, Issue 6

Article No.: 146, Pages 1 - 41

https://doi.org/10.1145/3654443

Published: 27 June 2024 Publication History

PDF eReader

Abstract

With the rate of discovered and disclosed vulnerabilities escalating, researchers have been experimenting with machine learning to predict whether a vulnerability will be exploited. Existing solutions leverage information unavailable when a CVE is created, making them unsuitable just after the disclosure. This paper experiments with early exploitability prediction models driven exclusively by the initial CVE record, i.e., the original description and the linked online discussions. Leveraging NVD and Exploit Database, we evaluate 72 prediction models trained using six traditional machine learning classifiers, four feature representation schemas, and three data balancing algorithms. We also experiment with five pre-trained large language models (LLMs). The models leverage seven different corpora made by combining three data sources, i.e., CVE description, Security Focus, and BugTraq. The models are evaluated in a realistic, time-aware fashion by removing the training and test instances that cannot be labeled “neutral” with sufficient confidence. The validation reveals that CVE descriptions and Security Focus discussions are the best data to train on. Pre-trained LLMs do not show the expected performance, requiring further pre-training in the security domain. We distill new research directions, identify possible room for improvement, and envision automated systems assisting security experts in assessing the exploitability.

1 Introduction

Software vulnerabilities represent serious threats to software dependability, allowing malicious users to attack its confidentiality, integrity, or availability [39, 69]. Vulnerabilities require specific methods and techniques to be detected [7, 62] and removed [18] promptly to avoid undesired consequences [32]. However, not all vulnerabilities are the same, as their exploitation may require different expertise and may have variable consequences. Since the number of newly discovered vulnerabilities increases rapidly [38], both vendors (i.e., owners of vulnerable products) and clients are interested in obtaining reliable feedback on the risk that a vulnerability may be exploited. The availability of such feedback is useful for various tasks, including prioritizing security verification [36, 81] and identifying suitable preventive actions to reduce the risk of an attack [30], such as the replacement of vulnerable constructs or APIs [47, 55].

Researchers have envisioned novel methods to estimate the dangerousness of newly discovered vulnerabilities. The most basic mechanisms are driven by the CVSS (Common Vulnerability Scoring System¹) “Base” score assigned. In particular, a vulnerability receiving a higher CVSS “Base” score is deemed more severe than others. To a certain extent, it is considered to have a higher chance of being exploited soon. Unfortunately, such an assumption does not always hold: a higher severity score does not necessarily imply a more likely exploitability, nor do lower scores denote less risk of being exploited [1]. For instance, this is the case of the infamous Heartbleed bug (CVE-2014-0160), whose CVSS 2.0 “Base” score was 5.0 (translating into medium risk), way lower than other vulnerabilities that were never observed to be exploited in the wild [32]. The upgraded version of CVSS, i.e., version 3.0, added a number of new metrics to capture additional aspects that were neglected in the previous version and corrected some inaccuracies. The new and improved “Base” score formula could address scenarios like the one observed for Heartbleed. Yet, this new version still cannot be used as a proxy for the exploitability risk. Indeed, CVE-2014-6049 received a CVSS 3.0 “Base” score of 2.7 (i.e., low risk), but was exploited less than a week after its disclosure. Such a score was even lower than the CVSS 2.0 “Base” score, i.e., 5.5. Therefore, it is unfortunate to note that CVSS 3.0 does not solve all the issues affecting version 2.0. Moreover, there is no significant correlation between CVSS “Base” score and the exploitability risk of a vulnerability. This alarming lack of correlation also happens for the more specific “Exploitability” and “Impact” sub-scores—i.e., the partial metrics required to compute the final “Base” score. Further detail about the correlation between the CVSS metrics and the risk of exploitability is available in the online appendix of this paper [49]. In summary, the values of the CVSS metrics assigned to the vulnerabilities affecting a system cannot provide a useful estimation of the probability of the system being attacked.

An alternative approach to gain insights on the risks associated with a vulnerability consists in adopting various strategies based on machine [14, 17, 85] and deep [45] learning models. The goal is to predict whether a new vulnerability will be exploited, either labeling it as “likely exploitable” or estimating a probability of exploitation [51] using a number of predictors from different data sources. In this respect, researchers have been using many sources of information connected to a vulnerability, ranging from its brief description contained in the CVE (Common Vulnerabilities and Exposures) record—i.e., a data structure containing all the information linked to a disclosed vulnerability—to online mentions from social media or the dark web [2, 85]. The vast majority of the proposed models rely on all the complete CVE information that were obtainable when the datasets used for the experimentations were built, hence also including the data that became available days or weeks after the official disclosure of a new vulnerability—for instance, CVE-2020-0583 received the CVSS “Base” score only 10 days after the disclosure. Indeed, as soon as a new CVE is added to the official database, it is only provided with a short description and at least one public reference, as required by CVE [35]. Thus, all the CVSS scores, the appropriate CWE (Common Weakness Enumeration²) and CPE (Common Platform Enumeration³) that have been leveraged so far for exploitability prediction tasks cannot be used before the results of the in-depth analysis made by security experts are available. During the period ranging from the vulnerability disclosure and the expert’s analysis—which can even last about two months—the existing prediction models cannot get these data from anywhere, and therefore are inoperable in practice. Developers are hence left disoriented, without any estimate of the risk associated with the new vulnerability. However, the phase immediately following the disclosure of a new vulnerability is the most alarming, as practitioners must take countermeasures as soon as possible; therefore, even a small hint of its possible exploitability would be beneficial to dedicate the right effort to its resolution. We recognize the need for an exploitability analysis of software vulnerabilities to provide developers and security experts with preliminary information that could be used to assess the dangerousness of a newly disclosed vulnerability and immediately take appropriate preventive actions to limit the potential harm of an attack.

In this work, we aim to investigate the effectiveness of early exploitability prediction models that exclusively rely on what we refer to as “early data”, i.e., the data already available in the CVE record of a just-disclosed software vulnerability, to determine whether it will be exploited in the future. An early prediction model leverages only those pieces of information that were already published before the disclosure date, i.e., the short description and the referenced online discussions, e.g., mailing lists and security advisories, and in no way relies on further analyses performed by security experts or additional data to be retrieved from the vulnerable system. Our goal is to experiment with early exploitability prediction modeling to assess whether and to what extent the dangerousness of a vulnerability can be estimated with sufficient confidence as soon as the vulnerability is disclosed for the first time.

To achieve our goal, we first collect all known vulnerabilities and exploits in the National Vulnerability Database(NVD) and the Exploit Database(EDB), respectively, at the time we started this study, and enrich them with the data coming from the online discussions mentioning them. Such data are joined in seven combinations to build the text corpus from which the prediction models extract the textual features. Then, we experiment with a total of 72 models, made from the combination of six different machine learning (ML) algorithms, three data balancing settings, and four different ways to encode unstructured text into features, and we investigate the employment of five pre-trained Large Language Models (LLMs), to determine which is the best solution to employ for this kind of task.

In addition, we evaluate all these models in a realistic scenario, i.e., simulating a real software production environment to investigate whether they can be effective in practice. To do this, we validate the models pretending to build and deploy them at different points in time, i.e., reference dates, and we assess how their performance changes with the evolution of the software history. In particular, we apply a time-aware validation mechanism, sorting the full dataset of CVEs by disclosure date, and splitting it at the reference dates. At each round of validation, we use all the data before the reference date to build the dataset fold for the round—this simulates the models’ deployment scenario in which all the information available from the past is leveraged for the learning, and the base of knowledge grows over time. To evaluate the models, we split each fold into training and test sets, ensuring that the vulnerabilities in the test set are published after those falling into the training set. Then, we adopt a data labeling strategy similar to the one explained by Jimenez et al. [54], i.e., we mark the training instances as “exploitable” only if they were exploited before the train-test split date, and we mark the test instances as “exploitable” if they were exploited before the reference date. However, we recognize that not all the information collected from the past is always reliable. In fact, since vulnerabilities are exploited over time, the CVEs published close to the reference date that were not exploited yet cannot be confidently labeled as “neutral”, as no sufficient time has elapsed to let the first exploits arise. Therefore, we further clear the dataset fold from those vulnerabilities falling into such an “uncertainty window” with no associated exploit yet. This data labeling strategy—detailed in Section 3—aims to provide a more realistic evaluation of the early exploitability prediction models.

The results show that the text from online discussions, particularly from Security Focus, can significantly boost the effectiveness of prediction models that leverage only the CVE initial description. All traditional ML models reached their best performance with vulnerability data before 2010. Oversampling the training data generally benefits all models, which draw the best performance when the features are weighted by their frequency (i.e., Term Frequency). The classifier with the best trade-off is the Logistic Regression, reaching a weighted F-measure of 0.49 and weighted MCC of 0.36. The most precise classifier is the Random Forest, while the one having the highest recall is KNN. The pre-trained LLMs used as-is failed to perform well, behaving like constant classifiers always predicting “neutral”.

Based on the results obtained, we envision a set of research directions that aim to (i) improve the quality of exploitability prediction models with particular attention to the data quality—i.e., the ground truth choice, the feature representation, and the like—and (ii) integrate exploitability prediction models into existing vulnerability assessment pipelines to better support security analysts. The experimented prediction models only assess the exploitability of publicly disclosed vulnerabilities without addressing undisclosed vulnerabilities due to inaccessibility to information that enables the prediction [13]. Our key contributions can be summarized as follows:

(i)

An evaluation method to assess the effectiveness of early exploitability prediction models that exclusively rely on the data available in a CVE record at the time of its public disclosure;

(ii)

A data cleaning strategy to remove those data instances that cannot be labeled with sufficient confidence since not enough time has passed since the vulnerability disclosure date.

(iii)

A data collection procedure to mine and aggregate online discussion data referenced by the external links in the CVE records, resulting in a novel dataset that researchers can use for further analyses.

(iv)

An empirical comparison of how different configurations of an ML pipeline can influence the performance of an early exploitability prediction model, involving 72 traditional learning algorithms, three balancing settings, and four feature representation techniques, and also five pre-trained LLMs used as-is.

(v)

A comprehensive dataset—which we publicly release [49]—containing information about disclosed vulnerabilities (CVEs), the linked initial mentions in online sources like Security Focus and BugTraq, and their public proof-of-concept exploits.

(vi)

A reproducible pipeline for training, validating, and analyzing all the ML models implemented as a collection of Python scripts, made available in an online appendix [49].

In this paper, we focus on the feasibility of building early exploitability prediction models, evaluating their performance in a realistic setting. We highlight that our work is not intended to provide practitioners with a ready-to-use solution, but rather to take the first steps into the empirical investigation of early exploitability prediction, which can be finally put into practice with further research effort spent by the community.

Structure of the paper. Section 2 presents background information on the life cycle of software vulnerabilities, other than discussing the related literature and the limitations we aim to address. Section 3 reports the research questions driving our work and the methods we applied to address them, while Section 4 presents the results we observed. Section 5 further elaborates on the insights of the study and the implications for researchers and practitioners. The potential threats to the validity of the study are discussed in Section 6. Finally, Section 7 concludes the paper, outlining our future research agenda.

2 Background and Related Work

2.1 Software Vulnerabilities Life Cycle

MITRE defines a software vulnerability as a flaw in a software component caused by a source code weakness that malicious users can exploit to cause damage to the confidentiality, integrity, or availability of the impacted components.⁴ According to this definition, a vulnerability is an identifiable instance of a defect affecting the source code and is characterized by its life cycle inside the flawed system. This life cycle starts with introducing the vulnerability in the system, meaning that there is a certain point in the lifetime of a software product in which the vulnerability is—involuntarily or voluntarily—introduced in the code [48]. The vulnerability is potentially exploitable as soon as the flawed code is released, starting its exposure window [39]. After its insertion, the vulnerability can be identified in several ways: internally through manual code inspection or running static analysis tools [7, 96], or from external bug reports [19, 88] or online discussions on public and independent channels (e.g., SecurityFocus).⁵ Once the vulnerability is discovered, the recommended action should be to warn a larger audience about the security issue affecting the software. To this aim, the vendor may request the allocation of a CVE record to have a unique identifier and a standardized descriptor of that specific flaw, which third-party security experts must approve. This certification step ensures that the reported vulnerability is valid and comes from a trustworthy source, which is why an external reference describing the issue is mandatory. The time ranging from the discovery to the public disclosure varies from vendor to vendor, depending on the policy they choose to adopt [5, 19, 105].

After the official disclosure by CVE, another external party may provide deeper analyses on the nature of the vulnerability, for example, by identifying the appropriate weakness type (CWE), determining the affected vulnerable configurations (CPE), and estimating its exploitability and potential impact. The famed framework CVSS—defined and managed by the Forum of Incident Response and Security Teams (FIRST)—is commonly adopted to measure this latter aspect. The CVSS standard defines three metrics groups (“Base”, “Temporal” and “Environmental”) that capture different aspects of a vulnerability, each containing a set of ordinal-scale metrics. The most relevant ones belong to the “Base” group, which is meant to stay stable after the initial assessment—in contrast with the “Temporal” and “Environmental” ones that are subject to continuous updates. In particular, the “Base” metrics values can also be aggregated together to form a continuous value ranging from 0 to 10, known as the “Base” score, that summarizes the overall severity of the vulnerability. This score is computed using a conversion table by remapping each value to a pre-defined number. The formula is slightly different between versions 2.0 and 3.0—indeed, the latter considers the user interaction and the degree of privileges required to carry out successful exploitation. At the end of October 2023, the newest 4.0 version of CVSS will be publicly released, and the process of adapting all the scores will begin. Despite its popularity, CVSS measurement has not been exempt from criticism. One of its main weak points is its subjectiveness, caused by a lack of clearly defined criteria for assigning the correct value to each metric. Indeed, the scoring procedure is driven by a set of self-asked questions about the characteristics of a vulnerability. In this respect, different analysts might interpret these criteria differently based on their knowledge and experience, causing the final base score to vary among raters. Such a scenario might seriously threaten vulnerability management since most strategies rely on the CVSS severity scores to establish the order according to which the security issues must be handled [41]. Figure 1 shows the typical life cycle of a software vulnerability, from the moment it is inserted in the code.

Fig. 1.

What is more, in a non-negligible number of cases, the CVSS measurement is not available until several days after disclosing a CVE, which can be too late in certain unfortunate cases [35]. Specifically, if we consider all the vulnerabilities published before 2021, the average delay between the disclosure and the first CVSS 2.0 measurement is 20 days. This delay becomes even greater for CVSS 3.0, reaching 69 days. In particular, in 2018, the delay was beyond the average: 38 and 96 days for CVSS 2.0 and 3.0, respectively. This phenomenon likely happens because the amount of newly disclosed vulnerabilities is continuously increasing—over 15,000 CVEs in the sole 2018—causing the experts carrying out the CVSS measurement to have an increased workload—the results of such analysis are reported in the online appendix of this paper [49]. Such a delay leaves developers weaponless for a prolonged time: they have no support in deciding the best action to react to recently-disclosed vulnerabilities.

These motivations laid the foundations of the works that employed machine learning to predict the CVSS vector—or part of it—using the initial information available at disclosure time [35, 45, 55]. Despite this, it has been widely recognized that the CVSS “Base” score is not a valid proxy for the exploitability risk of a vulnerability; therefore, it cannot be used as a recommendation mechanism to guide vulnerability management [1, 46].

2.2 Exploitability Prediction Modeling

Several works have considered alternative ways to assess the risk of newly-disclosed vulnerabilities in the last decade, leveraging many data types extracted from different sources. Bozorgi et al. [14] have been among the first to use machine learning to predict the exploitability of known vulnerabilities based on a large variety of metadata extracted from the Open Sourced Vulnerability Database (OSVDB) and the CVE List. They trained an SVM-based binary classifier to predict whether a given vulnerability is likely to be exploited in the (near) future, using a dataset of over 10,000 CVE records disclosed between 1991 and 2007. Their model used a wide spectrum of predictors, ranging from numerical features—e.g., the CVSS metric values, the disclosure date—to the binary occurrence (presence/absence) of the tokens extracted from all the text fields using the bag-of-words (BoW) method—e.g., the CVE description, the product name, the list of external references. They labeled as “exploitable” (positive class) those vulnerabilities that indicated their exploitation status in their metadata, leaving the rest as “unexploitable” (negative class). They provided two validations: one in an offline setting—i.e., splitting the entire dataset into training and test sets randomly—and another one in an online scenario—i.e., re-training the model multiple times using only the vulnerabilities disclosed before a given date. In the former case, they observed a very high accuracy (\(\approx 90\%\)), while in the latter, the model stabilized around \(75\%\) in the latest rounds. Furthermore, they compared the SVM’s score (i.e., the fitted model without the decision function) with the CVSS exploitability score of each vulnerability in the dataset, observing that the latter is not correlated to the classifier’s scores, and highlighting the poor significance of this metric value.

This work inspired many other empirical studies that investigated the performance of different learning algorithms—e.g., Logistic Regression and Random Forest [12]—and data sources to build the ground truth—such as Exploit Database or Symantec Attack Signatures [34]—both leveraging the metadata found in the CVE records. Lyu et al. [66] experimented with more sophisticated natural language processing (NLP) techniques and a character-level Convolutional Neural Network (CNN) to predict the exploitability using only the textual description associated with each CVSS value. Although most works treated the problem as a classification task, Jacobs et al. [51] presented the Exploitability Prediction Scoring System (EPSS) that estimates the probability that a given disclosed vulnerability will be exploited in the following 12 months using a Logistic Regression model. Instead of relying on traditional feature extraction techniques (e.g., Bag-of-Words), they collected the number of occurrences of a curated list of tags extracted using the RAKE tool [84]. Moreover, they navigated the external references reported in the CVE record and scraped the text from the HTML pages to further expand the amount of data available in their dataset.

The use of textual data was further investigated in other studies, experimenting with alternative ways to extract features and assign them values. Most of these studies relied on traditional information retrieval (IR) weighting schemas, such as word counting or term frequency – inverse document frequency (TF-IDF) [2, 34, 80, 85], while others exploited NLP techniques, such as word2vec models [71], to provide a compact representation of the words in a document corpus [45, 66]. Sabottke et al. [85] mined text explicitly mentioning CVEs from X (formerly known as Twitter) to expand the pool of predictors. Besides the traditional word counting, they also computed additional metrics from the tweet statistics, such as the number of retweets and replies. Similarly, Almukaynizi et al. [2] used the mentions in the dark web assigning to each extracted token their TF-IDF weight. Apparently, only Jacobs et al. [50] contemplated the text contained in the URLs in CVE records to widen the feature set but did not provide any detail on how they mined the HTML pages. Thanks to the recent advances in Large Language Models (LLMs), Yin et al. [100] experimented with a neural network based on BERT architecture [29], which they called ExBERT(Exploit BERT) leveraging only the CVE description. To develop such a solution, they added a pooling layer and a Long-Short Term Memory (LSTM) classifier on top of a pre-trained uncased base BERT.⁶ The full model was fine-tuned on a dataset built using the information retrieved from NVD and Exploit Database. Each input text (i.e., the CVE descriptions) was encoded using the WordPiece tokenizer [98], which is suitable for text containing low-frequency words like CVE descriptions. The authors decided not to include the CVSS scores in the input text, as they are not correlated with the presence of an exploit in the Exploit Database. Despite obtaining encouraging results, i.e., \(\sim\)90% F-measure, we believe that the experimental setup suffers from two main issues. First, the set of non-exploitable instances (i.e., the vulnerabilities having no known exploit in the Exploit Database) were down-sampled to match the number of exploitable instances, reaching an unrealistic balanced dataset, which also affected the test set composition. Second, the WordPiece tokenizer was fitted on the entire dataset before splitting it into train and test subsets, influencing the vocabulary with information that was supposed not to be seen at that stage. In the end, the final scores were inflated, hindering the realism of the model’s real performance.

Following a different strategy, Sabottke et al. [85] and Almukaynizi et al. [2] considered selecting part of the information published before the date on which a CVE was exploited in the wild—following the recommendation described by Reinthal et al. [80], who claimed that any realistic exploitability prediction model should not leverage the data arriving after the exploitation. The use of “future” data to predict data belonging to the “past” is just one of the limitations that Bullough et al. [17] identified in many works on the matter; in this respect, the authors presented challenges for a realistic machine learning-based exploitability prediction model. Previous studies have legitimately considered the effect of the imbalance ratio, but the re-sampling algorithms were inappropriately applied to the entire dataset, affecting the test sets as well [74]. Similar to the limitations found by Reinthal et al. [80], many works used all the data found in the CVE records when they were mined, including all the subsequent updates they received even after the exploitation. This approach is not representative of what would happen in reality. Indeed, a realistic prediction should be executable as soon as a new CVE is published, without waiting for the availability of additional metadata like the CWE, the CVSS score, or the CPE, which are known to be added only some days or weeks after the disclosure, as observed by Elbaz et al. [35]. To clarify such a concept, Figure 1 depicts the moment from which a realistic exploitability prediction model should be employed when related to the typical life cycle of a vulnerability.

2.3 Realistic Validation of Machine Learning Models for Security

In the context of vulnerability prediction modeling, Jimenez et al. [54] investigated the realism of machine learning models when dealing with data having some form of temporal relationships. The authors shed light upon what they called the “perfect labeling assumption” that most researchers adopt when training a vulnerability prediction model. According to this assumption, researchers implicitly suppose that all kinds of information about any given vulnerability are always available at any time. Under this assumption, all the models would exhibit optimistic results that will likely contradict the models’ performance when they are put into production. To overcome this bias, Jimenez et al. [54] proposed an alternative approach to evaluate a machine learning model’s performance in a more realistic and reliable way. Firstly, whenever the data exhibits some form of time relationship, the authors recommend the employment of a time-aware validation, which takes into account the moment in which each piece of information becomes available. The models should be trained on “past” data and tested on subsequent data, i.e., avoiding the use of any fully-random validation strategies—such a recommendation is in line with those identified by Bullough et al. [17]. Then, the labels assigned to the data appearing in the training set should strictly reflect what is known at training time, i.e., the date used to split the training and test sets, without leveraging any future information. This approach is what they called “real-world labeling”, which adds more realism to the models’ validation, making their results more faithful to those scored during the production. Furthermore, they observed that instances in the negative class should be labeled as “neutral” rather than “not vulnerable”, as the problem of vulnerability prediction is undecidable [58, 82].

Jimenez et al.’s considerations [54] were embraced by several subsequent works on vulnerability prediction. Liu et al. [63] introduced the n-fold time-series validation, in which the dataset is sorted by date and split in n+1 folds to perform n evaluation rounds. At each round i, the folds from 1 to i are used as the training set, while the fold i+1 as the test set. A similar approach was followed by Le et al. [60], who also considered an additional fold for the validation phase. Pornprasit and Tantithamthavorn [75] considered the time constraints of vulnerability data availability. They performed a time-aware validation by splitting the sorted dataset into two subsets: 80% of the entries were used for the training phase and the remaining 20% for the testing phase. All the above approaches led to observing a lower performance of the models compared to those reported in time-agnostic experimental settings, as Rakha et al. [78] argued in previous work.

In the context of exploitability prediction, Yin et al. [101] tackled the “concept drift” problem that can negatively influence the models’ credibility. Indeed, information about the existence of exploits for vulnerabilities changes over time, i.e., a vulnerability can be exploited years after its disclosure, with an inevitable impact on the labels assigned to each instance. Hence, they proposed an incremental learning strategy called Real-time Dynamic Concept Adaptive Learning (RDCAL) that trains and evaluates the models in an online learning scenario (i.e., when the models are re-trained multiple times on time-ordered data) where the labels (i.e., the exploitability status) are determined at each iteration rather than upstream. They experimented with this strategy using four classifiers, achieving comparable performance with the traditional batch learning schema. Similarly, Suciu et al. [93] observed that additional vulnerability data becomes available over time, e.g., write-ups, PoCs, and social media discussions, which often provide meaningful information about the likelihood of exploits happening in the future. Driven by the intuition that the probability of observing an exploit changes over time, they proposed a random variable called Expected Exploitability. Instead of deterministically labeling a vulnerability as “exploitable” or “not exploitable”, they considered exploitability as a stochastic process, describing the likelihood over time that a functional exploit will be developed for the considered vulnerability. Such Expected Exploitability metric can be computed leveraging supervised machine learning based on historical patterns observable for similar vulnerabilities, and be updated continuously by adding new information to the model as soon as it becomes available, e.g., CVSS metrics can be used for updating the prediction as soon as they are published. They evaluated the precision and recall of their classifier against five baseline models, observing that Expected Exploitability outperforms existing metrics for exploitability prediction, and improves over time.

2.4 Our Contribution

We believe all the considerations made for realistic vulnerability prediction modeling hold for exploitability prediction as well—for instance, we cannot look at the exploits that appeared after the training time to label a vulnerability as “exploitable” or “neutral” in the training set. In this work, we evaluate several early exploitability prediction models that leverage only the description and the online discussion data available at the disclosure time of a new vulnerability. We evaluate the models using a time-aware realistic validation schema: we make several training and testing rounds where we (i) clear out the instances for which the labeling cannot be done with sufficient confidence, (ii) assign the labels (“exploitable” and “neutral”) with an eye to the recommendations by Jimenez et al. [54], (iii) prepare the text corpora from different sources, (iv) extract the features according to the vocabulary fitted on the training data, (v) adjust the training set using data balancing algorithms, and (vi) train and test the models.

Compared to previous work on the matter, our novelty points include: (i) the creation of a full early exploitability prediction pipeline that only leverages the information available at the vulnerability disclosure time; (ii) the use of online discussion from Security Focus linked via references reported in the CVE records; (iii) the analysis of the model performance with the MCC metric, which considers all four quadrants of the confusion matrix; (iv) the experimentation with word embedding to represent textual features; (v) the employment of pre-trained LLMs in the task of exploitability prediction.

Table 1 reports a summary of the main works on exploitability prediction and highlights the core differences between the existing literature and our work.

Table 1.

Study	Data Sources	Feat. Data	Feat. Representation	Early	Classifiers	Validation	Evaluation
Bozorgi et al. [14]	OSVDB, CVE List	CVE Metadata, CVSS, etc.	Numeric, Bag-of-Words	No	SVM	Random and Time-aware Iterative	Accuracy
Bhatt et al. [12]	NVD, EDB	CVSS, CWE, Software Type	Categorical	No	Several	Random Iterative	Accuracy
Edkrantz et al. [34]	NVD, EDB, Rapid7, Symantec	CVE Metadata, CVSS, etc.	Numeric, Categorical, Bag-of-Words	No	Several	Random Iterative	Accuracy
Lyu et al. [66]	NVD, EDB, etc.	CVE Metadata	Character Embedding	No	CNN	Time-aware Non-Iterative	F-measure
Jacobs et al. [51]	NVD, EDB, Rapid7, etc.	CVE Metadata, URL Scraped	Bag-of-Words	No	LR	Time-aware Iterative	Precision, Recall
Sabottke et al. [85]	NVD, EDB, Twitter, etc.	CVE Metadata, CVSS, Tweets	Bag-of-Words	Partial\(^{*}\)	SVM	Random Iterative	Precision, Recall
Almukaynizi et al. [2]	NVD, EDB, ZDI	CVE Metadata, CVSS, Dark Web	Bag-of-Words, Numerical, Categorical, Binary	Partial\(^{*}\)	Several	Time-aware Non-iterative	Precision, Recall, F-measure
Yin et al. [100]	NVD, EDB	CVE Description	BERT	No	LSTM	Random Non-iterative	Accuracy, Precision, Recall, F-measure
Yin et al. [101]	NVD, EDB	CVE Description, CVSS	BERT, Categorical	No	Several	Time-aware Iterative	Accuracy, Precision, Recall, F-measure
Suciu et al. [93]	NVD, BugTraq, Twitter, etc.	PoCs, Tweets, etc.	AST, Textual Unigrams	Incremental\(^{**}\)	NN	Time-aware Iterative	Precision, Recall
This work	NVD, EDB, BugTraq, Security Focus	CVE Metadata, URL Scraped	Bag-of-Words, BERT, Word Embedding	Full	Several	Time-aware Iterative	F-measure, Precision, Recall, MCC

Table 1. Summary of the Main Works Experimenting with Machine Learning to Predict the Existence of a Vulnerability Exploit in the Future

“Feat.” \(=\) Feature, “SVM” \(=\) Support Vector Machine, “LR” \(=\) Logistic Regression.

“CNN” \(=\) Convolutional Neural Network, “NN” \(=\) Feed-forward Neural Network

\(^{*}\)Data between the disclosure and the first exploitation was not discarded.

\(^{**}\)Data published after the disclosure was used when available.

The main novelty points are reported in boldface.

3 Empirical Study Design

In Section 3.1, we formulate the goal of our study according to the Goal-Question-Metric (GQM) template [95]; from this goal, we distilled our research questions (RQs). In Section 3.2, we describe the steps we followed to collect the data needed to fuel the prediction models. In Section 3.3, we report the details of the models that were selected to participate in our experiments, and in Section 3.4 we explain how we trained and evaluated them. Lastly, in Section 3.5 we focus on the implementation details regarding the models, and we describe the infrastructure we employed to run our experiments.

3.1 Study Goal and Research Questions

The goal of this empirical study was to investigate the performance of machine learning (ML)-based classifiers to predict the exploitability of a just-disclosed vulnerability, with the purpose of providing early feedback on the exploitability of new vulnerabilities in a realistic scenario. The perspective was of both practitioners and researchers. The former are interested in obtaining as much information as possible to (i) have an initial assessment supporting the CVSS measurement and (ii) understand when and how the vulnerabilities afflicting their applications must be handled. The latter are interested in (i) comprehending the predictive capabilities of textual data—obtained from online discussions about the vulnerabilities written in natural language—represented in different ways and (ii) assessing the effectiveness of different learning configurations.

To the best of our knowledge, the current research in exploitability prediction has always involved all the data available in CVE databases at the time of the study when building the dataset used by the experimented classifiers. This scenario caused the models to rely on subjective measures like the CVSS scores to make their predictions. However, in a real-world scenario, such information is only made available at a non-negligible distance from the CVE disclosure date [35]—on average, 20 days after the disclosure by the year 2021. We hypothesize that a realistic prediction model should only leverage early data, i.e., those pieces of information available at the disclosure time, as soon as a vulnerability is made public through a CVE record.

Therefore, our empirical investigation focused on employing ML algorithms to train classification models that recognize potentially exploitable vulnerabilities using exclusively early data. We observed that whenever a new vulnerability is discovered, the ordinary mechanism to raise awareness about it consists of opening a free commentary about it on public mailing lists or other similar discussion channels [5, 19]. Such data sources contain valuable information that may provide additional insight into the seriousness and impact of the vulnerability, e.g., a crash stack trace [90], the reproduction steps [20], or even code snippets showing that an exploit is feasible in principle (a.k.a. Proof of Concepts, PoCs) [92]. Whilst involving data from multiple sources could provide a more comprehensive perspective of the problem [103, 106], the effect on the accuracy of the prediction models is not always positive [67]. Indeed, the information in each source might have evident contradicting information [25] that would not allow the models to distinguish between “exploitable” and “neutral” vulnerabilities. Hence, we are interested in investigating how different combinations of data from multiple sources affect the models’ performance. We asked:

RQ\(_{1}\). What is the impact of the early data source combinationon the performance of ML-based exploitability prediction?

Extracting relevant information from vulnerability data sources that allow the models to recognize exploitable vulnerabilities is not straightforward. Not only do such sources predominantly contain unstructured text, but also, no automated mining tool that extracts the relevant pieces of information, e.g., the code snippets isolated accurately, exists. The only available solutions can only work with traditional bug reports and for specific programming languages, such as Infozilla [10] that only works for bug reports in Java. We must rely on text processing techniques that automatically determine the features from a corpus of natural language text [11] to let the prediction models learn from unstructured text.

There are many ways to achieve such a task, that we group into two main categories: (1) those encoding the tokens found in the corpus—after adequate pre-processing—as individual features, and (2) those that learn how to represent a given text as an embedding. In the first case, once the textual features, e.g., tokens or words, are extracted, they are weighted according to different mechanisms, such as by counting the number of times a certain feature appears in a document in the corpus, or by measuring the frequency of that token over the entire corpus [8]. The number of resulting features cannot be controlled directly and heavily depends on the actual content of the documents in the corpus and on the pre-processing steps taken before, e.g., stemming. In the second case, the features do not represent specific textual elements, but they are “latent variables” that the model infers from the input corpus [59, 71]. Unlike the first category of techniques, the size of the embeddings is generally decided upstream before launching the embedding algorithm. The choice of the specific text representation technique can greatly impact the models’ performance [28, 43].

Furthermore, we believe other elements can influence the models’ performance that are worth investigating, such as the choice of the specific learning algorithm [94] or the use of data balancing algorithms to deal with the imbalance between “exploitable” and “neutral” instances [74]. Thus, we asked:

RQ\(_{2}\). What is the performance of ML-based early exploitability prediction under different learning configurations?

Due to the recent advancements in the field of Natural Language Processing (NLP) and the popularity of pre-trained Large Language Models (LLMs), we also wanted to assess their suitability for such a task, as experimented by Yin et al. [100]. Such models come with pre-trained weights learned in a self-supervised manner from large corpora of text not directly related to the specific tasks, e.g., general English text and/or examples of code written in different programming languages. The advantage of such models stands in their ability to determine the representation for the input (depending on what was seen during the pre-training stage) and return the prediction in a single shot. Therefore, we formulated two sub-questions to answer RQ\(_2\), the first one investigating the performance of several ML models made of the traditional key elements—i.e., feature representation, training data balancing, and learning algorithm—while the latter focused on the use of end-to-end pre-trained LLMs.

RQ\(_{2.1}\). What is the performance of ML-based early exploitability prediction under different learning configurations using traditional ML?

RQ\(_{2.2}\). What is the performance of ML-based early exploitability prediction under different learning configurations using end-to-end pre-trained LLMs?

3.2 Data Collection

The context of this empirical study was made of publicly disclosed vulnerabilities accompanied by references to online discussions mentioning them, such as public mailing lists and security advisories. Our literature review found no readily available dataset with all the data we need, i.e., the text of online discussion linked to disclosed vulnerabilities and the dates when the vulnerability was disclosed and exploited. Therefore, we adopted a systematic data collection procedure to link all the existing known vulnerabilities to public websites where they had been likely discussed for the first time before the official disclosure date—in the rest of the paper, we also use the wording “publication date” when referring to the date on which a CVE record is made accessible. Then, we mined a large set of public scripts and PoCs available online and linked them to the vulnerabilities they exploited to carry out (pseudo–)realistic attacks. Figure 2 depicts all the steps (from 1 to 7) we took to collect the data needed to answer our research questions, each detailed in the following. Table 2 summarizes the collected data, reporting how much information we could retrieve from each considered data source. The scripts to run the entire data collection procedure are available in the online appendix of this paper [49].

Fig. 2.

Table 2.

Description	Count	%
CVEs from NVD	148,299	–
Malformed	214	0.14%
Rejected	135	0.09%
No external references	5	0.003%
w/ Valid Data	147,900	99.73%
w/ Discussion	70,513	47.55%
\(\hookrightarrow\) on SecurityFocus	64,120	43.24%
\(\hookrightarrow\) on BugTraq	21,540	14.52%
\(\hookrightarrow\) on both SecurityFocus and BugTraq	15,147	10.21%
w/ PoC in Exploit Database	23,690	15.97%
URLs referenced by CVEs	714,854	–
SecurityFocus Discussions	65,978	9.23%
\(\hookrightarrow\) Discarded	10,973	1.53%
\(\hookrightarrow\) Valid	55,005	7.69%
BugTraq Discussions	26,387	3.69%
\(\hookrightarrow\) Discarded	5,368	0.75%
\(\hookrightarrow\) Valid	21,019	2.94%

Table 2. Summary of the Data Collected from the Considered Data Sources

3.2.1 Mining Known Vulnerability Data.

We relied on the National Vulnerability Database (NVD),⁷ a comprehensive catalog of disclosed vulnerabilities reported in the form of CVE records. NVD enriches the upstream CVE List managed by MITRE⁸ by adding the CVSS vectors, the labeling of known affect software versions via CPE, and the like. For such reasons, NVD has been the basis of many empirical studies on software vulnerabilities, being considered a reliable source of high-quality information [48, 54, 65, 72]. Our study did not target any specific platform or programming language, so we collected all existing vulnerabilities available in NVD at the time of the study. Thus, we downloaded the full dump of NVD curated by the CVE Search project⁹ on November 03, 2021, gathering 148,299 CVE entries (Step 1 in Figure 2). We pre-processed this raw dataset to ensure that the data quality was suitable for our study. In particular, we filtered out those entries that were (i) malformed (e.g., the identifier did not point to a really-existing CVE record), (ii) rejected (i.e., the CVE identifier was allocated but never approved at the final stage), or (iii) lacking external references. We could recognize vulnerabilities falling in those cases by inspecting the content of the dump: malformed CVE identifiers did not adhere to the pattern CVE-XXXX-YYYY; rejected CVEs had a clear statement of their rejection in the description; CVEs lacking external references were missing a list of links in the HTML page on CVE List. In the end, the three filters led to the removal of just 214, 135, and 5 entries, respectively (Step 2 in Figure 2), resulting in 147,900 valid CVEs.

Since we were interested in using only the information available at the disclosure time, we retraced the change history of the CVE record stored in NVD; for each CVE we scraped the content of its descriptive HTML page in NVD, as it contains a set of tables reporting the changes made to the record and their dates. In doing so, we could fetch the original description that the CVE had at the time it was first disclosed, so that we could use it within our early exploitability prediction models in a realistic scenario—indeed, the models should not be allowed to use information that was made available years after the disclosure. It is worth pointing out that the original date on which a CVE record was created is not reported in NVD but rather on the CVE List website. Hence, we mined the HTML pages in the CVE List website as well. The scraping of HTML pages of both NVD and CVE List was supported by the BeautifulSoup library for Python.¹⁰

3.2.2 Mining Online Discussion Data.

Any CVE record is equipped with a continuously updated list of external reference URLs pointing to web pages concerning that specific vulnerability, e.g., online discussions, official patches, bug reports. In this respect, our goal was to define an exploitability prediction model that leverages only those references available at the disclosure time, as we believe they could be the reason behind the allocation request of the CVE identifier. However, the references in the NVD dump are not provided with the date they were added to the CVE records, preventing us from knowing whether they have already been linked at the time of the record allocation or at a later stage. We observed that the CVE List website maintains the references in a different way: each link is labeled with a special keyword indicating its type—e.g., vendor advisory, mailing list—and origin, i.e., the website it points to).¹¹ Hence, we analyzed 714,854 CVE List references among all the 147,900 CVE records (Step 3 in Figure 2). We observed that the most recurring keyword was BID (SecurityFocus Bugtraq ID), which refers to security advisories published on the SecurityFocus website (43% of CVEs had at least one reference to this category). Such a website has long been considered a reliable source to report security bugs—each uniquely identified with a BID—and tracks existing solutions and working exploits [38]. Moreover, its plain HTML structure allowed the easy recovery of all the data using a simple HTML parser with the application of filters to exclude those references published after the vulnerability disclosure date. For all these reasons, we ignored the references listed in the NVD dump, navigated the BID-labeled URLs in CVE List, and parsed the content of the HTML in the response using BeautifulSoup—this was the only available option to recover such data, as SecurityFocus does not expose any accessible API. It is worth pointing out that since February 2020, SecurityFocus has stopped publishing further BID advisories; hence the URLs stored in the CVE records are actually inaccessible. We circumvented this limitation by exploiting the Wayback Machine ¹² service provided by the Internet Archive library [3], which offers free access to many digital resources that were once available on the web. Therefore, we queried the API exposed by the service that returns an active URL—stored in its archives—having the same HTML content as the input URL (Step 4a in Figure 2).

We also observed that BID reports are commonly related to a discussion on a public mailing list known as BugTraq, one of the most popular discussion boards where participants have been conducting discussions on newly-discovered vulnerabilities since 1993. All the discussions are held in natural language (commonly English) without following any specific text structure. Among the discussions, the only consistency is the header containing the original publication date. Consequently, we considered BugTraq references in addition to those labeled with BID. BugTraq has encountered a similar fate to SecurityFocus, as it was shut down in 2020; yet, we still considered it alongside SecurityFocus as it was referenced by a non-negligible number of the CVEs we selected (14% CVEs had at least one BugTraq reference). All the discussions held in the past are now archived by third-party websites, such as SecLists,¹³ from which we downloaded all the discussions pointed by the CVE records in our context. To recover the missing mailing list discussions, we exploited the format of the BugTraq identifiers, made of eight digits representing the publication day according to the format YYYYMMDD, plus a short text summarizing the content of the discussion. Both the year and the month allowed us to reach a page on SecLists containing the list of BugTraqs of that period, from which we retrieved the entry that had the highest similarity—using the Gestalt Pattern Matching [79]—with the short text in the BugTraq identifier. Then, we mined the content of the matched discussions leveraging BeautifulSoup (Step 4b in Figure 2). To summarize, 70,513 out of 147,900 CVEs had at least one BID or BugTraq type reference, linked to a total of 65,978 BID references and 26,387 BugTraq references. We considered these numbers sufficiently high to address the research goals of our investigation.

Afterward, we made sure to discard the text of those BID and BugTraq references (i) published after the CVE disclosure date or (ii) whose format did not allow to reliably recover their publication date. This happened for 10,973 out of 65,978 BIDs and for 5,368 out of 26,387 BugTraqs. Although these two filters caused some vulnerabilities not to have any BID or BugTraq reference, we still did not discard them entirely, as the sole original description provided in the CVE record might contain enough information to predict their exploitability, as observed in similar works [45, 66].

As a result of the process of retrieving input data for our investigation, we finally considered three data sources, namely (i) CVE records, retrieved from CVE List, (ii) SecurityFocus Bugtraq ID reports, restored from the Wayback Machine, and (iii) discussions on the BugTraq mailing list, collected from SecLists website. Each of these sources provided us with textual data that we combined to perform our experiments, as explained in Section 3.3.

Each text underwent a set of pre-processing steps to remove irrelevant pieces of information and facilitate encoding textual features (Step 5 in Figure 2). In particular, we first employed a set of regular expressions to detect and remove data that could negatively affect the process, such as websites, URLs, e-mail addresses, PGP signatures and messages, hex numbers, and words containing repetitions of the same letters for at least three times in a row [11, 45, 61]. Second, we applied the lowercase reduction, removed non-alphabetic characters (punctuation and Unicode symbols), and split the remaining content into tokens using the whitespace as a separator. Then, we removed any English stop word [26, 91], applied the suffix stripping using the Porter’s stemmer [76], and removed those terms having fewer than three characters.

3.2.3 Mining Exploit Data.

The CVE references labeled as EXPLOIT-DB in the CVE List point to Exploit Database (EDB in short),¹⁴ which is the most comprehensive collection of public exploits and Proofs of Concepts (PoCs) that explicitly target known vulnerabilities. Navigating these references could establish the links between the vulnerabilities and their exploits. However, we observed that only a minimal set of CVE records had an explicit link to EDB (i.e., 7.9%), likely owing to an improper curation of the CVE records—and not necessarily to a real lack of an exploit. Hence, similarly to Bhatt et al. [12], we rebuilt the links crossing the opposite way, connecting all the exploits in EDB to the affected CVEs using the metadata contained in the exploit entries. To this aim, we downloaded the complete list of exploits at the date of November 03, 2021, from the official GitHub repository¹⁵ to obtain the list of valid exploit identifiers, queried the EDB website, and parsed the HTML pages—still using the BeautifulSoup library—of each exploit to collect the target CVEs, if made explicit. Note that a single exploit or PoC may target more than one, often related, vulnerability; similarly, more than one exploit may affect a single vulnerability. In the end, a total of 47,742 exploits were collected, linked to 23,690 different CVEs, corresponding to 16.02% of the total CVEs with valid data (Step 6 in Figure 2). After connecting each CVE to their exploits, we could obtain the dates on which the first exploit of the vulnerability described in the CVE was uploaded in the Exploit Database. We collected all these dates and considered them as the “exploitation dates” (Step 7 in Figure 2), which will be needed when building our ground truth (see Section 3.4).

3.3 Model Selection

Once we had collected all the required data, which is summarized in Table 2, we could select the prediction models subject to our experiments.

To answer RQ\(_{1}\), we first considered the textual data from the three sources selected—i.e., CVE, SecurityFocus(SF), and BugTraq(BT)—individually to assess which one was the most helpful in predicting the exploitability of the vulnerabilities. Then we combined them by means of string concatenation to understand whether the data coming from multiple sources can let the models have additional information on the vulnerabilities and improve their accuracy. Hence, we formed four combinations, i.e., \(\langle\)CVE + SF\(\rangle\), \(\langle\)CVE + BT\(\rangle\), \(\langle\)SF + BT\(\rangle\), \(\langle\)CVE + SF + BT\(\rangle\). In the end, we tested with a total of seven different combinations of data sources, which we call corpora from now on.

Then, regarding RQ\(_{2.1}\), we experimented with 72 traditional learning configurations, determined by the combination of:

—

Six machine learning algorithms, opting for the most adopted for training binary classifiers, i.e., (i) Logistic Regression(LR) [70], (ii) Naïve Bayes(NB) [83], K-nearest Neighbors(KNN) [104], (iv) Support Vector Machine(SVM) [24], (v) Decision Tree(DT) [16], and (vi) Random Forest(RF) [15].

—

Four feature representation schemas, three of which encode each token found in training set as an independent feature—i.e., the simple word counting (a.k.a. Bag-of-Words, BoW), term frequency(TF), term frequency-inverse document frequency(TF-IDF)—and one that automatically learns embeddings from the training set—i.e., doc2vec(DE).

—

Three ways for managing the data imbalance during the training stage, i.e., leaving the data untouched (Original), over-sampling with SMOTE [21], and under-sampling with NearMiss (version 3) [102].

Similarly, for RQ\(_{2.2}\), we involved five pre-trained LLMs, all based on the BERT architecture [29]. Specifically, we selected: (i) DistilBERT [87], (ii) ALBERT [57], (iii) XLM-RoBERTa [23], (iv) CodeBERT [37], and (v) CodeBERTa [64]. We selected such models because of their noticeably different pre-training backgrounds. Indeed, all models we selected have a general understanding of the English language, which was required as the text of CVE descriptions and the other data sources we considered, i.e., SecurityFocus and BugTraq, were also in English as well. Both ALBERT and DistilBERT were pre-trained on BookCorpus¹⁶ and English Wikipedia¹⁷ corpora (just like the vanilla BERT), while XML-RoBERTa was pre-trained on CommonCrawl¹⁸ corpus containing text from 100 languages. Yet, vulnerability data often include elements that are not part of a common text in the English language, such as code snippets and many punctuation characters; for this reason, we also included two models having experience with programming languages, i.e., CodeBERT and CodeBERTa. Such models were pre-trained on CodeSearchNet¹⁹ corpus, containing examples of methods from six programming languages, as well as their associated documentation (e.g., JavaDoc), generally written in English. To allow the models to be fine-tuned on our binary classification downstream task, we equipped them with a linear layer on top of the pooled output.

To better contextualize the models’ performance, we also involve four baseline models, meant to determine the real usefulness of the non-trivial models. We selected: a Random(RND) classifier, stating that a vulnerability is “exploitable” with 50% probability, a Pessimistic(PES) classifier, always predicting that a vulnerability is “exploitable”, an Optimistic(OPT) classifier, always predicting that a vulnerability is “neutral”, and a Stratified(STR) classifier, predicting the exploitability with a probability equal to the frequency of “exploitable” instances in the training set. Due to their nature, the baseline models ignore any feature representation and data balancing technique employed.

To summarize, we experimented with a total of 567 models, i.e., 72 traditional ML models, five LLMs, and four baseline classifiers, all trained and tested on seven corpora.

3.4 Model Evaluation Framework

To ensure high realism in our experimentation, each of the 567 models was validated in the context of a time-aware validation, which emulates a scenario where the prediction models are iteratively re-trained and validated at different reference dates with different data. Such reference dates represent the moment in which the models would be put into production. To this end, we had first to sort all the instances (i.e., the vulnerabilities) previously collected (Section 3.2) by their publication date, then create the folds to form the validations rounds, and finally determine the target labels (i.e., the expected values the models should predict) to assign to each instance according to their exploitation date, if any. Given the time-aware nature of the validation, the assignment of the labels could not be done for all the instances in a single shot, as it would break its realism, since exploitability data is not always available. To better clarify this concept, let us suppose we wanted to validate the models in December 2005: we would not be allowed to look into any piece of information that came out after this date. For example, if a CVE had been published in 2003 and was first seen exploited only in 2006, we must treat that instance as “neutral” in December 2005. Therefore, we assigned the labels to the instances at each round of validation, in line with the recommendation by Jimenez et al. [54].

We also took into account other noise-introducing factors that could affect the labeling activity, which we explain in more detail in Section 3.4.1. The way we split the entire collection of vulnerabilities into folds to create the rounds for the time-aware validation is explained in Section 3.4.2, while in Section 3.4.3 we report how we built the training and test sets for each round. Lastly, in Section 3.4.5, we describe how the models’ performance was assessed. Figure 3 summarizes the entire framework with which we trained and tested the 567 models.

Fig. 3.

3.4.1 Data Cleaning Strategy.

Each instance in the dataset had to be labeled according to the presence of an exploit in Exploit Database. The most straightforward strategy would have been to mark as “exploitable” (true class) those instances having at least one associated exploit and as “neutral” (false class) all those not having any reported exploit at all. This would have caused 16.02% vulnerabilities to be labeled as “exploitable” and 83.98% as “neutral”. However, such an approach is improper in the context of a realistic validation as it would produce a large number of instances with inappropriate labels. Let us consider the case of CVE-2020-14340, published in June 2021. By the time of this part of the study, i.e., the reference date is November 2021, only five months had passed since its publication, so it was quite expected that an exploit was not already present in Exploit Database, as not enough time had passed since its publication to observe the first public exploit. As a matter of fact, the average time between the disclosure and the first exploitation of a vulnerability is 194 days, i.e., more than half a year. Marking as “neutral” such a recently-disclosed vulnerability would be too eager, causing an overabundance of false labels. We hypothesize that if we concede some more time to make the exploits emerge, we could label the instances more confidently. In other words, if recently disclosed vulnerabilities have not been seen exploited yet, we cannot deem them as “exploitable” or “neutral” with sufficient confidence. Thus, we decided to completely remove those instances from the validation round having the reference date of November 2021 to avoid introducing data with noisy labels—following an analogous strategy adopted by Garg et al. [42]. It is worth pointing out that such instances should not be used either as training data or as test data because, in the first case, they would inflate the number of false instances, while in the latter case, they would distort the models’ real performance.

We observed that the number of instances that risk being labeled improperly strongly depends on the amount of time we are willing to “concede” for exploits to emerge. Let us consider the case of CVE-2020-25649, which had been published six months before CVE-2020-14340 (seen in the previous example). Similarly to the previous case, half a year was not enough to let its first exploit manifest—indeed, this is even below the average exploitation time of 194 days. To minimize the risk of having improperly labeled instances among our train and test data, we selected the 90th percentile from the exploitation time distribution—corresponding to 532 days (about one year and a half)—to be the “tolerance period” we concede to exploits to manifest. Specifically, given the reference date \(D_i\) in a validation round \(R_i\), we applied our cleaning strategy to all vulnerabilities that have been published within 532 days from \(D_i\). All vulnerabilities within this uncertainty window that were not exploited before \(D_i\) were completely excluded from the round \(R_i\). All the vulnerabilities that passed the cleaning step and those outside the uncertainty window were labeled according to the presence of an exploit in the Exploit Database reported before \(D_i\).

3.4.2 Validation Round Creation.

To determine the number of folds into which the dataset must be divided, and so forming the validation rounds, we used the duration of the uncertainty window set before (Section 3.4.1). Namely, starting from November 2021 (the time of this part of the study) we repeatedly went “back in time” by 532 days at a time until the date in which the first vulnerability in the dataset was published (i.e., 1989). In this way, we ended up with 22 folds, each made of vulnerabilities published in 532 days time span. For instance, the 22th split consists of all the CVEs published between May 19, 2020, and November 2, 2021, while the 21th split consists of the CVEs published between May 18, 2020, and December 4, 2018, and so forth. Such a splitting allowed us to evaluate the models’ behavior when the uncertainty window is made of wholly different sets of vulnerabilities. Figure 3 shows an example of what happens within each round of the time-aware validation. We observe that the 22th round, i.e., the last one, corresponds to the case in which we use the entire dataset of vulnerabilities mined in this work to train and test the models.

3.4.3 Training & Test Set Preparation.

At each validation round \(R_i\), we used all folds from 1 to i to form the dataset of round \(R_i\). Then, we apply the data cleaning strategy described in Section 3.4.3 for all the vulnerabilities admitted in round \(R_i\), and we created the training and test sets using a time-aware 80/20 splitting, i.e., placing the first 80% instances in the training set and the remaining 20% in the test set, ensuring that all the training instances were published before all the test instances. Afterward, we could proceed with the labeling strategy described by Jimenez et al. [54]. Namely, we marked the training instances as “exploitable” if and only if they were exploited before the training date, which corresponds to the latest publication date among all the training instances, while we marked as “exploitable” the test instances if and only if they were exploited before the reference date of round \(R_i\).

3.4.4 Model Training & Testing.

At each validation round \(R_i\), all the vulnerabilities in the i-th training and test sets were linked with the content of each of the seven corpora (explained at the beginning of Section 3.3)—hence, forming seven “variants” of the i-th training and test sets.

The five pre-trained LLMs, and the four baseline classifiers were trained and tested at this stage without any other processing. On the contrary, the 72 machine learning configurations required further processing according to the selected feature representation schema and data balancing algorithm. Consequently, for each variant of the i-th training set, we fit the four feature representation techniques selected (Section 3.3), i.e., word counting (BoW), term frequency (TF), term frequency-inverse document frequency (TF-IDF), and doc2vec (DE). For the first three schemas, i.e., BoW, TF, and TF-IDF, we built three document-term matrices, where the rows represent each instance, and the columns represent all the tokens (using the white space as separator) found in the textual content associated. The values inside each cell are weighted depending on the specific schema:

(i)

BoW assigns to the ij-th cell the number of times the j-th term appears in the i-th document;

(ii)

TF assigns to the ij-th element the number of times the j-th term appears in the i-th document, divided by the total number of times the j-th term appears in the corpus;

(iii)

TF-IDF assigns to the ij-th element the TF value multiplied by the IDF (inverse document frequency) of the j-th term, which is computed as the logarithm of the total number of documents divided by the documents where the j-th term appear, hence lowering the weight for terms appearing in too many documents.

Therefore, each instance was represented as a numeric vector, which we used to represent the associated vulnerability. We observe that the final number of features varies among the seven corpora as they have different terms appearing in them; thus, each of the seven variants of the i-th training set ended up having different dimensionalities. On the other hand, the fourth schema, i.e., DE, learns a predetermined number of features extracted via the use of a neural network, which learns how to represent all the documents in an unsupervised manner. We chose the Distributed Bag-of-Words (PV-DBOW) variant, setting the size of the embeddings to 300, following the configuration that provided the best results in related work [45] and keeping all the other settings to the default recommendations for doc2vec. At this point, the two groups of features (tokens and embeddings) extracted from the i-th training set were reused as-is to determine the feature space of all the corresponding seven “variants” of the i-th test set without any modification to avoid data leakage [6]. To do this, on the one hand, we added the test instances into the existing document-term matrices fitted at the training time and weighed the new instances with BoW, TF, and TF-IDF; on the other hand, we fed the fitted doc2vec model with the test instances to obtain their embedding in the same space learned at the training time. Lastly, all variants of the i-th training set obtained so far were balanced with two algorithms, i.e., SMOTE over-sampling and NearMiss (version 3) under-sampling.

3.4.5 Performance Assessment.

Once we have obtained the predictions of all the 567 models on the test sets across all the 22 validation rounds, we derived the confusion matrices reporting the True/False Positive and True/False Negative predictions. From them, we computed the performance metrics commonly adopted for the binary classification task, i.e., accuracy, precision, recall, and F-measure [77]. The F-measure represents an aggregation of precision and recall, both crucial for evaluating binary classifiers [8]. The trade-off between such values is particularly tricky in the context of exploitability prediction, as practitioners might wish for higher precision to identify the potentially exploitable vulnerabilities correctly but also for high recall to avoid false negatives—i.e., vulnerabilities considered safe but exploitable. However, the F-measure does not consider the number of true negative instances, i.e., the neutral vulnerabilities correctly classified as “neutral”. The problem of exploitability prediction is highly imbalanced, so we were interested in evaluating all four quadrants of the confusion matrix. To this end, we also involved Matthews’s Correlation Coefficient (MCC) [68], which represents an indicator of the correlation between the predicted values and the actual labels of the instances, taking into account the class imbalance in the test set—differently from other traditional metrics.

We computed the selected performance metrics on all the 22 time-aware validation rounds. Consequently, the models had 22 scores of a given metric, which did not allow for direct comparison. Hence, we carried out two kinds of analyses. First, we aggregated the results observed in all 22 iterations of the time-aware validation using a weighted average, assigning a weight proportionate to the size of the training set used in an iteration. In other terms, the 22 scores were not treated equally, as (i) the initial iterations faced a problem that is less representative of today’s situation, and (ii) the amount of data the model worked with in the initial iterations was lower. Assigning equal weights to all the iterations, like the simple average, would have provided unrealistic and inflated results, rewarding the models that behaved well in most iterations rather than in the most recent—and, therefore, significant for today’s practitioners—ones. We exploited the aggregated scores to depict the box plots of each of the seven corpora, to highlight the distribution of the performance of the models trained and tested using a given corpus. Besides, we leveraged the Friedman test [40] to discover whether the seven distributions exhibit statistically significant differences (\(\alpha = 0.05\)). In case a difference is observed, we conducted the Nemenyi post hoc test [73] to identify the pairs of corpora having noticeable differences—indeed, the null hypothesis states that the compared groups have the same distribution. Such a test is robust to repeated comparisons and does not require the tested distributions to be normal. All of this was needed to answer RQ\(_1\). Afterward, we plot how the model performance varied over the 22 iterations, having the validation rounds on the x axes and the value scored with a given performance metric over the y axes. Such plots allowed us to observe the models’ general “trend” from different perspectives. Analyzing the trends was needed to answer both RQ\(_{2.1}\) and RQ\(_{2.2}\).

The raw results of our analyses are available in the online appendix of this paper [49].

3.5 Implementation Details and Experimental Infrastructure

The entirety of our experimentation was implemented with a collection of Python scripts. The traditional machine learning algorithms and the baseline models used the implementation provided by Scikit-Learn,²⁰ while the pre-trained LLMs were downloaded from HuggingFace²¹ using its Transformers library. In this respect, the exact pre-trained model versions we used are distilbert-base-uncased, albert-base-v2, xml-roberta-base, codebert-base, and codeberta-small-v1, all implemented with PyTorch.²² The document-term matrices and the feature weighting for BoW, TF, and TF-IDF were done using Scikit-Learn,²³ while the doc2vec model was provided by the Gensim library.²⁴ The data balancing algorithms, i.e., SMOTE and NearMiss, were implemented with Imbalanced-Learn²⁵ library. All the performance metrics relied on the implementation provided by Scikit-Learn, while the statistical tests leveraged the SciPy²⁶ and scikit-posthocs²⁷ packages.

We ran the experiments involving the baseline models and machine learning algorithms on a Linux machine equipped with a quad-core 1.50 GHz processor and 32 GB of memory. The full data collection procedure took about 11 days, while the dataset cleaning, splitting, and labeling required about 13 hours. Due to the large size of the dataset and the high number of configurations to evaluate and rounds to execute, the models’ training and testing phases were considerably time- and resource-consuming, taking a total of 65 hours to complete. To experiment with LLMs, we leveraged a GPU-equipped machine via the Vast.ai²⁸ cloud GPU rental service. The GPU was an NVIDIA RTX 3090 with 24 GB of memory, and the execution of the experiments took about 13 days to complete.

We warmly encourage replication and verification of our work. Thus, we make all the scripts available in the online appendix of this paper [49].

4 Analysis and Discussion of the Results

In this section, we present the results obtained in our experiments to answer our research questions (presented in Section 3.1).

4.1 The Impact of Early Data Source Combinations (RQ\(_1\))

Figures 4 and 5 show the distribution of the aggregated F-measure and MCC scored by the 77 models built on the seven corpora involved in the analysis for RQ\(_1\). We can immediately observe interesting differences among the distributions. The models trained using the \(\langle\)BT\(\rangle\) corpus had the worst performance, scoring less than \(\sim\)0.15 median weighted F-measure and less than \(\sim\)0.10 median weighted MCC. On the contrary, \(\langle\)CVE\(\rangle\) and \(\langle\)SF\(\rangle\) obtained better performance (\(p=0.001\)), reaching \(\sim\)0.40 and \(\sim\)0.30 median weighted F-measure, while scoring \(\sim\)0.25 and \(\sim\)0.16 median weighted MCC, respectively. According to the Nemenyi test, the difference between the two corpora is statistically significant for both metrics (\(p\lt 0.05\) for both metrics).

Fig. 4.

Fig. 5.

Then, we observe that combining data from multiple corpora led to improvements compared to individual ones. Specifically, adding the text data from the \(\langle\)CVE\(\rangle\) corpus to the \(\langle\)BT\(\rangle\) corpora can lead up to \(\sim\)0.20 median improvement in both weighted F-measure and MCC (\(p=0.001\) for both metrics). A smaller median improvement, though still statistically significant according to the Nemenyi test (\(p=0.001\) for both metrics), is observed when adding \(\langle\)CVE\(\rangle\) to the \(\langle\)SF\(\rangle\) and \(\langle\)SF + BT\(\rangle\) corpora, namely slightly less than 0.10 in both weighted F-measure and MCC. Conversely, adding the data from the \(\langle\)SF\(\rangle\) or \(\langle\)BT\(\rangle\) corpora to the \(\langle\)CVE\(\rangle\) corpus does not lead to any noticeable change, as also confirmed by the Nemenyi test (\(p\gt 0.05\) for both metrics). Interestingly, although with minimal (less than \(\sim\)0.01) and no significant differences (\(p\gt 0.05\) for both metrics), the models trained on the \(\langle\)CVE + SF\(\rangle\) corpus experienced a small drop in the median performance for both weighted F-measure and MCC when \(\langle\)BT\(\rangle\) is added. In the end, \(\langle\)CVE + SF\(\rangle\) and \(\langle\)CVE + SF + BT\(\rangle\) have been found to be the best corpora on which the models should train, reaching up to 0.48 weighted F-measure (0.40 on a median) and 0.35 weighted MCC (0.26 on a median). Yet, we cannot confidently determine which of the two is the best option since the models obtained comparable performance with negligible and non-statically significant differences.

The weighted precision and recall (Figures 6 and 7) help better comprehend the F-measure scores observed. The precision distributions are mainly centered around 0.43, though their variance appears higher when the data from the \(\langle\)CVE\(\rangle\) corpus is not involved. In other terms, the central tendency seems only slightly affected by the textual content used to describe the instances, but the same does not happen for the variance—i.e., without data from the \(\langle\)CVE\(\rangle\) corpus, the models behave largely differently in terms of precision. The best results were obtained by the models trained on the \(\langle\)CVE + SF\(\rangle\) corpus, reaching 0.46 on a median. The situation is somehow different when looking at the weighted recall (Figures 6 and 7). The distributions are noticeably different, with much wider variances; this means that the recall metric is highly subject to the specific model rather than the kind of textual data used. The most “contradicting” results were seen when the text from \(\langle\)BT\(\rangle\) corpus is involved, where the median weighted recall is noticeably lower than the mean, and the boxes are wider. Such a scenario indicates the presence of many models having very low recall, i.e., models tending to avoid predicting true (i.e., “exploitable”), and models that predominantly predicted true that can easily raise their recall. The models that did not use the \(\langle\)CVE\(\rangle\) corpus scored less than 0.25 weighted recall on a median. Once adding \(\langle\)CVE\(\rangle\), the central tendency between the corpora becomes more equalized (\(p\gt 0.05\)).

Fig. 6.

Fig. 7.

All these results indicate that the sole CVE description is sufficient for determining the majority of the performance [4, 45]. The text from SecurityFocus can provide additional information that further boosts the performance without changing the general trend. Despite not giving useful information on its own, the text from BugTraq does not hinder the predictions when mixed with text from other sources.

Answer to RQ\(_{1}\). The text from the experimented data sources significantly impacts the model performance, affecting up to 25% and 15% of the median weighted F-measure and MCC metrics, respectively. The central tendency of the models’ precision is essentially unaffected by the specific corpus selected, but not the variance, which is wider when the text from \(\langle\)CVE\(\rangle\) corpus is not involved. Wide distributions are also observed for the recall, irrespective of the corpus. Involving textual data from CVE always leads to improvements in all perspectives, which is further improved if the data from SecurityFocus are added as well. The text from BugTraq alone does not inform the models adequately but can be added as an extra source along with CVE and SecurityFocus without harm. In the end, the best corpora for training the models are \(\langle\)CVE + SF\(\rangle\) and \(\langle\)CVE + SF + BT\(\rangle\), showing no relevant differences between the two.

4.2 The Performance of Different Learning Configurations (RQ\(_2\))

Once we determined the best corpora to train the prediction models, we investigated the performance scored by the experimented learning configurations to answer RQ\(_2\). We subdivided it into RQ\(_{2.1}\) and RQ\(_{2.2}\) to have focused analyses on the models built with the traditional learning pipeline and those leveraging end-to-end pre-trained LLMs. For this analysis, we chose to focus on the models trained and tested on the \(\langle\)CVE + SF\(\rangle\) corpus as its content determined the best models overall. The raw results for the other corpora can be found in our online appendix [49].

To answer RQ\(_{2.1}\), we analyze the ML models built with the traditional pipeline, which is made of three key parts: (1) feature representation, (2) training data balancing, and (3) learning algorithm. Figure 8 provides a broad overview of the F-measure scores obtained by the six learning algorithms on the 12 training settings made by the combination of the four feature representation schemas and the three data balancing algorithms. We observe that all models under every training setting followed one great pattern: the F-measure steadily increases—net of sporadic drops—from the 1st round to the 12th, i.e., the point where almost all models achieve the best score of 0.97. Yet, all models start dropping their performance from that round on, reaching their lowest peak of less than 0.16—excluding the very initial rounds. This phenomenon shows that the most recent “versions” of such models are not able to recognize exploitable vulnerabilities properly, despite seeing dozens of thousands of examples during training. It seems the learning becomes less and less fruitful as the round goes by, likely due to the difficulty of recognizing a clear distinction between exploitable vulnerabilities and those not exploited yet, among many examples. In other words, the text data were enough to recognize the exploitability of “historical” vulnerabilities but are less helpful for modern-day vulnerabilities. It is worth pointing out that there could be other reasons behind such a drop. For instance, we observe that the disclosure of public exploits has become less frequent than it used to be in the past, from over 15,000 in the period 2001-2010 to less than 7,500 in the period 2011-2020. This difference becomes even more relevant when looking at the number of disclosed vulnerabilities: around 42,000 in 2001-2010 and around 100,000 in 2011-2020. Thus, the number of disclosed vulnerabilities doubled in a decade while the published exploits halved. This inevitably affected the distribution of true and false instances, ending up with highly imbalanced test sets in the latest validation rounds.

Fig. 8.

Due to the limited reliability of F-measure for measuring the model performance when the number of true instances is noticeably lower than the number of false instances [99], we also looked at the MCC metric (depicted in Figure 9) to observe whether a similar pattern occurred. Interestingly, the models achieve an MCC score around zero in the 12th round, indicating the absence of any correlations between the model predictions and the target variable (i.e., the exploitability). A lack of correlation means that the models make utterly unrelated predictions with the target variable, implying that the model performs no better than a fully random or constant classifier [9, 99]. The diverging results of MCC and F-measure scored at the 12th round suggests that many models in that round behaved almost like a constant classifier always predicting true; this behavior benefited the F-measure since over 95% test instances had true label in the 12th round but not the MCC metric, which does not reward models making one-way predictions.

Fig. 9.

The best MCC scores were obtained around the 15th round, going beyond 0.60 MCC in the best-case scenario, i.e., when SMOTE balancing is employed. Such a score indicates the presence of a strong positive correlation, which suggests the model performs well. This is further confirmed by the good F-measure scores, reaching 0.80 when SMOTE is used. Thus, the 15th round provides a definitely better trade-off than the 12th. All the models in the 15th round were trained on all vulnerabilities disclosed until December 2008 and tested on those disclosed until August 2011. We observe that the amount of true and false test instances is more balanced (around 50% for both), indicating that the exploitability prediction task was easier than it is today—indeed, in the last round the number of true instances in the test set is just the 5%. Unfortunately, after this round, all the models meet a similar demise seen for the F-measure: they slowly converge to no more than 0.12 MCC in the last round.

Looking deeper at the effect of feature representation schema on the F-measure trends (Figure 8), we observe that the models based on the document-term matrix (i.e., BoW, TF, and TF-IDF) share the same general trends once a data balancing technique is applied. In particular, we observe that the effect of a balancer looks the same in all three schemas, favoring and hindering the same classifiers. For instance, the KNN classifiers draw many benefits from an oversampled training set (i.e., SMOTE). Moreover, all the classifiers follow closely similar trends when SMOTE is used. On the contrary, the models have highly diversified trends with document embeddings (i.e., DE), standing out from the other three feature representation schemas. The KNN classifier still can be seen benefiting from the use of SMOTE, though with inferior performance than in other schemas. The MCC trends (Figure 9) exhibit a similar effect though with less diversification, i.e., the effect of the feature representation schema and the data balancing is smoother, particularly with TF and TF-IDF. Interestingly, the training sets undersampled with NearMiss determined models with negative MCC scores (reaching less than -0.3 in several cases), indicating the presence of moderate negative correlations between the model predictions and the target variable; yet, this happened only in the initial validation rounds, which does not imply any negative impact of this balancer.

We used the weighted metrics to determine the models that achieved the best results across all rounds. We found that the best model overall was a Logistic Regression classifier (LR) using TF feature schema and with a training set oversampled by SMOTE, scoring 0.49 weighted F-measure and 0.36 weighted MCC and touching 0.82 F-measure and 0.65 MCC in the 15th round. In particular, we observed that most learning algorithms had their best F-measure and MCC scores with TF and SMOTE. The story is slightly different by looking at the precision and recall. We found that the learning algorithm with the highest weighted recall was KNN, reaching 0.80 with BoW and SMOTE—0.08 higher than the score obtained with TF and SMOTE. Symmetrically, the Random Forest (RF) achieved the highest weighted precision score, reaching 0.65 with TF-IDF without data balancing—only 0.04 higher than the score obtained with TF and SMOTE. In the end, we observed a generally positive trend with TF and SMOTE, though maximizing a specific metric might require a specific learning configuration.

Lastly, we looked at the performance scored by the four baseline models, i.e., the random (RND), the pessimistic (PES), the optimistic (OPT), and the stratified (STR) classifiers. Among the four, the best baseline classifier was PES, achieving 0.37 weighted F-measure and 0.26 weighted precision. We remark that the MCC is always zero as the true and false negatives are always zero, while the recall is always maximum (i.e., one) for the same reason. Such results show that the experimented prediction models do make meaningful predictions as the best model, i.e., the Logistic Regression (with TF and SMOTE), outperforms PES by 0.12 and 0.34 in weighted F-measure and precision, respectively. Nevertheless, PES outperforms all models in the 12th round for F-measure. Indeed, due to the large presence of true instances in the test set, the PES model has a very low probability of making false positive predictions, obtaining high precision in return and boosting the F-measure—thanks to the recall score fixed to one. In any case, its performance drops in all the rounds, where the test instances have less imbalanced distributions.

Answer to RQ\(_{2.1}\). All traditional ML models had their F-measure increase until the 12th validation round, touching the peak of 0.97, until dropping to 0.16 in the latest round. The MCC scores followed a similar trend, despite the growth arriving at the 15th round, reaching \(\sim\)0.60. The 15th round determined the best trade-off among the two metrics. Oversampling with SMOTE generally gives benefits to the models, particularly if applied in conjunction with TF; in this setting, the Logistic Regression classifier achieved the highest weighted F-measure and MCC scores, i.e., 0.49 and 0.36, respectively, outperforming the pessimistic baseline model under all fronts. KNN achieved the highest weighted recall of 0.80 with BoW and SMOTE, while Random Forest reached 0.65 weighted precision with TF-IDF without any data balancing.

To answer RQ\(_{2.2}\), we analyze the models made with the pre-trained LLMs. We found that, with the exception of a few sporadic rounds, all models tend to act like perfect optimistic classifiers, i.e., always predicting “neutral” (having false label). Therefore, the F-measure (Figure 10) turned out to be extremely low (the weighted aggregated score did not go beyond 0.01 with CodeBERTa) as a consequence of the recall being always zero—due to the absence of any true prediction). Such behavior had an extremely positive impact on the accuracy (Figure 11), on which all the models achieved very high performance as the rounds went on; this happened because of the scarce number of true instances in the test sets of the later rounds.

Fig. 10.

Fig. 11.

The round that had the most interesting performance in the \(\langle\)CVE + SF\(\rangle\) corpus is the 12th, the same where the traditional ML models achieved the highest F-measure scores. In such a round, the CodeBERTa model reached 0.51 F-measure thanks to the quasi-perfect precision, i.e., 0.98, which happened because of the large number of true instances in the 12th test set that minimized the chances of making false positive predictions. Nevertheless, given the peculiarity of the 12th test set, it is not clear whether in this round CodeBERTa successfully learned something or it was just making random predictions (with 33% true predictions and 67% false ones). Such an interesting behavior did not only happen for the \(\langle\)CVE + SF\(\rangle\) corpus but for all the other corpora, though with different “fortunate” rounds.

Ultimately, we can conclude that the pre-trained LLMs could not learn anything from the training phases—except for the few “fortunate” rounds—despite the large amount of data available. The only models that apparently learned something were CodeBERTa and CodeBERT, both having experienced source code text during their pre-training stage; yet, they both tend to behave like the other models in later rounds. We believe the reasons could be imputed to the lack of a massive pre-training on text containing the typical vocabulary of the security domain, but also to an inadequate data preprocessing for the experimented models.

Answer to RQ\(_{2.2}\). The pre-trained LLMs used as-is are inadequate for assessing the exploitability leveraging early data, behaving like constant classifiers always predicting false. There are few exceptions in certain rounds of the time-aware validation, but whether the predictions were made randomly is unclear.

4.3 Further Analysis

The time-aware validation setting allowed us to observe how the models behave at different points in time where the training and test sets had diversified compositions. We wanted to shed light on the composition of both training and test sets to comprehend the possible reasons for the model to make such predictions further. Thus, we employed a dimensionality reduction technique based on the Singular Value Decomposition (SVD) [33], which projects the data into a lower dimensional space using matrix factorization. Such a technique is better known as Latent Semantic Analysis (LSA, a.k.a. Latent Semantic Indexing, LSI) [27, 31] when adapted for highly sparse data, like the textual represented with BoW, TF, and TF-IDF. Essentially, this technique forms a lower-dimensional “semantic space” of a given size—typically vastly lower the number of terms—where the instances sharing similar concepts are mapped to the same cluster, also dealing with cases of synonymy and polysemy of terms.

In our case, we chose to build a semantic space of two dimensions to allow plotting into a 2D space and inspect how the training and test instances are distributed. Specifically, we focused on the training and test sets employed in the most interesting rounds that emerged from the model performance analysis. Therefore, we inspected the 15th and the 22th rounds due to their contrasting performance; the former achieved the best trade-off between F-measure and MCC, and the latter had the worst performance overall. In continuity with the previous analyses, we focused on the \(\langle\)CVE + SF\(\rangle\) corpus and opted for visualizing the document-term matrices made with TF schema as it was the schema that had the best results overall.

Figures 12 and 13 show the scatter plots of the training and test instances drawn from the 15th and the 22nd rounds, respectively. Both plots depict the large number of instances involved in training and testing. We immediately observe in all cases, the “exploitable” (true) and “neutral” (false) instances are somehow “intermixed”. This could explain why many models had trouble understanding the difference between the two classes of instances and so opted to behave like constant classifiers in several cases. Nevertheless, we cannot exclude the visualization algorithm that failed to faithfully preserve the differences among the instances, though it is recommended to visualize data with textual features. Such an aspect is worth further investigation. Looking deeper at the training and test sets of the 15th round, we observe the two share a similar arrangement. This might indicate that the models, once trained, did not find a different problem once going to the test phase, which might be the main motivation for the good performance obtained in that round. On the contrary, the “neutral” (false) training and test instances of the 22nd round share the same arrangement but the “exploitable” (true) instances do not. Indeed, the arrangement in the test phase seems like a subset of the arrangement seen at the training time. Likely, this could be one of the reasons why the models increased their false positive rate (i.e., false instances deemed as true) due to this reduced presence of true instances at the testing time.

Fig. 12.

Fig. 13.

Further Analysis Summary. There is a noticeable discrepancy between the arrangement of training and test instances in the 22nd round, where all models’ performance dropped to their minimum. Such a discrepancy is not observed in a good round like the 15th. Visualizing the composition of training and test instances during the validation rounds seems an interesting diagnostic tool to find the possible causes behind misclassification.

5 Discussion and Implications

The results achieved in our study shed light on several aspects that may lead to several implications for the research community and the practitioners, as discussed in the following.

Searching for a Reliable Ground Truth. The results reported in Section 4 revealed the noticeable performance drop that affected all the models—including the baselines such as the pessimistic classifier—as the validation rounds proceeded. In this respect, we made two key observations: (1) the F-measure score is directly proportionate to the number of true instances (i.e., “exploitable”) appearing in the test set, independently from the composition of the training set; (2) the MCC scores revealed the existence of several rounds with positive correlations between the model predictions and the target variable. Similar findings were only encountered in similar research work applying a time-aware validation framework [2, 17]. Yet, such works only brought attention to the aggregate score performed in all rounds, while we opted for a hybrid strategy, presenting both the aggregated (weighted) scores and focused attention to those rounds exhibiting particular behaviors. We suspect that one of the main reasons behind such results lies in the strategy adopted to build the ground truth. In this work, we relied on the Exploit Database because of its good reputation and popularity among researchers in exploitability prediction [2, 34, 50, 85]. Nevertheless, we observed a noticeable reduction in the publication rate of exploited vulnerabilities—i.e., half as many exploits released in the period 2011-2020 than in the previous decade, with the rate of disclosed vulnerabilities doubled. Indeed, it seems that it has been struggling to keep up the pace of newly disclosed vulnerabilities in recent years. This could imply that either exploits are disclosed with less frequency than before or the rate of new vulnerabilities is too high to keep up the pace; this phenomenon makes the Exploit Database progressively less reliable for building a solid ground truth in both cases. In this respect, our study constitutes a baseline for future re-evaluations with alternative data sources to build better ground truths [2, 34]. Indeed, many other sources point to instances of exploits (or tentative exploits) observed in the wild. For example, Symantec Attack Signatures collect traces of attackers’ attempts via intrusion detection systems.²⁹ In the context of Project Zero, Google gathers 0-day exploits observed in the wild, enriched with a detailed root cause analysis.³⁰ The US Cybersecurity & Infrastructure Security Agency (CISA) curates the Known Exploited Vulnerabilities (KEV) catalog, containing hundreds reports of exploited vulnerabilities.³¹ Due to their newness, the size of such datasets is still limited (e.g., Google Project Zero has only 69 entries as of February 2024), though their increasing popularity should address this problem eventually, making them suitable for large-scale analyses like ours. We envision a triangulation of multiple strategies to improve the reliability and quality of the labeling process. To this end, there is a need for novel and automated monitoring solutions that automatically discover “silent” exploits on the web and map them to the related vulnerabilities; thus, the exploitability prediction models could rely on a wider and continuously growing knowledge base about real-world exploits.

Classifying Vulnerabilities for Fine-grained Inspections. Our large-scale analysis involved all disclosed vulnerabilities in NVD until November 03, 2021. In our work, we did not make any difference between vulnerabilities, e.g., analyzing web-based and memory-related vulnerabilities separately, but treating all of them as equal. Many factors concerning vulnerabilities inevitably influence any prediction activity, especially the exploitability prediction. In particular, how the distribution of vulnerability types varies over type could be one of such factors that might influence the prediction performance observed. To reach this goal, we have given each vulnerability a category guided by its assigned CWE. Namely, we re-mapped the given CWE according to the “Simplified Mapping” view provided by CWE itself.³² Such a step was meant to greatly reduce the many weakness types into a more reasonable set of categories, allowing us to decrease it from 180 to 89. Being this number still great, we opted to further assign new categories based on our knowledge of the 89 CWE types resulting from the first re-mapping, ending up with ten broader categories like “Authentication and Authorization” and “Resource Protection”. Figure 14 shows the distribution trend into such categories over time (over the 22 splits of the dataset). We can immediately observe that as the year passes, the precise type of vulnerability assigned becomes clearer. Indeed, during the initial periods of the CVE system, most vulnerabilities had their CWE not specified (“NVD-CWE-noinfo” or “NVD-CWE-Other”), resulting in many CVEs falling into the “Other” category, particularly during the first 12 rounds. Such lack of information does not allow us to easily understand whether the vulnerability types could have played a role in the performance drop we observed at the later rounds. The results of this work should be remade into a subset of vulnerabilities for triangulating the issue that affected the model performance.

Engineering the Learning Configuration. As observed in the context of RQ\(_2\), the four feature representation techniques had a relevant impact on the overall models’ performance. In this study, we relied on widespread settings to set up the text pre-processing pipeline and to configure the doc2vec model without carrying out a profound empirical investigation. Indeed, our goal was to assess the key differences among the main techniques employed when working with textual data. Hence, our work does not declare the best feature representation technique on all fronts but rather encourages the evaluation of alternative learning configurations and techniques. The Latent Semantic Analysis (LSA) [27, 31] used to reduce the dimensionality of the document-term matrices (Section 4.3) is a candidate technique that can be employed to represent the textual features the traditional ML models can use, acting somehow similarly to the embedding strategies like doc2vec. As regards the time-aware validation setting, we followed Liu et al.’s [63] approach by considering a deployment setting in which the knowledge base grows over time. In a different way, the work by Bullough et al. [17] employed a “sliding window” to train only on a limited set of past data, i.e., only those that are temporally closer to the testing data. The rationale of their choice is that recent data might represent the reality better than older data, as some characteristics might have changed over time—i.e., a concept drift has occurred [97]. For instance, the style and content of discussions in BugTraq written in 2010 might differ from those of 2000, negatively affecting the models when learning the relations between the textual features and the target variable. The traditional sliding window approach completely ignores older instances during the training based on a pre-determined or moving threshold (i.e., the “window size”). Alternatively, we could also assign a lower weight to older instances, using a “decaying window”, so that the learners would give less importance to the old instances that likely induced the models into error. In addition, by employing a mechanism to assess the quality of the online discussions, e.g., their readability [89] or the amount of their informative content [22], we could assign higher weights to “good” instances and instruct the models to give more attention to them during the training, hopefully improving their overall capabilities.

Fig. 14.

On the Practical Usages of Early and Realistic EPMs. Adopting an early exploitability prediction model provides many advantages in assessing the severity of newly discovered vulnerabilities. Let us consider a scenario where a software project adopts one. When a new vulnerability is discovered, either by internals or externals, the developers report the issue to MITRE and request the allocation of a CVE record, where they explain the issue found. Once third-party experts verify the issue, the vulnerability is officially disclosed in a CVE record containing the first official description in natural language. Such a description can be directly fed into the early exploitability prediction model to readily generate an initial assessment of that vulnerability. Besides, if other free commentaries are already available—e.g., via GitHub issues—the prediction model can integrate those pieces of information to boost the prediction accuracy further, as we also observed during the analyses for RQ\(_1\). Should the model flag the vulnerability as potentially exploitable, the developers can take specific countermeasures [56, 86] to (i) address the vulnerability earlier than other issues, (ii) release the software version containing the patch quickly, and (iii) adopt a better communication strategy to recommend the users to install the update as soon as possible. It is worth remarking that any countermeasure adopted in this sense is meant to hasten the vulnerability remediation process, not to replace other forms of security assessment like “late” exploitability prediction models or the CVSS analysis. In this respect, early assessment can also be used to support the security analysts in charge of making the CVSS measurement, who can rely on an additional “opinion” when it comes to judging the vulnerabilities’ nature. In particular, bringing forward the assessment of just-disclosed vulnerabilities facilitates the prioritization of all the security issues found until that moment. Indeed, developers can better understand which issue requires more attention than others and allocate adequate resources accordingly in the hope of reducing the duration of the exposure window and, therefore, the risk of being attacked, as explored by Jacobs et al. with EPSS [52]. Afterward, as soon as new information on the vulnerabilities becomes gradually available, e.g., a Proof-of-Concept is disclosed, developers can progressively leverage more reliable solutions to adjust the prioritization of their interventions, such as Evocatio fuzzer [53]. In such a mechanism, we envision that models based on early data can represent the first step of a prioritization pipeline which takes advantage of all the strengths of existing solutions as soon as the information they use becomes available. Despite the performance observed at the latest validation rounds not supporting the practical usefulness of such early models, we believe this work acts as a cornerstone for determining the feasibility of early vulnerability assessment, willing to channel more attention to this topic and express its potential to the utmost.

Early Predictions and Beyond. Our empirical investigation did not aim to provide a cutting-edge exploitability prediction model but rather to evaluate its performance with early data and in a realistic scenario. We acknowledge the existence of models that achieved better results in the literature [14, 45, 85]; yet, many of them did not consider the precautions indicated by [17] or those we adopted in this work. In this respect, we believe that replicating previous work under a realistic validation setting is necessary to estimate the models’ real effectiveness. Moreover, we envision a combination of all existing models, both early and late, to develop an incremental exploitability prediction system, i.e., an integrated framework that provides the best possible predictions according to the information available at a given time. For instance, after discovering a new vulnerability (day zero), the incremental system would just rely on the short description and the initial online discussions—as we have presented in this paper; on the day the experts make in-depth analyses, the system will consider all the features obtainable from the CVSS vector to further improve its predictive power—acting as “late” models. Such a solution may express its full usefulness in the case of borderline classifications, i.e., when the early predictions fall too close to the decision threshold, making the model unsure about the appropriate class to assign. In such scenarios, the system might recommend waiting for additional data, such as the CVSS exploitability metrics, before providing a more trustworthy response. Furthermore, this framework could be integrated with an impact prediction module that estimates the harms that the potential exploitability of that vulnerability could cause to the confidentiality, integrity, and availability of the targeted asset. This additional piece will cover the second part covered by CVSS base metrics, i.e., the “Impact metrics,” fulfilling the role of assisting the human experts in providing a broad understanding of the risks connected to keeping a vulnerability unfixed.

6 Threats to Validity

Threats to Construct Validity. We mined the full content of the National Vulnerability Database (NVD) combined with CVE List to collect all the known vulnerabilities disclosed before November 03, 2021, being careful to avoid the inclusion of malformed and rejected CVEs. We did not perform an extensive manual validation of the retrieved dataset to detect possible curation errors, such as a CVE incorrectly pointing to an external reference related to a different vulnerability. However, we are confident that the considered data sources are reliable since both databases are known to maintain high-quality data. Besides, we deliberately focused only on the URLs labeled as BID or BugTraq for three reasons: (1) URLs of these kind point to well-known sources where developers used to discuss vulnerabilities way before their official disclosure; (2) the pointed websites were easy to mine—a common HTML parser sufficed—to gather the required data; (3) developing a generic script able to mine all the thousands of different websites reachable from the CVEs would have been impractical. We handled the shutdown of both Security Focus and BugTraq using the Wayback Machine service and the SecLists archive to recover the missing links with the CVE records. Nevertheless, we cannot guarantee the freedom from missing or incorrect links caused by Wayback Machine or the imprecision of the pattern matching heuristic employed to reach the right page on SecLists. In any case, the approach of early prediction models is not strictly bound to BID and BugTraq references, and it can be adapted to any other source of online discussions with just minor tweaks.

The text of the online discussions contained many irrelevant data, such as e-mail addresses, PGP signatures, and hex numbers. We applied regular expressions to capture these patterns and remove them to improve the quality of the document corpus. In addition, we adopted the recommended pre-processing steps when working with natural language text to allow the feature representation techniques to learn a compact and representative vocabulary. We are aware that our text cleaning procedure may not have been complete and could have left other forms of noise, such as partial code snippets; yet, to the best of our knowledge, there are no tools able to capture partial code elements for any programming language; hence we opted not to implement an ad-hoc solution as it would have required dedicated effort and extensive validation.

When assigning the labels to the instances in our dataset, we carefully avoided labeling all instances outside the context of the time-aware validation. To this end, we followed the strategy proposed by Jimenez et al. [54], assigning more realistic labels to the instances at each round. Specifically, we labeled as “exploitable” (true class) the instances in the training set that were exploited before the training date—i.e., the latest publication date in the training set—and we marked as “exploitable” the test instances only if they were exploited before the date of the last vulnerability published in that round.

Threats to Internal Validity. The investigation for RQ\(_1\) analyzed the impact caused by the seven corpora created from the three data sources considered, i.e., CVE, Security Focus and BugTraq, and their combination. The combination consisted of applying a string concatenation to create the four combined corpora before creating the document-term matrices or fitting the word2vec model—this determined different feature spaces for each corpus. Not only did the analysis help understand the impact of each corpus, but it also showed that text-driven early prediction models can be employed with any source available, though with noticeably different performance. In other words, it is not mandatory that a vulnerability is disclosed via CVE before running the predictions, but the entire procedure can be done with any kind of text explaining the issue.

As indicated by Bullough et al. [17], several exploitability prediction models in literature had some issues in their machine learning setup. First, we adopted a time-aware validation setting to simulate a realistic production scenario in which the prediction model is periodically re-trained and deployed. We deliberately avoided a fully-random cross-validation as our data had time relations among them; indeed, training on “future” data to predict data belonging to the “past” would have generated inflated and misleading results. Moreover, we were careful to avoid applying the feature encoding and data balancing (where applied) on the test data, but only on the training set made up at each iteration of the time-aware validation. Indeed, such bad practices would produce overly optimistic results, as the models would learn information from data that should be left completely unseen before the testing phase because they represent instances to predict in a real deployment scenario [6].

To perform the time-aware validation, we split the dataset into several folds, each made of the vulnerabilities disclosed in (about) one year and a half time span—precisely, 532 days. We started the splitting from the last published CVE in 2021 and “jumped” back in time to form the 22 folds. The size of such a time span was determined by the 90th percentile of the exploitation time distribution, i.e., the duration of the uncertainty window. We made this choice to observe how the models behave when the uncertainty window is made of totally different sets of vulnerabilities. We acknowledge that there exist different ways to create the folds, such as by equally splitting the dataset by the number of CVEs. The results we obtained are still subject to the choice we made to set the size of the uncertainty window. We chose the 90th percentile of the exploitation time distribution as it represents a largely sufficient time to let exploits emerge. As a matter of fact, the average exploitation time, i.e., 194 days, appeared quite limited and too eager. We are aware of different, and perhaps more appropriate, widths of the uncertainty window that could determine more valid results, and that would be worth exploring with dedicated further analyses.

When generating the document embeddings from the training corpora with doc2vec, we used the configuration that provided the best results in previous work [45] and others settings recommended for doc2vec models, e.g., setting to 300 the size of the embedding or using the Distributed Bag-of-Words variant. The results achieved by this feature representation technique might change if different configurations are employed.

Threats to External Validity. We used the Exploit Database as the main source to build our ground truth (i.e., to label the CVEs with true and false) because of its reliability and completeness. Nevertheless, it only stores Proof of Concepts (PoC) and exploits publicly released by their authors without tracing any exploit observed in the wild (e.g., via attack signature detection) as done by Symantec Attack Signatures or Google Project Zero’s 0-days-In-The-Wild (described in Section 5). Therefore, our models can only predict whether an exploitation will be released without generalizing to other forms of exploitation. Moreover, the observed results hinted at possible flaws in our ground truth that caused the model performance degradation. Thus, integrating multiple data sources could improve the quality of the ground truth and, hopefully, the performance as well.

The exploitability prediction models experimented in this work target disclosed vulnerabilities. This choice was driven by the fact that the metadata for such vulnerabilities is available for initial assessments. Hence, the models cannot estimate the exploitability of 0-day vulnerabilities since they are supposed to be unknown to the developers or any other party involved in taking care of the system’s security. Predicting the risk of undergoing 0-day exploits inevitably might require monitoring the accesses to the application or any other suspicious actions, relying on principles different from those recalled in this work [44]. Nevertheless, the prediction of exploits to known and disclosed vulnerabilities can also be used as a proxy indicator for estimating the exploitation of other unknown vulnerabilities in the system sharing commonalities with those already disclosed [13]. Furthermore, the experimented models are meant to predict the event of future exploitation for individual vulnerabilities, in line with all the related work presented in Section 2. Hence, the models cannot predict attacks concerning multiple vulnerabilities or chains of exploits. Achieving such a goal is indeed feasible, though it might require more mature models to make accurate predictions for individual vulnerabilities.

Threats to Conclusion Validity. From all the 504 models, we computed multiple metrics capturing the classifiers’ performance from different points of view, reducing the risk of drawing erroneous conclusions. In particular, we deliberately did not consider the accuracy—except for observing the scarce performance of LLMs—as it produces largely inflated results leading to optimistic conclusions in imbalanced problems. In this paper, we largely relied on the F-measure and MCC metrics to observe how the model performed. The rest of the raw results are in the online appendix [49].

We aggregated the results of the 22 validation round with the weighted average to have a single number representing the overall performance and facilitate the comparison. We preferred this aggregator to other popular choices, such as the simple average or the median, because the 22 validation rounds are not equivalent representations of the same problem. Indeed, in the 15th and 22th rounds, all models achieved utterly different performance; nevertheless, the exploitability prediction problems of those two rounds cannot be directly compared as they are separated by the events that occurred in ten years. Therefore, we assigned more weights to the rounds having wider training sets. We also looked at the aggregated scores obtained using the simple average and the median. For instance, the model that had the largest weighted F-measure under \(\langle\)CVE + SF\(\rangle\) corpus (i.e., Logistic Regression with TF-IDF and oversampled training data) would score 0.54 with the simple average and 0.64 with the median, noticeably higher than 0.49 weighted score. We believe such inflated values do not accurately describe the overall model performance, motivating the use of a weighted aggregator. Still, we opted to closely inspect the model trends over the 22 validation rounds to avoid concluding using only a single aggregated value.

The Latent Semantic Analysis (LSA) employed in Section 4.3 allowed us to inspect the arrangement of the training and test instances at key validation rounds. We opted for this technique due to its suitability for textual data based on document-term matrices, which are known to generate a sparse feature space [31]. We were careful to fit the semantic space only on the training data to prevent it from being influenced by future data that was supposed to be unseen at that time. In other words, the test data were projected in the same semantic space previously fitted on the corresponding training set.

7 Conclusion

This paper presented a large-scale empirical evaluation of the effectiveness of early exploitability prediction models relying on the data available in a just-disclosed vulnerability, comparing 72 learning configurations, involving six traditional ML classifiers, four feature representation schemas, and three data balancing settings, as well as five pre-trained LLMs. All models were evaluated in the context of a time-aware validation setting representing a realistic scenario where the models are periodically re-trained and deployed. Additionally, we handled possible issues connected to an unrealistic and eager assignment of labels by employing a special data cleaning strategy.

The results showed that CVE descriptions alone suffice, but the addition of online discussions from Security Focus further boosts the performance of any model. The best combination of feature representation and data balancing was with TF and SMOTE in the majority of the cases. The best classifier depends on the performance metric: the Logistic Regression achieved the best F-measure and MCC scores, the Random Forest maximized the precision, and the KNN had a quasi-perfect recall. Unfortunately, pre-trained LLMs did not achieve the expected performance, requiring further pre-training in the security domain. Nevertheless, all models fell victim to the same phenomenon, i.e., a noticeable drop at later validation rounds, likely due to the large imbalance in the test sets.

Future research directions include the experimentation of novel mechanisms to build a more reliable and sound ground truth—e.g., by combining multiple data sources—or alternative learning configurations to improve the early exploitability prediction model performance. We envision possible further developments to make exploitability prediction more powerful and useful, such as employing an incremental exploitability prediction system to guide the choice of which countermeasures to apply when a new vulnerability is published, e.g., helping to decide which vulnerability must be addressed before than others. Such a system can also be integrated with an impact estimation module to provide a full overview of the risk connected to a newly found vulnerability. From a different perspective, we hypothesize that exploitability prediction modeling can be further improved by retrieving peculiar information from the software systems affected by just-disclosed vulnerabilities and using fine-grained text analysis tools that extract relevant elements from unstructured text, such as code snippets or stack traces, to have a more relevant feature space from which the models can better learn.

Credits

Emanuele Iannone: Formal analysis, Investigation, Data Curation, Validation, Writing - Original Draft, Visualization. Giulia Sellitto: Formal analysis, Investigation, Data Curation, Validation, Writing - Original Draft, Visualization. Emanuele Iaccarino: Formal analysis, Investigation, Data Curation, Validation, Writing - Original Draft, Visualization. Filomena Ferrucci: Supervision, Validation, Writing - Review & Editing. Andrea De Lucia: Supervision, Validation, Writing - Review & Editing. Fabio Palomba: Supervision, Validation, Writing - Review & Editing.

Conflict of Interest

The authors declare that they have no conflict of interest.

Data Availability

The datasets built during the current study, plus the scripts used to analyze and generate the data, are available in the FigShare repository: https://figshare.com/s/165b39c7094c7831365e.

Footnotes

CVSS website: https://www.first.org/cvss

Common Weakness Enumeration: https://cwe.mitre.org

Common Platform Enumeration: https://nvd.nist.gov/products/cpe

⁴

CVE Glossary: https://www.cve.org/ResourcesSupport/Glossary

⁵

SecurityFocus website: http://online.securityfocus.com/bid

⁶

Avalable at: https://github.com/google-research/bert

⁷

NVD website: https://nvd.nist.gov

⁸

CVE List website: https://cve.mitre.org

⁹

CVE Search dump: https://www.cve-search.org/dataset

¹⁰

BeautifulSoup website: https://beautiful-soup-4.readthedocs.io

¹¹

CVE References: https://cve.mitre.org/data/refs/index.html

¹²

Wayback Machine API: https://archive.org/help/wayback_api.php

¹³

SecLists website: https://seclists.org

¹⁴

EDB website: https://www.exploit-db.com

¹⁵

EDB repository: https://github.com/offensive-security/exploitdb

¹⁶

BookCorpus corpus: https://yknzhu.wixsite.com/mbweb

¹⁷

English Wikipedia corpus: https://www.english-corpora.org/wiki/

¹⁸

CommonCrawl corpus: https://commoncrawl.org/

¹⁹

CodeSearchNet repository: https://github.com/github/CodeSearchNet

²⁰

Scikit-Learn website: https://scikit-learn.org/

²¹

HuggingFace website: https://huggingface.co/

²²

PyTorch website: https://pytorch.org/

²³

Textual features extraction with Scikit-Learn: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

²⁴

doc2vec with Gensim: https://radimrehurek.com/gensim/models/doc2vec.html

²⁵

Imbalanced-Learn website: https://imbalanced-learn.org/

²⁶

SciPy website: https://scipy.org/

²⁷

scikit-posthocs documentation: https://scikit-posthocs.readthedocs.io/en/latest/

²⁸

Vast.ai website: https://vast.ai/

²⁹

Symantec Attack Signatures: https://www.broadcom.com/support/security-center/attacksignatures?

³⁰

Google Project Zero’s 0-days In-the-Wild: https://googleprojectzero.github.io/0days-in-the-wild/

³¹

CISA KEV Catalog: https://www.cisa.gov/known-exploited-vulnerabilities-catalog

³²

CWE-1003: https://cwe.mitre.org/data/definitions/1003.html

References

[1]

Luca Allodi and Fabio Massacci. 2014. Comparing vulnerability severity and exploits using case-control studies. ACM Trans. Inf. Syst. Secur. 17, 1, Article 1 (Aug.2014), 20 pages.

Abstract

1 Introduction

2 Background and Related Work

2.1 Software Vulnerabilities Life Cycle

2.2 Exploitability Prediction Modeling

2.3 Realistic Validation of Machine Learning Models for Security

2.4 Our Contribution

3 Empirical Study Design

3.1 Study Goal and Research Questions

3.2 Data Collection

3.2.1 Mining Known Vulnerability Data.

3.2.2 Mining Online Discussion Data.

3.2.3 Mining Exploit Data.

3.3 Model Selection

3.4 Model Evaluation Framework

3.4.1 Data Cleaning Strategy.

3.4.2 Validation Round Creation.

3.4.3 Training & Test Set Preparation.

3.4.4 Model Training & Testing.

3.4.5 Performance Assessment.

3.5 Implementation Details and Experimental Infrastructure

4 Analysis and Discussion of the Results

4.1 The Impact of Early Data Source Combinations (RQ\(_1\))

4.2 The Performance of Different Learning Configurations (RQ\(_2\))

4.3 Further Analysis

5 Discussion and Implications

6 Threats to Validity

7 Conclusion

Credits

Conflict of Interest

Data Availability

Footnotes

References

Index Terms

Recommendations

Predicting Exploitation of Disclosed Software Vulnerabilities Using Open-source Data

Mining trends and patterns of software vulnerabilities

Estimating Software Vulnerabilities

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations