Formation energy prediction of neutral single-atom impurities in 2D materials using tree-based machine learning

Aniwat Kesorn; Rutchapon Hunkao; Cheewawut Na Talang; Chanaprom Cholsuk; Asawin Sinsarp; Tobias Vogl; Sujin Suwanna; Suraphong Yuma

doi:10.1088/2632-2153/ad66ae

1. Introduction

Nano-structure materials, such as impurity structures in 2D materials, have exhibited great potentials in a wide range of applications, where they have become valuable resources. For instance, optical quantum technologies in solid-state systems can exploit impurity for operation [1–11]. Energy storage and solar cells can improve their performance by utilizing impurities [12–18]. Likewise, the presence of impurities can enhance the performance of sensing [19–24]. Layered materials are being explored extensively both experimentally [11, 25–29] and computationally [30–33] to investigate their properties for specific purposes.

To design or manipulate a nano structure to attain desirable properties, an impurity is added to the structure where its feasibility and stability are essential. The formation energy, which traditionally denotes the energy needed to bind an impurity to a host, can give preliminary indication, where the impurities with lower formation energy are more likely to exist than those with higher formation energy. In this sense, the formation energy is imperative as it is a basic property to assess the stability feasibility of a nano structure. However, the formation energy depends on many factors, such as the type of impurities, and the host material's atomic composition and structure. A conventional first-principle calculation such as density functional theory (DFT) can be time-consuming in calculating the formation energy [34–36]; therefore, a more effective method is beneficial and should be explored.

Owing to data booming from more robust data acquisition and efficient algorithms, machine learning (ML) has been proven as an effective tool for regression and classification tasks, and has recently attracted great interests in many fields including sciences, engineering, health science; see for examples [37, 38] and references therein. ML has provided variety of methods to classification and prediction [39, 40]. In this regard, ML algorithms are alternative approaches to attain such information to preliminarily predict the intrinsic properties of the nano-structure materials. To utilize ML for investigating materials properties, a proper representation of material profiles greatly enhances ML prediction accuracy and can shed more insights on material properties [41], and subsequently design [42]. Algorithms such as decision tree regression–random forest (RF), and gradient boosting algorithms–together with physics-inspired descriptors [43–45], have demonstrated improvement in predicting material properties. From our points of view, the physics-based descriptors may be advantageous for two reasons: they can provide interpretability and in some cases they may also work with a smaller dataset.

To employ ML effectively, the material representation and features are essential, and this has been demonstrated in predicting properties of 2D materials over the past decade for both impurity and the pristine structures [45–47]. For this purpose, the strategies for constructing representation of features, enabling them to be used as inputs or outputs for ML models, have been suggested that features should represent the similarity or difference of data points, be applicable to the entire material domain of interest, and be easier to calculate than the target property [46].

In this work, we presented a small set of features consisting of chemical and structural features to predict a formation energy of neutral single-atom impurities in 2D materials available in Impurities in 2D Materials Database (IMP2D) [48]. This database consists of the absorbates and interstitial defects in 44 host materials with nine different space groups. Each host is doped with many single-atom dopants in various configurations. The ML approaches are based on decision trees equipped with the chemical features and the structural features computed from the interaction of the dopant with neighboring host atoms via the Jacobi–Legendre (JL) polynomials. We discussed the physical insights from these features, and reported the regression performance of the formation energy prediction. This work illustrates the potential and application of physically meaningful features employed in ML to obtain physical properties of impurities in 2D materials that otherwise would require higher computational cost.

2. Methodology

2.1. Data preprocessing

The data of impurity systems is taken from the IMP2D database [48], which has 14 662 samples available with the formation energy; some of which belong to the adsorbate type, while the others to the interstitial type in 44 hosts. These hosts were previously identified from the Computational 2D Materials Database [48–50] as materials experimentally synthesized in 2D monolayer structures. An adsorbate impurity is defined as an impurity adsorbed onto the surface of the host atoms–for a single atom, it is also known as adatom–whereas an interstitial defect is an impurity that absorbs into the host material as an interstitial, as depicted in figure 1.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Two types of impurities: (a) adsorbate and (b) interstitial defect. In these pictures, the host is MoS₂ depicted in yellow and purple, and the dopant is a boron (B) atom depicted in green. The parallelogram box indicates a supercell used in computing the formation energy in IMP2D [48].
Download figure:
Standard image High-resolution image

The formation energy ( $E^\mathrm{f}$ ) in IMP2D was computed via the following equation:

$\begin{align} E^\mathrm{f} = E_\mathrm{tot}^\mathrm{imp} - E_\mathrm{tot}^\mathrm{pristine}-\sum_{i}n_i\mu_i,\end{align} \tag{ 1 }$

where $E_\mathrm{tot}^\mathrm{imp}$ and $E_\mathrm{tot}^\mathrm{pristine}$ are the total energies of the impurity and pristine (host) structures, respectively, obtained from DFT. n_i and µ_i are respectively the number and chemical potential of the added atom (dopant). In this database, only one atom is added implying $n_{i = 1} = 1$ . The Perdew, Burke, and Ernzerhof (PBE) exchange-correlation functional [51], which is the class of generalized gradient approximation (GGA), is used in their DFT calculations [48]. This functional is well known to have limitations in accurately predicting certain properties, such as giving underestimate band-gap energy, and inaccuracies in describing the localized states in transition metals. The formation energy calculation can be obtained more accurately by applying the GGA + Hubbard U [52–54] or a functional containing nonlocal exchange. However, the PBE functional gives advantages in (i) consistency where data calculated on the same level of theory can make direct comparison of results more valid, and (ii) a good balance between computational efficiency and accuracy [48], especially for a large number of calculations.

While IMP2D contains 14 662 samples with formation energy, some of them are filtered out based on the following three criteria. First, the selected samples hosted by TiCO₂ (144 samples) are filtered out because the difference of the host and impurity energies seems unreasonably high. Second, the selected samples need to have the total energy convergence within 10⁻³ [eV]. This was obtained from the parameter conv2 in the database [48]. Third, the impurity with outlier formation energy (1 sample) is filtered out. Then the chemical features as listed in table A1 are extracted for all hosts and dopants. The samples with missing values are removed. As such, only 5906 samples in 43 hosts are left for running the ML models, as listed in table B1 and supplementary section S1 for more details. The procedure is summarized in figure 2. The distribution of the computed formation energy is shown in figure 3.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** A schematic overview of sample selection where the data is extracted from the IMP2D database and then selected based on established criteria. The chemical and structural features are prepared for all selected samples, which we then employed machine learning algorithms to predict their formation energy.
Download figure:
Standard image High-resolution image

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Distribution of computed formation energy classified by the type of impurities in 5906 samples. The formation energies with frequency count less than 5 are annotated with numbers.
Download figure:
Standard image High-resolution image

2.2. Feature creation

After the data selection, the next step is preparing features, as shown in figure 2. We make distinction between two sets of features, which are termed chemical and structural features. A chemical feature is a property that involves chemical elements of host and added atoms such as chemical potential, number of atoms, atomic radius and electronegativity, whereas a structural feature refers to one that involves physical position or relative distance between atoms, which need a JL polynomial for preparation.

The structural features are generated by a vanishing Jacobi radial functions [55] in the spirit of Behler–Parrinello radial functions [56]. Using the added atom as a center, the structural feature can be written as a linear combination of vanishing Jacobi polynomials, $\tilde{P}_n^{(\alpha, \beta)}$ , as follows:

$\begin{align} x^{st}_n = \sum_i^{N_{rc}} \tilde{P}^{\left(\alpha, \beta\right)}_n \left( \cos{\frac{\pi r_i}{r_c}} \right),\end{align} \tag{ 2 }$

where r_i is the distance between the impurity atom and neighbor atom i within the cutoff radius r_c, as shown in figure 4(a). The vanishing Jacobi polynomials are defined as

$\begin{align} \tilde{P}^{\left(\alpha, \beta\right)}_n\left(z\right) = P^{\left(\alpha, \beta\right)}_n \left( \cos \frac{\pi r}{r_{c}} \right) - P^{\left(\alpha, \beta\right)}_n\left(-1\right),\end{align} \tag{ 3 }$

in such a way that the functions vanish smoothly at the distance r_c, as shown in figure 4(b). Here $P_n^{(\alpha, \beta)}$ is the Jacobi polynomial of order n for $z \in [-1,1]$ . These features are invariant under translation and rotation of the reference frame and they encapsulate structural information around the impurity atom which depends on symmetry, distances, and the number of neighboring atoms.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** (a) An interstitial defect is shown with neighboring atoms within the radius $r_c = 5.0$ [Å] depicted in green. The host material is Si₂ shown by blue balls, and a dopant is F shown by red balls. The parallelogram is a supercell. (b) Vanishing JL polynomials $\tilde{P}_n^{(\alpha, \beta)}$ for $n = 1, 2, 3, 4, 5$ .
Download figure:
Standard image High-resolution image

**Figure 4.** (a) An interstitial defect is shown with neighboring atoms within the radius $r_c = 5.0$ [Å] depicted in green. The host material is Si₂ shown by blue balls, and a dopant is F shown by red balls. The parallelogram is a supercell. (b) Vanishing JL polynomials $\tilde{P}_n^{(\alpha, \beta)}$ for $n = 1, 2, 3, 4, 5$ .
Download figure:
Standard image High-resolution image

Here, the cutoff radius of interaction is heuristically chosen with the sphere radius of $r_c = 5$ [Å] and $n_\mathrm{max} = 5$ . The hyper parameters α and β of the Jacobi polynomials are set equal to zero, which reduces them to the Legendre polynomials. In essence, we have a set of chemical features $\{x^\mathrm{chem}_i\}$ and a set of structural features $\{N_{rc},\ x^{st}_n\}$ to train the ML models.

2.3. Tree-based ML

The ML algorithms explored in this work are based on decision trees, and they include RF [57–59], and gradient boosting frameworks [60, 61] comprising gradient boosting regression (GBR), histogram-based gradient-boosting regression (HGBR) and light gradient-boosting machine (LightGBM). It should be noted that LightGBM is implemented in [62] and the others in Scikit-Learn in [63].

In RF, an ensemble of decision trees is built where each tree makes a decision of output on its own. Then the ensemble of the decision trees takes an average for the final result. In Scikit-Learn [63], each tree in the ensemble can be built from a sample drawn with replacement (i.e. a bootstrap sample) from the training set for a specific size of sub-samples. The splitting at each node during the construction of a tree can be considered with all input features or a random subset of specific size as max_feature. These two sources of randomness are applied to decrease the variance of the forest estimator. In this work, we employed the mean squared errors (MSE) as the cost function to measure the quality of the tree splitting [63]. Additionally, the hyper parameters of the decision tree, such as the maximum depth of the tree, the minimum number of samples in a leaf node, can be tuned to optimize the model.

In the gradient boosting frameworks, which is a generalization of boosting to an arbitrary differentiable loss function, a decision tree is used as a weak learner. The boosting deploys a sequence of weak learners to correct the output. The errors of the previous weak learners are calculated, and another weak learner is built to correct these errors. The algorithms can add some level of stochasticity by sub-sampling of the training data for a decision tree [63]. However, in GBR and LightGBM, the subsampling is drawn without replacement, whereas in HGBR, it is not implemented [63]. The random subset of features is allowed by all methods, leading to stochastic gradient boosting. On the other hand, LightGBM and HGBR make histograms of samples for each feature. This allows the algorithms to leverage integer-based data structures (histograms) instead of relying on sorted continuous values when building the trees. Furthermore, LightGBM is implemented with the gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB) algorithms which improve a speed for training a large data set. GOSS selects all samples with larger gradients and randomly selects some samples with smaller gradients for training a weak learner. EFB is used to deal with sparse feature space such as one-hot features. It should be noted that GOSS is not applied in this work since a simple stochastic sub-sampling is faster for a small sample size. In all gradient boosting algorithms here, the squared errors as the lost function is used to measure the quality of the tree splitting [62, 63].

2.4. Computational details

After preprocessing data, we split all samples into two sets, namely the dataset used for model construction and test, and the blind-test set; see figure 5. Accordingly, one host can contain many dopant types, where each impurity structure (sample) has only one dopant, but can have multiple configurations. The blind-test set (371 samples) contains only MoS₂ and W₂Se₄ hosts, which have been a prior separated from all other samples. This blind-test set can indeed be different samples. Here, we simply selected two hosts which have different host space-groups. The dataset containing the rest of the hosts, which is used for constructing an ML model, is further randomly split the samples into two subsets, namely the training data for 80% and test data for 20% disregarding the hots. Alternatively, we can employ a stratified data splitting where the percentage of samples for each host is preserved in the training set and the test set. Notably, the blind-test set contains impurity structures which ML models have never learned in training data.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Computational diagram to develop ML models. The training data is used to tune hyper parameters of the ML models by the 5-fold cross-validation method. Then, the trained models are used to predict the formation energy with the test data and the blind-test data. The blind-test data contains impurity structures whose hosts are not included in the training data.
Download figure:
Standard image High-resolution image

ML models used in this work have many hyper parameters, which can be selected to optimize bias and variance of the models. We used training data to optimize these hyper parameters. To that end, samples in the training data are split into five folds for cross-validation to average the root mean square error (RMSE) of each hyper-parameter set. The best hyper parameters are subsequently used to retrain the model by all training data. Finally, the trained models are used to predict the formation energy of impurities in the test data and those in the blind-test set. Three metrics are used to evaluate the regression performance, namely, the coefficient of determination (R²), the mean absolute error (MAE), and the RMSE; see appendix C. These procedures are summarized in figure 5.

3. Results and discussion

3.1. Feature discussion

In this work, we used chemical and structural features as described in section 2.2. These features will give information to a decision tree for splitting internal nodes. We compare the prediction results in three cases: (i) using the chemical features only, (ii) using both the chemical and structural features, and (iii) using both the chemical and structural features, but the chemical features do not include hostenergy/atom and host-spacegroup.

We expect that the chemical features, such as a chemical potential of a species or electronegativity, contain insights to the formation energy. For a moment, let us consider a chemical potential µ of a species, which is defined as the rate of change of energy with respect to the change of the particle number of the given species. We can interpret the first two terms of equation (1) as the chemical potential of a species in a mixture, i.e. the rate of change of energy (free energy) with respect to the change in the number of atoms that are added to the system. Therefore, the formation energy in equation (1) is obtained by comparing between the chemical potential of a species in a mixture (the first two terms) and chemical potential µ of a species (the third term). Consequently, the formation energy in equation (1) can indicate the structural stability when forming an impurity structure. Since the chemical potential of a species is computationally less expensive to obtain than the formation energy and available in the database, it can be a promising feature.

In similar regards, the electronegativity is a measure of the tendency of an atom to attract a bonding pair of electrons. The features natoms and Z-dopant/total determine the concentration of impurities. Other chemical features in table A1 are also straightforward to obtain and included to add more degree of freedoms for a decision tree to split into nodes. We remark that the hostenergy/atom feature requires DFT calculations, but it is also computationally less expensive than the formation energy because it is just the host property (in this work, it has only 43 hosts). Moreover, we showed in section 3.2 that even though when this feature is removed, the considered ML models still yield comparably accurate prediction; see table 3 in section 3.2.

In figure 6, the energy change from the first two terms of equation (1) is depicted as a function of the dopant chemical potential. If the host and dopant have weak interaction in the sense that their formation energy is small, the energy change should be close to the dopant chemical potential. In contrast, the stronger interaction will make the energy change more deviated from the dopant chemical potential.

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** The energy change $(E_\mathrm{tot}^\mathrm{imp} - E_\mathrm{tot}^\mathrm{pristine})$ when the host materials are doped by a dopant atom is depicted as a function of the dopant chemical potential.
Download figure:
Standard image High-resolution image

**Figure 6.** The energy change $(E_\mathrm{tot}^\mathrm{imp} - E_\mathrm{tot}^\mathrm{pristine})$ when the host materials are doped by a dopant atom is depicted as a function of the dopant chemical potential.
Download figure:
Standard image High-resolution image

For all structures in this database, the host energy and the dopant chemical potential are negative. Considering the formation energy for a given host, when a dopant is added to the host and the interaction between them is weak, e.g. an adsorbate type, the total energy of the composite system is combined and often more negative than the host. If the host remains the same, the energy change (the first two terms of equation (1)) should be close and proportional to the dopant chemical potential. Consequently, if the energy change is more negative than the dopant chemical potential, then the formation energy is negative and conversely. As evident in figure S1 in supplementary section S2 for the adsorbate type when the host is fixed, the magnitude of the formation energy is almost linearly decreased as the magnitude of the dopant chemical potential decreases. We observed that the lower magnitude of the dopant chemical potential has higher chance to yield a negative formation energy.

In contrast, if the interaction between the host and the dopant is stronger, e.g. for an interstitial defect, the host is more affected. In this case, there is a higher chance that the composite system has total energy less negative than the host energy. Consequently, the energy change is positive and thus makes the formation energy positive; see figure 6. Even when the host is fixed for the interstitial type, the formation energy versus the dopant chemical potential has relation that tends to be less than that of the adsorbate type; see figure S2 in supplementary section S2. This makes the interstitial type more complicated to predict than the adsorbate type. However, if we plot the formation energy of all hosts together versus the dopant chemical potential, the trends are less pronounced for both types; see figure S3 in supplementary section S2.

For the electronegativity, if a pair is bonded, the covalent bond requires lower relative electronegativity of the atomic pair, while the ionic bond requires a greater one. The stronger bond is expected to have more negative formation energy. It should be noted that, in this database, greater relative electronegativity of the dopant and its neighborhood has higher probability to make a formation. This is consistent with the previous report [64] that the formation energy of an oxygen vacancy in metal oxide materials was predicted with the relative electronegativity and the impurity concentration (electrons of vacated oxygen per total electrons in the structure), demonstrating that these features play essential roles in predicting the formation energy.

Since the features constitute non-linear relationships with the target, and they depend on the clustering of the hosts (e.g. the dopant chemical potential), the tree-based ML methods can deal with the problem better than a linear regression model. When including the structural features, which are constructed from neighboring atoms in the radious r_c around the dopant, then we removed the feature that gives rise to the clustering of the hosts, i.e. hostenergy/atom and host-spacegroup without a significant loss of the prediction score. Note that the linear regression with the ordinary least square gives worse prediction in the same setting with these feature sets; see supplementary section S3.

3.2. Prediction results

The samples are divided into three data sets: training data, test data and blind-test data. The ML models' hyper parameters are tuned by 5-folds cross-validation of training data, whose procedure is detailed in section 2.4. The optimized hyper-parameter sets are given in supplementary section S4 for different ML models and feature sets. The prediction results for the adsorbates, the interstitial defects, and both of them with the chemical features alone, or together with the structural features are reported in tables 1 and 2, respectively. Without the structural features, the ML models give lower MAE and RMSE, and higher R² scores for the adsorbate type than the interstitial type, since the interaction of the latter is more complicated as reflected in figure 6. If the chemical and structural features are applied together, the ML models improve their performance for predicting the formation energy for both impurity types and the union of them. Moreover, the models are also able to predict the formation energy for the blind test, which contains the hosts not recognized by the models from the training data.

Table 1. Performance of ML models with RMSE optimization when only chemical features are used.

Impurity type	Cross-validation of training data (5-fold)		Training data			Test data			Blind test
Impurity type	RMSE (avg.)	SD	R²	MAE [eV]	RMSE [eV]	R²	MAE [eV]	RMSE [eV]	R²	MAE [eV]	RMSE [eV]
RF

Adsorbate	1.0202	0.0672	0.9277	0.3189	0.6510	0.8503	0.5107	0.9341	0.9074	0.4242	0.6441
Interstitial	2.0268	0.0756	0.8720	0.6688	1.1485	0.6096	1.3049	2.1512	0.3827	1.8370	3.2945
Total	1.6777	0.1483	0.8644	0.5216	1.0171	0.6493	0.9074	1.7659	0.5536	1.2480	2.3676

GBR

Adsorbate	0.9391	0.0627	0.9186	0.3274	0.6909	0.8739	0.4662	0.8576	0.9207	0.4062	0.5960
Interstitial	1.9548	0.1397	0.8313	0.8709	1.3184	0.6653	1.2314	1.9918	0.5071	1.6022	2.9437
Total	1.5493	0.1618	0.8648	0.5814	1.0158	0.6860	0.8617	1.6708	0.5137	1.4050	2.4713

HGBR

Adsorbate	0.9418	0.0625	0.9270	0.3487	0.6544	0.8749	0.4706	0.8540	0.9079	0.4048	0.6423
Interstitial	1.9768	0.0915	0.8540	0.8129	1.2267	0.6383	1.2657	2.0706	0.4447	1.6896	3.1247
Total	1.5173	0.1831	0.8766	0.5902	0.9704	0.6919	0.8873	1.6550	0.4267	1.5981	2.6832

LightGBM

Adsorbate	0.9391	0.0654	0.9243	0.3549	0.6664	0.8739	0.4779	0.8573	0.9034	0.4305	0.6580
Interstitial	1.9897	0.1296	0.8389	0.8597	1.2884	0.6372	1.2751	2.0738	0.4960	1.6550	2.9769
Total	1.5358	0.1810	0.8518	0.6720	1.0633	0.6904	0.9018	1.6591	0.5875	1.1730	2.2761

Table 2. Performance of ML models with RMSE optimization when chemical and structural features are used.

Impurity type	Cross-validation of training data (5-fold)		Training data			Test data			Blind test
Impurity type	RMSE (avg.)	SD	R²	MAE [eV]	RMSE [eV]	R²	MAE [eV]	RMSE [eV]	R²	MAE [eV]	RMSE [eV]
RF

Adsorbate	0.9424	0.0609	0.9815	0.1564	0.3296	0.8901	0.4195	0.8004	0.9371	0.3430	0.5309
Interstitial	1.7473	0.1039	0.9623	0.3524	0.6237	0.7789	0.9621	1.6188	0.6862	1.1978	2.3487
Total	1.2595	0.1802	0.9728	0.2280	0.4557	0.8005	0.6258	1.3320	0.8382	0.6231	1.4254

GBR

Adsorbate	0.8264	0.0536	0.9892	0.1416	0.2513	0.9148	0.3874	0.7047	0.9439	0.3329	0.5013
Interstitial	1.5587	0.1653	0.9949	0.1359	0.2284	0.8461	0.8255	1.3508	0.7920	1.0200	1.9122
Total	1.0854	0.1483	0.9985	0.0696	0.1076	0.8549	0.5180	1.1360	0.8828	0.5314	1.2132

HGBR

Adsorbate	0.8197	0.0477	0.9927	0.1505	0.2064	0.9223	0.3627	0.6729	0.8743	0.5268	0.7506
Interstitial	1.5832	0.1373	0.9938	0.1708	0.2518	0.8672	0.7555	1.2545	0.8126	1.0707	1.8153
Total	1.0629	0.1624	0.9937	0.1405	0.2193	0.8509	0.5237	1.1514	0.8624	0.6026	1.3143

LightGBM

Adsorbate	0.8328	0.0416	0.9989	0.0611	0.0796	0.9255	0.3491	0.6588	0.9322	0.3166	0.5512
Interstitial	1.5678	0.0868	0.9972	0.1308	0.1698	0.8530	0.7851	1.3199	0.8254	0.8764	1.7523
Total	1.0682	0.1503	0.9943	0.1332	0.2080	0.8470	0.5331	1.1665	0.8850	0.5891	1.2019

It is also interesting to note that if only the structural features are used, the predictions are worse than using only the chemical features, since we considered all atoms in the r_c radius as the same species; hence, they cannot differentiate different host and dopant species. However, when we used them together with the chemical features, they improved the ML models' performance. We also remark that the database contains only relaxed structures of impurities. In practice, the structural features should be constructed from the initial structure, or at least the unrelaxed structures should be used in the predictions. We believe that at worst the performance of prediction is as given by table 1 when only the chemical features are used, and it should be improved even if the non-relaxed structures are used for constructing the structural features.

In addition, we trained and tested the aforementioned ML models by the chemical and structural features together, but hostenergy/atom and host-spacegroup are not included; see results in table 3. In this case, the rest of the features are not specific to the host names explicitly. Comparing the results in tables 2 and 3, these ML models still yield comparable prediction accuracy; see also figure 7 where we compared the MAEs of these feature sets when using the LightGBM model.

Figure 7. Refer to the following caption and surrounding text. — **Figure 7.** Comparing prediction R² scores of different feature sets by the LightGBM algorithm. Set f₁ denotes the chemical features, f₂ the chemical and structural features, and f₃ the chemical, excluding *hostenergy/atom* and *host-spacegroup*, and structural features.
Download figure:
Standard image High-resolution image

Table 3. Performance of ML models with RMSE optimization when chemical excluding hostenergy/atom and host-spacegroup and structural features are used.

Impurity type	Cross-validation of training data (5-fold)		Training data			Test data			Blind test
Impurity type	RMSE (avg.)	SD	R²	MAE [eV]	RMSE [eV]	R²	MAE [eV]	RMSE [eV]	R²	MAE [eV]	RMSE [eV]
RF

Adsorbate	0.9570	0.0608	0.9810	0.1620	0.3340	0.8855	0.4357	0.8169	0.9431	0.3428	0.5051
Interstitial	1.8131	0.1278	0.9584	0.3691	0.6548	0.7405	1.0516	1.7538	0.6961	1.1474	2.3115
Total	1.3100	0.1738	0.9705	0.2396	0.4748	0.7894	0.6576	1.3684	0.8178	0.6227	1.5127

GBR

Adsorbate	0.8714	0.0628	0.9749	0.1788	0.3837	0.8999	0.4157	0.7641	0.9469	0.3275	0.4876
Interstitial	1.6972	0.1422	0.9853	0.1885	0.3898	0.8046	0.9136	1.5220	0.7839	1.0468	1.9492
Total	1.1670	0.1477	0.9998	0.0241	0.0350	0.8278	0.5726	1.2374	0.8766	0.5562	1.2448

HGBR

Adsorbate	0.8591	0.0633	0.9972	0.0958	0.1278	0.9038	0.3953	0.7490	0.9314	0.3244	0.5544
Interstitial	1.6863	0.1032	0.9962	0.1479	0.1974	0.8209	0.8841	1.4572	0.8075	0.9954	1.8397
Total	1.1481	0.1553	0.9918	0.1624	0.2505	0.8283	0.5890	1.2358	0.8471	0.6998	1.3858

LightGBM

Adsorbate	0.8608	0.0555	0.9993	0.0490	0.0634	0.9223	0.3640	0.6731	0.9118	0.3357	0.6286
Interstitial	1.6840	0.1443	0.9998	0.0397	0.0491	0.8256	0.8534	1.4379	0.8303	0.9408	1.7272
Total	1.1446	0.1558	0.9924	0.1529	0.2401	0.8237	0.5914	1.2519	0.8701	0.6048	1.2771

In figure 8, we provide some examples of the prediction of the formation energy by LightGBM plotted versus the computed formation energy from the database. We have chosen to demonstrate our proposed method by LightGBM for its faster training than RF, GBR, HGBR for comparable RMSE. The results from the other algorithms are presented in supplementary section S5. Furthermore, the feature importance, which describes how much each feature is relevant relative to the others and is used to split the internal nodes in the decision trees, is also reported in supplementary section S6.

Figure 8. Refer to the following caption and surrounding text. — **Figure 8.** LightGBM: scatter plot of prediction and computed formation energies for training data, test data, and blind test in the inset. The impurity types are (a) adsorbate (b) interstitial (c) union of both. The chemical and structural features are used.
Download figure:
Standard image High-resolution image

We should point out that the prediction results with stratification–where the dataset and cross-validation are split in such a way that preserves the percentage of samples for each host–are comparable cross-validation scores to those without stratification. Please see supplementary sections S7.

Hence, it can be concluded that a small set of chemical features from fundamental properties of atoms, together with a set of structural features encoded in the JL polynomials, can be successfully utilized to predict the formation energy of neutral single-atom impurities in 2D materials, as indicated by high R² scores of the test and blind test. In this work, we focused our investigation on the adsorbates and interstitial defects because they are available in IMP2D. However, as substitutional defects in 2D materials are of great interests for most applications but they are not available in this database, we applied these models to them separately and included the results in appendix D.

3.3. Comparison of ML models

To compare the performance of the selected ML models, we varied the number N of decision trees as $N \in \{10, 20, 50, 100, 200, 500, 700, 1000\}$ . Then, the corresponding RMSE score, training time and predicting time are evaluated by the 5-fold cross-validation of training data. Our computing environment has Ubuntu (verison 20.04.3 LTS) operating system with Intel Core i9-7920X CPU (24 cores in total). RF, HGBR, and LightGBM algorithms are run with multi-threading, and the number of threads is fixed to 12, but GBR is only available to run on a single thread. The results for the adsorbates are shown in figure 9(a) for default hyper parameters and in figure 9(b) for optimized hyper parameters; see list of them in supplementary section S4. For these sets of hyper parameters, the training times fall in between a few tenths to ten seconds, and the prediction times are shorter. Even if the training time for a given set of parameters is in order of a few seconds, the model with the shorter training time requires less time-consuming in hyper-parameter optimizations. The default hyper-parameters yield similar RMSE to the optimized ones; however, the latter will give more robustness against different test-data sets. Also notably from figure 9, LightGBM takes the shortest training time for the same RMSE.

Figure 9. Refer to the following caption and surrounding text. — **Figure 9.** Training time and prediction time when the number of decision trees is varied. Both the chemical and structural features are used. (a) Default hyper-parameters and (b) optimized hyper-parameters.
Download figure:
Standard image High-resolution image

In addition, the learning curve, which shows a model's performance as the training size varies, is investigated. The results are shown in figure 10 for the selected chemical and structural feature sets; see also supplementary section S8 for other feature sets. We observed that RF yields lower score compared to the other gradient-boosting frameworks (i.e. GBR, HGBR, LightGBM). The interstitial type shows non-vanishing gradient when training samples are increased. Since interstitial type has smaller sample sizes, this is a possible reason for higher RMSE. Hence, the larger training size is expected to improve the prediction performance of the interstitial defects.

Figure 10. Refer to the following caption and surrounding text. — **Figure 10.** Learning curves, with the 5-fold cross-validation as training samples increase, are depicted with dash lines and solid lines for training and testing scores, respectively. The shaded regions by different colors represent standard deviations of RMSE. Results shown are from using the feature set f₂. The impurity types are (a) adsorbate, (b) interstitial, and (c) union of both.
Download figure:
Standard image High-resolution image

3.4. Dependence on types of impurities

It is also evident that the formation energy of the adsorbates is more accurately predicted than that of the interstitial defects. We believe it indicates complexity of the formation energy arising from the environment of the added atoms. In the adsorbate case, the interaction of the dopant with the host atoms is relatively weak, so the energy of the composite system is approximately the sum of the energy from the host and the dopant. Hence, for a given host, the energy change after doping is proportional to a dopant chemical potential. In contrast, the interstitial defects experience many influences from neighboring atoms, adding to a more complex situation. As a result, a set of chemical features may not capture the physical nature of such defects sufficiently accurately. Though the addition of JL terms improves the prediction accuracy due to more degrees of freedom in the modeling, it is likely to account for only a fraction of the interaction. Moreover, it seems that the interstitial type needs more training samples than available data.

It is also interesting to note that both impurity types are added by the same impurity into the same host; the difference is just the position of a dopant. Thus, the features that include the structural information can be used to predict them together.

3.5. Remarks on test accuracy and further improvement

As the distribution of the computed formation energy in figure 3 has indicated, we selected samples in the database which have converged DFT calculations of energy by a convergence threshold 10⁻³ [eV] as mentioned in section 2.1. In the distribution, the formation energy in range $E^\mathrm{f} \gt -5.0$ and $E^\mathrm{f} \lt 15.0$ [eV] includes 38 samples from the total of 5906 samples. Therefore, this region has fewer samples to train the ML models. If we cut off this region, the regression performance of the testing will be improved, especially in terms of MAE and RMSE. This sometimes makes the blind test yield higher score since there might be no sample in the region.

In this work, we account up to the two-body terms and treat all atoms as single species without specifying the different pairs due to the present of multiple species in each host system. The structural features thus can be improved further by multiplying the vanishing-Jacobi polynomials in equation (3) by an appropriate scaling factor, depending on the pair properties before they are summed to construct the feature $x^{st}_n$ . Alternatively, summing up to the three-body terms, where the distance and angle between three bodies are taken in to account, is another approach to improve the structural feature.

4. Conclusion and outlook

We applied tree-based ML algorithms to predict the formation energy of impurities in 2D materials, where adsorbates and interstitial defects are investigated. RF, GBR, HGBR, and LightGBM algorithms are employed in this work. The small feature set of chemical features from fundamental properties of atom and the created structural features, encoded in JL polynomials, can successfully predict the formation energy, resulting in high prediction scores from the test and blind-test sets. Even if there is no feature specified on the host name (such as hostenergy/atom and host-spacegroup), the prediction yields $R^2 \gt 82\%$ . Among the considered ML models, LightGBM algorithm takes the shortest training time, but yields the comparable or better prediction scores than the other algorithms. When the data sets are trained separately, the adsorbates yield higher accuracy for the formation energy than the interstitial defects. The performance prediction of the interstitial defects is expected to improve by increasing the training sample size. Potential further improvements can be explored by including three-body terms of the constructed structural features, or by applying appropriated scaling factors to the vanishing-Jacobi polynomial. This work therefore demonstrates the potential and application of physically meaningful features in ML to obtain accurate prediction of physical properties that are otherwise time-consuming in computation.

Looking forward, we have demonstrated a method to incorporate physically insightful features to supplement available features in a database to predict a physical property. More interesting and high-impact properties such as energy band gap, zero-phonon line, quantum efficiency, and excited state lifetime of quantum emitters [65] for quantum technology applications [31] can be considered in future work.

Acknowledgments

We are grateful to Prof. Dr Stefano Sanvito and Dr Urvesh Patil of School of Physics, Trinity College Dublin, Ireland, for valuable comments and suggestions. This research project is supported by Mahidol University (Fundamental Fund: FF-059/2566 fiscal year 2023 by National Science Research and Innovation Fund (NSRF), Thailand) and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - Projektnummer 445275953. This work is part of the Munich Quantum Valley, which is supported by the Bavarian state government with funds from the Hightech Agenda Bayern Plus. The authors acknowledge support by the German Space Agency DLR with funds provided by the Federal Ministry for Economic Affairs and Climate Action BMWK under Grant Number 50WM2165 (QUICK3) and 50RP2200 (QuVeKS). T V is funded by the Federal Ministry of Education and Research (BMBF) under Grant Number 13N16292. C C, C N and R H are grateful for the Thai government scholarships via the Development and Promotion of Science and Technology Talents Project (DPST).

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix A: Feature descriptions

It should be noted that a host space group is identified by a number, but it is not a quantity. In order to use this information in ML, it can be represented by the one-hot method.

Table A1. Chemical features of the hosts and dopants used in this work [48, 66]. The last two rows indicate the features that are excluded in constructing the feature set f₃.

Chemical Features	Descriptions
dopant-chem-pot	Dopant chemical potential.
natoms	Number of atoms in a structure.
dopant-engt	Electronegativity of dopant (Pauling scale).
engt-nb1, 2, 3, 4	Electronegativity of four nearest neighbors.
Delta-engt-avg	Different electronegativity between dopant and average of four nearest neighbors.
Z-dopant	Atomic number of dopant
Z-dopant/total	Ratio of dopant atomic number and total atomic number in a structure.
val-e-dopant	Valence electron of dopant.
period-dopant	Period of dopant in periodic table.
atomic-radius-dopant	Atomic radius of dopant.
host-spacegroup	The space group of the host. In this work, the materials considered in the database have only nine distinct space groups, numbered by 2, 11, 14, 53, 59, 156, 162, 164, 187. This categorical data is encoded using a one-hot scheme.
hostenergy/atom	Total energy per atom of a host structure.

Appendix B: Material information

Table B1. Number of impurity structures in the IMP2D database classified by the hosts.

Host formula	Host space group number	Number of impurity structures			Number of dopant elements
Host formula	Host space group number	Adsorbate (3992)	Interstitial (1914)	Total (5906)	Number of dopant elements
WS₂	187	142	70	212	60
MoS₂	187	136	104	240	62
WTe₂	187	110	77	187	59
MoSe₂	187	156	82	238	61
MoTe₂	187	144	74	218	61
NbS₂	187	121	129	250	58
SnS₂	187	37	57	94	47
WSe₂	187	156	97	253	61
NbSe₂	187	105	169	274	60
As₂	164	53	36	89	52
BiITe	156	6	1	7	2
Nb₄C₃	164	134	38	172	58
Nb₂CO₂	164	102	32	134	57
C₂H₂	164	93	7	100	48
Mo₂CO₂	164	25	7	32	27
Bi₂I₆	162	4	6	10	5
Hf₂Te₆	59	87	93	180	60
Cr₂I₆	162	77	64	141	55
V₂CO₂	164	90	23	113	54
HfSe₂	164	98	52	150	62
Ge₂	164	27	12	39	33
Ge₂H₂	164	74	10	84	52
HfS₂	164	133	58	191	61
PbI₂	164	3	5	8	5
PtSe₂	164	123	45	168	60
PtS₂	164	132	42	174	60
MoSSe	156	290	52	342	61
P₄	53	63	18	81	50
TiO₂	164	67	9	76	52
Re₄S₈	2	138	24	162	63
NiSe₂	164	103	35	138	57
Pd₂Se₄	14	35	15	50	35
ZrS₂	164	117	54	171	57
Re₄Se₈	2	131	37	168	64
TaS₂	164	97	34	131	59
SnSe₂	164	79	23	102	52
TaSe₂	164	66	49	115	55
TiS₂	164	40	23	63	39
W₂Te₄	11	142	37	179	59
ZrSe₂	164	83	47	130	56
Sn₂	164	27	22	49	33
W₂Se₄	11	107	24	131	60
Si₂	164	39	21	60	41

Appendix C: Regression performance metrics

In this work, we evaluate the performance of a regression model by three metrics, namely the coefficient of determination (R²), the MAE, and the RMSE. The definitions are below, where $\hat{y}_i$ is the predicted value of the ith sample; y_i is the corresponding actual value; and $\bar{y} = \frac{1}{n}\sum_{i = 1}^{n} y_i$ is the arithmetic mean of the actual values from the total n samples.

$\begin{align} R^2\left(y, \hat{y}\right) = 1 - \frac{\sum_{i = 1}^{n} \left(y_i -\hat{y}_i\right)^2}{\sum_{i = 1}^{n} \left(y_i -\bar{y}\right)^2}.\end{align} \tag{ C.1 }$

$\begin{align} \text{MAE}\left(y, \hat{y}\right) = \frac{1}{n} \sum_{i = 1}^{n} |y_i -\hat{y}_i|.\end{align} \tag{ C.2 }$

$\begin{align} \text{RMSE}\left(y, \hat{y}\right) = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} \left(y_i -\hat{y}_i\right)^2 }.\end{align} \tag{ C.3 }$

Appendix D: Results on substitutional defects

In this section, we presented the results on the ML models using chemical and structural features on substitutional defects. The data of single-atom substitutional defects are taken from the Quantum Point Defects Database (QPOD) [67], where we selected only the substitutional defects with neutral charge and available formation energy. Furthermore, we removed one sample with outlier formation energy, leaving 143 samples of substitutional defects to train and test the ML models. It should be noted that the data set consists of 60 hosts (pristine) and variety of defect atoms.

The chemical and structural features, which were extracted from [66–69], were similarly prepared as outlined in section 2.2. We used the chemical features listed in table A1, excluding host-spacegroup. Since the host will be substituted by another atom, we had to create and include a new feature–which is the difference of the chemical potential between the added atom and removed host atom to accommodate for the substitutional defect.

The selected data set with 143 samples is randomly split into the training data for 80% and the testing data for 20%. The former is used to optimize the hyper parameters of the ML models with the 5-fold cross-validation method. Then, the optimal hyper parameters are used to retrain the model by all training data. Finally, the testing data are used to evaluate the performance of the trained ML models.

The regression performances of the ML models for substitutional defects are reported in table D1. The comparison plots between the computed formation energy in the QPOD database and the predicted value are depicted in figure D1. The cross-validation RMSEs averaged from the 5-fold splitting are comparable among the different ML models, and they are compatible with the interstitial defects with slightly higher RMSEs than those in table 2. Notably, higher standard deviations are obtained since a smaller training set makes it more sensitive to the splitting of validating set in the cross-validation. We expect that the ML performance on substitutional defects should be improved with increasing training size of samples, in analogous with learning curves of the interstitial defects in figure 10. Therefore, in this section, we have demonstrated the application of ML models on a small set of substitutional defects, and it needs to be investigated more thoroughly with a larger sample size.

Figure D1. Refer to the following caption and surrounding text. — **Figure D1.** Scatter plots of predicted and computed formation energies of substitutional defects for training data and testing data in ML models: (a) RF, (b) GBR, (c) HGBR, and (d) LightGBM. Performance metrics including cross-validation (CV) RMSE, training and testing RMSEs, as well as training and testing R² are indicated for each model.
Download figure:
Standard image High-resolution image

Table D1. Performance of ML models with RMSE optimization of substitutional defects.

Model	Cross-validation of training data (5-fold)		Training data			Test data
Model	RMSE (avg.)	SD	R²	MAE [eV]	RMSE [eV]	R²	MAE [eV]	RMSE [eV]
RF	1.8323	0.1833	0.8820	0.6998	0.8946	0.6236	1.0533	1.4217
GBR	1.6707	0.2300	0.9937	0.1603	0.2072	0.5613	1.0740	1.5347
HGBR	1.7503	0.1948	0.9956	0.1088	0.1736	0.6647	0.9004	1.3416
LightGBM	1.6576	0.2444	0.8948	0.6128	0.8448	0.5665	1.1459	1.5255

Dates

Peer review information