In this section, we first describe our experimental setup, and then describe the results of our simulations:
Setup: We assessed the effectiveness of
ML-Assisted HT Detection (ML-HTD) on the three major IWLS benchmarks [
54] (Ethernet, S38417, and AES128). We created each benchmark and added 90 HT. These HTs are straightforward combinational HTs, consisting of a single 2-input XOR gate to transfer the Trojan payload to a target net, a single AND-tree to create the HT activation signal, and four input triggers attached to specific nets (as Trigger Nets). Nets are carefully picked from non-critical timing patterns with enough available timing slack to insert the Trojan trigger and payload. The positioning of the HT Circuit (and the first gate of the AND-tree) in relation to the triggering net determines the influence of the TT capacitive delay. The HT Circuit (first gate) is positioned within a 20
\(\mu\)m radius of the Trojan Trigger nets to minimize the Trigger impact (assuring a small delay impact). In order to reduce the influence of the driver gate’s gated capacitance on the latency of the Trojan triggering net, the AND tree for the HT Circuit is likewise built using the smallest AND gate available in the standard cell library. We would have 90 placed-and-routed netlists for each benchmark, each including a single HT circuit, in order to demonstrate the sensitivity of our approach. Each HT Circuit is added in a distinct placed-and-routed netlist. Each benchmark has a physical design that is hardened and timing that is closed at 1.4GHz in 32nm technology.
We do not know if a timing path chosen for training contains an HT during NN-watchdog training. As a result, we also assessed the effect of adding Trojan-affected temporal pathways to the practice set. With the inclusion of 0, 1, 5, 10, and 15 Trojan pathways in their training set, we trained 5 NN-watchdogs. Our objective is to assess whether Trojan-affected temporal routes, such as trigger nets or payload nets, might contaminate the model to the point where HT evasion occurs when evaluated using ML-Assisted HT Detection (ML-HTD).
By adjusting the slack recorded for each timing path to the neighboring higher clock sweeping frequency step, and modeling the CFST step size, the silicon CFST test was simulated using SPICE simulation. Modern testing equipment allows for step sizes as tiny as 10-15ps. Therefore, we decided on 15ps for the tester’s step size. Here, we additionally consider the influence of random process fluctuation (see Figure
4). In order to model the variation in path delays from chip to chip, each SPICE simulation is subjected to 200 Monte Carlo simulations (modeling CFST performed on 200 different dies in the same speed bin). For these simulations, the threshold voltage (Vth), oxide thickness (Tox), and channel length (L) are varied. In keeping with [
7], we have limited the random process variation to 5%.
In our simulations, we assessed the effectiveness of HT detection using two mechanisms for building our reference (Golden) timing model:
5.7.2 Neural Shifted Golden Timing Model (NGTM).
Using the suggested NN-watchdog, the process drift and systematic process variation is modelled, see Table
3. The anticipated shift by NN-watchdog, which generates path-specific changes in slack based on path topology/features, is then added to the STA findings. We also assessed the use of MLP and stacked regression as NN-watchdogs to demonstrate the efficacy of the layered learning model. There is no assurance that the timing path(s) impacted by the HT will not be included in the dataset used to train the NN-watchdog. As a result, we have looked at how well NGTM performs when the training set comprises 0, 1, 5, 10, and 15 timing paths that are HT-affected. In this method, we have adjusted the threshold for HT identification to 4
\(\sigma\) regressor standard deviation. The frequency of false positives is greatly decreased by selecting
\(\sigma\). A critical understanding of why the NN-watchdog created using the stacked-regression model is anticipated to be more sensitive/accurate than the MLP-regression model may be gained by comparing the standard deviation of the two models, as shown in Table
4. While statistically benefiting from a comparable false-positive rate, it benefits from a lower detection threshold.
We have extracted and reported the ideal threshold from the ROC curve using Youden’s [
55] method to assess the validity of the threshold values used for the HT detection flow outlined in this tutorial. Note that the ideal threshold can only be used for quality assessment since it depends on the ground-truth table (knowing precisely which time paths are and are not affected by HT). For both TT and TP-based ROC curves, the Youden technique gives a distinct detection threshold. The
Receiver Operating Characteristic (ROC) curve is a plot between the False Positive Rate and the True Positive Rate of a classification model. These plots are often used as a part of diagnostic test studies to demonstrate the trade-offs between Specificity (True Positive Rate) and Sensitivity (False Negative Rate) and hence find the cutoff point which optimizes Specificity and Sensitivity. The True Positive Rates and the False Negative Rates of a model are calculated for different cutoff points. Youden’s index is one of the sought-after metrics to find the optimal cutoff from the ROC curve. Youden’s index combines sensitivity and specificity into a single measure (
\(J = Sensitivity + Specificity - 1\)) and has a value between 0 and 1. In a perfect test, Youden’s index equals 1. It is comparable to the vertical distance from the ROC curve for a single choice threshold to the diagonal no discrimination (chance) line. The overall accuracy of the diagnostic test is determined by the
Area Under the Curve (AUC) metric. The ROC curve comes in handy when comparing the results of different models. An excellent model has an AUC near 1, which means it has a good measure of separability. A poor model has an AUC near 0, which means it has the worst measure of separability.
The outcome of the TP identification in the Fast (X, Y) = (5, 5) speed bin is shown in Figure
15. The accuracy of SSTA and NGTM in detecting TPs is compared in the top row. The false positive detection rate for each model across several benchmarks is shown. To forecast the change in slack, this figure compares the performance of the Stacked-regression and MLP-regression models for HT detection. Five different iterations of the NGTM model (NGTM-X) are provided, with each iteration having been trained with X HT included in its training set, where X
\(\in\)[0,1,5,10,15]. As reported, the inclusion of a small number of HT samples in our data set minimally impacts the detection rate of ML-HTD (using NGTM) on the test set, as the detection rate and false-positive rate of ML-HTD for NGTM-0 is similar to the NGTM-X for X
\(\in\)[0,1,5,10,15]. The similarity of detection rate and false-positive rate is simply because the number of HT is not statistically significant to affect the training (e.g., 15 HT data versus 20K HT free data points) process.
Since the detection rate and false-positive rate of ML-HTD for NGTM-0 and NGTM-X for X[0,1,5,10,15] are comparable, the inclusion of a few HT samples in our data set had a negligible effect on the rate of ML-HTD detection on the test set (using NGTM). The closeness in the detection and false-positive rates is because the number of HT (e.g., 15 HT data versus 20K HT-free data points) is not statistically significant enough to influence the training process.
The outcome of our TT identification in the FAST speed bin with (X, X) = (5, 5) is shown in Figure
16. It contrasts the efficiency of SSTA and various types of NGTM (Trojan-tainted model) for identifying TTs, much like the TP scenario. The graph illustrates how the detection rate for TTs is greatly improved by a decrease in the NN-standard watchdog’s deviation (over 40% in some cases). As demonstrated, NGTM has a lower rate of identifying TTs than TPs because TT has a lesser effect on the latency of impacted observed nets than TP (which is at least equal to one gate delay). We see that, similar to the TP situation, the training set being contaminated by a small number of HT data points does not affect the trained NN-WatchDog accuracy. As shown, the accuracy of HT detection greatly depends on the learning model chosen (MLP vs. Stacked) for training the NN-watchdog. As shown by the stacking regression, a reduced threshold (chosen based on 4x(sigma NN) of regression model error) might increase the detection rate by 10% to 15%, yielding an HT detection rate above 95%. This occurs when the entire solution’s false-positive rate, as determined by the stacking regression model, is lower or equal to its MLP-based equivalent. Although using the Youden threshold for detection considerably increases TT detection, it produces more false positives and might not be the best method for determining the detection threshold.
Figure
15 illustrates the ROC curve from which the Youden threshold is extracted for NGTM-10. The Youden value for other NGTM models is extracted using similar ROC curves. Table
4 compares the threshold values obtained from using the Youden method with a threshold of 4
\(\sigma _{NN}\) of regression model error. For having a different ROC curve, the threshold obtained from the Youden method is different for TT and TP detection. However, the 4
\(\sigma _{NN}\) threshold is fixed for both TT and TP (because the same NN for detection of TT and TP is used). As illustrated, the Youden threshold is smaller (for both TT and TP detection) than the threshold. This explains why HT detection using the Youden threshold results in a higher detection rate in Figures
15 and
16 for TT and TP detection. But, as illustrated in these figures, the smaller threshold comes at the expense of a significantly higher false-positive rate. This table also compares the threshold values obtained using the stacking learning solution and when using an MLP solution.
The TTs may be engineered to have a negligible delay impact on the affected timing paths, making them more covert. A tiny change in the delay of the affected timing paths might be achieved by connecting TTs to gates with high drive strength and low threshold voltage and minimizing their capacitive delay by shortening the TT nets. However, as mentioned in section IV-C, by capping the size of standard cells utilized in the design, boosting usage in regions that need to be safeguarded, and mandating high routing density across those areas, we may make the system more sensitive to Trojan Triggers. This would force the adversary to connect the TT to cells with smaller drive strength and use longer nets to connect the Trojan Trigger to the Trojan logic (TT placed further away). An experimental SPICE framework is set up for evaluating how a Trojan Trigger affects delay at various distances from its driving cell. For a process with 32nm technology, metal 3, we employed a distributed RC model. The effect of increasing the TT distance (and associated capacitive delay) on the latency of a timing path built using five NAND gates have been modeled. Assuming our detection threshold is 25ps, the timing-path delay will rise by 25ps when the TT introduces an extra capacitance equivalent to a net that drives the TT logic positioned 40 m distant from the impacted net, as shown. Sensitization may therefore be a successful strategy for raising the HT detection rate.