research-article

Open access

Hyperparameter Tuning with Gaussian Processes for Optimal Abstraction Control in Simulation-based Optimization of Smart Semiconductor Manufacturing Systems

Authors:

Boyi Su,

Seon Han ChoiAuthors Info & Claims

ACM Transactions on Modeling and Computer Simulation, Volume 35, Issue 1

Article No.: 7, Pages 1 - 21

https://doi.org/10.1145/3646549

Published: 25 November 2024 Publication History

PDF eReader

Abstract

Smart manufacturing utilizes digital twins that are virtual forms of their production plants for analyzing and optimizing decisions. Digital twins have been mainly developed as discrete-event models (DEMs) to represent the detailed and stochastic dynamics of productions in the plants. The optimum decision is achieved after simulating the DEM-based digital twins under various what-if decision candidates; thus, simulation acceleration is crucial for rapid optimum determination for given problems. For the acceleration of discrete-event simulations, adaptive abstraction-level conversion approaches have been previously proposed to switch active models of each machine group between a set of DEM components and a corresponding lookup table-based mean-delay model during runtime. The switching is decided by detecting the machine group’s convergence into (or divergence from) a steady state. However, there is a tradeoff between speedup and accuracy loss in the adaptive abstraction convertible simulation (AACS), and inaccurate simulation can degrade the quality of the optimum (i.e., the distance between the calculated optimum and the actual optimum). In this article, we propose a simulation-based optimization that optimizes the problem based on a genetic algorithm while tuning specific hyperparameters (related to the tradeoff control) to maximize the speedup of AACS under a specified accuracy constraint. For each individual, the proposed method distributes the overall computing budget for multiple simulation runs (considering the digital twin’s probabilistic property) into the hyperparameter optimization (HPO) and fitness evaluation. We propose an efficient HPO method that manages multiple Gaussian process models (as speedup-estimation models) to acquire promising optimal hyperparameter candidates (that maximize the simulation speedups) with few attempts. The method also reduces each individual’s exploration overhead (as the population evolves) by estimating each hyperparameter’s expected speedup using previous exploration results of neighboring individuals without actual simulation executions. The proposed method was applied to optimize raw-material releases of a large-scale manufacturing system to prove the concept and evaluate the performance under various situations.

List of Abbreviations

AACS	Adaptive Abstraction Convertible Simulation
ALC	Abstraction-Level Converter
DEM	Discrete-Event Model
DOS	DEM-Only Simulation
GA	Genetic Algorithm
GP	Gaussian Process
HPO	Hyperparameter Optimization
KS	Kolmogorov Smirnov
MDM	Mean-Delay Model
SBO	Simulation-Based Optimization

1 Introduction

Semiconductor manufacturing systems have applied Industry 4.0–based smart manufacturing concepts to their plants to enhance the monitoring and adaptive process control capabilities. The increased monitoring capabilities provide a huge volume of real-time data from machines or Internet-of-Things devices, which helps to develop and validate the digital twin (i.e., the virtual model of the plant). Moreover, we also used the data to initialize the digital twin so that we could estimate the performances of various what-if scenarios.

For agile production control, we should quickly compute optimal control solutions after evaluating the production performances of various what-if scenarios [2, 7, 35]. Depending on the measurement of key performance indicators, some indicators, such as overall cycle time, machine groups’ maximum (or average) queue lengths, and so on, are computed using discrete-event simulation to consider the production dynamics for accurate estimation [2, 5, 10, 14]. Engineers aim to represent the target manufacturing plant’s physical entities, such as machines and automated guided vehicles, considering their stochastic behaviors in terms of operation periods, reworking, and breakdown. Those engineers virtualize each physical entity into a corresponding discrete-event model (DEM) component. The components are integrated to build the digital twin (i.e., virtual plant), and the digital twin is simulated by exchanging events among the components. The digital twin is initialized and maintains its high fidelity online using observed events to synchronize the digital twin and physical plant [29].

Using the digital twin, we can evaluate the performances of various what-if production decision candidates based on different scheduling and dispatching scenarios before applying a candidate to the actual plant. This article deals with finding one or multiple optimal solutions among different candidates through simulation-based optimization (SBO) using the stochastic DEM-based digital twin. For the rapid response to the physical plant with an optimal control solution (derived by SBO), minimizing the overall execution time for SBO is essential. For that, we have previously focused on the methods of accelerated simulation of the digital twin [26, 27, 28]. The key concept of the methods is runtime model abstraction, which adaptively switches each machine group between the group’s high-fidelity DEMs components and a corresponding lookup table-based mean-delay model (MDM).

A group’s active model is switched whenever the group converges into or diverges from a steady state in terms of the stationary of queuing jobs in the group. If the group runs in an unsteady state, then the group’s DEM components become active to perform event-based complex processing operations based on external or internal events. Among external events, the lot events (symbolizing raw materials/parts flows) are primary entities processed by machines and move across the machines. In a steady state, an MDM becomes active to handle the lot events based on saved mean delays, skipping the DEMs’ detailed operations. The switching condition is computed following the Kolmogorov–Smirnov (KS) test. The test examines a statistical similarity between two types of consecutive observation samples, consisting of arrival-interval and waiting-time observations, and aims to see the convergence of queuing jobs based on Little’s Law. Depending on the lot-release scenario from the inventory, the overall production stability changes. As the virtual plant’s overall production becomes more stable, MDMs are more likely to be active than high-fidelity DEM components, which guarantees higher speed. Even in the same stability, depending on the values of the significance level \(\alpha\) used in the KS test, the statistical similarities of the consecutive samples are different; lower \(\alpha\) allows more invariances between the samples, leading to more simulation speedup with more accuracy loss.

During the SBO, the optimizer generates many individuals who represent control (or decision) candidates. Each individual’s expected performance is measured using multiple simulations due to the stochasticity of digital twin. To shorten the simulation execution during SBO, we can set lower \(\alpha\). However, this reduces the performance accuracy of each individual, which can influence the optimal quality in terms of the closeness between the derived optimal by adaptive abstraction convertible simulation (AACS) and the actual optimal. We denote that the actual optimal is a value computed through DEM-only simulation (DOS) with a sufficiently large simulation budget (i.e., the number of simulation runs for each individual). Thus, we should carefully select the \(\alpha\) value for each individual to gain a speedup while satisfying an accuracy constraint (i.e., a defined upper bound of the allowable accuracy loss). When MDMs sample lot types whose delay was not observed during the previous sampling for the MDM activation, these will trigger the switching conditions from MDMs to DEMs. In high-mix and low-volume manufacturing scenarios, a small sample size can reduce the speedup due to the frequency of the sampling new lot types, where the frequent switching reduces the activation duration of MDMS. However, a large sample size hinders early detection of steady-state convergence because of its more extended sampling period, which might not draw a maximum speedup. Hence, the selection of appropriate sample sizes is also essential to the simulation speedup in SBO.

This article proposes an integrated approach to optimize the plant’s production control parameters (affecting the target performance indicators) and then tune two important hyperparameters: \(\alpha\) and the sample size. The approach aims to shorten the overall SBO execution time by maximizing the AACS speedups under the accuracy constraint, which aims to derive the production-control optimum(s) as accurately as those optimized through DOS. We use the genetic algorithm (GA) as the default optimization algorithm. Due to the stochastic property of semiconductor manufacturing plants, each individual (provided by the GA-based optimizer) requires a number of simulation runs using different random seeds, as shown in Figure 1. In this article, the number of runs is also known as the simulation budget, and we fix each individual’s budget for its fitness evaluation. The proposed approach allocates part of each individual’s total simulation budget for the hyperparameter optimization (HPO).

Fig. 1.

The main contributions are as follows:

—

We propose an HPO approach integrated with conventional GA-based SBO to maximize the speedup of AACS under a user-defined accuracy constraint.

—

We designed a hyperparameter tuner that (1) performs HPO with Gaussian process (GP) models (as speedup-estimation models) to acquire promising hyperparameter candidates by estimating the candidates’ speedups before actual simulations and (2) evaluates the population’s fitnesses through multiple AACSs using tuned hyperparameter values.

—

We propose an efficient hyperparameter exploration method for a given individual, which reuses the previous exploration results of the individual’s neighbors based on the following perspective: an individual’s production dynamics will be similar to its neighbors, and the hyperparameter optimums for similar decision candidates would be close.

Distinct from traditional HPO methodologies, which predominantly aim to identify a singular optimal set of hyperparameters for enhancing problem-specific training or optimization processes, our approach focuses on determining multiple hyperparameter optima, each corresponding to an individual entity. This endeavor is aimed at maximizing the speedups of individuals by adeptly managing individuals’ own GP-based speedup estimation models. As the population evolves during genetic algorithm-based optimization, our approach augments the reliability of GP models by assimilating results derived from previous simulation explorations, thus empowering the tuning mechanism to deliver more precise hyperparameter optima. The evolution of these GP models amplifies the AACS speedups and reduces the portion of the budget allocated for hyperparameter exploration within HPO. This methodology ultimately leads to a reduction in the stochastic variances of the DEM, enabling more simulation runs for fitness evaluations compared to the hyperparameter explorations within a predetermined simulation budget.

The rest of the article is organized as follows. Section 2 describes the literature overview of the proposed SBO approach. Section 3 specifies the problems of HPO during AACS-based optimization. Section 4 explains the HPO procedures during the SBO. Section 5 shows the detailed procedures of (1) preparing the data used to initialize the GP models by reusing the previous exploration results of neighboring individuals without running actual simulations and (2) the GP-based HPO method. Section 6 shows the experimental results of the proposed approach, and Section 7 concludes the article.

2 Related Works

When a new demand is triggered, the semiconductor manufacturer often needs to make rapid decisions for production scheduling or material planning to optimize the cycle time, cost, utilization, bottleneck reduction, and so on. To be responsive to the rapid demand changes, optimal solutions need to be found quickly after estimating the performances of various what-if decision candidates. DEM has been typically used to simulate complex production dynamics of semiconductor manufacturing plants [2, 5, 10, 14]. DEM has multiple advantages in accessing the key performance indicators of the plant’s various production scenarios. It enables building a complex system model to be built by including detailed subcomponents, which operate in an event-driven manner and exchange events to influence others [24]. We can couple various DEM components to achieve the desired fidelity for detailed processing, routing, and dispatching rules [20]. The process of finding the optimum candidates by simulating the decision candidates is called SBO. One typical way to speed up the SBO is to reduce the total number of simulation runs required by improving the optimization algorithm. Another way is to accelerate the simulation evaluation (the leading overhead of SBO). This article focuses on the runtime improvement of simulation evaluation during the SBO.

Although there are advantages to using DEMs, discrete-event simulations are computationally expensive due to iterative event processing [17]. This shortcoming becomes further accentuated during SBO, where achieving global optimization typically requires tens of thousands of model runs to find the global optimum within a delineated feasible parameter space, thereby obstructing thorough exploration in the optimization phase. A conventional approach to alleviating the computational burden during SBO is employing surrogate models, which are cost-efficient approximated models of compute-intensive simulation models. These surrogate models articulate the relationship between the inputs (i.e., the model’s adjustable parameters) and outputs (i.e., the performance measure of the simulation model) using specified mathematical expressions whose parameters are calibrated using sufficient input-output data derived from multiple simulation runs with varied parameter sets.

In efforts to accelerate semiconductor manufacturing simulations, researchers have formulated two main types of surrogates: (1) queuing networks [6, 11, 23] and (2) other analytic closed-form solutions for individual machine groups. Queuing networks comprise queuing nodes assigned to each machine group whose machines execute analogous tasks and share a queue. Each node is modeled using queuing parameters to represent arrival and service processes. Other surrogates incorporate empirical approximations in various closed-forms solutions as constant delays, probability distributions, or exponential/quantile functions [3, 15, 25, 32]. The parameter values for those surrogates are calibrated using simulation outcomes of DEMs. However, surrogate models inherently face significant limitations, notably their inability to assure estimation accuracy in unprecedented or markedly divergent situations (arising from varied demands or orders) compared to those in the training data scenarios. These limitations are emphasized due to the characteristic on-demand requests of SBO aimed at optimizing evolving manufacturing systems. Thus, gathering adequate data for surrogate calibration before surrogate-based optimization is a practical challenge. Moreover, those surrogate models struggle to accommodate complex scenarios involving high-mix processing, varying setups, machine groups with batching and cascading machines, and dynamic dispatching rules.

To resolve these previous surrogates’ problems, we previously proposed an adaptive abstraction-level convertible model (as a new surrogate). We presented an abstraction-level converter (ALC) for each machine group to switch the group’s active model between a set of dedicated DEMs and a corresponding lookup table-based MDM during runtime. Each detection of a group’s steady-state convergence or divergence triggers the model switching. Compared to the previous surrogate models, MDMs are not pre-biased to specific scenarios or predeveloped before simulation runs, because their runtime behaviors rely on observed mean delay information in steady states. ALC detects the steady-state convergence/divergence through the KS test based on two important hyperparameters: significance level \(\alpha\) and the sample size. For a production scenario, various hyperparameter values differently improve speedups with some accuracy losses. Moreover, a hyperparameter candidate that guarantees the maximum speedup with the desired accuracy loss for a production scenario might not be effective in another scenario. Thus, we aim to perform the HPO during the GA-based SBO to maximize the simulation speedup under the accuracy constraint when evaluating a lot of individuals’ fitnesses.

HPO approaches have been actively studied in the training of machine learning. Those hyperparameters, such as the number of hidden layers, learning rates of stochastic gradient descent, batch size, and so on, directly control the behaviors of training algorithms and significantly affect the performance of machine learning [33]. The HPO approaches require deep knowledge of machine learning algorithms and machine learning-specific appropriate techniques [34]. The HPO can provide several primary advantages: (1) reducing the costly menial work of engineers, (2) improving the accuracy and efficiency of neural network training, and (3) improving the reproducibility and fairness of scientific studies [9].

Some HPO studies have tried to find optimal hyperparameter values for the evolutionary algorithm (EA) [8, 12, 19, 31], and Aleti and Moser [1] provided a comprehensive review of HPO. The success of EA applications highly depends on the hyperparameter settings, such as the choices of representation, parent selection, recombination and mutation operators, the probabilities in selection and replacement, population size, and so on. Among the HPO strategies in Reference [1], our approach can be categorized as adaptive parameter control that searches problem-specific optimum while monitoring the quality of EA runs to adjust hyperparameters.

This article presents a unified approach, concurrently orchestrating HPO and AALCs during SBO. Unlike conventional HPO studies, which primarily aim to identify a universal optimal set of hyperparameters to enhance training and optimization for specific problems, our approach endeavors to find the individuals’ own hyperparameter optima for their maximum speedup while maintaining accuracy constraints. Our proposed approach is advancing our preceding method discussed in Reference [30], seeking to maximize the speedup by incorporating a new critical hyperparameter, the sample size, and the previously used hyperparameter, \(\alpha\). Given the expanded cardinality induced by the new hyperparameter, it becomes crucial to develop an efficient search technique to navigate the enlarged hyperparameter space with minimized simulation runs. For that, we propose a Bayesian optimization-based strategy that proficiently manages individuals’ own GP models that are calibrated using observed speedup measurements and used to estimate the speedup of various hyperparameter candidates. The aim is to curtail the overhead of hyperparameter exploration by predicting high-quality initial GP models for new individuals, leveraging neighboring individuals’ existing GP models, and underpinning the assumption that optimal hyperparameters for analogous manufacturing scenarios are likely similar. The following section elucidates the challenges in realizing this objective.

3 Problem Formulation

To detect a machine group’s steady-state convergence (or divergence) of the DEMs, the group’s ALC monitors each lot type’s inter-arrival times and waiting times in the group’s queue. The same lot type guarantees the same priority in dispatching and the same processing recipe (related to processing time). Let \(1/\lambda _{i}\) be an inter-arrival time between an i-type lot and its previous i-type lot, and let \(W_i\) be the waiting time of an i-type lot. The ALC manages two types of consecutive paired samples. One type of paired samples consists of \(\lambda _i\) observations for all lot types (i.e., \(\forall i = 1, 2, \ldots\)), and the other type of paired samples consists of \(W_i\) observations. If both paired samples become statistically indistinguishable, then we determine that the group queue converges into a steady state based on queueing theory. In the steady state, we assume that (1) the queue’s average number of i-type lots is converged to \(\lambda _i \times W_i\) and (2) the average total queue length becomes \(\sum _i \lambda _i W_i\). The steady-state convergence condition based on the KS test can be formulated as follows:

\begin{equation} \sup | F_{\lambda }^{\text{-}}(\lambda _i) - F_{\lambda }^{\text{+}}(\lambda _i) | \le b \wedge \sup | F_{W}^{\text{-}}(W_i) - F_{W}^{\text{+}}(W_i) | \le b, \end{equation}

(1)

where \(F_{\lambda }^{\text{-}}\) and \(F_{\lambda }^{\text{+}}\) are cumulative distribution functions (CDFs) of two consecutive arrival-rate samples; \(F_{W}^{\text{-}}\) and \(F_{W}^{\text{+}}\) are CDFs of two consecutive waiting-time samples; and \(b=C(\alpha , m^{\text{-}}, m^{\text{+}})\) is an allowable bound that is calculated through function C based on significance-level \(\alpha\) and two sample sizes, \(m^{\text{-}}\) and \(m^{\text{+}}\). A general form of statistical invariance bound, \(C(\alpha , m^{\text{-}}, m^{\text{+}})\), is represented as \(\sqrt {-\ln (\alpha /2) \frac{1 + m^{\text{-}}/m^{\text{+}}}{2m^{\text{-}}}}\) [18]. As hyperparameter \(\alpha\) decreases, bound \(C(\alpha , m^{\text{-}}, m^{\text{+}})\) increases inversely. If the group’s ALC in a steady state detects a statistical arrival change, then the DEMs are reactivated after generating dummy lot events to make a workload consistency [26, 27, 28].

Depending on the production stability of the target scenario candidate, different values of \(\alpha\) are required, because there is a tradeoff between speedup and accuracy. Typically, smaller \(\alpha\) leads to better speedup at the expense of accuracy. Especially in scenarios whose overall productions are highly unstable, small \(\alpha\) can induce a large number of switching between the DEMs and corresponding MDMs, which results in both higher inaccuracy and a slowdown.

Besides detecting a statistical invariance of lot-input arrivals, a steady-state machine group can reactivate the DEMs when observing a new lot type or a machine down. The new lot type means that an input lot’s mean delay has not been stored in the MDM’s lookup table. ALC sends the new lot to DEMs after reactivating the DEMs. As the sample size increases, we can reduce the chance of observing a new lot type, which reduces the number of DEM reactivations of steady-state groups. However, this leads to spending more time on collecting the consecutive samples (for the convergence testing), which can result in less speedup because of insensitive steady-state convergence detection.

In this work, we tune two hyperparameters: \(\alpha\) and the sample size m, which are used for the AACS of each individual during SBO. There are several previous hyperparameter tuning techniques to improve neural network training. Those hyperparameter tuning techniques attempt to optimize functions whose closed-form expressions and derivatives are unknown and challenging to identify. Typical hyperparameter tuning methods are grid search, random search, and Bayesian optimization [9, 34]. Grid and random searches roam the entire space of available parameter values in an isolated way without paying attention to past results. These tuning techniques can become a time-consuming challenge, especially with large parameter spaces. The search space grows exponentially with the number of parameters tuned. Thus, we employ the Bayesian optimization approach, which guarantees that fewer attempts are required to find the optimal hyperparameter set compared to the grid and random searches. This reduction of attempts is significant when the computational cost in evaluating the objective function is high. In our case, the objective-function evaluation requires the speedup measurement through simulation; thus, the cost is relatively high compared to the analytic calculation of an equation.

Our approach requires users to specify an accuracy constraint as the bound of an allowable accuracy loss. Considering the DEM’s stochastic property, each individual’s fitness evaluation requires multiple simulation executions to average the results. We prefix each individual’s total number of simulation runs (i.e., computing budget) as a user-defined number. For the Bayesian optimization of individuals’ speedup-related hyperparameters, we manage multiple GP models to acquire promising hyperparameter candidates after estimating the speedup of each candidate prior to actual simulation-based measurements. The GP models also help quantify uncertainties of the predicted speedups.

The following sections detail these three points:

—

The overall framework to tune hyperparameters in SBO.

—

Each individual’s simulation budget management procedures of the tuner to balance exploration and exploitation.

—

The reuse of previous hyperparameter exploration information among neighboring individuals.

4 Overview of Hyperparameter Tuning in Simulation-based Optimization

As shown in Figure 1, the proposed approach requires two essential entities for simultaneous HPO and SBO, which are (1) the GA-based optimizer and (2) the hyperparameter tuner. The GA-based optimizer is for the problem-specific optimization of decision candidates. It generates multiple individuals (i.e., offsprings for the next generation) based on the fitness of every individual in the current population. Each individual symbolizes a manufacturing-decision candidate; a user-defined objective function is used to measure the individual’s mean fitness based on multiple simulation results using different random seeds. For each individual \(\boldsymbol {\mathbf {p}}_{\ell }\) in the current new offsprings \(P_\text{new}\), the tuner cooperates with the GA-based optimizer for the following two important roles.

—

Exploration: allocating a partial budget to traverse various hyperparameter points over the hyperparameter space until finding the optimal point.

—

Exploitation: evaluating \(\boldsymbol {\mathbf {p}}_\ell\)’s fitness through AACS using optimal hyperparameter point.

The tuner manages a GP model for each individual \(\boldsymbol {\mathbf {p}}_\ell\) to find the optimal hyperparameter point that maximizes an unknown speedup function f:

\begin{equation} \boldsymbol {\mathbf {x}}_\text{opt} = \mathop{\arg\max}_{\boldsymbol {\mathbf {x}} \in \mathcal {A}} f_\ell (\boldsymbol {\mathbf {x}}), \end{equation}

(2)

where \(\mathcal {A}\) denotes the search space of hyperparameter \(\boldsymbol {\mathbf {x}} = \langle x^{(1)}, x^{(2)} \rangle \in [0, 1]^2\) and \(x^{(1)}\) and \(x^{(2)}\) are two encoded hyperparameters from original hyperparameters, \(\alpha\) and m, respectively. The original hyperparameters \(\alpha\) and m are under user-defined search spaces \(A \subset \lbrace \alpha \mid 0 \lt \alpha \le 2\rbrace \in \mathbb {R}\) and \(M \in \mathbb {R}\), i.e., \((\alpha , m) \in A \times M\). The maximum \(\alpha\) is 2 (i.e., \(\sup A =2\)), which signifies that the candidate does not need AACS. If \(\sup A =2\), then statistical bound \(C(\alpha , m,m)=\sqrt {-\ln (\alpha /2) \frac{1 + m/m}{2m}}\) (in Equation (3)) becomes zero, and the abstraction conversion will not occur. If bound \(C(\alpha , m,m)\) is 2, then the simulation speedup always becomes less than 1 because of unnecessary computational overheads for steady-state detection. We roughly define an \(\alpha\) range, \(A_\text{dos}=\lbrace \alpha \mid 1 \le \alpha \le 2\rbrace \subset A\), where each \(\alpha \in A_\text{dos}\) deactivates ALCs to monitor their groups’ steady-state convergences.

Each encoding from \(\alpha\) to \(x^{(1)}\) (or from m to \(x^{(2)}\)) is performed considering the scales of \([0,1]\) and A (or M). Decoding from \(x^{(1)}\) to \(\alpha\) (or from \(x^{(2)}\) to m) is also performed when we apply a candidate (derived by the GA) to AACS. These scaling-based encoding and decoding procedures allow (1) the calculation of a kernel between two hyperparameters with equal importance at the GP-based speedup estimation (discussed in Section 5) and (2) the mapping between continuous GP variable \(x^{(2)}\) and discrete sample size m.

For each individual, users should specify two types of maximum simulation budgets: the maximum number of total simulations and the maximum number of simulations for the exploration in HPO. We denote two types of budget as \(B_\text{total}\) and \(B_\text{exp} (\lt B_\text{total})\), respectively. The maximum budget \(B_\text{exp}\) is an even number, but the number of actual simulations for HPO, denoted by \(n_\text{exp} (\le B_\text{exp})\), can vary depending on the confidence of estimated hyperparameters. The remaining simulation budget, \(B_\text{total}- n_\text{exp}\), is used to run multiple simulations for an individual’s fitness evaluation using the best hyperparameter point \(x_\text{best}\) resulting from the current exploration in HPO. The individual’s fitness is the average of the simulation results, as many as \(B_\text{total} - n_\text{exp} + n_\text{exp}/2\), as shown in Figure 2. The number of simulations for the fitness evaluation is the same as the required number of random seeds.

Fig. 2.

During the exploration in HPO, multiple simulation pairs are executed for an individual, as many as \(n_\text{exp}\). In each pair, AACS and DOS for a given individual \(\boldsymbol {\mathbf {p}}_\ell\) are conducted by sharing a same random seed to measure the accuracy loss and speedup. The accuracy loss, computed using random seed s and hyperparameters \(\boldsymbol {\mathbf {x}}\), is the ratio of difference between AACS- and DOS-driven evaluated fitnesses, such that \(\mathit {loss}_\ell (\boldsymbol {\mathbf {x}}, s) = | \left(\mathit {fit}_\text{aacs}(\boldsymbol {\mathbf {p}}_\ell , \boldsymbol {\mathbf {x}}, s) - \mathit {fit}_\text{dos}(\boldsymbol {\mathbf {p}}_\ell , s) \right) / \mathit {fit}_\text{dos}(\boldsymbol {\mathbf {p}}_\ell , s) |\), where \(\mathit {fit}_\text{aacs}(\boldsymbol {\mathbf {p}}_\ell , \boldsymbol {\mathbf {x}}, s)\) and \(\mathit {fit}_\text{dos}(\boldsymbol {\mathbf {p}}_\ell , s)\) are the evaluated fitnesses by AACS and DOS, respectively, using \(\boldsymbol {\mathbf {p}}_\ell\), \(\boldsymbol {\mathbf {x}}\), and s. The speedup is conventionally gauged using actual execution times; however, we have devised an alternative approach. Our driving rationale behind this shift is to allocate a greater portion of the total budget \((= B_\text{total})\) toward application-specific optimization, thereby curtailing the number of simulation runs for HPO. When we base the speedup evaluation on actual execution times, it necessitates multiple simulation runs for each hyperparameter candidate within HPO due to the unavoidable stochastic variances present, even in measuring execution times of simulation on a standalone computing machine. To circumvent this limitation, we define an alternative metric (as a deterministic gauge for assessing the speedup of AACS versus DOS) as follows.

Definition 1.

An individual \(\boldsymbol {\mathbf {p}}_\ell\)’s AACS speedup for a given pair of hyperparameter \(\boldsymbol {\mathbf {x}}\) and random seed s is measured by the ratio of the number of events processed during AACS compared to that of DOS, i.e., the speedup is represented by \(\mathit {ev}_\text{dos} (\boldsymbol {\mathbf {p}}_\ell , s)/\mathit {ev}_\text{aacs}(\boldsymbol {\mathbf {p}}_\ell , \boldsymbol {\mathbf {x}}, s)\),

where \(\mathit {ev}_\text{dos}(\boldsymbol {\mathbf {p}}_\ell , s)\) and \(\mathit {ev}_\text{aacs}(\boldsymbol {\mathbf {p}}_\ell , \boldsymbol {\mathbf {x}}, s)\) are the number of events processed during DOS and AACS, respectively, for the given \(\boldsymbol {\mathbf {p}}_\ell\), \(\boldsymbol {\mathbf {x}}\), and s.

Given a distinct random seed, the complete set of event-driven operations for the target scenario simulation is predetermined. The AACS strategy accelerates traditional discrete-event simulations in semiconductor manufacturing by omitting repetitive event-driven activities within steady-state machine groups. As a result, the enhancement in speedup, attributable to a specific hyperparameter candidate during AACS, is deeply tied to the number of events it can evaporate. For each hyperparameter candidate, there is a deterministic reduction in the total events in the standard DOS specific to the same individual and random seed. Our previous research empirically showed the high correlation between the reduced event count and the consequent simulation acceleration [26].

Using random seed s, the unknown speedup function \(f(\boldsymbol {\mathbf {x}})\) is measured by

\begin{align} f_\ell (\boldsymbol {\mathbf {x}}) \approx f_\ell (\boldsymbol {\mathbf {x}}, s) = {\left\lbrace \begin{array}{ll} \mathit {ev}_\text{dos} (\boldsymbol {\mathbf {p}}_\ell , s)/\mathit {ev}_\text{aacs}(\boldsymbol {\mathbf {p}}_\ell , \boldsymbol {\mathbf {x}}, s) & \text{ if } \mathit {loss}_\ell (\boldsymbol {\mathbf {x}}, s) \lt \theta _\text{acc}, \\ 0 & \text{ otherwise, } \end{array}\right.} \end{align}

(3)

where \(\theta _\text{acc}\) is a user-defined accuracy constraint. Though we can reduce any noise in measuring the speedup by computing the reduced events (as Definition 1), \(f(\boldsymbol {\mathbf {x}})\) still has a variance, because speedups could change as seed s varies; thus, an amount of statistical variance in the \(f_\ell (\boldsymbol {\mathbf {x}})\) measurement is inevitable.

For the exploration in HPO, the tuner manages each individual’s GP model as a surrogate model to predict a hyperparameter point’s expected speedup based on the prior measurements before measuring the actual speedup. For the given individual \(\boldsymbol {\mathbf {p}}_\ell\) from the GA-based optimizer, the tuner’s overall procedures are described in Algorithm 1. If individual \(\boldsymbol {\mathbf {p}}_\ell\)’s GP model has not been initialized for HPO, then the tuner prepares warmup hyperparameter points (\(X_{ \text{warm}, \ell }\)) and their corresponding speedups to initialize the GP model, as Lines 6–13. This data preparation is called a warmup stage for the GP model. The tuner attempts to estimate the speedups of each point in \(X_{ \text{warm}, \ell }\) without real simulations using previous exploration information of individual \(\boldsymbol {\mathbf {p}}_\ell\)’s neighbors. If point \(\boldsymbol {\mathbf {x}}_i \in X_{ \text{warm}, \ell }\) is predictable without simulation, then the tuner saves \(\boldsymbol {\mathbf {x}}_i\) in additional set \(X_\text{pred}\). Detailed procedures are described in Section 5.1. Otherwise, \(f_\ell (\boldsymbol {\mathbf {x}}_i)\) is measured through a pair of DOS and AACS executions, as Line 12. Each point \(\boldsymbol {\mathbf {x}}_i\) and its predicted or measured speedup, \(f_\ell (\boldsymbol {\mathbf {x}}_i)\), are saved in two types of data sequences: \(X_{\text{exp},\ell }\) and \(Y_{\text{exp},\ell }\), respectively.

Before the exploration for HPO, the tuner derives the optimal hyperparameter candidate \(\boldsymbol {\mathbf {x}}_\text{best}\) based on the previous exploration information: \(X_{\text{exp},\ell }\) and \(Y_{\text{exp},\ell }\), as Line 14. Suppose that \(f_\ell (\boldsymbol {\mathbf {x}}_\text{best})\) was not measured by simulation (i.e., \(\boldsymbol {\mathbf {x}}_\text{best} \in X_\text{pred}\)). In that case, the tuner re-evaluates \(f_\ell (\boldsymbol {\mathbf {x}}_\text{best})\) through a pair of simulations to compute \(f_\ell (\boldsymbol {\mathbf {x}}_\text{best})\) accurately, because the GP-based HPO approach intensively explores the vicinity of \(\boldsymbol {\mathbf {x}}_\text{best}\). Thus, the reliability of \(\boldsymbol {\mathbf {x}}_\text{best}\)’s speedup measurement is critical for successful GP-based HPO.

At the HPO stage, the tuner explores the hyperparameter space \(\mathcal {A}\) until exhausting all of budget \(B_\text{exp}\). Based on previously collected hyperparameters and their speedups, which are saved in \(X_{\text{exp},\ell }\) and \(Y_{\text{exp},\ell }\), the tuner updates the GP prior, which helps predict the speedup incorporating prior knowledge (kernels) and provide uncertainty measures over prediction. Then, the tuner acquires the optimal hyperparameter candidate, denoted by \(\boldsymbol {\mathbf {x}}_\text{new}\). Detailed procedures to update the GP prior and acquire \(\boldsymbol {\mathbf {x}}_\text{best}\) are discussed in Sections 5.1 and 5.2, respectively.

If new hyperparameter point \(\boldsymbol {\mathbf {x}}_\text{new}\) is close to the current optimum \(\boldsymbol {\mathbf {x}}_\text{best}\), then we ensure that hyperparameter \(\boldsymbol {\mathbf {x}}\) is converged. This hyperparameter convergence is an additional termination condition to prevent further exploration and is defined as follows:

Definition 2.

\(\boldsymbol {\mathbf {x}}_\text{new}=\langle x_\text{new}^{(1)}, x_\text{new}^{(2)}\rangle\) is converged to \(\boldsymbol {\mathbf {x}}_\text{best}=\langle x_\text{best}^{(1)}, x_\text{best}^{(2)}\rangle\) if satisfying \(| \boldsymbol {\mathbf {x}}_\text{new}^{(i)} - \boldsymbol {\mathbf {x}}_\text{best}^{(i)}| \lt \theta _\text{conv}^{(i)}\) for all \(i \in \lbrace 1, 2\rbrace\),

where \(\Theta _\text{conv} = \langle \theta _\text{conv}^{(1)}, \theta _\text{conv}^{(2)} \rangle\) is a pair of user-defined thresholds for the two-dimensional hyperparameter . If \(\boldsymbol {\mathbf {x}}_\text{new}\) is not converged, then the tuner evaluates \(f_\ell (\boldsymbol {\mathbf {x}}_\text{new})\) to compare the new speedup with \(f_\ell (\boldsymbol {\mathbf {x}}_\text{best})\) for further optimization, as Lines 24–26. Both explored point \(\boldsymbol {\mathbf {x}}_\text{new}\) and its speedup are saved to \(X_{\text{exp},\ell }\) and \(Y_{\text{exp},\ell }\) at each iteration.

After the hyperparameter exploration terminates, the tuner evaluates the fitness values through AACS using the remaining simulation budget, \(B_\text{total} - n_\text{exp}\) (see Lines 27 and 28). Then, the tuner returns the mean fitness value to the GA-based optimizer.

Individual \(\boldsymbol {\mathbf {p}}_\ell\) in the new offsprings \(P_\text{new}\) could be evaluated at the previous generation but was not selected as an elite. In this case, the tuner reuses the previous exploration and evaluation results of \(X_{\text{exp},\ell }\), \(Y_{\text{exp},\ell }\), and \(\mathit {FIT}_\ell\). Then, the simulation budget for HPO and fitness evaluation is used to (1) enhance the AACS speedup by finding a more optimized hyperparameter point and (2) increase accuracy in the fitness evaluation owing to a larger number of simulation attempts considering DEMs’ stochastic properties. If \(\boldsymbol {\mathbf {p}}_\ell\)’s optimal hyperparameter point was previously converged, then the tuner skips the HPO stage.

5 Gaussian Process-based Hyperparameter Tuning

5.1 Warmup and Update of Gaussian Process Model

The overall stability of each individual’s production dynamics is the main element in deciding a hyperparameter optimum that maximizes the speedup of the individual’s AACS. If production dynamics among neighbors are similar, then a hyperparameter value yields similar speedups during the neighbors’ AACSs; thus, we assume that a hyperparameter point’s expected speedup would be similar among neighbors.

If the GP model was not initialized for the given individual \(\boldsymbol {\mathbf {p}}_\ell\) (from the GA-based optimizer), then the tuner starts a warmup stage. In the warmup stage, the tuner initializes individual \(\boldsymbol {\mathbf {p}}_\ell\)’s GP model using synthetic data (which is used to initialize the GP) by estimating the speedups of initial hyperparameter points \(X_{ \text{warm}, \ell }\) based on the previous exploration results of adjacent neighbors. \(X_{ \text{warm}, \ell }\) is divided into two subsets: \(X_{\text{fix}}\) and \(X_{\text{rnd}, \ell }\) (i.e., \(X_{\text{warm}} = X_{\text{fix}} \cup X_{\text{rnd}, \ell }\)), where \(X_{\text{fix}}\) is the set of user-specified and prefixed points that are shared by all of the population; \(X_{\text{rnd}, \ell }\) is the set of randomly sampled points for each individual. The fixed points in \(X_{\text{fix}}\) should be specified sparsely but identically distributed over the hyperparameter space to mitigate a local optimum trap. The management of the fixed points (\(X_\text{fix}\)) aims to find each point’s speedup without actual simulation executions through approximated prediction using neighboring individuals’ speedup estimates (or simulation-based measurements) based on the assumption: a point’s speedups would be similar among neighboring individuals. More speedup prediction saves the simulation budget used to constitute plenty of the warmup data.

Let \(f_k(\boldsymbol {\mathbf {x}}_i)\in Y_{\text{exp},k}\) be the previously measured speedup of hyperparameter \(\boldsymbol {\mathbf {x}}_i \in X_{\text{exp},k}\) when evaluating the fitness of an individual \(\boldsymbol {\mathbf {p}}_k\). Let \({\bf P}_{\text{nbr}, \ell }\) be the set of \(\boldsymbol {\mathbf {p}}_\ell\)’s neighbors in all current and previous population P, i.e., \({\bf P}_{ \text{nbr}, \ell } = \lbrace p_k \mid d_{(\ell ,k)} \lt d_{\max }, p_k \in P \rbrace\), where \(d_{k, \ell }\) is the distance between two individuals: \(\boldsymbol {\mathbf {p}}_k\) and \(\boldsymbol {\mathbf {p}}_\ell\); and \(d_{\max }\) is a user-defined threshold that determines the maximum radius of the neighborhood. To estimate the speedup of an individual \(\boldsymbol {\mathbf {p}}_\ell\)’s hyperparameter point \(\boldsymbol {\mathbf {x}}_i \in X_{\text{fix}}\), the tuner checks whether there are enough neighbors who previously explored \((\boldsymbol {\mathbf {x}}_i)\), i.e., \(| N_{\ell }(\boldsymbol {\mathbf {x_i}})| \gt \omega\), where \(N_{\ell }(\boldsymbol {\mathbf {x_i}}) = \lbrace k \mid p_k \in {\bf P}_{\text{nbr}, \ell }, \boldsymbol {\mathbf {x}}_i \in X_{ \text{exp}, k} \rbrace\) is the index set of \(\boldsymbol {\mathbf {p}}_\ell\)’s neighbors who previously explored \(\boldsymbol {\mathbf {x}}_i\); and \(\omega\) is the dimension of problem-specific decision space, i.e., \(\boldsymbol {\mathbf {p}}_\ell = [ p_{\ell ,1} , \ldots , p_{\ell ,\omega }]\), where \(p_{\ell ,i}\) is ith production parameter to be optimized. If \(| N_{\ell }(\boldsymbol {\mathbf {x_i}})| \ge \omega\), then the tuner estimates \(f_{\ell }(\boldsymbol {\mathbf {x}}_i)\) using multiple gradients, denoted by \(\lbrace \nabla g_i(\boldsymbol {\mathbf {p}}_k) \rbrace\), where \(\nabla g_i(\boldsymbol {\mathbf {p}}_k)\) is the gradient of the \(\boldsymbol {\mathbf {x}}_i\)’s speedup at neighbor \(\boldsymbol {\mathbf {p}}_k \in {\bf P}_{\text{nbr}, \ell }\). With respect to neighbor \(\boldsymbol {\mathbf {p}}_k\), we estimate the speedup \(f_{\ell }(\boldsymbol {\mathbf {x}}_i)\) by

\begin{equation} f_{(\ell ,k)}(\boldsymbol {\mathbf {x}}_i) \approx f_k(\boldsymbol {\mathbf {x}}_i) + \nabla g_i(\boldsymbol {\mathbf {p}}_k) \cdot (\boldsymbol {\mathbf {p}}_\ell - \boldsymbol {\mathbf {p}}_k). \end{equation}

(4)

Though the function \(f_k(\boldsymbol {\mathbf {x}}_i)\) is unknown, we can approximate the gradient \(\nabla g_i(\boldsymbol {\mathbf {p}}_k)\) using neighbors’ previously measured speedups as follows:

\begin{align} \begin{bmatrix} \boldsymbol {\mathbf {p}}_{n_1}- \boldsymbol {\mathbf {p}}_k \\ \vdots \\ \boldsymbol {\mathbf {p}}_{n_\tau }- \boldsymbol {\mathbf {p}}_k \\ \end{bmatrix} \nabla g_i(\boldsymbol {\mathbf {p}}_k) &= \begin{bmatrix} f_{n_1}(\boldsymbol {\mathbf {x}}_i) - f_{k}(\boldsymbol {\mathbf {x}}_i) \\ \vdots \\ f_{n_\tau }(\boldsymbol {\mathbf {x}}_i) - f_{k}(\boldsymbol {\mathbf {x}}_i) \\ \end{bmatrix}, \end{align}

(5)

where \(\boldsymbol {\mathbf {p}}_{n_1}, \ldots , \boldsymbol {\mathbf {p}}_{n_\tau }\) are the common neighbors of \(\boldsymbol {\mathbf {p}}_\ell\) and \(\boldsymbol {\mathbf {p}}_k\), i.e., \(p_{n_j}\in {\bf P}_{\text{nbr},\ell } \setminus \lbrace \boldsymbol {\mathbf {p}}_k \rbrace\); and \(\tau (\ge \omega)\) is the number of used previous speedups of the common neighbors. Suppose that we simplify (5) as \(\mathcal {P} \nabla g_i(\boldsymbol {\mathbf {p}}_k) = \mathcal {Y}\), where \(\mathcal {P} = [\boldsymbol {\mathbf {p}}_{n_1}- \boldsymbol {\mathbf {p}}_k; \ldots , \boldsymbol {\mathbf {p}}_{n_\tau }- \boldsymbol {\mathbf {p}}_k]\) and \(\mathcal {Y} = [ f_{n_1}(\boldsymbol {\mathbf {x}}_i) - f_{k}(\boldsymbol {\mathbf {x}}_i); \ldots ; f_{n_\tau }(\boldsymbol {\mathbf {x}}_i) - f_{k}(\boldsymbol {\mathbf {x}}_i)]\). Then, we can solve \(\nabla g_i(\boldsymbol {\mathbf {p}}_k)\) through ordinary least square approximation as follows:

\begin{equation} \nabla g_i(\boldsymbol {\mathbf {p}}_k) = (\mathcal {P}^T \mathcal {P})^{-1}\mathcal {P}^T \mathcal {Y}. \end{equation}

(6)

To reduce the matrix multiplication overhead in the least square approximation, we can set an upper bound of \(\tau\), denoted by \(\theta _\text{nbr}\). If the number of the common neighbors, \(| {\bf P}_{\text{nbr}, \ell } \setminus \lbrace \boldsymbol {\mathbf {p}}_k \rbrace |\), is larger than \(\theta _\text{nbr}\), then the tuner selects partial speedup results of the most adjacent neighboring individuals, as many as \(\theta _\text{nbr}\).

Then, overall expected speedup \(f_{\ell }(\boldsymbol {\mathbf {x}}_i)\) with respect to all gradients \(\lbrace \nabla g_i(\boldsymbol {\mathbf {p}}_{k}) \mid k \in N_{\ell }(\boldsymbol {\mathbf {x}})\rbrace\) is estimated by

\begin{equation} f_{\ell }(\boldsymbol {\mathbf {x}}_i) \approx \sum _{k\in N_\ell (\boldsymbol {\mathbf {x}}_i)} \left(w_{(\ell ,k)} f_{(\ell ,k)}(\boldsymbol {\mathbf {x}}_i) / {\sum _k} w_{(\ell ,k)}\right), \end{equation}

(7)

where \(w_{(\ell ,k)} = 1/d_{(\ell ,k)}\) is the inverse of the distance. This weight-based estimation follows the perspective: The gradients of more adjacent neighbors are most reliable. If enough speedups of hyperparameter \(\boldsymbol {\mathbf {x}}_i \in X_{\text{fix}}\) (i.e., \(N_{\ell }(\boldsymbol {\mathbf {x}}_i) \le \omega\)) cannot be collected from neighbors, then the tuner evaluates \(f_{\ell }(\boldsymbol {\mathbf {x}}_i)\) through a pair of real executions of DOS and AACS.

After an individual’s HPO and fitness evaluation end, the individual notifies his neighbors of his new executions. Among neighbors, some neighbors can predict new speedup of a hyperparameter point if the number of the point’s neighbors’ estimates reaches the minimum neighbor size (which is the same as the dimension of problem-related decision space, \(\omega\)). Some neighbors can renew a hyperparameter’s existing speedup estimate if their numbers of neighbors’ estimates are less than \(\theta _\text{nbr}\).

Unlike \(X_\text{fix}\), the tuner attempts to initialize \(X_{ \text{rnd}, \ell }\) by selecting the neighbors’ optimal hyperparameter points that maximize the speedup. Let \(x_{\text{best}, k}\) be the current best hyperparameter point \(x_\text{best}\) of individual \(\boldsymbol {\mathbf {p}}_k\), i.e., \(\boldsymbol {\mathbf {x}}_{\text{best}, k}= \mathop{\arg\max}_{\boldsymbol {\mathbf {x}}_i \in X_{\text{exp}, k}} f_k(\boldsymbol {\mathbf {x}}_i)\). Then, all possible candidates of \(X_{\text{rnd}, \ell }\), denoted by \(X_{\text{can}, \ell }\), are derived by \(X_{\text{can}, \ell } = \lbrace \boldsymbol {\mathbf {x}}_{\text{best}, k} \mid \boldsymbol {\mathbf {p}}_k \in {\bf P}_{\text{nbr}, \ell }\rbrace \setminus X_\text{fix}\). If \(| X_{\text{can}, \ell } |\) is less than a user-defined threshold, denoted by \(\theta _\text{rnd}\), the tuner generates random points to constitute candidate \(X_{\text{can}}\) whose size is to be \(\theta _\text{rnd}\). With respect to \(X_{\text{can}, \ell }\), we initialize \(X_{\text{rnd}, \ell }\) as follows: \(X_{\text{rnd}, \ell } = \mathop{\arg\max}_{X \subset X_{\text{can}, \ell }, |X| = \theta _\text{rnd} } \sum _{\boldsymbol {\mathbf {x}}_{\text{best}, k} \in X } f_k(\boldsymbol {\mathbf {x}}_{\text{best}, k})\). The tuner measures the speedup of each new hyperparameter point in \(X_{\text{rnd}}\), and then builds the GP models based on the initial hyperparameter set \(X_{\text{exp}, \ell } (= X_{\text{fix}} \cup X_{\text{rnd}}\) and their corresponding speedups \(Y_{\text{exp}, \ell }\).

At the HPO stage, the GP prior is updated whenever \(X_{\text{exp}, \ell }\) and \(Y_{\text{exp}, \ell }\) are changed. Let us denote \(X_{\text{exp}, \ell }\) and \(Y_{\text{exp}, \ell }\) by two n-dimensional vectors of hyperparameters and their spedups, respectively, such that \(X_{\text{exp}, \ell } = [ \boldsymbol {\mathbf {x}}_1, \ldots , \boldsymbol {\mathbf {x}}_n ]\) and \(Y_{\text{exp}, \ell } = f_\ell (X_{\text{exp},\ell }) = [ f_\ell (\boldsymbol {\mathbf {x}}_1), \ldots , f_\ell (\boldsymbol {\mathbf {x}}_n) ]\). We can generally represent the GP prior, \(\text{Pr}(\cdot)\), as a multivariate normal distribution of n-dimensional vector:

\begin{equation} \text{Pr}(X_{\text{exp},\ell }) \approx \mathcal {N}\left(\mu (X_{\text{exp},\ell }), \Sigma (X_{\text{exp},\ell },X_{\text{exp},\ell }) \right), \end{equation}

(8)

where \(\mu (X_{\text{exp},\ell }) = f_\ell (X_{\text{exp},\ell })\) and \(\Sigma (X_{\text{exp},\ell },X_{\text{exp},\ell }) = [ K(x_1, x_1), \ldots , K(x_1, x_n); \ldots ; K(x_n, x_1), \ldots ,\) \(K(x_n, x_n)]\). For the covariance function K (i.e., kernel function), one of the most common kernels is the squared exponential covariance function:

\begin{equation} K(\boldsymbol {\mathbf {x}}_i, \boldsymbol {\mathbf {x}}_j) = \exp (- \frac{1}{2 \beta ^2} || \boldsymbol {\mathbf {x}}_i -\boldsymbol {\mathbf {x}}_j ||^2), \end{equation}

(9)

where \(\beta\) is a user-defined characteristic length-scale of the process.

Suppose that we infer the value of \(f(\boldsymbol {\mathbf {x}}_{\text{new}})\) at new hyperparameter \(\boldsymbol {\mathbf {x}}_{\text{new}}\). We estimate \(f(\boldsymbol {\mathbf {x}}_{\text{new}})\) based on the given prior using Bayes’s rule as follows:

\begin{align} f(\boldsymbol {\mathbf {x}}_{\text{new}}) \mid \text{Pr}(X_{\text{exp},\ell }) &\approx \mathcal {N}(\mu _{\ast }(\boldsymbol {\mathbf {x}}_{\text{new}}), \sigma _{\ast }^2(\boldsymbol {\mathbf {x}}_{\text{new}})), \end{align}

(10)

\begin{align} \mu _*(\boldsymbol {\mathbf {x}}_{\text{new}}) &= \Sigma (\boldsymbol {\mathbf {x}}_{\text{new}},X_{\text{exp},\ell }) \Sigma (X_{\text{exp},\ell },X_{\text{exp},\ell })^{-1}\left(f(X_{\text{exp},\ell }) - \mu (X_{\text{exp},\ell }) \right) + \mu (X_{\text{exp},\ell }), \end{align}

(11)

\begin{align} \sigma _{\ast }^2(\boldsymbol {\mathbf {x}}_{\text{new}}) & = \Sigma (\boldsymbol {\mathbf {x}}_{\text{new}},\boldsymbol {\mathbf {x}}_{\text{new}}) - \Sigma (\boldsymbol {\mathbf {x}}_{\text{new}},X_{\text{exp},\ell }) \Sigma (X_{\text{exp},\ell },X_{\text{exp},\ell })^{-1} \Sigma (X_{\text{exp},\ell },\boldsymbol {\mathbf {x}}_{\text{new}}). \end{align}

(12)

Based on the GP-bassed \(f(\boldsymbol {\mathbf {x}}_{\text{new}})\) estimation method, the following section details how to acquire the next hyperparameter point based on the posterior distribution.

5.2 Hyperparameter Acquisition Using Gaussian Process Model

Typical GP approaches use an acquisition function to choose a new hyperparameter point. There are a few common acquisition functions. In this work, we use expected improvement (EI) as a default acquisition function. The EI function of an individual \(\boldsymbol {\mathbf {p}}_\ell\) calculates the expected speedup improvement of any hyperparameter point \(\boldsymbol {\mathbf {x}}\) when exploring the vicinity of the current optimum \(\boldsymbol {\mathbf {x}}_{\text{best}, \ell }\). EI for \(\boldsymbol {\mathbf {p}}_\ell\) is calculated by

\begin{equation} \text{EI}_\ell (\boldsymbol {\mathbf {x}}) = \mathbb {E}[I_\ell (\boldsymbol {\mathbf {x}})] = \int _{-\infty }^{\infty } I_\ell (\boldsymbol {\mathbf {x}})\phi (z) dz, \end{equation}

(13)

where \(\phi (z)\) is the probability density function of the normal distribution \(\mathcal {N}(0,1)\), i.e., \(\phi (z) = \frac{1}{\sqrt {2\pi }} \exp (- z^2 / 2)\), and \(I(\boldsymbol {\mathbf {x}})\) is the function that measures the difference between the expected speedup at \(\boldsymbol {\mathbf {x}}\) and the current optimum value \(\boldsymbol {\mathbf {x}}_{\text{best}}\), which is defined as follows:

\begin{equation} I_\ell (x) = \max \left(0, f_\ell ^{\circ }(\boldsymbol {\mathbf {x}})- f_\ell (\boldsymbol {\mathbf {x}}_{\text{best}})\right), \end{equation}

(14)

where \(f_\ell ^{\circ }(\boldsymbol {\mathbf {x}}) = f_\ell (\boldsymbol {\mathbf {x}}) \mid \text{Pr}(X_{\boldsymbol {\mathbf {exp}},\ell })\). We need to get rid of operator \(\max (\cdot)\) in Equation (14). For that, the integral can be divided into two components: one where \(f_j^{\circ }(\boldsymbol {\mathbf {x}}) - f_\ell (\boldsymbol {\mathbf {x}}_{\text{best}})\) is positive and the other where it is negative. In Reference [16], the point is calculated as \(\boldsymbol {\mathbf {z}}_o\) such that \(\boldsymbol {\mathbf {z}}_0 = \frac{ f_\ell (\boldsymbol {\mathbf {x}}_\text{best}) - \mu _*(\boldsymbol {\mathbf {x}})}{\sigma _*(\boldsymbol {\mathbf {x}})}\), where \(\mu _*(\boldsymbol {\mathbf {x}})\) and \(\sigma _*(\boldsymbol {\mathbf {x}})\) are the mean and the standard deviation of the GP posterior predictive at point \(\boldsymbol {\mathbf {x}}\), respectively (see Equations (11) and (12)). Then, function \(\text{EI}\) is calculated in Reference [16], and expressed by

\begin{align} \text{EI}_\ell (\boldsymbol {\mathbf {x}}) = \left(\mu _{*}(\boldsymbol {\mathbf {x}})- f_\ell (\boldsymbol {\mathbf {x}}_{\text{best}})\right) \Phi \left(\frac{\mu _{*}(\boldsymbol {\mathbf {x}}) - f_\ell (\boldsymbol {\mathbf {x}}_{\text{best}})}{\sigma _*(\boldsymbol {\mathbf {x}})} \right) + \sigma _*(\boldsymbol {\mathbf {x}})\phi \left(\frac{\mu _{*}(\boldsymbol {\mathbf {x}} - f_\ell (\boldsymbol {\mathbf {x}}_\text{best})}{\sigma _*(\boldsymbol {\mathbf {x}})} \right), \end{align}

(15)

where \(\Phi (\cdot)\) is the standard normal distribution function. Formulation (15) works for \(\sigma _*(\boldsymbol {\mathbf {x}}) \gt 0\); otherwise, if \(\sigma _*(\boldsymbol {\mathbf {x}}) = 0\) (as it happens at the observed data point), then it holds that \(\text{EI}(\boldsymbol {\mathbf {x}}) = 0\).

Using the \(\text{EI}\) function, the next hyperparameter point is calculated as

\begin{equation} \boldsymbol {\mathbf {x}}_\text{new} = \mathop{\arg\max}_{\boldsymbol {\mathbf {x}} \in \mathcal {A}} \text{EI}(\boldsymbol {\mathbf {x}}). \end{equation}

(16)

A variety of methods to solve (16) have been studied. Unlike the objective \(f_\ell (\boldsymbol {\mathbf {x}})\) in our speedup optimization problem in (2), \(\text{EI}_\ell (\boldsymbol {\mathbf {x}})\) is inexpensive to evaluate and allows easy evaluation of first- and second- order derivatives. We can implement the EI algorithm using a continuous first- or second-order optimization method. We utilize one common technique that has worked well to calculate first derivatives and then use the quasi-Newton method L-BFGS-B [21].

If two points \(\boldsymbol {\mathbf {x}}_\text{best}\) and \(\boldsymbol {\mathbf {x}}_\text{new}\) meet the termination condition (Definition 2), then the tuner stops exploration in HPO. Otherwise, after computing \(f_\ell (\boldsymbol {\mathbf {x}}_\text{new})\) and an individual \(p_{\ell }\)’s fitness through a pair of AACS and DOS executions, the tuner continues the exploration if the exploration budget remains.

6 Experimentation

We designed a use case to evaluate the performance of the proposed SBO that optimized the hyperparameter values using a given budget for the simulation speedup. Through various experiments, first, we aim to demonstrate the importance of selecting an optimal hyperparameter point for a given production scenario by displaying diverse changes in the AACS accuracy and speedup according to the target hyperparameter points. Then, we will show the performance of the proposed HPO-based AACS-based optimization compared to other AACS-based optimizations with fixed hyperparameter points and DOS-based optimization.

As shown in Figure 3, the model has four types of machine groups for deposition, patterning, etching, and chemical-mechanical polishing. The detailed subprocesses in each machine can differ depending on a lot’s product recipe. Model parameters are referenced from References [13, 22].

Fig. 3.

There are five types of products, each requiring different mask layers (such as \(4,8,12,16, \text{ or } 20\)). As the required number of layers increases, the number of reentrances also increases (e.g., deposition \(\rightarrow\) patterning \(\rightarrow\) etching \(\rightarrow\) deposition \(\cdots\)). Thus, each wafer lot revisits a machine group multiple times for different processing steps, and the number of reentrances depends on the target lot’s product type. Each lot contains 25 wafers (as raw materials). We attempt to produce 100-wafer lots for each product type, i.e., 500-wafer lots for five product types should be produced by different machines, according to their own processing recipes. Depending on the lot-release rate of each product type from inventory, overall performances, such as mean cycle time (MCT) or throughput, could vary considerably. Typically, as the intervals of lot releases increase, the MCT becomes longer and the throughput decreases due to the reduction in machine utilization. Rather than multi-objective optimization methods, we conducted a single objective problem by integrating the two important performance indices, MCT and throughput, because a unique control solution is typically required to optimize the target manufacturing plant.

To minimize the MCT and maximize the throughput through SBO, we define a performance index (which is expressed in an integrated form) as follows:

\begin{equation} \text{PI}(\boldsymbol {\mathbf {p}}) = w_1 k_{\text{mct}}(\boldsymbol {\mathbf {p}}) + w_2/k_{\text{tp}}(\boldsymbol {\mathbf {p}}), \end{equation}

(17)

where \(\boldsymbol {\mathbf {p}}=(p_1, p_2, \ldots , p_5)\) is an individual that indicates the combination of lot-release intervals for five product types, \(p_i\) is the lot-release interval for the ith product type, \(w_1\) and \(w_2\) are user-defined weights, \(k_{\text{mct}}(\boldsymbol {\mathbf {p}})\) is the estimated MCT through simulation for population \(\boldsymbol {\mathbf {p}}\) for lot-release candidates, and; \(k_{\text{tp}}(\boldsymbol {\mathbf {p}})\) is the estimated throughput (average lots produced per hour). SBO is intended to minimize the defined performance index, which minimizes the MCT and maximizes the throughput.

In the first experiment, we measured the speedups and accuracy errors (that is, the estimation differences between AACS and DOS) using simple lot-release scenarios: Twenty-five wafer lots for all product types (i.e., 5 lots per each product type) are released together from inventory at a fixed rate. We present the measurement results in Figure 4, varying the release interval of each product type from 100 to 180 (minutes).

Fig. 4.

When significance level \(\alpha\) becomes lower, the AACS allows more statistical invariances between two consecutive samples (related to the stability decision), which increases the chance of switching from DEMs to MDMs and decreases the chance of switching from MDMs to DEMs through less stringent decisions of steady-state convergences. Having more activated MDMs instead of their corresponding high-fidelity DEMS guarantees more speedup but decreases the accuracy. When the sample size becomes smaller, ALCs detect steady-state convergences (or divergences) more quickly, which helps increase the MDMs’ activation periods and guarantees better speedups (see Figure 4(c) and (d)). MDMs process events based on the mean-delay information; the mean information is derived using observed delay data during the sampling period. If the sampling period (that is taken time to collect all observations as many as the defined sample size) is short, then the derived means could be a biased representative of behaviors observed in a specific simulation time region, resulting in lower accuracy.

If the release interval becomes smaller, then the overall production becomes unstable; thus, we cannot get noticeable speedups through AACS because of the lower activation rate of MDMs. If the release interval is extremely low, then each ALC’s consecutive samples of queueing parameters differ evidently. In these precarious production scenarios, lower \(\alpha\) can accidentally cause ALCs to activate the MDMs. However, these mistakes are corrected afterwards through the subsequent steady-state divergence detection, which increases the active-model switches between DEMs and MDMs and could degrade the simulation execution time to be worse than DOS. As the release interval increases, the MDMs are activated more often, which leads to higher speedups. Since the MDMs guarantee higher accuracy in more stable production scenarios, the accuracy of AACS increases as the release intervals increase. If the release interval is extremely low, then the AACS accuracy also becomes higher, because high-fidelity DEMs are mainly active instead of MDMs. The AACS accuracies are the lowest in an intermediate area, which is neither very stable nor unstable (see the release interval region around 140 in Figure 4).

As such, the speedups and accuracies are different according to the hyperparameter combinations and production scenarios, as shown in Figure 4. When evaluating the fitnesses of various production candidates, we should carefully decide specific hyperparameter points that maximize the speedup under an accuracy constraint, especially in exploring the candidates around an accuracy-deficient region (e.g., the release intervals around 140). As the AACS inaccuracy becomes larger during SBO, the calculated optimal problem-related individual might be far from the actual optimum points derived through DOS-based SBO.

In the second experiment, we aim to find the optimal lot-release intervals of five product types, denoted by \(\boldsymbol {\mathbf {p}}_{\text{opt}}\), i.e., \(\boldsymbol {\mathbf {p}}_{\text{opt}} = \mathop{\arg\min}_{\boldsymbol {\mathbf {p}}\in \mathbf {P}_\text{all}} \text{PI}(\boldsymbol {\mathbf {p}})\), where \(\mathbf {P}_\text{all} = \lbrace (p_1, \ldots , p_5) \mid \forall i \in\) \(\lbrace 1, \ldots 5\rbrace , p_i \in \lbrace 100, 105, \ldots , 200\rbrace \rbrace\). We assigned two weights considering the scales of two indices, \(w_{1}\) and \(w_{2}\), to prioritize throughput over cycle time, which makes the GA-based optimizer prefer to explore the problem-related decision candidates that cause relatively unstable production whose accuracy could problematic in AACS (see Figure 4(a) and (b)). In this case, appropriate hyperparameter selection is essential to prevent a large amount of AACS accuracy.

The GA-based optimizer for the SBO is developed based on the integer-encoding and the Boltzmann selection methods. For the GA, we set the population size, the maximum number of population generations, crossover rate, swap rate, and inversion rate to \(15, 30, 0.7, 0.3\), and 0.2, respectively.

For the tuner, considering the stochastic simulation, we set the total simulation budget \(B_{\text{total}}\) as 12 and the maximum number of exploration \(B_{\text{exp}}\) as 8. The minimum number of neighbors for the gradient-based speedup estimation is 5 (which equals the dimension of the target decision parameter space). We set the maximum number of neighbors, \(\theta _{\text{nbr}}\), as 8. The maximum radius of the neighborhood \(d_{\max }\) is 7. We set the accuracy constraint for HPO, which is the allowable accuracy loss, as \(2.5\%\). The tuner explores a defined hyperparameter space \(\lbrace (\sigma , m) \rbrace\) such that \(0.02 \le \sigma \le 1.0\) and \(m = \lbrace 10, 11, \ldots , 20\rbrace\). The set of fixed points, \(X_{\text{fix}}\), consists of 16 points that are equally distributed over the defined space, i.e., \(\lbrace (\sigma , m) \rbrace = \lbrace (0.02, 10), (0.3476, 10), (0.6752, 10), (1.0, 10), (0.02, 16), \ldots , (1.0, 20) \rbrace\). If a hyperparameter point’s \(\alpha\) is one, then we equally assign the point’s measured speedup as 1.0 without executing a pair of DOS and AACS (i.e., \(1 \in A_\text{dos}\)) to trigger the DOS-based fitness evaluation when this hyperparameter point is decided as an optimal point. The size of additional random points for the GP initialization, \(X_{\text{rnd}}\), is 3. During the GP-based HP optimization using L-BFGS-B, the maximum number of iterations is 20.

Based on the specified parameter values, we traced the execution times of evaluation and exploration of each individual using an experimental machine with an Intel Xeon E5-1650 3.6 GHz CPU and 32 GB of memory. During a single execution of SBO, the averages of each individual’s budget expense for the HP exploitation changed over the population generation, as shown in Figure 5. As the number of population generations increases, the tuner can estimate the speedup of fixed points in \(X_\text{fix}\) using neighbors’ previous exploration results without simulation. In the beginning, the entire exploration budget \(B_\text{exp}\) (= 8) is consumed. In this case, the fitness is the average of four DOS results obtained during the HP exploration and four AACS results using the current optimum hyperparameter point. As the generation evolves, the expense of the HPO budget decreases, which helps allocate more budget to the fitness evaluation, considering stochastic properties. The longer an individual survives or the more newly generated by the GA-based optimizer, the likelier it is that we can reduce the possibility of wrong fitness prediction caused by stochastic variances. During the SBO execution, the hyperparameter points used to evaluate the whole population’s fitnesses through AACS are distributed as shown in Figure 6.

Fig. 5.

Fig. 6.

Based on the outlined manufacturing and AACS configurations, Figure 7 offers our comparative assessment of the SBO performances, both in terms of overall execution times and the quality of the computed optimal manufacturing solutions, with a focus on achieving minimized average fitness. This comparative assessment encompasses three distinct simulation paradigms: (1) DOSs, (2) AACS that consistently operates with fixed hyperparameters throughout SBO, and (3) AACS wherein hyperparameters undergo adaptive refinement via HPO. For the second simulation method within the SBO, we used the static hyperparameter configurations where alpha is either \(0.01, 0.05, 0.1, \text{ and } 0.5\), all paired with a sample size of 20. As shown in Figure 7(a), lower \(\alpha\) values with the fixed sample size reduce the overall executions of the SBO. We recomputed the fitness of problem-related optimal individuals by running DOS 50 times (see Figure 7(b)).

Fig. 7.

Among the fixed \(\alpha\) values, the minimum \(\alpha\) (\(=0.01\)) reduces the overall execution time to about 345 seconds (about \(22.8\%\)) compared to the optimization using DOS. However, the fitness of the optimum individual derived through AACS-based optimization using the minimum \(\alpha\) is as much as \(2.4\%\) higher than that of the individual obtained by typical DOS-based optimization. The proposed SBO approach (labeled as \(\alpha\) tune in Figure 7) adopts different hyperparameter points for each individual to maximize the speedups under the defined accuracy constraint. Our method resulted in a slightly lower speedup (about \(20.5\%\)) compared to AACS-driven SBO with the minimum \(\alpha\). However, the fitness of the optimum individual according to HPO-based SBO is closer to the actual optimum (driven by DOS-based SBO). Compared to other hyperparameter points (whose \(\alpha\) values are \(0.05, 0.1, \text{ and } 0.5\)), the proposed method guarantees a similar quality of problem-related optimum individual with more reduced execution times. Moreover, our approach eradicates the efforts or concerns about hyperparameter selection.

It is still necessary to reduce the gap between the fitness of optimal individuals obtained by our method and by DOS-based optimization. The reason is assumed to be that we fixed the accuracy constraint among individuals. In future works, we will design a new hyperparameter tuning method for SBO with adaptive control of accuracy constraints. We will also extend our approach using the optimal computing budget allocation concept [4] to adjust the simulation budget, considering both the probability of correct selection and HBO.

7 Conclusion

We previously proposed AACS through steady-state convergence or divergence detection based on two consecutive queueing-parameter distributions (sampled from DEMs of machine groups). The speedup and accuracy loss of the AACS vary according to (1) the specific hyperparameter point (a combination of significance level \(\alpha\) and the sample size) and (2) the overall stability of the simulation scenarios. When exploring problem-related individuals over decision candidates during GA-based SBO, we must carefully select hyperparameters to maximize the speedup under a given accuracy constraint. To achieve the goal, we proposed an efficient hyperparameter tuning method to find the optimal hyperparameter point used for each individual’s AACS during SBO. To find optimal hyperparameter points efficiently with few attempts, we employed the GP model for each individual, which is used to estimate the speedup of the next hyperparameter candidate in advance as a surrogate model. We proposed the discrete-gradient-based speedup estimation method to prepare the data (used for the GP initialization) without actual simulation execution based on neighboring individuals’ previous exploration results. As the population generation evolves and optimization converges, the hyperparameter exploration overheads are reduced via the estimation of more GP models’ warmup data (used to initialize the GP odels) using the exploration results of enlarged neighboring individuals. The proposed method was applied to the lot-release-interval optimization of a large-scale wafer-fabrication scenario. It showed that the proposed method generates highly optimized results (compared to other AACSs with fixed hyperparameter points) with reduced execution times.

References

[1]

Aldeida Aleti and Irene Moser. 2016. A systematic literature review of adaptive parameter control methods for evolutionary algorithms. ACM Comput. Surv. 49, 3 (2016), 1–35.

Abstract

List of Abbreviations

1 Introduction

2 Related Works

3 Problem Formulation

4 Overview of Hyperparameter Tuning in Simulation-based Optimization

5 Gaussian Process-based Hyperparameter Tuning

5.1 Warmup and Update of Gaussian Process Model

5.2 Hyperparameter Acquisition Using Gaussian Process Model

6 Experimentation

7 Conclusion

References

Cited By

Index Terms

Recommendations

Hyperparameter Tunning in Simulation-based Optimization for Adaptive Digital-Twin Abstraction Control of Smart Manufacturing System

Runtime Abstraction-Level Conversion of Discrete-Event Wafer-fabrication Models for Simulation Acceleration

Simulation-Based Scheduling of Waterway Projects Using a Parallel Genetic Algorithm

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations