We now extend our training framework to the specific task of Gate Sizing in VLSI netlists. In particular, we summarize the attributes (or features) of nodes that are extracted from the pre-recovery netlist. In addition, we also provide an outline of the various configuration details and the hyperparameters used in our training framework.
4.4.1 Nodewise Features.
We evaluate a comprehensive set of 22 node-level features
\(X^{N \times 22}\) (
\(N\) is the number of nodes in the graph) that can be extracted from the pre-recovery timing graph. These features are a superset of the features used in previous works [
28,
32,
33]. We start with the hypothesis that these 22 features along with net connectivity information (in the form of an edge list) provide sufficient information for the DAGSizer model to learn node-level delay changes during the discrete gate-sizing optimization task. Figure
9 provides a pictorial illustration of the node-level features with reference to node E of our representative timing graph, denoted by
\(x_{E} = (f_1^{E}, f_2^{E},\ldots , f_{22}^{E})\). The following list in Table
1 summarizes our 22-dimensional feature vector (specific to node E). These features are extracted from the pre-recovery netlist, by our feature extractor. Currently, we do not support MCMM (multi-corner multi-mode) analysis, and the 22 extracted features correspond to a single timing corner. In Table
1, maximum/minimum possible power changes (
\(f_{9}^{E}\) and
\(f_{11}^{E}\)) of a node (cell) refer to the maximum/minimum leakage-power change among all possible cell-swaps of the node. Likewise, maximum/minimum delay changes (
\(f_{10}^{E}\) and
\(f_{12}^{E}\)) refer to the maximum/minimum propagation-delay changes among all possible cell-swaps of the node. To extract
\(f_{10}^{E}\) and
\(f_{12}^{E}\) from the library file, we use the average of rise and fall delay values, corresponding to the input slew and the output load values from the pre-recovery netlist. In addition to node-level features, we extract pin-pin connections from the netlist and construct the DAG, which is the other input to our framework. While extracting the edge-list, the combinational looping connections are avoided to remove cycles in the generated graph. To ensure that the graph traversal starts at the Q pin of the flop and ends at the D pin of the flop, we make a minor modification to our graph (by including disconnected clones of flop nodes).
4.4.2 Model Configuration.
We now describe the high-level configuration details of our model. The DAGSizer framework uses PyTorch library to implement encode, decode, aggregation, and combine operations.
Feature Encoder: A linear encoder
\({\bf Encode}_{\Theta _{E}}\) is implemented using
\(torch.Linear(22,32)\) to translate the 22-dimensional vector to a 32-dimensional vector. The purpose of the initial encoder layer is to learn the relative importance of feature dimensions. We use this feature encoder for all predictive models that we study in Section
5.
Aggregation: A parameterized aggregation operator \({\bf Agg}_{\Theta _{DAG}}\) is implemented using the message passing library \(torch-geometric.nn.MessagePassing()\), that is used to generate the message vector from the parent nodes. This message vector captures the feature information of the parent nodes and the labels (predicted or true) of the parents (teacher sampling).
Combine: A parametric combine operator
\({\bf Comb}_{\Theta _{DAG}}\) is used to combine the message vector and the node’s feature vector in the forward graph, and generate a 64-dimensional hidden representation of each node. Likewise, the combine operator of the reverse graph generates the other 64-dimensions of the hidden node representation. The concatenation of the two 64-dimensional node representations is used to generate the final 128-dimensional node embedding, i.e.,
\({\bf Comb}_{\Theta _{DAG}}\) = {
\(torch.nn.GRUCell(32, 64)\),
\(torch.nn.GRUCell(32, 64)\)}. For a fair comparison with the previous works, we use 128 dimensions for representing the node embeddings (Equation (
2)) of the neighborhood-based aggregation schemes.
Decode: A parametric decode operator translates the hidden vectors of each node to a regression label, i.e., \({\bf Decode}_{\Theta _{D}}\) = {\(torch.nn.Linear(128, 64)\), \(torch.nn.ELU\), \(torch.nn.Linear(64, 1)\), \(torch.nn.ELU\)}.
Loss Function: The mean-squared loss of node-level delta delay predictions is defined to be
where
\(\hat{y}_i \in \hat{Y}\) and
\(\tilde{y}_{i} \in \tilde{Y}\). For “don’t touch cells” (defined in Section
4.3 as cells that are disabled during the reassignment), we mask the loss (using the
\(m_{i}\) flags), implying a masking of the gradients during back propagation. Flops are an example of “don’t touch cells” in the leakage optimization step. Because leakage recovery is performed during the signoff stage, the default settings in modern physical design flows recommend the registers to be untouched during the leakage optimization step.
Other Hyperparameters: To be consistent across the predictive models, we use a hidden dimension of 128 to represent intermediate node embeddings and Adam Optimizer with a decaying learning rate: initialized at 0.001 and a decay factor of 1e-5 for every 20 epochs. We use a three-layer convolution for ECO-GNN and GRA-LPO. Since DAGSizer is a sequential message passing aggregation, we use a single hidden layer. To decompose the initial graph (subgraph batching step of Figure
7. and line 1 in Algorithm 1), we adopt the k-way
cut clustering implementation of METIS [
24] via the convenience wrapper PYMETIS [
27]. Following best practices [
24], we set the number of cut attempts to be one, and the number of iterations to be 10 for all testcases. Crucially, we favor large partitions to avoid unnecessarily splitting timing paths. Since METIS encourages
balanced partitions, we set the batch-size (number of nodes) according to the available GPU memory and the expected number of nodes in each partition. In general, we select the number of partitions so that batches (subgraphs) consist of roughly 50K nodes. Furthermore, METIS includes a variety of options for seeding graph partitions. The initial partitions may significantly affect the stability of the partitioning procedure. For example, options include spectral cuts, graph growing and greedy graph growing partitions, or Kernighan-Lin-inspired algorithms. The authors of METIS note that the spectral partitioners tend to underperform with respect to speed and quality compared to graph-growing methods [
24]. Of the three graph growing methods, the authors claim that greedy graph growing and “boundary” Kernighan-Lin perform comparatively well. We select greedy graph growing to generate initial partitions for all testcases.
To study the effect of modeling accuracy with and without partitioning, we use the
des_perf design with 61K nodes and 117K edges, for which the computational graph of can fit into our GPU memory without partitioning the graph. We analyze the accuracy loss and the percentage of cut-edges (w.r.t. the total number of edges in the graph) resulting from partitioning for various batch sizes. Batch size indicates the number of nodes per partition: 50K, 25K, 12K, 6K, 3K, 1K, and 0.5K. For
des_perf, we observe that
mean squared error (MSE) stays constant (0.0053) all the way to 0.5K nodes per partition. We believe that there could be three possible reasons for this behavior: (1) the percentage of disconnected edges as compared to the total number of edges is
\(\le\) 2.8% even for a batch size of 500 nodes; (2) cut edges might not always correspond to critical (i.e., having negative slack) timing paths, as suggested by data in
2; and (3) node features (Table
2) such as arrival time, sibling capacitance and sibling slack embed some neighboring information. For the six designs used in our experiments, Figure
10 shows the cut-cost percentages on the
\(y\)-axis (= percentage of cut-edges w.r.t. the total number of edges in the graph) as a function of the batch size percentage (
\(x\)-axis). For a batch size of 50K (that can fit into our GPU memory), the cut-cost percentage values (red star on the plots) stays within 2% (
\(y\)-axis) for all of our designs. Because
des_perf did not suffer any accuracy loss up to a cut-cost percentage of 3%, even if we can fit the entire graph (moving the red star toward the right) of
megaboom (or other large graphs) in GPU memory, we believe that the accuracy improvement could be insignificant.
Translation to Sizing Action: Since DAGSizer predicts nodewise delta-delay labels, these labels are translated to the sizing action (among all possible swaps) that most closely matches with the predicted delta-delay value using a simple nearest-neighbor search.
Inference Framework: After learning DAGSizer’s parameters (weights) in the training phase (as demonstrated in Figure
7), the inference flow is summarized in Figure
11. The inference flow starts with an input netlist that undergoes DAG translation and feature extraction. We then perform subgraph batching using PYMETIS to decompose the input graph to multiple smaller graphs. The sequential message passing mechanism of the pretrained DAGSizer is applied to each of these subgraphs to predict nodewise delta-delay labels. The generated delta-delay labels are converted to cell types and the changes are rolled back to generate an ECO netlist that can be used for downstream tasks.