1 Introduction
Instruction and data prefetching [
18] are commonly used in today’s processors to overcome the memory wall problem [
70]. The key idea behind prefetching is identifying the current memory access pattern and predicting addresses to proactively fetch instructions and data into the cache to avoid cache misses.
1 Prefetching hides the large memory access latency and, in turn, improves processor performance. As a result, modern processors employ multiple prefetchers to cover a wide range of applications. Consequently, existing prefetchers do not improve the performance of all applications; in some cases they hurt application performance by prefetching the wrong memory addresses [
36]. These incorrect prefetches use up the precious memory bandwidth and the limited space in the cache hierarchy. This increases data and instruction access latency, which hurts application performance.
To evaluate a prefetcher’s performance, we can use scope and accuracy as the metrics [
6,
34]. A prefetcher with high prefetching accuracy usually has limited scope—that is, it is very good at identifying a limited number of memory access patterns and can accurately prefetch data/instructions if those specific memory access patterns exist. However, such a prefetcher fails to identify other memory access patterns. Conversely, a prefetcher with broad scope caters to a wide variety of memory access patterns, but it has low accuracy.
Effectively, we need to find a balance between the accuracy and scope of the prefetcher. One way to balance scope and accuracy is to use multiple high accuracy prefetchers at each level of the memory hierarchy, as is the case in AMD and Intel processors [
1,
2,
3,
7,
21,
22,
23,
24,
25,
26,
27,
65]. Each prefetcher is customized to identify a specific type of memory access pattern and make a prefetching prediction. However, having multiple prefetchers operating at each level in the memory hierarchy can lead to the following issues:
•
Given that each prefetcher is trained independently to track a specific type of traffic and simultaneously share microarchitectural resources, prefetchers can sabotage each other during runtime. A prefetcher may trigger prefetch requests that evict cache lines that another prefetcher has accurately prefetched. This behavior leads to a loss in performance, wasted memory access bandwidth, and increased power consumption.
•
Different prefetchers (either at the same level or across levels in the memory hierarchy) latch onto memory access patterns at different speeds. So a prefetcher’s prediction can be influenced by the traffic generated by other prefetchers. These differences in temporal behavior can cause faulty synchronization among prefetchers and lead to a drop in application performance.
Essentially, prefetchers compete for resources and, at times, sabotage each other. To validate our argument, we use traces
2 generated from SPEC2017, SPEC2006, and Cloud benchmark suites and run these traces on an OoO processor that uses a different prefetcher at each level of the memory hierarchy (details of the particular evaluation methodology are provided in Section
4). We execute these traces using two state-of-the-art prefetcher solutions, IPCP [
45] and EIP [
55], which were the winners of prior prefetching competitions [
49,
50,
51]. These prefetcher solutions use different prefetchers at each level of the memory hierarchy—that is, different
prefetcher system configurations (PSCs).
3 Both prefetcher solutions improve performance compared to the no prefetching case. However, IPCP (
no-ipcp-ipcp-nl) (further details on the types of prefetchers are given later in Table
5) shows better performance over EIP in 100 out of 232 traces with the largest performance gain of 56% in
602.gcc_s-2226B. EIP (
EIP-nl-spp-no) shows better performance over IPCP in the remaining 132 traces and has the largest performance gain of 89% in
server_036.
In Figure
1, we show a subset of traces that have a significant difference in their IPC when using IPCP and EIP prefetcher solutions. The standard methodology would have us select EIP because it has on average 7.8% performance gain over IPCP, but that would come at the cost of having a performance loss for 100 out of the 232 traces.
One way to address this problem is to have multiple different prefetchers at each cache level and switch ON a prefetcher based on the current program phase. For example, one prior work that has attempted to leverage multiple prefetchers in the
same level of the memory hierarchy is by Kondugli and Huang [
34]. The authors propose a “composite prefetcher,” which uses a priority queue as the control algorithm to select a prefetcher at a single level in the memory hierarchy. Although this approach provides benefits, the composite prefetcher priority queue is designed offline for a given set of applications. For previously unseen applications, we will not necessarily see any performance improvement. Furthermore, the control algorithm will not scale well as we increase the number of prefetchers in the system and target
different levels in the memory hierarchy.
What is really needed is a
manager that can successfully manage multiple prefetchers at each level in the memory hierarchy and has a low overhead. This manager will choose the PSC that is suitable for the current phase of an application. The chosen PSC should have prefetchers that complement each other for the current phase, reduce memory access overhead, and, in turn, improve application performance. Given that multiple prefetchers are available at each level in the memory hierarchy, this “manager” will effectively determine which prefetcher should be ON/OFF at each level in the memory hierarchy, both across different phases of an application and across applications. In this article, we propose a
machine learning (ML)-based hardware manager called
Puppeteer that selects the PSC at runtime. Contrary to the prior work [
37,
52,
71] that focuses on training the ML model to improve the prefetch address prediction accuracy of the ML model, we train the ML model of Puppeteer to increase the overall system performance (quantified as IPC). Using our unique training strategy, we are able to specifically target training for application phases where the swing in IPC is much higher than in other regions. Thereby, we tailor Puppeteer for these phases and achieve high targeted-application-phase accuracy instead of just high overall model prediction accuracy.
To manage the prefetchers across multiple cache levels at runtime, we propose a multi-regression ML-based approach. We use the observed IPC of the various PSCs for different phases of each application to train our ML model. We train a unique random forest regressor per PSC (in our case, we have 5 different PSCs, which we pruned down from the 300 possible PSCs—more details about this are provided in Section
4), to create a suite of random forests regressors. For features used in the ML model, we use events that can be tracked using
hardware performance counters (HPCs) and whose behavior does not change with the choice of PSC (i.e., PSC-invariant events). An example of such an event is the number of branch instructions in an application. The branch instruction count does not change with the choice of PSC. Using only PSC-invariant events, we can limit the number of executions per trace we must account for during training, making it easier to train the ML model in Puppeteer (more details about the training approach are presented in Section
3.3). In summary, the contributions of our work are as follows:
•
We propose a novel ML-based runtime hardware manager called Puppeteer to manage the various prefetchers across the memory hierarchy to improve processor performance.
•
We design Puppeteer to use a set of PSC-invariant events (which can be tracked using HPCs) as inputs and predict the PSC for the next instruction window.
4 We co-optimize the hardware design and the ML model of Puppeteer with the goal of maximizing the overall application performance while minimizing the area overhead. At runtime, at the end of an instruction window, Puppeteer predicts the IPC for each PSC and selects the PSC with the highest predicted IPC for the succeeding instruction window.
•
We train Puppeteer to maximize processor performance instead of the prefetch address prediction accuracy. We use only 20% of the data for training to prevent overfitting and design Puppeteer with a low hardware overhead (\(\sim\)11 KB). For the 232 traces we experiment with, Puppeteer achieves an average performance gain of 46.0% in one core (1C), 25.8% in four core (4C), and 11.9% in eight core (8C) processors on average compared to a system with no prefetching. Moreover, we ensure that Puppeteer reduces negative outliers. When using Puppeteer, we observe an 89% reduction in the number of outliers, down to 8 outliers from 53, and a 20% reduction in the IPC loss of the worst-case outlier compared to the state-of-the-art prior prefetcher managers.
•
Finally, to the best of our knowledge, Puppeteer is the only manager that targets all cache levels in the memory hierarchy.
3 Puppeteer Design
3.1 Puppeteer System-Level Overview
In Figure
2, we show the system-level design of an example prefetcher system that uses Puppeteer. The prefetcher system consists of eight different prefetchers (Pf1–Pf8), which is typical in modern high performance processors such as Intel i9 [
26], AMD Ryzen7 [
1,
3], and AMD EPYC [
2]. These eight prefetchers track different memory access patterns and prefetch data from main memory to
last-level cache (LLC), from LLC to L2$, and from L2$ to L1$. Pf1 and Pf2 target instruction lines to bring into L1I$, Pf3 and Pf4 prefetch data into L1D$, and Pf5 and Pf6 prefetch data into L2$, whereas Pf7 and Pf8 target data to bring into LLC. These prefetchers compete among them for cache and memory resources. Even across memory levels, a wrong prefetch request from lower levels of memory can harm the performance of prefetchers at higher levels of memory. These prefetchers can sometimes act overly aggressive, and can adversely affect each other, in turn leading to loss of application performance. There are many heuristics-based algorithms that use simple inputs such as accuracy
5 of the prefetchers or memory bandwidth utilization to throttle prefetchers in such adverse scenarios [
14,
15,
16,
17,
23,
25,
62]. These heuristic algorithms are designed to have low overhead and hence are highly optimized for the prefetchers in a given prefetcher system.
Puppeteer works as a manager of all prefetchers and complements the heuristic algorithms used in the given prefetcher system. At runtime, Puppeteer periodically updates the PSC—that is, it sets which prefetcher should be ON and which should be OFF at each level in the memory hierarchy. To update the PSC, Puppeteer uses an ML model with the PSC-invariant hardware events, collected from HPCs as inputs. Although low-overhead heuristic algorithms are still required to make extremely low latency decisions at the cycle level, Puppeteer provides additional adaptability by leveraging the power of ML and thereby increasing the performance. Effectively, heuristic algorithms such as throttling are used in the system to constantly regulate the short-term behavior of the prefetchers, whereas Puppeteer controls the longer-term system-level behavior (across hundreds of thousands of cycles).
3.2 Puppeteer Algorithm
For the ML-based Puppeteer algorithm, we considered a classification-based approach and a regression-based approach. The classification-based approach has been used by prior works because it is relatively easy to train offline and has lower hardware overhead compared to regression. The regression-based approach has potential for higher performance and has better tolerance to variability compared to classification during runtime when trained offline.
Classification vs. regression. To train Puppeteer, as a classification problem we created a dataset using thresholding method similar to prior works [
6,
29,
37,
39]. Here, a trace is run using all available PSCs. A PSC is given a label of “1” if the IPC when using that PSC is within some threshold (in the case of the prior work, the threshold is 0.5%) of the IPC when using the ideal PSC. Multiple PSCs can pass the chosen threshold for a given program phase, and we then end up using the PSC that is predicted as “1” with the highest probability. Otherwise, the PSC receives a label of “0.” Using such a classification approach leads to sub-optimal results.
As an example, consider that we have four different traces. Let us say that we classify the first three out of the four traces correctly and the fourth one incorrectly—that is, we are able to identify the correct PSC for the first three traces but not the fourth trace. So, our classification accuracy is 75%. This means that the first three traces will have performance that is within 0.5% of their “ideal” performance. However, the performance of the fourth trace could be 100% worse or just 0.51% worse than the ideal performance. This variation in the performance is not accounted in the classification-based model. Furthermore, this problem is not unique to the given example. Any classification-based method would have a similar issue because all labeling methods used to create the dataset for the classification algorithm will certainly lose some amount of information.
In contrast, a regression-based approach accounts for the value of IPC gain/loss, and not just if there is IPC gain/loss, when deciding the PSC. Given that the regression algorithm is trained on the IPC values directly, the quantitative information of IPC gain/loss is not lost, and the regression algorithm can learn the magnitude of a good or bad prediction. In particular, the regression algorithm will be used to predict an IPC value for each PSC for a given instruction window. Then, we choose the PSC with the highest predicted IPC value.
Suite of regressors vs. single regressor. When using regression algorithms, we have two options: (i) use a single regressor, where all data collected from all PSCs are used to train that single regressor, or (ii) use a suite of regressors, where each PSC will have a dedicated regressor. Using a suite of regressors leads to a more customized solution that has higher accuracy as compared to using a single regressor. Conceptually, this is because in a single regressor, we maximize the accuracy across all PSCs instead of maximizing the accuracy of each PSC separately, whereas in a suite of regressors, we train a dedicated regressor for each PSC. In this way, we indirectly jointly increase the scope and accuracy of the overall prefetching system.
In our work, we use a suite of regressors where we implement each regressor using random forest (i.e., one random forest trained per PSC) due to its simple implementation, its robustness to noise in the dataset, its lower overhead (compared to other ML algorithms e.g. neural networks), its resilance to overfitting, and its higher accuracy (compared to other ML algorithms e.g. decision trees) [
35,
44,
53].
We have multiple trees per forest, and we allow each tree to split at locations that are unique to the PSC associated with the forest. The leaves of each tree in the forest specify the predicted IPC value for the PSC. For each forest (i.e., each PSC), we calculate the average of the predicted IPC values obtained from all trees in the forest, then choose the PSC with the highest average predicted IPC. Given that each forest has multiple decision trees, our method has higher tolerance to wrong decisions by the trees, where even if some of the trees give wrong decisions, other trees can compensate.
In Figure
3, we conceptually show the differences between the four different options discussed earlier: (a) single classifier, (b) single regressor (c) a suite of classifiers, and (d) a suite of regressors. Respectively, the leaf nodes in (a) contain the PSC choice directly, (b) have the predicted highest IPC among all the PSCs, (c) the probability value of a given instruction window belonging to the given PSC (of which we choose the highest one), and (d) the predicted IPC value for a given PSC that will be averaged per tree in a given RF and compared with the other averaged predicted PSC values from the other RFs (of which we will choose the highest). For (a) and (b), the PSCs are traversed simultaneously, since the PSCs share a single algorithm, whereas for (c) and (d), each PSC has a unique algorithm we traverse.
3.3 Puppeteer Training
To train Puppeteer, we need to generate a representative dataset. Consider the case where we have a single prefetcher,
\(Pf\), at only one level in the memory hierarchy. Here, the number of PSCs (
\(N_{psc}) = 2\) (i.e.,
\(Pf\) = OFF and
\(Pf\) = ON). For two consecutive instruction windows, we will have
\(N_{psc}^2 = 4\) possible scenarios: (i)
\(Pf\) = OFF
\(\rightarrow\) \(Pf\) = OFF, (ii)
\(Pf\) = ON
\(\rightarrow\) \(Pf\) = OFF, (iii)
\(Pf\) = OFF
\(\rightarrow\) \(Pf\) = ON, and (iv)
\(Pf\) = ON
\(\rightarrow\) \(Pf\) = ON. With
N number of instruction windows and
\(N_{trace}\) number of traces, the number of different possible scenarios will then be
\(N_{trace} \times N_{psc}^N\). When
N increases, the number of different scenarios will increase exponentially, hence including each unique scenario in the dataset for training is not feasible. To handle this problem, we propose to use only PSC-invariant events as our features. An example of a PSC-invariant event is the number of conditional branches, which is not affected by the choice of PSC. We check the variance of each hardware event value (for 180 total hardware events that we can track) for each PSC. We identify 59 events whose values vary by less than
\(\pm\)10% from their mean value across all PSCs. We further reduce the number of events by eliminating the redundant events that track similar behavior and have high correlation with each other. Table
2 shows the final six events we choose to track trace behavior. After we have identified our PSC-invariant events that will be the features and the PSCs that will be the choices of our ML model, we collect an IPC value per PSC for each instruction window as our ground truth.
Using the features and IPC values we have collected, we then form our suite of random forest regressors wherein we train a separate forest for each PSC using CART (classification and regression trees) [
64]. CART is a greedy recursive search algorithm that maximizes information by splitting the data at each node using one feature. Each child node is split recursively until there is no information gain from splitting a child node. We limit the total number of decision nodes in Puppeteer to keep the size of Puppeteer smaller than L1$. With this limitation in mind, we conduct a hyper-parameter search and determine that the number of estimators (trees per random forest) should be 5 and the number of max nodes should be 100 per tree.
3.4 Puppeteer Microarchitecture Design
Figure
4 shows the microarchitecture details of Puppeteer. We use a single-port SRAM array called
Node MEM to store information about the nodes that form the trees of each random forest in Puppeteer. We train Puppeteer offline on a group of applications of interest. After the training is complete, we load the random forest based ML model of Puppeteer into the
Node MEM at startup using firmware. If the processor unexpectedly executes a completely different set of applications than those already considered, the ML model might need to be retrained by including these new applications. This retraining process is performed offline, and the model is reloaded using firmware. As more and more applications are included, we will need to retrain the ML model less often. Each entry of
Node MEM corresponds to one node in one of the random forests, and it consists of the following fields. First, it consists of a 3-bit
HPC ID field that specifies which PSC-invariant event (i.e., which HPC) is used by that node to make a decision. The 3-bit encoding enables the node to use one of six different PSC-invariant events (see Table
2). Second, it consists of a 16-bit
Threshold field (threshold value is determined during training), which is employed by the node to decide if the decision path should branch left or right. In our problem, 16 bits provide enough precision for the ML model weight values. Third, it consists of a 12-bit (for 2,250 node addresses)
Left Node Value(LNV) field, and fourth, a 12-bit
Right Node Value(RNV) field. These
LNV and
RNV fields represent child node indices for internal nodes of a tree. For the leaf nodes of a tree, we use these
LNV and
RNV fields to indicate the predicted IPC value of a PSC. We differentiate between child node index and predicted IPC using, fifth, a 1-bit
Type field. We use a separate 1-bit
Type field for
LNV and
RNV.
At the end of every instruction window, Puppeteer calculates the predicted IPC for each PSC in the next instruction window by traversing the trees of the associated forest and using the PSC-invariant event values for the current window as inputs. For each forest, the controller in Puppeteer reads the Node MEM index of the root node for the first tree from Root Index Table and loads the Node MEM entry for the root node using a Load Unit into a register. Next, the HPC ID in the loaded Node MEM entry is used to load the corresponding PSC-invariant event value into a second register. Then the Threshold value, stored in the first register, and PSC-invariant event value stored in the second register are compared using the Comparator. Based on the Comparator output, we choose to traverse down to the left child or the right child. The Controller then uses the corresponding index value from LNV or RNV to find the next node in Node MEM. The Controller continues traversing the tree until it loads a predicted IPC value corresponding to a leaf from the Node MEM. The preceding steps are repeated for the remaining trees in the forest, then we calculate the average of the predicted IPC values obtained from all trees in that forest. The Best PSC Unit in the Controller stores the ID of the PSC with the highest predicted IPC value. Every time the Controller finishes traversing a forest, the predicted IPC value of that forest (i.e., PSC) is compared with the predicted IPC value stored in the Best PSC Unit using the Comparator. If the new predicted IPC value is higher than the current value, the Best PSC Unit updates the predicted IPC value and the ID of the PSC. Once all forests have been traversed (i.e., all PSCs have been evaluated), Puppeteer chooses the entry stored in the Best PSC Unit as the PSC for the next instruction window.
We determined that a maximum depth of 10 per tree is more than sufficient to accurately determine the best PSC. In our evaluation, we use a prefetcher system with
\(N_{psc}\) = 5 (given later in Table
6 and discussed in detail in Section
4). We need a total of 2,250 nodes to design the trees in Puppeteer, and these nodes require a 10.75-KB-sized
Node MEM (compared to a typical L1$ of 32 KB). Other than
Node MEM, we require a 5
\(\times\) N
\(_{psc}\)-entry
Root Index Table where each entry is 13-bit wide (12 bits for the root node index and 1 valid bit), a 12-bit comparator containing comparison logic and two registers, a load unit, and a register to store the best PSC information in the
Controller. We discuss the hardware overhead in more detail in Section
5.
4 Evaluation Methodology
We use ChampSim [
50] for our analysis, where we model 1C, 4C, and 8C processors to have multiple prefetchers at each level of the cache—private L1I$, L1D$, private L2$, and shared LLC (more details are provided in Table
2). We train Puppeteer using data collected from a 1C and 4C OoO processor and then evaluate it on 1C, 4C, and 8C OoO processors. In the 4C and 8C processors, we have a private Puppeteer per core. Each private Puppeteer utilizes the six PSC-invariant events per core and makes independent decisions. For our evaluation, we use a diverse set of 232 traces generated from SPEC2017 [
12], SPEC2006 [
11], and Cloud [
51] benchmarks. Here, a trace is a group of instructions that represent a specific behavior. For example, a benchmark with consistent looping behavior can be represented by one trace corresponding to a single iteration of the loop. One or more unique representative traces are generated from each benchmark using clustering algorithms [
48] to represent the different behaviors of a benchmark. The traces we use for our evaluation were generated by the organizers of the prior prefetching competitions [
49,
51].
In Table
3, we list the notations used for all static PSCs and the runtime managers (that change PSC at runtime) that we have evaluated. In our evaluation, we normalize all IPC values to the same state-of-the-art baseline as PPF [
6]—that is, SPP [
32] (
no-no-spp-no). We compare Puppeteer against the best static PSCs from competitions (i.e., IPCP (
no-ipcp-ipcp-nl) and EIP (
EIP-nl-spp-no), as well as managers such as the final version of the algorithm developed by Jiménez et al. [
29] that uses trial periods to latch onto the PSC (J3), Pythia that is an RL-based algorithm developed by Bera et al. [
5] (PY),
6 a multi-layer perceptron similar to Bhatia et al. [
6] (NN), and a binary-tree (BT) based algorithm [
37], where Liao et al. tried several different ML methods (e.g., decision trees and NN) and concluded that decision trees are the best choice.
4.1 Training Approaches for ML-Based Prefetcher Managers
When evaluating any new idea, a widely used approach in the industry is to use all available evaluation data. We are using a ML-based approach and are cognizant of the fact that we do not want to overfit our model using all available data for training the ML model. So here we compare three different approaches for constructing the training dataset for Puppeteer. For training, we have 232 traces
\(\times\)10,000 instruction windows per trace = 2,320,000 instructions windows. For each approach, we limit the training data in different ways. Note that we are still training at the instruction window-level and performing 10-fold cross validation on the training set for all approaches. We give the percentage of data used to construct each training dataset in Table
4.
In the first approach, we severely limit the data and randomly select 20% of the instruction windows at random time intervals to form the training dataset (i.e., 80% of the instruction windows are not seen during training). The idea behind this approach is that we are training the algorithm to recognize low-level memory access behavior. This approach does not overfit the model because the behavior of the benchmarks when collecting the training data would be different from the behavior of the benchmarks when we use the trained Puppeteer. This difference is because when collecting training data we do not change the PSC, whereas when using Puppeteer the PSC can potentially change for each instruction window. In the second approach, we use 80% of the traces for testing. We consider all instruction windows of the remaining 20% traces, and perform a 60–40 split where 60% of the windows are used for training and 40% of the windows are used for validation. This avoids overfitting. Given that traces are constructed to represent unique behaviors in each application, our training set will not have traces corresponding to all of the unique behaviors of a given benchmark. This means that some of the traces from other benchmarks are assumed to be from the same distribution of these missing traces or else there will be no way for the algorithm to train for these regions. In the third approach, we generate the training dataset using 20% of the benchmarks and for those benchmarks we use a 60–40 split of the instruction windows. The remaining 80% of the benchmarks are used for testing. In this approach, whole benchmarks are not considered during training. Here, too, we choose a 60–40 split of the instruction windows to avoid overfitting the model to the 20% of benchmarks in the training set.
In a 1C processor, when using Puppeteer on unseen benchmarks, for the first, second, and third approaches, we observe an average IPC gain of 11.3%, 9.8%, and 6.6%, respectively, over SPP. In the second and third approaches, we see that Puppeteer has lower performance gain than the first approach. Despite having 3% more instruction windows for training in the third approach compared to the second approach, there is a 3.2% drop in performance. Furthermore, in the first approach, we use only 10% of the total number of instruction windows for training (5 \(\times\) less instruction windows than the other approaches), yet this approach performs 1.5% and 4.7% better than the other two approaches. This implies that the second and third approaches are trained on an insufficient number of unique memory accesses patterns, leading to lower performance gain. Therefore, for the rest of our evaluation, we train all ML-based prefetcher managers using the first approach. As mentioned earlier, Puppeteer can always be retrained and updated via firmware.
4.2 1C, 4C, and 8C Workload Formulation
We run 1C and 4C experiments using a 200M instruction warmup phase and 1B instruction detailed simulation phase, whereas for 8C we use a 200M instruction warmup phase and 250M instruction detailed simulation phase. We generate a total of 232 traces from SPEC2017, SPEC2006, and Cloud benchmarks. We create two flavors of trace sets for the 4C and 8C experiments: a single-type trace set where each core runs the same trace, and a mixed-type trace set where each core runs a unique trace. Therefore, we have five data suites in total: 1C, 4C single-type trace set (4C STS), 4C mixed-type trace set (4C MTS), 8C single-type trace set (8C STS), and 8C mixed-type trace set (8C MTS). We use hashed-perceptron for branch predictor and least recently used policy for cache replacement policy provided by ChampSim. To construct our mixed-type trace sets, we first split the 232 traces into six groups based on ascending execution latency. Then we randomly select a trace from a given group per core and construct a mixed-type trace group. We cover all permutations of the various latency groups. This way, we have diversity in the traces running on the cores. For example, in a 4C processor, we can assign a trace from group 2 to core0, a trace from group 3 to core1, a trace from group 4 to core3, and a trace from group 5 to core3. This can be represented as \(2-3-4-5\). Therefore, we have 360 experiments (six options for core0 \(\times\) 5 options for core1 \(\times\) 4 options for core2 \(\times\) 3 options for core3) in 4C MTS for six latency groups and four selected traces (one for each core). For 8C MTS, since the number of experiments increases and the experiments take quite long, we replicate the same latency group for four of the cores. For example, we can use \(1-1-1-1-4-4-4-4\) for an 8C experiment. However, instead of 6 groups as in the 4C system, we use 10 groups for the 8C system, to cover a more granular variety of behavior and instead of permutations we use combinations with replacement. Therefore, we have 55 (two traces selected from 10 latency groups with replacement, i.e., \(C^R(10,2)\)) experiments for 10 groups and two selections. Note that even if the latency group is the same, since we choose a trace randomly from a latency group, the trace can still be different for each core.
4.3 Generating ML Models and PSC Pruning
To generate the 1C dataset (PSC and associated IPC values), we run the 1C experiments using all possible PSCs generated from the prefetcher options that are available in the ChampSim repository, the first place (IPCP [
45]), second place (Bingo [
4]), and third place (MLOP [
57]) winners of the Third Data Prefetching Competition (DPC3) [
49], and the first place (EIP [
55]), second place (FNL+MMA [
56]), and third place (DJOLT [
42]) winners of the First Instruction Prefetching Competition (IPC1) [
51]. To reduce the hardware overhead of Puppeteer, we avoid including multiple PSCs that cover the same traces. To this end, we initially ran 20 traces for 20M instructions with all possible PSCs (5 prefetching options in L1I$
\(\times\) 5 prefetching options in L1D$
\(\times\) 6 prefetching options in L2$
\(\times\) 2 prefetching options in LLC = 300 PSCs). We summarize the different prefetchers that we evaluated in Table
5.
For each trace, we sort the PSCs based on the corresponding IPC values in descending order. We generate a new table for each trace, where the table contains the top 10 PSC entries for the trace and combine these tables to form a super-table that contains the top 10 PSCs for all traces. Note that a PSC may be in the top 10 for more than one trace. We sort the PSCs in descending order based on the number of traces for which the PSC is in the top 10. Starting from the top, we select just enough PSCs to improve performance of all 20 traces. We picked the PSCs that have the best performance with minimal coverage overlap while reducing the number of unique prefetchers (to reduce the hardware overhead). In Table
6, we show the final 5 PSCs that we selected. These PSCs give good performance for the maximum number of traces. We show the prefetcher used at each cache level. It is interesting to note that the best prefetchers from DPC3 and IPC1—IPCP and EIP—are not used in the top 5 choices for PSCs. This shows that the state-of-the-art prefetchers designed in isolation may not be the best choice of prefetchers when used with other prefetchers.
We collect the 4C STS and 4C MTS datasets in the same way as the 1C dataset—the only difference is that the data is collected per core. We then train BT and Puppeteer using datasets collected from 1C, 4C STS, and 4C MTS. We denote the different flavors of the two algorithms as P1C, P4CS, and P4CM for Puppeteer, and B1C, B4CS, and B4CM for BT.
5 Evaluation Results
5.1 Puppeteer in a 1C Processor
In this section, we present the processor performance analysis when using Puppeteer in 1C, 4C, and 8C processors, a sensitivity analysis of how Puppeteer’s performance varies with cache size, and we explore the trade-off between performance improvement and hardware overhead when using Puppeteer. We evaluate Puppeteer using scope, accuracy, and power metrics. For the analysis presented in Sections
5.1,
5.2,
5.3, and
5.4, we assume a total hardware overhead budget of 11 KB for storing the Puppeteer model, and due to its simplicity (as explained in Section
2), we assume negligible evaluation overhead for the computing logic.
In Figure
5, we show the IPC distribution for various static PSCs and prefetcher managers that change the PSC at runtime. We begin our discussion with P1C, which is Puppeteer trained using the 1C data suite. Broadly, compared to a processor with no prefetchers, P1C provides an average performance gain of 46.0% (peak value of 613%). The true benefit of Puppeteer is observed when we look closer into the performance loss in Figure
6. We observe that when using B1C, 53 traces lose performance and 6 out of those 53 traces lose more than 10% performance. This performance loss would make this solution non-acceptable. Puppeteer has a worst-case loss of only 5%, and only 8 traces in total have lower performance than SPP. In Figure
7, we show the performance loss/gain of the 10 traces that have the worst performance for Puppeteer, when using five different managers: NN, B1C, J3, PY, and P1C (Puppeteer). Across these 10 traces, Puppeteer, NN, B1C, J3, and PY have an average performance loss of 1.9%, 6%, 6.6%, 4.8%, and 4.9%, respectively.
This clearly illustrates that Puppeteer provides a win-win situation, whereby we not only see a better average performance gain but also see a reduction in the maximum performance loss and the number of traces that have performance loss. For the other prior works, we observe that P1C provides 4.65%, 5.8%, 12.2%, 15.3%, and 5.1% average IPC gain over IPCP, EIP, NN, PY, and J3, respectively.
We also trained two more flavors for Puppeteer and BT using 4C STS and 4C MTS datasets. P4CS and P4CM achieve 0.5% lower average performance gain than P1C. This is because Puppeteer trained using 1C data is better suited for a 1C processor. However, the performance improvement when using Puppeteer, which is trained with 4C data, is still quite large at 45% (average of P4CS and P4CM) compared to a system with no prefetching. This means that Puppeteer generalizes quite well and has learned the underlying architectural phenomena. Compared to B1C for B4CS and B4CM, we observe 1% and 2% lower average performance, respectively. Furthermore, the performance loss for the worst case outlier increases to 30%. Finally, one thing to note about J3 is that its average IPC gain is 2.1% lower than B1C, yet the negative outliers are less in both quantity and magnitude. This is because J3 makes more conservative changes compared to BT since the algorithm cycles through the PSCs as part of a constantly ongoing trial phase. This means that in a production capable system, J3 might have been more viable compared to BT. This is because processors have to ensure that all applications retain or increase their performance across new processor generations.
5.2 Puppeteer in a 4C Processor
Here we discuss the use of Puppeteer trained using 1C, 4C STS, and 4C MTS datasets in a 4C processor. These cases are represented by P1C, P4CS, and P4CM, respectively. In Figure
8, we show the IPC distribution of Puppeteer and the various prior works in a 4C system while running a 4C STS data suite. We observe that the average IPC gain of P1C over the no prefetching case is 25.8%. Compared to B1C, P1C has 4.2% higher average performance. The worst-case performance loss in B1C has gone up to 45%, whereas the worst-case loss of P1C is at only 19%. The number of traces that lose performance is 61 for B1C, whereas P1C has just 24 traces that lose performance. Compared to IPCP, EIP, NN, PY, and J3, our P1C achieves 10.8%, 4.8%, 8.7%, 6.8%, and 3.7% average IPC gain, respectively. One key observation here is that although P1C has only been trained on 1C data, it is still applicable to the 4C case and provides a clear advantage over prior work. This is important given that as the number of cores increases, the number of unique combinations of different traces that we will need to run in the multi-core processor increases exponentially (for a 4C processor, we need to cover
\(232^4\) = 2.89 billion trace combinations). Therefore, generalization using only 1C data is an important aspect to consider when comparing ML-based algorithms. To further test Puppeteer and BT, we train both algorithms with 4C STS and 4C MTS data suites. For Puppeteer, we observe that P4CS has 1.2% and P4CM has 2.2% better performance compared to P1C (see Figure
8). For BT, B4CS has 4.0% and B4CM has 2.9% better performance compared to B1C. This means that Puppeteer is better at learning the underlying (micro)architectural behavior with a variety of traces running concurrently compared to the same trace running on all cores. In contrast, BT requires data collected specifically from the same set of experiments to achieve better performance.
In Figure
9, we show the IPC distribution of Puppeteer and the various prior work when running 4C MTS. P1C achieves 23% average performance gain over no prefetching. P1C’s average performance gain is also very close (
\(\lt\)0.6% difference) to the average performance gain of P4CS and P4CM. Once again, we observe that Puppeteer is superior to BT in this regard, with B1C achieving 5% lower performance gain than P1C. B1C’s performance gain is also 4% lower than B4CS and 4.5% lower than B4CM. When we observe the outliers, B1C has 23 outliers while P1C has only 3. The worst-case outlier in B1C has 7% performance loss compared to SPP, whereas P1C has only 0.4% loss. Compared to IPCP, EIP, NN, PY, and J3, our P1C (and also P4CS and P4CM) achieves 14.5%, 5%, 12.8%, 11.9%, and 4.4% average IPC gain, respectively.
5.3 Puppeteer in an 8C Processor
Here we discuss the use of Puppeteer trained using 1C, 4C STS, and 4C MTS datasets in an 8C processor. These cases are represented by P1C, P4CS, and P4CM, respectively. We conduct this 8C processor analysis to test if Puppeteer scales to a larger number of cores. For an 8C processor running STS, we observe several interesting trends (Figure
10). P1C has 11.9% average IPC gain over no prefetching, which is 4.8% higher than B1C. With P4CS and P4CM, our IPC gain is 12.7% and 12.9%, respectively, over the no prefetching case. This means that to train Puppeteer for a multi-core processor, we should have at least some data corresponding to a multi-core processor in our dataset. It should be noted that compared to the no prefetching case, P1C, P4CS, and P4CM have lower IPC gain in the 8C processor than the 4C processor, which in turn has lower performance gain than the 1C processor. This is because prefetching is much harder to do in multi-core processors. The overall benefit of prefetching goes down in multi-core processors, and Puppeteer has lower possible peak performance. Comparatively, B1C has almost 0% performance gain over SPP. Even a static PSC, namely EIP, has 5.5% higher average IPC compared to B1C. P1C also gives comparable performance to B4CS and B4CM even though P1C was not trained using any data from a multi-core processor. When we observe the outliers, B1C has 83 outliers with worst-case performance loss of 62%, whereas P1C has 47 outliers with worst-case performance loss of 35%. Compared to IPCP, EIP, NN, PY, and J3, our P1C achieves 5.9%, 0.2%, 6.5%, 4.4%, and 1.2% average IPC gain, respectively.
When using the 8C MTS (Figure
11), we observe similar trends to when using 8C STS. P1C has 2% better average IPC than B1C and comparable performance to B4CS and B4CM. P4CS and P4CM achieve 4% and 4.4% better average IPC compared to B1C. Compared to IPCP, EIP, NN, PY, and J3, our P1C achieves 6.9%, 1.6%, 6.6%, 7.6%, and 3.1% average IPC gain, respectively.
5.4 Puppeteer Performance for Different Cache Sizes
In this section, we discuss how the performance of BT, SPP, and Puppeteer varies with cache size. We train both models only using data collected on a 1C processor with 32-KB L1I$, 48-KB L1D$, 512-KB L2$, and 2-MB LLC. In Figure
12, we show the average IPC values of BT (i.e., B1C), SPP, and Puppeteer (i.e., P1C) for
\(0.5\times\),
\(1\times\),
\(1.5\times\), and
\(2\times\) the nominal L1$ and L2$ sizes. P1C is affected by L1$ size slightly more than L2$ size, but the difference is small (at most 0.4% more performance at
\(2\times\) L1$ compared to
\(2\times\) L2$). At
\(0.5\times\) L1$ size, P1C has 1.5% lower performance compared to the performance of P1C at nominal cache size. At
\(2\times\) L1$ size, P1C has 1.1% better performance compared to the performance of P1C at nominal cache size. At all cache sizes, P1C performs at least 1% better than B1C with the largest performance difference of 2.3% at nominal cache size. SPP performance swings by 5.5% between the
\(2\times\) L1$ and the
\(0.5\times\) L1$, and by 3.7% between the
\(2\times\) L2$ and the
\(0.5\times\) L1$. In contrast, P1C has only a 2.9% swing. This means that with P1C, we are already operating close to the possible peak performance, and so increasing the cache size does not have a significant effect on the performance.
5.5 Performance Improvement for Different Model Sizes
In Figure
13, we show the average IPC for 1C for the differently sized models. As is immediately apparent, Puppeteer provides good performance improvement even when using a small model that fits within 1 KB, whereas NN does not provide any performance improvement until we use a model that requires 11 KB. At 11-KB model size, a 1C processor with an NN manager has 11.8% lower average IPC as compared to a 1C processor with Puppeteer. The NN also has
\(\sim 3\times\) larger area and
\(11\times\) larger power as compared to Puppeteer. If we compare 100-KB NN to 1-KB Puppeteer, using Puppeteer provides 11.6% performance improvement, whereas NN provides 9.7% performance improvement while the area and power of 100-KB NN is
\(76\times\) and
\(185\times\), respectively, that of Puppeteer. Therefore, NN is not suitable for hardware prefetcher adaptation.
5.6 Puppeteer Overheads
To understand the tradeoff between the overheads for selecting PSC using Puppeteer and the performance improvement using Puppeteer, in this section we design Puppeteer when we have four different hardware overhead budgets: 1 KB, 5 KB, 11 KB, and 100 KB. Here we train a different Puppeteer model for each hardware overhead budget.
PSC selection. To select a PSC, Puppeteer just traverses through the random forest for each PSC. If we evaluate all five forests in series, where we will require a maximum 500 comparison operations (5 forests \(\times\) 5 trees per forest \(\times\) 10 comparisons = 250 comparison for 1 KB and 5 KB; 5 forests \(\times\) 5 trees per forest \(\times\) 20 comparisons = 500 comparisons for 11 KB and 100 KB), it will take less than 0.5% of the total time required to execute the 100K instructions in the instruction window (assuming each instruction takes on average a clock cycle). Thus, we end up using the chosen PSC for 99.5% of the instruction window for Puppeteer. We would like to note that the hardware for determining the best PSC is not on the critical path, and it runs in parallel to the normal application execution without stopping the prefetchers.
Power and area overheads. In Table
7, we show the total area and power required for Puppeteer. All designs in Table
7 are such that they need less than 0.5% of the instruction widow time to choose a PSC. We calculate the power, area, and delay values using Cadence Genus and the SRAM array compiler for 22-nm GLOBALFOUNDRIES
® (GF22FDX) [
10]. These area and power overheads are much smaller compared to the area and power of the overall processor. In Table
8, we provide more details about the Puppeteer design. We show the number of each type of logic component and SRAM sizes for different configurations of Puppeteer. Note that the SRAM word length changes for the different-sized Puppeteer configurations because we require additional bits to address a larger number of nodes.
We would like to note that in most modern processors, the memory subsystem has multiple prefetchers at each level of the cache hierarchy [
1,
2,
3,
7,
21,
26,
27,
65]. In Table
6, we show the overhead due to the prefetchers (not including Puppeteer) at each level of the memory hierarchy. This overhead is not unusual for today’s server class of CPU architectures that use multiple prefetchers to target a variety of memory patterns.
5.7 Puppeteer Versus Other Managers
Power in caches. In Figure
14, we show the average power consumed in the caches for various static PSCs and the prior managers compared to Puppeteer. We calculate the power using Cadence Genus and the SRAM array compiler for GF22FDX
® [
10]. Compared to the no prefetching case, Puppeteer has 65% higher power consumption overall. Compared to BT (average of B1C, B4CS, and B4CM), Puppeteer (average of P1C, P4CS, and P4CM) has 9% lower average power. NN and J3 have 6.3% and 2.2% lower average power than Puppeteer, and as we have discussed, they also have significantly lower average IPC. For the static PSC, EIP is extremely power hungry, with the highest average power being 20.4% more than Puppeteer. IPCP and NO have lower power consumption than Puppeteer, but they also have the worst performance among all options we have considered.
Scope. In Figure
15, we compare the prefetching scope of the prior works and Puppeteer. Here we determine scope as the number of total misses reduced by the prefetcher or prefetching system, divided by the total number of misses with prefetching disabled. We observe very limited scope across the four levels of the memory hierarchy for NN and J3. This is probably one of the reasons these options have lower performance than BT and Puppeteer. Comparing BT (average of B1C, B4CS, and B4CM) and Puppeteer (average of P1C, P4CS, and P4CM), in L1D$ and L2$, BT achieves a marginal difference with 0.5% broader scope than Puppeteer. In L1I$ cache, BT actually has 3.3% broader scope compared to Puppeteer. However, Puppeteer has 4.8% broader scope compared to BT in LLC. Intuitively, since LLC penalties are more important than L1I$, L1D$, and L2$ penalties, this is a better outcome. Experimentally, our results also support this outcome since Puppeteer has better performance than BT.
Accuracy. In Figure
16, we compare the prefetching accuracy of Puppeteer and the prior works. Here, accuracy of a prefetcher is measured as the number of misses that a prefetcher or prefetcher system has reduced compared to the case when prefetching is disabled, divided by the number of misses caused by the prefetcher. Puppeteer is better than the other options in L1I$ and L1D$, but comparable in L2$. In LLC, J3 fairs better than the other options with only 1.5% lower accuracy than Puppeteer. In case of BT and NN, Puppeteer achieves 21% better accuracy compared to BT and 28% better accuracy compared to NN. It is also of note that since the accuracy of Puppeteer in L1I$ is 6.7% better compared to BT, this better accuracy plays a role in compensating for the broader L1I$ scope of BT compared to Puppeteer.
5.8 Temporal Variations in PSC When Using Puppeteer
In Figure
17, we show the temporal behavior of P1C, B1C, and J3 while running 429.mcf-217B as an example. Figure
17(a) shows the percentage of time for which each PSC was used by each manager algorithm when executing 1.2B instructions of 429.mcf-217B. In Figure
17(b), we show the IPC values and the PSC used in each instruction window for a small slice of the same trace. We would like to note three interesting observations. First, from Figure
17(a), both B1C and P1C use
fnl-bingo-spp-no for the 80% of the instruction windows, yet the performance difference between the two is around 20% over the whole trace. This means that the PSC chosen in the remaining 20% of the instruction windows have a larger influence on the overall performance. Second, in the first 25 instruction windows after the 5,000th instruction window, there is a large variation in IPC gain when using B1C as compared to P1C, whereas all three algorithms converge to the same performance and same PSC during the last 25 instruction windows. This shows that P1C does a better job at predicting the PSC in different regions of an application. Third, J3 uses the same PSC as P1C but has lower performance in the first 25 instruction windows. This illustrates that changing the PSC has a cumulative effect on IPC. The choice of PSC made by P1C in prior instruction windows allowed P1C to gain more performance in the given instruction windows compared to J3.
5.9 Random Forest Regressors Versus Other Regressors
In this section, we compare (see Figure
18) the random forest regressor, which is our choice of the regression algorithm of Puppeteer, with three other regression algorithms: linear regressor [
67], passive aggressive regressor [
13], and gradient boosted regressor [
40]. We compared the four regression algorithms for the 1C system. When used as part of Puppeteer, the random forest regressor has 1.7%, 1.2%, and 0.55% higher average IPC compared to linear regressor, passive aggressive regressor, and gradient boosted regressor, respectively. Furthermore, we observe 6.7%, 14.9%, and 8.2% worst-case IPC loss for linear regressor, passive aggressive regressor, and gradient boosted regressor, respectively, compared to the 5% worst-case IPC loss for random forest regressor. The performance of Puppeteer could conceivably be further increased by mixing and matching different regressors for each PSC.