research-article

Open access

Finding Near-optimal Configurations in Colossal Spaces with Statistical Guarantees

Authors:

Jeho Oh,

Don Batory,

Rubén HeradioAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 33, Issue 1

Article No.: 7, Pages 1 - 36

https://doi.org/10.1145/3611663

Published: 23 November 2023 Publication History

PDF eReader

Abstract

A Software Product Line (SPL) is a family of similar programs. Each program is defined by a unique set of features, called a configuration, that satisfies all feature constraints. “What configuration achieves the best performance for a given workload?” is the SPLOptimization (SPLO) challenge. SPLO is daunting: just 80 unconstrained features yield 10²⁴ unique configurations, which equals the estimated number of stars in the universe. We explain (a) how uniform random sampling and random search algorithms solve SPLO more efficiently and accurately than current machine-learned performance models and (b) how to compute statistical guarantees on the quality of a returned configuration; i.e., it is within x% of optimal with y% confidence.

1 Introduction

A¹ Software Product Line (SPL) is a family of programs with similar functionalities. Each SPL program or product is defined by features, i.e., standardized increments of program functionality. Features have constraints: a feature may require and/or preclude other features. All features and their constraints are defined in a feature model. A configuration is a unique set of features that satisfies the SPL’s feature model. The configuration space or product space of an SPL, denoted $\mathbb {C}$, is the set of all SPL configurations, exactly one program/product per configuration. A configuration space can be colossal$\gg$$10^{10}$; a set of f unconstrained features yields a space of size $2^{f}$. A space of size 250K, which is near the upper limit of product space enumeration [116], has f$\approx$18 features, which is tiny for an SPL. Most SPLs are larger. Table 1 lists the sizes of contemporary SPLs taken from [48, 50, 69, 79, 89].

Table 1.

Clients want an SPL program to satisfy constraints. Functionality constraints declare required or forbidden features. There are environmental (hardware and platform) constraints, performance constraints on program usage workloads, and specification challenges—mutually exclusive features often implement the same functionality in different ways, each with a unique performance surface. There are also (sometimes unknown) combinations of features that are advantageous or detrimental to performance. Given these hurdles, what product of an SPL achieves the best performance? This is the challenge of SPL OptimizationSPLO.

SPLO is daunting. The complexity of feature constraints and the performance influence of features and feature interactions are beyond human reasoning. Simply using default configurations is notoriously bad [7]. To find a configuration with near-optimal performance is known to be difficult [41, 43, 49, 55, 60, 83, 84, 85, 88, 91, 94, 100, 103, 115, 127]. The Contestants. There are two known ways to find a near-optimal configuration$c_{no}$ in a product space: (a) create a Performance ModelPM and use an optimizer or (b) randomly search using Uniform Random SamplingURS—every configuration in $\mathbb {C}$ has an equal probability of being selected (e.g., $\frac{1}{|\mathbb {C}|}$ where $|\mathbb {C}|$ is the cardinality of $\mathbb {C}$).

The upper path in Figure 1 abstracts the process of Machine LearningML PMs: a configuration space is randomly (and not necessarily uniformly) sampled; samples are interleaved with model learning until a model is sufficiently accurate. An optimizer uses a PM with a workload and functionality constraints to find a $c_{no}$.

Fig. 1.

The bottom path abstracts random searching: a workload-and-functionality-constrained subspace is uniformly sampled until a $c_{no}$ is found.

Why Is SPLOHard? Three reasons:

•

URS is a gold standard for statistical analysis. Uniformly sampling an enumerated space is easy: randomly select an integer from [1.$|\mathbb {C}|$] and index to that configuration. Enumeration of colossal spaces is infeasible, so non-URS sampling methods are used instead [1, 3, 25, 27, 34, 42, 49, 62, 65]. Probabilistic models of URS are simple, but rarely so for non-URS methods. And each configuration is a solution to a propositional formula; how to index to a solution is unknown.

•

Building and benchmarking a configuration is very expensive. Minimizing the sample size while achieving accuracy is critical to all approaches. Today, only heuristics are known, like: use sample size (f, 2$\cdot$f, 3$\cdot$f, ...), where f is the SPL’s number of features [42, 46].

•

Statistical guarantees on the quality of returned $c_{no}$s should be required: a $c_{no}$ is within $x\%$ of optimal with $y\%$ confidence. Such statistical guarantees are unknown today.

1.1 The Central Questions of SPLO

Let a sample be a set of configurations whose cardinality is its size. Let $c_{best}$ be a product in $\mathbb {C}$ that has the optimal performance for a given workload and functionality constraints. Then:

(1)

How does one find a $c_{no}$ in an SPL configuration space?

(2)

How accurate (e.g., how near $c_{best}$) is the returned $c_{no}$?

(3)

What sample size should be used?

1.2 Contributions of This Article

•

Order statistics and URS [10, 126] provide an SPLO statistical guarantee: a returned $c_{no}$ is within $x\%$ of optimal with $y\%$ confidence;

•

Given any two of (a) accuracy ($x\%$), (b) confidence ($y\%$), and (c) sample size, the third can be determined mathematically, which leads to standardized answer tables;

•

A scalable algorithm to uniformly sample colossal ($\gg$$10^{10}$) configuration spaces;

•

Experimental SPLO results comparing $c_{no}$ recommendations of existing ML PMs with those of random search algorithms on enumerable SPLs with $\le$250K products;

•

Experimental SPLO results on random search algorithms in colossal SPL spaces: one has $10^{12}$ products and another has $10^{81}$;

•

The first solution to the Fixed Budget SPLO problem: given a fixed sample size, return the best $c_{no}$ with statistical qualifications using multiple random search algorithms.

2 Results On Performance Modeling

2.1 Basic Facts

Performance Modeling. ML approaches to PM creation are enormously diverse [73, 95]; we do not try to be exhaustive or complete. Instead, we review ideas of Linear RegressionLR, a popular ML approach used in SPLO. Let $\hat{\$}(c)$ be the estimated performance of configuration $c\!\in \!\mathbb {C}$. A common form of $\hat{\$}(c)$ is [33, 43, 67, 107, 108]

\begin{align} { \hat{\$}(c) ~~=~~ \beta _0 ~+~ \beta _1\cdot x_1(c) ~+~ \beta _2\cdot x_2(c) ~+~ \cdots ~+~ \beta _h\cdot x_h(c). } \end{align}

(1)

Consider any $x_i(c)$ term in Equation (1). Either $x_i(c)$ represents a unique feature, say $F_j$ in Figure 2(a), meaning $x_i(c)$ = 1 if $F_j$ is present in $c$ and 0 otherwise, or$x_i(c)$ represents a $t$-way interaction of $t$ $\gt$ 1 distinct features. Suppose $x_i(c)$ is the three-way interaction of features $\lbrace F_j,F_k,F_q\rbrace$ in Figure 2(b), meaning $x_i(c)$ = 1 if $F_j,F_k,$ and $F_q$ are all present in $c$ and 0 otherwise. If an SPL has f features, the number of distinct $x_i(c)$ terms is $2^{f}$-1. For any reasonable f, $2^{f}$ is far too big. So a typical approach finds two-way to five-way interactions that are important to performance [43, 67, 106], so that $h$$\ll$$2^{f}$ in Equation (1). Recent research suggests three-way is sufficient [67].

Fig. 2.

Let $\$(c_r)$ be the benchmarked value of configuration $c_r$. Given a set of $\lbrace (c_r, \$(c_r))\rbrace _{r=1..t}$ pairs, LR finds the value $B=[\beta _0 \ldots \beta _{h}]$ that minimizes the sum of the squares of differences between measured and predicted values, i.e., $min\lbrace \sum _{r=1}^{t} (\$(c_r) - \hat{\$}(c_r))^2 \rbrace , \text{~where~} B \in \mathbb {R}^{h+1}$ [23].

Optimizer Complexity. All $x_i(c)$ assume the value 0 or 1. Applying the constraints of a feature model so that only legal configurations are examined, optimizing Equation (1) becomes an instance of 0-1 Linear Programming, which is NP-hard [16, 122]. Although this result is specific to LR PMs, any comparable formulation will not alter this complexity.

Conclusion: An optimizer must solve an NP-hard problem to find $c_{best}$.

Workload and Environment Fragility. A workload is a set of tasks that are to be executed by a program. A benchmark measures one or more performance metrics (build size, completion time, maximum memory footprint, etc.) of a program when executing a workload. All ML PMs models known to us are created with a 0 fixed workload. It is well known that changing the workload alters $c_{best}$; it is the same for changes in execution environment [6, 7, 15, 29, 128, 130].

Conclusion:A PMmay need to be relearned if its workload or environment changes. More on this in Section 8.2.

2.2 PM Answers to Central Questions

Answers to Section 1.1 questions for contemporary PM research are:

(1)

A PM “fits” a line or curve through a set of observations $\lbrace \,(c_r,\$(c_r))\,\rbrace _{r=1..t}$. Prediction errors are unavoidable, although errors are minimized.

(2)

Unless an SPL configuration space is enumerated and benchmarked, it is unknown how close a $c_{no}$ is to $c_{best}$. Of course, enumeration is impractical or impossible in most circumstances.

(3)

The sample size to use depends on the learning algorithm (see $SPLConqueror$ in Section 5.1), although there are rules of thumb: Let f be the number of SPL features. Start with a sample size f, build a model, and compute its accuracy α. If α is too low, repeat the process until a budget of configurations or an acceptable accuracy is reached, i.e., (f, 2$\cdot$f, 3$\cdot$f, ...) [42, 46].

3 Results On Simple Random Searching

3.1 Performance Configuration Space Graphs

Imagine it is possible to benchmark every $c\!\in \!\mathbb {C}$, where $\$(c)$ is $c$’s measured performance. Small $\$$ is good (efficient) and large $\$$ is bad (inefficient). Sort all $(c,\$(c))$ pairs in increasing $\$(c)$ order and plot them equally spaced along the $X$-axis. The result is a Performance Configuration SpacePCS graph. A normalized PCS graph normalizes the $X$-axis to the unit interval [0.1], where $c_{best}$ = 0 and $c_{worst}$ = 1. The $Y$-axis is similarly normalized, where $\$(c_{best})$ = 0 and $\$(c_{worst})$ = 1. See Figure 3 [91].

Fig. 3.

All SPLs have finite (perhaps colossal) configuration spaces. Consequently, their PCS graphs are discrete, discontinuous, and stair-stepped like Figure 3, because consecutive configurations along the X-axis encode discrete decisions/features that make discontinuous jumps in performance [78]. Further, every PCS graph is monotonically non-decreasing; consecutive configurations along the X-axis, like $c_{i}$ and $c_{i+1}$, satisfy $\$(c_{i})$ $\le$ $\$(c_{i+1})$, as some features have no impact on performance.

Random search algorithms are well suited for non-differentiable and discontinuous functions, like PCS graphs.

3.2 Simple Random Search

URS requires every configuration to have equal probability $\frac{1}{|\mathbb {C}|}$ to be selected. Given that $|\mathbb {C}|$ is colossal, we can approximate a discrete distribution with the continuous distribution Uniform(0,1):

\begin{equation} \lim _{|\mathbb {C}|\rightarrow \infty } \frac{1}{|\mathbb {C}|} \cdot \Big [~ 1 ~..~ |\mathbb {C}|~\Big ] = \lim _{|\mathbb {C}|\rightarrow \infty } \left[~ \frac{1}{|\mathbb {C}|} ~..~ \frac{|\mathbb {C}|}{|\mathbb {C}|} ~\right] = [0..1]{} . \end{equation}

(2)

The Simple Random SearchSRS algorithm uniformly selects $n$ configurations from $\mathbb {C}$, i.e., $n$ points from [0.1]. On average, $n$ points partition [0.1] into $n$$+$1 equal-length segments. The $k^{th}$-best configuration out of $n$, denoted $c_{k,n}$, has expected rank $\frac{k}{n+1}$. The $k$$\cdot$$\binom{n}{k}$ term in Equation (3) is a normalization constant [10, 126]:

\begin{equation} c_{k,n} ~=~ k\cdot \binom{n}{k} \cdot \int _0^1 x^{k-1} \cdot (1-x)^{n-k} \cdot dx ~=~ \frac{k}{n+1} . \end{equation}

(3)

The expected rank or distance $c_{no}$ from $c_{best}$ is

\begin{equation} c_{1,n} ~=~ \frac{1}{n+1} . \end{equation}

(4)

Let’s pause to appreciate this result. The lone axis of Figure 4 represents the $X$-axis of all PCS graphs. As the sample size n increases, $c_{no}$ progressively moves closer to $c_{best}$ at X = 0, Figure 4(a)$\rightarrow$4(c). If a sample size of 99 is used, $c_{no}$ will be 1%, on average, from $c_{best}$ in ranking along the X-axis.

Fig. 4.

Note: Equations (3) and (4) do not reference $|\mathbb {C}|$; $|\mathbb {C}|$ disappeared when the limit was taken in Equation (2). This means Equations (3) and (4) predict $c_{no}$ $X$-axis ranks for an infinite-sized configuration space. Only for tiny spaces, $|\mathbb {C}|$ $\le$ 1,000, will predictions by Equations (3) and (4) be low. See Appendix A.

How accurate is the $\frac{1}{n+1}$ estimate? Answer: We can compute $v_{1,n}$, the second moment of $c_{1,n}$, and then the standard deviation of $c_{1,n}$ [16, 81]³:

\begin{align} v_{1,n} &= 1\cdot \binom{n}{1}\cdot \!\int _0^1 x^2\cdot (1-x)^{n-1}\cdot dx = \frac{2}{(n+1)\cdot (n+2)} \end{align}

(5)

\begin{align} \sigma _{1,n} &=~~\sqrt { v_{1,n} - {c_{1,n}}^2 } =~~ \sqrt {\frac{2}{(n+1)\cdot (n+2)} - \left(\frac{1}{n+1} \right)^2 . } \end{align}

(6)

For large $n$, Equation (6) converges to $\sqrt {\tfrac{2}{n^2} - \tfrac{1}{n^2}} = \tfrac{1}{n}$, which equals $c_{1,n}$ $= \frac{1}{n}$ from Equation (4). Figure 5 shows the convergence rate:

\begin{equation} \%diff~=~100\cdot \left(\frac{c_{1,n}}{\sigma _{1,n}} -1\right) . \end{equation}

(7)

When $n$ = 50, $c_{1,n}$ is 2% larger than σ_1,n. For $n$ $\ge$ 200, there is no practical difference between theoretical $c_{1,n}$ and $\sigma _{1,n}$ values; i.e., the standard deviation of $c_{1,n}$ is small.

Fig. 5.

Readers may have noticed that our configuration ranking is along the X-axis, not the Y-axis. This is a percentile. In SPLO, the goal is to be in the smallest percentile: $\le$1% means “in the top 1 percentile.”

Conclusion: To find a $c_{no}$ in a colossal product space, SRStakes a uniform sample of size $n$, builds and benchmarks each configuration, and returns the best-performing configuration, $c_{no}$, that on average is the top $\frac{100}{\text{n+1}}$ percentile of all products with a standard deviation of the $\frac{100}{\text{n+1}}$ percentile.

3.3 How to Uniformly Sample an SPL Configuration Space

Every SPL has a feature model $\digamma$ that can be translated into a propositional formula $ϕ$ [9, 13, 14]. A #SAT tool can count the number of solutions to $ϕ$ efficiently [111]. We know $\mid$$ϕ$$\mid$ = $|\mathbb {C}|$. Let $cfc$ be the client functionality constraints on $ϕ$. The predicate for a user-constrained space is $ϕ$$\wedge$$cfc$.

Algorithm 1 ² samples a configuration by assigning a Boolean value to each feature $f_1, f_2, \ldots , f_\omega$ in $\digamma$. First, $f_1$ is randomly assigned according to its probability $p_1$ = $\frac{|\phi \wedge f_1|}{|\phi |}$ of being true in any configuration. Suppose $f_1$ is assigned to false. Then, $f_2$ is randomly assigned according to its probability $p_2$ of being true in a configuration conditioned to $f_1$’s prior assignment: $p_2$ = $\frac{|\phi \wedge \lnot f_1 \wedge f_2|}{|\phi \wedge \lnot f_1|}$. This procedure advances until the last feature $f_\omega$ is assigned, thereby completing a uniformly random configuration. A formal proof of Algorithm 1’s uniformity is given in Appendix B.

BDDSampler. A new tool, called $BDDSampler$ [17, 50], implements an optimized version of Algorithm 1 [91]. $BDDSampler$ is built on top of the $CUDD$ [31] library for $BDD$s and is remarkably fast, even for colossal spaces. The last column in Table 2 shows the time $BDDSampler$ needed to sample 1,000 configurations with replacement for different SPLs³ averaged over 100 executions.⁴ The third column in Table 2 lists the $BDD$ synthesis times for feature models by the procedure of [35].⁵

Table 2.

SPL	$\|\mathbb {C}\|$	Time (secs)
SPL	$\|\mathbb {C}\|$	Synthesis	Sampling
JHipster 3.1.6	2.6$\cdot$$10^{4}$	0.01	0.04
DellSPLOT	7.4$\cdot$$10^{6}$	0.29	0.08
Fiasco 2014092821	5.1$\cdot$$10^{9}$	0.14	0.07
axTLS 1.5.3	3.9$\cdot$$10^{12}$	0.05	0.04
ToyBox 0.5.2	1.5$\cdot$$10^{17}$	0.02	0.25
uClibc 201 50420	7.5$\cdot$$10^{50}$	0.41	0.14
BusyBox 1.23.2	7.4$\cdot$$10^{146}$	0.62	0.26
EmbToolkit 1.7.0	4.0$\cdot$$10^{334}$	4304.68	2.61
LargeAutomotive	5.3$\cdot$$10^{1441}$	21.50	12.07

Table 2. $BDDSampler$ Sampling Time for 1,000 Configurations

SRS requires (i) building a $BDD$ structure, (ii) sampling configurations, (iii) building products, and (iv) benchmarking products. Actions (i)–(ii) can be done relatively quickly, but (iii)–(iv) are computationally expensive, and that is why minimizing the sample size is critical to both SPLO and ML performance [79]. For example, sampling all 26,256 configurations of $JHipster$ with $BDDSampler$ took 4.48 seconds.$^{{4},}$⁶ However, building and benchmarking all 26,256 configurations took 4,376 hours of CPU time (182 days approximately or 10 min/build-and-benchmark) and needed 5.2 terabytes of disk on the INRIA supercomputer Grid’5000 [48].

3.4 What Sample Size to Use?

A basic question for any SPLO sampling method is: What sample size is needed to find a near-optimal solution for a given accuracy? As rigorous analyses are usually not cited by the authors of proposed non-URS methods (e.g., [30, 34, 40, 42, 49, 82]), this question may have no answer. URS does. Let ρ be the desired percentile of accuracy (e.g., top 1% sets ρ = .01). Each selected configuration is a Bernoulli trial. The confidence/probability ¢ that a uniform sample of size $n$ returns a $c_{no}$ in the top ρ accuracy is shown in Equation (8):

\begin{equation} {¢} ~=~ 1-(1-{ρ {}})^n . \end{equation}

(8)

Solving for n yields Equation (9):

\begin{equation} n ~=~ \frac{ln(1-\text{¢})}{ln(1-{ρ {}})} . \end{equation}

(9)

Table 3 lists the sample size that achieves a given confidence (¢) and accuracy (ρ) for an infinite-sized space. Example: A configuration in the top 2% of $\mathbb {C}$ with 95% confidence is returned when $n$ = 148.

Table 3.

Other tables can be derived from Equation (8) for accuracy (ρ) and confidence (¢). Table 4(a) says a budget of 100 samples and 95% confidence returns a configuration in the top 2.95% of all solutions.

Table 4.

3.5 Why URS Is Important

What is the mean value $\mu$ of configuration space property $\lambda$? Answer: Take a uniform sample of size $n$ and benchmark each configuration to obtain its $\lambda$ value. Then compute the mean $\overline{\mu }$ and standard deviation $s$ of sampled $\lambda$ values. By the Central Limit TheoremCLT [114], the true population mean $\mu$ is contained in the following confidence interval:

\begin{equation} \left(\overline{\mu }\,-\,t\!\cdot \! \frac{s}{\sqrt {n}} \right) ~~ \le ~~ {μ } ~~ \le ~~ \left(\overline{\mu }\,+\,t\!\cdot \! \frac{s}{\sqrt {n}} \right) , \end{equation}

(10)

where $t$ is determined from Student’s $t$-distribution given a desired confidence level ¢ and sample size $n$. Table 4 lists $t$ values for some combinations of ¢ and $n$ [114]. Note: A precondition of CLTand Equation (10) is that samples are uniform.

Table 5.

Example. Let μ be the average number of features that are present in a configuration. Figure 6 plots μ estimates for two SPLs with different sampling methods and sample sizes. The $X$-axis is $n$, the sample size, and the $Y$-axis is μ estimates. The straight line ($\boldsymbol {-\!-}$) indicates the correct μ as these SPLs are small enough to enumerate and compute the correct answer. The dashed lines indicate the 95% confidence envelope for each μ estimate, Equation (10).

marks estimates by URS. Sampling methods $\blacktriangle$ and $\Diamond$ are proposed as alternative methods to URS: $\blacktriangle$ is QuickSampler [34] and $\Diamond$ is DDbS [62].⁷ Observe:

Fig. 6.

•

All three $\mu$ estimates converge to an answer with increasing $n$.

•

URS correctly estimates $\mu$ with increasing accuracy; other methods converge to different incorrect answers.

•

Method $\diamond$ selects different sample sets each time in Figure 6(b), but oddly the same number of features occurs in all samples. Thus, the estimate by $\diamond$ is suspicious as it lacks variability.

•

Population statistics (like μ) can be predicted by probability analyses. URS can confirm the correctness of these predictions.

•

When analytical predictions are unavailable, URS can estimate population statistics that a correct analysis would return.

3.6 PCS Graphs of Enumerable and Non-enumerable SPLs

What do real PCS graphs look like? This is not a fundamental question, but one asked of curiosity. Several small SPLs were enumerated and benchmarked by Siegmund et al. [105, 106], which took months to complete. From his data, we computed their unnormalized PCS graphs, Figure 7.

Fig. 7.

•

Apache is an open-source Web server [8]. With nine features and 192 configurations, the maximum server load size was measured through autobench and httperf.

•

LLVM is a compiler infrastructure in C++ [75]. With 11 features and 1,024 configurations, test suite compilation times were measured.

•

H264 is a video encoder library for H.264/MPEG-4 AVC format written in $C$ [45]. With 16 features and 1,152 configurations, Sintel trailer encoding times were measured.

•

BerkeleyDBC is an embedded database system written in C [19]. With 18 features and 2,560 configurations, benchmark response times were measured.

A complete PCSgraph plots every point in $\mathbb {C}$; this is possible when an SPL configuration space is enumerable. But what about spaces that are too large to enumerate? A number of techniques were tried, and the simplest worked best:

(1)

Take a uniform sample of size n = 100 or n = 200 as this (to us) yields a minimal fidelity PCS graph.

(2)

For each configuration $c$, build and benchmark it to measure $\$(c)$.

(3)

Sort the $(c,\$(c))$ tuples from best performing to worst.

(4)

Let $y_i$ be the $i^{th}$ best performance. Plot a PCS graph using these points $\lbrace (\frac{i}{n+1}, y_i) \rbrace _{i=1}^n$.

Example. uClibc-ng is a C library for embedded Linux systems with 269 features and $|\mathbb {C}|$ = $\sim$8 $\times$ $10^{26}$ [90]. A minimum fidelity ($n$ = 200) PCS graph of uClibc-ng is Figure 8. Build size was measured.

Fig. 8.

3.7 SRS Answers to Central Questions

Section 1.1 listed three questions; SRS offers elegant answers for each:

(1)

How does one find a $c_{no}$ in an SPL configuration space? Answer: Take a uniform sample of size $n$, benchmark each configuration, and return the best-performing configuration, $c_{no}$.

(2)

How accurate (e.g., how near $c_{best}$) is the returned $c_{no}$? Answer: On average, the $c_{no}$ is $\frac{100}{n+1}$ percentiles from $c_{best}$ with standard deviation of $\frac{100}{n+1}$ percentiles.

(3)

What sample size should be used? Answer: Choose a desired accuracy and confidence for a $c_{no}$, and use Table 3 to determine the sample size.

4 Recursive Random Search

We believe SRS offers a minimal performance bound for every SPLO algorithm, as more sophisticated algorithms and those that exploit domain-specific knowledge should perform better. In this section, we review another promising random search algorithm. There are no replacements for SRS yet; a replacement would have an SPLO statistical guarantee on the $c_{no}$s it returns.

Recursive Random Search. A $c_{no}$ will be in the top $\frac{1}{1+9}$ = 10% percentile using a uniform sample of size 9. Increasing the solution precision to the top $\frac{1}{1+99}$ = 1% requires a sample size of 99, 11$\times$ larger. Suppose from the first nine configurations feature $f$ is inferred to be common to configurations in the top 10%. If the scope of the search is restricted to $(\phi \wedge f)$ and another uniform sample of size 9 is taken, a near-optimal solution would be within $\frac{1}{1+9}\cdot \frac{1}{1+9}$ = $\frac{1}{100}$ = 1%, for a total of 18 configurations, a 5.5$\times$ improvement. This is Recursive Random SearchRRS.

Implementation. A rule of thumb for $c_{best}$ is that it contains some of the top performance-enhancing features of an SPL [22]. We call such features noteworthy. The twist is that some features become noteworthy only in the presence of other noteworthy features.

Consider the PCS graph of LLVM, Figure 9(a). This graph is almost linear. Look how noteworthy features ($f$ or $\lnot f$) present themselves in Figures 9(a)–(d), in order of most influential to next most influential, and so on, recursively restricting the next subspace to search.

Fig. 9.

We mechanized the noteworthy procedure by (1) qualifying features to consider, as not all are relevant; (2) checking selected features for compability; and (3) filtering remaining features based on their performance influence.

First, we found experimentally that examining only the features of the top configuration $T_1$ was misleading—some noteworthy features of $T_1$ do not belong to $c_{best}$ and by selecting them it ensures RRS never reaches $c_{best}$. Examining features shared by the top two configurations ($T_1,T_2$) was less misleading. And examining shared features in the top three ($T_1..T_3$) configurations was too constraining, as important features may not be in all three configurations.

Second, let $S$ be the features common to $T_1$ and $T_2$. A SAT solver was not needed to validate that features of $S$ are compatible (meaning no feature(s) of $S$ precludes another). All features of $T_1$ are compatible, and so too is any subset. Any shared subset among $T_1$ and $T_2$ must also be compatible.

Third, let $N$ configurations be uniformly sampled per recursion.⁸ For every sampled configuration $c$, we know its features and its measured performance is $\$(c)$. Now, what features of $S$ are noteworthy? Answer: Consider each feature $f$ $\in$ $S$. Compute the average performance $\hspace{0.83328pt}\overline{\hspace{-0.83328pt}\$\hspace{-0.83328pt}}\hspace{0.83328pt}(f)$ of configurations sampled so far with feature $f$ and the average performance $\hspace{0.83328pt}\overline{\hspace{-0.83328pt}\$\hspace{-0.83328pt}}\hspace{0.83328pt}(\lnot f)$ of configurations without $f$. Their difference $$\Delta (f)$ is the performance influence of f:

\begin{equation} \$\Delta (f) ~=~ \hspace{0.83328pt}\overline{\hspace{-0.83328pt}\$\hspace{-0.83328pt}}\hspace{0.83328pt}(f) - \hspace{0.83328pt}\overline{\hspace{-0.83328pt}\$\hspace{-0.83328pt}}\hspace{0.83328pt}(\lnot f) . \end{equation}

(11)

The sign of $\$\Delta (f)$ indicates whether fimproves (negative value) or degrades (positive value) average performance. Further, a t-test [38] checks whether $\$\Delta (f)$ is statistically significant with 95% confidence; if significant, $f$ is noteworthy, or else it is discarded.⁹RRS is Algorithm 2.$^9$

Comparison. The accuracy of SRS and RRS can be compared by experiments that compute the true average rank $\hspace{0.83328pt}\overline{\hspace{-0.83328pt}\mu \hspace{-0.83328pt}}\hspace{0.83328pt}$ of solutions (which is possible for enumerable SPLs) to the theoretical accuracy of SRS for a sample size $n$, $\frac{1}{n+1}$, a.k.a. Equation (4). The experiment uses:

•

$N$ as the number of configurations per RRS recursion and

•

$n$ as the total number of configurations taken by RRS.

Figure 10 plots averages of 100 experiments for different SPLs and different $N$. While both $\hspace{0.83328pt}\overline{\hspace{-0.83328pt}\mu \hspace{-0.83328pt}}\hspace{0.83328pt}$ and $\frac{1}{n+1}$ decrease sharply with increasing $N$, $\hspace{0.83328pt}\overline{\hspace{-0.83328pt}\mu \hspace{-0.83328pt}}\hspace{0.83328pt}$ is on average better than $\frac{1}{n+1}$.

Fig. 10.

Key limitations of RRS are:

•

It is not always better than SRS when the $N$ (configurations per recursion) is too small.

•

It lacks analyses like $c_{no}$ rank prediction (Equation (4)) and confidence guarantees (Equation (8)).

A solution to these limitations is given in Section 7. The next sections evaluate SRS and RRS.

5 Evaluation Using Enumerable SPLs

SPL researchers used enumerable SPLs ($|\mathbb {C}|$ $\le$ 250K) as benchmarks with metrics for overall accuracy (MAPE, defined in Section 5.4) and, to a lesser extent, solution accuracy (average rank of returned $c_{no}$s) and reliability (standard deviation of returned $c_{no}$s) to compare different PM algorithms [42, 46, 47, 62, 88]. We adopt these guidelines.

Using the same sample size or smaller, an SPLO algorithm is more accurate than others if it finds better solutions ($c_{no}$s) and is more reliable than others if its solutions have a smaller standard deviation ($\sigma$). A higher $\sigma$ means solutions vary more.

We ask the following research questions about SPLO algorithms:

•

RQ1: Which algorithm is the most accurate across selected SPLs?

•

RQ2: Which algorithm is the most reliable across selected SPLs?

•

RQ3: Are PM accuracy and PM solution accuracy correlated?

5.1 Evaluation Setup

Enumerated spaces allow us to (a) know the true PCS rank of a $c_{no}$ and (b) compute the difference of a $c_{no}$’s true performance $\$(c_{no})$ from a PM’s estimate $\hat{\$}(c_{no})$. Taken from [105, 106], the SPLs are:

•

$BerkeleyDBC$ is an embedded database system with 18 features and 2,560 configurations [19]. Benchmark response times were measured.

•

$7z$ is a file archiver with 44 features and 68,640 configurations [2]. Compression times were measured.

•

$VP9$ is a video encoder with 42 features and 216,000 configurations [116]. Video encoding times were measured. To our knowledge, $VP9$ is the largest SPL that has been enumerated.

Each successive SPL in the above list has a configuration space that is $\sim 10\times$ larger than its predecessor. Figure 11 shows their unnormalized PCS graphs.

Fig. 11.

We compare SRS and RRS with two PMs: $SPLConqueror$ [107] and $DeepPer\!f$ [46]. $DeepPer\!f$ is a state-of-the-art deep sparse neural network that outperformed other major PMs in 2019, including CART [43], DECART [42], Fourier [93], and $SPLConqueror$. We include $SPLConqueror$ as it is the state of the art in LR PMs, using linear regression as described in Section 2.

Recall the purpose of a PM is to predict the performance of any configuration in $\mathbb {C}$. It is not to find an optimal or near-optimal solution. That is the purpose of an optimizer. We explained in Section 2.1 that finding $c_{best}$ by an optimizer is NP-hard. To discount this difficulty, we use a perfect optimizer that returns the optimal configuration according to its PM for free by using the PM to evaluate $min_{c\in |\mathbb {C}|} ~\hat{\$}(c)$. Of course, such an optimizer is impractical but can be emulated for enumerable configuration spaces. So the conclusions of this section favor PMs.

For SRS and $DeepPer\!f$, we ran experiments with sample sizes 50, 100, 200, 500, and 1,000. $DeepPer\!f$ asks for the sample size to use and the number of experiments; hyperparameters for its neural network are configured automatically. For RRS, we ran experiments with $N$ $\in$ $\lbrace 15,20,30,50,100,200\rbrace$ configurations per recursion and summed the total number of configurations used after RRS terminates. Remember RRS does not perform well w.r.t. SRS when too few configurations per recursion are used. RRS has a minimum sample sizeMinSS whose value is revealed by experiments RQ1 and RQ2.

For $SPLConqueror$, the settings of Kaltenecker et al. were used [62]. All five sampling methods of $SPLConqueror$ were evaluated, each producing a distinct PM. Diversified Distance-Based Learning, which we label as S2, was reported to have the best prediction accuracy.¹⁰ For each sampling method, three different sample sizes were used, corresponding to $t$-way population sizes $t\!\in \!\lbrace 1,2,3\rbrace$, although some $SPLConqueror$ algorithms used additional configurations whose numbers we could not control but did report. See [62, 107] for more details.

Each experiment was repeated 100 times, and averages are reported. The number 100 was chosen so that our evaluations would finish in 2 weeks of compute time. Statistical significance tests are reported in Appendix C for those interested. Our source code and experimental data are available at https://doi.org/10.5281/zenodo.7485062.

5.2 RQ1: Which algorithm is the most accurate across selected SPLs?

Let $n$ be the number of configurations benchmarked by a PM in an experiment; $\overline{n}$ is the average over 100 experiments. Let $\mu _x$ be the percentile rank of its $c_{no}$s, also averaged over 100 experiments. $\mu _x$ = 5% means that the $c_{no}$s returned by a PM are in the top 5% (.05 percentile), on average, from $c_{best}$.

The lines of Figure 12 (next page) connect $($$\overline{n}$, $\mu _x$$)$ points of each PM. $SPLConqueror$ has five lines, one for each sampling method. Figures 12(a)–(c) show the full results; Figures 12(d)–(f) show a top 5% (.05 percentile) magnified view. Tables (not graphics) for Figure 12 are in our Zenodo download.

Fig. 12.

We found:

•

SRS and RRS exhibited the overall best performance.

•

When using $\le$20 configs/recursion, SRS dominates RRS in all but one point in $7z$, Figure 12(e). When $\ge$30 is used, RRS dominates SRS for all SPLs. This discussion continues in RQ2.

•

When RRS uses $\gt$200 configurations total, it returns a $c_{no}$ whose normalized rank is less than 0.2% on average, compared to the theoretical SRS $c_{no}$ normalized rank of 0.5%, Equation (4).

•

The $\mu _x$ of $SPLConqueror$ PMs varied, depending on the sampling method and sample size. S2 dominated other $SPLConqueror$ algorithms and outperformed RRS in BerkeleyDBC and 7z. No $SPLConqueror$ algorithm outperformed SRS or RRS in VP9 (the largest SPL space).

•

$DeepPer\!f$ under-performed SRS and RRS for all sample sizes and SPLs. $DeepPer\!f$ dominated $SPLConqueror$ on VP9 but under-performed BerkeleyDBC and 7z except on three points.

With respect to better $c_{no}$ accuracy with larger sample sizes, we observed:

•

SRS and RRS steadily improved $\mu _x$ values with increasing sample sizes in all SPLs. SRS and RRS produced the most consistent results.

•

More configurations did not ensure better $c_{no}$s for PMs. $DeepPer\!f$ found progressively better $c_{no}$s as sample sizes increased to 500 but did not consistently improve $c_{no}$s afterward.

•

$SPLConqueror$ PM $c_{no}$s varied considerably. Only two results, S2 and S3 in VP9, showed strictly improving $c_{no}$s with increasing sample size.

With respect to theoretical predictions:

•

Figure 13(a) shows that the $\mu _x$ of SRS with sample size $n$ matches Equations (4)–(). The $\mu _x$ of $7z$ and $VP9$ measurements are slightly lower than theoretical $\mu _x$ as some configurations exhibit the same performance, but the rank that we assigned measured the number of configurations that have better performance. This possibility is evident in the flat shelf of configurations approaching the origin in the PCS graphs of these SPLs (Figure 11).

Fig. 13.

Summarizing Figure 12:

•

RRS generally outperforms SRS, $DeepPer\!f$, and $SPLConqueror$ over a wide range of different sample sizes in different SPLs.

•

SRS and RRS $c_{no}$s progressively move toward the origin ($c_{best}$) of each PCS graph as sample sizes increase. $DeepPer\!f$ $c_{no}$s plateau for BerkeleyDBC and 7z.

•

We consider SRS as a “minimal performance bound” for SPLOs, as it relies only on URS. $DeepPerf$ failed to outperform SRS for all plotted 45 points in Figure 12. $SPLConqueror$ failed to outperform SRS in 28-of-45 = 62% plotted points.¹¹ These results raise a general concern on the $c_{no}$ accuracy of PMs.

•

$SPLConqueror$ outperformed RRS in 12-of-45 = 27% of the data points in Figure 12.¹² However, the sampling method and sample size that yielded these results were unknown before these experiments. A priori, it is not obvious which $SPLConqueror$ algorithm to use ahead of time.

•

A perfect optimizer was used, which biases the results of this section toward PMs.

Conclusion: Sampling (esp. RRS)produced the best $\mu _x$ solutions in these experiments.

5.3 RQ2: Which algorithm is the most reliable across selected SPLs?

The standard deviation $\sigma _x$ of $\mu _x$ measures the reliability of solutions returned by SPLO algorithms. The larger the $\sigma _x$, the less stable or more variable the result; the smaller the $\sigma _x$, the better.

The lines of Figure 14 connect $($$\overline{n}$, $\sigma _x$$)$ points of each SPLO algorithm. Figures 14(a)–(c) are the full results; Figures 14(d)–(f) show a magnified top 5% (.05 percentile) view. Tables (not graphics) for Figure 14 are in our Zenodo download. We found:

Fig. 14.

•

SRS and RRS demonstrated consistently small $\sigma _x$ below 1% for $\overline{n}$ $\ge$ 200 in all SPLs, matching the theoretical predictions of Figure 5. Further, the $\sigma _x$ of SRS and RRS decreased steadily—well below 1%—as the sample size increased.

•

When using $\ge$30 configs/recursion, the $\sigma _x$ of $\mu _x$ is clearly lower for RRS than SRS (see Figures 14(b)–(c)). Henceforth, we use MinSS = 30 configurations per recursion unless otherwise specified.$^{{9}}$

•

$DeepPer\!f$ reduced $\sigma _x$ with increasing sample sizes up to 500, but not consistently over 500 ($\sigma _x$ of $7z$ increased >500). Further, $DeepPer\!f$ has a significantly higher $\sigma _x$ than SRS and RRS for all SPLs, doing no better than $\sigma _x$ = 6%.

•

$\sigma _x$ for $SPLConqueror$ varies considerably. S3 and S5 had the lowest $\sigma _x$ as it approached 0. The S3 and S5 PMs were created by samples from a SAT solver, which is known to be biased [62]. We conjecture these PMs returned similar solutions. In general, larger sample sizes did not consistently lower $\sigma _x$ and $SPLConqueror$ $\sigma _x$s were higher than those of SRS and RRS.

•

The $\sigma _x$ of SRS matches the theoretical $\sigma _x$ for Order Statistics, Figure 13(b). The SRS $\sigma _x$s for $7z$ and $VP9$ measurements are slightly lower than theory Equations (4)–() for the reason given earlier.

Conclusion: Sampling (esp. RRS)produced the lowest $\sigma _x$ values and was the most reliable in these experiments.

Additional Evidence. In [91], we compared a draft of RRS (here called RRS₀) with two PMs, one by Sarkar et al. [101] and a precursor to $SPLConqueror$ [106], on small SPLs explained earlier: Apache ($|\mathbb {C}|$ = 192), LLVM ($|\mathbb {C}|$ = 1,024), and H264 ($|\mathbb {C}|$ = 1,152). SRS dominated these PMs on all SPLs, and RRS₀ dominated SRS, consistent with results of this section.

5.4 RQ3: Are PM accuracy and PM solution accuracy correlated?

An implicit assumption in the SPL ML PM literature is “PM accuracy is correlated to PM solution accuracy” [42, 43, 106, 107, 109], which we call conjecture $\mathbb {K}$. To quantify $\mathbb {K}$, we use the Mean Absolute Percentage ErrorMAPE, which is widely used as the overall measure of PMaccuracy in the SPL literature [42, 46]. MAPE is the average absolute difference between $c$’s predicted performance $\hat{\$}(c)$ and $c$’s benchmarked performance $\$(c)$. For an enumerable space $\mathbb {C}$:

\begin{equation} {\texttt {MAPE}} = \frac{100}{|\mathbb {C}|}\cdot \sum _{c \in \mathbb {C}}~~\frac{|\$(c) - \hat{\$}(c)|}{\$(c)} . \end{equation}

(12)

The box-plots¹³ of Figure 15(a) summarize MAPE values for the PMs obtained with $DeepPerf$ and $SPLConqueror$. $DeepPerf$ consistently produces more accurate and reliable PMs than $SPLConqueror$ (i.e., the boxes are nearer to the $X$-axis and narrower, respectively).

Fig. 15.

However, $DeepPerf$’s predictions are not that good and worsen as $|\mathbb {C}|$ increases. Figure 15(b) zooms MAPE values to $DeepPerf$’s scale. $DeepPerf$’s (a) accuracy decreases with increasing SPL size $|\mathbb {C}|$, as the median values are 3.7% (BerkeleyDBC), 9.7% (7z), and 17.5% (VP9), and (b) reliability also reduces with increasing SPL size $|\mathbb {C}|$, as the 25th and 75th percentiles are [2.6, 5.8] for BerkeleyDBC, [7.9, 19.0] for 7z, and [10.1, 44.0] for VP9. This suggests that although PMs for increasingly larger spaces can be created with small sample sizes, PM MAPE accuracy suffers.

The solution accuracyβ of a PM is the rank of the $c_{no}$ that it returns in an RQ1 experiment. (Again, 100 such experiments were done per [PM, SPL, sample size] triplet.) A (MAPE, β) pair can be defined for each PM per experiment. The scatter-plot in Figure 16 shows the (MAPE, β) pairs collected from all RQ1 experiments. Now conjecture $\mathbb {K}$: If MAPE and β are ideally correlated, there would be a one-to-one relationship between MAPE and β values; the points would follow a clear pattern, being aligned on a straight line or a curve. And if they were positively correlated, low MAPE values would correspond to low βs. If this is the case, an optimizer should return better $c_{no}$s with lower MAPE values.

Fig. 16.

Figure 16 doesn’t show this: $DeepPerf$ in BerkeleyDBC displays a wide range of βs for the same MAPE values (i.e., the points are vertically stacked). Inversely, for $SPLConqueror$ S1 in VP9, PMs with very different MAPE values got roughly the same βs (i.e., the points are horizontally aligned at different heights).

Table 6 lists the dependency between MAPE and β estimated with Spearman’s $\rho$, Kendall’s $\tau$, Hoeffding’s $D$ [53], and Distance Correlation ($dCor$) [110]. The magnitude of these measures shows the strength of the dependency. The higher the magnitude, the more dependent are MAPE and β. A correlation measure $c$ can be interpreted as very weak if $c$$\lt$0.2, weak if 0.2 $\le$ $c$ $\lt$ 0.4, moderate if 0.4 $\le$ $c$ $\lt$ 0.6, strong if 0.6 $\le$ $c$ $\lt$ 0.8, and very strong if $c$ $\ge$ 0.8.

Table 6.

Algorithm	Correlation Measure
Algorithm	Spearman’s $\rho$	Kendall’s $\tau$	Hoeffding’s $D$	dCor
DeepPerf	0.122	0.082	0.337	0.165
SPLCon. S1	0.214	0.131	0.345	0.229
SPLCon. S2	0.450	0.330	0.379	0.293
SPLCon. S3	0.443	0.323	0.370	0.342
SPLCon. S4	0.336	0.229	0.361	0.272
SPLCon. S5	0.559	0.420	0.433	0.783

Table 6. Correlation between MAPE and β for $DeepPerf$ and $SPLConqueror$

Note: The magnitude of $\rho$, $\tau$, and $dCor$ goes from 0 (no dependency) to 1 (total dependency), while $D$ ranges from $-$0.5 (no dependency) to 1 (total dependency). To facilitate its comparison with the other measures, $D$ was rescaled to [0...1]. Also, $\rho$ and $\tau$ might have a negative sign if MAPE and β had an inverse relationship (β decreasing as MAPE increases), but this didn’t occur.

Conclusion: An implicit assumption in the ML PMliterature is PMaccuracy is correlated to PMsolution accuracy. We found evidence to the contrary, as the correlation was weak in our experiments.

5.5 Threats to Validity

There are three confounding factors: SPLs, sample size, and graph shape.

•

SPLs. We considered three enumerable SPLs whose sizes were $\sim 10\times$ larger than the next. Different performances might have resulted using other SPLs. However, SRS and RRS experimental results on an additional three enumerable SPLs in [91] (smaller than the SPLs used here) were consistent with this article’s results (see end of Section 5.3.)

•

Sample Size. A goal or motivation of prior work was to use the smallest sample sizes possible to get accurate predictions. (Performance was the reason given in Section 3.3.) We followed a standard evaluation procedure used in prior work to compare SRS and RRS with $DeepPer\!f$ and $SPLConqueror$ [46]. SRS and RRS consistently exhibited the smallest $\mu _x$ and smallest $\sigma _x$ of $c_{no}$s returned across all sample sizes and SPLs considered. It is possible with larger sample sizes that $DeepPer\!f$ and $SPLConqueror$ might have performed better.

•

Graph Shape. By chance, the three PCS graphs of enumerable SPLs, Figure 11, are convex; i.e., all have a gradual descent (from right to left) to the origin, $c_{best}$. Concave PCS graphs are harder to optimize. We delay further discussion of this topic, as Section 6 presents examples and the impact of concavity on optimization.

Research on ML PMs continues to advance, and so too their accuracy. The statements and results presented in this article are state of the art as of 2023.

5.6 Summary

SRS and RRS consistently produced the lowest-ranked and most stable $c_{no}$s (i.e., smallest $\mu _x$ and $\sigma _x$) across diverse enumerable SPLs of different sizes and sample sizes. We noticed in the ML PM literature an implicit assumption that a more accurate PM should produce more accurate $c_{no}$s, but the results of RQ1 and RQ2 suggested otherwise. Upon further investigation, we found the correlation of PM model accuracy is weak w.r.t. $c_{no}$ (solution) accuracy. We again remind readers that we used a perfect optimizer to compute our PMresults; an imperfect optimizer would unlikely improve PMperformance.

We offer an explanation for these results. Learning a function $PM$:$\mathbb {C}$$\rightarrow$$\mathbb {R}$ to predict the performance of every $c$$\in$$\mathbb {C}$ with MAPE accuracy $\le$8% and $\mathbb {C}$ is colossal ($\gg$$10^{10}$) and with sample sizes $\lt$5K is unbelievable. ML PMs should do better with larger sample sizes, but this will be expensive. In contrast, finding near-optimals using a sample size $\lt$300 andthat is within 1% of optimal with95% confidence in infinite-sized spaces is doable with sampling (see Table 3). Look carefully at Figures 12–16 to see a recurring trend: as SPL size $|\mathbb {C}|$ increases, performance graphs become progressively more wild, meaning that the average accuracy and standard deviation of $c_{no}$s decrease with increasing $|\mathbb {C}|$ for the same sample sizes. And we learned that greater overall PM accuracy does not necessarily lead to better near-optimals. We are not optimistic that small sample sizes can produce truly accurate PMs for large SPL spaces. It is asking too much. Others, prior to us, reached a similar conclusion [128, 130].

Conclusion: Random sampling is a better technology match for SPLOthan ML PMs.

6 Evaluation of SRS and RRS On Kconfig SPLs

We evaluate SRS and RRS on two SPLs that, to our knowledge, have not been evaluated in prior PM work. Both use the $Kconfig$ configuration tool [65]:

(1)

$axTLS$ 2.1.4 is a client-server library with 94 features and 2$\cdot$$10^{12}$ configurations [11].

(2)

$ToyBox$ 0.7.5 is a Linux command line utilities package with 316 features and 1.4$\cdot$$10^{81}$ configurations [112].

Both were benchmarked for their build size. Figure 17 shows their minimum fidelity PCS graphs. We ask:

Fig. 17.

RQ4: Does RRS outperform SRS in colossal configuration spaces?

Unlike the SPLs from Section 5, we cannot measure the precise $X$-axis rank of configurations or the value of $c_{best}$ as both require enumeration. We can compare the true build size of solutions of SRS and RRS from the same SPL to determine the best $c_{no}$.

We devised an experiment to address RQ4 so that it could be completed within 2 weeks:

•

Compare SRS and RRS with the same total number of samples $n$ = $\lbrace 100, 200, 300, 400, 500\rbrace$.

•

SRS samples $n$ configurations and reports the minimum build size.

•

RRS samples MinSS = 30 configurations per recursion.

•

RRS terminates once the total number of configurations it uses reaches $n$, or by not finding a noteworthy feature, or when the constricted configuration subspace is smaller than 30, then enumeration occurs.

•

All experiments are repeated 25 times.

Figure 18 shows the results. Observations. SRS generally found progressively better solutions as $n$ increased for both $axTLS$ and $ToyBox$; solutions for $axTLS$ seemed to reach a fixed point when $n$ > 200.

Fig. 18.

RRS terminated early for $axTLS$ for $n$ $\in$ $\lbrace 300,400,500\rbrace$, where the average number of configurations benchmarked was 210, 217, and 220. At termination, the last constricted space was so small that it was enumerated. The odd shape of Figure 18(a) is simply RRS repeatedly converging on a near-minimum build size after examining >200 configurations. Overall, RRS found $c_{no}$s with smaller build sizes than SRS. How Good Are These Results? In an abandoned experiment prior to RQ4, we uniformly sampled and benchmarked 46,250 configurations each from $axTLS$ and $ToyBox$ (included in our Zenodo download). We salvaged this work for RQ4 as a 46,250-point PCS graph, Figure 19. The best solution had the percentile rank of $1 / (46,250{} + 1)\cdot 100\% = .22\%$ or 5-sigma ($\le$.23%), a high level of resolution [98, 119]. We then overlaid the results of Figures 18 and 19 to produce Figure 20.

Fig. 19.

Fig. 20.

Figure 20 magnifies Figure 19 to the top-performing percentiles. $PCS_{best}$ is the best-performing of all 46,250 configurations. The dashed black line is the $PCS_{best}$ boundary. With increasing sample sizes, SRS solutions approach $PCS_{best}$, as expected, but are never below $PCS_{best}$. RRS solutions appear as

s; they are literally inside the .22% percentile, visually on the $Y$-axis of each PCS graph usually below $PCS_{best}$. Overall, RRS solutions are better than $PCS_{best}$ once $n$ $\ge$ 200.

On the Shape of PCSGraphs. A key influence on SPLO is the shape of a PCS near the origin. There are two possibilities: a PCS graph is convex or concave, Figure 21. The PCS graphs of enumerable SPLs, Figure 11, are convex: they have a gradual slope to the origin where the performance ($Y$-axis) difference between the 5%, 1%, and .05% percentile $c_{no}$s may not be much, and stopping a search sooner than .05% might seem acceptable.

Fig. 21.

Concave PCS graphs are different: they have a steep drop to the origin. Concave graphs are harder to optimize, simply because the next round of sampling or recursion might produce a noticeably better $c_{no}$, so continued optimization is worth the effort. The 200-point PCS graph of $ToyBox$ in Figure 17(b) suggests concavity, but not so for $axTLS$ in Figure 17(a). However, their magnified (46,250-point) PCS graphs in Figure 20 confirm both are concave.

The dilemma is this: generally, you don’t know a priori whether a PCS graph is convex or concave near the origin until you look; there is no downside for continued searching if the allotment of configurations permits.

Conclusion: RRSfinds better $\mu _y$ solutions than SRS. This has been a consistent result from small through colossal SPLs in our experiments.

7 Fixed Budget SPLO: the Essential Problem

Sections 5 and 6 compared different SPLO algorithms by averaging experiments that sampled tens of thousands of configurations per SPL. This extravagance is unlikely to be common in practice.

Instead, users are more likely to have a fixed budget (a maximum allotment of configurations for benchmarking) because of limited time, limited costs, and so forth. The challenge is that no single SPLO algorithm will outperform all other algorithms for all SPLs or sample sizes. We know that RRS-is-always-better-than-SRS is false: there are cases in this article where SRS performs better than RRS. But if a statistical bound on the quality of a solution is needed, SRS is the only game in town.

Here is a solution to the fixed budget SPLO: run both SRS and RRS with the same number of configurations. Both are executed in steps of $N$ configurations, where $N$ $\ge$ MinSS.

The first step samples $N$ configurations using SRS. These samples are reused as the first $N$ configurations of RRS. At this point, both SRS and RRS return the same “near optimal” configuration. In subsequent steps, SRS samples another $N$ configurations from the entire space, while RRS samples a different set of $N$ configurations from a noteworthy constricted space. This last step is repeated until the allocation is exhausted. The best $c_{no}$ returned by SRS or RRS is chosen, along with the statistical guarantees of the SRS $c_{no}$. SRS guarantees give a conservative bound on the goodness of the RRS $c_{no}$.

Example. Consider a budget of 450 configurations. The first 50 are used by both SRS and RRS; 400 configurations remain. Then 200 configurations are allocated to both SRS and RRS and are consumed in four additional rounds of 50 configurations each. A total of 450 configurations is consumed.

We repeated this three times; i.e., we conducted three identical experiments ($R1$–$\hspace{0.80002pt}R3$) whose results are not averaged. Figure 22 shows all three experiments return essentially the same result. Each red dot on the $Y$-axis indicates the first-round results for both SRS and RRS for an experiment. Each red dot attached to two lines: one for SRS and the other for RRS.

Fig. 22.

Table 6 tallies the results of ($R1$–$\hspace{0.80002pt}R3$). The $c_{no}$ columns list the minimum build size found, and “Best Alg” lists the algorithm that produced the solution. $\vartheta$-Bound is a conservative theoretical bound on the goodness of the solution, derived from Equation (8) using 250 configurations with 95% confidence, and yields 1.2% accuracy.

Table 6.

Conclusion: A solution to the fixed budget SPLOproblem gives the same number of configurations to SRSand RRSand takes the best solution of the two. The statistical quality of the SRSsolution serves as a conservative bound for RRS.

8 Related and Future Work

8.1 Highly Configurable Systems

Highly Configurable SystemsHCSs define a broader universe in which SPLs and SPLO reside. HCSs have configuration parameters that are real and/or binary variables called options or tuning knobs. Unlike SPLs, an HCS has no feature model.

A pipeline of $t$ tools is an example. Each tool has $k$ (e.g., command-line) options. Selecting any or all of the $k$ options for a tool is possible, and selecting/deselecting an option for one tool has no effect on the selection or deselection of options of other tools. The configuration space size for this problem is precisely $2^{k\cdot t}$. HCSO, the HCS counterpart to SPLO, finds values for each of the $k$$\cdot$$t$ options that work best together for a given workload and environment [57].

Another example is a database system with $w$ real-valued tuning knobs [7]. A space of $\mathbb {R}^w$ option combinations must be explored; the setting of one knob may trigger adjustments of other knobs. A challenge is to create ML models to understand the causal functional relationships among knobs [57]. HCSO finds a $w$-tuple that achieves a near-optimal performance [128, 130].

Yet another example is the algorithm configuration or parameter tuning problem [54], where the parameters of an algorithm are configured to achieve the algorithm’s optimal performance for a given set of problem instances.

At a high abstraction level, HCSO and SPLO look alike. Unbeknown to us, in 2003, Ye and Kalyanaraman developed an RRS-like algorithm (also named RRS) to search contour plots for minima in network parameter configurations [128]. They uniformly sampled an $\mathbb {R}^2$ space and used performance rankings to identify the top “noteworthy” $2D$ points. Then their RRS recursively drills down on areas surrounding these points to find minima. As there are no features (as in SPLs), the mechanisms of their RRS algorithm differ from ours. They also discovered Equations (8) and (9) to guide their search and to choose sample sizes. Here again, their context and use of these equations differs from ours, but much is the same.

Figure 23 is taken from [128]: the $2D$ contour is randomly sampled, and the top-performing regions (in this case three) in blue are “noteworthy” and RRS explores regions around these points.

Fig. 23.

The core differences between HCSs and SPLs are:

•

HCSs have no feature model.

•

Our SRS algorithm provides statistical guarantees on the $c_{no}$s it returns. Order statistics are also useful and may be relevant to HCSO algorithms.

•

URS of SPL spaces is much harder as configurations are solutions to propositional formulas rather than points in continuous real $2D$ or $n$-$D$ HCS spaces.

The rest of this section focuses on related work in the SPL domain.

8.2 Relevant Results in ML PMs

Other PMs for SPLOs. Guo et al. encoded a PM as a Classification and Regression TreeCART [43]. Sarkar et al. extended [43] with “projective sampling,” a technique that checks performance-estimation accuracy improvement with more samples [101]. Later, Guo and Shi improved the efficiency of CART by resampling and automated parameter tuning techniques [44].

Zhang et al. used Fourier learning and incrementally sampled configurations until a PM achieved a desired accuracy [129]. Ha and Zhang combined Fourier learning with LASSO regression to improve the efficiency of learning Fourier coefficients for each feature [47]. Dorn et al. used probabilistic programming to derive a PM that captures the uncertainty from benchmarking configurations and reasoning with incomplete data [33]. Martin et al. compared different ML techniques and discovered that different methods work better for different SPLs and that feature selection techniques from ML can improve learning in general [80]. These papers were evaluated using relatively small SPLs with $\le$60 features and $|\mathbb {C}|$ $\le$ 250K. A survey of other PMs for configurable systems is in [1].

Scaling PMs. As of 2022, PMs of SPLs with $|\mathbb {C}|$ > $10^{6}$ are rare. When attempted, a non-URS sample is taken whose size ranged from 500 to 5,000 configurations, i.e., the size of an enumerated SPL in this article. Recently, PMs for Linux were created from 85K configurations [79]. Whether accurate PMs for colossal spaces can be learned from small samples (85K$\ll$$\sim$$10^{4000}$) and be optimized efficiently is an interesting question beyond the scope of our article.

Improving PMAccuracy. PMs are not very accurate [79, 115]. An SPL codebase can be carved into regions (methods or groups of methods) that have the same feature presence condition (i.e., a feature qualification that must be satisfied for the region to be present in a product). By using fixed workloads and selecting configurations that cover (almost) all execution paths per region, a PM for each region is created. These PMs are then composed to produce a composite PM with improved accuracy [115].

Transfer Learning. A PM is created with a fixed workload. Should the workload change, the PM may need to be relearned (Section 2). An alternative is transfer learning [59]. Let $\hat{\$}$:$\mathbb {C}$$\rightarrow$$\mathbb {R}$ be the performance estimation function of a PM for space $\mathbb {C}$. A transfer functionTF translates a $\hat{\$}$ learned for workload $w$ to another function $\hat{\$}^{\prime }$ with a different workload $w^{\prime }$. A linear TF, $\hat{\$}^{\prime }(c)$ = $\alpha \cdot \hat{\$}(c)+\beta$, is postulated, $\forall c$$\in$$\mathbb {C}$. The values of constants $\alpha$ and $\beta$ are learned. Linear TFs work well for small workload distortions, but existing evidence suggests otherwise for greater distortions [59, 79].

A recent paper by Martin et al. [79] presents a heterogeneous transfer learning method (tEAMS) that works surprisingly well to evolve PMs of progressive releases of Linux. MAPE values for newly learned PMs are in the 8.2%–9.2% range. When using the same budget, tEAMS produces transferred PMs with MAPE values 5.6%–7.1%. However, MAPE values tend to degrade after multiple transfers.

8.3 Optimizers

Optimizers for SPLs. Optimizers in the SPLO literature have focused on multi-objective optimization using evolutionary algorithms [32, 68, 127], active learning [131], filtered Cartesian flattening [117, 118], and integer programming [127]. Two other tools known to us used PMs specifically to learn near-optimals: rank-learners [87] and FLASH [88].

Nair et al. observed experimentally that PM accuracy improves rapidly as more configurations are used to train them. A point is eventually reached where improvement stagnates, and it is wasteful to use additional configurations. The stagnation point can be detected by measuring if the accuracy of PM trained with additional configurations differs minimally (e.g., computing the MAPE difference between PMs). Nair et al. claim that comparing PMs’ ability to rank configurations instead of their accuracy is a better stopping criterion that detects the stagnation point earlier and with the goal of returning good near-optimal configurations. Experiments show their rank stopping criterion sometimes saves configurations, but Figure 7 in [87] says it gets slightly worse rankings than conventional non-ranked approaches.

FLASH is a follow-on paper by the same authors. It relies on Sequential Model-based Optimization [56], a broad generalization of RRS for HCSO. To optimize a performance metric, FLASH builds a CART model with an initial learning set $L$ of benchmarked configurations. Then another set $S$ of configurations is chosen, CART estimates the performance of each $s$ $\in$ $S$, and the best-performing configuration, $c_{no}$, from $S$ is returned. This $c_{no}$ is then benchmarked and added to $L$, and this cycle repeats for a budgeted number of iterations. FLASH was evaluated on tiny ($\lt$6 options w. $|\mathbb {C}|$$\lt$$4K$) and small ($\lt$20 options w. $|\mathbb {C}|$$\lt$$240K$) HCSs.

In both of these papers, no statistical guarantees (within $x\%$ of optimal with $y\%$ confidence) are returned, which we feel is essential for optimization.

Domain-specific Optimizers. Exploiting domain-specific knowledge can lead to better $c_{no}$s. $COZART$ [72] is a tool to find a Linux kernel configuration with minimum build size. With prior knowledge of which features are necessary for booting the Linux kernel and that build size decreases by deselecting features, $COZART$ derives a configuration that selects the necessary features and excludes others as much as possible. $COZART$ does not search for configurations, yet it finds a configuration smaller than sampling does.

Random Search Optimizers. Random Search is a family of numerical optimization algorithms for functions that are discontinuous and non-differentiable [18, 123]. SRS and RRS are examples. There is nothing preventing SRS or RRS from being used as an optimizer for a PM: replace the component that builds a configuration $c$ and benchmarks it, with a component that calls a PM to return an estimate of $c$’s performance. The inaccuracy of PM predictions may limit the utility of statistical guarantees of SRS.

8.4 Sampling SPL Configurations

As late as 2020, it was believed that URS of non-enumerable SPL spaces was infeasible [62, 97]. Consequently, novel sampling algorithms were proposed as substitutes. Dutra et al. devised $QuickSampler$, which randomly selects features to form a configuration and attempts to fix the configuration using a MaxSAT solver [34], a solver that tries to maximize the number of satisfiable CNF clauses. Kaltenecker et al. introduced Diversified Distance-based SamplingDDbS, which treats configurations as vectors and derives configurations with maximum difference among them [62]. MaxSAT (and thus $QuickSampler$) does not achieve URS and DDbS is not scalable [92]. Many more are cited in [1].

Some build tools offer their own sampling algorithm. $Kconfig$ [65] has the $conf$ tool [36], which has the $randconfig$ option to randomly generate configurations that are not uniform. $randconfig$ assigns values to features in the order they appear in a $Kconfig$ specification, so that a valid value for a feature being examined may be constrained by the selection of prior features. Samples are therefore biased. Recently, another tool called $KconfigSampler$ [37] supports the hierarchical random sampling of the Linux kernel. This kind of sampling is not uniform but ensures that features at the same abstraction level in the $Kconfig$ specification have the same probability of appearing in a random configuration. $KconfigSampler$ is implemented as a net of interconnected $BDD$s.

Other tools partition the solution space into cells as evenly as possible using universal hashing functions. Then, the tool selects one cell at random and generates a solution with a SAT solver. $UniWit$ [26] was the first sampler to implement this idea, which guaranteed uniformity but has serious scalability limitations. Two later iterations of $UniWit$, called $Unigen$ [28] and Unigen2 [25], tried to improve scalability while keeping uniformity, with not much success [51, 97]. The last $UniWit$ iteration is UniGen3 [76], which finally sacrifices uniformity to provide scalability.

Other work achieved URS by counting solutions of a propositional formula $ϕ$. Oh et al. were the first to experimentally demonstrate URS of large SPL spaces. They used a model counting $BDD$ to count the exact number of solutions to $ϕ$ and functionality-constrained versions of $ϕ$ [91]. This work was later generalized with the $Smarch$ tool, which uses #SAT and Algorithm 1, Section 3.3.

Three other samplers based on counting are $Spur$ [5], $KUS$ [102], and $BDDSampler$ [50]. $Spur$ relies on #SAT technology, $KUS$ on a knowledge compilation structure called Deterministic Decomposable Negation Normal Formd-DNNF, and $BDDSampler$ on $BDD$s. The evaluation of Unigen2, $Smarch$, $Spur$, $KUS$, and $BDDSampler$ was reported in [50]; a variety of models, in terms of size (from 14 to 18,570 variables) and application domain (automotive industry, embedded systems, a laptop customization system, a web application generator, integrated circuits, etc.) were examined. Results showed that only $BDDSampler$ currently provides both uniformity and scalability.

8.5 Feature Models and URS

Numerical Features. This article focused on binary {0,1} features to match classical SPL feature models [9, 14]. However, the Linux build tool $Kconfig$ [64] has feature models with binary and numerical featuresNFs. An NF is a numerical value within a bounded range, which can be approximated by an integer in a corresponding range. Bit-blasting is a technique to encode numerical values as bit vectors and arithmetic operations and constraints as propositional formulas [24]. This allows NF propositional formulas to be directly analyzed “as is” by both SRS and RRS [83, 84]. As $DeepPer\!f$ and $SPLConqueror$ can handle NFs natively, future work should compare how SRS and RRS perform w.r.t. $DeepPer\!f$, $SPLConqueror$, and FLASH on NF models.

Scalability of URS. Our analysis of URS, Equations (2)–(9), yields results for an infinite-sized configuration space. However, the best tools today [17] cannot analyze Linux, the largest known SPL, whose estimated size exceeds $10^{2200}$. Extending today’s #SAT and $BDD$ technologies to analyze Linux remains a challenge.

Dimension Reduction. Not all features contribute to performance; most features of an SPL are of this type. There are several ways in which irrelevant features can be identified and removed from ML PMs [4, 46, 107].

SRS and RRS do something similar: they ignore non-noteworthy features. In contrast, how performance-irrelevant features can be eliminated from a feature model’s propositional formula $ϕ$and still admit model counting is not obvious. If this could be done, it might solve the scalability problems that remain for URS, discussed above.

Tseitin’s Transformation. Not any translation of a feature model to propositional formula $ϕ$ and then to a CNF formula, $ϕ$$^{cnf}$, can be used with a #SAT sampling tool. Some translations do not preserve the 1:1 correspondence between products and solutions of $ϕ$, resulting in an over-counting. Tseiten’s transformation is one of several transformations that preserves the required 1:1 correspondence for URS [113]. The check: if a translation of $ϕ$ to $ϕ$$^{cnf}$ adds no additional variables (features), then $|\mathbb {C}|$ = |$ϕ$$^{cnf}$|. $BDD$s do not have this problem. See Appendix D for more details.

RRSvs. SRS. A perfect RRS would constrain $\mathbb {C}$ in each recursive iteration by selecting a subspace that always contains $c_{best}$. Currently, RRS uses a heuristic that chooses noteworthy features with the best contributing performance in a sample. This procedure works most times, but, as we saw, not always. An open problem remains: is there an improved RRS algorithm or analysis that always selects a subspace containing $c_{best}$ with a computable degree of confidence?

9 Conclusions

ML is an alluring way to explore PMs for SPLs. But lacking a scalable way to uniformly sample highly constrained spaces of colossal ($\gg$$10^{10}$) SPLs had two consequences: (1) serious efforts were spent on non-URS methods to find substitutes for URS [1, 3, 25, 27, 34, 42, 49, 62, 65], but to properly evaluate their statistical behavior, a gold standard required URS, and (2) most PMs were not adequately evaluated for scalability; SPLs with enumerable spaces ($\le$250K) were common until recently (e.g., [79]). In Section 3.3, we diminished these problems by showing how to uniformly sample colossal SPL configuration spaces as large as $10^{1441}$.

An initial motivation for PMs was to find SPL $c_{no}$s for a given workload. Typical PMs required an optimizer to find a $c_{no}$; but the only way to determine the quality of $c_{no}$s (e.g., how near they are to optimal) required enumerable SPLs. In Sections 3.1–3.2, we showed how order statistics with URS provided a needed statistical guarantee for colossal SPLs: a $c_{no}$ is within $x\%$ of optimal with $y\%$ confidence. Further, given any two of $($accuracy $x\%$, confidence $y\%$, or sample size $n$$)$ for a $c_{no}$, the third is computed by an equation or found in a table.

Two random search algorithms that used URS were presented, SRS and RRS. With enumerated SPLs in Section 5, we compared them to state-of-the-art PMs, $DeepPer\!f$ (a sparse neural network) and $SPLConqueror$ (linear regression), on $c_{no}$ accuracy (average distance μ from optimal) and reliability (standard deviation of μ). Experiments showed SRS dominated both PMs, and RRS dominated SRS. Further, a common belief in the PM literature is that “a more accurate PM produces a more accurate $c_{no}$.” We found evidence to the contrary, where PM accuracy was weakly correlated to $c_{no}$ accuracy.

In Section 6, we demonstrated the efficacy of RRS and SRS on two colossal SPLs: $axTLS$ ($|\mathbb {C}|$ = $10^{12}$) and $ToyBox$ ($|\mathbb {C}|$ = $10^{81}$). Sampling at most 500 configurations, RRS found $c_{no}$s that were inside the .22 percentile (or 5-sigma) of optimal for both SPLs. And in Section 7, we presented a fixed budget algorithm that gave the same sample size to both RRS and SRS: let each compute their $c_{no}$s where the best $c_{no}$ was returned along with the statistical guarantees of SRS, as RRS has no guarantees.

Our work encourages further research on topics of substance: (1) generalize URS to numerical features, (2) compare PMs with SRS and RRS on numerical feature models, (3) use URS to determine how well PMs scale to colossal SPLs and (4) improve URS scalability to the largest known SPL: the Linux kernel.

Footnotes

This article extends two prior publications: [91] from 2017 and [16] from 2021.

Knuth first sketched this algorithm in 2009 [66]. Batory reinvented it in 2016 unaware of his work. Oh was first to implement it with practical improvements: using Heule’s cube-and-conquer algorithm to find an efficient ordering of features to partition the space [52], caching #SAT computations to avoid repeated evaluations, replacing the remaining last $g$ bits to assign when they are “don’t cares” with a random $g$-bit number, and (optionally) caching configurations to remove duplicates, thereby achieving sampling without replacement [91].

The $BDD$s of the SPLs in Table 2 are available at https://doi.org/10.5281/zenodo.4514919.

⁴

An Intel(R) Core(TM) i7-6700HQ, 2.60GHz, 16GB RAM, operating Linux Ubuntu 19.10 was used.

⁵

The tool used to synthesize the $BDD$s is available at https://github.com/davidfa71/Extending-Logic.

⁶

We did not enumerate all configurations but sampled them by running BDDSampler -norep 26256 JHipster.dddmp, which asks $BDDSampler$ to generate 26,256 random configurations without replacement from a $BDD$ that encodes the $JHipster$ feature model.

⁷

From prior experiments [92], we knew that samples from these methods were not uniform and Equation (10) was not applicable. We wondered how they performed w.r.t. URS as they were proposed as URS substitutes.

⁸

See Section 6 for additional constraints on RRS termination, which samples $\le$ $N$ configurations on the last recursion.

⁹

A typical rule of thumb [38] states that the Central Limit Theorem holds whenever the sample size is $\ge 30$. Accordingly, the distribution of the sample means is normal and thus the performance contribution of each feature is estimated by subtracting means and using a t-test. However, when $N$ $\lt 30$, a more robust estimator and a non-parametric test is required; in particular, the feature’s performance is calculated by $\Delta (f)$ = median${(}\$(f){)}\, -$ median${(}\$(\lnot f){)}$, and statistical significance checked with a Mann-Whitney U-test [77].

¹⁰

S1 is Distanced Based, S2 is Diversified Distance Based, S3 is Solver Based, S4 is uniform sampling from an enumerated configuration space, and S5 is Randomized Solver Based [62].

¹¹

SRS bettered $SPLConqueror$ on 6 points of BerkeleyDBC, 7 in 7z, and 15 in VP9, for a total of 28 of 45 = 62%. Some $SPLConqueror$ experiments used smaller or larger numbers of samples compared to SRS experiments. For these cases, we used order statistics (1/(n+1)) to derive if $SPLConqueror$ performed better than SRS or not.

¹²

$SPLConqueror$ outperformed SRS on 7 points of BerkeleyDBC, 5 in 7z, and 0 in VP9, for 12 of 45 = 27%.

¹³

A box-plot encodes the values of five percentiles [121]. The bottom of the thin vertical line is the 0th percentile (or lowest value); the top denotes the 100th percentile (or highest value). The horizontal line in the box denotes the median; the box extends downward to indicate 25th percentile boundary and upward the 75th percentile boundary.

A | ℂ| > 1,000: When an Infinite Space can Approximate a Discrete Space

$|\mathbb {C}|$ is big enough when Equations (4)–() are satisfied, i.e., when the mean ( $\overline{c_{1,n}}$ ) and standard deviation ( $\overline{\sigma _{1,n}}$ ) of many samples converge to their theoretical counterparts, $c_{1,n}$ and $\sigma _{1,n}$, Equations (4)–(). Figure 24 shows the result of simulating 1,000 samples, with $|\mathbb {C}|$ configurations each, for different values of $|\mathbb {C}|$. For each $|\mathbb {C}|$, there is one point representing the mean, $\overline{c_{1,n}}$, in Figure 24(a), and one point for the standard deviation, $\overline{\sigma _{1,n}}$, in Figure 24(b). Red lines show the theoretical $c_{1,n}$ and $\sigma _{1,n}$ counterparts.

Fig. 24.

The vertical (blue) line of Figure 24 shows the approximation works well for $|\mathbb {C}|$ = 1,024 (a tiny SPL), i.e., for SPLs with 10 unconstrained optional features. A more conservative estimate was $|\mathbb {C}|$ > 2,000 in [91].

B Proof of Uniformity of the URS Algorithm

Two kinds of probabilities need to be distinguished to prove the uniformity of Algorithm 1:

(1)

The probability $P(c)$ that configuration $c$ is sampled and

(2)

The probability $P(f)$ that feature $f$ belongs to a sampled configuration, i.e., $P(f) = \frac{|\phi \wedge f|}{|\phi |}$.

Uniformity means that every configuration has the same chance to be sampled. According to the probability definition, $\sum _{i=1}^{|\phi |}P(c_i)=1$. Hence, uniformity is satisfied whenever $P(c)=\frac{1}{|\phi |}$ for any $c$.

Algorithm 1 samples a configuration by incrementally assigning true or false to each of the $\omega$ features in a feature model. In Equation (13), $a_i$ stands for the value assigned to feature $f_i$. Due to feature constraints, assignments depend on each other, and so feature values must be generated following the chain rule [124] to ensure the final configuration is valid, i.e., using feature conditional probabilities, Equation (14). In each iteration $i$, the algorithm produces a random assignment $a_i$ by taking into account the probabilities of the previous assignments $a_1,a_2,\ldots ,a_{i-1}$ (Equation (15)). At the end, all features are assigned and $|\phi \wedge a_1 \wedge a_2 \wedge \cdots \wedge a_\omega | = 1$, since a complete feature assignment corresponds to a unique configuration. As a result, the probability of sampling the configuration is $P(c)=\frac{1}{|\phi |}$ (Equation (16)), which guarantees the sampling procedure is uniform.

\begin{align} P(c) & = P(a_1 \cap a_2 \cap a_3 \cap \cdots \cap a_\omega) \end{align}

(13)

\begin{align} & = P(a_1)\cdot P(a_2|a_1)\cdot P(a_3|a_1 \cap a_2)\cdot \cdots \cdot P(a_\omega |a_1 \cap a_2 \cap a_3 \cap \cdots \cap a_{\omega -1}) \end{align}

(14)

(15)

\begin{align} & = \frac{1}{|\phi |} \end{align}

(16)

C Statistical Significance

The results of Sections 5 and 6 were analyzed to test their statistical significance. As usual in science, the confidence level was set to 95%.

A result is said to be statistically significant when it is unlikely to happen by chance. That is, Sections 5 and 6 answer RQ1–RQ4 by analyzing a sample of SPLs (BerkeleyDBC, 7z, VP9, axTLS, and ToyBox). However, we could have accidentally selected a very particular set of SPLs that does not reflect the characteristics of the whole population of SPLs. Statistical significance means rejecting that possibility, thus supporting the generality of our results.

C.1 RQ1 and RQ2

An ANOVA test is the standard way to check if the differences among each algorithm’s $c_{no}$s in Section 5 were statistically significant [38]. However, our experiments violated ANOVA preconditions:

•

$c_{no}$s for each algorithm were not normally distributed. Table 8 summarizes Shapiro-Wilk tests [99] conducted per algorithm; as all p-values were $\le$0.05, normality was rejected.

Table 8.

•

The variance of $c_{no}$s returned by each algorithm was highly different. In particular, the Levene test [74] for variance homogeneity produced $F=191.5$ and p-value $\sim$0. As p-value $\le$ 0.05, variance homogeneity was rejected.

The Kruskal-Wallis test [70] was used as the non-parametric alternative to ANOVA. It raised H = 3.050 and p-value $\sim$0. As p-value $\le$ 0.05, the test concluded that at least one of the algorithms achieved $c_{no}$s significantly different from at least one of the other algorithms.

To determine precisely for which algorithms the $c_{no}$s differ, all pairwise comparisons in Table 9 were performed following the method described in [104]. First, all $c_{no}$s were ranked (i.e., the smallest $c_{no}$ scored a rank of 1, the second smallest one a rank of 2, and so on). Then, the mean of the $c_{no}$ ranks was computed for each algorithm. The absolute value of the difference between the means of every pair of algorithms was calculated. These absolute values, called observed differences, were compared to thresholds, named critical differences and calculated from the number of experiments carried out per algorithm and the confidence level.

Table 9.

Comparison	Observed Difference	Critical Difference	Statistically Significant?
SRS vs. RRS	1,938.758	293.0368	yes
SRS vs. DeepPerf	2,398.176	306.424	yes
SRS vs. SPLCon. S1	2,094.880	350.495	yes
SRS vs. SPLCon. S2	1,491.906	350.495	yes
SRS vs. SPLCon. S3	522.209	350.495	yes
SRS vs. SPLCon. S4	1,230.784	350.495	yes
SRS vs. SPLCon. S5	811.818	350.495	yes
RRS vs. DeepPerf	4,336.934	293.037	yes
RRS vs. SPLCon. S1	4,033.646	338.854	yes
RRS vs. SPLCon. S2	3,430.664	338.854	yes
RRS vs. SPLCon. S3	2,460.968	338.854	yes
RRS vs. SPLCon. S4	3,169.542	338.854	yes
RRS vs. SPLCon. S5	1,126.934	338.854	yes
DeepPerf vs. SPLCon. S1	303.288	350.495	no
DeepPerf vs. SPLCon. S2	906.270	350.495	yes
DeepPerf vs. SPLCon. S3	1,875.966	350.495	yes
DeepPerf vs. SPLCon. S4	1,167.391	350.495	yes
DeepPerf vs. SPLCon. S5	3,209.994	350.495	yes
SPLCon. S1 vs. SPLCon. S2	602.982	389.613	yes
SPLCon. S1 vs. SPLCon. S3	1,572.678	389.613	yes
SPLCon. S1 vs. SPLCon. S4	864.104	389.613	yes
SPLCon. S1 vs. SPLCon. S5	2,906.707	389.613	yes
SPLCon. S2 vs. SPLCon. S3	969.696	389.613	yes
SPLCon. S2 vs. SPLCon. S4	261.122	389.613	no
SPLCon. S2 vs. SPLCon. S5	2,303.724	389.613	yes
SPLCon. S3 vs. SPLCon. S4	708.574	389.613	yes
SPLCon. S3 vs. SPLCon. S5	1,334.028	389.613	yes
SPLCon. S4 vs. SPLCon. S5	2,042.603	389.613	yes

Table 9. Multiple Comparison Test

According to [104], observed differences should be considered statistically significant whenever they are greater than or equal to than their corresponding critical differences. Therefore, all observed differences were statistically significant except when comparing $DeepPer\!f$ to $SPLConqueror$ S1, and $SPLConqueror$ S2 to $SPLConqueror$ S4.

To summarize:

•

The Kruskal-Wallis and multiple comparison tests support the statistical significance of RQ1 (Section 5.2).

•

The Levene test supports the statistical significance of RQ2 (Section 5.3).

C.2 RQ3

Table 10 summarizes the significance of the correlations between MAPE and β reported in Table 6 (Section 5.4). As all p-values $\le$ 0.05, all correlation measures were statistically significant.

Table 10.

Algorithm	p-value
Algorithm	Spearman’s $\rho$	Kendall’s $\tau$	Hoeffding’s $D$	dCor
DeepPerf	$2.908\cdot 10^{-6}$	$3.198\cdot 10^{-6}$	$\sim$0	$\sim$0
SPLCon. S1	$8.591\cdot 10^{-11}$	$1.650\cdot 10^{-8}$	$\sim$0	$\sim$0
SPLCon. S2	$\sim$0	$\sim$0	$\sim$0	$\sim$0
SPLCon. S3	$\sim$0	$\sim$0	$\sim$0	$\sim$0
SPLCon. S4	$\sim$0	$\sim$0	$\sim$0	$\sim$0
SPLCon. S5	$\sim$0	$\sim$0	$\sim$0	$\sim$0

Table 10. Significance Tests of MAPE and β Correlation

C.3 RQ4

Analogous to Appendix C.1, a t-test would be the standard way [38] to check the significance of the SRS and RRS difference reported in Section 6; however, the experimental data violated t-test preconditions:

•

The build sizes of the configurations obtained with SRS and RRS were not normally distributed. Table 11 summarizes Shapiro-Wilk tests [99] conducted per algorithm; as all p-values were $\le$0.05, normality was rejected.

Table 11.

Algorithm	$W$	p-value
SRS	0.829	$\sim$0
RRS	0.741	$\sim$0

Table 11. Shapiro-Wilk’s Normality Tests for T-test

•

The build size variance for each algorithm was heterogeneous. The Levene test [74] produced $F$ = 28.453 and p-value = $1.46\cdot 10^{-7}$. As p-value $\le$ 0.05, variance homogeneity was rejected.

The Mann-Whitney U-test [38], also known as the Wilcoxon signed-rank test, was used as the non-parametric alternative to t-test. It raised $W$ = 46043 and p-value $\sim$ 0. As p-value $\le$ 0.05, the test concluded that the SRS and RRS difference was statistically significant.

D Propositional Formula ϕ to CNF Conversion

SAT and #SAT solvers require a Conjunctive Normal FormCNF formula as input [20, 120]. Transforming $ϕ$ into a CNF formula $ϕ$$^{cnf}$ is straightforward with rules of logical equivalence. But doing so may increase the number of clauses exponentially [120], and simplifying $ϕ$$^{cnf}$ to reduce the number of clauses is nontrivial [63, 86].

To avoid this, Equisatisfiable TransformationsETs are used. Two formulas are equisatisfiable when one formula is satisfiable only if the other is satisfiable, and vice versa [125]. ETs produce a CNF formula $ϕ$$^{cnf}$ that is equisatisfiable to $ϕ$ [113]. There are many ETs [58, 96, 113], not all of which are suitable for URS.

Consider: $\text{$ϕ $} = (a \wedge b) \vee (c \wedge d)$. An ET from Plaisted and Greenbaum [96] introduces additional variables $x_1$and $x_2$ for the clauses of $ϕ$:

$\text{$ϕ $$^{cnf}${}} = (x_1 {\vee }x_2) {\wedge }(\lnot x_1 {\vee }a) {\wedge }(\lnot x_1 {\vee }b) {\wedge }(\lnot x_2 {\vee }c) {\wedge }(\lnot x_2 {\vee }d)$.

Each row of Table 12 is a solution of both $ϕ$ and $ϕ$$^{cnf}$. The last solution of $ϕ$ corresponds to three solutions of $ϕ$$^{cnf}$. A problem for URS is exposed: using $ϕ$$^{cnf}$ yields a biased sampling of $ϕ$. Statistical predictions by URS of $ϕ$$^{cnf}$ are distorted predictions about $ϕ$:

Table 12.

•

$\mid$$ϕ$$^{cnf}$$\mid$ is 9 and $\mid$$ϕ$$\mid$ is 7, a $28\%$ over-estimation.

•

The percentage of products with feature $d$ in $ϕ$$^{cnf}$ is $78\%$$~=~$$\frac{7}{9}$, whereas the correct answer in $ϕ$ is $71\%$$~=~$$\frac{5}{7}$, a $10\%$ over-estimation.

How do redundant solutions arise? We observed empirically that if $ϕ$$^{cnf}$ adds no new variables to $ϕ$ then all is OK: URS statistics about $ϕ$$^{cnf}$ match $ϕ$ because $\mid$$ϕ$$^{cnf}$$\mid$$~=~$$\mid$$ϕ$$\mid$.

Adding variables might not be a problem. Tseitin’s transformation [113], a well-known ET method, adds variables but does not increase the number of solutions. Tseitin’s transformation extends the Plaisted and Greenbaum transformation with blocked clauses [71]. The elimination of blocked clauses [61], which is a SAT preprocessing technique used in top-tier solvers, removes those clauses and introduces redundant solutions.

The example of Table 12 shows there are bad ETs that both add variables and distort statistical predictions. The pragmatic problem is this: Given a feature-model-to-propositional-formula tool, you may not know if the tool (or the #SATsolver that uses the propositional formula)employs bad ETsif extra variables are used.

We used the Kmax tool [39] in our work, which avoids translation controversies as it adds no extra variables in translating $ϕ$ to $ϕ$$^{cnf}$.

Also, Projected Model Counters#$\exists$SAT [12] can be used as an alternative to classical #SAT solvers to prevent miscounting. #$\exists$SAT counts the solutions of $ϕ$$^{cnf}$ with respect to an input set of relevant variables, called projection variables. If all variables in $ϕ$ are specified as projection variables, #$\exists$SAT will ignore any other auxiliary variables in $ϕ$$^{cnf}$, thus computing the right count. Further, sampling with $BDD$s avoids ET problems as $BDD$s don’t require the input formula to be in any particular form. This is an advantage of using $BDD$s.

Acknowledgments

We thank the referees for their help to improve this paper. We also thank Prof. Marijn Heule (CMU), Daniel-Jesus Munoz (U. of Malaga), Prof. Maggie Myers (UT Austin), and Prof. Norbert Siegmund (U. Liepzig).

References

[1]

2021. Learning software configuration spaces: A systematic literature review. JSS 182 (2021).

SPL	\(\|\mathbb {C}\|\)	Time (secs)
SPL	\(\|\mathbb {C}\|\)	Synthesis	Sampling
JHipster 3.1.6	2.6\(\cdot\)\(10^{4}\)	0.01	0.04
DellSPLOT	7.4\(\cdot\)\(10^{6}\)	0.29	0.08
Fiasco 2014092821	5.1\(\cdot\)\(10^{9}\)	0.14	0.07
axTLS 1.5.3	3.9\(\cdot\)\(10^{12}\)	0.05	0.04
ToyBox 0.5.2	1.5\(\cdot\)\(10^{17}\)	0.02	0.25
uClibc 201 50420	7.5\(\cdot\)\(10^{50}\)	0.41	0.14
BusyBox 1.23.2	7.4\(\cdot\)\(10^{146}\)	0.62	0.26
EmbToolkit 1.7.0	4.0\(\cdot\)\(10^{334}\)	4304.68	2.61
LargeAutomotive	5.3\(\cdot\)\(10^{1441}\)	21.50	12.07

Algorithm	Correlation Measure
Algorithm	Spearman’s \(\rho\)	Kendall’s \(\tau\)	Hoeffding’s \(D\)	dCor
DeepPerf	0.122	0.082	0.337	0.165
SPLCon. S1	0.214	0.131	0.345	0.229
SPLCon. S2	0.450	0.330	0.379	0.293
SPLCon. S3	0.443	0.323	0.370	0.342
SPLCon. S4	0.336	0.229	0.361	0.272
SPLCon. S5	0.559	0.420	0.433	0.783

Algorithm	p-value
Algorithm	Spearman’s \(\rho\)	Kendall’s \(\tau\)	Hoeffding’s \(D\)	dCor
DeepPerf	\(2.908\cdot 10^{-6}\)	\(3.198\cdot 10^{-6}\)	\(\sim\)0	\(\sim\)0
SPLCon. S1	\(8.591\cdot 10^{-11}\)	\(1.650\cdot 10^{-8}\)	\(\sim\)0	\(\sim\)0
SPLCon. S2	\(\sim\)0	\(\sim\)0	\(\sim\)0	\(\sim\)0
SPLCon. S3	\(\sim\)0	\(\sim\)0	\(\sim\)0	\(\sim\)0
SPLCon. S4	\(\sim\)0	\(\sim\)0	\(\sim\)0	\(\sim\)0
SPLCon. S5	\(\sim\)0	\(\sim\)0	\(\sim\)0	\(\sim\)0

Abstract

1 Introduction

1.1 The Central Questions of SPLO

1.2 Contributions of This Article

2 Results On Performance Modeling

2.1 Basic Facts

2.2 PM Answers to Central Questions

3 Results On Simple Random Searching

3.1 Performance Configuration Space Graphs

3.2 Simple Random Search

3.3 How to Uniformly Sample an SPL Configuration Space

3.4 What Sample Size to Use?

3.5 Why URS Is Important

3.6 PCS Graphs of Enumerable and Non-enumerable SPLs

3.7 SRS Answers to Central Questions

4 Recursive Random Search

5 Evaluation Using Enumerable SPLs

5.1 Evaluation Setup

5.2 RQ1: Which algorithm is the most accurate across selected SPLs?

5.3 RQ2: Which algorithm is the most reliable across selected SPLs?

5.4 RQ3: Are PM accuracy and PM solution accuracy correlated?

5.5 Threats to Validity

5.6 Summary

6 Evaluation of SRS and RRS On Kconfig SPLs

7 Fixed Budget SPLO: the Essential Problem

8 Related and Future Work

8.1 Highly Configurable Systems

8.2 Relevant Results in ML PMs

8.3 Optimizers

8.4 Sampling SPL Configurations

8.5 Feature Models and URS

9 Conclusions

Footnotes

A | ℂ| > 1,000: When an Infinite Space can Approximate a Discrete Space

B Proof of Uniformity of the URS Algorithm

C Statistical Significance

C.1 RQ1 and RQ2

C.2 RQ3

C.3 RQ4

D Propositional Formula ϕ to CNF Conversion

Acknowledgments

References

Cited By

Index Terms

Recommendations

Finding near-optimal configurations in product lines by random sampling

Heuristic and exact algorithms for product configuration in software product lines

Validating Partial Configurations of Product Lines

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations