1 Introduction

In this paper we examine the problem of constructing classifiers that are built to minimise upper bounds on the future misclassification rate of a predictor. This is a fundamental problem in machine learning providing practitioners with a guarantee on the future performance of a trained predictor. Given its importance, there has been a significant amount of research that seeks to address this problem, one of the most prominent directions being a theory on the uniform convergence of empirical quantities to their mean (Vapnik and Chervonenkis 1971; Vapnik 1995). This theory provides a way of estimating the future misclassification rate of a predictor based on its empirical performance and some measure of the complexity of the predictor function e.g. the Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis 1971) or the fat-shattering dimension (Alon et al. 1997). Further work in this direction (Marchand and Shawe-Taylor 2002; Sokolova et al. 2002) has been based on the prior assumption that the decision boundary can be constructed as a logical combination of a small set of data derived features. The analysis of these algorithms considers class-conditional error bounds that can be used for unequal loss functions. Class-conditional error bounds also inspire the approach described in the next paragraph.

In this paper we view the problem of generalisation from a different perspective by building upon the minimax probability machine (MPM) framework introduced in Lanckriet et al. (2003). In this setting, rather than trying to trade off the error over the training sample with the complexity of the function, we directly minimise an upper bound on the future misclassification rate. This minimisation takes place in a worst-case setting by considering all possible class-conditional distributions that have a particular mean and covariance matrix. These class-conditional means and covariance matrices play a key role in determining the optimal predictor and deriving upper bounds on its future misclassification rate. However when it comes to implementing these algorithms in practice, the true moments are not known and their empirical counterparts have to be used instead.

In this paper we seek to address the problems caused by the uncertainty of empirical moments by presenting the high-probability minimax probability machine (HP-MPM). The HP-MPM incorporates high probability upper bounds on the deviation of true moments from their empirical counterparts into the minimax problem to ensure that the future misclassification rate guarantees hold true with high-probability. The incorporation of moment uncertainty introduces a natural regularisation component into the optimisation scheme. We see that a smaller number of observations for a particular class results in greater uncertainty regarding its distribution, thus warranting additional regularisation. This is an often overlooked component of binary classifiers, where the class-conditional distributions are traditionally jointly regularised, ignoring the relative amount of information that is available for each class.

This paper follows with an introduction to the original MPM in Sect. 2, providing much of the technical details that will be required for the formulation of the HP-MPM. In Sect. 3 we present high-probability bounds on the deviations of true moments from their empirical counterparts, and show how they give rise to the HP-MPM optimisation scheme. In Sect. 4 we discuss the alternating optimisation that was designed to solve the problem, and deal with the kernelisation of the algorithm in Sect. 5. We present the results of our experiments in Sect. 6, and conclude in Sect. 7 with some final remarks regarding how best to use the newly proposed algorithm and where future research should focus.

2 Minimax probability machines

We consider the problem of constructing a binary classifier (predictor) by using some labelled training set consisting of inputs \(\{ \mathbf {x}_1, \dots , \mathbf {x}_m\}\in \mathcal {X}\subseteq \mathbb {R}^d\), and their corresponding class labellings \(\{y_1, \dots , y_m\} \in \{0,1\}\). For each class \(j=0,1\), we assume that the input observations belonging to this class are generated according to some underlying distribution \(\mathcal {D}_j\) where the mean \(\bar{\mathbf {x}}_j \in \mathbb {R}^d\) and covariance matrix \(\varSigma _j \in \mathbb {R}^{d \times d}\) of the distribution are known, but is otherwise arbitrary. The goal of the MPM is to find the linear decision boundary (hyperplane) that minimises the probability that future observations from these distributions will lie on the wrong side of this boundary.

Central to the derivation of the minimax program used in the MPM is the following theorem, a multivariate extension of the Chebyshev Inequality (Marshall and Olkin 1960), which was popularised for convex optimisation in Bertsimas and Popescu (2005):

Theorem 1

(Marshall and Olkin 1960; Bertsimas and Popescu 2005)

$$\begin{aligned} \underset{\mathbf {x}\sim \mathcal {D}}{\sup } \mathbf {Pr}\{ \mathbf {x}\in \mathcal {S}\} = \frac{1}{1+d^2}, \,\,\,\,\, \text { with } \,\,\,\, d^2 = \underset{\mathbf {x}\in \mathcal {S}}{\inf }\, (\mathbf {x}- \bar{\mathbf {x}})^T \varSigma ^{-1} (\mathbf {x}- \bar{\mathbf {x}}), \end{aligned}$$

where \(\mathbf {x}\) is a random vector, \(\mathcal {S}\) is a given convex set, and where the supremum is taken over all distributions \(\mathcal {D}\) for \(\mathbf {x}\) that have mean \(\bar{\mathbf {x}}\) and covariance matrix \(\varSigma \).

This theorem relates the maximum probability of a random vector \(\mathbf {x}\sim \mathcal {D}\) belonging to a convex set \(\mathcal {S}\) to the minimum Mahalanobis distance \(d^2\) from the centre of the distribution \(\bar{\mathbf {x}}\) to that set. Motivated by finding a linear decision boundary, Lanckriet et al. (2003) showed that when \(\mathcal {S}\) is the upper half-space defined the separating hyperplane \(\mathcal {H}(\mathbf {w},b):=\{ \mathbf {x}\,\,| \,\, \mathbf {w}^T \mathbf {x}= b \}\), the distance \(d^2\) admits a closed form expression given by

$$\begin{aligned} d^2= \inf _{\mathbf {w}^T \mathbf {x}\ge b}(\mathbf {x}-\bar{\mathbf {x}})^T\varSigma ^{-1} (\mathbf {x}-\bar{\mathbf {x}}) = {\left\{ \begin{array}{ll}\frac{ (b-\mathbf {w}^T\bar{\mathbf {x}})^2}{\mathbf {w}^T \varSigma \mathbf {w}} &{}\text { if } \mathbf {w}^T \bar{\mathbf {x}} < b\\ 0 &{} \text { if } \mathbf {w}^T \bar{\mathbf {x}} \ge b \end{array}\right. }. \end{aligned}$$
(1)

This expression enables us to upper bound the probability that an observation drawn from a class-conditional distribution will lie on the wrong side of the separating hyperplane, alternatively it provides a lower bound on the probability that the observation will lie on the correct side of the hyperplane. This results in the following optimisation problem:

$$\begin{aligned} \max _{\mathbf {w},b, \alpha } \,\,\alpha \text {s.t.} &\underset{\mathbf {x}_1 \sim \mathcal {D}_1 }{\inf } P(\mathbf {w}^T\mathbf {x}_1 \ge b ) \ge \alpha \\&\underset{\mathbf {x}_0 \sim \mathcal {D}_0 }{\inf } P(\mathbf {w}^T\mathbf {x}_0 \le b ) \ge \alpha , \end{aligned}$$

where \(\alpha \in [0,1]\) is the minimum probability that examples are labelled correctly in the future. To see this, let our classifier predict that \(\mathbf {x}\) belongs to class 1 if \(\mathbf {w}^T \mathbf {x}\ge b\). The maximum probability that a point drawn from \(\mathcal {D}_1\) resides on the wrong side of this hyperplane \(\mathcal {H}(\mathbf {w},b)\) is given by

$$\begin{aligned} \underset{\mathbf {x}_1 \sim \mathcal {D}_1}{\sup }P(\mathbf {w}^T\mathbf {x}_1 <b)= \frac{1}{1+d^2}=1-\alpha . \end{aligned}$$

Therefore the minimum probability that a random vector \(\mathbf {x}_1\) resides on the correct side of the hyperplane is greater than \(\alpha \). Using the closed form expression for the Mahalanobis distance given in (1), and assuming that \(\mathbf {w}^T\bar{\mathbf {x}}_1 > b\), we derive the following key equivalence statement

$$\begin{aligned} \underset{\mathbf {x}_1 \sim \mathcal {D}_1}{\inf }P(\mathbf {w}^T \mathbf {x}_1 \ge b) \ge \alpha \iff -b+ \mathbf {w}^T\bar{\mathbf {x}}_1 \ge \kappa (\alpha ) \sqrt{\mathbf {w}^T \varSigma _1 \mathbf {w}}, \end{aligned}$$
(2)

where \(\kappa (\alpha )= \sqrt{\alpha /(1-\alpha )}\). A similar but opposite formulation for class 0 allows the optimisation problem to be written as

$$\begin{aligned} \max _{\mathbf {w},b, \alpha } \,\,\alpha \text {s.t.} -b + \mathbf {w}^T \bar{\mathbf {x}}_1&\ge \kappa (\alpha ) \sqrt{ \mathbf {w}^T \varSigma _1 \mathbf {w}}\\ b - \mathbf {w}^T\bar{\mathbf {x}}_0&\ge \kappa (\alpha ) \sqrt{ \mathbf {w}^T \varSigma _0 \mathbf {w}}. \end{aligned}$$

Further reductions led to the second-order cone program given by the following theorem:

Theorem 2

(Lanckriet et al. 2003) If \(\bar{\mathbf {x}}_1 = \bar{\mathbf {x}}_0\) then the minimax probability decision problem does not have a meaningful solution and the worst case misclassification probability is given by \(1- \alpha _*=1\). Otherwise an optimal hyperplane \(\mathcal {H}(\mathbf {w}_*,b_*)\) exists and can be determined by solving the convex optimisation problem

$$\begin{aligned} \kappa _*^{-1}: = \min _{\mathbf {w}} \sqrt{ \mathbf {w}^T \varSigma _1 \mathbf {w}} + \sqrt{ \mathbf {w}^T \varSigma _{0} \mathbf {w}} \text { s.t. } \mathbf {w}^T(\bar{\mathbf {x}}_1 -\bar{\mathbf {x}}_0) =1, \end{aligned}$$
(3)

and setting b to the value

$$\begin{aligned} b_* = \mathbf {w}_*^T \bar{\mathbf {x}}_1 - \kappa _* \sqrt{\mathbf {w}_*^T \varSigma _{1} \mathbf {w}_*}, \end{aligned}$$
(4)

where \(\mathbf {w}_*\) is the optimal solution to (3). The optimal worst-case misclassification probability is given by

$$\begin{aligned} 1 - \alpha _* = \frac{1}{1+\kappa _*^2} = \frac{ \left( \sqrt{\mathbf {w}_*^T \varSigma _{1} \mathbf {w}_*} + \sqrt{\mathbf {w}_*^T \varSigma _{0} \mathbf {w}_*} \right) ^2}{1 + \left( \sqrt{\mathbf {w}_*^T \varSigma _{1} \mathbf {w}_*} + \sqrt{\mathbf {w}_*^T \varSigma _{0} \mathbf {w}_*} \right) ^2}. \end{aligned}$$
(5)

If either \(\varSigma _1\) or \(\varSigma _0\) is positive definite, the optimal hyperplane is unique.

Lanckriet et al. (2003) showed that it was possible to solve the optimisation (3) using an iterative least-squares scheme, which has a worst-case complexity \(\mathcal {O}(d^3)\). Furthermore, their empirical results show that the MPM approach to classification is competitive with the state-of-the-art support vector machine (SVM) (Boser et al. 1992; Cortes and Vapnik 1995), thus providing evidence in support of the MPM as an efficient and effective approach for binary classification. These encouraging results resulted in the MPM approach being applied to both novelty-detection (Ghaoui et al. 2002) and regression (Strohmann and Grudic 2002), with some degree of success. In Huang et al. (2004) the authors identified an oversight in the original MPM formulation: that is, it implicitly assumes that the prior probability of each class is the same. The authors showed that if the prior probabilities of the classes differed, then it was no longer optimal to minimise a single worst-case future misclassification rate but rather one should minimise a weighted combination of class specific worst-case misclassification rates. The weights correspond to their prior class probability and this formulation became known as the minimum error MPM (ME-MPM). An alternative approach for dealing with imbalanced data was presented in Osadchy et al. (2015), where the authors optimised an objective that used the SVM hinge loss for the less frequent class and the minimax loss formulation for the abundant class. A transductive minimax probability machine was proposed in Huang et al. (2014), here unlabelled test points were assigned classes based upon their ability to minimise the worst case error bound and it was shown to be especially competitive with the transductive SVM on semi-supervised learning tasks. Similar to the transductive setting, in Huang et al. (2015) the authors focus on the problem of clustering by assigning unlabelled data to clusters in an attempt to optimise criterion defined by the MPM framework.

3 High-probability MPMs

The bounds on the future misclassification rate presented in (5) are valid when the true moments of the class-conditional distributions are known in advance. However this is not often the case, and in practice the true moments have to be substituted for empirical ones during the algorithm’s implementation. This can result in the derivation of sub-optimal predictors and lead to bounds on the future misclassification rate that are invalid. In this section we make use of high-probability bounds on the deviation of true moments from their empirical counterparts, and use this to derive an optimisation scheme that directly takes into consideration the uncertainty of the empirical moments when constructing the predictor.

We begin by reviewing the robustness results presented in the original MPM (Lanckriet et al. 2003), where the authors sought to address the use of empirical moment estimates using a specific uncertainty sets, \(\mathcal {U}_0\) and \(\mathcal {U}_1\), for the value of the true moments. More specifically, for each \(j=0,1\), the optimisation scheme considered all values of the true moments, \(\bar{\mathbf {x}}_j\) and \(\varSigma _j\), that resided within the uncertainty set

$$\begin{aligned} \mathcal {U}_j = \left\{ (\bar{\mathbf {x}}_j, \hat{\varSigma }_j) \, : \, (\bar{\mathbf {x}}_j - \hat{\mathbf {x}}_j)^T \varSigma _j^{-1} (\bar{\mathbf {x}}_j- \hat{\mathbf {x}}_j) \le \nu ^2, \,\,\, || \varSigma _j - \hat{\varSigma }_j||_F \le \rho \right\} , \end{aligned}$$

where \(\hat{\mathbf {x}}_j\) and \(\hat{\varSigma }_j\), are the empirical estimates of the mean and covariance matrix of class j, derived from the training sample i.e.

$$\begin{aligned} \hat{\mathbf {x}}_j = \frac{1}{m_j} \sum _{k=1}^{m_j} \mathbf {x}_j \text {and} \hat{\varSigma }_j = \frac{1}{m_j} \sum _{k=1}^{m_j} (\mathbf {x}_k - \hat{\mathbf {x}}_j )(\mathbf {x}_k - \hat{\mathbf {x}}_j)^T \, \end{aligned}$$

where \(m_j\) is the number of observations belonging to class j. The value of \(\nu \ge 0\) and \(\rho \ge 0 \) control the size of the uncertainty set, and have to be set at the practitioner’s discretion. It can be argued that this specific uncertainty set chosen more for its numerical tractability rather than its statistical accuracy, and in what follows of this section, we derive a statistically motivated approach for incorporating the uncertainty of the moments into the optimisation scheme.

Shawe-Taylor and Cristianini (2003) present high-probability upper bounds on the deviation of true moments \((\bar{\mathbf {x}}, \varSigma )\) from their empirical counterparts \((\hat{\mathbf {x}}, \hat{\varSigma })\): high-probability in the sense that the probability that the true value of the moment deviates from the empirical one by more than \(\epsilon \in \mathbb {R}\) is less than \(\delta \ge 0\). They showed that the following holds true with probability at least \(1-\delta \):

$$\begin{aligned} || \bar{\mathbf {x}} - \hat{\mathbf {x}} ||_2 \le \frac{R}{\sqrt{m}} \left( 2 + \sqrt{2 \log \frac{1}{\delta } } \right) \,\,\, \text { and } \,\, \, || \varSigma - \hat{\varSigma }||_F \le \frac{2R^2}{\sqrt{m}}\left( 2 + \sqrt{2 \log \frac{2}{\delta }} \right) , \end{aligned}$$
(6)

where \(|| \cdot ||\) and \(|| \cdot ||_F\) denote the \(L_2\) and Frobenius norms, respectively, \(R >0\) is the radius of the smallest sphere containing the support of \(\mathcal {X}\) i.e. for all \(\mathbf {x}\in \mathcal {X},\) \(||\mathbf {x}|| \le R\), and m is the number of observations that were used to construct the empirical moment. The authors examined the implications of the these deviation bounds on the MPM guarantees, showing that the high-probability worst-case estimate for the future misclassification errors can differ significantly from that found using (5). Their focus was on finding what the high-probability worst-case future misclassification rate was, given that the predictor was constructed using the original MPM formulation (3). Whereas our focus is on designing an optimisation scheme that directly minimises the high-probability bound on future misclassification by taking into consideration the uncertainty in the empirical moments.

To do this we begin by reviewing how moment uncertainty affects the key equivalence relationship given in (2). To simplify the analysis, we assume that the weight vector lies within the unit-ball defined by the \(L_2\)-norm i.e. \(|| \mathbf {w}|| \le 1\). We want to find a high-probability bound on the deviation of the values in the inequality (2) when using empirical and true moments in the expression. We do this by using the following adaptation of the proposition presented in Shawe-Taylor and Cristianini (2003).

Proposition 1

Let \(\hat{\mathbf {x}}\) and \(\hat{\varSigma }\) be the empirical mean and covariance matrix of a sample of m points drawn independently according some probability distribution \(\mathcal {D}\) with mean \(\bar{\mathbf {x}}\) and covariance matrix \( \varSigma \). The weight vector \(||\mathbf {w}|| \le 1\) where \(\mathbf {w}\ne \mathbf {0}\), and \(b \in \mathbb {R}\) are given such that \(\mathbf {w}^T \hat{\mathbf {x}} \le b\). Then if

$$\begin{aligned} b-\mathbf {w}^T \hat{\mathbf {x}} \ge \sqrt{ \kappa (\alpha )^2 \mathbf {w}^T \hat{\varSigma }\mathbf {w}+ T} \end{aligned}$$
(7)

where

$$\begin{aligned} T=\frac{4 R^2}{\sqrt{m}} \left( 2 + \sqrt{2 \ln \frac{2}{\delta } } \right) + \kappa (\alpha )^2 \frac{2 R^2}{\sqrt{m}} \left( 2 + \sqrt{ 2 \ln \frac{2}{\delta } } \right) \end{aligned}$$

then with probability at least \(1-\delta \) over the draw of the random sample

$$\begin{aligned} b-\mathbf {w}^T \bar{\mathbf {x}} \ge \kappa (\alpha ) \sqrt{ \mathbf {w}^T \varSigma \mathbf {w}} \text { and } \underset{\mathbf {x}\sim \mathcal {D}}{\inf } P (\mathbf {w}^T \mathbf {x}\ge b ) \ge \alpha . \end{aligned}$$

Proof

To prove this we show that if

$$\begin{aligned}&(b - \mathbf {w}^T \hat{\mathbf {x}} )^2 - \kappa (\alpha )^2 \mathbf {w}^T \hat{\varSigma }\mathbf {w}\ge T \end{aligned}$$

then with probability at least \(1-\delta \)

$$\begin{aligned} (b - \mathbf {w}^T \bar{\mathbf {x}} )^2 - \kappa (\alpha )^2 \mathbf {w}^T \varSigma \mathbf {w}\ge 0. \end{aligned}$$

We do this by bounding the high-probability differences in the value of the expressions on the left hand side of the inequalities

$$\begin{aligned}&\left| (b - \mathbf {w}^T \hat{\mathbf {x}} )^2 - \kappa (\alpha )^2 \mathbf {w}^T \hat{\varSigma }\mathbf {w}- (b - \mathbf {w}^T \bar{\mathbf {x}} )^2 +\kappa (\alpha )^2 \mathbf {w}^T \varSigma \mathbf {w}\right| \\&\quad \le || \hat{\mathbf {x}} - \bar{\mathbf {x}} || \left( 2b + || \hat{\mathbf {x}} + \bar{\mathbf {x}} || \right) + \kappa (\alpha )^2 \left| \mathbf {w}^T \hat{\varSigma }\mathbf {w}- \mathbf {w}\varSigma \mathbf {w}\right| \\&\quad \le || \hat{\mathbf {x}} - \bar{\mathbf {x}} ||4 R + \kappa (\alpha )^2 || \hat{\varSigma }- \varSigma ||_F. \end{aligned}$$

The proof is completed by using the bounds on the empirical moments presented (6) with \(\delta \) replaced with \(\delta /2\),

$$\begin{aligned}&\left| (b - \mathbf {w}^T \hat{\mathbf {x}} )^2 - \kappa (\alpha )^2 \mathbf {w}^T \hat{\varSigma }\mathbf {w}- (b - \mathbf {w}^T \bar{\mathbf {x}} )^2 +\kappa (\alpha )^2 \mathbf {w}^T \varSigma \mathbf {w}\right| \\&\,\,\, \le \frac{4 R^2}{\sqrt{m}} \left( 2 + \sqrt{2 \ln \frac{2}{\delta } } \right) + \kappa (\alpha )^2 \frac{2 R^2}{\sqrt{m}} \left( 2 + \sqrt{ 2 \ln \frac{2}{\delta } } \right) . \end{aligned}$$

Note that the bound comes into play when we consider

$$\begin{aligned} (b - \mathbf {w}^T \hat{\mathbf {x}} )^2 - \kappa (\alpha )^2 \mathbf {w}^T \hat{\varSigma }\mathbf {w}\ge (b - \mathbf {w}^T \bar{\mathbf {x}} )^2 - \kappa (\alpha )^2 \mathbf {w}^T \varSigma \mathbf {w}, \end{aligned}$$

and it holds true regardless of the bound if the inequality is reversed. \(\square \)

To formulate the HP-MPM optimisation scheme we use each classes corresponding inequality (7), where the class specific uncertainty is captured by the term \(T_j\) for \(j=0,1\) with

$$\begin{aligned} T_j&=\frac{4 R^2}{\sqrt{m_j}} \left( 2 + \sqrt{2 \ln \frac{2}{\delta } } \right) + \kappa (\alpha )^2 \frac{2 R^2}{\sqrt{m_j}} \left( 2 + \sqrt{ 2 \ln \frac{2}{\delta } } \right) \\&= 2 A_j + \kappa (\alpha )^2 A_j, \end{aligned}$$

where

$$\begin{aligned} A_j = \frac{2 R^2}{\sqrt{m_j}}\left( 2 + \sqrt{2 \ln \frac{2}{\delta } } \right) . \end{aligned}$$
(8)

We can drop the dependence of the optimisation on \(\alpha \) by noting the monotonic relationship it has with \(\kappa (\alpha )\), and by introducing the constraint that \(||\mathbf {w}|| \le 1\) and using the inequalities (7), the minimax program becomes

$$\begin{aligned} \underset{ \mathbf {w}, b, \kappa }{\max } \,\,\kappa \text {s.t.} ||\mathbf {w}|| \le 1\\ -b + \mathbf {w}^T \hat{\mathbf {x}}_1&\ge \sqrt{ 2A_1+ \kappa ^2 \left( \mathbf {w}^T \hat{\varSigma }_1 \mathbf {w}+ A_1 \right) }\\ b - \mathbf {w}^T\hat{\mathbf {x}}_0&\ge \sqrt{ 2A_0+ \kappa ^2 \left( \mathbf {w}^T \hat{\varSigma }_0 \mathbf {w}+ A_0 \right) } . \end{aligned}$$

Corollary 1

For \(j=0,1\), let \(\hat{\mathbf {x}}_j\) and \(\hat{\varSigma }_j\) be the empirical mean and covariance matrix of \(m_j\) points drawn independently from distributions \(D_j\) with true mean \(\bar{\mathbf {x}}_j\) and covariance matrix \(\varSigma _j\), and let \(A_j\) be defined according to (8). If \(|| \hat{\mathbf {x}}_1 - \hat{\mathbf {x}}_0|| \le \sqrt{2 A_1} + \sqrt{2 A_0}\) then the high probability MPM decision problem does not have a meaningful solution and the worst-case misclassification probability is given by \(1-\alpha _* =1\). Otherwise an optimal hyperplane \(\mathcal {H}(\mathbf {w}_*,b_*)\) exists and can be determined by solving the optimisation problem given by

$$\begin{aligned} \underset{ \mathbf {w}, \kappa }{\max } \,\,\kappa &\text {s.t.} ||\mathbf {w}|| \le 1 \nonumber \\&\mathbf {w}^T (\hat{\mathbf {x}}_1- \hat{\mathbf {x}}_0) = \sqrt{ 2A_1+ \kappa ^2 \left( \mathbf {w}^T \hat{\varSigma }_1 \mathbf {w}+ A_1 \right) } + \sqrt{ 2A_0+ \kappa ^2 \left( \mathbf {w}^T \hat{\varSigma }_0 \mathbf {w}+ A_0 \right) }, \end{aligned}$$
(9)

and setting b to the value

$$\begin{aligned} b_*=\mathbf {w}_*^T \hat{\mathbf {x}}_1- \sqrt{ 2A_1 + \kappa ^2_*\left( \mathbf {w}_*^T \hat{\varSigma }_1 \mathbf {w}_* + A_1 \right) }=\mathbf {w}_*^T \hat{\mathbf {x}}_0+ \sqrt{ 2A_0 + \kappa ^2_*\left( \mathbf {w}_*^T \hat{\varSigma }_0 \mathbf {w}_* + A_0 \right) }, \end{aligned}$$

where \(\mathbf {w}_*\) and \(\kappa _*\) are the optimal solutions to (9). Then with probability at least \(1-\delta \) over the draws of the random sample, the optimal worst-case misclassification probability is given by

$$\begin{aligned} 1-\alpha _* = \frac{1}{1+\kappa _*^2}. \end{aligned}$$

When presented with a new input observation \(\mathbf {x}^\prime \), we make our prediction according to what side of the optimal hyperplane the point resides i.e. we predict that \(y^\prime =1\) if \(\mathbf {w}_*^T\mathbf {x}^\prime - b_* \ge 0\), and that \(y^\prime =0\) otherwise.

4 Optimisation scheme

The optimisation problem given in (9) can not be solved using the same approach taken in Lanckriet et al. (2003) because of the unit \(L_2\)-norm restriction on \(\mathbf {w}\), and the presence of the uncertainty terms \(A_j\) under the square root. To solve this problem we propose the use of an auxiliary function \(h(\mathbf {w},\kappa )\) in conjunction with an alternating update scheme over \(\mathbf {w}\) and \(\kappa \). The auxiliary function is given by

$$\begin{aligned} h(\mathbf {w},\kappa ) = \mathbf {w}^T(\hat{\mathbf {x}}_1 - \hat{\mathbf {x}}_0) -&\sqrt{ 2A_1 + \kappa ^2\left( \mathbf {w}^T \hat{\varSigma }_1 \mathbf {w}+ A_1 \right) } - \sqrt{ 2A_0+ \kappa ^2 \left( \mathbf {w}^T \hat{\varSigma }_0 \mathbf {w}+ A_0 \right) } \end{aligned}$$
(10)

Note that the class uncertainty terms require the computation of span of the data i.e. find R such that \(|| \mathbf {x}|| \le R \) for all \(\mathbf {x}\in \mathcal {X}\). During implementation this will have to be estimated from the training sample or can be enforced by some normalisation scheme that is independent of the learning algorithm.

Initialisation: To initialise the optimisation, we begin with \(\kappa =0\), and find the value of \(\mathbf {w}\) that maximises \(h(\mathbf {w},\kappa )\) subject to the constraints that \(||\mathbf {w}||\le 1\). This has a closed form solution \((\hat{\mathbf {x}}_1 - \hat{\mathbf {x}}_0)/|| \hat{\mathbf {x}}_1 - \hat{\mathbf {x}}_0 || = \arg \max _{||\mathbf {w}|| \le 1 } h(\mathbf {w},0)\), and provides the conditions necessary for a meaningful solution to the high probability MPM decision problem i.e. we require that \(\max _{||\mathbf {w}||\le 1 } h(\mathbf {w}, 0) >0\) in order to be able to find a positive value of \(\kappa \) in the next step of the optimisation scheme.

\(\mathbf {w}\) -step: For non-initialisation \(\mathbf {w}\)-steps, the goal is maximise the value of the auxiliary function by performing gradient ascent subject to our constraint \(||\mathbf {w}||\le 1\). It is straightforward to show that \(h(\mathbf {w},\kappa )\) is a concave in \(\mathbf {w}\) and therefore every local optimum will be a global optimum. Therefore we can use standard constrained optimisation tools to solve this intermediate problem. Note that we do not need to run these constrained optimisations to convergence, we simply need the value of the auxiliary function to increase, in order to allow for a larger value of \(\kappa \) in the next step. We view this as a constrained maximisation subject to some implicit degree of regularisation imposed by the value of \(\kappa \).

\(\kappa \) -step: In order to continue the optimisation, we require that the \(\mathbf {w}\)-step results in a strictly positive value for the auxiliary function, \(h(\mathbf {w},\kappa )>0\). If this is not the case then the optimisation has converged, and we have reached the optimal solution. If \(h(\mathbf {w},\kappa )>0\), we can increase the value of \(\kappa \) to \(\kappa ^\prime \) such that the value of the auxiliary function is zero i.e. \(h(\mathbf {w}, \kappa ^\prime )=0\). This can be performed using a simple line-search procedure, or by finding the roots of a quadratic expression involving \(\kappa \). Note that in order for the optimisation to progress we must find \(\kappa ^\prime \) such that \(\kappa ^\prime > \kappa \). To simplify the range of the line-search we observe an upper bound on the value of \(\kappa ^\prime \), namely \(\kappa ^\prime \le \kappa _u = ||\hat{\mathbf {x}}_1 - \hat{\mathbf {x}}_0||/\left( \sqrt{ \mathbf {w}^T \hat{\varSigma }_1 \mathbf {w}} + \sqrt{ \mathbf {w}^T \hat{\varSigma }_0 \mathbf {w}} \right) \).

figure a

Optimal solution: We prove that the optimal solution for the weight vector \(\mathbf {w}_*\) will have a unit \(L_2\)-norm i.e. \(||\mathbf {w}_*||=1\). To do this suppose that \(||\mathbf {w}_*||<1\), we know that at optimality \(h(\mathbf {w}_*, \kappa _*)=0\) and that \(\mathbf {w}^\prime = \mathbf {w}_*/||\mathbf {w}_*||\) is also a feasible solution. We show that \(h(\mathbf {w}^\prime , \kappa _*)>0\) and that \(\mathbf {w}_*\), where \(||\mathbf {w}_*||<1\), can not be the optimal solution. To see this observe that

$$\begin{aligned} \sqrt{ 2 A_j + \kappa _*^2\left( \mathbf {w}^{\prime \, T} \hat{\varSigma }_j \mathbf {w}^\prime + A_j\right) }&= \sqrt{ 2 A_j + \kappa _*^2\left( \frac{1}{||\mathbf {w}_*||^2} \mathbf {w}_*^T \hat{\varSigma }_j \mathbf {w}_* + A_j\right) }\\&< \frac{1}{||\mathbf {w}_*||}\sqrt{ 2 A_j + \kappa _*^2 \left( \mathbf {w}_*^T \hat{\varSigma }_j \mathbf {w}_* + A_j \right) }. \end{aligned}$$

Using this inequality in the auxiliary function \(h(\mathbf {w}^\prime ,\kappa _*)\) we see that

$$\begin{aligned} h(\mathbf {w}^\prime , \kappa _*) > \frac{1}{||\mathbf {w}_*||} h(\mathbf {w}_*, \kappa _*), \end{aligned}$$

where we know by the monotonicity of \(h(\mathbf {w},\kappa )\) with respect to \(\kappa \), that there exists \(\kappa ^\prime > \kappa _*\), satisfying the constraints in (9). Therefore \((\mathbf {w}_*,\kappa _*)\) can not be the optimal solution to this problem.

Geometric interpretation: The original MPM can be viewed as looking for the point of intersection between two ellipsoids centered at the class means, where the shape of the ellipsoids are determined by the covariance matrices and their size is controlled by the value of \(\kappa \) i.e. for \(j=0,1\)

$$\begin{aligned} \mathcal {E}_j(\kappa ) = \left\{ \mathbf {x}= \bar{\mathbf {x}}_j + \varSigma _j^{1/2} \mathbf {u}\,\, : \,\, || \mathbf {u} || \le \kappa \right\} \end{aligned}$$

Clearly as the size of \(\kappa \) increases these ellipsoids will eventually overlap. However, the optimal hyperplane is given by the common tangent to the ellipsoids at the first point of their tangency. During our optimisation scheme, we alternate between allowing these ellipsoids, albeit a penalised verson of them, to intersect i.e. \(h(\mathbf {w},\kappa )=0\), and rotating \(\mathbf {w}\) to provide additional space for the ellipsoids to expand into at the next stage of the optimisation. We can view the moment uncertainty as introducing a regularisation component to the covariance matrices, along with a penalty regarding the location of the means. The regularised ellipsoids we consider in the high-probability setting are given by

$$\begin{aligned} \hat{\mathcal {E}}_j(\kappa ) = \left\{ \mathbf {x}= \hat{\mathbf {x}}_j + \tilde{\varSigma }(\kappa )_j^{1/2} \mathbf {u}: \,\, ||\mathbf {u}||\le \kappa , \,\, \tilde{\varSigma }(\kappa )_j= \hat{\varSigma }_j + I_d\left( A_j + \frac{2 A_j}{\kappa ^2} \right) \right\} . \end{aligned}$$
(11)

Using the geometrical interpretation, we see that as the value of \(\kappa \) grows, the effective regularisation on the covariance matrix decreases. This results from a relative reduction in the role played by the mean uncertainty in the square root term. Intuitively, as we move away from the means, with increasing values of \(\kappa \), the point of origin becomes less important and we focus more on the underlying shape of the ellipsoid. In Fig. 1 we show how the ellipsoids change as we increase \(\kappa \) up until their point of tangency. The intermediate solutions where the auxiliary function \(h(\mathbf {w},\kappa )=0\), represent hyerplanes that are tangential to the ellipsoids but where the ellipsoids are not tangential to one another.

Fig. 1
figure 1

Geometric interpretation of the high-probability MPM and the intermediate solutions produced during the optimisation scheme. We can see that in the beginning, for small values of \(\kappa \), the penalised (regularised) covariance matrices are almost spherical. As the value of \(\kappa \) increases, and we move away from the class means, the ellipsoids begin to take on a shape increasingly determined by the sampled covariance matrix, however there still remains the regularisation caused by the uncertainty in the value of the covariance matrix. We see that the intermediate solutions \(h(\mathbf {w},\kappa )=0\) result in hyperplanes that are tangential to the each classes ellipsoid, however these ellipsoids are only tangential to one another at the optimal solution. In the samples used to generate this solution, \(m_1=20\) and \(m_0=200\), explaining the larger size of the ellipsoid for class 1

5 Kernelisation

So far we have explored the notion of finding an optimal linear decision boundary. Geometrically we saw that the worst-case bound on the future misclassification rate depends on both the distance between the class means, and the shape of the ellipsoids that their covariance matrices determine. However, it is often the case that by mapping the inputs into some higher-dimensional feature space there is a greater degree of separation between the two classes, and thus we should be able to reduce the worst-case future misclassification rate. Kernel methods (Vapnik 1998; Shawe-Taylor and Cristianini 2004) are able to take advantage of these higher-dimensional feature spaces without having to explicitly compute them, and have proven a useful tool for many classification algorithms. In this section we show that our optimisation problem can be re-written in terms of the kernel function, which allows us to efficiently use higher-dimensional feature spaces to represent the input space. To do this we closely follow the approach taken in Lanckriet et al. (2003).

We begin by introducing the feature mapping \(\phi : \mathcal {X}\rightarrow \mathcal {F}\) where the linear decision boundary in this space is given by hyperplane \(\mathcal {H}(\mathbf {w},b)= \{ \phi (\mathbf {x}) \in \mathcal {F} \,\, :\,\, \mathbf {w}^T \phi (\mathbf {x}) = b \}\). Note that the linear decision boundary in feature space corresponds to a non-linear decision boundary in the original space \(\mathcal {X}\). The data is mapped according to

$$\begin{aligned}&\mathbf {x}_1 \rightarrow \phi (\mathbf {x}_1) \sim \mathcal {D}_1^\phi \\&\mathbf {x}_0 \rightarrow \phi (\mathbf {x}_0) \sim \mathcal {D}_0^\phi , \end{aligned}$$

where distribution \(\mathcal {D}_j^\phi \), has mean \(\bar{\phi }_j\) and covariance matrix \( \varSigma _j^\phi \) defined in the feature space \(\mathcal {F}\). To find the optimal hyperplane in \(\mathcal {F}\) we follow the same optimisation problem given in (9), where we substitute the original empirical moments with their feature space counterparts \(\hat{\phi }_j\) and \(\hat{\varSigma }_j^\phi \) for each class \(j=0,1\). In order to efficiently use the feature mappings and make use of the kernel-trick, we have to show that the feature mappings enter the optimisation scheme only in terms of their inner-product \(\langle \phi (\mathbf {x}), \phi (\mathbf {x}^\prime ) \rangle = K(\mathbf {x},\mathbf {x}^\prime )\), where \(K: \mathcal {F} \times \mathcal {F}\) is the kernel function corresponding to the feature mapping \(\phi \). This allows us to use high-dimensional feature spaces without having to explicitly compute them, thus making them tractable to work with.

To do this, first we have to show that any optimal solution to (9) must lie in the space spanned by the input data. To prove this, suppose the optimal solution is given by \(\mathbf {w}_*=\mathbf {w}_s + \mathbf {w}_o\), where \(\mathbf {w}_s\) is the projection of \(\mathbf {w}\) onto the span of the input data and \(\mathbf {w}_o\) is orthogonal to the space spanned by the input data. We show that the value of \(\mathbf {w}_o\) plays no role in our ability to satisfy the first constraint in (9), however it does play a part in the unit \(L_2\)-norm restriction \(||\mathbf {w}||\le 1\). Therefore if we removed this orthogonal component and scaled our \(\mathbf {w}_s\) so that it resided on the unit-ball, we have already showed that this will increase the value of the auxiliary function, thus permitting an increase in the value of \(\kappa \) in the next round of the optimisation scheme. Therefore a solution containing an orthogonal component can never be optimal.

The empirical means and covariances are linear combinations of the input data, and it is straightforward to show that

$$\begin{aligned}&\mathbf {w}^T(\hat{\mathbf {x}}_1 - \hat{\mathbf {x}}_0) = \mathbf {w}_s^T(\hat{\mathbf {x}}_1 - \hat{\mathbf {x}}_0)\\&\mathbf {w}^T \hat{\varSigma }_j \mathbf {w}= \mathbf {w}_s^T \hat{\varSigma }_j \mathbf {w}_s \text {for } j=0,1. \end{aligned}$$

Therefore the value of the auxiliary function evaluated at \(\mathbf {w}\) and \(\mathbf {w}_s\) are the same i.e. \(h(\mathbf {w},\kappa ) = h(\mathbf {w}_s,\kappa )\). We know that if we replace \(\mathbf {w}= \mathbf {w}_s + \mathbf {w}_o\) with \(\mathbf {w}_s/||\mathbf {w}_s||\), where \(||\mathbf {w}_s||< 1\),then the value of our auxiliary function increases, and allows for a larger value of \(\kappa \) at optimality. Therefore a solution containing a component orthogonal to the span of the input data can not be optimal, and the optimal solution must be given by a linear combination of the input data

$$\begin{aligned} \mathbf {w}_* = \sum _{i=1}^m \gamma _i \mathbf {x}_i, \end{aligned}$$

where \(\gamma _i \in \mathbb {R}\) for all \(i=1,\dots , m\). To take full advantage of the kernel-trick, and avoid having to explicitly evaluate the feature mappings, we now have to show that the feature mappings only appear in the optimisation problem as inner-products.

Let us denote the kernel matrix \(\mathbf {K}\) where \(\mathbf {K}_{ij}=K(\mathbf {x}_i,\mathbf {x}_j)\) for all \(i,j=1,\dots , m\). The first \(m_1\) rows and last \(m_0\) rows of \(\mathbf {K}\) are denoted \(\mathbf {K}_1\) and \(\mathbf {K}_0\), respectively:

$$\begin{aligned} \mathbf {K}= \begin{pmatrix}\mathbf {K}_1 \\ \mathbf {K}_0 \end{pmatrix}, \end{aligned}$$

where \(y_i=1\) for \(i=1,\dots m_1\), and \(y_i=0\) for \(i=m_1+1, \dots , m_1 + m_0\). The class row averages, \(\mathbf {l}_1^T\) and \(\mathbf {l}_0^T\), are m-dimensional vectors given by

$$\begin{aligned} \left( \mathbf {l}^T_1 \right) _i = \frac{1}{m_1} \sum _{j=1}^{m_1} K(\mathbf {x}_j, \mathbf {x}_i) \text {and} \left( \mathbf {l}^T_0 \right) _i = \frac{1}{m_0} \sum _{j=m_1+1}^{m} K(\mathbf {x}_j, \mathbf {x}_i). \end{aligned}$$

We create the block-row-averaged kernel matrix \(\mathbf {L}\) by setting the row average of \(\mathbf {K}_1\) and \(\mathbf {K}_0\) equal to zero by:

$$\begin{aligned} \mathbf {L}=\begin{pmatrix} \mathbf {K}_1 - \mathbf {1}_{m_1} \mathbf {l}_1^T \\ \mathbf {K}_0 - \mathbf {1}_{m_0} \mathbf {l}_0^T \end{pmatrix}= \begin{pmatrix} \sqrt{m_1}\,\, \mathbf {L}_1 \\ \sqrt{m_0} \,\,\mathbf {L}_0 \end{pmatrix}, \end{aligned}$$

where \(\mathbf {1}_m\) is a column vector of ones of dimension m. The empirical moment estimates in the feature space are given by

$$\begin{aligned} \hat{\phi }_1 = \frac{1}{m_1} \sum _{i=1}^{m_1} \phi (\mathbf {x}_i) &\text {and} \hat{\varSigma }_1^\phi = \frac{1}{m_1}\sum _{i=1}^{m_1} \left( \phi (\mathbf {x}_i) - \hat{\bar{\mathbf {x}}}_1^\phi \right) \left( \phi (\mathbf {x}_i) - \hat{\bar{\mathbf {x}}}_1^\phi \right) ^T \\ \hat{\phi }_0 = \frac{1}{m_0} \sum _{i=m_1+1}^{m} \phi (\mathbf {x}_i) &\text {and} \hat{\varSigma }_0^\phi = \frac{1}{m_0}\sum _{i=m_1+1}^{m} \left( \phi (\mathbf {x}_i) - \hat{\bar{\mathbf {x}}}_0^\phi \right) \left( \phi (\mathbf {x}_i) - \hat{\bar{\mathbf {x}}}_0^\phi \right) ^T \end{aligned}$$

We saw earlier that the solution is given by \(\mathbf {w}= \sum _{i=1}^m \gamma _i \phi (\mathbf {x}_i)\), and therefore the components of the optimisation become

$$\begin{aligned} \mathbf {w}^T (\hat{\phi }_1 - \hat{\phi }_0) = \varvec{\gamma }^T\left( \mathbf {l}_1 - \mathbf {l}_0\right) , \mathbf {w}^T \hat{\varSigma }_1^\phi \mathbf {w}= \varvec{\gamma }^T \mathbf {L}_1^T \mathbf {L}_1 \varvec{\gamma } \text { and} \mathbf {w}^T \hat{\varSigma }_0^\phi \mathbf {w}= \varvec{\gamma }^T \mathbf {L}_0^T \mathbf {L}_0 \varvec{\gamma }. \end{aligned}$$

This allows us to write the kernelised version of the HP-MPM as

$$\begin{aligned} \underset{\varvec{\gamma }, \kappa }{\max }&\,\,\kappa \text {s.t.} ||\mathbf {w}||^2 = \varvec{\gamma }^T \mathbf {K}\varvec{\gamma }\le 1\\&\varvec{\gamma }^T(\mathbf {l}_1 - \mathbf {l}_0) = \sqrt{ 2 A_1 + \kappa ^2 \left( \varvec{\gamma }^T \mathbf {L}_1^T \mathbf {L}_1 \varvec{\gamma }+ A_1 \right) } + \sqrt{ 2 A_0 + \kappa ^2 \left( \varvec{\gamma }^T \mathbf {L}_0^T \mathbf {L}_0 \varvec{\gamma }+ A_1 \right) } . \end{aligned}$$

The same alternating optimisation procedure can be used to find the optimal values \(\kappa _*\) and \(\varvec{\gamma }_*\), and the optimal value of the bias term is given by

$$\begin{aligned} b_* = \varvec{\gamma }_*^T\mathbf {l}_1 - \sqrt{2A_1 + \kappa _*^2 \left( \varvec{\gamma }_*^T \mathbf {L}_1^T \mathbf {L}_1 \varvec{\gamma }_* + A_1 \right) } = \varvec{\gamma }_*^T \mathbf {l}_0 + \sqrt{2 A_0 + \kappa _*^2 \left( \varvec{\gamma }_*^T \mathbf {L}_0^T \mathbf {L}_0 \varvec{\gamma }_* + A_0 \right) } \end{aligned}$$

As with the linear case, when presented with a new input observation \(\mathbf {x}^\prime \), we predict that \(y^\prime =1\) if \( \varvec{\gamma }_*^T \mathbf {k}_{\mathbf {x}^\prime } - b_* \ge 0\), where \((\mathbf {k}_{\mathbf {x}^\prime })_i = k(\mathbf {x}_i,\mathbf {x}^\prime )\), and \(y^\prime =0\) otherwise. We should point out that we have no reason to expect that the solution \(\varvec{\gamma }_*\) will be sparse i.e. many \((\gamma _*)_i=0\), and therefore the computational cost at prediction will be linear in the size of the training sample.

6 Experiments

In this section we examine the performance of the proposed HP-MPM and compare it to the original MPM, and two other popular binary classification algorithms, Fisher’s discriminant (FDA) (Fisher 1936), and the support vector machine (SVM). In Table 1 we provide a summary of the datasets taken from the UCI repository, http://archive.ics.uci.edu/ml/, and the toy dataset used in Lanckriet et al. (2003), that we have used in our experiments. We have included details regarding the number of observations, the dimension of the input space and the relative class frequencies to help support our argument regarding the importance of including information regarding the moment uncertainty into the derivation of the predictor. All of the datasets were normalised so that each feature had zero mean and unit variance. To handle missing values, as in the vote dataset, we simply computed the means and standard deviations of each feature using the available data, performed standard normalisation on them and then set the values of the missing data to zero post-normalisation. Each dataset was randomly partitioned 50 times into training, validation and test samples, and we report the average performance over all test samples. During the experiments we varied the size of the training sample between 10 and 70%, in increments of 10%, of the full dataset to investigate how the algorithms performed with various amounts of information. The size of the validation set was fixed at 20% and the remaining data was used for testing. The goal of these experiments was to evaluate the benefits of considering moment uncertainty in the construction of the predictor, and to understand the relative gains that its inclusion have as we change the number of training points.

Table 1 Overview of the UCI datasets used during the experiments
Table 2 Linear experiments: we show how the performance of the classification algorithms on the datasets vary as the amount of data used during training changes

One of the main motivations of the formulation of the HP-MPM was to correct for overly confident estimates on the worst-case future misclassification rate. This was done through the introduction of high-probability bounds on the deviation of the true moments from the empirical counterparts. However, we noticed that during the experiments that these high-probability bounds appeared to be too restrictive in many settings and we were unable to generate meaningful solutions i.e. \(\alpha _*=0\). To overcome this deficiency we propose to use the moment uncertainty terms as a form of regularisation, and during the experiments we use a validation procedure to choose what fraction of the true moment uncertainty we should use. More precisely, rather than using \(A_j\) we used some fractional amount \(\hat{A}_j = \nu A_j\) of the full uncertainty, where \(\nu \in \{0.05, 0.1, 0.15, 0.2, 0.3, 0.5, 1\}\). For the parameter selection process, in each training and test sample we had a distinct validation set that was used to evaluate the performance of the predictor generated for the particular regularisation parameter. For each of these training, validation and test sets we evaluated the performance of the predictor (parameter) with the best validation set accuracy on the test sample. The same method to choose the regularisation parameter for the SVM and FDA, where SVM’s capacity parameter was selected from \(C \in \{ 10^{-3}, \dots , 10^3 \}\), and the FDA’s regularisation term chosen from \(\lambda \in \{10^{-3}, \dots , 10^3 \}\).

In Table 2 we examine the performance of the linear based classification algorithms and show how their performance varies as we change the size of the training sample used to construct the predictor. As one would expect, in general the performance on the test samples improves as more training examples are presented to the algorithm during training. However in the sonar dataset we see a drop in the performance of the MPM and FDA predictors as we increase the fraction of the dataset used in training from 0.1 to 0.3. This can be explained by the relatively small number of observations that were used to construct the empirical moments, which determine each algorithms decision boundary. On the other hand we see that the HP-MPM and the SVM are relatively robust to the use of small training samples, and we observe the benefits of the regularisation scheme implemented by the HP-MPM, and note the benefits of constructing the decision boundary using peripheral points, as advocated by the SVM, rather than poorly estimated empirical moments.

Table 3 Kernel experiments: we show how the performance of the classification algorithms on the datasets vary as the amount of data used during training changes

From Table 2 we can observe that when using minimal amounts of training data i.e. 10% of the full dataset, the HP-MPM method is nearly always the top performing algorithm. As the size of the training sample increases, the advantage of the HP-MPM begins to erode and its performance comes in line with the original MPM. This is to be expected in the case of large amounts of available data since we know that the HP-MPM will eventually converge towards the original MPM solution as moment uncertainty decreases to zero.

In Table 3 we present the performance of the kernelised version of the algorithms. Here we used the popular Gaussian kernel \(k(\mathbf {x}, \mathbf {x}^\prime ) = \exp (-||\mathbf {x}- \mathbf {x}^\prime ||^2/\sigma )\), where the width of the kernel \(\sigma \in \{10^{-3}, \dots , 10^3\}\) was chosen using the same validation scheme outlined earlier. We see that in general each algorithm’s performance is similar to its performance in the linear setting, however there are noticeable improvements on the ionosphere, ringnorm and sonar datasets when using the Gaussian kernel. This suggests that these input spaces are better separated with a non-linear decision boundary, whereas for the others a simple linear decision boundary will suffice. In the kernelised form we see that the MPM approach to classification , MPM or HP-MPM, is extremely competitive with the SVM, being the top performing algorithm for a large proportion of the dataset/training set size combinations.

It would appear as though the validation procedure used to determine the parameters for the kernelised form of FDA failed to ensure that increased training data resulted in an improvement in the performance. This could be due to inappropriate values of \(\lambda \) used during regularisation, however there is very little guidance in the literature on a suitable degree of regularisation, whereas the HP-MPM has a simple range \(\nu \in [0,1]\) from which to choose. Furthermore, it is straightforward to work out what maximum value of \(\nu \) will result in \(\kappa >0\) i.e. the conditions for non-zero \(\kappa \) require \(\nu \in [0,1]\) to satisfy \(||\hat{\mathbf {x}}_1- \hat{\mathbf {x}}_0|| \ge \sqrt{2 \nu A_1} + \sqrt{ 2 \nu A_0}\).

The MPM schemes are generally competitive with the other approaches, however they seem to perform comparatively poorly, some 3% worse than the SVM, on the german dataset. This weakness of the MPM was previously identified in Huang et al. (2004), and is due to the MPMs prior assumption that the prior probability of each class is the same. We know from Table 1 that this is not the case for the german dataset, and that the probability of belonging to class 1 is much higher than class 0. One could foresee the HP-MPM making this situation potentially even worse given the nature in which in constructs its solution, and its natural bias towards placing the decision boundary closer to the mean of the class where moment uncertainty is lower i.e. the one with more observations. This is illustrated in Fig. 1, where we see the hyperplane is positioned nearer to the mean of class 0 because the training sample consists of many more observations from this class. Fortunately this is not the case and we see that the HP-MPM’s performance is similar to that of the MPM. This is largely a result of the low levels of confidence in future performance i.e. small \(\kappa _*\), which results in large levels of implicit regularisation for both classes as seen in the expression for the ellipsoids (11). This results in a decision boundary that is not overly biased towards predicting that new observations belong to the minority class. A simpler explanation of its similar performance can be given by the validation procedure that was used to determine what degree \(\nu \) of regularisation to choose i.e. more likely that a smaller value of \(\nu \) was used since it would place the decision boundary less close to the mean of the more common class, and therefore not be overly biased towards predicting that a new observation belongs to the less probable class.

To improve the performance of the HP-MPM on unbalanced training samples, we propose a simple solution that adjusts the bias term used in the construction of the decision boundary. During the training step we use the optimal weight vector \(\mathbf {w}_*\) found using the standard HP-MPM algorithm, and then select the bias term to be the one that maximises the accuracy on the training set. These weight vectors and biases are then evaluated on the validation sample. Alternatively one could use the validation set to set the bias term, however this could be thought of as given this a glimpse of additional training samples and therefore an unfair advantage. Geometrically, this adjustment corresponds to a translational movement of the hyperplane where its direction \(\mathbf {w}\) remains the same. In the german dataset, this corresponds to shifting the decision boundary in the direction of the mean of class 0, since we want to increase the probability that a new observation is predicted to belong to class 1. In Table 4 we show the results obtained on the german dataset by selecting the value of the bias term, when using the Gaussian kernel. We see that this simple approach to selecting the value of the bias term b, represented by column bHP-MPM in Table 4, improves the performance of the HP-MPM, correcting its implicit assumption that classes are equally likely, and brings its performance in line with the SVM. This would suggest that the direction \(\mathbf {w}\) found by the HP-MPM is a useful method for discriminating between classes, and the bias term can be selected to take into consideration the relative class probabilities. However in doing so, the worst-case error rates that are found using the HP-MPM are no longer valid as we have repositioned the location of the separating hyperplane.

We evaluate the performance of the proposed method on the relatively large adult dataset with the results presented in Table 5. This table reports the performance using the linear version of all proposed methods. We see that the bHP-MPM approach is the top performing approach up until 5,000 training examples are provided to the learning algorithm, after which the SVM becomes the top performing predictor. This supports our argument that in the case of limited data availability the incorporation of moment uncertainty can improve the performance of predictors. As the number of data points increases and our information of the class-conditional distribution improves, the worst-case assumptions and the regularisation imposed by the HP-MPM, may hinder the construction of predictions, whereas the SVM is able to take advantage of better knowledge of the true periphery of the class-conditional distributions.

Table 4 German dataset: we evaluate the performance (classification accuracy) of selecting the bias term for the HP-MPM according to its performance on the validation set
Table 5 Adult dataset lines experiment: we evaluate the performance (classification accuracy) of the proposed algorithm on a large scale dataset
Fig. 2
figure 2

Currency experiments: profit on trading decisions advised by the different algorithms. We see that the HP-MPM performs consistently well across the majority of settings (training window and regularisation). However, all of the algorithm seem to struggle with the AUD/USD currency pair

Fig. 3
figure 3

Currency experiments: improvement of accuracy on random guessing i.e. improvement over 50% correct. The HP-MPM appears to be the most consistently performing algorithm and only does worse than random guessing on two particular parameterisations. The other algorithms appear to perform quite considerably worse in terms of accuracy, with the SVM only consistently better than random for the EUR/USD currency pair

Currency movement prediction We conclude our experiments by testing the performance of the different classification algorithms on predicting the daily price movement of four common currency pairs. The daily foreign exchange (FX) data was freely downloaded from http://www.dukascopy.com, and ranges from October 2008 to October 2014. The currency pairs that we investigated were; EUR-GBP, EUR-USD, EUR-GBP and AUD-USD. We now describe the classification setting used in the experiments. Let the opening price of the currency at day t be given by \(p_t\), we represent the input space using a range of n-past log returns. For example, if \(n=3\), then the input representation \(\mathbf {x}_t\) at time t is given by

$$\begin{aligned} \mathbf {x}_t=\left[ \log \left( {p_t}/{p_{t-1}}\right) , \log \left( {p_{t-1}}/{p_{t-2}}\right) , \log \left( {p_{t-2}}/{p_{t-3}}\right) \right] . \end{aligned}$$

Given \(\mathbf {x}_t\), our goal is to make a prediction whether we believe the price at the next time step will be higher than the current i.e. \(y_t=1\) if \(p_{t+1}\ge p_t\), and \(y_t=0\) otherwise. In evaluating the performance of the algorithms we recorded not only the accuracy of the predictions, but also the hypothetical profit that would be made had we made a decision according to the advice of the predictor i.e. if we predicted the price to increase from t to \(t+1\), then our return \(r_t\) would be the change in price over this time \(r_t =(p_{t+1}-p_t)/p_t\). Similarly if we predicted the price would fall over this time horizon \(r_t=(p_t-p_{t+1})/p_t\).

To train the model we implement a simple sliding window procedure that uses a fixed size number of examples (training window) to construct the predictor, which is then refreshed after a given number of observations (test window). By updating the predictor over time it is hoped that the predictor will be able to account for fact that the data is most likely not identically and independently distributed. Unfortunately we are unable to use the same validation technique that we used on the previous experiments, as it is likely that the most recent observations are the most important to the derivation of the predictor and we cannot make predictions based on observations in the future. Therefore in the results presented in Figs. 2 and 3 we have shown the performance of all of regularisation parameters for each classification algorithm. Given the multitude of different settings for these experiments and the limited space, we present only the results obtained when predictor is refreshed every 10 days, the input space is described using the last 5 log returns and we allow the training window to vary between 50, 100 and 200 days.

In Fig. 2 we present the hypothetical profits that would have been generated having traded on the prediction of the algorithms. We see that the HP-MPM approach performs consistently well across varying degrees of regularisation i.e. \(\nu \). It only fails to make profits on the AUD/USD currency pair, however its returns are often considerably better than those generated by the FDA or SVM. Similarly the accuracy of the HP-MPM is consistently on par with, if not exceeding that, of the other algorithms. On these datasets it would appear that the moment based algorithms, FDA and HP-MPM, perform better in terms of accuracy the SVM. We believe that this is largely due to the nature in which the solutions are constructed. The SVM will construct its solutions using points that it believes to lie on the boundary of the class-conditional distributions, whereas the moment based solutions are defined by the mean and covariances i.e. the majority of the data, rather than the outliers. Therefore when it comes to finding predictors in high-noise environments, the SVM will be constructing its solutions based on these outlying points rather than constructing it using the points that define the mass of the distribution.

7 Conclusions

In this paper we addressed an oversight of the original minimax probability machine (Lanckriet et al. 2003): that is, the worst-case future misclassification rates depend on prior knowledge of each classes mean and covariance matrix. In practice, these true moment values have to be substituted with their empirical counterparts, which are finite sample estimates of their true values. Making use of the high-probability bounds on the deviation of these estimates from their true values (Shawe-Taylor and Cristianini 2003), we derived a new optimisation scheme that takes into account the moment uncertainty and directly minimises the worst-case future misclassification rate that holds true with high-probability. We observed that in many experiments the moment uncertainty was so large that it was unable to produce meaningful results i.e. \(\kappa _*=0\). Therefore, at the expense of statistical correctness, we proposed to use fractional quantities of the true moment uncertainty as a form of regularisation. This form of regularisation, unlike most traditional schemes, implicitly takes into consideration the relative uncertainties regarding each class i.e. through different values of \(A_1\) and \(A_0\). During the experiments we noted that its performance was competitive with the popular SVM and FDA approaches, however its advantage was most apparent when minimal amounts of training data were used to construct the decision boundary thus providing support for this new approach to regularisation.

Earlier we briefly mentioned other learning algorithms that use the minimax formulation popularised by Lanckriet et al. (2003). Future work should investigate how best to include notion of moment uncertainty into these approaches. The minimum error MPM (MEMPM) (Huang et al. 2004) can be thought of as a more principled approach to our proposed bHP-MPM, taking into consideration the relative class probabilities in the construction of the decision hyperplane. Given its similarities to the original MPM approach, it should be straightforward to introduce the moment uncertainty into the MEMPM with an additional high-probability estimate on the prior class-probabilities. This approach could be used in place of our simple bias selection process, as a more principled approach to handling unbalanced training samples. Given that the minimax principle is used for the abundant class in Osadchy et al. (2015) it would seem unlikely that introducing moment uncertainty would be particularly beneficial. For the transductive (Huang et al. 2014) and clustering based (Huang et al. 2015) minimax approaches, the main difficulty of including moment uncertainty exist stems from the assignment of unlabelled data to classes. This would allow us to have control over the uncertainty surrounding each class and could inadvertently induce a bias that encourages equal numbers of observations for both classes. Despite these potential difficulties, the inclusion of moment uncertainty with existing minimax approaches remains an interesting area of research.

To improve the correctness of this approach, future work should focus on obtaining tighter bounds on the deviation of empirical moments from their true values. This would lead to statistically correct worst-case guarantees, whilst also circumventing the problem of having to use a validation set in order to choose the regularisation terms for the HP-MPM. We mentioned briefly that we have no reason to expect a sparse kernel based solution, making it difficult to handle large datasets. Future work should focus on developing specialised optimisation procedures for updating the weights of the dual variables \(\varvec{\gamma }\). If we are able to implement update schemes similar to those used in SVMs then we should be able to efficiently scale the HP-MPM approach to much larger datasets.