1 Introduction

Gaussian processes (GPs) are key components of many statistical and machine learning models. In a Bayesian setting, they provide a probabilistic approach to model unknown functions, which can subsequently be used to quantify uncertainty in predictions. An introduction and overview of GPs is given in Williams and Rasmussen (2006). In regression tasks, the GP is a popular prior for the unknown regression function, \(f:x \rightarrow y\), due to its nonparametric nature and tractability. It assumes that the function evaluated at any finite set of inputs \((x_1, \ldots , x_N)\) is Gaussian distributed with mean vector \((\mu (x_1), \ldots , \mu (x_N))\) and covariance matrix with elements \(K(x_i, x_j)\), where the mean function \(\mu (\cdot )\) and the positive semi-definite covariance (or kernel) function \(K(\cdot , \cdot )\) represent the parameters of the GP.

While GPs are flexible and have been successfully applied to various problems, limitations exist. First, GP models suffer from a high computational burden, due to the need to invert and store large, dense covariance matrices. Second, typically parametric forms are specified for \(\mu (\cdot )\) and \(K(\cdot , \cdot )\), which crucially determine properties of the regression function, such as spatial correlation, smoothness, and periodicity. This limits the model’s ability to recover changing behavior of the function, e.g. different smoothness levels, across the input space.

To address the computational burden, one approach is to approximate based on multiple GP experts. In this case, the data is partitioned into groups, and within each group, a GP expert specifies the conditional model for the output y given the input x. Scalability is enhanced as each expert only requires inversion of smaller matrices based on subsets of the data. The experts’ predictions can then be combined through a product operation, known as product of experts (PoEs, Tresp, 2000), or a sum operation, known as mixture of experts (MoEs, Jacobs et al., 1991). The PoE approach includes the Bayesian Committee Machine (BCM, Tresp, 2000), generalized PoE (gPoE, Cao & Fleet, 2014), and robust BCM (rBCM, Deisenroth & Ng, 2015). PoEs are motivated by the fact that the product operation maintains Gaussianity and are specifically designed for faster inference of stationary GP models. However, a thorough theoretical analysis (Szabó & Van Zanten, 2019) shows that although some PoE approaches can achieve good posterior contraction rates and asymptotic coverage, this is only in the unrealistic non-adaptive setting. Instead, MoEs work well even in adaptive settings and correspond to sound statistical models that take a weighted average of the experts’ predictions. The weights are defined by the gating network, which probabilistically maps the experts to regions of the input space. The recent work of Trapp et al. (2020) combines both operations to build a sum-product network of GPs.

Beyond approximation, the crucial advantage of MoEs is that model flexibility is greatly enhanced. Specifically, any simplifying assumptions of the experts need only hold for each subset of the data. This allows the model to infer different behaviors, such as smoothness and variability, within different regions, and combine the multiple experts to capture non-stationary, heterogeneity, discontinuities, and multi-modality.

In this paper, our contribution is threefold. First, we construct a novel MoE model that combines the expressive power of deep neural networks (DNNs) and the probabilistic nature of GPs. While powerful, DNNs lack the probabilistic framework and sound uncertainty quantification of GPs, and there has been increased interest in recent years in combining DNNs and GPs to benefit from the advantages and overcome the limitations of each method, see e.g. (Huang et al., 2015; Wilson et al., 2016; Iwata & Ghahramani, 2017; Daskalakis et al., 2020) to name a few. Specifically, we use GP experts for smooth, probabilistic reconstructions of the unknown regression function within each region, while employing DNNs for the gating network to flexibly determine the regions. To further enhance scalability, we combine the distributed approximation of GPs through the MoE architecture with low-rank approximations using an inducing point strategy (Snelson & Ghahramani, 2006). This combination leads to a robust and efficient model which is able to outperform competing models.

Second, we provide a fast and accurate approximation of the proposed deep mixture of sparse GP experts, using a recently introduced method called Cluster–Classify–Regress (CCR, Bernholdt et al., 2019). Finally, a novel connection is made between CCR and optimization algorithms commonly used for MAP estimation of MoEs, which lends credibility to its success as an approximation algorithm in this context. More generally, the connection both provides a framework to potentially further refine the CCR solution through additional iterations of the optimization algorithm and also sheds lights on the MoE model underlying the CCR algorithm, which provides a fast approximation for other MoE architectures as well.

2 Methodology

For i.i.d. data \((y_1, \ldots y_N)\), mixture models are an extremely useful tool for flexible density estimation due to their ability to approximate a large class of densities and their attractive balance between smoothness and flexibility. When additional covariate information is present and the data consists of input–output pairs, \(\{(x_i,y_i)\}_{i=1}^N\), MoEs extend mixtures to achieve flexible conditional density estimation (also known as density regression), where the whole density of the output changes with the inputs. This is achieved by modelling the mixture parameters as functions of the inputs, that is, by defining the gating network, which probabilistically partitions the input space into regions, and by specifying the experts, which characterize the relationship between x and y within each region. This results in flexible framework which has been employed in numerous applications; for a recent overview, see (Gormley & Frühwirth-Schnatter, 2019).

Specifically, the MoE model assumes that outputs are independently generated, for \(i = 1,\ldots , N\), from a mixture:

$$\begin{aligned} \begin{aligned} y_i \mid x_i&\sim \sum _{l=1}^L {w_l}(x_i; {\psi }) \text {N}\left( y_i \mid {f}(x_i; {\theta }_l),\sigma ^{2 }_l\right) ,\\ \end{aligned} \end{aligned}$$
(1)

where \({w_l}(\cdot ; {\psi })\) is the gating network with parameters \({\psi }\); \({f}(\cdot ; {\theta }_l)\) is the regression function for the \(l\text{th}\) expert with parameters \({\theta }_l\); and L is the number of experts. For simplicity, we focus on the case when \(y \in \mathbb {R}\) and employ a Gaussian likelihood for the experts, although this may be generalized for other data types. The gating network \(({w_1}(\cdot ; {\psi }), \ldots , {w_L}(\cdot ; {\psi }))\), which maps the input space \(\mathcal {X} \subseteq \mathbb {R}^d\) to the \(L-1\) dimensional simplex (i.e. \(w_l(x; \psi )\ge 0\) and \(\sum _{l=1}^L w_l(x; \psi ) =1)\), reflects the relevance of each expert at any location \(x \in \mathcal {X}\).

The experts form the building blocks of the mixture model, which are combined to recover general shapes for the conditional density of y given x, that are too complicated to be captured by a single expert. To provide intuition, MoEs can be used when the population consists of (input-dependent) sub-populations, such that within each, a single expert provides a good description of the relationship between y and x. However, in general, the groups may not represent actual sub-populations but rather they are combined to achieve flexibility and approximate a wide class of conditional densities. MoEs can be augmented with a set of allocation variables \({\textbf{z}}=(z_1, \ldots , z_N)\) indicating group or cluster membership, where \(z_i = l\) if the \(i\text{th}\) data point is generated from the \(l\text{th}\) expert, for \(l = 1, \ldots , L\). Specifically, if

$$\begin{aligned} {p(z_i = l \mid x_i) = w_l (x_i, \psi ) \quad \text {and} \quad p(y_i \mid x_i, z_i) = \text {N}\left( y_i \mid f(x_i; \theta _{z_i}),\sigma ^{2 }_{z_i}\right) ,} \end{aligned}$$

then Eq. (1) is recovered after marginalization of \(z_i\):

$$\begin{aligned} {p(y_i \mid x_i)}&{= \sum _{l=1}^L p(y_i, z_i=l \mid x_i) = \sum _{l=1}^L p(z_i=l \mid x_i) p(y_i \mid x_i, z_i=l)} \\&= {\sum _{l=1}^L w_l (x_i, \psi ) \text {N}\left( y_i \mid f(x_i; \theta _{l}),\sigma ^{2 }_{l}\right) .} \end{aligned}$$

Thus, letting \({\textbf{y}}=(y_1,\ldots , y_N)\) and \({\textbf{x}}= (x_1, \ldots , x_N)\), we can define the augmented model as

$$\begin{aligned} p({\textbf{y}}, {\textbf{z}}\mid {\textbf{x}})&= \prod _{i=1}^N p(y_i \mid x_i, z_i) p(z_i \mid x_i) = \prod _{i=1}^N {w_{z_i}} \text {N}(y_i \mid {f}(x_i; {\theta }_{z_i}),\sigma ^{2}_{z_i}) \\&= \prod _{l=1}^L \prod _{i:z_i=l} {w_l} (x_i ; {\psi }) \text {N}(y_i \mid {f}(x_i; {\theta }_{l}),\sigma ^{2}_l). \end{aligned}$$

This augmented version of the model is widely used in inference algorithms (such as expectation-maximization), as direct optimization of the likelihood in Eq. (1) is challenging, due to multi-modality and well-known identifiability issues associated with mixtures. Moreover, the introduction of the allocation variables \({\textbf{z}}\) permits distributed inference across experts.

Various formulations have been proposed in literature for both the experts and gating networks, ranging from simple linear models to flexible nonlinear approaches. Examples include (generalized) linear or semi-linear models (Jordan & Jacobs, 1994; Xu et al., 1995), splines (Rigon & Durante, 2017), neural networks (Bishop, 1994; Ambrogioni et al., 2017), GPs (Tresp, 2001; Rasmussen & Ghahramani, 2002), tree-based classifiers (Gramacy & Lee, 2008), and others. In this work, we combine sparse GP experts with DNN gating networks to flexibly determine regions and provide a probabilistic nonparametric model of the unknown regression function. Figure 1 highlights the advantages of this construction: 1) flexible regions (over quadratic classifiers, e.g (Meeds & Osindero, 2006; Nguyen & Bonilla, 2014; Zhang & Williamson, 2019)) and 2) well-calibrated uncertainty (over DNN experts, e.g (Bishop, 1994; Ambrogioni et al., 2017)), through tighter credible intervals that maintain the desired coverage.

Fig. 1
figure 1

Motivating toy example. Comparison of the true and estimated allocations (first row) and predictions (second row) for the proposed DNN gating network with GP experts (second column), logistic regression (LR) gating network with GP experts (third column), and DNN gating network and experts (fourth column). The DNN gating network recovers the true allocations, and combined with GP experts leads to improved accuracy and uncertainty (details in Sect. 4.1), especially for outlying test points

2.1 Sparse Gaussian process experts

Mixtures of GP experts have proven to be very successful (Tresp, 2001; Rasmussen & Ghahramani, 2002; Meeds & Osindero, 2006; Yuan & Neubauer, 2009; Nguyen & Bonilla, 2014; Gadd et al., 2020). In particular, they overcome limitations of stationary GPs by reducing computational complexity through distributed approximations and allowing different properties of the function for each GP expert to handle challenges, such as discontinuities, non-stationarity, non-Gaussianity and multi-modality. They allow flexible conditional density estimation, going beyond the iid Gaussian error assumption of standard GP regression models.

In this case, one assumes a GP prior on the regression function for each expert with expert-specific hyperparameters \({\theta _l =( \mu _l,\phi _l)}\):

$$\begin{aligned} {f(\cdot ; \theta _l) \sim \text {GP}(\mu _l, K_{\phi _l}),} \end{aligned}$$

where \(\mu _l\) is the mean function of the expert (for simplicity, it assumed to be constant) and \(\phi _l\) are the parameters of the covariance function \(K_{\phi _l}\), whose chosen form and hyperparameters encapsulate properties of the function for each expert. While it is common to use zero mean functions in standard GP models, which is made appropriate by first subtracting the overall mean from the output, in mixtures, we must include a constant mean, as the clustering structure is unknown and the data cannot be centred for each expert. Letting \(f_{l,i} ={f}(x_i; {\theta }_{l})\) denote the \(l\text{th}\) function evaluated at \(x_i\) with \({\textbf{f}}_l = \lbrace f_{l,i} \rbrace _{z_i = l}\), the GP prior implies a multivariate normal prior on the function \({\textbf{f}}_l\) evaluated at the inputs within the \(l\text{th}\) cluster:

$$\begin{aligned} {{\textbf{f}}_l \sim \text {N}( {\varvec{\mu }}_l, \textbf{K}_{l,N_l}),} \end{aligned}$$

where \({\varvec{\mu }}_l\) is a vector with entries \(\mu _l\); \(\textbf{K}_{l,N_l}\) represents the \(N_l \times N_l\) matrix obtained by evaluating the covariance function \(K_{\phi _l}\) at each pair of inputs in the \(l\text{th}\) cluster; and \(N_l\) is number data points in the \(l\text{th}\) cluster. With the Gaussian likelihood, the unknown function can be marginalized, and the likelihood of the data within each cluster is:

$$\begin{aligned} {p({\textbf{y}}_l \mid {\textbf{x}}_l)}&{= \int \prod _{i:z_i=l} \text {N}(y_i \mid f_{l,i} ,\sigma ^{2}_l) \text {N}({\textbf{f}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,N_l}) d{\textbf{f}}_l}\\&{= \text {N}({\textbf{y}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,N_l}+\sigma ^{2}_l \textbf{I}_{N_l} ),} \end{aligned}$$

where \({\textbf{y}}_l\) and \({\textbf{x}}_l\) contain the outputs and inputs of the \(l\text{th}\) cluster, i.e. \({\textbf{y}}_l = \lbrace y_i \rbrace _{z_i=l}\) and \({\textbf{x}}_l = \lbrace x_i \rbrace _{z_i=l}\).

GP experts are appealing due to their flexibility, intrepretability and probabilistic nature, but they come with an increased computational cost. Specifically, given the allocation variables \({\textbf{z}}\), the GP hyperparameters, which crucially determine the behavior of the unknown function, can be estimated by optimizing the log marginal likelihood:

$$\begin{aligned} {\log (p({\textbf{y}}\mid {\textbf{x}}, {\textbf{z}})) = \sum _{l=1}^L \log \left( \text {N}({\textbf{y}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,N_l}+\sigma ^{2}_l \textbf{I}_{N_l} )\right) .} \end{aligned}$$

This requires inversion of \(N_l \times N_l\) matrices. Compared with standard GP models, the cost is reduced from \(\mathcal {O}(N^3)\) to \(\mathcal {O}(\sum _{l=1}^L N_l^3)\), however this can still be expensive, depending on the size of clusters.

To further improve scalability, one can resort to approximate methods for GPs, including sparse GPs (Snelson & Ghahramani, 2006; Titsias, 2009; Bui et al., 2017), predictive processes (Banerjee et al., 2008), basis function approximations (Cressie & Johannesson, 2008), or sparse formulations of the precision matrix (Lindgren et al., 2011; Grigorievskiy et al., 2017; Durrande et al., 2019), among others (see reviews in Chp. 8 of Williams and Rasmussen (2006) and (Heaton et al., 2019)). In the present work, we employ sparse GPs, which augment the model with a set of \(M_l<N_l\) pseudo-inputs \({\tilde{{\textbf{x}}}}_l = ({\tilde{x}}_{l,1}, \ldots , {\tilde{x}}_{l,M_l})\) and pseudo-targets \({\tilde{{\textbf{f}}}}_l= ({\tilde{f}}_{l,1}, \ldots , {\tilde{f}}_{l,M_l})\) representing the \(l\text{th}\) function evaluated at the pseudo-inputs, i.e. \({\tilde{f}}_{l,m} = f({\tilde{x}}_{l,m}; \theta _l)\) for \(m=1, \ldots , M_l\). Then, the key assumption for scalability is that the prior on the regression function within each cluster factorizes given the pseudo-targets:

$$\begin{aligned} {p({\textbf{f}}_l, {\tilde{{\textbf{f}}}}_l \mid {\textbf{x}}_l, {\tilde{{\textbf{x}}}}_l )}&{= p({\textbf{f}}_l \mid {\tilde{{\textbf{f}}}}_l , {\textbf{x}}_l, {\tilde{{\textbf{x}}}}_l ) p({\tilde{{\textbf{f}}}}_l \mid {\tilde{{\textbf{x}}}}_l )}\\&{\approx \prod _{i:z_i=l} p(f_{l,i} \mid {\tilde{{\textbf{f}}}}_l , {\textbf{x}}_l, {\tilde{{\textbf{x}}}}_l) p({\tilde{{\textbf{f}}}}_l \mid {\tilde{{\textbf{x}}}}_l )}. \end{aligned}$$

From properties of GPs, the prior of the pseudo-targets is \({\tilde{{\textbf{f}}}}_l \sim \text {N}({\varvec{\mu }}_l, \textbf{K}_{l,M_l})\), and the augmented model for \({\textbf{f}}_l\) given the pseudo-targets is:

$$\begin{aligned} { p({\textbf{f}}_l \mid {\textbf{x}}_l, {\tilde{{\textbf{x}}}}_l , {\tilde{{\textbf{f}}}}_l) = \prod _{i:z_i=l} \text {N}(f_{l,i} \mid {\widehat{\mu }}_{l,i}, \lambda _{l,i}), } \end{aligned}$$

where

$$\begin{aligned} {\widehat{\mu }}_{l,i}&= \mu _l + (\textbf{k}_{l,M_l,i})^T(\textbf{K}_{l,M_l})^{-1} ({\tilde{{\textbf{f}}}}_l- {\varvec{\mu }}_l), \end{aligned}$$
(2)
$$\begin{aligned} \lambda _{l,i}&= K_{\phi _l}(x_i,x_i)- (\textbf{k}_{l,M_l,i})^T(\textbf{K}_{l,M_l})^{-1}\textbf{k}_{l,M_l,i}, \end{aligned}$$
(3)

with \(\textbf{K}_{l,M_l}\) denoting the \(M_l \times M_l\) matrix with elements \(K_{\phi _l}({\tilde{x}}_{l,j},{\tilde{x}}_{l,h})\) and \(\textbf{k}_{l,M_l,i}\) denoting the vector of length \(M_l\) with elements \(K_{\phi _l}({\tilde{x}}_{l,j}, x_i)\). This corresponds to the fully independent training conditional (FITC) approximation (Quiñonero-Candela & Rasmussen, 2005) (see (Bauer et al., 2016) for a further discussion of FITC). For the Gaussian likelihood, \({\textbf{f}}_l\) can be marginalized and the likelihood of the data points within each cluster also factorizes as:

$$\begin{aligned} p({\textbf{y}}_l \mid {\textbf{x}}_l, {\tilde{{\textbf{x}}}}_l , {\tilde{{\textbf{f}}}}_l) = \prod _{i:z_i=l} \text {N}(y_i\mid {\widehat{\mu }}_{l,i}, \sigma ^{2 }_l +\lambda _{l,i} ). \end{aligned}$$
(4)

After marginalization of the pseudo-targets, the pseudo-inputs \({\tilde{{\textbf{x}}}}_l\) and hyperparameters \((\mu _l, \phi _l)\) can be estimated by optimizing the marginal likelihood:

$$\begin{aligned} \begin{aligned}&\log (p({\textbf{y}}\mid {\textbf{x}}, {\textbf{z}}, {\tilde{{\textbf{x}}}})) = \sum _{l=1}^L \log \left( \text {N}({\textbf{y}}_l \mid {\varvec{\mu }}_l, {\varvec{\Sigma }}_l ) \right) \,, \end{aligned} \end{aligned}$$
(5)

where \({\varvec{\Sigma }}_l = (\textbf{K}_{l,M_l N_l})^T(\textbf{K}_{l,M_l})^{-1} \textbf{K}_{l,M_l N_l}+ {\varvec{\Lambda }}_l +\sigma ^{2}_l \textbf{I}_{N_l}\); \(\textbf{K}_{l,M_l N_l}\) is the \(M_l \times N_l\) matrix with columns \(\textbf{k}_{l,M_l,i}\); and \({\varvec{\Lambda }}_l\) is the diagonal matrix with diagonal entries \(\lambda _{l,i}\). This strategy allows us to reduce the complexity to \(\mathcal {O}(\sum _{l=1}^L N_l M_l^2)\).

2.2 Deep neural gating networks

The choice of gating network is crucial to flexibly determine the regions and accurately approximate the conditional densities. Indeed, various proposals exist to combine GP experts with different gating networks, and approaches to specify the gating network can be divided into generative or discriminative classifiers. Generative models specify a joint mixture model for y and x and typically assume a Gaussian distribution for the inputs within each class, resulting in linear or quadratic discriminant analysis, e.g. (Meeds & Osindero, 2006; Chen et al., 2014; Zhang & Williamson, 2019). To allow more flexibility in the regions, Yuan and Neubauer (2009); Gadd et al. (2020) extend this by considering a mixture of Gaussians within each class. Discriminative models, on the other hand, avoid modelling the inputs and focus on the conditional density of interest by defining the gating network directly. Discriminative models include linear classifiers (Nguyen & Bonilla, 2014) and tree-based classifiers (Gramacy & Lee, 2008), which may result in rigid assumptions on the partition structure of the input space. The latter, for example, assumes axis-aligned rectangular regions, and while more flexible partitioning approaches exist, such as Voronoi tessellations (Pope et al., 2019), they come at an increased cost. Other flexible proposals include GP classifiers (Tresp, 2001) and kernel-based methods (Rasmussen & Ghahramani, 2002), but these similarly suffer from the trade-off between flexibility and cost.

In this work, we focus on DNNs, which are known to be universal for classification (Szymanski & McCane, 2012). DNNs flexibly determine the regions, removing any rigid assumptions, without increasing the computational cost. Specifically, we define the gating network by a feedforward DNN with a softmax output:

$$\begin{aligned} {w_l}(x; {\psi }) = \frac{\exp (h_l(x; {\psi }))}{\sum _{j=1}^L \exp (h_j(x; {\psi }))} \,, \end{aligned}$$

where \(h_l\) is the \(l{\textrm{th}}\) component of \(h: \mathbb {R}^d \rightarrow \mathbb {R}^{L}\), defined by

$$\begin{aligned} h(~\cdot ~; {\psi }) = \eta _{J}( \eta _{J-1}( \cdots \eta _{1}( ~\cdot ~; {\psi }_{1}) \cdots ; {\psi }_{J-1}); {\psi }_{J}) \,, \end{aligned}$$

with \(\eta _{j}: \mathbb {R}^{d_{j-1}} \rightarrow \mathbb {R}^{d_j}\) (\(d_0=d\), \(d_J=L\)) the \(j\textrm{th}\) layer of a neural network

$$\begin{aligned} \eta _{j}(~\cdot ~; {\psi }_{j}): x \mapsto \eta _{j}(x; {\psi }_{j}) = \textrm{ReLU}(A_{j}x + b_{j})\,, \end{aligned}$$

where \(\textrm{ReLU}(x) = \textrm{max} \{0, x\}\) is the element-wise rectifier; \({\psi }_{j} = \{A_{j}, b_{j}\}\) comprises the weights \(A_{j} \in \mathbb {R}^{d_{j} \times d_{j-1}}\) and biases \(b_j \in \mathbb {R}^{d_j}\) for level \(j = 1, \dots , J\); and \(\psi = (\psi _1,\ldots , \psi _J)\) collects all the parameters of the DNN.

Deep neural gating networks have been used in literature but are typically combined with DNN experts (Bishop, 1994; Ambrogioni et al., 2017). The mixture density network (MDN, Bishop, 1994) uses this gating network but parametrizes both the regression function and variance of the Gaussian model in Eq. (1) by DNNs. This offers considerable flexibility beyond standard DNN regression, but significant valuable information can be gained with GP experts. Specifically, as the number of data points in each cluster is data-driven, DNN experts may overfit due to small cluster sizes. In addition, DNNs are susceptible to adversarial attacks and tend to be overconfident even when predictions are incorrect (Szegedy et al., 2013). Moreover, DNNs may underestimate variability, especially for test data which are quite different from the observed data (Nguyen et al., 2015) (see Fig. 1). Instead, GP experts probabilistically model the local regression function, avoiding overfitting and providing well-calibrated uncertainty quantification.

3 Inference

While Markov chain Monte Carlo algorithms provide full Bayesian inference for MoEs, they rely on expensive Gibbs samplers (Rasmussen & Ghahramani, 2002; Meeds & Osindero, 2006; Gadd et al., 2020), which alternately sample the allocation variables \({\textbf{z}}\) given the model parameters (that is, the gating and experts parameters) and the model parameters given \({\textbf{z}}\). To alleviate the cost, Zhang and Williamson (2019) use importance sampling and parallelization. However, accurate estimation requires the number of important samples to be exponential in the Kullback–Leibler divergence between the proposal and posterior (Chatterjee & Diaconis, 2018); this can be prohibitive for MoEs due to the massive dimension of the partition space. A popular alternative strategy is approximate inference, where the Gibbs sampling steps for allocation variables and the model parameters are replaced, respectively, with either an expectation or maximization step (Kurihara & Welling, 2009). An expectation-expectation strategy corresponds to mean-field variational Bayes for MoEs (Yuan & Neubauer, 2009). While popular expectation-maximization (EM) algorithms can be used for maximum a posteriori (MAP) estimation as in Tresp (2001), we instead focus on the faster maximization-maximization (MM) strategy. For generative mixtures of GP experts, an MM algorithm was developed in Chen et al. (2014).

Our MM algorithm provides MAP estimation for the augmented model by optimising the augmented posterior:

$$\begin{aligned}&{\pi (\psi , \theta , \sigma ^2, {\textbf{z}}, {\tilde{{\textbf{f}}}}\mid {\textbf{y}}, {\textbf{x}}, {\tilde{{\textbf{x}}}})} {\propto p({\textbf{y}},{\textbf{z}}, {\tilde{{\textbf{f}}}} \mid {\textbf{x}}, {\tilde{{\textbf{x}}}},\psi , \theta , \sigma ^2 ) \pi (\psi , \theta , \sigma ^2)}\\&\quad {= \prod _{l=1}^L \left[ \prod _{i:z_i=l} w_l(x_i;\psi ) \text {N}(y_i\mid {\widehat{\mu }}_{l,i}, \sigma ^{2 }_l +\lambda _{l,i} ) \right] \text {N}({\tilde{{\textbf{f}}}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,M_l}) \pi (\psi , \theta , \sigma ^2), } \end{aligned}$$

where \(\theta =(\theta _1,\ldots , \theta _L)\), \(\sigma ^2 =(\sigma ^2_1,\ldots , \sigma ^2_L)\), and \({\tilde{{\textbf{f}}}} =({\tilde{{\textbf{f}}}}_1,\ldots , {\tilde{{\textbf{f}}}}_L)\) collect the expert-specific GP hyperparameters, noise variance, and pseudo-targets, respectively. In the following, we consider vague priors \(\pi ({\psi }, {\theta }, \sigma ^2) \propto 1\), and thus equivalently optimise the augmented log posterior:

$$\begin{aligned} \begin{aligned}&\log (\pi ({\psi }, {\theta }, \sigma ^2, {\textbf{z}}, {\tilde{{\textbf{f}}}}\mid {\textbf{y}}, {\textbf{x}}, {\tilde{{\textbf{x}}}}) ) =\text {const.} + \sum _{l=1}^L \log ( \text {N}({\tilde{{\textbf{f}}}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,M_l}))\\&\quad + \sum _{l=1}^L \sum _{i:z_i=l} \log (w_{l}(x_i; {\psi })) + \log ( \text {N}(y_i \mid {\widehat{\mu }}_{l,i},\sigma ^{2 }_l +\lambda _{l,i})).\\ \end{aligned} \end{aligned}$$

For GP experts, this provides maximum marginal likelihood estimation or type II MLE of the GP hyperparameters and noise variance \(({\theta }_l, \sigma ^{2}_l)\), as the GP regression functions are marginalized. For sparse GP experts, we have the additional parameters \(({\tilde{{\textbf{x}}}}_l, {\tilde{{\textbf{f}}}}_l)\). The pseudo-inputs \({\tilde{{\textbf{x}}}}_l\) are treated as hyperparameters to be optimized, and while pseudo-targets \({\tilde{{\textbf{f}}}}_l\) can be analytically integrated, we also estimate \({\tilde{{\textbf{f}}}}_l\) with its posterior mean for faster inference (see Sect. 3.1 for further details). The MM algorithm is an iterative conditional modes algorithm (Kittler & Föglein, 1984) (can also be viewed as a coordinate ascent method) that alternates between optimizing the allocation variables \({\textbf{z}}\) and the model parameters. It is guaranteed to never decrease the log posterior of the augmented model and therefore will converge to a fixed point (Kurihara & Welling, 2009). However, it is susceptible to local maxima, which can be alleviated with multiple restarts of random initializations.

3.1 Maximization–maximization

Our MM algorithm is defined by iterating over the following two steps: optimize the (i) latent cluster allocation variables and (ii) gating and expert parameters

$$\begin{aligned} \begin{aligned}&\text {(i)}~~ \textbf{z} =\mathop {\textrm{argmax}}\limits _{{\textbf{z}} \in \lbrace 1, \ldots , L \rbrace ^N } \log \pi ({\textbf{z}} \mid {\psi , \theta , \sigma ^2, {\tilde{{\textbf{f}}}}, {\tilde{{\textbf{x}}}}}, {\textbf{x}}, {\textbf{y}}) \,,\\&\text {(ii)} ~~ {(\psi , \theta , \sigma ^2, {\tilde{{\textbf{f}}}}, {\tilde{{\textbf{x}}}})} = \mathop {\textrm{argmax}}\limits _{{(\psi , \theta , \sigma ^2, \tilde{{\textbf{f}}}, {\tilde{{\textbf{x}}}})}} \log \pi ({\psi , \theta , \sigma ^2, \tilde{{\textbf{f}}} \mid {\tilde{{\textbf{x}}}}}, \textbf{z}, {\textbf{x}}, {\textbf{y}}). \end{aligned} \end{aligned}$$
(6)

We note that while the pseudo-targets \({\tilde{{\textbf{f}}}}\) can be marginalized, the full conditional for \({\textbf{z}}\) in the first step of Eq. (6) would be:

$$\begin{aligned} { \pi (\textbf{z} \mid \psi , \theta , \sigma ^2, {\tilde{{\textbf{x}}}} , {\textbf{x}}, {\textbf{y}}) }&={ \int \pi (\textbf{z}, {\tilde{{\textbf{f}}}} \mid \psi , \theta , \sigma ^2, {\tilde{{\textbf{x}}}} , {\textbf{x}}, {\textbf{y}}) d{\tilde{{\textbf{f}}}}}\\&{\propto p(\textbf{z} \mid \psi , {\textbf{x}}) p({\textbf{y}}\mid {\textbf{x}}, {\textbf{z}}, {\tilde{{\textbf{x}}}}, \theta , \sigma ^2),} \end{aligned}$$

with optimal allocation:

$$\begin{aligned} \textbf{z}&= {\mathop {\textrm{argmax}}\limits _{\textbf{z} \in \lbrace 1, \ldots , L \rbrace ^N } \log (\pi (\textbf{z} \mid \psi , \theta , \sigma ^2, {\tilde{{\textbf{x}}}} , {\textbf{x}}, {\textbf{y}}))} \nonumber \\&=\mathop {\textrm{argmax}}\limits _{{\textbf{z}} \in \lbrace 1, \ldots , L \rbrace ^N } \sum _{i=1}^N \log ({w}_{z_i}(x_i; {\psi })) + \sum _{l=1}^L \log \left( \text {N}({\textbf{y}}_l \mid {\varvec{\mu }}_l, {\varvec{\Sigma }}_l )\right) \, . \end{aligned}$$
(7)

Recall from Eq. (5) that \({\varvec{\Sigma }}_l\) is a full covariance matrix; thus, the second term in Eq. (7) does not factorize across \(i=1,\ldots , N\) and direct optimization over the space \(\lbrace 1, \ldots , L \rbrace ^N\) is infeasible. An iterative conditional modes algorithm could be employed which cycles over each data point \(i=1,\ldots , N\), optimizing \(z_i\) based on the conditional Gaussian likelihood given the data points currently allocated to each cluster. For each data point, the conditional Gaussian likelihood \(p(y_i \mid z_i = l, {\textbf{z}}_{-i}, {\textbf{y}}_{-i})\) must be computed (where \({\textbf{z}}_{-i}\) and \({\textbf{y}}_{-i}\) denote \({\textbf{z}}\) and \({\textbf{y}}\) with the \(i\text{th}\) element removed), requiring rank one updates to the inverse covariance matrices for every \(l=1,\ldots , L\).

However, we can significantly reduce the computational cost by also estimating \({\tilde{{\textbf{f}}}}\). In this case, the full conditional for \({\textbf{z}}\) in (i) of Eq. (6) factorizes across \(i=1,\ldots , N\):

$$\begin{aligned} { \pi (\textbf{z} \mid \psi , \theta , \sigma ^2, {\tilde{{\textbf{f}}}}, {\tilde{{\textbf{x}}}} , {\textbf{x}}, {\textbf{y}}) }&{\propto p(\textbf{z} \mid \psi , {\textbf{x}}) p({\textbf{y}}\mid {\textbf{x}}, {\textbf{z}}, {\tilde{{\textbf{x}}}}, {\tilde{{\textbf{f}}}}, \theta , \sigma ^2)}\\&={ \prod _{i=1}^N w_{z_i}(x_i; \psi ) \text {N}(y_i \mid {\widehat{\mu }}_{z_i,i},\sigma ^{2 }_{z_i} +\lambda _{z_i,i}).} \end{aligned}$$

Thus, the allocation can be done in parallel across the N data points:

$$\begin{aligned} \begin{aligned} z_i = \mathop {\textrm{argmax}}\limits _{{z_i} \in \lbrace 1, \ldots , L \rbrace } \log (w_{z_i}(x_i; {\psi })) + \log (\text {N}(y_i\mid {{\widehat{\mu }}_{z_i,i}}, \sigma ^{2 }_{z_i} +\lambda _{z_i,i})). \end{aligned} \end{aligned}$$
(8)

In the second step of the MM algorithm in Eq. (6), we perform two sub-steps, first optimizing the gating network and expert parameters with \({\tilde{{\textbf{f}}}}\) marginalized, and then, given those optimal values, estimating \({\tilde{{\textbf{f}}}}\):

$$\begin{aligned} \begin{aligned}&\text {(a)} ~~ {(\psi , \theta , \sigma ^2, {\tilde{{\textbf{x}}}}) = \mathop {\textrm{argmax}}\limits _{(\psi , \theta , \sigma ^2, {\tilde{{\textbf{x}}}})} \log \pi ({\psi , \theta , \sigma ^2 \mid {\tilde{{\textbf{x}}}}}, \textbf{z}, {\textbf{x}}, {\textbf{y}}),}\\&\text {(b)}~~ { {\tilde{{\textbf{f}}}} = \mathop {\textrm{argmax}}\limits _{{\tilde{{\textbf{f}}}}} \log \pi ({\tilde{{\textbf{f}}}} \mid {\tilde{{\textbf{x}}}}, \textbf{z}, {\textbf{x}}, {\textbf{y}}, \psi , \theta , \sigma ^2)}. \end{aligned} \end{aligned}$$
(9)

In (a) of Eq. (9), the full conditional of \((\psi , \theta , \sigma ^2)\) is:

$$\begin{aligned} \pi (\psi , \theta , \sigma ^2 \mid {\tilde{{\textbf{x}}}}, \textbf{z}, {\textbf{x}}, {\textbf{y}})&\propto p({\textbf{z}}\mid {\textbf{x}}, \psi ) p({\textbf{y}}\mid {\textbf{x}}, {\tilde{{\textbf{x}}}}, {\textbf{z}}, {\tilde{{\textbf{f}}}}, \theta , \sigma ^2 ) \\&= \prod _{l=1}^L \left[ \prod _{i:z_i=l} w_l(x_i;\psi ) \right] \text {N}({\textbf{y}}_l\mid {\varvec{\mu }}_l, {\varvec{\Sigma }}_l). \end{aligned}$$

Thus, optimization of the gating network and expert parameters can be done in parallel, both between each other as well as across \(l=1,\ldots , L\) for the experts. Specifically, the optimal gating network are:

$$\begin{aligned} \begin{aligned} {\psi } = \mathop {\textrm{argmax}}\limits _{{\psi \in \Psi }}\sum _{l=1}^L\sum _{i: z_i=l} h_{l}(x_i; {\psi }) - \sum _{i=1}^N \log \left( \sum _{l=1}^L \exp (h_l(x_i; {\psi })) \right) , \end{aligned} \end{aligned}$$
(10)

and expert parameters (GP hyperparameters, noise variance, and pseudo-inputs) are estimated by optimizing the log marginal likelihood in Eq. (5):

$$\begin{aligned} \begin{aligned} ({\theta }_l, \sigma ^{2}_l, {\tilde{{\textbf{x}}}}_l) = \mathop {\textrm{argmax}}\limits _{\theta \in {\Theta }, \sigma ^2 \in \mathbb {R}_+, {\tilde{{\textbf{x}}}} \in \mathbb {R}^{M_l \times d} } \log \left( \text {N}({\textbf{y}}_l \mid {\varvec{\mu }}_l, {\varvec{\Sigma }}_l )\right) \,. \end{aligned} \end{aligned}$$
(11)

Then, in (b) of Eq. (9), the full conditional for \({\tilde{{\textbf{f}}}}_l\) is:

$$\begin{aligned} \pi ({\tilde{{\textbf{f}}}}_l \mid {\tilde{{\textbf{x}}}}_l, {\textbf{x}}, {\textbf{y}}, \theta , \sigma ^2)&\propto p({\tilde{{\textbf{f}}}}_l \mid {\tilde{{\textbf{x}}}}_l, \theta , \sigma ^2) p({\textbf{y}}\mid {\tilde{{\textbf{f}}}}_l , {\tilde{{\textbf{x}}}}_l, {\textbf{x}}, {\textbf{y}}, \theta , \sigma ^2)\\&= \text {N}({\tilde{{\textbf{f}}}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,M_l}) \prod _{i:z_i=l} \text {N}(y_i\mid {{\widehat{\mu }}_{l,i}}, \sigma ^{2 }_{l} +\lambda _{l,i}), \end{aligned}$$

which, following standard derivations, is shown to be Gaussian with covariance matrix \(\textbf{K}_{l,M_l} (\textbf{Q}_{l,M_l} )^{-1} \textbf{K}_{l,M_l}\) and mean:

$$\begin{aligned}&\text {E}[ {\tilde{{\textbf{f}}}}_l \mid {\theta }_l, \sigma ^{2}_l, {\tilde{{\textbf{x}}}}_l, {\textbf{y}}_l, {\textbf{x}}_l ] \nonumber \\&\quad ={\varvec{\mu }}_l + \textbf{K}_{l,M_l} (\textbf{Q}_{l,M_l} )^{-1} \left( \textbf{K}_{l, M_l N_l} ( {\varvec{\Lambda }}_l +\sigma ^{2}_l \textbf{I}_{N_l})^{-1}({\textbf{y}}_l- {\varvec{\mu }}_l)\right) , \end{aligned}$$
(12)

where \(\textbf{Q}_{l,M_l} = \textbf{K}_{l,M_l} + \textbf{K}_{l,M_l N_l} ( {\varvec{\Lambda }}_l +\sigma ^{2}_l \textbf{I}_{N_l})^{-1} (\textbf{K}_{l,M_l N_l} )^T\). Thus, the optimal \({\tilde{{\textbf{f}}}}_l\) in (b) of Eq. (8) is the posterior mean in Eq. (12).

3.2 A fast approximation: CCR

The MM algorithm iterates between clustering in Eq. (8) and in parallel classification in Eq. (10) and regression in Eqs. (11) and (12). This closely resembles the CCR algorithm recently introduced in Bernholdt et al. (2019). The important differences are that (1) CCR is a one pass algorithm that does not iterate between the steps and (2) CCR approximates the clustering in the first step of the MM algorithm by (a) careful re-scaling the data to emphasize the output y in relation to x and (b) subsequently applying a fast clustering algorithm, e.g. K-means (Hartigan & Wong, 1979), Gaussian mixture model (GMM, McLachlan & Basford, 1988), or DB-scan (Ester et al., 1996), to jointly cluster the rescaled (yx). We also note that the original formulation of CCR performs an additional clustering step so that the allocation variables used by the regression correspond to the prediction of the classifier; this is equivalent to the clustering step of the MM algorithm in Eq. (8) including only the term associated to the gating network.

This novel connection sheds light on the MoE model underlying the CCR algorithm and allows us to view CCR as a fast, one-pass approximation to the MM algorithm for MoEs. Therefore, we can use CCR to construct a fast approximation of the proposed deep mixture of sparse GP experts. Moreover, this connection can be used to obtain a fast approximation to other discriminative MoE architectures. As shown in the following section, CCR provides a good, fast approximation for many numerical examples. If extra computational resources are available, the CCR solution can be improved through additional MM iterations (i.e. it provides a good initialization for the MM algorithm). However, in our examples, we notice that the potential for further improvement is limited. We find that MM with random initialization can also produce a reasonable estimate in many examples after two iterations; we refer to this algorithm as MM2r. It is fast, but we will see that it takes approximately 2–3 times longer than CCR.

3.3 Complexity considerations

Suppose we parameterize the DNN with \(p_c\) parameters, and each of the sparse GP experts is approximated with \(M_l\) pseudo-inputs. The MM algorithm described in Sect. 3.1 incurs a cost per iteration of clustering the N points given the current set of parameters (Eq. 6). This cost is \(\mathcal {O}(N\sum _{l=1}^L M_l^2)\), and it is parallel in N. The algorithm also incurs a cost per iteration of classification with N points and L regressions using \(N_1, \dots , N_L\) points, where \(N= \sum _{l=1}^LN_l\) in Eq. (6). These operations can also be done in parallel. The cost for the classification is \(\mathcal {O}(Np_c)\) assuming that the number of epochs for training is \(\mathcal {O}(1)\). The cost for the regressions is \(\mathcal {O}(\sum _{l=1}^L M_l^2 N_l)\). Hence the total cost for Eq. (6) is \(\mathcal {O}(N p_c + \sum _{l=1}^L M_l^2 N_l)\), which can be roughly bounded by \(\mathcal {O}(N P_\textrm{max})\), where \(P_\textrm{max} = \textrm{max}\{p_c, M_1^2,\dots , M_L^2\}\). Randomly initialized MM cannot be expected to provide reasonable results after one pass, however with sparse GP models the first iteration provides a significant improvement which is also sometimes reasonable. Ignoring parallel considerations, the total cost for 2-pass MM (MM2r) is \(\mathcal {O}(2 N p_c + \sum _{l=1}^L M_l^2 (2N_l+N))\).

For CCR, the cost is the same for the second step Eq. (6), while the first step is replaced with a chosen clustering algorithm, e.g. GMM, which incurs a cost of \(\mathcal {O}(NL)\). The latter iterates between steps which can be parallelized in different ways. The total cost of CCR is hence \(\mathcal {O}(N (p_c+L) + \sum _{l=1}^L M_l^2 N_l) = \mathcal {O}(N P_\textrm{max})\). This is a one pass algorithm, which often provides acceptable results. The overhead for MM2r vs. CCR for our model is then roughly \(\mathcal {O}((P_\textrm{max}-L)N)\). More precisely it is \(\mathcal {O}(N (p_c-L) + \sum _{l=1}^L M_l^2 (N+N_l) )\).

3.4 Prediction

There are two approaches that can be employed to predict \(y^*\) at a test value \(x^*\). Hard allocation based prediction is based on the single best regression/expert:

$$\begin{aligned} z^* =\mathop {\textrm{argmax}}\limits _{z^* \in \lbrace 1, \ldots , L \rbrace } \log (w_{z^*}(x^*; {\psi })) \, , \,\,\, y^* = {\widehat{\mu }}_{z^*}(x^*) \, , \end{aligned}$$
(13)

where (similiar to Eq. (2)), the GP expert’s prediction is:

$$\begin{aligned} {\widehat{\mu }}_{z^*}(x^*) = \mu _l + (\textbf{k}_{l,M_l,x^*})^T(\textbf{K}_{l,M_l})^{-1} ({\tilde{{\textbf{f}}}}_l- {\varvec{\mu }}_l), \end{aligned}$$

with \(\textbf{k}_{l,M_l,x^*}\) denoting the vector of length \(M_l\) with elements \(K_{\phi _l}({\tilde{x}}_{l,j}, x^*)\). Whereas soft allocation based prediction is given by a weighted average

$$\begin{aligned} y^* = \sum _{l=1}^L w_{l}(x^*; {\psi }) {\widehat{\mu }}_{l}(x^*) \,. \end{aligned}$$
(14)

Soft-allocation may be preferred in cases when there is not a clear jump in the unknown function, thus allowing us to smooth the predictions in regions where the classifier is unsure. Similarly, the variance or density of the output or regression function can also be computed at any test location to obtain measures of uncertainty in our predictions.

Hard allocation based density estimation delivers a Gaussian approximation for a test value \(x^*\),

$$\begin{aligned} {\widehat{p}}(y\mid x^*, z^*=l) = \text {N}\left( y\mid {\widehat{\mu }}_{l}(x^*),{\widehat{\sigma }}^{2}_l(x^*) \right) \,, \end{aligned}$$

with allocation \(z^*\) as in Eq. (13), and where \({\widehat{\sigma }}^{2}_l(x^*) = \lambda _{l, x^*} + \sigma ^{2}_l + (\textbf{k}_{l,M_l,x^*})^T (\textbf{Q}_{l,M_l})^{-1} \textbf{k}_{l,M_l,x^*}\) and \(\lambda _{l,x^*}\) is computed as in Eq. (3) at \(x^*\). Soft allocation based density estimation delivers a mixture of Gaussians for a test value \(x^*\), similar to Eq. (14):

$$\begin{aligned} {\widehat{p}}(y\mid x^*) = \sum _{l=1}^L w_{l}(x^*; {\psi }) \text {N}(y\mid {\widehat{\mu }}_l(x^*),{\widehat{\sigma }}^{2}_l(x^*) ). \end{aligned}$$

In cases when the density of the output may be multi-modal, looking at point predictions alone is not useful. In this setting, the soft density estimates are preferred, allowing one to capture and visualise the multi-modality.

4 Numerical experiments

In this section, we perform a range of experiments to highlight the flexibility of our model and compare the accuracy and speed of the MM and CCR algorithms. For the sparse GP experts, we use the isotropic squared exponential covariance function with a variable number of inducing points \(M_l\) based on the cluster sizes (Burt et al., 2020; Nieman et al., 2021). The pseudo-inputs \({\tilde{{\textbf{x}}}}^l\) are initialized via K-means and the GP hyperparameters are initialized based on the scale of the data. The number of experts L is determined apriori using the Bayesian information criterion (BIC) to compare the GMM clustering solutions of the rescaled (yx) across different values of L. The choice of GMM is motivated by allowing elliptically-shaped clusters in the rescaled space, while keeping cost low (e.g. if compared to the K-means). In the “Appendix 1”, we instead consider using cross-validation and examine the log-likelihood on the held-out data to select the number of components, but we found that this significantly increases run time compared to BIC, with similar performance metrics. A table reporting the number of experts can also be found in “Appendix 1”. The architecture of the DNN gating network is chosen with the use of Auto-Keras (Jin et al., 2019) and a quadratic regularization is used with a value of 0.0001 for the penalization parameter. The adaptive stochastic gradient descent solver Adam (Kingma & Ba, 2014) is used to optimize the model weights and biases, with a validation fraction of 0.1 and a maximum of 500 epochs. The GPy package (GPy, 2012) and Keras (Chollet, 2015) were used to train the experts and gating network respectively.

For each experiment we performed 5-fold cross-validation with random shuffling of the data. The experiments are executed on Intel Core i7 (I7-7820HQ) with 4 cores and with 16GB of RAM.

4.1 Simulations

As a motivating toy example, we generate inputs within each cluster \(l = 1, 2\) from a noisy 2D spiral and outputs \(y = (2\,l -1) (x_1^2 + x_2^2) + \epsilon\). Figure 1 demonstrates that flexible DNN gating networks, over quadratic classifiers (e.g. logistic regression (LR)), are required to recover the true allocations. When combined with GP experts, this leads to improved accuracy (\(R^2=98.74\%, \, 86.98\%, \, 97.04\%\) for DNN+GP, LR+GP, MDN, respectively) and tighter credible intervals (CI) that maintain the desired coverage (average CI length = \(0.25,\, 1.36,\, 0.41\) and empirical coverage (EC) = \(95\%,\, 100\%,\, 98.7\%\) for DNN+GP, LR+GP, MDN, respectively). In particular, for outlying test points (red squares in Fig. 1), MDN provides highly overconfident inaccurate predictions, whereas predictions and CIs are more accurate and realistic for the proposed DNN+GP.

4.2 Datasets

Our experiments range from small to large datasets of varying dimensions and complexity and mainly aim to highlight the model’s capability to flexibly estimate the conditional density and capture issues such as discontinuity, non-stationarity, heteroscedasticity, and multi-modality that challenge standard GP models. First, the Motorcycle dataset (Silverman, 1985) consists of \(N=133\) measurements with \(d=1\). Second, the NASA dataset (Gramacy & Lee, 2008) comes from a computer simulator of a NASA rocket booster vehicle with \(N=3167\); we focus on modelling the lift force as a function of the speed (mach), the angle of attack (alpha), and the slide-slip angle (beta), i.e. \(d=3\). Our third experiment is the Higdon function (Higdon, 2002; Gramacy & Lee, 2008); \(\mathcal {X}=[0,20]\) and \(N=1000\). Next, \(N=10,000\) points are generated from the Bernholdt function (Bernholdt et al., 2019); in this case, \(\mathcal {X}=[-4,10]^2\) and \(f(x) = g(x_1)g(x_2)\), where g(x) is the piece-wise smooth function studied in Monterrubio-Gómez et al. (2020). The kin40k dataset (Seeger et al., 2003) is a popular benchmark for GP regression methods and consists of \(N = 40,000\) points that describe the location of a robotic arm as a highly nonlinear function of control input with \(d = 8\).

4.2.1 \(\chi\) dataset

A tokamak is a device which uses magnetic fields to confine hot plasma in the shape of a torus. It is the leading candidate for production of controlled thermonuclear power, for use in a prospective future fusion reactor (Greenwald, 2016).

One dimensional radial transport modeling (Park et al., 2017) using theory-based models such as GLF23 (Waltz et al., 1997), MMM95 (Bateman et al., 1998), and TGLF (Staebler et al., 2007) plays an essential role in interpreting experimental data and guiding new experiments for magnetically confined plasmas in tokamaks. Turbulent transport resulting from micro-instabilities have a strong nonlinear dependency on the temperature and density gradients. One of the key characteristics is a sharp increase of turbulent flux as the gradient of temperature increases beyond a certain critical value. This leads to a highly nonlinear and discontinuous function of the inputs.

Here we consider an analytical stiff transport model that describes turbulent ion energy transport in tokamak plasmas (Bernholdt et al., 2019; Janeschitz et al., 2002):

$$\begin{aligned} \chi = S (R T'/T - (R T'/T)_\textrm{crit})^\alpha H\left( \left| \frac{(R T'/T)}{(R T'/T)_\textrm{crit}} \right| - 1\right) \,, \end{aligned}$$
(15)

where \(\chi\) is ion thermal diffusivity, \(H(\cdot )\) is the Heaviside function, R is the major radius, and \(T'\) is the radial derivative of ion temperature. The normalized critical gradient \((R T'/T)_\textrm{crit}\) of ion temperature is calculated using IFS/PPPL model (Kotschenreuther et al., 1995), which is a nonlinear function of electron density (\(n_e\)), electron and ion temperatures (\(T_e,T\)), safety factor (q), magnetic shear (\({\hat{s}}\)), effective charge (\(Z_\textrm{eff}\)) and the normalized gradient of ion temperature \((RT'/T)\) and density \((Rn'/n)\). It is assumed that \(S=1\) and \(\alpha =1\). Considering \((T, T')\) and \((n, n')\) as two input parameters each, this gives a total of 10 inputs \(x \in \mathbb {R}^{10}\). The output (15) is \(y \in \mathbb {R}_+\). Basic primitive model inputs are chosen uniformly at random from a hypercube \(\omega \in [0,1]^{17}\), which then give rise to realistic inputs \(x \in \mathbb {R}^{10}\), which concentrate on a manifold in the ambient space. See (Kotschenreuther et al., 1995) for details of the model, and Fig. 2 for visualization of the input distribution histogram.

The data consists of \(N=150,000\) simulations from (15). This model presents a challenge for the competing MoE methods, due to input dimensionality and data size, and for competing PoE methods, due to the sharp nonlinearities and discontinuities. It therefore serves to highlight the value of our method. Results appear in the bottom row of the tables.

Fig. 2
figure 2

Input distribution marginals of a \(\chi\) dataset

4.3 Results

For our model, we compare the MM algorithm with the fast CCR and MM2r approximations. The MM algorithm is initialized at the CCR solution and iterates until the reduction in the \(R^2\) is below a threshold of 0.0001 or a maximum number of iterations is reached. Competing models are two product of GP experts models, gPoE (Cao & Fleet, 2014), RBCM (Deisenroth & Ng, 2015), a mixture of GP experts, FastGP (Nguyen & Bonilla, 2014), and MDN (Bishop, 1994), as well as a Bayesian treed-based model (BART Chipman et al., 2010). Treed GP models (Gramacy & Lee, 2008) were also considered, although the cost was so much larger that the method is not competitive (see Table 4). We also report results for other methods, including ORTHNAT (Salimbeni et al., 2018), PPGPR (Jankowiak et al., 2020b), DSPP (Jankowiak et al., 2020a), and Deep GPs (Damianou & Lawrence, 2013).

First, we observe that CCR has similar or improved test accuracy compared to MM2r (Table 1), and CCR reduces wall-clock time (Table 2) by a factor of 2–3 for all experiments. The CCR solution can be further refined through MM iterations (Table 1); however this comes at a higher computational cost. Moreover, in practice we observe little to no improvement in accuracy with further MM iterations, across all datasets.

Compared with state-of-the-art GP, neural network, and tree-based benchmarks, our model has the highest test accuracy in almost all the experiments considered; the one exception is RBCM for Motorcycle, where the accuracy is slightly higher, but well within the standard deviation. For the smaller datasets, CCR is slightly slower than the PoE models (gPoE and RBCM) and FastGP, however this is offset by the improved accuracy. CCR is substantially faster than all competitors for the \(\chi\) model, with \(d=10\) and (relatively) big data \(N=150,000\).

Table 1 \(R^2\) test accuracy (%) on the five datasets for our model with CCR, CCR-MM, and MM2r algorithms and MDN, gPoE, RBCM, and FastGP benchmarks
Table 2 Mean wall-clock time (in seconds) on the five datasets for our model with CCR, CCR-MM, and MM2r algorithms and MDN, gPoE, RBCM, and FastGP benchmarks
Table 3 Empirical coverage (\(\text {EC}_{95}\)) and average length of 95% CIs (\(\bar{\text {CI}}_{95}\)), averaged over 5-folds
Fig. 3
figure 3

a Predictions (based on soft and hard allocations) with our model for the Motorcycle dataset, with two standard deviations and soft allocation based density estimates; b a slice representing the density estimate given \(x^* = -\,0.478\)

Lastly, we highlight that our model provides uncertainty quantification (as well as density estimation which can capture multi-modality in the predictions) for both the unknown function and predictions. Figure 3 displays the soft and hard allocation predictions together with the respective 2-\(\sigma\) CIs for the Motorcycle dataset. The model is able to recover the apparent non-stationary and heteroscedasticity in the data, while the other models (similar plots provided in the Fig. 5) tend to produce intervals that are too wide (especially on the left for \(x \le -0.6\) for PoEs) or too tight. For our model, the data (in red) is contained within the region bounded with dashed lines, suggesting that the model also provides good empirical coverage. Table 3 gives more insights into the empirical coverage (i.e. fraction of times the test points are contained within the 95% CIs) against the average length of the CIs. It can be seen that the proposed model achieves tight intervals at the desired coverage, whereas most of the other models produce overconfident predictions or conservative intervals that are unnecessarily wide.

Table 4 Treed GP accuracy (\(R^2\)) on the test data, Wall-clock time, Empirical coverage (\(\text {EC}_{95}\)), and average length of 95% CIs (\(\bar{\text {CI}}_{95}\))

Table 4 shows the metrics obtained for the treed GP model on all datasets but kin40k and \(\chi _{\text {150k}}\) because of the computational infeasibility (the treed GP package employs MCMC inference), and hence it was separated in the standalone table. Although the treed GP model has the advantage in terms of \(R^2\) and empirical coverage/interval length over CCR in several instances, it is offset by prohibitively longer execution times.

Figure 4 compares heat maps of the conditional density for Motorcycle dataset. It can be seen that, unlike the other models, the proposed method successfully recovers non-stationary and heteroscedasticity in the data and provides UQ, which is closer to the one provided by the treed GP model and at a fraction of the cost.

Fig. 4
figure 4

Heat map of the conditional density for Motorcycle data

The results from Tables 1 to Tables 3 are presented in a more digestible format in Fig. 5. Figure 5 (as well as Tables 1 to 3) shows that proposed method provides the best balance between low computational cost and high accuracy, and good uncertainty quantification (nominal coverage and tight CIs) in the examples considered.

Fig. 5
figure 5

Left: Accuracy versus Time (on log-scale—note purple/circled data). CCR delivers comparable/higher accuracy, with comparable/smaller cost. Right: Empirical coverage vs Average length of 95% CIs. CCR provides judicious UQ

5 Conclusion

We have proposed a novel MoE, which combines powerful DNNs to flexibly determine the local regions and sparse GPs to probabilistically model the local regression functions. Through various experiments, we have demonstrated that this combination provides a flexible, robust model that is able to recover challenging behaviors such as discontinuities, non-stationarity, and non-normality and well calibrated uncertainty. In addition, we have established a novel connection between the maximization-maximization algorithm and the recently introduced CCR algorithm. This allows us to obtain a fast approximation that significantly outperforms competing methods. Moreover, in some cases, the solution can be further refined through additional MM iterations. While we focus on the proposed deep mixture of sparse GP experts, this connection can be generally applied to other MoE architectures for fast approximation. Future research will explore extensions to infinite mixtures of experts, that allow for a data-driven number of clusters which can grow unboundedly with the data. In addition, the recent work of Rossi et al. (2021) found that combining the FITC approximation of GPs with Bayesian treatment of the inducing points and hyperparameters can improve performance significantly, in particular, compared with variationally sparse GPs; in this direction, future work will also explore suitable priors for inducing variables and GP hyperparameters in MAP estimation.