Fast deep mixtures of Gaussian process experts

Etienam, Clement; Law, Kody J. H.; Wade, Sara; Zankin, Vitaly

doi:10.1007/s10994-023-06491-x

Fast deep mixtures of Gaussian process experts

Open access
Published: 08 January 2024

Volume 113, pages 1483–1508, (2024)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Fast deep mixtures of Gaussian process experts

Download PDF

Clement Etienam^1,4,
Kody J. H. Law¹,
Sara Wade² &
…
Vitaly Zankin ORCID: orcid.org/0000-0003-0674-3721^1,3

1811 Accesses
5 Altmetric
Explore all metrics

Abstract

Mixtures of experts have become an indispensable tool for flexible modelling in a supervised learning context, allowing not only the mean function but the entire density of the output to change with the inputs. Sparse Gaussian processes (GP) have shown promise as a leading candidate for the experts in such models, and in this article, we propose to design the gating network for selecting the experts from such mixtures of sparse GPs using a deep neural network (DNN). Furthermore, a fast one pass algorithm called Cluster–Classify–Regress (CCR) is leveraged to approximate the maximum a posteriori (MAP) estimator extremely quickly. This powerful combination of model and algorithm together delivers a novel method which is flexible, robust, and extremely efficient. In particular, the method is able to outperform competing methods in terms of accuracy and uncertainty quantification. The cost is competitive on low-dimensional and small data sets, but is significantly lower for higher-dimensional and big data sets. Iteratively maximizing the distribution of experts given allocations and allocations given experts does not provide significant improvement, which indicates that the algorithm achieves a good approximation to the local MAP estimator very fast. This insight can be useful also in the context of other mixture of experts models.

Regularized Estimation and Feature Selection in Mixtures of Gaussian-Gated Experts Models

Correlated product of experts for sparse Gaussian process regression

Article Open access 25 January 2023

Stochastic variational hierarchical mixture of sparse Gaussian processes for regression

Article 06 July 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Gaussian processes (GPs) are key components of many statistical and machine learning models. In a Bayesian setting, they provide a probabilistic approach to model unknown functions, which can subsequently be used to quantify uncertainty in predictions. An introduction and overview of GPs is given in Williams and Rasmussen (2006). In regression tasks, the GP is a popular prior for the unknown regression function, $f:x \rightarrow y$, due to its nonparametric nature and tractability. It assumes that the function evaluated at any finite set of inputs $(x_1, \ldots , x_N)$ is Gaussian distributed with mean vector $(\mu (x_1), \ldots , \mu (x_N))$ and covariance matrix with elements $K(x_i, x_j)$, where the mean function $\mu (\cdot )$ and the positive semi-definite covariance (or kernel) function $K(\cdot , \cdot )$ represent the parameters of the GP.

While GPs are flexible and have been successfully applied to various problems, limitations exist. First, GP models suffer from a high computational burden, due to the need to invert and store large, dense covariance matrices. Second, typically parametric forms are specified for $\mu (\cdot )$ and $K(\cdot , \cdot )$, which crucially determine properties of the regression function, such as spatial correlation, smoothness, and periodicity. This limits the model’s ability to recover changing behavior of the function, e.g. different smoothness levels, across the input space.

To address the computational burden, one approach is to approximate based on multiple GP experts. In this case, the data is partitioned into groups, and within each group, a GP expert specifies the conditional model for the output y given the input x. Scalability is enhanced as each expert only requires inversion of smaller matrices based on subsets of the data. The experts’ predictions can then be combined through a product operation, known as product of experts (PoEs, Tresp, 2000), or a sum operation, known as mixture of experts (MoEs, Jacobs et al., 1991). The PoE approach includes the Bayesian Committee Machine (BCM, Tresp, 2000), generalized PoE (gPoE, Cao & Fleet, 2014), and robust BCM (rBCM, Deisenroth & Ng, 2015). PoEs are motivated by the fact that the product operation maintains Gaussianity and are specifically designed for faster inference of stationary GP models. However, a thorough theoretical analysis (Szabó & Van Zanten, 2019) shows that although some PoE approaches can achieve good posterior contraction rates and asymptotic coverage, this is only in the unrealistic non-adaptive setting. Instead, MoEs work well even in adaptive settings and correspond to sound statistical models that take a weighted average of the experts’ predictions. The weights are defined by the gating network, which probabilistically maps the experts to regions of the input space. The recent work of Trapp et al. (2020) combines both operations to build a sum-product network of GPs.

Beyond approximation, the crucial advantage of MoEs is that model flexibility is greatly enhanced. Specifically, any simplifying assumptions of the experts need only hold for each subset of the data. This allows the model to infer different behaviors, such as smoothness and variability, within different regions, and combine the multiple experts to capture non-stationary, heterogeneity, discontinuities, and multi-modality.

In this paper, our contribution is threefold. First, we construct a novel MoE model that combines the expressive power of deep neural networks (DNNs) and the probabilistic nature of GPs. While powerful, DNNs lack the probabilistic framework and sound uncertainty quantification of GPs, and there has been increased interest in recent years in combining DNNs and GPs to benefit from the advantages and overcome the limitations of each method, see e.g. (Huang et al., 2015; Wilson et al., 2016; Iwata & Ghahramani, 2017; Daskalakis et al., 2020) to name a few. Specifically, we use GP experts for smooth, probabilistic reconstructions of the unknown regression function within each region, while employing DNNs for the gating network to flexibly determine the regions. To further enhance scalability, we combine the distributed approximation of GPs through the MoE architecture with low-rank approximations using an inducing point strategy (Snelson & Ghahramani, 2006). This combination leads to a robust and efficient model which is able to outperform competing models.

Second, we provide a fast and accurate approximation of the proposed deep mixture of sparse GP experts, using a recently introduced method called Cluster–Classify–Regress (CCR, Bernholdt et al., 2019). Finally, a novel connection is made between CCR and optimization algorithms commonly used for MAP estimation of MoEs, which lends credibility to its success as an approximation algorithm in this context. More generally, the connection both provides a framework to potentially further refine the CCR solution through additional iterations of the optimization algorithm and also sheds lights on the MoE model underlying the CCR algorithm, which provides a fast approximation for other MoE architectures as well.

2 Methodology

For i.i.d. data $(y_1, \ldots y_N)$, mixture models are an extremely useful tool for flexible density estimation due to their ability to approximate a large class of densities and their attractive balance between smoothness and flexibility. When additional covariate information is present and the data consists of input–output pairs, $\{(x_i,y_i)\}_{i=1}^N$, MoEs extend mixtures to achieve flexible conditional density estimation (also known as density regression), where the whole density of the output changes with the inputs. This is achieved by modelling the mixture parameters as functions of the inputs, that is, by defining the gating network, which probabilistically partitions the input space into regions, and by specifying the experts, which characterize the relationship between x and y within each region. This results in flexible framework which has been employed in numerous applications; for a recent overview, see (Gormley & Frühwirth-Schnatter, 2019).

Specifically, the MoE model assumes that outputs are independently generated, for $i = 1,\ldots , N$, from a mixture:

$$\begin{aligned} \begin{aligned} y_i \mid x_i&\sim \sum _{l=1}^L {w_l}(x_i; {\psi }) \text {N}\left( y_i \mid {f}(x_i; {\theta }_l),\sigma ^{2 }_l\right) ,\\ \end{aligned} \end{aligned}$$

(1)

where ${w_l}(\cdot ; {\psi })$ is the gating network with parameters ${\psi }$; ${f}(\cdot ; {\theta }_l)$ is the regression function for the $l\text{th}$ expert with parameters ${\theta }_l$; and L is the number of experts. For simplicity, we focus on the case when $y \in \mathbb {R}$ and employ a Gaussian likelihood for the experts, although this may be generalized for other data types. The gating network $({w_1}(\cdot ; {\psi }), \ldots , {w_L}(\cdot ; {\psi }))$, which maps the input space $\mathcal {X} \subseteq \mathbb {R}^d$ to the $L-1$ dimensional simplex (i.e. $w_l(x; \psi )\ge 0$ and $\sum _{l=1}^L w_l(x; \psi ) =1)$, reflects the relevance of each expert at any location $x \in \mathcal {X}$.

The experts form the building blocks of the mixture model, which are combined to recover general shapes for the conditional density of y given x, that are too complicated to be captured by a single expert. To provide intuition, MoEs can be used when the population consists of (input-dependent) sub-populations, such that within each, a single expert provides a good description of the relationship between y and x. However, in general, the groups may not represent actual sub-populations but rather they are combined to achieve flexibility and approximate a wide class of conditional densities. MoEs can be augmented with a set of allocation variables ${\textbf{z}}=(z_1, \ldots , z_N)$ indicating group or cluster membership, where $z_i = l$ if the $i\text{th}$ data point is generated from the $l\text{th}$ expert, for $l = 1, \ldots , L$. Specifically, if

$$\begin{aligned} {p(z_i = l \mid x_i) = w_l (x_i, \psi ) \quad \text {and} \quad p(y_i \mid x_i, z_i) = \text {N}\left( y_i \mid f(x_i; \theta _{z_i}),\sigma ^{2 }_{z_i}\right) ,} \end{aligned}$$

then Eq. (1) is recovered after marginalization of $z_i$:

$$\begin{aligned} {p(y_i \mid x_i)}&{= \sum _{l=1}^L p(y_i, z_i=l \mid x_i) = \sum _{l=1}^L p(z_i=l \mid x_i) p(y_i \mid x_i, z_i=l)} \\&= {\sum _{l=1}^L w_l (x_i, \psi ) \text {N}\left( y_i \mid f(x_i; \theta _{l}),\sigma ^{2 }_{l}\right) .} \end{aligned}$$

Thus, letting ${\textbf{y}}=(y_1,\ldots , y_N)$ and ${\textbf{x}}= (x_1, \ldots , x_N)$, we can define the augmented model as

$$\begin{aligned} p({\textbf{y}}, {\textbf{z}}\mid {\textbf{x}})&= \prod _{i=1}^N p(y_i \mid x_i, z_i) p(z_i \mid x_i) = \prod _{i=1}^N {w_{z_i}} \text {N}(y_i \mid {f}(x_i; {\theta }_{z_i}),\sigma ^{2}_{z_i}) \\&= \prod _{l=1}^L \prod _{i:z_i=l} {w_l} (x_i ; {\psi }) \text {N}(y_i \mid {f}(x_i; {\theta }_{l}),\sigma ^{2}_l). \end{aligned}$$

This augmented version of the model is widely used in inference algorithms (such as expectation-maximization), as direct optimization of the likelihood in Eq. (1) is challenging, due to multi-modality and well-known identifiability issues associated with mixtures. Moreover, the introduction of the allocation variables ${\textbf{z}}$ permits distributed inference across experts.

Various formulations have been proposed in literature for both the experts and gating networks, ranging from simple linear models to flexible nonlinear approaches. Examples include (generalized) linear or semi-linear models (Jordan & Jacobs, 1994; Xu et al., 1995), splines (Rigon & Durante, 2017), neural networks (Bishop, 1994; Ambrogioni et al., 2017), GPs (Tresp, 2001; Rasmussen & Ghahramani, 2002), tree-based classifiers (Gramacy & Lee, 2008), and others. In this work, we combine sparse GP experts with DNN gating networks to flexibly determine regions and provide a probabilistic nonparametric model of the unknown regression function. Figure 1 highlights the advantages of this construction: 1) flexible regions (over quadratic classifiers, e.g (Meeds & Osindero, 2006; Nguyen & Bonilla, 2014; Zhang & Williamson, 2019)) and 2) well-calibrated uncertainty (over DNN experts, e.g (Bishop, 1994; Ambrogioni et al., 2017)), through tighter credible intervals that maintain the desired coverage.

2.1 Sparse Gaussian process experts

Mixtures of GP experts have proven to be very successful (Tresp, 2001; Rasmussen & Ghahramani, 2002; Meeds & Osindero, 2006; Yuan & Neubauer, 2009; Nguyen & Bonilla, 2014; Gadd et al., 2020). In particular, they overcome limitations of stationary GPs by reducing computational complexity through distributed approximations and allowing different properties of the function for each GP expert to handle challenges, such as discontinuities, non-stationarity, non-Gaussianity and multi-modality. They allow flexible conditional density estimation, going beyond the iid Gaussian error assumption of standard GP regression models.

In this case, one assumes a GP prior on the regression function for each expert with expert-specific hyperparameters ${\theta _l =( \mu _l,\phi _l)}$:

$$\begin{aligned} {f(\cdot ; \theta _l) \sim \text {GP}(\mu _l, K_{\phi _l}),} \end{aligned}$$

where $\mu _l$ is the mean function of the expert (for simplicity, it assumed to be constant) and $\phi _l$ are the parameters of the covariance function $K_{\phi _l}$, whose chosen form and hyperparameters encapsulate properties of the function for each expert. While it is common to use zero mean functions in standard GP models, which is made appropriate by first subtracting the overall mean from the output, in mixtures, we must include a constant mean, as the clustering structure is unknown and the data cannot be centred for each expert. Letting $f_{l,i} ={f}(x_i; {\theta }_{l})$ denote the $l\text{th}$ function evaluated at $x_i$ with ${\textbf{f}}_l = \lbrace f_{l,i} \rbrace _{z_i = l}$, the GP prior implies a multivariate normal prior on the function ${\textbf{f}}_l$ evaluated at the inputs within the $l\text{th}$ cluster:

$$\begin{aligned} {{\textbf{f}}_l \sim \text {N}( {\varvec{\mu }}_l, \textbf{K}_{l,N_l}),} \end{aligned}$$

where ${\varvec{\mu }}_l$ is a vector with entries $\mu _l$; $\textbf{K}_{l,N_l}$ represents the $N_l \times N_l$ matrix obtained by evaluating the covariance function $K_{\phi _l}$ at each pair of inputs in the $l\text{th}$ cluster; and $N_l$ is number data points in the $l\text{th}$ cluster. With the Gaussian likelihood, the unknown function can be marginalized, and the likelihood of the data within each cluster is:

$$\begin{aligned} {p({\textbf{y}}_l \mid {\textbf{x}}_l)}&{= \int \prod _{i:z_i=l} \text {N}(y_i \mid f_{l,i} ,\sigma ^{2}_l) \text {N}({\textbf{f}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,N_l}) d{\textbf{f}}_l}\\&{= \text {N}({\textbf{y}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,N_l}+\sigma ^{2}_l \textbf{I}_{N_l} ),} \end{aligned}$$

where ${\textbf{y}}_l$ and ${\textbf{x}}_l$ contain the outputs and inputs of the $l\text{th}$ cluster, i.e. ${\textbf{y}}_l = \lbrace y_i \rbrace _{z_i=l}$ and ${\textbf{x}}_l = \lbrace x_i \rbrace _{z_i=l}$.

GP experts are appealing due to their flexibility, intrepretability and probabilistic nature, but they come with an increased computational cost. Specifically, given the allocation variables ${\textbf{z}}$, the GP hyperparameters, which crucially determine the behavior of the unknown function, can be estimated by optimizing the log marginal likelihood:

$$\begin{aligned} {\log (p({\textbf{y}}\mid {\textbf{x}}, {\textbf{z}})) = \sum _{l=1}^L \log \left( \text {N}({\textbf{y}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,N_l}+\sigma ^{2}_l \textbf{I}_{N_l} )\right) .} \end{aligned}$$

This requires inversion of $N_l \times N_l$ matrices. Compared with standard GP models, the cost is reduced from $\mathcal {O}(N^3)$ to $\mathcal {O}(\sum _{l=1}^L N_l^3)$, however this can still be expensive, depending on the size of clusters.

To further improve scalability, one can resort to approximate methods for GPs, including sparse GPs (Snelson & Ghahramani, 2006; Titsias, 2009; Bui et al., 2017), predictive processes (Banerjee et al., 2008), basis function approximations (Cressie & Johannesson, 2008), or sparse formulations of the precision matrix (Lindgren et al., 2011; Grigorievskiy et al., 2017; Durrande et al., 2019), among others (see reviews in Chp. 8 of Williams and Rasmussen (2006) and (Heaton et al., 2019)). In the present work, we employ sparse GPs, which augment the model with a set of $M_l<N_l$ pseudo-inputs ${\tilde{{\textbf{x}}}}_l = ({\tilde{x}}_{l,1}, \ldots , {\tilde{x}}_{l,M_l})$ and pseudo-targets ${\tilde{{\textbf{f}}}}_l= ({\tilde{f}}_{l,1}, \ldots , {\tilde{f}}_{l,M_l})$ representing the $l\text{th}$ function evaluated at the pseudo-inputs, i.e. ${\tilde{f}}_{l,m} = f({\tilde{x}}_{l,m}; \theta _l)$ for $m=1, \ldots , M_l$. Then, the key assumption for scalability is that the prior on the regression function within each cluster factorizes given the pseudo-targets:

$$\begin{aligned} {p({\textbf{f}}_l, {\tilde{{\textbf{f}}}}_l \mid {\textbf{x}}_l, {\tilde{{\textbf{x}}}}_l )}&{= p({\textbf{f}}_l \mid {\tilde{{\textbf{f}}}}_l , {\textbf{x}}_l, {\tilde{{\textbf{x}}}}_l ) p({\tilde{{\textbf{f}}}}_l \mid {\tilde{{\textbf{x}}}}_l )}\\&{\approx \prod _{i:z_i=l} p(f_{l,i} \mid {\tilde{{\textbf{f}}}}_l , {\textbf{x}}_l, {\tilde{{\textbf{x}}}}_l) p({\tilde{{\textbf{f}}}}_l \mid {\tilde{{\textbf{x}}}}_l )}. \end{aligned}$$

From properties of GPs, the prior of the pseudo-targets is ${\tilde{{\textbf{f}}}}_l \sim \text {N}({\varvec{\mu }}_l, \textbf{K}_{l,M_l})$, and the augmented model for ${\textbf{f}}_l$ given the pseudo-targets is:

$$\begin{aligned} { p({\textbf{f}}_l \mid {\textbf{x}}_l, {\tilde{{\textbf{x}}}}_l , {\tilde{{\textbf{f}}}}_l) = \prod _{i:z_i=l} \text {N}(f_{l,i} \mid {\widehat{\mu }}_{l,i}, \lambda _{l,i}), } \end{aligned}$$

where

$$\begin{aligned} {\widehat{\mu }}_{l,i}&= \mu _l + (\textbf{k}_{l,M_l,i})^T(\textbf{K}_{l,M_l})^{-1} ({\tilde{{\textbf{f}}}}_l- {\varvec{\mu }}_l), \end{aligned}$$

(2)

$$\begin{aligned} \lambda _{l,i}&= K_{\phi _l}(x_i,x_i)- (\textbf{k}_{l,M_l,i})^T(\textbf{K}_{l,M_l})^{-1}\textbf{k}_{l,M_l,i}, \end{aligned}$$

(3)

with $\textbf{K}_{l,M_l}$ denoting the $M_l \times M_l$ matrix with elements $K_{\phi _l}({\tilde{x}}_{l,j},{\tilde{x}}_{l,h})$ and $\textbf{k}_{l,M_l,i}$ denoting the vector of length $M_l$ with elements $K_{\phi _l}({\tilde{x}}_{l,j}, x_i)$. This corresponds to the fully independent training conditional (FITC) approximation (Quiñonero-Candela & Rasmussen, 2005) (see (Bauer et al., 2016) for a further discussion of FITC). For the Gaussian likelihood, ${\textbf{f}}_l$ can be marginalized and the likelihood of the data points within each cluster also factorizes as:

$$\begin{aligned} p({\textbf{y}}_l \mid {\textbf{x}}_l, {\tilde{{\textbf{x}}}}_l , {\tilde{{\textbf{f}}}}_l) = \prod _{i:z_i=l} \text {N}(y_i\mid {\widehat{\mu }}_{l,i}, \sigma ^{2 }_l +\lambda _{l,i} ). \end{aligned}$$

(4)

After marginalization of the pseudo-targets, the pseudo-inputs ${\tilde{{\textbf{x}}}}_l$ and hyperparameters $(\mu _l, \phi _l)$ can be estimated by optimizing the marginal likelihood:

$$\begin{aligned} \begin{aligned}&\log (p({\textbf{y}}\mid {\textbf{x}}, {\textbf{z}}, {\tilde{{\textbf{x}}}})) = \sum _{l=1}^L \log \left( \text {N}({\textbf{y}}_l \mid {\varvec{\mu }}_l, {\varvec{\Sigma }}_l ) \right) \,, \end{aligned} \end{aligned}$$

(5)

where ${\varvec{\Sigma }}_l = (\textbf{K}_{l,M_l N_l})^T(\textbf{K}_{l,M_l})^{-1} \textbf{K}_{l,M_l N_l}+ {\varvec{\Lambda }}_l +\sigma ^{2}_l \textbf{I}_{N_l}$; $\textbf{K}_{l,M_l N_l}$ is the $M_l \times N_l$ matrix with columns $\textbf{k}_{l,M_l,i}$; and ${\varvec{\Lambda }}_l$ is the diagonal matrix with diagonal entries $\lambda _{l,i}$. This strategy allows us to reduce the complexity to $\mathcal {O}(\sum _{l=1}^L N_l M_l^2)$.

2.2 Deep neural gating networks

The choice of gating network is crucial to flexibly determine the regions and accurately approximate the conditional densities. Indeed, various proposals exist to combine GP experts with different gating networks, and approaches to specify the gating network can be divided into generative or discriminative classifiers. Generative models specify a joint mixture model for y and x and typically assume a Gaussian distribution for the inputs within each class, resulting in linear or quadratic discriminant analysis, e.g. (Meeds & Osindero, 2006; Chen et al., 2014; Zhang & Williamson, 2019). To allow more flexibility in the regions, Yuan and Neubauer (2009); Gadd et al. (2020) extend this by considering a mixture of Gaussians within each class. Discriminative models, on the other hand, avoid modelling the inputs and focus on the conditional density of interest by defining the gating network directly. Discriminative models include linear classifiers (Nguyen & Bonilla, 2014) and tree-based classifiers (Gramacy & Lee, 2008), which may result in rigid assumptions on the partition structure of the input space. The latter, for example, assumes axis-aligned rectangular regions, and while more flexible partitioning approaches exist, such as Voronoi tessellations (Pope et al., 2019), they come at an increased cost. Other flexible proposals include GP classifiers (Tresp, 2001) and kernel-based methods (Rasmussen & Ghahramani, 2002), but these similarly suffer from the trade-off between flexibility and cost.

In this work, we focus on DNNs, which are known to be universal for classification (Szymanski & McCane, 2012). DNNs flexibly determine the regions, removing any rigid assumptions, without increasing the computational cost. Specifically, we define the gating network by a feedforward DNN with a softmax output:

$$\begin{aligned} {w_l}(x; {\psi }) = \frac{\exp (h_l(x; {\psi }))}{\sum _{j=1}^L \exp (h_j(x; {\psi }))} \,, \end{aligned}$$

where $h_l$ is the $l{\textrm{th}}$ component of $h: \mathbb {R}^d \rightarrow \mathbb {R}^{L}$, defined by

$$\begin{aligned} h(~\cdot ~; {\psi }) = \eta _{J}( \eta _{J-1}( \cdots \eta _{1}( ~\cdot ~; {\psi }_{1}) \cdots ; {\psi }_{J-1}); {\psi }_{J}) \,, \end{aligned}$$

with $\eta _{j}: \mathbb {R}^{d_{j-1}} \rightarrow \mathbb {R}^{d_j}$ ($d_0=d$, $d_J=L$) the $j\textrm{th}$ layer of a neural network

$$\begin{aligned} \eta _{j}(~\cdot ~; {\psi }_{j}): x \mapsto \eta _{j}(x; {\psi }_{j}) = \textrm{ReLU}(A_{j}x + b_{j})\,, \end{aligned}$$

where $\textrm{ReLU}(x) = \textrm{max} \{0, x\}$ is the element-wise rectifier; ${\psi }_{j} = \{A_{j}, b_{j}\}$ comprises the weights $A_{j} \in \mathbb {R}^{d_{j} \times d_{j-1}}$ and biases $b_j \in \mathbb {R}^{d_j}$ for level $j = 1, \dots , J$; and $\psi = (\psi _1,\ldots , \psi _J)$ collects all the parameters of the DNN.

Deep neural gating networks have been used in literature but are typically combined with DNN experts (Bishop, 1994; Ambrogioni et al., 2017). The mixture density network (MDN, Bishop, 1994) uses this gating network but parametrizes both the regression function and variance of the Gaussian model in Eq. (1) by DNNs. This offers considerable flexibility beyond standard DNN regression, but significant valuable information can be gained with GP experts. Specifically, as the number of data points in each cluster is data-driven, DNN experts may overfit due to small cluster sizes. In addition, DNNs are susceptible to adversarial attacks and tend to be overconfident even when predictions are incorrect (Szegedy et al., 2013). Moreover, DNNs may underestimate variability, especially for test data which are quite different from the observed data (Nguyen et al., 2015) (see Fig. 1). Instead, GP experts probabilistically model the local regression function, avoiding overfitting and providing well-calibrated uncertainty quantification.

3 Inference

While Markov chain Monte Carlo algorithms provide full Bayesian inference for MoEs, they rely on expensive Gibbs samplers (Rasmussen & Ghahramani, 2002; Meeds & Osindero, 2006; Gadd et al., 2020), which alternately sample the allocation variables ${\textbf{z}}$ given the model parameters (that is, the gating and experts parameters) and the model parameters given ${\textbf{z}}$. To alleviate the cost, Zhang and Williamson (2019) use importance sampling and parallelization. However, accurate estimation requires the number of important samples to be exponential in the Kullback–Leibler divergence between the proposal and posterior (Chatterjee & Diaconis, 2018); this can be prohibitive for MoEs due to the massive dimension of the partition space. A popular alternative strategy is approximate inference, where the Gibbs sampling steps for allocation variables and the model parameters are replaced, respectively, with either an expectation or maximization step (Kurihara & Welling, 2009). An expectation-expectation strategy corresponds to mean-field variational Bayes for MoEs (Yuan & Neubauer, 2009). While popular expectation-maximization (EM) algorithms can be used for maximum a posteriori (MAP) estimation as in Tresp (2001), we instead focus on the faster maximization-maximization (MM) strategy. For generative mixtures of GP experts, an MM algorithm was developed in Chen et al. (2014).

Our MM algorithm provides MAP estimation for the augmented model by optimising the augmented posterior:

$$\begin{aligned}&{\pi (\psi , \theta , \sigma ^2, {\textbf{z}}, {\tilde{{\textbf{f}}}}\mid {\textbf{y}}, {\textbf{x}}, {\tilde{{\textbf{x}}}})} {\propto p({\textbf{y}},{\textbf{z}}, {\tilde{{\textbf{f}}}} \mid {\textbf{x}}, {\tilde{{\textbf{x}}}},\psi , \theta , \sigma ^2 ) \pi (\psi , \theta , \sigma ^2)}\\&\quad {= \prod _{l=1}^L \left[ \prod _{i:z_i=l} w_l(x_i;\psi ) \text {N}(y_i\mid {\widehat{\mu }}_{l,i}, \sigma ^{2 }_l +\lambda _{l,i} ) \right] \text {N}({\tilde{{\textbf{f}}}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,M_l}) \pi (\psi , \theta , \sigma ^2), } \end{aligned}$$

where $\theta =(\theta _1,\ldots , \theta _L)$, $\sigma ^2 =(\sigma ^2_1,\ldots , \sigma ^2_L)$, and ${\tilde{{\textbf{f}}}} =({\tilde{{\textbf{f}}}}_1,\ldots , {\tilde{{\textbf{f}}}}_L)$ collect the expert-specific GP hyperparameters, noise variance, and pseudo-targets, respectively. In the following, we consider vague priors $\pi ({\psi }, {\theta }, \sigma ^2) \propto 1$, and thus equivalently optimise the augmented log posterior:

$$\begin{aligned} \begin{aligned}&\log (\pi ({\psi }, {\theta }, \sigma ^2, {\textbf{z}}, {\tilde{{\textbf{f}}}}\mid {\textbf{y}}, {\textbf{x}}, {\tilde{{\textbf{x}}}}) ) =\text {const.} + \sum _{l=1}^L \log ( \text {N}({\tilde{{\textbf{f}}}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,M_l}))\\&\quad + \sum _{l=1}^L \sum _{i:z_i=l} \log (w_{l}(x_i; {\psi })) + \log ( \text {N}(y_i \mid {\widehat{\mu }}_{l,i},\sigma ^{2 }_l +\lambda _{l,i})).\\ \end{aligned} \end{aligned}$$

For GP experts, this provides maximum marginal likelihood estimation or type II MLE of the GP hyperparameters and noise variance $({\theta }_l, \sigma ^{2}_l)$, as the GP regression functions are marginalized. For sparse GP experts, we have the additional parameters $({\tilde{{\textbf{x}}}}_l, {\tilde{{\textbf{f}}}}_l)$. The pseudo-inputs ${\tilde{{\textbf{x}}}}_l$ are treated as hyperparameters to be optimized, and while pseudo-targets ${\tilde{{\textbf{f}}}}_l$ can be analytically integrated, we also estimate ${\tilde{{\textbf{f}}}}_l$ with its posterior mean for faster inference (see Sect. 3.1 for further details). The MM algorithm is an iterative conditional modes algorithm (Kittler & Föglein, 1984) (can also be viewed as a coordinate ascent method) that alternates between optimizing the allocation variables ${\textbf{z}}$ and the model parameters. It is guaranteed to never decrease the log posterior of the augmented model and therefore will converge to a fixed point (Kurihara & Welling, 2009). However, it is susceptible to local maxima, which can be alleviated with multiple restarts of random initializations.

3.1 Maximization–maximization

Our MM algorithm is defined by iterating over the following two steps: optimize the (i) latent cluster allocation variables and (ii) gating and expert parameters

$$\begin{aligned} \begin{aligned}&\text {(i)}~~ \textbf{z} =\mathop {\textrm{argmax}}\limits _{{\textbf{z}} \in \lbrace 1, \ldots , L \rbrace ^N } \log \pi ({\textbf{z}} \mid {\psi , \theta , \sigma ^2, {\tilde{{\textbf{f}}}}, {\tilde{{\textbf{x}}}}}, {\textbf{x}}, {\textbf{y}}) \,,\\&\text {(ii)} ~~ {(\psi , \theta , \sigma ^2, {\tilde{{\textbf{f}}}}, {\tilde{{\textbf{x}}}})} = \mathop {\textrm{argmax}}\limits _{{(\psi , \theta , \sigma ^2, \tilde{{\textbf{f}}}, {\tilde{{\textbf{x}}}})}} \log \pi ({\psi , \theta , \sigma ^2, \tilde{{\textbf{f}}} \mid {\tilde{{\textbf{x}}}}}, \textbf{z}, {\textbf{x}}, {\textbf{y}}). \end{aligned} \end{aligned}$$

(6)

We note that while the pseudo-targets ${\tilde{{\textbf{f}}}}$ can be marginalized, the full conditional for ${\textbf{z}}$ in the first step of Eq. (6) would be:

$$\begin{aligned} { \pi (\textbf{z} \mid \psi , \theta , \sigma ^2, {\tilde{{\textbf{x}}}} , {\textbf{x}}, {\textbf{y}}) }&={ \int \pi (\textbf{z}, {\tilde{{\textbf{f}}}} \mid \psi , \theta , \sigma ^2, {\tilde{{\textbf{x}}}} , {\textbf{x}}, {\textbf{y}}) d{\tilde{{\textbf{f}}}}}\\&{\propto p(\textbf{z} \mid \psi , {\textbf{x}}) p({\textbf{y}}\mid {\textbf{x}}, {\textbf{z}}, {\tilde{{\textbf{x}}}}, \theta , \sigma ^2),} \end{aligned}$$

with optimal allocation:

$$\begin{aligned} \textbf{z}&= {\mathop {\textrm{argmax}}\limits _{\textbf{z} \in \lbrace 1, \ldots , L \rbrace ^N } \log (\pi (\textbf{z} \mid \psi , \theta , \sigma ^2, {\tilde{{\textbf{x}}}} , {\textbf{x}}, {\textbf{y}}))} \nonumber \\&=\mathop {\textrm{argmax}}\limits _{{\textbf{z}} \in \lbrace 1, \ldots , L \rbrace ^N } \sum _{i=1}^N \log ({w}_{z_i}(x_i; {\psi })) + \sum _{l=1}^L \log \left( \text {N}({\textbf{y}}_l \mid {\varvec{\mu }}_l, {\varvec{\Sigma }}_l )\right) \, . \end{aligned}$$

(7)

Recall from Eq. (5) that ${\varvec{\Sigma }}_l$ is a full covariance matrix; thus, the second term in Eq. (7) does not factorize across $i=1,\ldots , N$ and direct optimization over the space $\lbrace 1, \ldots , L \rbrace ^N$ is infeasible. An iterative conditional modes algorithm could be employed which cycles over each data point $i=1,\ldots , N$, optimizing $z_i$ based on the conditional Gaussian likelihood given the data points currently allocated to each cluster. For each data point, the conditional Gaussian likelihood $p(y_i \mid z_i = l, {\textbf{z}}_{-i}, {\textbf{y}}_{-i})$ must be computed (where ${\textbf{z}}_{-i}$ and ${\textbf{y}}_{-i}$ denote ${\textbf{z}}$ and ${\textbf{y}}$ with the $i\text{th}$ element removed), requiring rank one updates to the inverse covariance matrices for every $l=1,\ldots , L$.

However, we can significantly reduce the computational cost by also estimating ${\tilde{{\textbf{f}}}}$. In this case, the full conditional for ${\textbf{z}}$ in (i) of Eq. (6) factorizes across $i=1,\ldots , N$:

$$\begin{aligned} { \pi (\textbf{z} \mid \psi , \theta , \sigma ^2, {\tilde{{\textbf{f}}}}, {\tilde{{\textbf{x}}}} , {\textbf{x}}, {\textbf{y}}) }&{\propto p(\textbf{z} \mid \psi , {\textbf{x}}) p({\textbf{y}}\mid {\textbf{x}}, {\textbf{z}}, {\tilde{{\textbf{x}}}}, {\tilde{{\textbf{f}}}}, \theta , \sigma ^2)}\\&={ \prod _{i=1}^N w_{z_i}(x_i; \psi ) \text {N}(y_i \mid {\widehat{\mu }}_{z_i,i},\sigma ^{2 }_{z_i} +\lambda _{z_i,i}).} \end{aligned}$$

Thus, the allocation can be done in parallel across the N data points:

$$\begin{aligned} \begin{aligned} z_i = \mathop {\textrm{argmax}}\limits _{{z_i} \in \lbrace 1, \ldots , L \rbrace } \log (w_{z_i}(x_i; {\psi })) + \log (\text {N}(y_i\mid {{\widehat{\mu }}_{z_i,i}}, \sigma ^{2 }_{z_i} +\lambda _{z_i,i})). \end{aligned} \end{aligned}$$

(8)

In the second step of the MM algorithm in Eq. (6), we perform two sub-steps, first optimizing the gating network and expert parameters with ${\tilde{{\textbf{f}}}}$ marginalized, and then, given those optimal values, estimating ${\tilde{{\textbf{f}}}}$:

$$\begin{aligned} \begin{aligned}&\text {(a)} ~~ {(\psi , \theta , \sigma ^2, {\tilde{{\textbf{x}}}}) = \mathop {\textrm{argmax}}\limits _{(\psi , \theta , \sigma ^2, {\tilde{{\textbf{x}}}})} \log \pi ({\psi , \theta , \sigma ^2 \mid {\tilde{{\textbf{x}}}}}, \textbf{z}, {\textbf{x}}, {\textbf{y}}),}\\&\text {(b)}~~ { {\tilde{{\textbf{f}}}} = \mathop {\textrm{argmax}}\limits _{{\tilde{{\textbf{f}}}}} \log \pi ({\tilde{{\textbf{f}}}} \mid {\tilde{{\textbf{x}}}}, \textbf{z}, {\textbf{x}}, {\textbf{y}}, \psi , \theta , \sigma ^2)}. \end{aligned} \end{aligned}$$

(9)

In (a) of Eq. (9), the full conditional of $(\psi , \theta , \sigma ^2)$ is:

$$\begin{aligned} \pi (\psi , \theta , \sigma ^2 \mid {\tilde{{\textbf{x}}}}, \textbf{z}, {\textbf{x}}, {\textbf{y}})&\propto p({\textbf{z}}\mid {\textbf{x}}, \psi ) p({\textbf{y}}\mid {\textbf{x}}, {\tilde{{\textbf{x}}}}, {\textbf{z}}, {\tilde{{\textbf{f}}}}, \theta , \sigma ^2 ) \\&= \prod _{l=1}^L \left[ \prod _{i:z_i=l} w_l(x_i;\psi ) \right] \text {N}({\textbf{y}}_l\mid {\varvec{\mu }}_l, {\varvec{\Sigma }}_l). \end{aligned}$$

Thus, optimization of the gating network and expert parameters can be done in parallel, both between each other as well as across $l=1,\ldots , L$ for the experts. Specifically, the optimal gating network are:

$$\begin{aligned} \begin{aligned} {\psi } = \mathop {\textrm{argmax}}\limits _{{\psi \in \Psi }}\sum _{l=1}^L\sum _{i: z_i=l} h_{l}(x_i; {\psi }) - \sum _{i=1}^N \log \left( \sum _{l=1}^L \exp (h_l(x_i; {\psi })) \right) , \end{aligned} \end{aligned}$$

(10)

and expert parameters (GP hyperparameters, noise variance, and pseudo-inputs) are estimated by optimizing the log marginal likelihood in Eq. (5):

$$\begin{aligned} \begin{aligned} ({\theta }_l, \sigma ^{2}_l, {\tilde{{\textbf{x}}}}_l) = \mathop {\textrm{argmax}}\limits _{\theta \in {\Theta }, \sigma ^2 \in \mathbb {R}_+, {\tilde{{\textbf{x}}}} \in \mathbb {R}^{M_l \times d} } \log \left( \text {N}({\textbf{y}}_l \mid {\varvec{\mu }}_l, {\varvec{\Sigma }}_l )\right) \,. \end{aligned} \end{aligned}$$

(11)

Then, in (b) of Eq. (9), the full conditional for ${\tilde{{\textbf{f}}}}_l$ is:

$$\begin{aligned} \pi ({\tilde{{\textbf{f}}}}_l \mid {\tilde{{\textbf{x}}}}_l, {\textbf{x}}, {\textbf{y}}, \theta , \sigma ^2)&\propto p({\tilde{{\textbf{f}}}}_l \mid {\tilde{{\textbf{x}}}}_l, \theta , \sigma ^2) p({\textbf{y}}\mid {\tilde{{\textbf{f}}}}_l , {\tilde{{\textbf{x}}}}_l, {\textbf{x}}, {\textbf{y}}, \theta , \sigma ^2)\\&= \text {N}({\tilde{{\textbf{f}}}}_l \mid {\varvec{\mu }}_l, \textbf{K}_{l,M_l}) \prod _{i:z_i=l} \text {N}(y_i\mid {{\widehat{\mu }}_{l,i}}, \sigma ^{2 }_{l} +\lambda _{l,i}), \end{aligned}$$

which, following standard derivations, is shown to be Gaussian with covariance matrix $\textbf{K}_{l,M_l} (\textbf{Q}_{l,M_l} )^{-1} \textbf{K}_{l,M_l}$ and mean:

$$\begin{aligned}&\text {E}[ {\tilde{{\textbf{f}}}}_l \mid {\theta }_l, \sigma ^{2}_l, {\tilde{{\textbf{x}}}}_l, {\textbf{y}}_l, {\textbf{x}}_l ] \nonumber \\&\quad ={\varvec{\mu }}_l + \textbf{K}_{l,M_l} (\textbf{Q}_{l,M_l} )^{-1} \left( \textbf{K}_{l, M_l N_l} ( {\varvec{\Lambda }}_l +\sigma ^{2}_l \textbf{I}_{N_l})^{-1}({\textbf{y}}_l- {\varvec{\mu }}_l)\right) , \end{aligned}$$

(12)

where $\textbf{Q}_{l,M_l} = \textbf{K}_{l,M_l} + \textbf{K}_{l,M_l N_l} ( {\varvec{\Lambda }}_l +\sigma ^{2}_l \textbf{I}_{N_l})^{-1} (\textbf{K}_{l,M_l N_l} )^T$. Thus, the optimal ${\tilde{{\textbf{f}}}}_l$ in (b) of Eq. (8) is the posterior mean in Eq. (12).

3.2 A fast approximation: CCR

The MM algorithm iterates between clustering in Eq. (8) and in parallel classification in Eq. (10) and regression in Eqs. (11) and (12). This closely resembles the CCR algorithm recently introduced in Bernholdt et al. (2019). The important differences are that (1) CCR is a one pass algorithm that does not iterate between the steps and (2) CCR approximates the clustering in the first step of the MM algorithm by (a) careful re-scaling the data to emphasize the output y in relation to x and (b) subsequently applying a fast clustering algorithm, e.g. K-means (Hartigan & Wong, 1979), Gaussian mixture model (GMM, McLachlan & Basford, 1988), or DB-scan (Ester et al., 1996), to jointly cluster the rescaled (y, x). We also note that the original formulation of CCR performs an additional clustering step so that the allocation variables used by the regression correspond to the prediction of the classifier; this is equivalent to the clustering step of the MM algorithm in Eq. (8) including only the term associated to the gating network.

This novel connection sheds light on the MoE model underlying the CCR algorithm and allows us to view CCR as a fast, one-pass approximation to the MM algorithm for MoEs. Therefore, we can use CCR to construct a fast approximation of the proposed deep mixture of sparse GP experts. Moreover, this connection can be used to obtain a fast approximation to other discriminative MoE architectures. As shown in the following section, CCR provides a good, fast approximation for many numerical examples. If extra computational resources are available, the CCR solution can be improved through additional MM iterations (i.e. it provides a good initialization for the MM algorithm). However, in our examples, we notice that the potential for further improvement is limited. We find that MM with random initialization can also produce a reasonable estimate in many examples after two iterations; we refer to this algorithm as MM2r. It is fast, but we will see that it takes approximately 2–3 times longer than CCR.

3.3 Complexity considerations

Suppose we parameterize the DNN with $p_c$ parameters, and each of the sparse GP experts is approximated with $M_l$ pseudo-inputs. The MM algorithm described in Sect. 3.1 incurs a cost per iteration of clustering the N points given the current set of parameters (Eq. 6). This cost is $\mathcal {O}(N\sum _{l=1}^L M_l^2)$, and it is parallel in N. The algorithm also incurs a cost per iteration of classification with N points and L regressions using $N_1, \dots , N_L$ points, where $N= \sum _{l=1}^LN_l$ in Eq. (6). These operations can also be done in parallel. The cost for the classification is $\mathcal {O}(Np_c)$ assuming that the number of epochs for training is $\mathcal {O}(1)$. The cost for the regressions is $\mathcal {O}(\sum _{l=1}^L M_l^2 N_l)$. Hence the total cost for Eq. (6) is $\mathcal {O}(N p_c + \sum _{l=1}^L M_l^2 N_l)$, which can be roughly bounded by $\mathcal {O}(N P_\textrm{max})$, where $P_\textrm{max} = \textrm{max}\{p_c, M_1^2,\dots , M_L^2\}$. Randomly initialized MM cannot be expected to provide reasonable results after one pass, however with sparse GP models the first iteration provides a significant improvement which is also sometimes reasonable. Ignoring parallel considerations, the total cost for 2-pass MM (MM2r) is $\mathcal {O}(2 N p_c + \sum _{l=1}^L M_l^2 (2N_l+N))$.

For CCR, the cost is the same for the second step Eq. (6), while the first step is replaced with a chosen clustering algorithm, e.g. GMM, which incurs a cost of $\mathcal {O}(NL)$. The latter iterates between steps which can be parallelized in different ways. The total cost of CCR is hence $\mathcal {O}(N (p_c+L) + \sum _{l=1}^L M_l^2 N_l) = \mathcal {O}(N P_\textrm{max})$. This is a one pass algorithm, which often provides acceptable results. The overhead for MM2r vs. CCR for our model is then roughly $\mathcal {O}((P_\textrm{max}-L)N)$. More precisely it is $\mathcal {O}(N (p_c-L) + \sum _{l=1}^L M_l^2 (N+N_l) )$.

3.4 Prediction

There are two approaches that can be employed to predict $y^*$ at a test value $x^*$. Hard allocation based prediction is based on the single best regression/expert:

$$\begin{aligned} z^* =\mathop {\textrm{argmax}}\limits _{z^* \in \lbrace 1, \ldots , L \rbrace } \log (w_{z^*}(x^*; {\psi })) \, , \,\,\, y^* = {\widehat{\mu }}_{z^*}(x^*) \, , \end{aligned}$$

(13)

where (similiar to Eq. (2)), the GP expert’s prediction is:

$$\begin{aligned} {\widehat{\mu }}_{z^*}(x^*) = \mu _l + (\textbf{k}_{l,M_l,x^*})^T(\textbf{K}_{l,M_l})^{-1} ({\tilde{{\textbf{f}}}}_l- {\varvec{\mu }}_l), \end{aligned}$$

with $\textbf{k}_{l,M_l,x^*}$ denoting the vector of length $M_l$ with elements $K_{\phi _l}({\tilde{x}}_{l,j}, x^*)$. Whereas soft allocation based prediction is given by a weighted average

$$\begin{aligned} y^* = \sum _{l=1}^L w_{l}(x^*; {\psi }) {\widehat{\mu }}_{l}(x^*) \,. \end{aligned}$$

(14)

Soft-allocation may be preferred in cases when there is not a clear jump in the unknown function, thus allowing us to smooth the predictions in regions where the classifier is unsure. Similarly, the variance or density of the output or regression function can also be computed at any test location to obtain measures of uncertainty in our predictions.

Hard allocation based density estimation delivers a Gaussian approximation for a test value $x^*$,

$$\begin{aligned} {\widehat{p}}(y\mid x^*, z^*=l) = \text {N}\left( y\mid {\widehat{\mu }}_{l}(x^*),{\widehat{\sigma }}^{2}_l(x^*) \right) \,, \end{aligned}$$

with allocation $z^*$ as in Eq. (13), and where ${\widehat{\sigma }}^{2}_l(x^*) = \lambda _{l, x^*} + \sigma ^{2}_l + (\textbf{k}_{l,M_l,x^*})^T (\textbf{Q}_{l,M_l})^{-1} \textbf{k}_{l,M_l,x^*}$ and $\lambda _{l,x^*}$ is computed as in Eq. (3) at $x^*$. Soft allocation based density estimation delivers a mixture of Gaussians for a test value $x^*$, similar to Eq. (14):

$$\begin{aligned} {\widehat{p}}(y\mid x^*) = \sum _{l=1}^L w_{l}(x^*; {\psi }) \text {N}(y\mid {\widehat{\mu }}_l(x^*),{\widehat{\sigma }}^{2}_l(x^*) ). \end{aligned}$$

In cases when the density of the output may be multi-modal, looking at point predictions alone is not useful. In this setting, the soft density estimates are preferred, allowing one to capture and visualise the multi-modality.

4 Numerical experiments

In this section, we perform a range of experiments to highlight the flexibility of our model and compare the accuracy and speed of the MM and CCR algorithms. For the sparse GP experts, we use the isotropic squared exponential covariance function with a variable number of inducing points $M_l$ based on the cluster sizes (Burt et al., 2020; Nieman et al., 2021). The pseudo-inputs ${\tilde{{\textbf{x}}}}^l$ are initialized via K-means and the GP hyperparameters are initialized based on the scale of the data. The number of experts L is determined apriori using the Bayesian information criterion (BIC) to compare the GMM clustering solutions of the rescaled (y, x) across different values of L. The choice of GMM is motivated by allowing elliptically-shaped clusters in the rescaled space, while keeping cost low (e.g. if compared to the K-means). In the “Appendix 1”, we instead consider using cross-validation and examine the log-likelihood on the held-out data to select the number of components, but we found that this significantly increases run time compared to BIC, with similar performance metrics. A table reporting the number of experts can also be found in “Appendix 1”. The architecture of the DNN gating network is chosen with the use of Auto-Keras (Jin et al., 2019) and a quadratic regularization is used with a value of 0.0001 for the penalization parameter. The adaptive stochastic gradient descent solver Adam (Kingma & Ba, 2014) is used to optimize the model weights and biases, with a validation fraction of 0.1 and a maximum of 500 epochs. The GPy package (GPy, 2012) and Keras (Chollet, 2015) were used to train the experts and gating network respectively.

For each experiment we performed 5-fold cross-validation with random shuffling of the data. The experiments are executed on Intel Core i7 (I7-7820HQ) with 4 cores and with 16GB of RAM.

4.1 Simulations

As a motivating toy example, we generate inputs within each cluster $l = 1, 2$ from a noisy 2D spiral and outputs $y = (2\,l -1) (x_1^2 + x_2^2) + \epsilon$. Figure 1 demonstrates that flexible DNN gating networks, over quadratic classifiers (e.g. logistic regression (LR)), are required to recover the true allocations. When combined with GP experts, this leads to improved accuracy ($R^2=98.74\%, \, 86.98\%, \, 97.04\%$ for DNN+GP, LR+GP, MDN, respectively) and tighter credible intervals (CI) that maintain the desired coverage (average CI length = $0.25,\, 1.36,\, 0.41$ and empirical coverage (EC) = $95\%,\, 100\%,\, 98.7\%$ for DNN+GP, LR+GP, MDN, respectively). In particular, for outlying test points (red squares in Fig. 1), MDN provides highly overconfident inaccurate predictions, whereas predictions and CIs are more accurate and realistic for the proposed DNN+GP.

4.2 Datasets

Our experiments range from small to large datasets of varying dimensions and complexity and mainly aim to highlight the model’s capability to flexibly estimate the conditional density and capture issues such as discontinuity, non-stationarity, heteroscedasticity, and multi-modality that challenge standard GP models. First, the Motorcycle dataset (Silverman, 1985) consists of $N=133$ measurements with $d=1$. Second, the NASA dataset (Gramacy & Lee, 2008) comes from a computer simulator of a NASA rocket booster vehicle with $N=3167$; we focus on modelling the lift force as a function of the speed (mach), the angle of attack (alpha), and the slide-slip angle (beta), i.e. $d=3$. Our third experiment is the Higdon function (Higdon, 2002; Gramacy & Lee, 2008); $\mathcal {X}=[0,20]$ and $N=1000$. Next, $N=10,000$ points are generated from the Bernholdt function (Bernholdt et al., 2019); in this case, $\mathcal {X}=[-4,10]^2$ and $f(x) = g(x_1)g(x_2)$, where g(x) is the piece-wise smooth function studied in Monterrubio-Gómez et al. (2020). The kin40k dataset (Seeger et al., 2003) is a popular benchmark for GP regression methods and consists of $N = 40,000$ points that describe the location of a robotic arm as a highly nonlinear function of control input with $d = 8$.

4.2.1 $\chi$ dataset

A tokamak is a device which uses magnetic fields to confine hot plasma in the shape of a torus. It is the leading candidate for production of controlled thermonuclear power, for use in a prospective future fusion reactor (Greenwald, 2016).

One dimensional radial transport modeling (Park et al., 2017) using theory-based models such as GLF23 (Waltz et al., 1997), MMM95 (Bateman et al., 1998), and TGLF (Staebler et al., 2007) plays an essential role in interpreting experimental data and guiding new experiments for magnetically confined plasmas in tokamaks. Turbulent transport resulting from micro-instabilities have a strong nonlinear dependency on the temperature and density gradients. One of the key characteristics is a sharp increase of turbulent flux as the gradient of temperature increases beyond a certain critical value. This leads to a highly nonlinear and discontinuous function of the inputs.

Here we consider an analytical stiff transport model that describes turbulent ion energy transport in tokamak plasmas (Bernholdt et al., 2019; Janeschitz et al., 2002):

$$\begin{aligned} \chi = S (R T'/T - (R T'/T)_\textrm{crit})^\alpha H\left( \left| \frac{(R T'/T)}{(R T'/T)_\textrm{crit}} \right| - 1\right) \,, \end{aligned}$$

(15)

where $\chi$ is ion thermal diffusivity, $H(\cdot )$ is the Heaviside function, R is the major radius, and $T'$ is the radial derivative of ion temperature. The normalized critical gradient $(R T'/T)_\textrm{crit}$ of ion temperature is calculated using IFS/PPPL model (Kotschenreuther et al., 1995), which is a nonlinear function of electron density ($n_e$), electron and ion temperatures ($T_e,T$), safety factor (q), magnetic shear (${\hat{s}}$), effective charge ($Z_\textrm{eff}$) and the normalized gradient of ion temperature $(RT'/T)$ and density $(Rn'/n)$. It is assumed that $S=1$ and $\alpha =1$. Considering $(T, T')$ and $(n, n')$ as two input parameters each, this gives a total of 10 inputs $x \in \mathbb {R}^{10}$. The output (15) is $y \in \mathbb {R}_+$. Basic primitive model inputs are chosen uniformly at random from a hypercube $\omega \in [0,1]^{17}$, which then give rise to realistic inputs $x \in \mathbb {R}^{10}$, which concentrate on a manifold in the ambient space. See (Kotschenreuther et al., 1995) for details of the model, and Fig. 2 for visualization of the input distribution histogram.

The data consists of $N=150,000$ simulations from (15). This model presents a challenge for the competing MoE methods, due to input dimensionality and data size, and for competing PoE methods, due to the sharp nonlinearities and discontinuities. It therefore serves to highlight the value of our method. Results appear in the bottom row of the tables.

4.3 Results

For our model, we compare the MM algorithm with the fast CCR and MM2r approximations. The MM algorithm is initialized at the CCR solution and iterates until the reduction in the $R^2$ is below a threshold of 0.0001 or a maximum number of iterations is reached. Competing models are two product of GP experts models, gPoE (Cao & Fleet, 2014), RBCM (Deisenroth & Ng, 2015), a mixture of GP experts, FastGP (Nguyen & Bonilla, 2014), and MDN (Bishop, 1994), as well as a Bayesian treed-based model (BART Chipman et al., 2010). Treed GP models (Gramacy & Lee, 2008) were also considered, although the cost was so much larger that the method is not competitive (see Table 4). We also report results for other methods, including ORTHNAT (Salimbeni et al., 2018), PPGPR (Jankowiak et al., 2020b), DSPP (Jankowiak et al., 2020a), and Deep GPs (Damianou & Lawrence, 2013).

First, we observe that CCR has similar or improved test accuracy compared to MM2r (Table 1), and CCR reduces wall-clock time (Table 2) by a factor of 2–3 for all experiments. The CCR solution can be further refined through MM iterations (Table 1); however this comes at a higher computational cost. Moreover, in practice we observe little to no improvement in accuracy with further MM iterations, across all datasets.

Compared with state-of-the-art GP, neural network, and tree-based benchmarks, our model has the highest test accuracy in almost all the experiments considered; the one exception is RBCM for Motorcycle, where the accuracy is slightly higher, but well within the standard deviation. For the smaller datasets, CCR is slightly slower than the PoE models (gPoE and RBCM) and FastGP, however this is offset by the improved accuracy. CCR is substantially faster than all competitors for the $\chi$ model, with $d=10$ and (relatively) big data $N=150,000$.

Table 1 $R^2$ test accuracy (%) on the five datasets for our model with CCR, CCR-MM, and MM2r algorithms and MDN, gPoE, RBCM, and FastGP benchmarks

Full size table

Table 2 Mean wall-clock time (in seconds) on the five datasets for our model with CCR, CCR-MM, and MM2r algorithms and MDN, gPoE, RBCM, and FastGP benchmarks

Full size table

Table 3 Empirical coverage ($\text {EC}_{95}$) and average length of 95% CIs ($\bar{\text {CI}}_{95}$), averaged over 5-folds

Full size table

Lastly, we highlight that our model provides uncertainty quantification (as well as density estimation which can capture multi-modality in the predictions) for both the unknown function and predictions. Figure 3 displays the soft and hard allocation predictions together with the respective 2-$\sigma$ CIs for the Motorcycle dataset. The model is able to recover the apparent non-stationary and heteroscedasticity in the data, while the other models (similar plots provided in the Fig. 5) tend to produce intervals that are too wide (especially on the left for $x \le -0.6$ for PoEs) or too tight. For our model, the data (in red) is contained within the region bounded with dashed lines, suggesting that the model also provides good empirical coverage. Table 3 gives more insights into the empirical coverage (i.e. fraction of times the test points are contained within the 95% CIs) against the average length of the CIs. It can be seen that the proposed model achieves tight intervals at the desired coverage, whereas most of the other models produce overconfident predictions or conservative intervals that are unnecessarily wide.

Table 4 Treed GP accuracy ($R^2$) on the test data, Wall-clock time, Empirical coverage ($\text {EC}_{95}$), and average length of 95% CIs ($\bar{\text {CI}}_{95}$)

Full size table

Table 4 shows the metrics obtained for the treed GP model on all datasets but kin40k and $\chi _{\text {150k}}$ because of the computational infeasibility (the treed GP package employs MCMC inference), and hence it was separated in the standalone table. Although the treed GP model has the advantage in terms of $R^2$ and empirical coverage/interval length over CCR in several instances, it is offset by prohibitively longer execution times.

Figure 4 compares heat maps of the conditional density for Motorcycle dataset. It can be seen that, unlike the other models, the proposed method successfully recovers non-stationary and heteroscedasticity in the data and provides UQ, which is closer to the one provided by the treed GP model and at a fraction of the cost.

The results from Tables 1 to Tables 3 are presented in a more digestible format in Fig. 5. Figure 5 (as well as Tables 1 to 3) shows that proposed method provides the best balance between low computational cost and high accuracy, and good uncertainty quantification (nominal coverage and tight CIs) in the examples considered.

5 Conclusion

We have proposed a novel MoE, which combines powerful DNNs to flexibly determine the local regions and sparse GPs to probabilistically model the local regression functions. Through various experiments, we have demonstrated that this combination provides a flexible, robust model that is able to recover challenging behaviors such as discontinuities, non-stationarity, and non-normality and well calibrated uncertainty. In addition, we have established a novel connection between the maximization-maximization algorithm and the recently introduced CCR algorithm. This allows us to obtain a fast approximation that significantly outperforms competing methods. Moreover, in some cases, the solution can be further refined through additional MM iterations. While we focus on the proposed deep mixture of sparse GP experts, this connection can be generally applied to other MoE architectures for fast approximation. Future research will explore extensions to infinite mixtures of experts, that allow for a data-driven number of clusters which can grow unboundedly with the data. In addition, the recent work of Rossi et al. (2021) found that combining the FITC approximation of GPs with Bayesian treatment of the inducing points and hyperparameters can improve performance significantly, in particular, compared with variationally sparse GPs; in this direction, future work will also explore suitable priors for inducing variables and GP hyperparameters in MAP estimation.

Data availibility

All datasets used in the paper are publicly available.

Code availability

Code is available at https://github.com/zankin/DMoGPE.

References

Ambrogioni, L., Güçlü, U., van Gerven, M.A., & Maris, E. (2017). The kernel mixture network: A nonparametric method for conditional density estimation of continuous random variables. arXiv preprint arXiv:1705.07111 .
Banerjee, S., Gelfand, A. E., Finley, A. O., & Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4), 825–848.
Article MathSciNet PubMed Google Scholar
Bateman, G., Kritz, A. H., Kinsey, J. E., Redd, A. J., & Weiland, J. (1998). Predicting temperature and density profiles in tokamaks. Physics of Plasmas, 5(5), 1793–1799.
Article ADS CAS Google Scholar
Bauer, M., van der Wilk, M., & Rasmussen, C. E. (2016). Understanding probabilistic sparse Gaussian process approximations. Advances in Neural Information Processing Systems, 29, 1533–1541.
Google Scholar
Bernholdt, D. E., Cianciosa, M. R., Green, D. L., Park, J. M., Law, K. J., & Etienam, C. (2019). Cluster, classify, regress: A general method for learning discontinuous functions. Foundations of Data Science, 1, 491.
Article MathSciNet Google Scholar
Bishop, C. M. (1994). Mixture density networks. Technical Report, Aston University.
Bui, T. D., Yan, J., & Turner, R. E. (2017). A unifying framework for sparse Gaussian process approximation using power expectation propagation. Journal of Machine Learning Research, 18, 1–72.
CAS Google Scholar
Burt, D. R., Rasmussen, C. E., & van der Wilk, M. (2020). Convergence of sparse variational inference in Gaussian processes regression. Journal of Machine Learning Research, 21, 1–63.
MathSciNet Google Scholar
Cao, Y., & Fleet, D. J. (2014). Generalized product of experts for automatic and principled fusion of Gaussian process predictions. CoRR.
Chatterjee, S., & Diaconis, P. (2018). The sample size required in importance sampling. The Annals of Applied Probability, 28(2), 1099–1135.
Article MathSciNet Google Scholar
Chen, Z., Ma, J., & Zhou, Y. (2014). A precise hard-cut EM algorithm for mixtures of Gaussian processes. In International conference on intelligent computing (pp. 68–75).
Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266–298.
Article MathSciNet Google Scholar
Chollet, F., et al. (2015). Keras. https://keras.io
Cressie, N., & Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 209–226.
Article MathSciNet Google Scholar
Damianou, A., & Lawrence, N.D. (2013). Deep gaussian processes. In Artificial intelligence and statistics (pp. 207–215).
Daskalakis, C., Dellaportas, P., & Panos, A. (2020). Faster Gaussian processes via deep embeddings.
Deisenroth, M., & Ng, J.W. (2015). Distributed Gaussian processes. In International conference on machine learning (pp. 1481–1490).
Durrande, N., Adam, V., Bordeaux, L., Eleftheriadis, S., & Hensman, J. (2019). Banded matrix operators for Gaussian Markov models in the automatic differentiation era. In 22nd International conference on artificial intelligence and statistics (Vol. 89, pp. 780–2789).
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Knowledge discovery in databases (Vol. 96, pp. 226–231).
Gadd, C. W., Wade, S., & Boukouvalas, A. (2020). Enriched mixtures of Gaussian process experts. In 23rd International conference on artificial intelligence and statistics.
Gormley, I. C., & Frühwirth-Schnatter, S. (2019). Mixtures of experts models. In S. Frühwirth-Schnatter, G. Celeux, & C. P. Robert (Eds.), Handbook of mixture analysis. Chapman and Hall/CRC.
Google Scholar
GPy (since 2012). GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy.
Gramacy, R. B., & Lee, H. K. H. (2008). Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103(483), 1119–1130.
Article MathSciNet CAS Google Scholar
Greenwald, J. (2016). Major next steps for fusion energy based on the spherical tokamak design.
Grigorievskiy, A., Lawrence, N., & Särkkä, S. (2017). Parallelizable sparse inverse formulation Gaussian processes (SpInGP). In: 2017 IEEE 27th international workshop on machine learning for signal processing (MLSP) (pp. 1–6).
Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society Series C (Applied Statistics), 28(1), 100–108.
Google Scholar
Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., & Zammit-Mangion, A. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24(3), 398–425. https://doi.org/10.1007/s13253-018-00348-w
Article MathSciNet PubMed Google Scholar
Higdon, D. (2002). Space and space-time modeling using process convolutions. In Quantitative methods for current environmental issues (pp. 37–56). Springer.
Huang, W., Zhao, D., Sun, F., Liu, H., & Chang, E. (2015). Scalable Gaussian process regression using deep neural networks. In Twenty-fourth international joint conference on artificial intelligence.
Iwata, T., & Ghahramani, Z. (2017). Improving output uncertainty estimation and generalization in deep learning via neural network Gaussian processes. arXiv preprint arXiv:1707.05922
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87.
Article PubMed Google Scholar
Janeschitz, G., Pacher, G., Zolotukhin, O., Pereverzev, G., Pacher, H., Igitkhanov, Y., & Sugihara, M. (2002). A 1-d predictive model for energy and particle transport in h-mode. Plasma Physics and Controlled Fusion, 44(5A), A459.
Article Google Scholar
Jankowiak, M., Pleiss, G., Gardner, J. (2020a). Deep sigma point processes. In Conference on uncertainty in artificial intelligence (pp. 789–798).
Jankowiak, M., Pleiss, G., Gardner, J. (2020b). Parametric gaussian process regressors. In International conference on machine learning (pp. 4702–4712).
Jin, H., Song, Q., Hu, X. (2019). Auto-keras: An efficient neural architecture search system. IN Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 1946–1956).
Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214.
Article Google Scholar
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kittler, J., & Föglein, J. (1984). Contextual classification of multispectral pixel data. Image and Vision Computing, 2(1), 13–29.
Article Google Scholar
Kotschenreuther, M., Dorland, W., Beer, M., & Hammett, G. (1995). Quantitative predictions of tokamak energy confinement from first-principles simulations with kinetic effects. Physics of Plasmas, 2(6), 2381–2389.
Article ADS CAS Google Scholar
Kurihara, K., & Welling, M. (2009). Bayesian k-means as a “maximization-expectation’’ algorithm. Neural Computation, 21(4), 1145–1172.
Article MathSciNet PubMed Google Scholar
Lindgren, F., Rue, H., & Lindström, J. (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4), 423–498.
Article MathSciNet Google Scholar
McLachlan, G. J., & Basford, K. E. (1988). Mixture models: Inference and applications to clustering (Vol. 38). M. Dekker.
Google Scholar
Meeds, E., & Osindero, S. (2006). An alternative infinite mixture of Gaussian process experts. In Advances in neural information processing systems (pp. 883–890).
Monterrubio-Gómez, K., Roininen, L., Wade, S., Damoulas, T., & Girolami, M. (2020). Posterior inference for sparse hierarchical non-stationary models. Computational Statistics & Data Analysis.
Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 427–436).
Nguyen, T., & Bonilla, E. (2014). Fast allocation of Gaussian process experts. International conference on machine learning (pp. 145–153).
Nieman, D., Szabo, B., & van Zanten, H. (2021). Contraction rates for sparse variational approximations in Gaussian process regression. arXiv preprint arXiv:2109.10755
Park, J. M., Murakami, M., John, H. S., Lao, L. L., Chu, M., & Prater, R. (2017). An efficient transport solver for tokamak plasmas. Computer Physics Communications, 214, 1–5.
Article ADS MathSciNet CAS Google Scholar
Pope, C. A., Gosling, J. P., Barber, S., Johnson, J. S., Yamaguchi, T., Feingold, G., & Blackwell, P. G. (2019). Gaussian process modeling of heterogeneity and discontinuities using voronoi tessellations. Technometrics 1–20.
Quiñonero-Candela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6, 1939–1959.
MathSciNet Google Scholar
Rasmussen, C. E., & Ghahramani, Z. (2002). Infinite mixtures of Gaussian process experts. Advances in neural information processing systems (pp. 881–888).
Rigon, T., & Durante, D. (2017). Tractable Bayesian density regression via logit stick-breaking priors. arXiv preprint arXiv:1701.02969
Rossi, S., Heinonen, M., Bonilla, E., Shen, Z., & Filippone, M. (2021). Sparse gaussian processes revisited: Bayesian approaches to inducing-variable approximations. In Banerjee, A., & Fukumizu, K. (Eds.), Proceedings of the 24th international conference on artificial intelligence and statistics (Vol. 130, pp. 1837–1845). PMLR.
Salimbeni, H., Cheng, C.-A., Boots, B., & Deisenroth, M. (2018). Orthogonally decoupled variational gaussian processes. In Advances in neural information processing systems (Vol. 31).
Seeger, M. W., Williams, C. K. I., & Lawrence, N. D. (2003). Fast forward selection to speed up sparse Gaussian process regression. In Bishop, C. M. & Frey, B. J. (Eds.), Proceedings of the ninth international workshop on artificial intelligence and statistics (Vol. R4, pp. 254–261). PMLR. Retrieved from https://proceedings.mlr.press/r4/seeger03a.html (Reissued by PMLR on 01 April 2021).
Silverman, B. W. (1985). Some aspects of the spline smoothing approach to non-parametric regression curve fitting. Journal of the Royal Statistical Society: Series B (Methodological), 47(1), 1–21.
MathSciNet Google Scholar
Snelson, E., & Ghahramani, Z. (2006). Sparse Gaussian processes using pseudo-inputs. In Advances in neural information processing systems (pp. 1257–1264).
Staebler, G., Kinsey, J., & Waltz, R. (2007). A theory-based transport model with comprehensive physics. Physics of Plasmas, 14(5), 055909.
Article ADS Google Scholar
Szabó, B., & Van Zanten, H. (2019). An asymptotic analysis of distributed nonparametric methods. Journal of Machine Learning Research, 20(87), 1–30.
MathSciNet Google Scholar
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199
Szymanski, L., & McCane, B. (2012). Deep, super-narrow neural network is a universal classifier. In The 2012 international joint conference on neural networks (IJCNN) (pp. 1–8).
Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In Artificial intelligence and statistics (pp. 567–574).
Trapp, M., Peharz, R., Pernkopf, F., & Rasmussen, C. E. (2020). Deep structured mixtures of Gaussian processes. In International conference on artificial intelligence and statistics (pp. 2251–2261).
Tresp, V. (2000). A Bayesian committee machine. Neural Computation, 12(11), 2719–2741.
Article CAS PubMed Google Scholar
Tresp, V. (2001). Mixtures of Gaussian processes. In Advances in neural information processing systems (pp. 654–660).
Waltz, R., Staebler, G., Dorland, W., Hammett, G., Kotschenreuther, M., & Konings, J. (1997). A gyro-Landau-fluid transport model. Physics of Plasmas, 4(7), 2482–2496.
Article ADS CAS Google Scholar
Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning. MIT Press.
Google Scholar
Wilson, A. G., Hu, Z., Salakhutdinov, R., & Xing, E. P. (2016). Deep kernel learning. In Artificial intelligence and statistics (pp. 370–378).
Xu, L., Jordan, M. I., & Hinton, G. E. (1995). An alternative model for mixtures of experts. In Advances in neural information processing systems (pp. 633– 640).
Yuan, C., & Neubauer, C. (2009). Variational mixture of Gaussian process experts. In Advances in neural information processing systems (pp. 1897–1904).
Zhang, M. M., & Williamson, S. A. (2019). Embarrassingly parallel inference for Gaussian processes. Journal of Machine Learning Research, 20, 1–26.
MathSciNet Google Scholar

Download references

Funding

CE, KL, and VZ were supported by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). VZ and KL were supported by The Alan Turing Institute.

Author information

Authors and Affiliations

Department of Mathematics, University of Manchester, Manchester, UK
Clement Etienam, Kody J. H. Law & Vitaly Zankin
School of Mathematics, University of Edinburgh, Edinburgh, UK
Sara Wade
The Alan Turing Institute, London, UK
Vitaly Zankin
NVIDIA, Reading, UK
Clement Etienam

Authors

Clement Etienam
View author publications
You can also search for this author in PubMed Google Scholar
Kody J. H. Law
View author publications
You can also search for this author in PubMed Google Scholar
Sara Wade
View author publications
You can also search for this author in PubMed Google Scholar
Vitaly Zankin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors, CE, KJHL, SW, and VZ, substantially contributed to the conception and design of this work.

Corresponding author

Correspondence to Vitaly Zankin.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical standard

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Derek Greene.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1: Number of experts

Table 5 contains the number of experts L used for the inference for each dataset. This quantity was chosen using the BIC criterion.

Table 5 Number of experts chosen with the BIC criterion

Full size table

The number of experts can also be estimated using the cross-validation. In Table 6 and Fig. 6 we present results for the Motorcycle dataset when L is determined using this approach. Table 6 reports the wall-clock times taken to estimate the number of experts using both the BIC ($L_{BIC}$) and cross-validation ($L_{CV}$) approaches.

Table 6 Results for the Motorcycle dataset, when the number of experts is chosen using the cross-validation

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Etienam, C., Law, K.J.H., Wade, S. et al. Fast deep mixtures of Gaussian process experts. Mach Learn 113, 1483–1508 (2024). https://doi.org/10.1007/s10994-023-06491-x

Download citation

Received: 28 October 2022
Revised: 07 October 2023
Accepted: 23 November 2023
Published: 08 January 2024
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10994-023-06491-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Abstract
1 Introduction
2 Methodology
3 Inference
4 Numerical experiments
5 Conclusion
Data availibility
Code availability
References
Funding
Author information
Ethics declarations
Additional information
Appendix 1: Number of experts
Rights and permissions
About this article

Fig. 1
View in article Full size image
Fig. 2
View in article Full size image
Fig. 3
View in article Full size image
Fig. 4
View in article Full size image
Fig. 5
View in article Full size image
Fig. 6
View in article Full size image

Ambrogioni, L., Güçlü, U., van Gerven, M.A., & Maris, E. (2017). The kernel mixture network: A nonparametric method for conditional density estimation of continuous random variables. arXiv preprint arXiv:1705.07111 .
Banerjee, S., Gelfand, A. E., Finley, A. O., & Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4), 825–848.
Article MathSciNet PubMed Google Scholar
Bateman, G., Kritz, A. H., Kinsey, J. E., Redd, A. J., & Weiland, J. (1998). Predicting temperature and density profiles in tokamaks. Physics of Plasmas, 5(5), 1793–1799.
Article ADS CAS Google Scholar
Bauer, M., van der Wilk, M., & Rasmussen, C. E. (2016). Understanding probabilistic sparse Gaussian process approximations. Advances in Neural Information Processing Systems, 29, 1533–1541.
Google Scholar
Bernholdt, D. E., Cianciosa, M. R., Green, D. L., Park, J. M., Law, K. J., & Etienam, C. (2019). Cluster, classify, regress: A general method for learning discontinuous functions. Foundations of Data Science, 1, 491.
Article MathSciNet Google Scholar
Bishop, C. M. (1994). Mixture density networks. Technical Report, Aston University.
Bui, T. D., Yan, J., & Turner, R. E. (2017). A unifying framework for sparse Gaussian process approximation using power expectation propagation. Journal of Machine Learning Research, 18, 1–72.
CAS Google Scholar
Burt, D. R., Rasmussen, C. E., & van der Wilk, M. (2020). Convergence of sparse variational inference in Gaussian processes regression. Journal of Machine Learning Research, 21, 1–63.
MathSciNet Google Scholar
Cao, Y., & Fleet, D. J. (2014). Generalized product of experts for automatic and principled fusion of Gaussian process predictions. CoRR.
Chatterjee, S., & Diaconis, P. (2018). The sample size required in importance sampling. The Annals of Applied Probability, 28(2), 1099–1135.
Article MathSciNet Google Scholar
Chen, Z., Ma, J., & Zhou, Y. (2014). A precise hard-cut EM algorithm for mixtures of Gaussian processes. In International conference on intelligent computing (pp. 68–75).
Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266–298.
Article MathSciNet Google Scholar
Chollet, F., et al. (2015). Keras. https://keras.io
Cressie, N., & Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 209–226.
Article MathSciNet Google Scholar
Damianou, A., & Lawrence, N.D. (2013). Deep gaussian processes. In Artificial intelligence and statistics (pp. 207–215).
Daskalakis, C., Dellaportas, P., & Panos, A. (2020). Faster Gaussian processes via deep embeddings.
Deisenroth, M., & Ng, J.W. (2015). Distributed Gaussian processes. In International conference on machine learning (pp. 1481–1490).
Durrande, N., Adam, V., Bordeaux, L., Eleftheriadis, S., & Hensman, J. (2019). Banded matrix operators for Gaussian Markov models in the automatic differentiation era. In 22nd International conference on artificial intelligence and statistics (Vol. 89, pp. 780–2789).
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Knowledge discovery in databases (Vol. 96, pp. 226–231).
Gadd, C. W., Wade, S., & Boukouvalas, A. (2020). Enriched mixtures of Gaussian process experts. In 23rd International conference on artificial intelligence and statistics.
Gormley, I. C., & Frühwirth-Schnatter, S. (2019). Mixtures of experts models. In S. Frühwirth-Schnatter, G. Celeux, & C. P. Robert (Eds.), Handbook of mixture analysis. Chapman and Hall/CRC.
Google Scholar
GPy (since 2012). GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy.
Gramacy, R. B., & Lee, H. K. H. (2008). Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103(483), 1119–1130.
Article MathSciNet CAS Google Scholar
Greenwald, J. (2016). Major next steps for fusion energy based on the spherical tokamak design.
Grigorievskiy, A., Lawrence, N., & Särkkä, S. (2017). Parallelizable sparse inverse formulation Gaussian processes (SpInGP). In: 2017 IEEE 27th international workshop on machine learning for signal processing (MLSP) (pp. 1–6).
Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society Series C (Applied Statistics), 28(1), 100–108.
Google Scholar
Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., & Zammit-Mangion, A. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24(3), 398–425. https://doi.org/10.1007/s13253-018-00348-w
Article MathSciNet PubMed Google Scholar
Higdon, D. (2002). Space and space-time modeling using process convolutions. In Quantitative methods for current environmental issues (pp. 37–56). Springer.
Huang, W., Zhao, D., Sun, F., Liu, H., & Chang, E. (2015). Scalable Gaussian process regression using deep neural networks. In Twenty-fourth international joint conference on artificial intelligence.
Iwata, T., & Ghahramani, Z. (2017). Improving output uncertainty estimation and generalization in deep learning via neural network Gaussian processes. arXiv preprint arXiv:1707.05922
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87.
Article PubMed Google Scholar
Janeschitz, G., Pacher, G., Zolotukhin, O., Pereverzev, G., Pacher, H., Igitkhanov, Y., & Sugihara, M. (2002). A 1-d predictive model for energy and particle transport in h-mode. Plasma Physics and Controlled Fusion, 44(5A), A459.
Article Google Scholar
Jankowiak, M., Pleiss, G., Gardner, J. (2020a). Deep sigma point processes. In Conference on uncertainty in artificial intelligence (pp. 789–798).
Jankowiak, M., Pleiss, G., Gardner, J. (2020b). Parametric gaussian process regressors. In International conference on machine learning (pp. 4702–4712).
Jin, H., Song, Q., Hu, X. (2019). Auto-keras: An efficient neural architecture search system. IN Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 1946–1956).
Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214.
Article Google Scholar
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kittler, J., & Föglein, J. (1984). Contextual classification of multispectral pixel data. Image and Vision Computing, 2(1), 13–29.
Article Google Scholar
Kotschenreuther, M., Dorland, W., Beer, M., & Hammett, G. (1995). Quantitative predictions of tokamak energy confinement from first-principles simulations with kinetic effects. Physics of Plasmas, 2(6), 2381–2389.
Article ADS CAS Google Scholar
Kurihara, K., & Welling, M. (2009). Bayesian k-means as a “maximization-expectation’’ algorithm. Neural Computation, 21(4), 1145–1172.
Article MathSciNet PubMed Google Scholar
Lindgren, F., Rue, H., & Lindström, J. (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4), 423–498.
Article MathSciNet Google Scholar
McLachlan, G. J., & Basford, K. E. (1988). Mixture models: Inference and applications to clustering (Vol. 38). M. Dekker.
Google Scholar
Meeds, E., & Osindero, S. (2006). An alternative infinite mixture of Gaussian process experts. In Advances in neural information processing systems (pp. 883–890).
Monterrubio-Gómez, K., Roininen, L., Wade, S., Damoulas, T., & Girolami, M. (2020). Posterior inference for sparse hierarchical non-stationary models. Computational Statistics & Data Analysis.
Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 427–436).
Nguyen, T., & Bonilla, E. (2014). Fast allocation of Gaussian process experts. International conference on machine learning (pp. 145–153).
Nieman, D., Szabo, B., & van Zanten, H. (2021). Contraction rates for sparse variational approximations in Gaussian process regression. arXiv preprint arXiv:2109.10755
Park, J. M., Murakami, M., John, H. S., Lao, L. L., Chu, M., & Prater, R. (2017). An efficient transport solver for tokamak plasmas. Computer Physics Communications, 214, 1–5.
Article ADS MathSciNet CAS Google Scholar
Pope, C. A., Gosling, J. P., Barber, S., Johnson, J. S., Yamaguchi, T., Feingold, G., & Blackwell, P. G. (2019). Gaussian process modeling of heterogeneity and discontinuities using voronoi tessellations. Technometrics 1–20.
Quiñonero-Candela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6, 1939–1959.
MathSciNet Google Scholar
Rasmussen, C. E., & Ghahramani, Z. (2002). Infinite mixtures of Gaussian process experts. Advances in neural information processing systems (pp. 881–888).
Rigon, T., & Durante, D. (2017). Tractable Bayesian density regression via logit stick-breaking priors. arXiv preprint arXiv:1701.02969
Rossi, S., Heinonen, M., Bonilla, E., Shen, Z., & Filippone, M. (2021). Sparse gaussian processes revisited: Bayesian approaches to inducing-variable approximations. In Banerjee, A., & Fukumizu, K. (Eds.), Proceedings of the 24th international conference on artificial intelligence and statistics (Vol. 130, pp. 1837–1845). PMLR.
Salimbeni, H., Cheng, C.-A., Boots, B., & Deisenroth, M. (2018). Orthogonally decoupled variational gaussian processes. In Advances in neural information processing systems (Vol. 31).
Seeger, M. W., Williams, C. K. I., & Lawrence, N. D. (2003). Fast forward selection to speed up sparse Gaussian process regression. In Bishop, C. M. & Frey, B. J. (Eds.), Proceedings of the ninth international workshop on artificial intelligence and statistics (Vol. R4, pp. 254–261). PMLR. Retrieved from https://proceedings.mlr.press/r4/seeger03a.html (Reissued by PMLR on 01 April 2021).
Silverman, B. W. (1985). Some aspects of the spline smoothing approach to non-parametric regression curve fitting. Journal of the Royal Statistical Society: Series B (Methodological), 47(1), 1–21.
MathSciNet Google Scholar
Snelson, E., & Ghahramani, Z. (2006). Sparse Gaussian processes using pseudo-inputs. In Advances in neural information processing systems (pp. 1257–1264).
Staebler, G., Kinsey, J., & Waltz, R. (2007). A theory-based transport model with comprehensive physics. Physics of Plasmas, 14(5), 055909.
Article ADS Google Scholar
Szabó, B., & Van Zanten, H. (2019). An asymptotic analysis of distributed nonparametric methods. Journal of Machine Learning Research, 20(87), 1–30.
MathSciNet Google Scholar
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199
Szymanski, L., & McCane, B. (2012). Deep, super-narrow neural network is a universal classifier. In The 2012 international joint conference on neural networks (IJCNN) (pp. 1–8).
Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In Artificial intelligence and statistics (pp. 567–574).
Trapp, M., Peharz, R., Pernkopf, F., & Rasmussen, C. E. (2020). Deep structured mixtures of Gaussian processes. In International conference on artificial intelligence and statistics (pp. 2251–2261).
Tresp, V. (2000). A Bayesian committee machine. Neural Computation, 12(11), 2719–2741.
Article CAS PubMed Google Scholar
Tresp, V. (2001). Mixtures of Gaussian processes. In Advances in neural information processing systems (pp. 654–660).
Waltz, R., Staebler, G., Dorland, W., Hammett, G., Kotschenreuther, M., & Konings, J. (1997). A gyro-Landau-fluid transport model. Physics of Plasmas, 4(7), 2482–2496.
Article ADS CAS Google Scholar
Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning. MIT Press.
Google Scholar
Wilson, A. G., Hu, Z., Salakhutdinov, R., & Xing, E. P. (2016). Deep kernel learning. In Artificial intelligence and statistics (pp. 370–378).
Xu, L., Jordan, M. I., & Hinton, G. E. (1995). An alternative model for mixtures of experts. In Advances in neural information processing systems (pp. 633– 640).
Yuan, C., & Neubauer, C. (2009). Variational mixture of Gaussian process experts. In Advances in neural information processing systems (pp. 1897–1904).
Zhang, M. M., & Williamson, S. A. (2019). Embarrassingly parallel inference for Gaussian processes. Journal of Machine Learning Research, 20, 1–26.
MathSciNet Google Scholar

Search

Navigation

Fast deep mixtures of Gaussian process experts

Abstract

Similar content being viewed by others

Regularized Estimation and Feature Selection in Mixtures of Gaussian-Gated Experts Models

Correlated product of experts for sparse Gaussian process regression

Stochastic variational hierarchical mixture of sparse Gaussian processes for regression

Explore related subjects

1 Introduction

2 Methodology

2.1 Sparse Gaussian process experts

2.2 Deep neural gating networks

3 Inference

3.1 Maximization–maximization

3.2 A fast approximation: CCR

3.3 Complexity considerations

3.4 Prediction

4 Numerical experiments

4.1 Simulations

4.2 Datasets

4.2.1 \(\chi\) dataset

4.3 Results

5 Conclusion

Data availibility

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical standard

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix 1: Number of experts

Appendix 1: Number of experts

Rights and permissions

About this article

Cite this article

Share this article

Keywords