Graph Structure Learning with Interpretable
Bayesian Neural Networks

Max Wasserman [email protected]
Department of Computer Science
University of Rochester
Gonzalo Mateos [email protected]
Department of Electrical and Computer Engineering
University of Rochester
Abstract

Graphs serve as generic tools to encode the underlying relational structure of data. Often this graph is not given, and so the task of inferring it from nodal observations becomes important. Traditional approaches formulate a convex inverse problem with a smoothness promoting objective and rely on iterative methods to obtain a solution. In supervised settings where graph labels are available, one can unroll and truncate these iterations into a deep network that is trained end-to-end. Such a network is parameter efficient and inherits inductive bias from the optimization formulation, an appealing aspect for data constrained settings in, e.g., medicine, finance, and the natural sciences. But typically such settings care equally about uncertainty over edge predictions, not just point estimates. Here we introduce novel iterations with independently interpretable parameters, i.e., parameters whose values - independent of other parameters’ settings - proportionally influence characteristics of the estimated graph, such as edge sparsity. After unrolling these iterations, prior knowledge over such graph characteristics shape prior distributions over these independently interpretable network parameters to yield a Bayesian neural network (BNN) capable of graph structure learning (GSL) from smooth signal observations. Fast execution and parameter efficiency allow for high-fidelity posterior approximation via Markov Chain Monte Carlo (MCMC) and thus uncertainty quantification on edge predictions. Informative priors unlock modeling tools from Bayesian statistics like prior predictive checks. Synthetic and real data experiments corroborate this model’s ability to provide well-calibrated estimates of uncertainty, in test cases that include unveiling economic sector modular structure from S&\&&P500500500500 data and recovering pairwise digit similarities from MNIST images. Overall, this framework enables GSL in modest-scale applications where uncertainty on the data structure is paramount.

1 Introduction

Graphs serve as a foundational paradigm in machine learning, data science, and network science for modeling complex systems, capturing intricate relationships in data that range from social networks to gene interactions; see e.g., (Kolaczyk, 2009; Hamilton, 2020). In computational biology, accurate graph structures can offer insights into gene regulatory pathways, enhancing our ability to treat diseases at the genetic level (Cai et al., 2013). In finance, the ability to recover precise financial networks can be useful for risk assessment and market stability (Marti et al., 2021). In geometric deep learning (Bronstein et al., 2017), the lack of observed graph structure underlying data often limits the use of efficient learning models such as graph neural networks (GNNs); see e.g., (Cosmo et al., 2020). Yet, despite their importance, existing methodologies for graph structure learning (GSL) from multivariate nodal data face significant limitations.

This work addresses the GSL problem where nodal observations are used to predict the completely unobserved graph structure. Traditional approaches summarize the nodal observations with a pairwise vertex (dis)similarity matrix, e.g., correlation or Euclidean distance matrices, and formulate GSL as a regularized convex inverse problem (Mateos et al., 2019; Dong et al., 2019). The primary objective encourages data fidelity, according to a chosen model linking the nodal observations and the sought latent graph, while the regularization objectives capture prior structural knowledge, such as edge sparsity or desired connectivity patterns. These model-based methods solve the inverse problem using optimization algorithms, which often come with convergence rate guarantees (Saboksayr & Mateos, 2021; Wang et al., 2023). However, as noted in (Pu et al., 2021) their expressiveness is constrained to graph characteristics that can be modeled via convex criteria and they may suffer complexity and scalability issues. Specifically, these model-based approaches typically necessitate an outer loop for grid-based optimization of regularization parameters (and often other algorithm parameters, e.g., a step-size), and an inner loop that, given this fully defined optimization problem, may demand thousands of iterations to converge on a solution. When nodal observations come with corresponding graph labels, recent supervised GSL approaches partially overcome these obstacles using ‘algorithm unrolling’ (Shrivastava et al., 2020; Pu et al., 2021; Wasserman et al., 2023). Unrollings truncate these inner-loop iterations to yield a deep network architecture that is trained end-to-end to approximate the solution to the inverse problem (Monga et al., 2021). The choice of a custom loss function, which need not be convex, plus optional architectural refinements of an already well-motivated initial network tend to improve performance on the task of interest. Truncation depth provides explicit control on the complexity of prediction, now a forward pass in the network. And the use of backpropogated gradients makes learning the regularization (and step-size) parameters, now network parameters, more scalable. Additionally, the unrolled deep networks tend to inherit appealing aspects from the original model-based formulation, namely inductive bias and low dimensionality, as their outputs are approximations to solutions of the original inverse problem.

1.1 Towards interpretability and uncertainty quantification: Desiderata and contributions

In composite inverse problems with multiple regularizers, understanding how each regularization parameter values affect desirable solution traits (e.g., image sharpness or number of graph edges) may be complex and often beyond the modeler’s knowledge. This obscurity necessitates a naive—and consequently costly—multi-dimensioanl grid search for suitable parameter values; see e.g., the numerical study in (Dong et al., 2016, Section V-B). When the relationship between the regularization parameters and a solution characteristic is clear, we say the parameters are interpretable with respect to (w.r.t.) the output characteristic. When an output characteristic is influenced by a single parameter, independent of all others, we call this parameter independently interpretable w.r.t. the output characteristic. Such instances make interpretability actionable, namely they allow prior knowledge on the solution characteristic to be incorporated onto the value of the independently interpretable parameter. A key contribution of this work is: i) in recognizing optimization problems with independently interpretable parameters as ripe for the adoption of unrollings, which inherit this interpretability because they approximate inverse problem solutions; and ii) in leveraging the Bayesian framework to seamlessly incorporate such prior knowledge into the parameters of the unrolled network.

An issue unrollings often face (within GSL and beyond) is their tendency to produce layers with pre-activation outputs that are nonlinear functions of the parameters; e.g., parameter products and parameters in denominators (Monga et al., 2021). This prevents the deep network from being a true neural network (NN), i.e., a function composed of layers, where each layer consists of affine transformations of data or intermediate activations followed by non-linear functions applied pointwise (Bishop, 1995). To address this issue, network designers may opt for reparameterization, but at the expense of parameter interpretability and often degraded empirical performance (Monga et al., 2021; Shrivastava et al., 2020). Unrollings can be impeded by nuisance parameters - parameters not of direct interest, but which must be accounted for - like a step-size. Nuisance parameters are typically a vestige of the chosen optimization algorithm rather than being intrinsic to the problem formulation. Nuissance parameters can also undermine training stability (see Appendix A.1.1) and since they lack a clear connection to specific solution characteristics they hinder informative prior modeling. The preceding discussion motivates the need for innovative GSL techniques to produce true NN unrollings amenable to incorporation of prior knowledge on their parameters.

Providing estimates of uncertainty on an inferred graph structure is important for downstream applications, e.g. in biology, finance, and machine learning with GNNs. In general Bayesian statistics, low-dimensional models with interpretable parameters are constructed using background information on the specific problem. Interpretability allows informative prior modeling and thus access to enticing Bayesian modeling tools, e.g., prior predictive checks (Gabry et al., 2017), while the low dimensionality allows tractable high-quality posterior inference, typically via Markov Chain Monte Carlo (MCMC) sampling. For example, traditional Bayesian approaches to GSL tend to address the transductive (tied to a single graph) setting by constructing a joint model over parameters, observed data, and latent graph structure, use MCMC to draw samples, and marginalize out all but the graph samples (Butts, 2003; Crawford, 2015; Gray et al., 2020). Stepping back from the specific case of GSL, the inability to specify a sufficiently expressive model often motivates the use of Bayesian neural networks (BNNs), an alternative paradigm which typically bypasses domain-specific modeling opting instead to feed the output of a performant NN directly into the likelihood function; see e.g., (Jospin et al., 2022). Both traditional Bayesian and BNN approaches derive uncertainty estimates over predictions by marginalizing out the parameter posterior. However, BNNs present notable challenges, especially in the incorporation of prior information and posterior approximation. Due to the non-interpretability of parameters, priors are often selected for computational convenience, such as a zero-mean isotropic Gaussian, which can inadvertently bias predictions against prior beliefs, a phenomenon known as ‘unintentionally informative priors’ (Wenzel et al., 2020), thus negating a key benefit of Bayesian methods. Additionally, the parameters’ high dimensionality and complex posterior geometry make high-quality posterior approximation a formidable task, often requiring significant approximations that undermine the interpretability of the results (Gal & Ghahramani, 2016; Wenzel et al., 2020; Monga et al., 2021). This highlights the need for an inductive GSL method that not only provides reliable uncertainty estimates over edge predictions but also combines the expressive power of BNNs with the traditional Bayesian approach’s strengths in integrating prior knowledge, employing robust modeling tools like predictive checks, and ensuring high-fidelity posterior approximation.

Summary of contributions. In this paper, we introduce the first BNN for supervised GSL from smooth signal observations. This BNN produces a distribution over unseen test graphs allowing estimation of uncertainty over edge predictions. It leverages the independent intepretability of the paramters in the GSL formulation to allow informative prior modeling over the weights of the NN, itself a result of unrolling novel iterations for a model-based formulation with well-documented merits. We make the following contributions:

  • In Section 3, we develop a novel optimization algorithm for GSL from smooth signals (Algorithm 1), which is step-size free and parameterized to yield independent interpretability w.r.t. edge sparsity.

  • In Section 4, we unroll Algorithm 1 to produce the first strict unrolling for GSL from smooth signals, which results in a true NN. This is to be contrasted with existing supervised-learning approaches to GSL, where GLAD (Shrivastava et al., 2020) and Unrolled PDS (Pu et al., 2021) are not true NNs, and GDN (Wasserman et al., 2023) is not a strict unrolling because it resorts to gradient truncation and reparameterization. The proposed unrolled NN is used to define our BNN, dubbed ‘DPG’ since Algorithm 1 is a dual-based proximal gradient method (Beck & Teboulle, 2014).

  • In Section 5, we introduce a methodology to integrate prior knowledge into BNN prior distributions, specifically for networks derived from unrolled optimization algorithms for inverse problems with independent interpretability. We show this approach unlocks classical Bayesian modeling tools like predictive checking, which we fruitfully apply to DPG. High-fidelity parameter posterior inference via Hamiltonian Monte Carlo (HMC) sampling enables the first instance of a model for GSL from smooth signals, capable of producing estimates of uncertainty over edge predictions.

  • In Section 6, we validate DPG’s ability to produce high-quality and well-calibrated uncertainty estimates from synthetic data, stock price time series (S&\&&P500500500500), as well as graphs learnt from images of MNIST digits. The reliability of these estimates is underscored by notable Pearson correlations between predictive uncertainty and error: 0.700.700.700.70 for the stock data and 0.620.620.620.62 for the MNIST digits. Additional experimental evidence is provided in the appendices.

2 Related Work

A recent body of work addresses the GSL task via algorithm unrollings under varying data assumptions. Noteworthy contributions include GLAD (Shrivastava et al., 2020), which unrolls alternating-minimization iterations for Gaussian graphical model selection, Unrolled PDS (Pu et al., 2021) which unrolls primal-dual splitting (PDS) iterations for the GSL problem from smooth signals, and GDN (Wasserman et al., 2023) which unrolls linearized proximal-gradient iterations for a network deconvolution task that posits a polynomial relationship between the observed pairwise vertex distance matrix and the latent graph. Here we deal with the GSL problem from observations of smooth signals as in (Pu et al., 2021), but the optimization algorithm we unroll is different (cf. DPG in Algorithm 1 versus PDS), more compact and devoid of step-sizes. Additionally, none of the previous GSL unrolling result in true NNs, provide estimates of uncertainty on their adjacency matrix predictions, nor leverage the interpretability of network parameters in the modeling process.

Building a probabilistic model to connect observed network data to latent graph structure has a long history in the computer and social network analysis communities; see e.g., (Coates et al., 2002; Butts, 2003; Kolaczyk, 2009). Gibbs sampling was used mostly as a means to efficiently explore the resulting network posterior rather than quantify the uncertainty the model placed on particular edges. Since then, this probabilistic modeling approach has been used with alternate network models, types of observed data (e.g., information cascades and protein-protein interactions), and posterior approximation approaches; see (Shaghaghian & Coates, 2016; Gray et al., 2020; Jiang & Kolaczyk, 2011; Williamson, 2016; Pal & Coates, 2019). Such approaches are typically transductive - tying themselves to a single training graph - and require expensive joint inference of the model and the latent graph structure. Our approach only requires inference of the BNN model parameters and thus is naturally inductive, i.e., able to generalize to new nodes, or entirely new graphs. Some recent works build on such graph distribution modeling approaches in an approximate Bayesian manner, to incorporate the uncertainty in an observed graph for downstream tasks with GNNs (Zhang et al., 2019; Pal et al., 2020). Others forgo modeling the distribution of the observed graph but take approximate Bayesian approaches to modeling with GNNs. For instance, (Opolka & Lió, 2022) uses a deep graph convolutional Gaussian process with variational posterior approximation for link prediction, and (Sevilla & Segarra, 2023) pre-trains a score-matching GNN for use in annealed Langevin diffusion to draw approximate samples from the network posterior. All such Bayesian GNN approaches require (partial) observation of graph structure, and rely on approximate inference methods due to large dimensionality. Our DPG approach uses no observed graph structure (except for graph labels during training) and allows for high-fidelity posterior approximation. To the best of our knowledge, BNNs have so far not been used for GSL with uncertainty quantification.

More broadly, unrolling-inspired Bayesian deep networks have recently found success in uncertainty quantification for computational imaging (Barbano et al., 2020; Zhang et al., 2021; Ekmekci & Cetin, 2022). The inductive bias provided by the original iterations lead to gains in data efficiency, but still have limited parameter interpretability and high dimensionality leading to naive priors and coarse posterior approximation. This exciting line of work inspired some crucial ideas in this paper, cross-pollinating benefits to GSL.

3 Model-based Formulation and Optimization Preliminaries

Let 𝒢(𝒱,,𝑨)𝒢𝒱𝑨\mathcal{G}(\mathcal{V},\mathcal{E},{\bm{A}})caligraphic_G ( caligraphic_V , caligraphic_E , bold_italic_A ) be an undirected graph, where 𝒱={1,N}𝒱1𝑁\mathcal{V}=\{1,\ldots N\}caligraphic_V = { 1 , … italic_N } are the vertices (or nodes), 𝒱×𝒱𝒱𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}caligraphic_E ⊆ caligraphic_V × caligraphic_V are the edges, and 𝑨+N×N𝑨superscriptsubscript𝑁𝑁{\bm{A}}\in\mathbb{R}_{+}^{N\times N}bold_italic_A ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT is the symmetric adjacency matrix collecting the non-negative edge weights. For (i,j)𝑖𝑗(i,j)\notin\mathcal{E}( italic_i , italic_j ) ∉ caligraphic_E we have Aij=0subscript𝐴𝑖𝑗0A_{ij}=0italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0. We exclude the possibility of self loops, so that 𝑨𝑨{\bm{A}}bold_italic_A is hollow meaning Aii=0subscript𝐴𝑖𝑖0A_{ii}=0italic_A start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = 0, for all i𝒱𝑖𝒱i\in\mathcal{V}italic_i ∈ caligraphic_V. For the special case of unweighted graphs that will be prominent in our models, then 𝑨{0,1}N×N𝑨superscript01𝑁𝑁{\bm{A}}\in\{0,1\}^{N\times N}bold_italic_A ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT.

In this paper, we consider that 𝑨𝑨{\bm{A}}bold_italic_A is unknown and we want to estimate the latent graph structure from nodal measurements only111This is different to the link prediction task, where one is given measurements of edge status for a training subset of node pairs (plus, optionally, node attributes), and the transductive goal is to predict test links from the same graph(Kolaczyk, 2009). To this end, we acquire graph signal observations 𝒙=[x1,,xN]N𝒙superscriptsubscript𝑥1subscript𝑥𝑁topsuperscript𝑁{\bm{x}}=[x_{1},\dots,x_{N}]^{\top}\in\mathbb{R}^{N}bold_italic_x = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the signal value (i.e., a nodal attribute or feature) at vertex i𝒱𝑖𝒱i\in\mathcal{V}italic_i ∈ caligraphic_V. When P𝑃Pitalic_P such signals are available we construct matrix 𝑿=[𝒙1,,𝒙P]N×P𝑿subscript𝒙1subscript𝒙𝑃superscript𝑁𝑃{\bm{X}}=[{\bm{x}}_{1},\dots,{\bm{x}}_{P}]\in\mathbb{R}^{N\times P}bold_italic_X = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_P end_POSTSUPERSCRIPT, where each row 𝒙¯iPsubscriptsuperscript¯𝒙top𝑖superscript𝑃\bar{{\bm{x}}}^{\top}_{i}\in\mathbb{R}^{P}over¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, i=1,,N𝑖1𝑁i=1,\ldots,Nitalic_i = 1 , … , italic_N, of 𝑿𝑿{\bm{X}}bold_italic_X represents a vector of features or nodal attributes at vertex i𝑖iitalic_i. We can summarize this dataset using a pairwise vertex dissimilarity matrix, here the Euclidean distance matrix 𝑬+N×N𝑬superscriptsubscript𝑁𝑁{\bm{E}}\in\mathbb{R}_{+}^{N\times N}bold_italic_E ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, where Eij=𝒙¯i𝒙¯j22subscript𝐸𝑖𝑗superscriptsubscriptnormsubscript¯𝒙𝑖subscript¯𝒙𝑗22E_{ij}=\|\bar{{\bm{x}}}_{i}-\bar{{\bm{x}}}_{j}\|_{2}^{2}italic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∥ over¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Assuming our data lie on a smooth manifold, we interpret 𝒢𝒢\mathcal{G}caligraphic_G as a discrete representation of this manifold. When nodes ij𝒱𝑖𝑗𝒱i\neq j\in\mathcal{V}italic_i ≠ italic_j ∈ caligraphic_V have large edge weight Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, reflecting close points on the manifold, Eijsubscript𝐸𝑖𝑗E_{ij}italic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT will be small. Accordingly, smooth (w.r.t. 𝒢𝒢\mathcal{G}caligraphic_G) vectors in 𝑿𝑿{\bm{X}}bold_italic_X have small total variation or Dirichlet energy (Belkin & Niyogi, 2001), namely

TV𝒢(𝑿)=12i,jAij𝒙¯i𝒙¯j22=𝑨𝑬1,1,𝑇subscript𝑉𝒢𝑿12subscript𝑖𝑗subscript𝐴𝑖𝑗superscriptsubscriptnormsubscript¯𝒙𝑖subscript¯𝒙𝑗22subscriptnorm𝑨𝑬11TV_{\mathcal{G}}({\bm{X}})=\frac{1}{2}\sum_{i,j}A_{ij}\|\bar{{\bm{x}}}_{i}-% \bar{{\bm{x}}}_{j}\|_{2}^{2}=\|{\bm{A}}\circ{\bm{E}}\|_{1,1},italic_T italic_V start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( bold_italic_X ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∥ over¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ bold_italic_A ∘ bold_italic_E ∥ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , (1)

where 𝒁1,1subscriptnorm𝒁11\|{\bm{Z}}\|_{1,1}∥ bold_italic_Z ∥ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT is the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm of 𝒁𝒁{\bm{Z}}bold_italic_Z and \circ is the Hadamard (entrywise) product. The prevalence of smooth network data, for instance sensor measurements (Chepuri et al., 2017), protein function annotations (Kolaczyk, 2009), and product ratings (Huang et al., 2018), justifies using a smoothness criterion for the GSL task.

3.1 Graph structure learning from smooth signals

Given 𝑿𝑿{\bm{X}}bold_italic_X assumed to be smooth on 𝒢𝒢\mathcal{G}caligraphic_G, a popular model-based GSL approach is to minimize the Dirichlet energy in (1) w.r.t. 𝑨𝑨{\bm{A}}bold_italic_A; see e.g., (Hu et al., 2013; Dong et al., 2016; Kalofolias, 2016; Kalofolias & Perraudin, 2017). The inverse problems posed in these works can be unified under the general composite formulation

𝑨=argmin𝑨𝒜{𝑨𝑬1,1+h(𝑨)},superscript𝑨subscriptargmin𝑨𝒜subscriptnorm𝑨𝑬11𝑨{\bm{A}}^{*}=\operatorname*{arg\,min}_{{\bm{A}}\in\mathcal{A}}\left\{\|{\bm{A}% }\circ{\bm{E}}\|_{1,1}+h({\bm{A}})\right\},bold_italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_A ∈ caligraphic_A end_POSTSUBSCRIPT { ∥ bold_italic_A ∘ bold_italic_E ∥ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT + italic_h ( bold_italic_A ) } , (2)

where the feasible set is 𝒜:={𝑨N×N: diag(𝑨)=𝟎,Aij=Aji0,i,j𝒱}assign𝒜conditional-set𝑨superscript𝑁𝑁formulae-sequenceformulae-sequence diag𝑨0subscript𝐴𝑖𝑗subscript𝐴𝑗𝑖0for-all𝑖𝑗𝒱\mathcal{A}:=\{{\bm{A}}\in\mathbb{R}^{N\times N}:\text{ diag}({\bm{A}})=% \mathbf{0},A_{ij}=A_{ji}\geq 0,\forall i,j\in\mathcal{V}\}caligraphic_A := { bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT : diag ( bold_italic_A ) = bold_0 , italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ≥ 0 , ∀ italic_i , italic_j ∈ caligraphic_V }, i.e., hollow, symmetric, non-negative matrices. The regularization term h(𝑨)𝑨h({\bm{A}})italic_h ( bold_italic_A ) typically promotes desired structure on the estimated edge set (e.g., sparsity, no isolated nodes) and can be used to avoid the trivial solution 𝑨=𝟎superscript𝑨0{\bm{A}}^{*}=\mathbf{0}bold_italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_0. We henceforth use h(𝑨)=α𝟏log(𝑨𝟏)+β2𝑨F2𝑨𝛼superscript1toplog𝑨1𝛽2superscriptsubscriptnorm𝑨𝐹2h({\bm{A}})=-\alpha\mathbf{1}^{\top}\text{log}({\bm{A}}\mathbf{1})+\frac{\beta% }{2}\|{\bm{A}}\|_{F}^{2}italic_h ( bold_italic_A ) = - italic_α bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT log ( bold_italic_A bold_1 ) + divide start_ARG italic_β end_ARG start_ARG 2 end_ARG ∥ bold_italic_A ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (α,β0𝛼𝛽0\alpha,\beta\geq 0italic_α , italic_β ≥ 0 are regularization parameters), which excludes the possibility of isolated nodes and has achieved state-of-the-art results (Kalofolias, 2016).

It is convenient to reformulate (2) in an unconstrained, yet equivalent form. We start by compactly representing variable 𝑨𝑨{\bm{A}}bold_italic_A and data matrix 𝑬𝑬{\bm{E}}bold_italic_E with their vectorized upper triangular parts 𝒂,𝒆+N(N1)/2𝒂𝒆superscriptsubscript𝑁𝑁12{\bm{a}},{\bm{e}}\in\mathbb{R}_{+}^{N(N-1)/2}bold_italic_a , bold_italic_e ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N ( italic_N - 1 ) / 2 end_POSTSUPERSCRIPT, implicitly enforcing symmetry and hollowness, while also halving the problem dimension. To enforce non-negativity the indicator function 𝕀{𝒂0}={0\mathbb{I}\{{\bm{a}}\geq 0\}=\{0blackboard_I { bold_italic_a ≥ 0 } = { 0 if 𝒂0𝒂0{\bm{a}}\geq 0bold_italic_a ≥ 0 else }\infty\}∞ } is included in the objective. Finally, we substitute the nodal degrees 𝒅=𝑨𝟏𝒅𝑨1{\bm{d}}={\bm{A}}\mathbf{1}bold_italic_d = bold_italic_A bold_1 with the vectorized equivalent 𝒅=𝑺𝒂𝒅𝑺𝒂{\bm{d}}={\bm{S}}{\bm{a}}bold_italic_d = bold_italic_S bold_italic_a, where 𝑺{0,1}N×N(N1)/2𝑺superscript01𝑁𝑁𝑁12{\bm{S}}\in\{0,1\}^{N\times N(N-1)/2}bold_italic_S ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_N ( italic_N - 1 ) / 2 end_POSTSUPERSCRIPT is a fixed binary matrix that maps vectorized edge weights to degrees. The resulting optimization problem is given by

𝒂(𝒆,α,β)=argmin𝒂N(N1)/2{2𝒂𝒆α𝟏log(𝑺𝒂)+β2𝒂22+𝕀{𝒂0}},superscript𝒂𝒆𝛼𝛽subscriptargmin𝒂superscript𝑁𝑁122superscript𝒂top𝒆𝛼superscript1toplog𝑺𝒂𝛽2subscriptsuperscriptnorm𝒂22𝕀𝒂0{\bm{a}}^{*}({\bm{e}},\alpha,\beta)=\operatorname*{arg\,min}_{{\bm{a}}\in% \mathbb{R}^{N(N-1)/2}}\hskip 3.0pt\left\{2{\bm{a}}^{\top}{\bm{e}}-\alpha% \mathbf{1}^{\top}\text{log}({\bm{S}}{\bm{a}})+\frac{\beta}{2}\|{\bm{a}}\|^{2}_% {2}+\mathbb{I}\{{\bm{a}}\geq 0\}\right\},bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_e , italic_α , italic_β ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_N ( italic_N - 1 ) / 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { 2 bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_e - italic_α bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT log ( bold_italic_S bold_italic_a ) + divide start_ARG italic_β end_ARG start_ARG 2 end_ARG ∥ bold_italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + blackboard_I { bold_italic_a ≥ 0 } } , (3)

which is convex and admits a unique optimal solution; see e.g., (Saboksayr & Mateos, 2021). Next, we comment on the role of the regularization parameters α,β𝛼𝛽\alpha,\betaitalic_α , italic_β and their interpretability properties. We then offer a brief discussion on optimization algorithms to tackle problem (3). These ingredients will be essential to build a BNN model for supervised GSL in Sections 4 and 5.

Independent interpretability of regularization parameters. The weights α𝛼\alphaitalic_α and β𝛽\betaitalic_β are not independently interpretable w.r.t. to relevant graph characteristics, frustrating straightforward interpretation of their effect on the solution 𝒂(𝒆,α,β)superscript𝒂𝒆𝛼𝛽{\bm{a}}^{*}({\bm{e}},\alpha,\beta)bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_e , italic_α , italic_β ). Specifically, for fixed α𝛼\alphaitalic_α, increasing β𝛽\betaitalic_β leads to denser edge patterns, as we have (quadratically) increased the relative cost of large edge weights. Indeed, the sparsest graph is obtained for β=0𝛽0\beta=0italic_β = 0. But in general, many interesting graph characteristics, e.g., sparsity, connectivity, diameter, and edge weight magnitude, are non-trivial functions of both α𝛼\alphaitalic_α and β𝛽\betaitalic_β; see also (Dong et al., 2016) for a similar issue.

To facilitate independent control over the sparsity pattern and scale of the edge weights of recovered graphs, (Kalofolias, 2016, Prop. 2) introduced an equivalent (θ,δ𝜃𝛿\theta,\deltaitalic_θ , italic_δ)-parameterization of (3), namely

𝒂(𝒆,α,β)=αβ𝒂(1αβ𝒆,1,1)=δ𝒂(θ𝒆,1,1).superscript𝒂𝒆𝛼𝛽𝛼𝛽superscript𝒂1𝛼𝛽𝒆11𝛿superscript𝒂𝜃𝒆11{\bm{a}}^{*}({\bm{e}},\alpha,\beta)=\sqrt{\frac{\alpha}{\beta}}{\bm{a}}^{*}% \left(\frac{1}{\sqrt{\alpha\beta}}{\bm{e}},1,1\right)=\delta{\bm{a}}^{*}(% \theta{\bm{e}},1,1).bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_e , italic_α , italic_β ) = square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG end_ARG bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α italic_β end_ARG end_ARG bold_italic_e , 1 , 1 ) = italic_δ bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ bold_italic_e , 1 , 1 ) . (4)

We can map from the former parameterization to the latter by first scaling 𝒆𝒆{\bm{e}}bold_italic_e by θ=1/αβ𝜃1𝛼𝛽\theta=1/\sqrt{\alpha\beta}italic_θ = 1 / square-root start_ARG italic_α italic_β end_ARG, solving (3) with θ𝒆𝜃𝒆\theta{\bm{e}}italic_θ bold_italic_e using α=β=1𝛼𝛽1\alpha=\beta=1italic_α = italic_β = 1, and finally scaling the recovered edges by the constant δ=α/β𝛿𝛼𝛽\delta=\sqrt{\alpha/\beta}italic_δ = square-root start_ARG italic_α / italic_β end_ARG; we refer the reader to (Kalofolias, 2016; Kalofolias & Perraudin, 2017) for a proof of the equivalence claim. Due to the separable structure of the right-hand-side of (4), any GSL algorithm would require a single input parameter θ𝜃\thetaitalic_θ, and the obtained solution 𝒂(θ𝒆,1,1)superscript𝒂𝜃𝒆11{\bm{a}}^{*}(\theta{\bm{e}},1,1)bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ bold_italic_e , 1 , 1 ) can then be scaled by δ𝛿\deltaitalic_δ. All in all, the sparsity level of 𝒂superscript𝒂{\bm{a}}^{*}bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is determined solely by θ𝜃\thetaitalic_θ, making θ𝜃\thetaitalic_θ independently interpretable w.r.t. sparsity. Moreover, δ𝛿\deltaitalic_δ is interpretable w.r.t. edge weight magnitude, but not independently so, as larger θ𝜃\thetaitalic_θ produces smaller weights [see Figure 3 (bottom-left)].

3.2 Optimization algorithms

Problem (3) has a favorable structure that has been well documented, and several efficient optimization algorithms were proposed to obtain a solution 𝒂(𝒆,α,β)superscript𝒂𝒆𝛼𝛽{\bm{a}}^{*}({\bm{e}},\alpha,\beta)bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_e , italic_α , italic_β ) with 𝒪(N2)𝒪superscript𝑁2\mathcal{O}(N^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) complexity per iteration. Specifically, a forward-backward-forward PDS algorithm was first proposed in (Kalofolias, 2016). PDS introduces a step-size parameter which must be tuned to yield satisfactory empirical convergence properties, thus increasing the overall computational burden. We find that effective step-size values tend to lie on a narrow interval beyond which PDS exhibits divergent behavior, further frustrating tuning; see Appendix A.1.1 for a supporting discussion. GSL algorithms based on the alternating-directions method of multipliers (Wang et al., 2021) or majorization-minimization (Fatima et al., 2022) have been developed as well. Recently, (Saboksayr & Mateos, 2021) introduced a fast dual proximal gradient (FDPG) algorithm to solve (3), which is devoid of step-size parameters and – different from all three previous approaches – it comes with global convergence rate guarantees. For this problem, the strongest convergence results to date are in (Wang et al., 2023).

Our starting point in this work is the FDPG optimization framework, but different from (Saboksayr & Mateos, 2021) we: (i) develop a solver for the (θ,δ𝜃𝛿\theta,\deltaitalic_θ , italic_δ)-parameterization of (3); and (ii) turn-off the Nesterov-type acceleration from the proximal-gradient iterations used to solve the dual problem of (3). This yields a dual proximal gradient (DPG) method, tabulated under Algorithm 1. In a nutshell, during iterations k=1,2,𝑘12k=1,2,\ldotsitalic_k = 1 , 2 , … Algorithm 1 updates the vectorized adjacency matrix estimate 𝒂k+N(N1)/2subscript𝒂𝑘superscriptsubscript𝑁𝑁12{\bm{a}}_{k}\in\mathbb{R}_{+}^{N(N-1)/2}bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N ( italic_N - 1 ) / 2 end_POSTSUPERSCRIPT, an auxiliary vector of nodal degrees 𝒅k+Nsubscript𝒅𝑘superscriptsubscript𝑁{\bm{d}}_{k}\in\mathbb{R}_{+}^{N}bold_italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, as well as dual variables 𝝀kNsubscript𝝀𝑘superscript𝑁\bm{\lambda}_{k}\in\mathbb{R}^{N}bold_italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT used to enforce the variable splitting constraint 𝒅=𝑺𝒂𝒅𝑺𝒂{\bm{d}}={\bm{S}}{\bm{a}}bold_italic_d = bold_italic_S bold_italic_a. A naive DPG implementation incurs 𝒪(N2)𝒪superscript𝑁2\mathcal{O}(N^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) computational and memory complexities, and we note all nonlinear operations involved (i.e., ReLU()=max(0,)ReLU0\textrm{ReLU}(\cdot)=\max(0,\cdot)ReLU ( ⋅ ) = roman_max ( 0 , ⋅ ), ()2superscript2(\cdot)^{2}( ⋅ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and ()\sqrt{(\cdot)}square-root start_ARG ( ⋅ ) end_ARG) are pointwise on their vector arguments. As a result of the design choices (i)-(ii), the DPG algorithm requires the fewest operations per iteration and the fewest number of parameters among existing solvers of (3), and is devoid of any uninterpretable nuisance parameters, e.g., step-sizes. FDPG was only considered on the original (α,β)𝛼𝛽(\alpha,\beta)( italic_α , italic_β )-parameterization of (3); by instead opting for DPG iterations to solve the (θ,δ)𝜃𝛿(\theta,\delta)( italic_θ , italic_δ )-parameterization of (3), we reveal independent interpretability of θ𝜃\thetaitalic_θ w.r.t. sparsity of the optimal graphs.

Next, we will unroll Algorithm 1 to produce a GSL NN which inherits its advantages - namely simple, efficient, minimally parameterized layers, with independent interpretability - forming the backbone of our BNN.

Algorithm 1 Dual Proximal Gradient Descent
Inputs: Fixed parameters θ,δ𝜃𝛿\theta,\delta\in\mathbb{R}italic_θ , italic_δ ∈ blackboard_R and data 𝒆𝒆{\bm{e}}bold_italic_e
Initialize: 𝒂0subscript𝒂0{\bm{a}}_{0}bold_italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝝀0subscript𝝀0\bm{\lambda}_{0}bold_italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at random.
for k=1,2,𝑘12k=1,2,\dotsitalic_k = 1 , 2 , … do
   𝒅k=𝑺𝒂k1(N1)𝝀k1subscript𝒅𝑘𝑺subscript𝒂𝑘1𝑁1subscript𝝀𝑘1{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}{\bm{d}}_{% k}}={\bm{S}}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% {\bm{a}}_{k-1}}-(N-1){\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{% 0}{0}\bm{\lambda}_{k-1}}bold_italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_italic_S bold_italic_a start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - ( italic_N - 1 ) bold_italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT
   𝝀k=12(N1)(𝒅k𝒅k2+4(N1)𝟏{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\bm{% \lambda}_{k}}=\frac{-1}{2(N-1)}\Big{(}{\color[rgb]{1,.5,0}\definecolor[named]{% pgfstrokecolor}{rgb}{1,.5,0}{\bm{d}}_{k}}-\sqrt{{\color[rgb]{1,.5,0}% \definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}{\bm{d}}_{k}}^{2}+4(N-1)\bm{1}}bold_italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG - 1 end_ARG start_ARG 2 ( italic_N - 1 ) end_ARG ( bold_italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - square-root start_ARG bold_italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 ( italic_N - 1 ) bold_1 end_ARG )
   𝒂k=max(𝟎,12𝑺𝝀kθ𝒆)subscript𝒂𝑘012superscript𝑺topsubscript𝝀𝑘𝜃𝒆{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}{\bm{a}}_{k}% }=\max\Big{(}\bm{0},\frac{1}{2}{\bm{S}}^{\top}{\color[rgb]{1,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}% \pgfsys@color@cmyk@fill{0}{1}{0}{0}\bm{\lambda}_{k}}-\theta{\bm{e}}\Big{)}bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_max ( bold_0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_S start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_θ bold_italic_e )
end for
Return: δ𝒂k𝛿subscript𝒂𝑘\delta{\bm{a}}_{k}italic_δ bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
Refer to caption
Figure 1: Left: The dual proximal gradient (DPG) algorithm to solve the GSL problem (3). Right: Unrolling and truncating Algorithm 1 after D𝐷Ditalic_D iterations produces Unrolled DPG.

4 Graph Structure Learning from Smooth Signals with Bayesian Neural Networks

So far we have described a model-based approach to (point) estimation of graphs from smooth signals. In this work, we assume a labeled training dataset is available. We aim to construct a BNN model to produce uncertainty estimates on graph predictions for unseen test data.

Our BNN approach for GSL in a nutshell. Here, we restrict ourselves to binary graphs 𝒂{0,1}N(N1)/2𝒂superscript01𝑁𝑁12{\bm{a}}\in\{0,1\}^{N(N-1)/2}bold_italic_a ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N ( italic_N - 1 ) / 2 end_POSTSUPERSCRIPT; weighted graphs only require a change to the ensuing likelihood function. We denote all training data as 𝒯={𝒯e,𝒯a}={𝒆(t),𝒂(t)}t=1T𝒯subscript𝒯𝑒subscript𝒯𝑎superscriptsubscriptsuperscript𝒆𝑡superscript𝒂𝑡𝑡1𝑇\mathcal{T}=\{\mathcal{T}_{e},\mathcal{T}_{a}\}=\{{\bm{e}}^{(t)},{\bm{a}}^{(t)% }\}_{t=1}^{T}caligraphic_T = { caligraphic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } = { bold_italic_e start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, an unseen test sample as (𝒆~,𝒂~)~𝒆~𝒂(\tilde{{\bm{e}}},\tilde{{\bm{a}}})( over~ start_ARG bold_italic_e end_ARG , over~ start_ARG bold_italic_a end_ARG ), and the collection of all parameters of our yet to be defined BNN model as 𝚯𝚯\bm{\Theta}bold_Θ. Following the principles of BNNs, our goal is to construct a posterior predictive distribution p(𝒂~𝒆~,𝒯)𝑝conditional~𝒂~𝒆𝒯p(\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},\mathcal{T})italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ) by marginalizing out model parameters 𝚯𝚯\bm{\Theta}bold_Θ, i.e.,

p(𝒂~𝒆~,𝒯)=p(𝒂~𝒆~,𝚯)p(𝚯𝒯)d𝚯1Mm=1Mp(𝒂~𝒆~,𝚯(m)),𝑝conditional~𝒂~𝒆𝒯𝑝conditional~𝒂~𝒆𝚯𝑝conditional𝚯𝒯d𝚯1𝑀superscriptsubscript𝑚1𝑀𝑝conditional~𝒂~𝒆superscript𝚯𝑚p(\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},\mathcal{T})=\int p(\tilde{{\bm{a}}}% \mid\tilde{{\bm{e}}},\bm{\Theta})\cdot p(\bm{\Theta}\mid\mathcal{T})\,\textrm{% d}\bm{\Theta}\approx\frac{1}{M}\sum_{m=1}^{M}p(\tilde{{\bm{a}}}\mid\tilde{{\bm% {e}}},\bm{\Theta}^{(m)}),italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ) = ∫ italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , bold_Θ ) ⋅ italic_p ( bold_Θ ∣ caligraphic_T ) d bold_Θ ≈ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , bold_Θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) , (5)

where we use M𝑀Mitalic_M Monte Carlo draws from the posterior 𝚯(m)p(𝚯𝒯)similar-tosuperscript𝚯𝑚𝑝conditional𝚯𝒯\bm{\Theta}^{(m)}\sim p(\bm{\Theta}\mid\mathcal{T})bold_Θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∼ italic_p ( bold_Θ ∣ caligraphic_T ) to approximate the intractable integral. We can then designate our edge-wise point and uncertainty estimates as the first two moments of the posterior predictive marginals p(a~i𝒆~,𝒯)𝑝conditionalsubscript~𝑎𝑖~𝒆𝒯p(\tilde{a}_{i}\mid\tilde{{\bm{e}}},\mathcal{T})italic_p ( over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ), respectively, for each candidate edge indexed by i𝒱×𝒱𝑖𝒱𝒱i\in\mathcal{V}\times\mathcal{V}italic_i ∈ caligraphic_V × caligraphic_V.

Roadmap. In Section 4.1, we first discuss the development of a GSL NN which takes in the vectorized Euclidean distance matrix 𝒆𝒆{\bm{e}}bold_italic_e of nodal observations 𝑿𝑿{\bm{X}}bold_italic_X and assigns a probability to all possible edges. We do so by unrolling Algorithm 1, producing an GSL NN with an independently interpretable parameter w.r.t. sparsity of graph outputs. Treating the model parameters of this unrolled NN as stochastic, choosing a likelihood function (Section 4.2), and setting appropriate priors (Section 5), we produce a BNN. For inference we use HMC (Section 4.3), an unusual choice in BNNs made viable by the low dimensionality and fast execution of our NN, stemming from its origins as an unrolling. Pushing the test inputs through each such GSL NN and averaging the resulting predictive distributions provides provides the approximate (5) posterior predictive, from which we derive edge-wise point and uncertainty estimates as elaborated in Section 4.4.

4.1 Algorithm unrolling: Iterative optimization as a neural network blueprint

Unrollings (also known as deep unfoldings) use iterative algorithms as templates of neural architectures that are trained to approximate solutions of optimization problems. Iterations are truncated and mapped to NN layers, while optimization (e.g., regularization and step-size) parameters are turned into learnable weights. Originating from convergent iterative procedures, unrolled NNs inherit many desirable properties; namely, a low number of parameters, fast layer-wise execution, and favorable generalization in low-data environments (Monga et al., 2021). Typical architectural design choices include whether to replace intermediate operators with more expressive (possibly pre-trained) learnable modules - making them no longer ‘strict’ unrollings - and whether a common set of parameters should be used in all layers, or, to decouple these parameters across layers. Both decisions are context dependent and often driven by the amount of data available as well as complexity considerations. Deviating from strict unrollings with shared parameters can naturally lead to gains in expressive power, but often at the cost of inductive bias and training stability.

Unrolling DPG iterations. Pioneered by (Gregor & LeCun, 2010) to efficiently learn sparse codes from data, algorithm unrolling ideas are recently gaining traction for GSL as well. Starting from the formulation in Section 3.1, (Pu et al., 2021) unrolls a PDS solver of the (α,β)𝛼𝛽(\alpha,\beta)( italic_α , italic_β )-parameterization (3). Unrolled PDS has no independently interpretable parameters, is not a true NN (pre-activation outputs are nonlinear functions of the parameters), and includes a nuisance step-size parameter – arguably a shortcoming as gradient-based learning can easily produce large-enough weight values for divergent behavior, and thus NaN’s. Attempting to unroll PDS iterates for the (θ,δ)𝜃𝛿(\theta,\delta)( italic_θ , italic_δ )-parameterization of (3) would not fix these practical problems.

We instead advocate unrolling the DPG iterations (Algorithm 1) developed to solve the (θ,δ)𝜃𝛿(\theta,\delta)( italic_θ , italic_δ )-parameterization of (3). This way, we obtain a true NN without nuisance step-size parameters - avoiding the aforementioned issues and reducing the parameter count by a third - while inheriting independently interpretable parameter θ𝜃\thetaitalic_θ (w.r.t. sparsity of graph outputs). Incidentally, layers in Unrolled DPG [depicted in Figure 1 (right)] are markedly simpler and require fewer operations than Unrolled PDS. We denote the output of a D𝐷Ditalic_D-layer unrolling of Algorithm 1 as δΓθD𝛿superscriptsubscriptΓ𝜃𝐷\delta\Gamma_{\theta}^{D}italic_δ roman_Γ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. As we would like probabilities over candidate edges, we subtract a learnable mean shift b𝑏bitalic_b and drive the output through a sigmoid σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ), producing our desired GSL NN output

𝒑^=σ(δΓθD(𝒆)b𝟏)(0,1)N(N1)/2,^𝒑𝜎𝛿superscriptsubscriptΓ𝜃𝐷𝒆𝑏1superscript01𝑁𝑁12\hat{{\bm{p}}}=\sigma(\delta\Gamma_{\theta}^{D}({\bm{e}})-b\mathbf{1})\in(0,1)% ^{{}^{N(N-1)/2}},over^ start_ARG bold_italic_p end_ARG = italic_σ ( italic_δ roman_Γ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_italic_e ) - italic_b bold_1 ) ∈ ( 0 , 1 ) start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT italic_N ( italic_N - 1 ) / 2 end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , (6)

with parameters 𝚯={θ,δ,b}𝚯𝜃𝛿𝑏\bm{\Theta}=\{\theta,\delta,b\}bold_Θ = { italic_θ , italic_δ , italic_b }. To see that Unrolled DPG is a true NN, note that θ𝜃\thetaitalic_θ is only involved in a linear function θ𝒆𝜃𝒆\theta{\bm{e}}italic_θ bold_italic_e of the input data. Likewise, δ𝛿\deltaitalic_δ and b𝑏bitalic_b are only involved in an affine mapping of the activations ΓθD(𝒆)superscriptsubscriptΓ𝜃𝐷𝒆\Gamma_{\theta}^{D}({\bm{e}})roman_Γ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_italic_e ). All non-linear operations (specifically squaring, square root, and max\maxroman_max) are pointwise functions of intermediate activations. Going back to the design considerations mentioned at the beginning of this section, here we keep the unrolling strict and share parameters across layers to retain independent interpretability of θ𝜃\thetaitalic_θ, minimize parameter count, and simplify upcoming Bayesian inference. Tradeoffs arising with model expansion using multiple input and output channels per layer are discussed in Section 6.1.

All in all, unrolled DPG is the first true NN for GSL from smooth signals, and the first strict unrolling for GSL which produces a true NN. For the various reasons laid out in the preceding discussion, Unrolled DPG is of independent interest as a new model for point estimation of graph structure in a supervised setting. As we show next, it will be an integral component of the stochastic model used to construct a BNN to facilitate uncertainty quantification for adjacency matrix predictions.

4.2 Stochastic model

Here we specify a stochastic model for the random variables of interest, namely the binary adjacency matrix 𝒂𝒂{\bm{a}}bold_italic_a, nodal data entering via the Euclidean distance matrix 𝒆𝒆{\bm{e}}bold_italic_e, and the BNN weights 𝚯𝚯\bm{\Theta}bold_Θ. The Bayesian posterior from which we wish to sample satisfies p(𝚯𝒯)p(𝒯a𝒯e,𝚯)p(𝚯)proportional-to𝑝conditional𝚯𝒯𝑝conditionalsubscript𝒯𝑎subscript𝒯𝑒𝚯𝑝𝚯p(\bm{\Theta}\mid\mathcal{T})\propto p(\mathcal{T}_{a}\mid\mathcal{T}_{e},\bm{% \Theta})p(\bm{\Theta})italic_p ( bold_Θ ∣ caligraphic_T ) ∝ italic_p ( caligraphic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∣ caligraphic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , bold_Θ ) italic_p ( bold_Θ ), and a first step in BNN design is to specify the likelihood p(𝒂𝒆,𝚯)𝑝conditional𝒂𝒆𝚯p({\bm{a}}\mid{\bm{e}},\bm{\Theta})italic_p ( bold_italic_a ∣ bold_italic_e , bold_Θ ) and the prior p(𝚯)𝑝𝚯p(\bm{\Theta})italic_p ( bold_Θ ). An i.i.d. assumption on the training data 𝒯𝒯\mathcal{T}caligraphic_T allows the likelihood to factorize over samples p(𝒯a𝒯e,𝚯)=t=1Tp(𝒂(t)𝒆(t),𝚯)𝑝conditionalsubscript𝒯𝑎subscript𝒯𝑒𝚯superscriptsubscriptproduct𝑡1𝑇𝑝conditionalsuperscript𝒂𝑡superscript𝒆𝑡𝚯p(\mathcal{T}_{a}\mid\mathcal{T}_{e},\bm{\Theta})=\prod_{t=1}^{T}p({\bm{a}}^{(% t)}\mid{\bm{e}}^{(t)},\bm{\Theta})italic_p ( caligraphic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∣ caligraphic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , bold_Θ ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p ( bold_italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∣ bold_italic_e start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_Θ ). Moreover, assuming edges within a graph sample 𝒂(t)superscript𝒂𝑡{\bm{a}}^{(t)}bold_italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT are mutually conditionally independent given parameters 𝚯𝚯\bm{\Theta}bold_Θ leads to further likelihood factorization as p(𝒯a𝒯e,𝚯)=t=1Ti=1N(N1)/2p(ai(t)𝒆(t),𝚯)𝑝conditionalsubscript𝒯𝑎subscript𝒯𝑒𝚯superscriptsubscriptproduct𝑡1𝑇superscriptsubscriptproduct𝑖1𝑁𝑁12𝑝conditionalsubscriptsuperscript𝑎𝑡𝑖superscript𝒆𝑡𝚯p(\mathcal{T}_{a}\mid\mathcal{T}_{e},\bm{\Theta})=\prod_{t=1}^{T}\prod_{i=1}^{% N(N-1)/2}p(a^{(t)}_{i}\mid{\bm{e}}^{(t)},\bm{\Theta})italic_p ( caligraphic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∣ caligraphic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , bold_Θ ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N ( italic_N - 1 ) / 2 end_POSTSUPERSCRIPT italic_p ( italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_italic_e start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_Θ ). We model ai𝒆,𝚯Bernoulli(p^i)similar-toconditionalsubscript𝑎𝑖𝒆𝚯Bernoullisubscript^𝑝𝑖a_{i}\mid{\bm{e}},\bm{\Theta}\sim\textrm{Bernoulli}(\hat{p}_{i})italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_italic_e , bold_Θ ∼ Bernoulli ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where the success (or edge i𝒱×𝒱𝑖𝒱𝒱i\in\mathcal{V}\times\mathcal{V}italic_i ∈ caligraphic_V × caligraphic_V presence) probability is given by the output of the Unrolled DPG network, i.e., p^i=σ(δ[ΓθD(𝒆)]ib)subscript^𝑝𝑖𝜎𝛿subscriptdelimited-[]superscriptsubscriptΓ𝜃𝐷𝒆𝑖𝑏\hat{p}_{i}=\sigma\Big{(}\delta[\Gamma_{\theta}^{D}({\bm{e}})]_{i}-b\Big{)}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( italic_δ [ roman_Γ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_italic_e ) ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_b ) as in (6). Putting all the pieces together, the final expression for the likelihood is

p(𝒯a𝒯e,𝚯)=t=1Ti=1N(N1)/2(p^i(t))ai(t)(1p^i(t))1ai(t).𝑝conditionalsubscript𝒯𝑎subscript𝒯𝑒𝚯superscriptsubscriptproduct𝑡1𝑇superscriptsubscriptproduct𝑖1𝑁𝑁12superscriptsubscriptsuperscript^𝑝𝑡𝑖subscriptsuperscript𝑎𝑡𝑖superscript1subscriptsuperscript^𝑝𝑡𝑖1subscriptsuperscript𝑎𝑡𝑖p(\mathcal{T}_{a}\mid\mathcal{T}_{e},\bm{\Theta})=\prod_{t=1}^{T}\prod_{i=1}^{% N(N-1)/2}(\hat{p}^{(t)}_{i})^{a^{(t)}_{i}}(1-\hat{p}^{(t)}_{i})^{1-a^{(t)}_{i}}.italic_p ( caligraphic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∣ caligraphic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , bold_Θ ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N ( italic_N - 1 ) / 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (7)

We reiterate that, crucially, the Unrolled DPG outputs enter the stochastic model via the likelihood, as the means of the edge distributions. The specification of the prior p(𝚯)𝑝𝚯p(\bm{\Theta})italic_p ( bold_Θ ) will be addressed in Section 5.

4.3 Inference

We aim to generate M𝑀Mitalic_M Monte Carlo draws from the posterior 𝚯(m)p(𝚯𝒯)p(𝒯a𝒯e,𝚯)p(𝚯)similar-tosuperscript𝚯𝑚𝑝conditional𝚯𝒯proportional-to𝑝conditionalsubscript𝒯𝑎subscript𝒯𝑒𝚯𝑝𝚯\bm{\Theta}^{(m)}\sim p(\bm{\Theta}\mid\mathcal{T})\propto p(\mathcal{T}_{a}% \mid\mathcal{T}_{e},\bm{\Theta})p(\bm{\Theta})bold_Θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∼ italic_p ( bold_Θ ∣ caligraphic_T ) ∝ italic_p ( caligraphic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∣ caligraphic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , bold_Θ ) italic_p ( bold_Θ ). In modern BNNs based on overparameterized deep networks, high dimensionality and parameter unidentifiability are prevalent. This translates to highly multi-modal posteriors which make sampling challenging (Izmailov et al., 2021). Even when posterior geometries are more benign, the computational and memory requirements for evaluating the network and its gradients render the generation of Monte Carlo posterior samples impractical. As a typical workaround one can resort to posterior approximation using inexpensive mini-batch methods such as mean-field variational inference or stochastic-gradient MCMC (Izmailov et al., 2021). In contrast, we now argue that generation of high-fidelity Monte Carlo posterior samples using HMC is uniquely compatible with our proposed BNN; for implementation details see Section 6.1.

For starters, Unrolled DPG’s repetitive layers simplify model construction and allow straightforward compilation in automatic differentiation software, significantly boosting runtime efficiency. The low parameter count of Unrolled DPG facilitates the use of forward-mode auto-differentiation; this eliminates the need to store intermediate activations and perform a backward pass, thus it ensures memory usage does not increase with model depth D𝐷Ditalic_D. Moreover, as Unrolled DPG can effectively approximate inverse problem solutions with relatively shallow depths (see Section 6), selecting a smaller D𝐷Ditalic_D significantly reduces runtime complexity. For small- to moderately-sized graphs and datasets, our computational and memory requirements for network and gradient evaluation are low; see Section 6 for GSL problem instances where inference is attainable in <2absent2<2< 2 minutes on a M2222 MacBook laptop. The limited parameter count also aids in sampling efficiency as we avoid the curse of dimensionality that complicates sampling in higher-dimensional models. Finally, the ability to place informative priors over independently interpretable parameters (as we do in Section 5) is known to smoothen the posterior geometry, especially in data scarce regimes where the likelihood and prior are of comparable magnitude (Gelman et al., 1995).

4.4 Prediction

By conditioning on data 𝒯𝒯\mathcal{T}caligraphic_T and integrating out model parameters 𝚯𝚯\bm{\Theta}bold_Θ, we obtain a predictive distribution on unseen graph adjacency matrices 𝒂~~𝒂\tilde{{\bm{a}}}over~ start_ARG bold_italic_a end_ARG given nodal signals 𝒆~~𝒆\tilde{{\bm{e}}}over~ start_ARG bold_italic_e end_ARG. As the integral is intractable, we use the M𝑀Mitalic_M posterior samples obtained via HMC in the inference stage to approximate the predictive distribution p(𝒂~𝒆~,𝒯)𝑝conditional~𝒂~𝒆𝒯p(\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},\mathcal{T})italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ). This approximation process is summarized in (5). .

Importantly, we can generate samples from the posterior predictive by randomly drawing from the sampling distribution with each parameter sample plugged in, i.e., 𝒂~(m)p(𝒂~𝒆~,𝚯(m))similar-tosuperscript~𝒂𝑚𝑝conditional~𝒂~𝒆superscript𝚯𝑚\tilde{{\bm{a}}}^{(m)}\sim p(\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},\bm{\Theta}^% {(m)})over~ start_ARG bold_italic_a end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∼ italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , bold_Θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ). Given the form of our BNN’s stochastic model as introduced in Section 4.2, such a draw reduces to sampling a Bernoulli distribution for each possible edge. Randomly drawing from the sampling distribution is critical as it accounts for both forms of uncertainty in posterior predictive quantities, namely, sampling uncertainty and estimation uncertainty  (Gelman et al., 1995). Using these posterior predictive samples, we can approximate the mean (‘pred. mean’) and standard deviation (‘pred. stdv.’) of the edge-wise marginals of the posterior predictive as

𝔼[a~i𝒆~,𝒯]1Mm=1Ma~i(m)andVar[a~i𝒆~,𝒯]12[1M1m=1M(a~i(m)𝔼[a~i𝒆~,𝒯])2]12.formulae-sequence𝔼delimited-[]conditionalsubscript~𝑎𝑖~𝒆𝒯1𝑀superscriptsubscript𝑚1𝑀subscriptsuperscript~𝑎𝑚𝑖andVarsuperscriptdelimited-[]conditionalsubscript~𝑎𝑖~𝒆𝒯12superscriptdelimited-[]1𝑀1superscriptsubscript𝑚1𝑀superscriptsuperscriptsubscript~𝑎𝑖𝑚𝔼delimited-[]conditionalsubscript~𝑎𝑖~𝒆𝒯212\mathbb{E}[\tilde{a}_{i}\mid\tilde{{\bm{e}}},\mathcal{T}]\approx\frac{1}{M}% \sum_{m=1}^{M}\tilde{a}^{(m)}_{i}\quad\text{and}\quad\mathrm{Var}[\tilde{a}_{i% }\mid\tilde{{\bm{e}}},\mathcal{T}]^{\frac{1}{2}}\approx\left[\frac{1}{M-1}\sum% _{m=1}^{M}(\tilde{a}_{i}^{(m)}-\mathbb{E}[\tilde{a}_{i}\mid\tilde{{\bm{e}}},% \mathcal{T}])^{2}\right]^{\frac{1}{2}}.blackboard_E [ over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ] ≈ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over~ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and roman_Var [ over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ≈ [ divide start_ARG 1 end_ARG start_ARG italic_M - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT - blackboard_E [ over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT . (8)

Naturally, 𝔼[a~i𝒆~,𝒯]𝔼delimited-[]conditionalsubscript~𝑎𝑖~𝒆𝒯\mathbb{E}[\tilde{a}_{i}\mid\tilde{{\bm{e}}},\mathcal{T}]blackboard_E [ over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ] is a Bayesian point estimate for a~isubscript~𝑎𝑖\tilde{a}_{i}over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, while Var[a~i𝒆~,𝒯]12Varsuperscriptdelimited-[]conditionalsubscript~𝑎𝑖~𝒆𝒯12\mathrm{Var}[\tilde{a}_{i}\mid\tilde{{\bm{e}}},\mathcal{T}]^{\frac{1}{2}}roman_Var [ over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT offers a measure of uncertainty in such prediction of edge i𝑖iitalic_i.

5 Bayesian Modeling of Unrolling-Based BNNs with Independent Interpretability

Refer to caption
Figure 2: Bayesian workflow with independent interpretability. Inputs: A labeled data set 𝒯𝒯\mathcal{T}caligraphic_T, an inverse problem with independently interpretable parameter θ𝜃\thetaitalic_θ w.r.t. some characteristic of the solution, prior beliefs over this solution characteristic, and an unrolled NN which approximate solutions to the inverse problem. Bayesian Modeling: We use independent interpretability of θ𝜃\thetaitalic_θ to convert prior beliefs on solution characteristics to a prior distribution on the independently interpretable parameter. We use prior predictive checks to ensure priors generate data sets which encompass all plausible values of the solution characteristic, while still preferentially generating data sets which we believe are more likely apriori. If not, we can leverage independent interpretability to refine the prior. We then sample from the posterior, and use posterior predictive checks to provide a subjective validation of model fit.

In this section, we present a method for Bayesian modeling for BNNs which use a network produced by unrolling an optimization algorithm to solve an inverse problem with independent interpretability. We instantiate these ideas on the GSL problem (3) - where solutions are undirected graphs - and use prior knowledge over the sparsity of such graphs to shape prior distributions over the corresponding independently interpretable parameter θ𝜃\thetaitalic_θ in Unrolled DPG introduced in Section 4.1.

5.1 Prior modeling

For Bayesian models, priors over unobserved quantities play two main roles: encoding information germane to the problem being analyzed and aiding the ensuing Bayesian inferences. Despite the name, a prior can in general only be interpreted in the context of the likelihood with which it will be paired. There are many types of priors, and we here we list the common ones in order of degree in which it is intended to affect the information in the likelihood: non-informative, reference, structural, regularizing, weakly informative, and strongly informative (Gelman et al., 2017). The more domain specific information present, the further to the end of this list we may be able to move, which often results in better behaved posterior geometries, and thus more stable Bayesian inference. Exactly how this domain specific information is encoded in the prior depends on both the problem and the specific chosen model.

Informative priors with independent interpretability. In the context of the GSL problem dealt with here, independent interpretability makes setting a strongly informative prior over DPG’s independently interpretable parameter θ𝜃\thetaitalic_θ feasible. To this end, one first discretizes θ𝜃\thetaitalic_θ over a few orders of magnitude. For each θ𝜃\thetaitalic_θ value in the grid, we run Algorithm 1 on a subset of observed inputs 𝒯𝒆subscript𝒯𝒆\mathcal{T}_{{\bm{e}}}caligraphic_T start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT and record the relevant data characteristic (edge sparsity) of each recovered solution. We then plot the histogram of such data characteristic and observe their agreement with prior beliefs. Finally, one chooses an appropriate parametric distribution p(θ)𝑝𝜃p(\theta)italic_p ( italic_θ ) which approximates this relative level of agreement across the range of θ𝜃\thetaitalic_θ values. We can use the recovered solutions at a priori probabale θ𝜃\thetaitalic_θ values to set weakly informative priors over the remaining model parameters. In our setting, that involves the scale of the edge weights of the recovered solutions. In Section 5.2 we discuss evaluation of this initial p(𝚯)𝑝𝚯p(\bm{\Theta})italic_p ( bold_Θ ) in terms of the predictions it makes. For a visualization of this prior modeling workflow, see Figure 2. A numerical example is now presented to ground this methodology.

Example (Informative prior modeling of DPG parameters). In this example that continues in Section  5.2, we use a data set 𝒯𝒯\mathcal{T}caligraphic_T of T=50𝑇50T=50italic_T = 50 random geometric graphs (N=20𝑁20N=20italic_N = 20 nodes, connectivity of 1313\frac{1}{3}divide start_ARG 1 end_ARG start_ARG 3 end_ARG), denoted RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT, and their corresponding analytic Euclidean distance matrices as inputs defined in Section 6. See Appendix A.4 for similar analysis on other random graph distributions. Prior modeling and upcoming prior predictive checks require only the distance matrices 𝒆𝒆{\bm{e}}bold_italic_e, and we use only 5555 such distance matrices from 𝒯𝒯\mathcal{T}caligraphic_T.

Suppose we have prior beliefs on the sparsity related characteristics of recovered graphs, namely sparsity itself (=1absent1=1= 1 -- edge density) or the number of connected components. For example, suppose we believe recovered graphs should have edge densities around [.05,.5].05.5[.05,.5][ .05 , .5 ] and 5absent5\leq 5≤ 5 connected components. Since Unrolled DPG approximates solutions to (3) - and θ𝜃\thetaitalic_θ determines such sparsity related characteristics independently of other optimization parameters - we can simply run Algorithm 1 to convergence using the 5555 inputs on a set of discretized θ𝜃\thetaitalic_θ values across several orders of magnitude, as demonstrated in Figure 3 (top-left). We observe θ[101,101]𝜃superscript101superscript101\theta\in[10^{-1},10^{1}]italic_θ ∈ [ 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ] produces solutions with sparsity characteristics consistent with our prior beliefs. Choosing θsimilar-to𝜃absent\theta\simitalic_θ ∼ Lognormal(0,4040,40 , 4) concentrates approximately 75%percent7575\%75 % probability mass on interval [101,101]superscript101superscript101[10^{-1},10^{1}][ 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ] while still covering a broader range of values to accommodate uncertainty stemming from sampling variance (small data set) and approximation error (truncated iterations). Priors are placed over the non-independently interpretable parameters δ𝛿\deltaitalic_δ and b𝑏bitalic_b with the purpose of guarding against implausible values and stabilizing posterior sampling; such priors are sometimes called ‘weakly informative’ (Gelman et al., 1995). Because δΓθD(𝒆)b𝛿subscriptsuperscriptΓ𝐷𝜃𝒆𝑏\delta\Gamma^{D}_{\theta}({\bm{e}})-bitalic_δ roman_Γ start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_e ) - italic_b will be driven through a sigmoid, we aim for both terms to be of comparable magnitude. Since our graphs are binary, δΓθD(𝒆)𝛿subscriptsuperscriptΓ𝐷𝜃𝒆\delta\Gamma^{D}_{\theta}({\bm{e}})italic_δ roman_Γ start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_e ) can reasonably be assumed to be 1absent1\approx 1≈ 1. Utilizing Algorithm 1 solutions obtained over the θ𝜃\thetaitalic_θ grid, we examine the edge weight distribution (prior to scaling by δ𝛿\deltaitalic_δ) in Figure 3 (bottom-left) to inform p(δ)𝑝𝛿p(\delta)italic_p ( italic_δ ). Setting δsimilar-to𝛿absent\delta\simitalic_δ ∼ Lognormal(2222, 2222) brings scaled edge weights close to 1111 for high (prior) probability values of θ𝜃\thetaitalic_θ. Similarly, setting bsimilar-to𝑏absentb\simitalic_b ∼ Lognormal(1111, 2222) ensures b𝑏bitalic_b is also approximately 1111. Both distributions assign probability mass across several orders of magnitude, both above and below 1111, reflecting the breadth of our uncertainty concerning their specific value ranges. Next, we employ prior predictive checks to validate that - with the established priors - our model generates graphs that align with our prior beliefs.

Refer to captionRefer to caption
Refer to caption
Figure 3: Left: Prior modeling with independently interpretable parameter θ𝜃\thetaitalic_θ. We run Algorithm 1 to convergence over discretized θ𝜃\thetaitalic_θ. Top: Larger θ𝜃\thetaitalic_θ produces sparser graphs. Bottom: Larger θ𝜃\thetaitalic_θ produces smaller edge weight magnitudes. Right: Predictive Checks. The original prior generates very few data sets with densities of .9absent.9\approx.9≈ .9, a value we feel is plausible. We can use the independent interpretability of θ𝜃\thetaitalic_θ w.r.t. edge density to alter its prior accordingly. A prior predictive check with this altered prior now encompasses these plausible data sets. The posterior predictive check ensures the replicated data sets - now sampled after conditioning on the training data - have similar edge densities to the observed training labels. Indeed, these edge densities, denoted as ‘posterior’, are tightly distributed around the average edge density of the labels.

5.2 Predictive checking

In Bayesian statistics, predictive checks evaluate a model’s efficacy in capturing data characteristics like means, standard deviations, and quantiles (Gelman et al., 1995). Predictive checks leverage the existence of a model of the joint distribution p(𝒯,𝚯)𝑝𝒯𝚯p(\mathcal{T},\bm{\Theta})italic_p ( caligraphic_T , bold_Θ ) between data and parameters. This model allows us to draw samples from the data marginal - called replicated data - by drawing samples from the joint and simply dropping the parameter samples. Prior predictive checks draw replicated data from the joint before conditioning on data 𝒯𝒯\mathcal{T}caligraphic_T, while posterior predictive checks do so after. In both cases, we apply the chosen statistic capturing our data characteristic of interest to the replicated data, producing a histogram. The goal in prior predictive checking is to shape priors that encompass all plausible data sets, while still guiding the model towards data sets we deem more likely apriori. Thus prior predictive checking often consists of visual inspection of the histogram to ensure this holds, and adjusting the location and/or scale of prior distributions until it does. The goal in posterior predictive checking is to subjectively validate model fit; here we compare the statistic on the replicated data (a histogram) to the same statistic applied to the actual data 𝒯𝒯\mathcal{T}caligraphic_T (a single scalar). A well-fit model should have a histogram tightly concentrated around the real data statistic.

In typical BNNs, the lack of parameter interpretability prevents effective use of predictive checking. (Wenzel et al., 2020) highlights this issue by demonstrating that an isotropic Gaussian prior on NN weights (ResNet-20202020) fails to generate data consistent with prior expectations in image classification tasks. But the lack of parameter interpretability prevents the use of these findings to modify the prior for improved alignment. Below, we instantiate predictive checking for DPG, with average edge density as our test statistic.

Example (Prior predictive check over DPG parameters). To perform a prior predictive check we use the 5555 inputs from 𝒯𝒆subscript𝒯𝒆\mathcal{T}_{{\bm{e}}}caligraphic_T start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT and draw 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT replicated data sets 𝒂repsuperscript𝒂rep{\bm{a}}^{\text{rep}}bold_italic_a start_POSTSUPERSCRIPT rep end_POSTSUPERSCRIPT. Figure 3 (right) shows the histogram of average edge densities of replicated data sets under the prior settings laid out in Section 5.1, which we call the ‘original prior’. We observe that such a prior succeeds in producing data sets that coincide with those we deem more likely apriori, but falls short of encompassing all plausible data sets, namely those with edge densities of 0.9absent0.9\approx 0.9≈ 0.9. To address this, we lean on independent interpretability of θ𝜃\thetaitalic_θ w.r.t. the sparsity of graph solutions: decreasing the location of p(θ)𝑝𝜃p(\theta)italic_p ( italic_θ ) will increase edge densities (lower sparsity) of graph solutions, and so we lower p(θ)𝑝𝜃p(\theta)italic_p ( italic_θ )’s location parameter from 00 to 1212-\frac{1}{2}- divide start_ARG 1 end_ARG start_ARG 2 end_ARG, with all other parameters of the prior distribution remaining the same. We call this new prior the ‘altered prior’. Repeating the prior predictive check with this ‘altered prior’ we now observe 12%absentpercent12\approx 12\%≈ 12 % of replicated data sets with edge densities in [.75,1].751\left[.75,1\right][ .75 , 1 ], as opposed to 4%percent44\%4 % before, and we confirm through visual inspection that data sets with all possible edge densities are indeed produced. The ‘altered prior’ thus encompasses all plausible data sets while still preferentially generating data sets we feel are more likely apriori.

Example (Posterior predictive check over DPG parameters). Now, we repeat this procedure, but draw replicated data sets from the joint after conditioning on 𝒯𝒯\mathcal{T}caligraphic_T. We perform inference drawing M=104𝑀superscript104M=10^{4}italic_M = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT posterior samples; for each posterior sample we draw a single replicated data set. In Figure 3 (right) we plot the histogram of the average edge densities of these replicated data sets, denoted ‘posterior’, and compare against the average edge density of the graph labels in 𝒯𝒯\mathcal{T}caligraphic_T, denoted ‘labels’. Because all outcomes are tightly distributed around the mean edge density of the real data, we can have confidence the model parameters have fit appropriately.

6 Experiments

We have introduced DPG, the first BNN for GSL from smooth signals, which is capable of providing estimates of uncertainty of its edge predictions. We showed DPG can effectively incorporate prior information into the prior distribution over its parameters. In this section, we evaluate DPG across synthetic and real datasets, and introduce other baseline models for comparison, including a more expressive variant of DPG.

6.1 Models, metrics, and experimental details

Strict unrollings: DPG and PDS. In the following experiments, the prior used for the DPG’s parameters {θ,δ,b}𝜃𝛿𝑏\{\theta,\delta,b\}{ italic_θ , italic_δ , italic_b } is the ‘altered prior’ as developed in Section 5: θsimilar-to𝜃absent\theta\simitalic_θ ∼ Lognormal(1/2,4124-1/2,4- 1 / 2 , 4), δsimilar-to𝛿absent\delta\simitalic_δ ∼ Lognormal(2222, 2222), and bsimilar-to𝑏absentb\simitalic_b ∼ Lognormal(1111, 2222). We similarly refer to the BNN with Unrolled PDS as its NN model - and no change to the stochastic model - as PDS. As none of PDS’s parameters {α,β,γ,b}𝛼𝛽𝛾𝑏\{\alpha,\beta,\gamma,b\}{ italic_α , italic_β , italic_γ , italic_b } have independent interpretability, we cannot use the prior modeling techniques outlined in Section 5. Attempting HMC on PDS produces significant number of divergent simulated Hamiltonian trajectories, a phenomenon known to be caused by high posterior curvature (Betancourt, 2016). This makes HMC run slowly, produce poorly mixed chains, and ultimately non-performant PDS models. Indeed, we find that values of step-size γ𝛾\gammaitalic_γ which produce divergent behavior (very low likelihood) are close those which are most performant (very high likelihood), indicating high-curvature in the likelihood function, and thus the posterior; see Figure 8 (left). Because γ𝛾\gammaitalic_γ is not interpretable, it is unclear how to shape p(γ)𝑝𝛾p(\gamma)italic_p ( italic_γ ) to reduce posterior curvature without first running a discrete search. Indeed, to find a performant PDS model amenable to efficient HMC, we first run such a discrete search, fix γ𝛾\gammaitalic_γ to a found performant value of 0.10.10.10.1, and then set priors α𝛼\alphaitalic_α, β𝛽\betaitalic_β similar-to\sim LogNormal(0,10)0,10)0 , 10 ) and b𝒩(0,103)similar-to𝑏𝒩0superscript103b\sim\mathcal{N}(0,10^{3})italic_b ∼ caligraphic_N ( 0 , 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).

Model expansion via MIMO and partial stochasticity: DPG-MIMO and DPG-MIMO-E. Here, we expand the Unrolled DPG NN for more expressive power by making each layer multi-input multi-output (MIMO). Each MIMO layer maps inputs {𝝀k1N×C\{\bm{\lambda}_{k-1}\in\mathbb{R}^{N\times C}{ bold_italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT, 𝐚k1N(N1)/2×C}\mathbf{a}_{k-1}\in\mathbb{R}^{N(N-1)/2\times C}\}bold_a start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N ( italic_N - 1 ) / 2 × italic_C end_POSTSUPERSCRIPT } to outputs {𝝀kN×C\{\bm{\lambda}_{k}\in\mathbb{R}^{N\times C}{ bold_italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT, 𝐚kN(N1)/2×C}\mathbf{a}_{k}\in\mathbb{R}^{N(N-1)/2\times C}\}bold_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N ( italic_N - 1 ) / 2 × italic_C end_POSTSUPERSCRIPT }, where C𝐶Citalic_C is the fixed number of input and output channels across layers. As before, layers share parameters, now 𝜽C×C𝜽superscript𝐶𝐶\bm{\theta}\in\mathbb{R}^{C\times C}bold_italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT. In the k𝑘kitalic_k-th layer, parameter θjisubscript𝜃𝑗𝑖\theta_{ji}\in\mathbb{R}italic_θ start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ∈ blackboard_R is used to process 𝝀k1[:,i]subscript𝝀𝑘1:𝑖\bm{\lambda}_{k-1}[:,i]bold_italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT [ : , italic_i ] and 𝐚k1[:,i]subscript𝐚𝑘1:𝑖\mathbf{a}_{k-1}[:,i]bold_a start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT [ : , italic_i ] as in a regular DPG iteration. Doing so for i{1,,C}𝑖1𝐶i\in\{1,\dots,C\}italic_i ∈ { 1 , … , italic_C } and averaging the refined 𝝀𝝀\bm{\lambda}bold_italic_λ’s and 𝐚𝐚\mathbf{a}bold_a’s produces 𝝀k[:,j]subscript𝝀𝑘:𝑗\bm{\lambda}_{k}[:,j]bold_italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ : , italic_j ] and 𝐚k[:,j]subscript𝐚𝑘:𝑗\mathbf{a}_{k}[:,j]bold_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ : , italic_j ], respectively. We use parameters 𝜹C𝜹superscript𝐶\bm{\delta}\in\mathbb{R}^{C}bold_italic_δ ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT and b𝑏b\in\mathbb{R}italic_b ∈ blackboard_R to produce edge probabilities σ(1C𝒂D𝜹b𝟏)𝜎1𝐶subscript𝒂𝐷𝜹𝑏1\sigma(\frac{1}{C}{\bm{a}}_{D}\bm{\delta}-b\mathbf{1})italic_σ ( divide start_ARG 1 end_ARG start_ARG italic_C end_ARG bold_italic_a start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT bold_italic_δ - italic_b bold_1 ). The priors are unchanged from the ‘altered prior’ developed in the non-MIMO setting from Section 5. The increased parameterization and complexity of operations make the posterior geometry significantly more complex, such that we could not find a setup which produced well-mixed posterior sampling via HMC. Instead, we resort to Maximum a Posteriori (MAP) estimation, equivalent to maximizing the posterior with fixed observed data, using full gradient descent. We call this non-stochastic setup DPG-MIMO.

A common approach to overcome the intractable posterior inference in BNNs is partial stochasticity, where we learn point estimates of a subset of the parameters and distributions over the rest. Recent work has shown partial stochasticity of a BNN can produce similarly useful posterior predictive distributions, even outperforming fully stochastic networks in prediction in some setting (Sharma et al., 2022). Here, we ‘decapitate‘ the MAP trained DPG-MIMO by discarding 𝜹MAPsubscript𝜹MAP\bm{\delta}_{\text{MAP}}bold_italic_δ start_POSTSUBSCRIPT MAP end_POSTSUBSCRIPT and bMAPsubscript𝑏MAPb_{\text{MAP}}italic_b start_POSTSUBSCRIPT MAP end_POSTSUBSCRIPT; instead we feed each of the C𝐶Citalic_C output channels 𝒂D[:,j]subscript𝒂𝐷:𝑗{\bm{a}}_{D}[:,j]bold_italic_a start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ : , italic_j ] of the base MAP DPG-MIMO (only parameterized by 𝜽MAPsubscript𝜽MAP\bm{\theta}_{\text{MAP}}bold_italic_θ start_POSTSUBSCRIPT MAP end_POSTSUBSCRIPT) into a new depth 20202020 stochastic single channel DPG δjΓθj20(𝒂D[:,j])subscript𝛿𝑗superscriptsubscriptΓsubscript𝜃𝑗20subscript𝒂𝐷:𝑗\delta_{j}\Gamma_{\theta_{j}}^{20}({\bm{a}}_{D}[:,j])italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ : , italic_j ] ). We average their output, shift by new stochastic b𝑏bitalic_b, and drive through a sigmoid: σ(1Cj=1CδjΓθj20(𝒂D[:,j]))b𝟏)\sigma\big{(}\frac{1}{C}\sum_{j=1}^{C}\delta_{j}\Gamma_{\theta_{j}}^{20}({\bm{% a}}_{D}[:,j]))-b\mathbf{1}\big{)}italic_σ ( divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ : , italic_j ] ) ) - italic_b bold_1 ). We can now run inference on these stochastic head parameters {θ1,δ1,,θC,δC,b}subscript𝜃1subscript𝛿1subscript𝜃𝐶subscript𝛿𝐶𝑏\{\theta_{1},\delta_{1},\dots,\theta_{C},\delta_{C},b\}{ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_b } keeping 𝜽MAPsubscript𝜽MAP\bm{\theta}_{\text{MAP}}bold_italic_θ start_POSTSUBSCRIPT MAP end_POSTSUBSCRIPT fixed. We denote this model as DPG-MIMO-E. All MIMO models use C=4𝐶4C=4italic_C = 4: C=8𝐶8C=8italic_C = 8 offered negligible performance gains and posed (partially stochastic) inference challenges, while C=2𝐶2C=2italic_C = 2 was less performant than C=4𝐶4C=4italic_C = 4.

Metrics. To provide summaries of our models’ predictive accuracy and quality of uncertainty we use two proper scoring rules: the Negative Log-Likelihood (NLL) p(𝒂~𝒆~,𝒯)𝑝conditional~𝒂~𝒆𝒯p(\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},\mathcal{T})italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ) and the Brier Score 1||𝔼𝚯𝒯[𝒑^(𝒂~𝒆~,𝚯)𝒂~]2)\frac{1}{|\mathcal{E}|}\mathbb{E}_{\bm{\Theta}\mid\mathcal{T}}[\|\hat{{\bm{p}}% }(\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},\bm{\Theta})-\tilde{{\bm{a}}}]\|^{2})divide start_ARG 1 end_ARG start_ARG | caligraphic_E | end_ARG blackboard_E start_POSTSUBSCRIPT bold_Θ ∣ caligraphic_T end_POSTSUBSCRIPT [ ∥ over^ start_ARG bold_italic_p end_ARG ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , bold_Θ ) - over~ start_ARG bold_italic_a end_ARG ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Beyond proper scoring rules, we use Expected Calibration Error (ECE) (Guo et al., 2017), which measures the correspondence between predicted probabilities and empirical accuracy, and Error, defined as the percentage disagreement between a thresholded pred. mean 𝔼𝚯𝒯[𝒂~𝒆~,𝒯]>0.5subscript𝔼conditional𝚯𝒯delimited-[]conditional~𝒂~𝒆𝒯0.5\mathbb{E}_{\bm{\Theta}\mid\mathcal{T}}[\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},% \mathcal{T}]>0.5blackboard_E start_POSTSUBSCRIPT bold_Θ ∣ caligraphic_T end_POSTSUBSCRIPT [ over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ] > 0.5 and the actual label 𝒂~~𝒂\tilde{{\bm{a}}}over~ start_ARG bold_italic_a end_ARG. See Appendix A.2 for full details.

Hyperparameters and inference details. Utilizing NumPyro’s NUTS implementation of HMC (Phan et al., 2019), our experiments run 4444 chains in parallel, each chain taking 500500500500 warm-up steps before generating 1000100010001000 samples, accumulating M=4000𝑀4000M=4000italic_M = 4000 total samples. We use depth D=200𝐷200D=200italic_D = 200 for all models unless otherwise specified, as we did not observe significant improvements in predictive performance beyond this depth. We choose 𝒂0=12𝟏subscript𝒂0121{\bm{a}}_{0}=\frac{1}{2}\cdot\mathbf{1}bold_italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ bold_1 (reflecting prior uncertainty of existent edges) and 𝝀0=17𝟏subscript𝝀0171\bm{\lambda}_{0}=17\cdot\mathbf{1}bold_italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 17 ⋅ bold_1 (approximate average value of limiting 𝝀𝝀\bm{\lambda}bold_italic_λ when running Algorithm 1 to convergence on RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT analytical distance matrices for θ=1𝜃1\theta=1italic_θ = 1) for all experiments. We did not find performance to be sensitive to such choices. Further details can be found in Appendix A.2.

6.2 Synthetic data evaluation

Refer to caption
Figure 4: Effective i.i.d. generalization. Both DPG and PDS are performant and well calibrated BNNs, although PDS requires 3×\approx 3\times≈ 3 × more time for inference. Further performance gains are found with the expanded, partially stochastic DPG-MIMO-E model. The rightmost plot (reliability diagram) indicates high calibration across confidence levels; the left-most bin is least calibrated but contains <0.4%absentpercent0.4<0.4\%< 0.4 % of edges across all models. This plot shows experiments on N=20𝑁20N=20italic_N = 20 RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT graphs. Error bars are scaled by .05.05.05.05 for compact visual effect.

Generative smooth signal model. We build on (Dong et al., 2016)’s work on probabilistic smooth graph signal generation to validate our methods using synthetic data. Let 𝑳=𝑼𝚲𝑼𝑳𝑼𝚲superscript𝑼top{\bm{L}}={\bm{U}}\bm{\Lambda}{\bm{U}}^{\top}bold_italic_L = bold_italic_U bold_Λ bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT be the eigendecomposition of the graph Laplacian 𝑳=diag(𝑨𝟏)𝑨𝑳diag𝑨1𝑨{\bm{L}}=\textrm{diag}({\bm{A}}\mathbf{1})-{\bm{A}}bold_italic_L = diag ( bold_italic_A bold_1 ) - bold_italic_A, with diagonal eigenvalue matrix 𝚲𝚲\bm{\Lambda}bold_Λ having associated eigenvalues λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT sorted in increasing order, 𝒖isubscript𝒖𝑖{\bm{u}}_{i}bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i𝑖iitalic_i-th column of 𝑼𝑼{\bm{U}}bold_italic_U) is the eigenvector of 𝑳𝑳{\bm{L}}bold_italic_L with eigenvalue λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. (Dong et al., 2016) proposed that smooth signals can be generated from a colored Gaussian distribution as 𝒙=𝝁+ix^i𝒖i𝒙𝝁subscript𝑖subscript^𝑥𝑖subscript𝒖𝑖{\bm{x}}=\bm{\mu}+\sum_{i}\hat{x}_{i}{\bm{u}}_{i}bold_italic_x = bold_italic_μ + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where x^i𝒩(0,λi)similar-tosubscript^𝑥𝑖𝒩0subscriptsuperscript𝜆𝑖\hat{x}_{i}\sim\mathcal{N}(0,\lambda^{\dagger}_{i})over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_λ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and 𝝁N𝝁superscript𝑁\bm{\mu}\in\mathbb{R}^{N}bold_italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT denotes an arbitrary mean vector. Therefore 𝒙𝒩(𝝁,𝑳)similar-to𝒙𝒩𝝁superscript𝑳{\bm{x}}\sim\mathcal{N}(\bm{\mu},{\bm{L}}^{\dagger})bold_italic_x ∼ caligraphic_N ( bold_italic_μ , bold_italic_L start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ). To sample such a distribution, it suffices to draw an initial non-smooth signal 𝒙0𝒩(𝟎,𝑰)similar-tosubscript𝒙0𝒩0𝑰{\bm{x}}_{0}\sim\mathcal{N}(\mathbf{0},{\bm{I}})bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I ) and then compute 𝒙𝝁+𝑳𝒙0𝒙𝝁superscript𝑳subscript𝒙0{\bm{x}}\leftarrow\bm{\mu}+\sqrt{{\bm{L}}^{\dagger}}{\bm{x}}_{0}bold_italic_x ← bold_italic_μ + square-root start_ARG bold_italic_L start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. When 𝝁=𝟎𝝁0\bm{\mu}=\mathbf{0}bold_italic_μ = bold_0, this procedure produces signals such that the power on the i𝑖iitalic_i-th frequency component is x^i2Γ(k=12,θ=2λi)similar-tosuperscriptsubscript^𝑥𝑖2Γformulae-sequence𝑘12𝜃2superscriptsubscript𝜆𝑖\hat{x}_{i}^{2}\sim\Gamma{(k=\frac{1}{2},\theta=2\lambda_{i}^{\dagger})}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ roman_Γ ( italic_k = divide start_ARG 1 end_ARG start_ARG 2 end_ARG , italic_θ = 2 italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ). Thus the expected power 𝔼[x^i2]=λi𝔼delimited-[]superscriptsubscript^𝑥𝑖2superscriptsubscript𝜆𝑖\mathbb{E}[\hat{x}_{i}^{2}]=\lambda_{i}^{\dagger}blackboard_E [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT is inversely proportional to the associated eigenvalue, concentrating energy on the low frequencies of the Laplacian spectrum. We construct the data matrix 𝑿=[𝒙1,,𝒙P]N×P𝑿subscript𝒙1subscript𝒙𝑃superscript𝑁𝑃{\bm{X}}=[{\bm{x}}_{1},\dots,{\bm{x}}_{P}]\in\mathbb{R}^{N\times P}bold_italic_X = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_P end_POSTSUPERSCRIPT, where 𝒙psubscript𝒙𝑝{\bm{x}}_{p}bold_italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are drawn i.i.d. from 𝒩(𝟎,𝑳)𝒩0superscript𝑳\mathcal{N}(\mathbf{0},{\bm{L}}^{\dagger})caligraphic_N ( bold_0 , bold_italic_L start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) as described above. Recalling the definition of the Euclidean distance matrix 𝑬𝑬{\bm{E}}bold_italic_E in Section 3 and using the fact that 𝒙¯i𝒙¯iΓ(k=P2,θ=2Lii)similar-tosuperscriptsubscript¯𝒙𝑖topsubscript¯𝒙𝑖Γformulae-sequence𝑘𝑃2𝜃2subscriptsuperscript𝐿𝑖𝑖\bar{{\bm{x}}}_{i}^{\top}\bar{{\bm{x}}}_{i}\sim\Gamma(k=\frac{P}{2},\theta=2L^% {\dagger}_{ii})over¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ roman_Γ ( italic_k = divide start_ARG italic_P end_ARG start_ARG 2 end_ARG , italic_θ = 2 italic_L start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT ) and 𝔼[𝒙¯i𝒙¯j]=p=1P𝔼[xipxjp]=p=1PCov(xip,xjp)=Lij,(i,j)formulae-sequence𝔼delimited-[]superscriptsubscript¯𝒙𝑖topsubscript¯𝒙𝑗superscriptsubscript𝑝1𝑃𝔼delimited-[]subscript𝑥𝑖𝑝subscript𝑥𝑗𝑝superscriptsubscript𝑝1𝑃Covsubscript𝑥𝑖𝑝subscript𝑥𝑗𝑝subscriptsuperscript𝐿𝑖𝑗for-all𝑖𝑗\mathbb{E}[\bar{{\bm{x}}}_{i}^{\top}\bar{{\bm{x}}}_{j}]=\sum_{p=1}^{P}\mathbb{% E}[x_{ip}x_{jp}]=\sum_{p=1}^{P}\text{Cov}(x_{ip},x_{jp})=L^{\dagger}_{ij},% \forall(i,j)blackboard_E [ over¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] = ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT blackboard_E [ italic_x start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT ] = ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT Cov ( italic_x start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT ) = italic_L start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , ∀ ( italic_i , italic_j ), we can show 𝔼[𝑬]=𝟏diag(𝑳)+diag(𝑳)𝟏2𝑳𝔼delimited-[]𝑬1diagsuperscriptsuperscript𝑳topdiagsuperscript𝑳superscript1top2superscript𝑳\mathbb{E}[{\bm{E}}]=\mathbf{1}\text{diag}({\bm{L}}^{\dagger})^{\top}+\text{% diag}({\bm{L}}^{\dagger})\mathbf{1}^{\top}-2{\bm{L}}^{\dagger}blackboard_E [ bold_italic_E ] = bold_1 diag ( bold_italic_L start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + diag ( bold_italic_L start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - 2 bold_italic_L start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT which we call the ‘analytic distance matrix’. We henceforth use 𝔼[𝑬]𝔼delimited-[]𝑬\mathbb{E}[{\bm{E}}]blackboard_E [ bold_italic_E ] and its finite sample approximations from P𝑃Pitalic_P signals for evaluation.

Refer to caption
Figure 5: Qualitative i.i.d. generalization. Left: For a random test sample we show the label 𝒂~~𝒂\tilde{{\bm{a}}}over~ start_ARG bold_italic_a end_ARG, and estimated mean (pred. mean) and standard deviation (pred. stdv) of the edge-wise marginal posterior predictive p(ai~𝒆~,𝒯)𝑝conditional~subscript𝑎𝑖~𝒆𝒯p(\tilde{a_{i}}\mid\tilde{{\bm{e}}},\mathcal{T})italic_p ( over~ start_ARG italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ). Comparing pred. mean to the label 𝒂~~𝒂\tilde{{\bm{a}}}over~ start_ARG bold_italic_a end_ARG adds qualitative evidence that the model is well fit to the data. Right: The edge-wise uncertainty estimate (pred. stdv.) and error |a~i𝔼𝚯|𝒯[a~i|𝒆~,𝒯]||\tilde{a}_{i}-\mathbb{E}_{\bm{\Theta}|\mathcal{T}}[\tilde{a}_{i}|\tilde{{\bm{% e}}},\mathcal{T}]|| over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT bold_Θ | caligraphic_T end_POSTSUBSCRIPT [ over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | over~ start_ARG bold_italic_e end_ARG , caligraphic_T ] | have a strong positive Pearson correlation ρ=0.91𝜌0.91\rho=0.91italic_ρ = 0.91.

Evaluation of i.i.d. generalization. Here, we train and test DPG, PDS, and DPG-MIMO-E on N=20𝑁20N=20italic_N = 20 RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT graphs and their corresponding analytic distance matrices. We use T=50𝑇50T=50italic_T = 50 graphs for training and 100100100100 for testing. We present the results in Figure 4. DPG and PDS perform comparably; although DPG has marginally better accuracy and calibration. Significant improvement is found in the more expressive DPG-MIMO-E across all metrics, showing our simple 3-parameter model can be effectively expanded. These models are similarly performant across graph distributions (see the experiments reported in Table 1 and also in Appendix A.2); we show a single graph ensemble here for brevity.

Figure 5 depicts estimates of the first two moments of the posterior predictive distributions produced by DPG on a random test sample. We observe well-calibrated confidence and uncertainty, plus a strong correlation between error magnitude and uncertainty; indeed a useful uncertainty estimate should be a proxy for error.

Predictive uncertainty under distribution shift. We evaluate two forms of distribution shift: corruptions to the noiseless analytic distance matrix 𝔼[𝑬]𝔼delimited-[]𝑬\mathbb{E}[{\bm{E}}]blackboard_E [ bold_italic_E ] and label mismatch. To investigate corruptions, we train using T=50𝑇50T=50italic_T = 50 RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT (N=20𝑁20N=20italic_N = 20) labels and their analytic distance matrix, and test on 100100100100 RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT samples which instead use P𝑃Pitalic_P signals to compute an Euclidean distance matrix 𝑬𝑬{\bm{E}}bold_italic_E. Fewer samples P𝑃Pitalic_P tend to produce larger corruption magnitudes, i.e., deviations from 𝔼[𝑬]𝔼delimited-[]𝑬\mathbb{E}[{\bm{E}}]blackboard_E [ bold_italic_E ]. Figure 6 shows that indeed, all models display lower predictive accuracy and higher uncertainty on the increasingly shifted input data, with the simpler models outperforming in the presence of larger corruption. To investigate label mismatch, models are fit to the same training data as above - T=50𝑇50T=50italic_T = 50 RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT (N=20𝑁20N=20italic_N = 20) labels with analytic distance matrix input - but now tested on 100100100100 test samples from the following different N=20𝑁20N=20italic_N = 20 random graph distributions: (i) random geometric graphs with connectivity radius 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG (RG1212{}_{\frac{1}{2}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_FLOATSUBSCRIPT), Erdős-Rényi graphs with edge probability p=0.5𝑝0.5p=0.5italic_p = 0.5 (ER1212{}_{\frac{1}{2}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_FLOATSUBSCRIPT), and Barabási-Albert graphs with m=1𝑚1m=1italic_m = 1 link per node (BA1). Results are displayed in Table 1. All models show lower accuracy (\uparrow Brier Score) and commensurately higher uncertainty (\uparrow NLL) on these mismatched data. DPG methods tend to maintain better calibration than PDS.

Refer to caption
Figure 6: Detection of covariate shift across models. Using RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT graphs with N=20𝑁20N=20italic_N = 20 nodes, we fit models on analytic Euclidean distance matrices and evaluate on corrupted distance matrices in a test set. Fewer signals P𝑃Pitalic_P used tends to increase corruption magnitude. Log-scaling unintentionally makes error bars appear to grow with P𝑃Pitalic_P.
Table 1: Detection of label mismatch. Models are fit to RG13subscriptRG13\text{RG}_{\frac{1}{3}}RG start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_POSTSUBSCRIPT and tested on 100100100100 samples from the same RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT as well as three different graph distributions. For each metric, the mean over the test set is reported.
NLL BS (×102absentsuperscript102\times\mathrm{1}\mathrm{0}^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT) ECE (×103absentsuperscript103\times\mathrm{1}\mathrm{0}^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT)
RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT RG12subscriptRG12\text{RG}_{\frac{1}{2}}RG start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT ER12subscriptER12\text{ER}_{\frac{1}{2}}ER start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT BA1subscriptBA1\text{BA}_{1}BA start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT RG12subscriptRG12\text{RG}_{\frac{1}{2}}RG start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT ER12subscriptER12\text{ER}_{\frac{1}{2}}ER start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT BA1subscriptBA1\text{BA}_{1}BA start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT RG12subscriptRG12\text{RG}_{\frac{1}{2}}RG start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT ER12subscriptER12\text{ER}_{\frac{1}{2}}ER start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT BA1subscriptBA1\text{BA}_{1}BA start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
DPG 10.33 17.53 33.01 20.91 1.29 2.29 4.66 2.10 3.42 52.06 171.38 16.60
PDS 10.32 17.70 33.43 20.80 1.29 2.33 4.80 2.10 3.75 52.87 173.53 16.60
DPG-MIMO 9.19 17.13 40.49 13.37 1.19 1.74 4.62 2.06 3.50 54.61 161.71 11.24
DPG-MIMO-E 8.35 12.66 37.77 8.35 1.07 1.52 7.82 1.32 2.40 29.17 29.07 2.84

Scaling to larger graphs. Direct inference in large graph settings is challenged by substantial memory demands. Specifically, providing gradients for HMC via naive backpropagation necessitates storing all intermediate outputs, while full Bayesian inference mandates holding the full data set in memory, cumulatively leading to a memory footprint of 𝒪(DN2T)𝒪𝐷superscript𝑁2𝑇\mathcal{O}(DN^{2}T)caligraphic_O ( italic_D italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ). Adopting forward-mode auto-differentiation partially mitigates this issue by removing depth dependency, effectively reducing memory complexity to 𝒪(N2T)𝒪superscript𝑁2𝑇\mathcal{O}(N^{2}T)caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ). The time complexity for MCMC also presents scalability challenges: each forward pass has time complexity of 𝒪(DN2T)𝒪𝐷superscript𝑁2𝑇\mathcal{O}(DN^{2}T)caligraphic_O ( italic_D italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ), and each posterior sample requires multiple forward passes. Attempting to instead perform transfer learning – where inference occurs with smaller graphs to then test on larger graphs – still faces an obstacle. The objective function (3) contains the terms 2𝒂𝒆=2i=1N(N1)/2aiei2superscript𝒂top𝒆2superscriptsubscript𝑖1𝑁𝑁12subscript𝑎𝑖subscript𝑒𝑖2{\bm{a}}^{\top}{\bm{e}}=2\sum_{i=1}^{N(N-1)/2}a_{i}e_{i}2 bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_e = 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N ( italic_N - 1 ) / 2 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, β2𝒂22=β2i=1N(N1)/2ai2𝛽2subscriptsuperscriptnorm𝒂22𝛽2superscriptsubscript𝑖1𝑁𝑁12superscriptsubscript𝑎𝑖2\frac{\beta}{2}\|{\bm{a}}\|^{2}_{2}=\frac{\beta}{2}\sum_{i=1}^{N(N-1)/2}a_{i}^% {2}divide start_ARG italic_β end_ARG start_ARG 2 end_ARG ∥ bold_italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_β end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N ( italic_N - 1 ) / 2 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and α𝟏log(𝑺𝒂)=αi=1Nlog(𝑺𝒂)i𝛼superscript1toplog𝑺𝒂𝛼superscriptsubscript𝑖1𝑁logsubscript𝑺𝒂𝑖-\alpha\mathbf{1}^{\top}\text{log}({\bm{S}}{\bm{a}})=-\alpha\sum_{i=1}^{N}% \text{log}({\bm{S}}{\bm{a}})_{i}- italic_α bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT log ( bold_italic_S bold_italic_a ) = - italic_α ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT log ( bold_italic_S bold_italic_a ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The former two terms involve N(N1)/2𝑁𝑁12N(N-1)/2italic_N ( italic_N - 1 ) / 2 summands, while the latter is a sum over N𝑁Nitalic_N elements. This results in disparate growth rates for these terms as N𝑁Nitalic_N increases, which poses a challenge for parameter optimization across different graph sizes, i.e., we find parameters optimized for a graph with Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT nodes are not effective for N>>Nimuch-greater-than𝑁subscript𝑁𝑖N>>N_{i}italic_N > > italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Moreover, the differing growth rates cause standard cardinality-based normalization schemes (dividing each term by its number summands) to fail, and determining analytically how the parameters should change as a function of size is non-trivial; see Appendix A.3 for a detailed discussion.

In light of this, we perform an adjusted transfer learning experiment by fitting the empirical growth trends of MAP DPG parameter estimates on N={20,50,100,200}𝑁2050100200N=\{20,50,100,200\}italic_N = { 20 , 50 , 100 , 200 } ER1414{}_{\frac{1}{4}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_FLOATSUBSCRIPT graphs and their corresponding analytic distance matrices. By using a moderate depth D=200𝐷200D=200italic_D = 200 and T=10𝑇10T=10italic_T = 10 training samples, MAP inference on larger graphs with N=200𝑁200N=200italic_N = 200 nodes takes 1similar-toabsent1\sim 1∼ 1 hour on a M2222 MacBook laptop. Using this empirical parameter fit we can extrapolate parameter values and perform transfer learning on 100100100100 ER1414{}_{\frac{1}{4}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_FLOATSUBSCRIPT test graphs of size up to N=1000𝑁1000N=1000italic_N = 1000; again done locally without any GPU. The transfer learning experiment reveals an expected but graceful decay in performance in NLL as N𝑁Nitalic_N increases. Further information and plots on these scaling experiments are available in Appendix A.3.

6.3 Ablation studies

Prior modeling. To inspect the influence of informative prior modeling on DPG, we define model ‘DPG-U’ which replaces DPG’s informative priors with the following uninformative priors: log θU[106,106]similar-to𝜃𝑈superscript106superscript106\theta\sim U\left[10^{-6},10^{6}\right]italic_θ ∼ italic_U [ 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ], δ,b𝒩(𝟎,103𝐈)similar-to𝛿𝑏𝒩0superscript103𝐈\delta,b\sim\mathcal{N}(\mathbf{0},10^{3}\mathbf{I})italic_δ , italic_b ∼ caligraphic_N ( bold_0 , 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT bold_I ). What we find is that in data-rich regimes an informative prior mostly acts to improve efficiency of posterior inference, e.g., when fitting to T=50𝑇50T=50italic_T = 50 RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT graphs with N=20𝑁20N=20italic_N = 20 nodes (using analytic distance matrices) and testing on 100100100100 i.i.d. samples, it reduces the time needed to generate M=4000𝑀4000M=4000italic_M = 4000 samples by a factor of 7.8×7.8\times7.8 × and increases the effective sample size - a measure of the number of independent samples with the same estimation power as the observed correlated samples (Gelman et al., 1995) - for θ𝜃\thetaitalic_θ (ESSθ) by a factor of 5×5\times5 ×. In data poor regimes it also tends to help slow down performance degradation, e.g., when instead using T=2𝑇2T=2italic_T = 2 with the same graph data, NLL and ECE are 37%percent3737\%37 % and 13.7%percent13.713.7\%13.7 % better with the prior than without. See Table 3 in Appendix A.4 for the full numerical results.

Partial stochasticity. We conducted an ablation study investigating the benefits of adding partial stochasticity to the DPG-MIMO model (C=4,D=200formulae-sequence𝐶4𝐷200C=4,D=200italic_C = 4 , italic_D = 200) using MAP point estimates; the partially stochastic setup is described in Section 6.1. The training and test sets are identical to the i.i.d. generalization experiment above. Our experiments showed that introducing partial stochasticity in DPG-MIMO-E markedly improved NLL, BS, error, and ECE by 9.1%percent9.19.1\%9.1 %, 10.3%percent10.310.3\%10.3 %, 9.0%percent9.09.0\%9.0 %, and 31.4%percent31.431.4\%31.4 %, respectively.

6.4 Real data evaluation

Refer to caption
Refer to caption
Figure 7: Quantifying Uncertainty with Financial Data and Images of Digits. Left: S&P500500500500. We randomly split stocks from 3333 sectors of the S&P500500500500 and compute their input 𝟏𝟏|\mathbf{1}\mathbf{1}^{\top}-|bold_11 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - |ΣΣ\Sigmaroman_Σ||||, where ΣΣ\Sigmaroman_Σ is the matrix of Pearson correlation coefficients, using log daily returns over an extended period. The graph label connects stocks of the same sector indicated by red block diagonal outline. There is only a single train and test sample. Here, we display the test sample input and estimates of the mean (pred. mean) and standard deviation (pred. stdv.) of the posterior predictive. In pred. mean, white (black) inside (outside) the block diagonal indicates correct prediction for both experiments. Pred. mean which don’t match their label tend to have larger uncertainty. Error and pred. stdv. have a Pearson correlation of 0.700.700.700.70 over the test set. Right: MNIST digits. We construct graphs of MNIST digits "1111" and "2222", connecting nodes of the same digit, similarly outlined in red ("1111"s are the first block). The input is the log pairwise Euclidean distance of their vectorized image. Notice "1111"s tend to be much closer to each other than "2222"s. DPG again shows an ability to place higher uncertainty on edges with higher error. Error and pred. stdv. have a Pearson correlation of 0.620.620.620.62 over the test set. Pred. stdv. in both plots are maximum normalized for visual clarity.

Quantifying uncertainty in networks learnt from S&P500500500500 stock prices. To verify that DPG provides useful measures of uncertainty on edge predictions based on real data, we first use the financial dataset presented in (de Miranda Cardoso et al., 2021). The time series consist of S&P500500500500 daily stock prices from three sectors (Communication Services, Utilities, and Real Estate), comprising a total of N=82𝑁82N=82italic_N = 82 stocks. The data spans the period from Jan. 3333rd 2014201420142014 to Dec. 29292929th 2017201720172017, yielding P=1006𝑃1006P=1006italic_P = 1006 daily observations. We divided these stocks equally into training and testing sets. For both sets, we created a log-returns data matrix, denoted as 𝑿N/2×P𝑿superscript𝑁2𝑃{\bm{X}}\in\mathbb{R}^{N/2\times P}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N / 2 × italic_P end_POSTSUPERSCRIPT. Specifically, Xi,j=log SPi,jlog SPi,j1subscript𝑋𝑖𝑗log 𝑆subscript𝑃𝑖𝑗log 𝑆subscript𝑃𝑖𝑗1X_{i,j}=\text{log }SP_{i,j}-\text{log }SP_{i,j-1}italic_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = log italic_S italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - log italic_S italic_P start_POSTSUBSCRIPT italic_i , italic_j - 1 end_POSTSUBSCRIPT, where SPi,j𝑆subscript𝑃𝑖𝑗SP_{i,j}italic_S italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT signifies the closing price of the i𝑖iitalic_i-th stock on the j𝑗jitalic_j-th day. From each matrix (train and test), we derived a matrix of Pearson correlation coefficients ΣΣ\Sigmaroman_Σ. We take the input to be 𝑬𝟏𝟏|{\bm{E}}\leftarrow\mathbf{1}\mathbf{1}^{\top}-|bold_italic_E ← bold_11 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - |ΣΣ\Sigmaroman_Σ||||, thus yielding measure of dissimilarity. We also constructed a binary label graph, assigning an edge weight value of 1111 for stocks within the same sector and 00 for pairs in different sectors. Thus, our training and testing datasets each contain a single sample. Nodes from the same sector are numbered contiguously creating a block diagonal adjacency matrix structure, displayed with the red outline in Figure 7 (left).

We use DPG with depth D=200𝐷200D=200italic_D = 200 and identify a high-performing parameter value range for θ𝜃\thetaitalic_θ via predictive checking as in Section 5.2. Such values aligned closely with those found in the synthetic experiments, and so we keep the priors unchanged. We run Bayesian parameter inference on a M2222 MacBook laptop which takes about 2222 minutes; we visualize the input, empirical mean, and standard deviation of the posterior predictive on the test sample in Figure 7 (left). Notably, while our mean recovery is robust, there are areas where it deviates from the label. These areas tend to exhibit increased variation in the posterior predictive. The Pearson correlation coefficient between the error and predictive standard deviation on all edges, true positive edges, and true negative edges is 0.700.700.700.70, 0.700.700.700.70, and 0.790.790.790.79, respectively, suggests that our model’s uncertainty can be a useful proxy for error in the absence of labels in this financial test case.

Learning the graph of MNIST digits 1 and 2. To further demonstrate DPG’s capability to quantify uncertainty on graphs estimated from images of digits, we emulate the experiment from (Kalofolias, 2016) which learns a graph to classify handwritten digits “1111” and “2222” from the MNIST database (LeCun et al., 1998). Each digit is an image of 28×28282828\times 2828 × 28 pixels, each pixel taking integer value in 0,,25502550,\dots,2550 , … , 255. As (Kalofolias, 2016) reports, this problem is particular because digits "1111" are much close to each other than "2222" are (average square distance of 45454545 and 102102102102, respectively). A single sample in our dataset is constructed as follows. We randomly sample 25252525 images of digits “1111” and “2222”. Each image represents a node in this graph sample, hence N=50𝑁50N=50italic_N = 50. We connect nodes of the same digit with a binary edge. The image corresponding to each node is vectorized into 𝒙¯i282subscript¯𝒙𝑖superscriptsuperscript282\bar{{\bm{x}}}_{i}\in\mathbb{R}^{28^{2}}over¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 28 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, i=1,N𝑖1𝑁i=1,\ldots Nitalic_i = 1 , … italic_N, becoming the nodal feature vector. For ease of label visualization, we order the nodes such that all “1111” digits come before “2222” digits, which creates a block diagonal adjacency matrix, indicated by the red outline in Figure 7 (right). The nodal dissimilarity matrix 𝑬𝑬{\bm{E}}bold_italic_E is computed as the log pairwise Euclidean distance of the node features (vectorized digit images). We construct a training and test set of size T=5𝑇5T=5italic_T = 5 and 50505050, respectively.

We use the same modeling and inference setup as in the previous real data experiment; inference takes 10absent10\approx 10≈ 10 minutes. Figure 7 (right) displays the input 𝑬𝑬{\bm{E}}bold_italic_E, pred. mean, and pred. stdv. on a random test sample. Visual inspection reveals performant mean prediction, and that edges with high errors tend to have high variation. Indeed, across the entire test set the two have a strong positive Pearson correlation coefficient over all edges, true positive edges, and true negative edges of 0.620.620.620.62, 0.720.720.720.72, and 0.510.510.510.51, respectively, suggesting DPG effectively quantifies predictive uncertainty of edges in this image analysis setting.

7 Concluding summary, limitations, and the road ahead

In this work we identify independent interpretability of (regularization) parameters in inverse problems as a key property, which allows one to incorporate prior information on solution characteristics into prior distributions over the NN parameters of an unrolled optimization algorithm. We investigate an inverse problem with this independent interpretability property in the context of GSL from smooth signals, and introduce an optimization algorithm that is simpler than existing methods, without sacrificing estimation performance. We unroll this algorithm producing the first true NN for GSL from smooth signals. We leverage independent interpretability to incorporate prior information about sparsity characteristics of sought graphs into prior distributions over unrolled network parameters, producing the first BNN for GSL from smooth signal observations. We lean into the advantages of unrollings - low parameter dimensionality, fast compiler and auto-grad friendly layers, as well as the empirical need for shallow networks - to perform posterior approximation using MCMC sampling. Doing so yields high-quality and well-calibrated uncertainty estimates over edge predictions, as demonstrated via comprehensive experiments with synthetic and real datasets.

The primary limitation of our approach lies in its scalability to moderate- and large-sized graphs and datasets. The computational and memory requirements to achieve asymptotically exact posterior inference through MCMC sampling are substantial, owing to the necessity for numerous forward network passes per posterior sample and the requirement to store the entire data set in memory. Promising directions of future work include exploration of non-asymptotically exact posterior inference methods, namely variational inference, which typically requires much less computation and allows for mini-batch training and thus use of larger datasets (Jospin et al., 2022). As we produce distributions over output graphs, future work can explore Bayesian decision analysis with the introduction of a utility function, or even using DPG as a module within a larger system, where propagation of uncertainty over the inferred graph is important.

References

  • Barbano et al. (2020) Riccardo Barbano, Zeljko Kereta, Chen Zhang, Andreas Hauptmann, Simon Arridge, and Bangti Jin. Quantifying sources of uncertainty in deep learning-based image reconstruction. In NeurIPS Work. Deep Learn. Inverse Probl., 2020.
  • Beck & Teboulle (2014) Amir Beck and Marc Teboulle. A fast dual proximal gradient algorithm for convex minimization and applications. Oper. Res. Lett., 42(1):1–6, 2014.
  • Belkin & Niyogi (2001) Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. NeurIPS, 14, 2001.
  • Betancourt (2016) Michael Betancourt. Diagnosing suboptimal cotangent disintegrations in hamiltonian monte carlo. arXiv preprint arXiv:1604.00695, 2016.
  • Bishop (1995) Christopher M Bishop. Neural networks for pattern recognition. Oxford university press, 1995.
  • Bronstein et al. (2017) Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: Going beyond Euclidean data. IEEE Signal Process. Mag., 34(4):18–42, 2017.
  • Butts (2003) Carter T Butts. Network inference, error, and informant (in) accuracy: a Bayesian approach. Soc. Netw., 2003.
  • Cai et al. (2013) Xiaodong Cai, Juan Andrés Bazerque, and Georgios B Giannakis. Inference of gene regulatory networks with sparse structural equation models exploiting genetic perturbations. PLoS Comput. Biol., 2013.
  • Chepuri et al. (2017) Sundeep Prabhakar Chepuri, Sijia Liu, Geert Leus, and Alfred O Hero. Learning sparse graphs under smoothness prior. In Proc. Int. Conf. Acoustics, Speech, Signal Process., 2017.
  • Coates et al. (2002) Mark Coates, Rui Castro, Robert Nowak, Manik Gadhiok, Ryan King, and Yolanda Tsang. Maximum likelihood network topology identification from edge-based unicast measurements. ACM SIGMETRICS Perform. Eval. Rev., 2002.
  • Cosmo et al. (2020) Luca Cosmo, Anees Kazi, Seyed-Ahmad Ahmadi, Nassir Navab, and Michael Bronstein. Latent-graph learning for disease prediction. In Med. Image Comput. Comput. Assist. Intervent., 2020.
  • Crawford (2015) Forrest W Crawford. Hidden network reconstruction from information diffusion. In Int. Conf. Inform. Fusion., 2015.
  • de Miranda Cardoso et al. (2021) Jose Vinicius de Miranda Cardoso, Jiaxi Ying, and Daniel Palomar. Graphical models in heavy-tailed markets. NeurIPS, 2021.
  • Dong et al. (2016) Xiaowen Dong, Dorina Thanou, Pascal Frossard, and Pierre Vandergheynst. Learning Laplacian matrix in smooth graph signal representations. IEEE Trans. Signal Process., 64(23):6160–6173, 2016.
  • Dong et al. (2019) Xiaowen Dong, Dorina Thanou, Michael Rabbat, and Pascal Frossard. Learning graphs from data: A signal representation perspective. IEEE Signal Process. Mag., 36(3):44–63, 2019.
  • Ekmekci & Cetin (2022) Canberk Ekmekci and Mujdat Cetin. Uncertainty quantification for deep unrolling-based computational imaging. IEEE Trans. Comput. Imaging, 8, 2022.
  • Fatima et al. (2022) Ghania Fatima, Aakash Arora, Prabhu Babu, and Petre Stoica. Learning sparse graphs via majorization-minimization for smooth node signals. IEEE Signal Process. Lett., 29:1022–1026, 2022.
  • Gabry et al. (2017) Jonah Gabry, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. Visualization in bayesian workflow. arXiv preprint arXiv:1709.01449, 2017.
  • Gal & Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016.
  • Gelman et al. (1995) Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 3rd edition, 1995.
  • Gelman et al. (2017) Andrew Gelman, Daniel Simpson, and Michael Betancourt. The prior can often only be understood in the context of the likelihood. Entropy, 2017.
  • Gray et al. (2020) Caitlin Gray, Lewis Mitchell, and Matthew Roughan. Bayesian inference of network structure from information cascades. IEEE Trans. Signal Inf. Process. Netw., 2020.
  • Gregor & LeCun (2010) Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In ICML, pp.  399–406, 2010.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, 2017.
  • Hamilton (2020) William L. Hamilton. Graph representation learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 14(3):1–159, 2020.
  • Hu et al. (2013) Chenhui Hu, Lin Cheng, Jorge Sepulcre, Georges El Fakhri, Yue M Lu, and Quanzheng Li. A graph theoretical regression model for brain connectivity learning of alzheimer’s disease. In IEEE 10th Int. Symp. Biomed. Imaging, 2013.
  • Huang et al. (2018) Weiyu Huang, Antonio G Marques, and Alejandro R Ribeiro. Rating prediction via graph signal processing. IEEE Trans. Signal Process., 2018.
  • Izmailov et al. (2021) Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Gordon Wilson. What are bayesian neural network posteriors really like? In ICML, 2021.
  • Jiang & Kolaczyk (2011) Xiaoyu Jiang and Eric D. Kolaczyk. A latent eigenprobit model with link uncertainty for prediction of protein–protein interactions. Stat. Biosci., 2011.
  • Jospin et al. (2022) L. V. Jospin, H. Laga, F. Boussaid, W. Buntine, and M. Bennamoun. Hands-on Bayesian neural networks—A tutorial for deep learning users. IEEE Comput. Intell. Mag., 17, 2022.
  • Kalofolias (2016) Vassilis Kalofolias. How to learn a graph from smooth signals. In Int. Conf. Artif. Intell. Stat., 2016.
  • Kalofolias & Perraudin (2017) Vassilis Kalofolias and Nathanaël Perraudin. Large scale graph learning from smooth signals. In ICLR, 2017.
  • Kolaczyk (2009) Eric D. Kolaczyk. Statistical Analysis of Network Data: Methods and Models. Springer, New York, NY, 2009.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 1998.
  • Marti et al. (2021) Gautier Marti, Frank Nielsen, Mikołaj Bińkowski, and Philippe Donnat. A review of two decades of correlations, hierarchies, networks and clustering in financial markets. Prog. Inf. Geom. Theory Appl., 2021.
  • Mateos et al. (2019) Gonzalo Mateos, Santiago Segarra, Antonio G. Marques, and Alejandro Ribeiro. Connecting the dots: Identifying network structure via graph signal processing. IEEE Signal Process. Mag., 36(3):16–43, 2019.
  • Monga et al. (2021) Vishal Monga, Yuelong Li, and Yonina C Eldar. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. IEEE Signal Process. Mag., 38, 2021.
  • Murphy (2023) Kevin P. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023.
  • Opolka & Lió (2022) Felix Opolka and Pietro Lió. Bayesian link prediction with deep graph convolutional gaussian processes. In PMLR, 2022.
  • Pal & Coates (2019) Soumyasundar Pal and Mark Coates. Scalable MCMC in degree corrected stochastic block model. In Proc. Int. Conf. Acoustics, Speech, Signal Process., 2019.
  • Pal et al. (2020) Soumyasundar Pal, Saber Malekmohammadi, Florence Regol, Yingxue Zhang, Yishi Xu, and Mark Coates. Non-parametric graph learning for bayesian graph neural networks. In Proc. Conf. Uncertain. Artif. Intell., PMLR, 2020.
  • Phan et al. (2019) Du Phan, Neeraj Pradhan, and Martin Jankowiak. Composable effects for flexible and accelerated probabilistic programming in numpyro. In NeurIPS Workshop Program Transform. ML, 2019.
  • Pu et al. (2021) Xingyue Pu, Tianyue Cao, Xiaoyun Zhang, Xiaowen Dong, and Siheng Chen. Learning to learn graph topologies. In NeurIPS, 2021.
  • Saboksayr & Mateos (2021) Seyed Saman Saboksayr and Gonzalo Mateos. Accelerated graph learning from smooth signals. IEEE Signal Process. Lett., 28, 2021.
  • Sevilla & Segarra (2023) Martín Sevilla and Santiago Segarra. Bayesian topology inference on partially known networks from input-output pairs. arXiv preprint arXiv:2309.06532, 2023.
  • Shaghaghian & Coates (2016) Shohreh Shaghaghian and Mark Coates. Bayesian inference of diffusion networks with unknown infection times. In Proc. IEEE Workshop on Statistical Signal Process., 2016.
  • Sharma et al. (2022) Mrinank Sharma, Sebastian Farquhar, Eric Nalisnick, and Tom Rainforth. Do bayesian neural networks need to be fully stochastic? arXiv preprint arXiv:2211.06291, 2022.
  • Shrivastava et al. (2020) Harsh Shrivastava, Xinshi Chen, Binghong Chen, Guanghui Lan, Srinivas Aluru, Han Liu, and Le Song. GLAD: Learning sparse graph recovery. In ICLR, 2020.
  • Wang et al. (2021) Xiaolu Wang, Chaorui Yao, Haoyu Lei, and Anthony Man-Cho So. An efficient alternating direction method for graph learning from smooth signals. In Proc. Int. Conf. Acoustics, Speech, Signal Process., 2021.
  • Wang et al. (2023) Xiaolu Wang, Chaorui Yao, and Anthony Man-Cho So. A linearly convergent optimization framework for learning graphs from smooth signals. IEEE Trans. Signal Inf. Process. Netw., 9, 2023.
  • Wasserman et al. (2023) Max Wasserman, Saurabh Sihag, Gonzalo Mateos, and Alejandro Ribeiro. Learning graph structure from convolutional mixtures. TMLR, 2023.
  • Wenzel et al. (2020) Florian Wenzel, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the Bayes posterior in deep neural networks really? In ICML, 2020.
  • Williamson (2016) Sinead A. Williamson. Nonparametric network models for link prediction. JMLR, 2016.
  • Zhang et al. (2021) Chen Zhang, Riccardo Barbano, and Bangti Jin. Conditional variational autoencoder for learned image reconstruction. Computation, 2021.
  • Zhang et al. (2019) Yingxue Zhang, Soumyasundar Pal, Mark Coates, and Deniz Ustebay. Bayesian graph convolutional neural networks for semi-supervised classification. In AAAI, 2019.

Appendix A Appendix

A.1 Optimization algorithms for model-based GSL methods

A.1.1 Primal-dual splitting (PDS) algorithm

Algorithm 2 was introduced in (Kalofolias, 2016) to solve the (α,β𝛼𝛽\alpha,\betaitalic_α , italic_β)-parameterization of (3), and was subsequently unrolled by (Pu et al., 2021) on the same objective. Note the increased number of operations, intermediate variables, and parameters as compared to Algorithm 1.

Algorithm 2 Proximal Dual Splitting (PDS)
Inputs: Fixed parameters α,β,γ𝛼𝛽𝛾\alpha,\beta,\gamma\in\mathbb{R}italic_α , italic_β , italic_γ ∈ blackboard_R, and data 𝒆𝒆{\bm{e}}bold_italic_e.
Initialize: 𝒂0subscript𝒂0{\bm{a}}_{0}bold_italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒗0subscript𝒗0{\bm{v}}_{0}bold_italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at random.
for k=1,2,𝑘12k=1,2,\dotsitalic_k = 1 , 2 , … do
   𝒓1,k=𝒂kγ(2β𝒂k+2𝒆+𝑺𝒗k)subscript𝒓1𝑘subscript𝒂𝑘𝛾2𝛽subscript𝒂𝑘2𝒆superscript𝑺topsubscript𝒗𝑘{\bm{r}}_{1,k}={\bm{a}}_{k}-\gamma(2\beta{\bm{a}}_{k}+2{\bm{e}}+{\bm{S}}^{\top% }{\bm{v}}_{k})bold_italic_r start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT = bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ ( 2 italic_β bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 2 bold_italic_e + bold_italic_S start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).
   𝒓2,k=𝒗k+γ𝑺𝒂ksubscript𝒓2𝑘subscript𝒗𝑘𝛾𝑺subscript𝒂𝑘{\bm{r}}_{2,k}={\bm{v}}_{k}+\gamma{\bm{S}}{\bm{a}}_{k}bold_italic_r start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT = bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_γ bold_italic_S bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.
   𝒑1,k=proxγ,Ω1(𝒓1,k)subscript𝒑1𝑘subscriptprox𝛾subscriptΩ1subscript𝒓1𝑘{\bm{p}}_{1,k}=\text{prox}_{\gamma,\Omega_{1}}({\bm{r}}_{1,k})bold_italic_p start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT = prox start_POSTSUBSCRIPT italic_γ , roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_r start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT ), where proxγ,Ω1(𝒓1,k)=max{0,𝒓1,k}subscriptprox𝛾subscriptΩ1subscript𝒓1𝑘0subscript𝒓1𝑘\text{prox}_{\gamma,\Omega_{1}}({\bm{r}}_{1,k})=\max\{0,{\bm{r}}_{1,k}\}prox start_POSTSUBSCRIPT italic_γ , roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_r start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT ) = roman_max { 0 , bold_italic_r start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT }.
   𝒑2,k=proxγ,Ω2(𝒓2,k)subscript𝒑2𝑘subscriptprox𝛾subscriptΩ2subscript𝒓2𝑘{\bm{p}}_{2,k}=\text{prox}_{\gamma,\Omega_{2}}({\bm{r}}_{2,k})bold_italic_p start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT = prox start_POSTSUBSCRIPT italic_γ , roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_r start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT ), where (proxγ,Ω2(𝒓2,k))i=(r2ir2i2+4αγ)/2subscriptsubscriptprox𝛾subscriptΩ2subscript𝒓2𝑘𝑖subscript𝑟subscript2𝑖superscriptsubscript𝑟subscript2𝑖24𝛼𝛾2\Bigl{(}\text{prox}_{\gamma,\Omega_{2}}({\bm{r}}_{2,k})\Bigr{)}_{i}=(r_{2_{i}}% -\sqrt{r_{2_{i}}^{2}+4\alpha\gamma})/2( prox start_POSTSUBSCRIPT italic_γ , roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_r start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_r start_POSTSUBSCRIPT 2 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - square-root start_ARG italic_r start_POSTSUBSCRIPT 2 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_γ end_ARG ) / 2.
   𝒒1,k=𝒑1,kγ(2β𝒑1,k+2𝒆+𝑺𝒑2,k)subscript𝒒1𝑘subscript𝒑1𝑘𝛾2𝛽subscript𝒑1𝑘2𝒆superscript𝑺topsubscript𝒑2𝑘{\bm{q}}_{1,k}={\bm{p}}_{1,k}-\gamma(2\beta{\bm{p}}_{1,k}+2{\bm{e}}+{\bm{S}}^{% \top}{\bm{p}}_{2,k})bold_italic_q start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT = bold_italic_p start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT - italic_γ ( 2 italic_β bold_italic_p start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT + 2 bold_italic_e + bold_italic_S start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT ).
   𝒒2,k=𝒑2,kγ𝑺𝒑1,ksubscript𝒒2𝑘subscript𝒑2𝑘𝛾𝑺subscript𝒑1𝑘{\bm{q}}_{2,k}={\bm{p}}_{2,k}-\gamma{\bm{S}}{\bm{p}}_{1,k}bold_italic_q start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT = bold_italic_p start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT - italic_γ bold_italic_S bold_italic_p start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT.
   𝒂k+1=𝒂k𝒓1,k+𝒒1,ksubscript𝒂𝑘1subscript𝒂𝑘subscript𝒓1𝑘subscript𝒒1𝑘{\bm{a}}_{k+1}={\bm{a}}_{k}-{\bm{r}}_{1,k}+{\bm{q}}_{1,k}bold_italic_a start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_r start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT + bold_italic_q start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT.
   𝒗k+1=𝒗k𝒓2,k+𝒒2,ksubscript𝒗𝑘1subscript𝒗𝑘subscript𝒓2𝑘subscript𝒒2𝑘{\bm{v}}_{k+1}={\bm{v}}_{k}-{\bm{r}}_{2,k}+{\bm{q}}_{2,k}bold_italic_v start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_r start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT + bold_italic_q start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT.
end for
Return: 𝒂k+1subscript𝒂𝑘1{\bm{a}}_{k+1}bold_italic_a start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT

Parameter tuning in the convergent setting. The optimization parameters α,β𝛼𝛽\alpha,\betaitalic_α , italic_β and γ𝛾\gammaitalic_γ of Algorithm 2 are not interpretable, and so performing parameter tuning incurs a 𝒪(K3)𝒪superscript𝐾3\mathcal{O}(K^{3})caligraphic_O ( italic_K start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ), where K𝐾Kitalic_K is the number of discretization values used for each parameter. This is in contrast to Algorithm 1 which has only two parameters θ𝜃\thetaitalic_θ and δ𝛿\deltaitalic_δ. Since θ𝜃\thetaitalic_θ is independently interpretable w.r.t sparsity, we can first tune θ𝜃\thetaitalic_θ to produce outputs with desired sparsity level, then tune δ𝛿\deltaitalic_δ for appropriate edge weight scale, reducing reducing tuning costs to 𝒪(K)𝒪𝐾\mathcal{O}(K)caligraphic_O ( italic_K ).

Step size γ𝛾\gammaitalic_γ frustrates Bayesian PDS. Algorithm 2 has a nuisance step size parameter γ𝛾\gammaitalic_γ which must be properly tuned for convergent and performant iterations; problematically such γ𝛾\gammaitalic_γ values are a functions of α𝛼\alphaitalic_α and β𝛽\betaitalic_β values; see (Kalofolias, 2016). The step size γ𝛾\gammaitalic_γ introduces several issues for use as a learned parameter in a BNN. First, we empirically find that the γ𝛾\gammaitalic_γ values which produce convergent iterations of Algorithm 2 (within 5×1045superscript1045\times 10^{4}5 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT iterations) are very close to γ𝛾\gammaitalic_γ values which produce divergent iterations, as shown in Figure 8 (left). This presents the practical problem of producing NaNs during Bayesian inference, as the sampler will be drawn toward such unstable values of γ𝛾\gammaitalic_γ, causing the unrolling to diverge. Second, the values of γ𝛾\gammaitalic_γ which produce divergent unrollings are themselves a function of α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and the unrolling depth, frustrating simple solutions to prevent divergence, e.g., setting p(γ)𝑝𝛾p(\gamma)italic_p ( italic_γ ) to be some fixed closed interval. Third, the products γβ𝛾𝛽\gamma\betaitalic_γ italic_β and γα𝛾𝛼\gamma\alphaitalic_γ italic_α ensure the Unrolled PDS is not a neural network; this causes problems in practice as products of parameters - each of which can vary over many orders of magnitude - can cause problematic gradients. These observations help explain why we could not find any prior configuration over the three parameters {α,β,γ}𝛼𝛽𝛾\{\alpha,\beta,\gamma\}{ italic_α , italic_β , italic_γ } in PDS which produced convergent posterior sampling over multiple depths. Fixing γ𝛾\gammaitalic_γ to a scalar value ameliorates these three issues, indeed this was the only configuration which produced a performant PDS model. We found γ=0.1𝛾0.1\gamma=0.1italic_γ = 0.1 worked best. Even so, care has to be taken in setting the priors for α𝛼\alphaitalic_α and β𝛽\betaitalic_β: log α𝛼\alphaitalic_α and log β𝛽\betaitalic_β U[106,106]similar-toabsent𝑈superscript106superscript106\sim U\left[10^{-6},10^{6}\right]∼ italic_U [ 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ] produced all divergent paths - recall this naive setting for log θ𝜃\thetaitalic_θ in DPG worked with no issue in the Ablation Studies of Section 6.3. A successful avenue we followed to produce a performant PDS model was to use α𝛼\alphaitalic_α, β𝛽\betaitalic_β similar-to\sim LogNormal(0,10)0,10)0 , 10 ) and b𝒩(0,103)similar-to𝑏𝒩0superscript103b\sim\mathcal{N}(0,10^{3})italic_b ∼ caligraphic_N ( 0 , 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).

106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT100superscript10010^{0}10 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT102superscript10210^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPTγ𝛾\gammaitalic_γ𝒂PDS(γ)𝒂DPG2𝒂DPG2subscriptnormsubscriptsuperscript𝒂PDS𝛾subscriptsuperscript𝒂DPG2subscriptnormsubscriptsuperscript𝒂DPG2\frac{\|{\bm{a}}^{*}_{\text{PDS}}(\gamma)-{\bm{a}}^{*}_{\text{DPG}}\|_{2}}{\|{% \bm{a}}^{*}_{\text{DPG}}\|_{2}}divide start_ARG ∥ bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT PDS end_POSTSUBSCRIPT ( italic_γ ) - bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT DPG end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT DPG end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG
626 s304 sPDSDPG
Figure 8: Left: PDS Step Size γ𝛾\gammaitalic_γ. Running both DPG and PDS iterations with α=β=1𝛼𝛽1\alpha=\beta=1italic_α = italic_β = 1 and θ=1/αβ=1𝜃1𝛼𝛽1\theta=1/\sqrt{\alpha\beta}=1italic_θ = 1 / square-root start_ARG italic_α italic_β end_ARG = 1, δ=α/β=1𝛿𝛼𝛽1\delta=\sqrt{\alpha/\beta}=1italic_δ = square-root start_ARG italic_α / italic_β end_ARG = 1 for 5×1045superscript1045\times 10^{4}5 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT iterations, we plot the mean and standard deviation of the normalized 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between the output of PDS vs DPG. The value of γ𝛾\gammaitalic_γ resulting in the most faithful PDS solution is close to values which yield divergent iterations, problematic for its use as our NN function. Right: PDS vs DPG inference time using D=200𝐷200D=200italic_D = 200 on T=50𝑇50T=50italic_T = 50 RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT graphs running NUTS with 4444 chains (in parallel), each taking M=1500𝑀1500M=1500italic_M = 1500 samples.

Comparing Bayesian inference time of PDS vs DPG. Above we show that finding a range of reasonable values of the optimisation parameters can be cubic in PDS while linear in DPG, owing to independent interpretability. Further, PDS takes more than twice as long in performing inference as the DPG (and PDS with stochastic γ𝛾\gammaitalic_γ took an order of magnitude longer). This may be due to the increased complexity of PDS iterates and the fact that both α𝛼\alphaitalic_α and β𝛽\betaitalic_β occur inside PDS iterations, compared to only θ𝜃\thetaitalic_θ for DPG iterates, increasing the complexity of the computational graph required to compute parameter gradients. See Figure 8 (right) showing the difference in inference runtime for PDS and DPG, both having depth D=200𝐷200D=200italic_D = 200 using T=50𝑇50T=50italic_T = 50 RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT graphs.

A.2 Experimental Details

A.2.1 Scoring rules and evaluation metrics

Fitting the model to training data 𝒯𝒯\mathcal{T}caligraphic_T produces samples {𝚯(m)}m=1Msuperscriptsubscriptsuperscript𝚯𝑚𝑚1𝑀\{\bm{\Theta}^{(m)}\}_{m=1}^{M}{ bold_Θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT from the posterior p(𝚯𝒯)𝑝conditional𝚯𝒯p(\bm{\Theta}\mid\mathcal{T})italic_p ( bold_Θ ∣ caligraphic_T ). Given test sample {𝒆~,𝒂~}~𝒆~𝒂\{\tilde{{\bm{e}}},\tilde{{\bm{a}}}\}{ over~ start_ARG bold_italic_e end_ARG , over~ start_ARG bold_italic_a end_ARG } we can evaluate the quality of fit using proper scoring rules, e.g., Log Predictive Density and Brier Score, and discrete metrics, e.g., the Error.

Log predictive density, i.e., Log Likelihood. The Log Likelihood measures the quality of the probabilistic predictions of the model. It is defined as the predicted probability of the true outcome under the model.

log p(𝒂~𝒆~,𝒯)log 𝑝conditional~𝒂~𝒆𝒯\displaystyle\text{log }p(\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},\mathcal{T})log italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ) =log p(𝒂~𝒆~,𝚯)p(𝚯𝒯)𝑑𝚯absentlog 𝑝conditional~𝒂~𝒆𝚯𝑝conditional𝚯𝒯differential-d𝚯\displaystyle=\text{log }\int p(\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},\bm{% \Theta})p(\bm{\Theta}\mid\mathcal{T})d\bm{\Theta}= log ∫ italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , bold_Θ ) italic_p ( bold_Θ ∣ caligraphic_T ) italic_d bold_Θ
log (1Mm=1Mp(𝒂~𝒆~,𝚯(m)))absentlog 1𝑀superscriptsubscript𝑚1𝑀𝑝conditional~𝒂~𝒆superscript𝚯𝑚\displaystyle\approx\text{log }\big{(}\frac{1}{M}\sum_{m=1}^{M}p(\tilde{{\bm{a% }}}\mid\tilde{{\bm{e}}},\bm{\Theta}^{(m)})\big{)}≈ log ( divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , bold_Θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) )
=log m=1Mp(𝒂~𝒆~,𝚯(m))log Sabsentlog superscriptsubscript𝑚1𝑀𝑝conditional~𝒂~𝒆superscript𝚯𝑚log 𝑆\displaystyle=\text{log }\sum_{m=1}^{M}p(\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},% \bm{\Theta}^{(m)})-\text{log }S= log ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , bold_Θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) - log italic_S
=log-sum-exp{log p(𝒂~𝒆~,𝚯(1)),,log p(𝒂~𝒆~,𝚯(M))}log Sabsentlog-sum-explog 𝑝conditional~𝒂~𝒆superscript𝚯1log 𝑝conditional~𝒂~𝒆superscript𝚯𝑀log 𝑆\displaystyle=\text{log-sum-exp}\Bigl{\{}\text{log }p(\tilde{{\bm{a}}}\mid% \tilde{{\bm{e}}},\bm{\Theta}^{(1)}),\dots,\text{log }p(\tilde{{\bm{a}}}\mid% \tilde{{\bm{e}}},\bm{\Theta}^{(M)})\Bigr{\}}-\text{log }S= log-sum-exp { log italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , bold_Θ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) , … , log italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , bold_Θ start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ) } - log italic_S (9)

The NLL is simply 1log p(𝒂~𝒆~,𝒯)1log 𝑝conditional~𝒂~𝒆𝒯-1\cdot\text{log }p(\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},\mathcal{T})- 1 ⋅ log italic_p ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ). A lower NLL indicates better predictive performance, as it suggests that the model assigns higher probabilities to the observed outcomes.

Brier Score. The Brier Score is used to assess the accuracy of probabilistic predictions. It measures the mean squared difference between the predicted probability and the actual outcome. A lower Brier Score indicates better calibration and accuracy of the probabilistic predictions. Recall the edge-wise probabilities output by the model for parameters 𝚯(m)superscript𝚯𝑚\bm{\Theta}^{(m)}bold_Θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT are denoted 𝒑^(m)=σ(δ(m)Γθ(m)D(𝒆)b(m))superscript^𝒑𝑚𝜎superscript𝛿𝑚superscriptsubscriptΓsuperscript𝜃𝑚𝐷𝒆superscript𝑏𝑚\hat{{\bm{p}}}^{(m)}=\sigma(\delta^{(m)}\Gamma_{\theta^{(m)}}^{D}({\bm{e}})-b^% {(m)})over^ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT = italic_σ ( italic_δ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT roman_Γ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_italic_e ) - italic_b start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ). Then the Brier Score is

BS(𝒂~𝒆~,𝒯)BSconditional~𝒂~𝒆𝒯\displaystyle\text{BS}(\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},\mathcal{T})BS ( over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ) =1||𝒑^𝒂~22p(𝚯|𝒯)𝑑𝚯absent1superscriptsubscriptnorm^𝒑~𝒂22𝑝conditional𝚯𝒯differential-d𝚯\displaystyle=\frac{1}{|\mathcal{E}|}\int\|\hat{{\bm{p}}}-\tilde{{\bm{a}}}\|_{% 2}^{2}\cdot p(\bm{\Theta}|\mathcal{T})d\bm{\Theta}= divide start_ARG 1 end_ARG start_ARG | caligraphic_E | end_ARG ∫ ∥ over^ start_ARG bold_italic_p end_ARG - over~ start_ARG bold_italic_a end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_p ( bold_Θ | caligraphic_T ) italic_d bold_Θ
1M||m=1M𝒑^(m)𝒂~22absent1𝑀superscriptsubscript𝑚1𝑀superscriptsubscriptnormsuperscript^𝒑𝑚~𝒂22\displaystyle\approx\frac{1}{M|\mathcal{E}|}\sum_{m=1}^{M}\|\hat{{\bm{p}}}^{(m% )}-\tilde{{\bm{a}}}\|_{2}^{2}≈ divide start_ARG 1 end_ARG start_ARG italic_M | caligraphic_E | end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ over^ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT - over~ start_ARG bold_italic_a end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (10)

Complementary metrics. The NLL is particularly useful for evaluating the overall fit of a probabilistic model, emphasizing the accuracy of the assigned probabilities, especially for less likely events. In contrast, the Brier Score provides a more intuitive measure of predictive accuracy and calibration, making it easier to interpret the model’s probabilistic predictions in practical scenarios.
Numerical Concerns. Evaluating log p(𝒂|𝒆,𝚯(m))log 𝑝conditional𝒂𝒆superscript𝚯𝑚\text{log }p({\bm{a}}|{\bm{e}},\bm{\Theta}^{(m)})log italic_p ( bold_italic_a | bold_italic_e , bold_Θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) can cause numerical issues which stem from underflow in computing log a^(m)superscript^𝑎𝑚\hat{a}^{(m)}over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT or log (1a^(m))1superscript^𝑎𝑚(1-\hat{a}^{(m)})( 1 - over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ). for some edges. This becomes an issue in the covariate shift experiments in Section 6, where the data becomes very noisy and the model can confidently predict the wrong label. When this underflow occurs, the log evaluates to -\infty- ∞. To solve this we use the softplus parametrization of the log likelihood: softplus(y¯(δ(m)Γθ(m)D(𝒆)b(m)))softplus¯𝑦superscript𝛿𝑚superscriptsubscriptΓsuperscript𝜃𝑚𝐷𝒆superscript𝑏𝑚-\text{softplus}(-\bar{y}\cdot(\delta^{(m)}\Gamma_{\theta^{(m)}}^{D}({\bm{e}})% -b^{(m)}))- softplus ( - over¯ start_ARG italic_y end_ARG ⋅ ( italic_δ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT roman_Γ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_italic_e ) - italic_b start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) ), where y¯=2y1¯𝑦2𝑦1\bar{y}=2y-1over¯ start_ARG italic_y end_ARG = 2 italic_y - 1, see Equation (10.13) in (Murphy, 2023).

Error. We take the error to be the percentage of incorrectly predicted edges between our thresholded pred. mean 𝔼𝚯𝒯[𝒂~𝒆~,𝒯]>0.5subscript𝔼conditional𝚯𝒯delimited-[]conditional~𝒂~𝒆𝒯0.5\mathbb{E}_{\bm{\Theta}\mid\mathcal{T}}[\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},% \mathcal{T}]>0.5blackboard_E start_POSTSUBSCRIPT bold_Θ ∣ caligraphic_T end_POSTSUBSCRIPT [ over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ] > 0.5 and the true label 𝒂~~𝒂\tilde{{\bm{a}}}over~ start_ARG bold_italic_a end_ARG.

Calibration for GSL. We follow the calibration procedure laid out in (Guo et al., 2017). To fit into this framework, we take our prediction to be the pred. mean thresholded at 0.50.50.50.5, i.e., 𝔼𝚯𝒯[𝒂~𝒆~,𝒯]>0.5subscript𝔼conditional𝚯𝒯delimited-[]conditional~𝒂~𝒆𝒯0.5\mathbb{E}_{\bm{\Theta}\mid\mathcal{T}}[\tilde{{\bm{a}}}\mid\tilde{{\bm{e}}},% \mathcal{T}]>0.5blackboard_E start_POSTSUBSCRIPT bold_Θ ∣ caligraphic_T end_POSTSUBSCRIPT [ over~ start_ARG bold_italic_a end_ARG ∣ over~ start_ARG bold_italic_e end_ARG , caligraphic_T ] > 0.5. Thus the confidence, i.e., the probabilities associated with the predicted label, will thus always be in [0.5,1]absent0.51\in\left[0.5,1\right]∈ [ 0.5 , 1 ]. We will thus construct M𝑀Mitalic_M uniform bins Im=(.5+.5(m1)M,.5+.5mM]subscript𝐼𝑚.5.5𝑚1𝑀.5.5𝑚𝑀I_{m}=(.5+\frac{.5*(m-1)}{M},.5+\frac{.5*m}{M}]italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ( .5 + divide start_ARG .5 ∗ ( italic_m - 1 ) end_ARG start_ARG italic_M end_ARG , .5 + divide start_ARG .5 ∗ italic_m end_ARG start_ARG italic_M end_ARG ], m={1,,M}𝑚1𝑀m=\{1,\dots,M\}italic_m = { 1 , … , italic_M }. If an edge confidence is in Imsubscript𝐼𝑚I_{m}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, we assign it to bin Bmsubscript𝐵𝑚B_{m}italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. We then evaluate the accuracy acc(Bm):=1|Bm|iBm𝟏(a^i=ai)assignsubscript𝐵𝑚1subscript𝐵𝑚subscript𝑖subscript𝐵𝑚1subscript^𝑎𝑖subscript𝑎𝑖(B_{m}):=\frac{1}{|B_{m}|}\sum_{i\in B_{m}}\mathbf{1}(\hat{a}_{i}=a_{i})( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) := divide start_ARG 1 end_ARG start_ARG | italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_1 ( over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and average confidence conf(Bm):=1|Bm|iBma^iassignsubscript𝐵𝑚1subscript𝐵𝑚subscript𝑖subscript𝐵𝑚subscript^𝑎𝑖(B_{m}):=\frac{1}{|B_{m}|}\sum_{i\in B_{m}}\hat{a}_{i}( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) := divide start_ARG 1 end_ARG start_ARG | italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each bin. The Expected Calibration Error (ECE) is defined as the weighted average the bins’ accuracy/confidence difference:

ECE=m=1M|Bm|B|acc(Bm)conf(Bm)|,ECEsuperscriptsubscript𝑚1𝑀subscript𝐵𝑚𝐵accsubscript𝐵𝑚confsubscript𝐵𝑚\text{ECE}=\sum_{m=1}^{M}\frac{|B_{m}|}{B}|\text{acc}(B_{m})-\text{conf}(B_{m}% )|,ECE = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG | italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | end_ARG start_ARG italic_B end_ARG | acc ( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) - conf ( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) | ,

where B:=m=1M|Bm|assign𝐵superscriptsubscript𝑚1𝑀subscript𝐵𝑚B:=\sum_{m=1}^{M}|B_{m}|italic_B := ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | are the total number of edges predicted. Thus the ECE provides a measure of calibration over all edges in the evaluation set.

A.2.2 Inference Details

Details on HMC.

Refer to caption
Figure 9: Trace plots and bivariate kernel density estimates of DPGs parameters on RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT graphs.

To ensure convergence has been achieved in our posterior sampling we simulate multiple chains with different starting points in the parameter space, discard the first 500500500500 iterations of each simulation, and monitor relevant diagnostic criteria - namely the Potential Scale Reduction Factor (PSRF) r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG and Effective Sample Size (ESS). We do not perform thinning on the resulting samples. Unless otherwise stated all inference was achieved without issue, namely with that r^1^𝑟1\hat{r}\approx 1over^ start_ARG italic_r end_ARG ≈ 1 and n^eff1000subscript^𝑛𝑒𝑓𝑓1000\hat{n}_{eff}\geq 1000over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_e italic_f italic_f end_POSTSUBSCRIPT ≥ 1000 for each parameter. For a visualization of successful HMC inference with DPG, see Figure 9 which displays the trace plots and bivariate kernel density estimates of DPG with D=30𝐷30D=30italic_D = 30 on RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT graphs using analytic distance matrices.

MAP vs MLE. Due to nonconvexity in the loss, initialization is important in finding MAP and MLE of parameters. We did not alter the default initialization provided in NumPyro with MAP, as it worked without issue. We found MLE to be more dependant on initialization; we initialized the parameters of MLE models to the median value of the priors set in Section 4, i.e. θ,δ,b0.32,100,1formulae-sequence𝜃𝛿𝑏0.321001\theta,\delta,b\approx 0.32,100,1italic_θ , italic_δ , italic_b ≈ 0.32 , 100 , 1.

Log-normal parameterization. Let Z𝑍Zitalic_Z be a standard normal variable and let μ𝜇\muitalic_μ and σ>0𝜎0\sigma>0italic_σ > 0 be two real numbers, then the distribution of the random variable X=eμ+σZ𝑋superscript𝑒𝜇𝜎𝑍X=e^{\mu+\sigma Z}italic_X = italic_e start_POSTSUPERSCRIPT italic_μ + italic_σ italic_Z end_POSTSUPERSCRIPT is called a log-normal distribution with parameters μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ, i.e. XLognormal(μ,σ2)similar-to𝑋Lognormal𝜇superscript𝜎2X\sim\text{Lognormal}(\mu,\sigma^{2})italic_X ∼ Lognormal ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Parameters μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ represent the mean and standard deviation of logX𝑋\log Xroman_log italic_X, not X𝑋Xitalic_X itself. While the normality of log X𝑋Xitalic_X is true regardless of base, we will be performing modeling using powers of 10101010. Thus in the main body when we specify a distribution Lognormal(μ,σ2)Lognormal𝜇superscript𝜎2\text{Lognormal}(\mu,\sigma^{2})Lognormal ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), it implies the location parameter used is ln 10μln superscript10𝜇\text{ln }10^{\mu}ln 10 start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT, while the scale parameter is unchanged. For example, Lognormal(2,4)Lognormal24\text{Lognormal}(2,4)Lognormal ( 2 , 4 ) would imply location μ=ln 102𝜇ln superscript102\mu=\text{ln }10^{2}italic_μ = ln 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and scale σ=2𝜎2\sigma=2italic_σ = 2, and thus would be invoked as a call to the probabilistic programming language distribution, here NumPyro, as numpyro.distributions.LogNormal(loc =ln 102absentln superscript102=\text{ln }10^{2}= ln 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, scale =2absent2=2= 2).

A.3 Scaling to Large Graphs

Refer to caption
Figure 10: Transferring to Larger Graphs. Using T=20𝑇20T=20italic_T = 20 ER1414{}_{\frac{1}{4}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_FLOATSUBSCRIPT graphs of varying sizes, we run a grid search to find optimal θ𝜃\thetaitalic_θ for the structure recovery task using Algorithm 1 for each graph size, and show the F1111 score for each θ𝜃\thetaitalic_θ. We also plot the line corresponding to Equation 13 which extrapolates performant θ𝜃\thetaitalic_θ values from the Ni=20subscript𝑁𝑖20N_{i}=20italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 20 setting. This line coincides with empirically observed highly performant θ𝜃\thetaitalic_θ values in problem setting for both smaller and significantly larger graphs.

Scaling via transfer learning in the convergent setting. In (3), terms 2𝒂𝒆,β2𝒂222superscript𝒂top𝒆𝛽2subscriptsuperscriptnorm𝒂222{\bm{a}}^{\top}{\bm{e}},\frac{\beta}{2}\|{\bm{a}}\|^{2}_{2}2 bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_e , divide start_ARG italic_β end_ARG start_ARG 2 end_ARG ∥ bold_italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT have N(N1)/2𝑁𝑁12N(N-1)/2italic_N ( italic_N - 1 ) / 2 components while term α𝟏log(𝑺𝒂)𝛼superscript1toplog𝑺𝒂-\alpha\mathbf{1}^{\top}\text{log}({\bm{S}}{\bm{a}})- italic_α bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT log ( bold_italic_S bold_italic_a ) has N𝑁Nitalic_N components, thus the former terms will tend to dominate the latter as N𝑁Nitalic_N grows (using fixed α𝛼\alphaitalic_α, β𝛽\betaitalic_β). We thus create a new problem where each term is normalized by it’s number of components; we use the notation α¯,β¯¯𝛼¯𝛽\bar{\alpha},\bar{\beta}over¯ start_ARG italic_α end_ARG , over¯ start_ARG italic_β end_ARG to indicate the scale invariant parameter values

argmin𝒂subscriptargmin𝒂\displaystyle\operatorname*{arg\,min}_{{\bm{a}}}start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_a end_POSTSUBSCRIPT {|N|12𝒂𝒆N1α¯𝟏log(𝑺𝒂)+|N|1β¯2𝒂22+𝕀{𝒂0}}superscriptsubscript𝑁12superscript𝒂top𝒆superscript𝑁1¯𝛼superscript1toplog𝑺𝒂superscriptsubscript𝑁1¯𝛽2subscriptsuperscriptnorm𝒂22𝕀𝒂0\displaystyle\hskip 3.0pt\left\{|\mathcal{E}_{N}|^{-1}2{\bm{a}}^{\top}{\bm{e}}% -N^{-1}\bar{\alpha}\mathbf{1}^{\top}\text{log}({\bm{S}}{\bm{a}})+|\mathcal{E}_% {N}|^{-1}\frac{\bar{\beta}}{2}\|{\bm{a}}\|^{2}_{2}+\mathbb{I}\{{\bm{a}}\geq 0% \}\right\}{ | caligraphic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 2 bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_e - italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_α end_ARG bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT log ( bold_italic_S bold_italic_a ) + | caligraphic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG over¯ start_ARG italic_β end_ARG end_ARG start_ARG 2 end_ARG ∥ bold_italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + blackboard_I { bold_italic_a ≥ 0 } } (11)
=𝒂(|N|1𝒆,N1α¯,|N|1β¯)1-1 correspondence between objective terms and params/inputsabsentsuperscript𝒂superscriptsubscript𝑁1𝒆superscript𝑁1¯𝛼superscriptsubscript𝑁1¯𝛽1-1 correspondence between objective terms and params/inputs\displaystyle={\bm{a}}^{*}(|\mathcal{E}_{N}|^{-1}{\bm{e}},N^{-1}\bar{\alpha},|% \mathcal{E}_{N}|^{-1}\bar{\beta})\hskip 5.0pt\text{, $1$-$1$ correspondence % between objective terms and params/inputs}= bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( | caligraphic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_e , italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_α end_ARG , | caligraphic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_β end_ARG ) , 1 - 1 correspondence between objective terms and params/inputs
=𝒂(𝒆,|N|N1α¯,β¯), argmin invariant to positive scaling, multiply by |N|absentsuperscript𝒂𝒆subscript𝑁superscript𝑁1¯𝛼¯𝛽, argmin invariant to positive scaling, multiply by |N|\displaystyle={\bm{a}}^{*}({\bm{e}},|\mathcal{E}_{N}|N^{-1}\bar{\alpha},\bar{% \beta})\hskip 20.0pt\text{, argmin invariant to positive scaling, multiply by % $|\mathcal{E}_{N}|$}= bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_e , | caligraphic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_α end_ARG , over¯ start_ARG italic_β end_ARG ) , argmin invariant to positive scaling, multiply by | caligraphic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT |
=𝒂(𝒆,.5(N1)α¯,β¯).absentsuperscript𝒂𝒆.5𝑁1¯𝛼¯𝛽\displaystyle={\bm{a}}^{*}({\bm{e}},.5(N-1)\bar{\alpha},\bar{\beta}).= bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_e , .5 ( italic_N - 1 ) over¯ start_ARG italic_α end_ARG , over¯ start_ARG italic_β end_ARG ) . (12)

Intuitively, (12) tells us to scale α𝛼\alphaitalic_α by 𝒪(N)𝒪𝑁\mathcal{O}(N)caligraphic_O ( italic_N ). This makes sense, as α𝛼\alphaitalic_α itself scales the objective term which grows at a factor of 𝒪(N)𝒪𝑁\mathcal{O}(N)caligraphic_O ( italic_N ) slower than the other (finite) terms. This should help correct the differing component sizes as the problem dimension N𝑁Nitalic_N grows. Eq. (12) formulates the solution to the size normalized problem with respect to the solutions of original problem (3). This shows that we can use the original solution procedures, now with scaled α𝛼\alphaitalic_α, to produce solutions. To do so, suppose we find optimal parameter values αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the original problem (3) with problem size Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To find the proper parameter values for a different problem size N𝑁Nitalic_N, we can solve for the scale invariant parameter values, then re-scale to the appropriate size as

α(N)𝛼𝑁\displaystyle\alpha(N)italic_α ( italic_N ) =αi(.5(Ni1))1α¯.5(N1)absentsuperscriptsubscript𝛼𝑖superscript.5subscript𝑁𝑖11¯𝛼.5𝑁1\displaystyle=\overbrace{\alpha_{i}\big{(}.5(N_{i}-1)\big{)}^{-1}}^{\bar{% \alpha}}.5(N-1)= over⏞ start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( .5 ( italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT over¯ start_ARG italic_α end_ARG end_POSTSUPERSCRIPT .5 ( italic_N - 1 )
β(N)𝛽𝑁\displaystyle\beta(N)italic_β ( italic_N ) =βiβ¯.absentsubscriptsubscript𝛽𝑖¯𝛽\displaystyle=\underbrace{\beta_{i}}_{\bar{\beta}}.= under⏟ start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_β end_ARG end_POSTSUBSCRIPT .

Note as a sanity check, when N=Ni𝑁subscript𝑁𝑖N=N_{i}italic_N = italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT we have α(N)=αi𝛼𝑁subscript𝛼𝑖\alpha(N)=\alpha_{i}italic_α ( italic_N ) = italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and β(N)=βi𝛽𝑁subscript𝛽𝑖\beta(N)=\beta_{i}italic_β ( italic_N ) = italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It is trivial to recover similar functions for θ(N)𝜃𝑁\theta(N)italic_θ ( italic_N ) and δ(N)𝛿𝑁\delta(N)italic_δ ( italic_N ) by substituting into the definitions of θ=1αβ𝜃1𝛼𝛽\theta=\frac{1}{\sqrt{\alpha\beta}}italic_θ = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α italic_β end_ARG end_ARG and δ=αβ𝛿𝛼𝛽\delta=\sqrt{\frac{\alpha}{\beta}}italic_δ = square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG end_ARG:

θ(N)𝜃𝑁\displaystyle\theta(N)italic_θ ( italic_N ) =1α(N)β(N)absent1𝛼𝑁𝛽𝑁\displaystyle=\frac{1}{\sqrt{\alpha(N)\beta(N)}}= divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α ( italic_N ) italic_β ( italic_N ) end_ARG end_ARG =θi(.5(Ni1))12θ¯absentsuperscriptsubscript𝜃𝑖superscript.5subscript𝑁𝑖112¯𝜃\displaystyle=\overbrace{\theta_{i}\big{(}.5(N_{i}-1)\big{)}^{\frac{1}{2}}}^{% \bar{\theta}}= over⏞ start_ARG italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( .5 ( italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUPERSCRIPT (.5(N1))12superscript.5𝑁112\displaystyle\big{(}.5(N-1)\big{)}^{-\frac{1}{2}}( .5 ( italic_N - 1 ) ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT =θiNi1N1absentsubscript𝜃𝑖subscript𝑁𝑖1𝑁1\displaystyle=\theta_{i}\sqrt{\frac{N_{i}-1}{N-1}}= italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT square-root start_ARG divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG start_ARG italic_N - 1 end_ARG end_ARG (13)
δ(N)𝛿𝑁\displaystyle\delta(N)italic_δ ( italic_N ) =α(N)β(N)1absent𝛼𝑁𝛽superscript𝑁1\displaystyle=\sqrt{\alpha(N)\beta(N)^{-1}}= square-root start_ARG italic_α ( italic_N ) italic_β ( italic_N ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG =δi(.5(Ni1))12δ¯absentsubscriptsubscript𝛿𝑖superscript.5subscript𝑁𝑖112¯𝛿\displaystyle=\underbrace{\delta_{i}\big{(}.5(N_{i}-1)\big{)}^{-\frac{1}{2}}}_% {\bar{\delta}}= under⏟ start_ARG italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( .5 ( italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_δ end_ARG end_POSTSUBSCRIPT (.5(N1))12superscript.5𝑁112\displaystyle\big{(}.5(N-1)\big{)}^{\frac{1}{2}}( .5 ( italic_N - 1 ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT =δiN1Ni1.absentsubscript𝛿𝑖𝑁1subscript𝑁𝑖1\displaystyle=\delta_{i}\sqrt{\frac{N-1}{N_{i}-1}}.= italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT square-root start_ARG divide start_ARG italic_N - 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG end_ARG . (14)

See Figure 10 which empirically validates this derivation on ER1414{}_{\frac{1}{4}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_FLOATSUBSCRIPT graphs across an order of magnitude difference in problem size N𝑁Nitalic_N.

Scaling in the unrolled setting. See Figure 12 (left) for the empirical growth trends of MAP estimates of the Unrolled DPG model (D=200𝐷200D=200italic_D = 200) parameters on ER1414{}_{\frac{1}{4}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_FLOATSUBSCRIPT graphs. Parameter θ𝜃\thetaitalic_θ seems to follow the trend predicted by our analysis, but δ𝛿\deltaitalic_δ does not. Indeed it seems to follow a strongly linear trend, and b𝑏bitalic_b seems to follow a power law. We can extrapolate from these trends to parameterize Unrolled DPG networks on significantly larger graphs with excellent performance; see Figure 12 (left).

Refer to caption
Figure 11: Learning MAP estimates 𝚯^^𝚯\hat{\bm{\Theta}}over^ start_ARG bold_Θ end_ARG of the DPG parameters on ER1414{}_{\frac{1}{4}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_FLOATSUBSCRIPT graphs of increasing size N𝑁Nitalic_N shows that θ^(N)^𝜃𝑁\hat{\theta}(N)over^ start_ARG italic_θ end_ARG ( italic_N ) resemble logarithmic growth, δ^(N)^𝛿𝑁\hat{\delta}(N)over^ start_ARG italic_δ end_ARG ( italic_N ) are linear, while b^(N)^𝑏𝑁\hat{b}(N)over^ start_ARG italic_b end_ARG ( italic_N ) show a sort of power law decay.
Refer to caption
Refer to caption
Figure 12: Left: Computing MAP estimates of DPG at increasing graph sizes shows improving recovery performance (solid lines). Extrapolating from the trends using the approach outlined in Appendix A.3, we can perform transfer learning larger graph sizes (dotted lines) with gradual decay in performance. Right: The analytic distance matrix E get smoother for increasingly sized ER1414{}_{\frac{1}{4}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_FLOATSUBSCRIPT graphs, providing an explanation for why graph recovery improves with larger graph sizes.

Larger ER graphs are smoother. We also notice an interesting trend when constructing smooth graph as in Section 3. We find that the analytical distance matrices produced by ER1414{}_{\frac{1}{4}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_FLOATSUBSCRIPT graphs get smoother as their size increases. For instance, see Figure 12 (right) where the mean and maximum entries of the analytic distance matrix get smaller as N𝑁Nitalic_N grows larger. This explains why the graph recovery problem gets easier and thus our performance gets better, as shown in Figure 12 (left).

A.4 Further Experiments

Prior modeling. The prior modeling techniques introduced in Section 5 work across multiple random graph ensembles. Here, we replicate the prior modeling experiments done on RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT in Section 5 for ER1414{}_{\frac{1}{4}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_FLOATSUBSCRIPT, and BA1. See Figure 13 and Figure 14 and compare to Figure 3 (left).

Refer to caption
Refer to caption
Refer to caption
Figure 13: Prior modeling of θ𝜃\thetaitalic_θ for synthetic ensembles RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT, ER1414{}_{\frac{1}{4}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_FLOATSUBSCRIPT, and BA1 from left to right. The edge density and number of connected components do not require the presence of labels, while the error does. A threshold of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT is used to decide the existence of an edge in order to remove the influence of small numerical effects.
Refer to caption
Refer to caption
Refer to caption
Figure 14: Prior modeling of b𝑏bitalic_b and δ𝛿\deltaitalic_δ for synthetic ensembles RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT, ER1414{}_{\frac{1}{4}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_FLOATSUBSCRIPT, and BA1 from left to right. We inspect the edge weight distributions recovered running the held out data through Algorithm 1 for discretized values of θ𝜃\thetaitalic_θ. The recovered edge weights are non-negative, and so the relevant scaling of δ𝛿\deltaitalic_δ and b𝑏bitalic_b is a function of the median and maximum values.

I.i.d. generalization across multiple random graph distributions. Our method is performant across the random graph distributions we’ve tested. To supplement the results shown in Section 6, we show i.i.d. generalization results across more random graph distributions, as shown in Table 2.

Table 2: I.I.D. Generalization of DPG across Random Graph Distributions
Graph Distribution NLL BS (×102absentsuperscript102\times\mathrm{1}\mathrm{0}^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT) Error (%) ECE (×103absentsuperscript103\times\mathrm{1}\mathrm{0}^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT)
RG1212{}_{\frac{1}{2}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_FLOATSUBSCRIPT 8.9938.9938.9938.993 p m 9.516 0.9880.9880.9880.988 p m 0.945 1.1111.1111.1111.111 p m 1.267 4.374.374.374.37
BA1 14.72414.72414.72414.724 p m 3.704 2.6142.6142.6142.614 p m 0.751 2.9162.9162.9162.916 p m 1.003 30.6330.6330.6330.63
ER.15 7.6777.6777.6777.677 p m 5.021 0.9970.9970.9970.997 p m 0.569 1.2371.2371.2371.237 p m 0.879 3.043.043.043.04
ER1212{}_{\frac{1}{2}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_FLOATSUBSCRIPT 0.9620.9620.9620.962 p m 3.029 0.0900.0900.0900.090 p m 0.236 0.0840.0840.0840.084 p m 0.265 0.940.940.940.94

Data efficiency. To further investigate how robust DPG is to sparse data settings, in Figure 15 we display i.i.d. generalization performance as a function of training set size T𝑇Titalic_T on RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT graphs and their corresponding analytic distance matrices. We additionally visualize the parameter posterior marginals. We find that past T50𝑇50T\approx 50italic_T ≈ 50, i.i.d. generalization did not significantly improve, while past T102𝑇superscript102T\approx 10^{2}italic_T ≈ 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the parameter posterior marginals converge.

Refer to caption
Figure 15: Impact of training set size on i.i.d. generalization performance. We use DPG with D=200𝐷200D=200italic_D = 200 on RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT graphs with analytic distance matrices to estimate generalization. The leftmost plot shows the generalization performance converge after 50absent50\approx 50≈ 50 samples on these same test graphs (error bars indicate .05.05*.05 ∗stdv for compact visualization). The rightmost plots depict posterior marginals for the 3333 parameters - θ𝜃\thetaitalic_θ, δ𝛿\deltaitalic_δ, and b𝑏bitalic_b - for increasing sizes of training data. The observed reduction in variance of the posterior is a reflection of lower epistemic uncertainty. The posterior shows little change after 102absentsuperscript102\approx 10^{2}≈ 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT samples.

Prior ablation study. In sparse data regimes DPG is more performant and has more efficient inference than DPG with un-informative priors (DPG-U). Indeed when the lack of data weakens the likelihood relative to the prior, a useful prior should maintain efficient Bayesian inference (Gelman et al., 2017). Table 3 illustrates this using RG1313{}_{\frac{1}{3}}start_FLOATSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_FLOATSUBSCRIPT graphs and their corresponding analytic distance matrices on data sets of varying sizes T𝑇Titalic_T; see Section 3 for further discussion.

Table 3: Ablation Study: Removing Informative Priors with Limited Training Data.
Model Time (s) NLL BS (×102absentsuperscript102\times\mathrm{1}\mathrm{0}^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT) Error (%) ECE (×103absentsuperscript103\times\mathrm{1}\mathrm{0}^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) ESSθ
T=50𝑇50T=50italic_T = 50 DPG-U 343.97343.97343.97343.97 11.18511.18511.18511.185 1.3941.3941.3941.394 1.7211.7211.7211.721 3.413.413.413.41 3.93.93.93.9
DPG 44.1544.1544.1544.15 11.18311.18311.18311.183 1.3921.3921.3921.392 1.7211.7211.7211.721 3.323.323.323.32 19.719.719.719.7
T=5𝑇5T=5italic_T = 5 DPG-U 15.1515.1515.1515.15 11.47611.47611.47611.476 1.5081.5081.5081.508 1.8471.8471.8471.847 5.625.625.625.62 15.515.515.515.5
DPG 10.6310.6310.6310.63 11.39811.39811.39811.398 1.5061.5061.5061.506 1.8421.8421.8421.842 4.874.874.874.87 15.715.715.715.7
T=2𝑇2T=2italic_T = 2 DPG-U 17.0617.0617.0617.06 12.12512.12512.12512.125 1.5401.5401.5401.540 1.8211.8211.8211.821 8.758.758.758.75 10.510.510.510.5
DPG 10.6610.6610.6610.66 11.66111.66111.66111.661 1.5221.5221.5221.522 1.8321.8321.8321.832 7.557.557.557.55 11.311.311.311.3
T=1𝑇1T=1italic_T = 1 DPG-U 11.1111.1111.1111.11 11.94211.94211.94211.942 1.6731.6731.6731.673 1.7891.7891.7891.789 7.987.987.987.98 7.17.17.17.1
DPG 6.596.596.596.59 10.97610.97610.97610.976 1.4981.4981.4981.498 1.6421.6421.6421.642 4.234.234.234.23 7.87.87.87.8