1 Introduction

With the ever-increasing complication of statistics and machine learning models, data privacy has become a primary concern as it becomes increasingly difficult to safeguard the potential disclosure of private information during model training. Differential privacy (Dwork et al. , 2006; Dwork and Roth , 2014) provides a framework for quantifying the privacy loss as well as for designing algorithms with privacy-preserving guarantees.

Many problems in machine learning fall under the paradigm of empirical risk minimization (ERM), where the loss is expressed as \(\frac{1}{n}\sum _{i=1}^n f(w; z_i)\) with independent and identically distributed (i.i.d) samples \(z_1,\ldots , z_n\) drawn from a data distribution \({\mathcal {D}}\). Differentially private ERM, originally studied in Chaudhuri et al. (2011), aims to safeguard the privacy disclosure of the samples in the solution \(w^*\). There exist many approaches to achieve this goal. The first class of methods is to perturb the output of a non-differentially private algorithm by adding a Laplace or Gaussian noise (Chaudhuri and Monteleoni , 2008; Chaudhuri et al. , 2011; Zhang et al. , 2017). Another approach considers adding a linear random perturbation term to the objective and is known as objective perturbation (Chaudhuri and Monteleoni , 2008; Chaudhuri et al. , 2011; Kifer et al. , 2012; Iyengar et al. , 2019; Bassily et al. , 2021). The third type of approach is to inject noise to gradient based algorithms at each iteration (Bassily et al. , 2014; Wang et al. , 2017; Wang et al. , 2019a; Abadi et al. , 2016; Bassily et al. , 2019). In addition, there also exist various specialized methods for problems such as linear regression and statistics estimation (Dwork and Lei , 2009; Nissim et al. , 2007; Amin et al. , 2019; Kamath et al. , 2019; Biswas et al. , 2020).

Among all the aforementioned approaches, gradient perturbation receives the most attention due to its generality for arbitrary loss functions and scalability to large datasets. Furthermore, it only requires to bound the sensitivity of the gradients computed at each iteration, rather than the entire process. There is an ongoing line of work that aims to improve the utility of the gradient perturbed algorithms while maintaining the same amount of privacy budget. Such improvements have been seen under (strongly) convex losses (Wang et al. , 2017; Iyengar et al. , 2019; Yu et al. , 2021; Bassily et al. , 2019; Kuru et al. , 2022; Asi et al. , 2021), nonconvex losses (Wang et al. , 2017; Zhang et al. , 2017; Wang et al. , 2019a; Wang et al. , 2019b), and also structured losses such as satisfying Polyak-Łojasiewicz condition (Wang et al. , 2017).

In this paper, we consider the following ERM problem in a differentially private setting where the parameter is constrained to lie on a Riemannian manifold, i.e.,

$$\begin{aligned} \min _{w \in {\mathcal {M}}} \left\{ F(w) = \frac{1}{n} \sum _{i=1}^n f_i(w) = \frac{1}{n} \sum _{i=1}^n f( w ; z_i ) \right\} , \end{aligned}$$
(1)

where \({\mathcal {M}}\) is a d-dimensional Riemannian manifold and \(f: {\mathcal {M}}\times {\mathcal {D}}\xrightarrow {} {\mathbb {R}}\) is a loss function over samples. Riemannian manifolds commonly occur in statistics and machine learning where the constraint space naturally possess additional nonlinear structure, such as orthogonality (Absil et al. , 2009), positive definiteness (Bhatia , 2009), unit norm, and hyperbolic (Boumal , 2020), among others. Popular applications involving manifold structures include matrix completion (Boumal and Absil , 2011b; Han and Gao , 2021a), metric learning, covariance estimation (Han et al. , 2021b), principal component analysis (Absil et al. , 2007), online learning (Amari , 1998), and taxonomy embedding (Nickel and Kiela , 2018), to name a few.

While a few recent works address specific Riemannian optimization problems under differential privacy, such as the private Fréchet mean computation (Reimherr et al. , 2021), there exists no systematic study of general purpose strategies to guarantee differential privacy for (1) on Riemannian manifolds. The main challenges are in terms of privatizing operations on manifolds intrinsically and bounding the sensitivity within a unified framework. On the other hand, differentially private (non-Riemannian) approaches have been studied for ERM problems with specific constraints (Chaudhuri et al. , 2013; Bassily et al. , 2014; Jiang et al. , 2016; Abadi et al. , 2016; Bassily et al. , 2019; Maunu et al. , 2022). Such approaches typically employ the projected gradient algorithm, i.e., taking the gradient step and adding noise in the Euclidean space, and then projecting onto the constraint set. Extending such a strategy to Riemannian manifolds may result in looser sensitivity and utility bounds, scaling poorly with the dimension of the ambient space, which can be much larger than the intrinsic manifold dimension (Reimherr et al. , 2021).

Table 1 Summary of utility guarantees of proposed \((\epsilon , \delta )\)-differentially private Algorithm 1 under different function classes. \(L_0, L_1\) are geodesic Lipschitz and smoothness constants of f and \(\beta , \tau\) are constants of geodesic strong convexity and Riemannian PL condition of F, respectively. The size of the dataset is n and the intrinsic dimension of the manifold is d. The curvature constant of the domain (as defined in Lemma 1) is \(\varsigma\). All utility bounds are measured in the expected empirical excess risk \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*)\) where \(w^*\) is a global minimizer except for the general nonconvex case where the bound is measured in \({\mathbb {E}}\Vert F(w^\textrm{priv}) \Vert ^2_{w^\textrm{priv}}\). The bounds hide dependence on \(c_l\) (lower bound on the metric tensor) and \(D_{\mathcal {W}}\) (diameter bound of the domain)

Contributions. In this work, we propose a general framework via Riemannian optimization to achieve differential privacy for (1) by Riemannian gradient perturbation adhering to the Riemannian geometry. To the best of our knowledge, this is the first work that provides systematic gradient perturbation techniques for privatizing on Riemannian manifolds. We generalize the Gaussian mechanism to the tangent space of Riemannian manifolds and also adapt the moments accountant technique to trace the privacy loss. We study the privacy guarantees of the differentially private Riemannian (stochastic) gradient descent method and show its utility guarantees for various function classes of interest on Riemannian manifolds, including geodesic (strongly) convex, general nonconvex functions, and functions satisfying the Riemannian Polyak-Łojasiewicz (PL) conditions. A summary of utility bounds is in Table 1. This allows tasks on manifolds to be privatized and analyzed in a unified framework and often leads to tighter utility guarantees compared to directly privatizing the output (Reimherr et al. , 2021) where more conservative noise is often added (as discussed in Sect. 5). We show that the projected gradient methods for (1) ignore the intrinsic geometry when taking the update, thereby hampering the utility under structured loss functions on manifolds (as discussed in Sects. 4 and 5). Finally, we also explore various interesting applications on Riemannian manifolds where privacy is concerned, including the principal eigenvector computation, Fréchet mean, Wasserstein barycenter, hyperbolic structured prediction, and low-rank matrix completion problems.

2 Preliminaries

Riemannian geometry. A Riemannian manifold \({\mathcal {M}}\) of dimension d is a smooth manifold with an inner product structure \(\langle \cdot , \cdot \rangle _w\) (i.e., a Riemannian metric) on every tangent space \(T_w{\mathcal {M}}\). Given an orthonormal basis \((\partial _1, \ldots , \partial _d)\) for \(T_w{\mathcal {M}}\), the metric can be expressed with a symmetric positive definite matrix \(G_w\) and the inner product can be written as \(\langle \xi , \zeta \rangle _w = \vec {\xi }^\top G_w \vec {\zeta }\) where \(\vec {\xi }, \vec {\zeta } \in {\mathbb {R}}^d\) are the vectorization of tangent vectors \(\xi , \zeta \in T_w{\mathcal {M}}\) in the normal coordinate system. An induced norm is defined as \(\Vert \xi \Vert _w = \sqrt{\langle \xi , \xi \rangle _w}\) for any \(\xi \in T_w{\mathcal {M}}\). A geodesic \(c: [0,1] \xrightarrow {} {\mathcal {M}}\) is a locally distance minimizing curve on the manifold with zero acceleration (intrinsic derivative of the velocity). For any \(\xi \in T_w{\mathcal {M}}\), the exponential map is defined as \(\textrm{Exp}_w(\xi ) = c(1)\), where \(c(0) = w\) and \(c'(0) = \xi\). If, between two points \(w, w'\in {\mathcal {M}}\), there exists a unique geodesic connecting them, the exponential map has a smooth inverse and the Riemannian distance is given by \(\textrm{dist}(w, w') = \Vert \textrm{Exp}_w^{-1}(w') \Vert _w = \Vert \textrm{Exp}_{w'}^{-1}(w) \Vert _{w'}\). We call a neighbourhood \({\mathcal {W}}\) totally normal if for any two points, the exponential map is invertible. The Riemannian gradient of a real-valued function, denoted as \(\textrm{grad}F(w)\), is a tangent vector that satisfies, for any \(\xi \in T_w{\mathcal {M}}\), it holds that \(\langle \textrm{grad}F(w), \xi \rangle _w = \textrm{D}_\xi F(w) = \langle \nabla F(w), \xi \rangle _2\) where \(\textrm{D}_\xi F(w)\) is the directional derivative of F(w) along \(\xi\) and \(\nabla F(w)\) is the Euclidean gradient.

Riemannian optimization. Under non-private settings, Riemannian optimization (Absil et al. , 2009; Boumal , 2020) provides a class of methods to efficiently solve problem (1). The Riemannian steepest descent method (Udriste , 2013; Zhang and Sra , 2016) takes a gradient update via the exponential map so that the iterates stay on the manifold, i.e., \(\textrm{Exp}_{w}(- \eta \, \textrm{grad}F(w))\) for some stepsize \(\eta\). Other more advanced solvers include the Riemannian conjugate gradient (Ring and Wirth , 2012), Riemannian trust region methods (Absil et al. , 2007; Absil et al. , 2009), as well as many recent stochastic optimizers (Bonnabel , 2013; Zhang et al. , 2016; Sato et al. , 2019; Kasai et al. , 2018; Han and Gao , 2021b; Han and Gao , 2021a).

Function classes on Riemannian manifolds. The notion of Lipschitz continuity has been generalized to Riemannian manifolds (Boumal , 2020; Zhang and Sra , 2016). A differentiable function \(F:{\mathcal {M}}\xrightarrow {} {\mathbb {R}}\) is geodesic \(L_0\)-Lipschitz if for any \(w \in {\mathcal {M}}\), \(\Vert \textrm{grad}F(w)\Vert _w \le L_0\). The function F is called geodesic \(L_1\)-smooth if for any \(w \in {\mathcal {M}}\) and \(w' = \textrm{Exp}_w(\xi )\), we have \(|F(w') - F(w) - \langle \textrm{grad}F(w), \xi \rangle _w | \le \frac{L_1}{2}\Vert \xi \Vert _w^2\) and equivalently \(\Vert \textrm{Hess}F(w) [u] \Vert _w \le L_1 \Vert u \Vert _w\) for any \(u \in T_w{\mathcal {M}}\) where \(\textrm{Hess}F(w): T_w{\mathcal {M}}\xrightarrow {} T_w{\mathcal {M}}\) is the Riemannian Hessian.

Geodesic convexity (Zhang and Sra , 2016) is an extension of convexity in the Euclidean space. A set \({\mathcal {W}}\subseteq {\mathcal {M}}\) is geodesic convex if for any two points in the set, there exists a geodesic in the set joining them. A function \(F: {\mathcal {W}}\xrightarrow {} {\mathbb {R}}\) is called geodesic convex if for any \(w, w' \in {\mathcal {W}}\), it satisfies \(F(c(t)) \le (1-t) F(w) + t F(w')\) for all \(t \in [0,1]\), where c is the geodesic such that \(c(0) = w, c(1) = w'\). In addition, a function F is called geodesic \(\beta\)-strongly convex if for any \(w, w' = \textrm{Exp}_w(\xi ) \in {\mathcal {W}}\), it satisfies \(F(w') \ge F(w) + \langle \textrm{grad}F(w), \xi \rangle _w + \frac{\beta }{2}d^2(w, w')\) for some \(\beta > 0\). We also introduce Riemannian Polyak-Łojasiewicz (PL) condition (Zhang et al. , 2016), which is weaker than the geodesic strong convexity. A function \(F:{\mathcal {M}}\xrightarrow {} {\mathbb {R}}\) is said to satisfy the Riemannian PL condition if for any \(w \in {\mathcal {M}}\), there exists \(\tau > 0\) such that \(F(w) - F(w^*) \le \tau \Vert \textrm{grad}F(w) \Vert ^2_w\) where \(w^*\) is a global minimizer of F on \({\mathcal {M}}\).

Finally, we recall a trigonometric distance bound for Riemannian manifolds with lower bounded sectional curvature, which is crucial in convergence analysis for geodesic (strongly) convex optimization.

Lemma 1

(Trigonometric distance bound (Bonnabel , 2013; Zhang and Sra , 2016; Zhang et al. , 2016; Han et al. , 2021b)) Let \(w_0, w_1, w_2 \in {\mathcal {W}}\subseteq {\mathcal {M}}\) lie in a totally normal neighbourhood of a Riemannian manifold with curvature lower bounded by \(\kappa _\textrm{min}\), and \(l_0 = \textrm{dist}(w_0, w_1), l_1 = \textrm{dist}(w_1, w_2)\) and \(l_2 = \textrm{dist}(w_0, w_2)\). Denote \(\theta\) as the angle on \(T_{w_0}{\mathcal {M}}\) such that \(\cos (\theta ) = \frac{1}{l_0 l_2}\langle \textrm{Exp}_{w_0}^{-1}(w_1), \textrm{Exp}_{w_0}^{-1}(w_2) \rangle _{w_0}\). Let \(D_{\mathcal {W}}\) be the diameter of \({\mathcal {W}}\), i.e., \(D_{\mathcal {W}}:= \max _{w, w' \in {\mathcal {W}}} \textrm{dist}(w, w')\). Define the curvature constant \(\varsigma = \frac{\sqrt{|\kappa _\textrm{min}|}D_{\mathcal {W}}}{\tanh (|\sqrt{|\kappa _\textrm{min}|} D_{\mathcal {W}})}\) if \(\kappa _\textrm{min} < 0\) and \(\varsigma = 1\) if \(\kappa _\textrm{min} \ge 0\). Then, we have \(l_1^2 \le \varsigma l_0^2 + l_2^2 - 2 l_0 l_2 \cos (\theta )\).

Differential privacy. Let \(D = \{z_1, \ldots , z_n\} \subset {\mathcal {D}}^n\) be a dataset. A neighbouring dataset of D, denoted as \(D'\) is a dataset that differs in only one sample from D. The neighbouring relation is denoted as \(D \sim D'\). We first recall the definition of \((\epsilon , \delta )\)-differential privacy (DP) (Dwork and Roth , 2014), which is defined on arbitrary measurable space \({\mathcal {M}}\) (not necessarily a Riemannian manifold).

Definition 1

(\((\epsilon , \delta )\)-differential privacy) A randomized mechanism \({\mathcal {R}}: {\mathcal {D}}^n \xrightarrow {} {\mathcal {M}}\) is called \((\epsilon , \delta )\)-differentially private on \({\mathcal {M}}\) if for any neighbouring datasets \(D, D' \subset {\mathcal {D}}^n\) and any measurable space \({\mathcal {A}}\subseteq {\mathcal {M}}\), we have \({\mathbb {P}}({\mathcal {R}}(D) \in {\mathcal {A}}) \le \exp (\epsilon ) \, {\mathbb {P}}({\mathcal {R}}(D') \in {\mathcal {A}}) + \delta\).

We also make use of Rényi differential privacy (RDP) (Mironov , 2017), which enjoys a tighter privacy bound under composition and subsampling (Wang et al. , 2019c). Given \(D \sim D'\), we define the cumulant generating function of a mechanism \({\mathcal {R}}\) as \(K_{{\mathcal {R}}, (D, D')}(\lambda ):= \log {\mathbb {E}}_{o \sim {\mathcal {R}}(D)} [\exp (\lambda \, {\mathcal {L}}_{{\mathcal {R}}(D) \Vert {\mathcal {R}}(D')}) ] = \log {\mathbb {E}}_{o \sim {\mathcal {R}}(D)} \left[ \left( \frac{p({\mathcal {R}}(D) = o)}{p({\mathcal {R}}(D') = o)} \right) ^{\lambda } \right]\), where \({\mathcal {L}}_{{\mathcal {R}}(D) \Vert {\mathcal {R}}(D')} = \log \left( \frac{p({\mathcal {R}}(D) = o)}{p({\mathcal {R}}(D') = o)} \right)\) is known as the privacy loss random variable at \(o \sim {\mathcal {R}}(D)\) (Dwork and Roth , 2014). When maximized over all the neighbouring datasets, \(K_{{\mathcal {R}}}(\lambda ):= \sup _{D \sim D'} K_{{\mathcal {R}},(D, D')}(\lambda )\) is called the \(\lambda\)-th moment of the mechanism (Abadi et al. , 2016).

Definition 2

(\((\alpha ,\rho )\)-Rényi differential privacy (Mironov , 2017)) For \(\alpha \ge 1\) and \(\rho > 0\), a randomized mechanism \({\mathcal {R}}: {\mathcal {D}}^n \xrightarrow {} {\mathcal {M}}\) is called \((\alpha , \rho )\)-Rényi differentially private if \(\frac{1}{\alpha - 1} K_{{\mathcal {R}}}(\alpha -1) \le \rho\).

Proposition 1

(Relationship between RDP and \((\epsilon , \delta )\)-DP (Mironov , 2017)) If a mechanism \({\mathcal {R}}\) satisfies \((\alpha , \rho )\)-Rényi differential privacy, then it satisfies \((\rho + \log (1/\delta )/(\alpha - 1), \delta )\)-differential privacy.

The notions of differential privacy introduced above are well-defined on Riemannian manifolds, which is a measurable space under the Borel sigma algebra (Pennec , 2006). However, a systematic approach for preserving differential privacy when the parameters of interest are on Riemannian manifolds has not been studied. A recent work (Reimherr et al. , 2021) proposes differentially private Fréchet mean computation over general Riemannian manifolds by output perturbation. The subsequent work (Soto et al. , 2022) also studies private Fréchet mean by generalizing the K-norm gradient mechanism to manifolds. Nevertheless, computing the Fréchet mean is a special problem instance of (1), which may also be solved via the proposed framework under differential privacy. Please refer to Sect. 5 for more comparative discussions.

3 Differential privacy on manifolds

This section develops various tools for preserving and analyzing differential privacy on Riemannian manifolds. Proofs for the results in this section are deferred to the appendix.

First, we generalize the Gaussian mechanism (Dwork and Roth , 2014) from the Euclidean space to Riemannian manifolds. One approach is to directly add noise on the manifold following an intrinsic Gaussian distribution (Pennec , 2006). However, this strategy faces two challenges. First, it is required to bound the sensitivity in terms of the Riemannian distance, which could be difficult particularly for negatively curved manifolds, such as the symmetric positive definite manifold. Second, the generalization suffers from metric distortion by curvature and it requires a nontrivial adaptation of the proof strategy in the Euclidean space (Dwork and Roth , 2014). Nevertheless, it is worth mentioning that the Laplace mechanism (Dwork et al. , 2006) can be generalized with the triangle inequality of the Riemannian distance as has been done recently in Reimherr et al. (2021).

Instead, we consider directly adding noise to the tangent space of the manifold following an isotropic Gaussian distribution with respect to the Riemannian metric. In this case, we can measure the sensitivity on the tangent space. We highlight that although the tangent space can be identified as a Euclidean space, the proposed strategy differs from the classic Gaussian mechanism, which adds isotropic noise to each coordinate in the Euclidean space (where the metric tensor is identity). To this end, we define the tangent space Gaussian distribution as follows.

Definition 3

(Tangent space Gaussian distribution) For any \(w \in {\mathcal {M}}\), a tangent vector \(\xi \in T_w{\mathcal {M}}\) follows a tangent space Gaussian distribution at w, denoted as \(\xi \sim {\mathcal {N}}_{w}(\mu , \sigma ^2)\) with mean \(\mu \in T_w{\mathcal {M}}\) and standard deviation \(\sigma > 0\) if its density is given by \(p_w(\xi ) = C^{-1}_{w, \sigma } \exp (- \frac{\Vert \xi - \mu \Vert ^2_w}{2\sigma ^2})\), where \(C_{w, \sigma }\) is the normalizing constant.

Remark 1

In a normal coordinate system of the tangent space, we remark that \({\mathcal {N}}_w(\mu , \sigma ^2)\) is equivalent to a standard multivariate Gaussian with covariance as a function of the metric tensor of the tangent space. Denote \(\vec {\xi } \in {\mathbb {R}}^{d}\) as the vectorization of the tangent vector \(\xi \in T_w{\mathcal {M}}\simeq {\mathbb {R}}^d\) in the normal coordinates. The density can then be written as \(p_w(\xi ) = C^{-1}_{w,\sigma } \exp (- \frac{1}{{2\sigma ^2}} (\vec {\xi } - \vec {\mu })^\top G_w ( \vec {\xi } - \vec {\mu }))\), where \(G_w\) is the (symmetric positive definite) metric tensor at w. This is a standard Gaussian distribution with mean \(\vec {\mu }\) and covariance \(\sigma ^2 G_w^{-1}\), i.e., \(\vec {\xi } \sim {\mathcal {N}}(\vec {\mu }, \sigma ^2 G_w^{-1})\). The normalizing constant is given by \(C_{w, \sigma } = \sqrt{(2\pi )^d \sigma ^2 |G_w^{-1}|}\).

We next introduce a generalization of the Gaussian mechanism on the tangent space. We stress that the following Gaussian mechanism depends on the sensitivity measured in the Riemannian metric.

Proposition 2

(Tangent space Gaussian mechanism) Given a query function \(H: {\mathcal {D}}^n \xrightarrow {} T_w{\mathcal {M}}\), let \(\varDelta _H:= \sup _{D, D' \subset {\mathcal {D}}^n: D \sim D'} \Vert H(D) - H(D') \Vert _w\) be the global sensitivity of H with respect to the Riemannian metric. Define \({\mathcal {R}}(D) = H(D) + \xi\) where \(\xi \sim {\mathcal {N}}_w(0, \sigma ^2)\) with \(\sigma ^2 \ge {2\log (1.25/\delta )} \varDelta _H^2/\epsilon ^2\). Then, \({\mathcal {R}}\) is \((\epsilon , \delta )\)-differentially private.

To show privacy guarantees in the subsequent sections, we adapt the moments accountant technique in the Euclidean space (Abadi et al. , 2016) to Riemannian manifolds, which results in a tighter bound compared to the advanced composition (Dwork and Roth , 2014). To achieve this, we first provide results that bound the moments of a tangent space Gaussian mechanism (Proposition 2). The next two lemmas show upper bounds on \(K_{\mathcal {R}}(\lambda )\) under full datasets (Lemma 2) as well as under subsampling (Lemma 3).

Lemma 2

(Moments bound) Consider a query function \(H: {\mathcal {D}}^n \xrightarrow {} T_w{\mathcal {M}}\) for some \(w \in {\mathcal {M}}\). Given a dataset \(D = \{z_1, \ldots , z_n\} \subset {\mathcal {D}}^n\) and suppose \(H(D) = \frac{1}{n} \sum _{i=1}^n h(z_i)\) and h is geodesic \(L_0\)-Lipschitz. Let \({\mathcal {R}}(D) = H(D) + \xi\), where \(\xi \sim {\mathcal {N}}_w(0, \sigma ^2)\). Then, the \(\lambda\)-th moment of \({\mathcal {R}}\) satisfies \(K_{\mathcal {R}}(\lambda ) \le \frac{2 \lambda (\lambda + 1) L_0^2}{n^2 \sigma ^2}\).

Lemma 3

(Moments bound under subsampling) Under the same settings as in Lemma 2, consider \(D_\textrm{sub}\) to be a subset of size b where samples are selected from D without replacement. Let \({\mathcal {R}}(D) = \frac{1}{b} \sum _{z \in D_\textrm{sub}} h(z) + \xi\), where \(\xi \sim {\mathcal {N}}_w(0, \sigma ^2)\). Suppose \(\sigma ^2 \ge \frac{4L_0^2}{b^2}, b < n, \lambda \le 2 \sigma ^2 \log (n/(b(\lambda + 1)(1 + \frac{b^2 \sigma ^2}{4 L_0^2})))/3\). Then, \(\lambda\)-th moment of \({\mathcal {R}}\) satisfies \(K_{\mathcal {R}}(\lambda ) \le \frac{15 (\lambda +1) L_0^2}{n^2 \sigma ^2}\).

4 Differentially private Riemannian (stochastic) gradient descent

In this section, we introduce differentially private Riemannian (stochastic) gradient descent method (Algorithm 1), where we add noise following the tangent space Gaussian distribution \({\mathcal {N}}_w(0, \sigma ^2)\). We show under proper choice of parameters, the algorithm preserves both the privacy guarantee as well as utility guarantees under various function classes on Riemannian manifolds. It should be noted that when the Riemannian manifold is the Euclidean space, i.e., \({\mathcal {M}}\equiv {\mathbb {R}}^d\), Algorithm 1 reduces to the standard DP-GD and DP-SGD (Wang et al. , 2017; Abadi et al. , 2016; Bassily et al. , 2019).

In Algorithm 1, the samples are selected without replacement following Wang et al. (2019c), and thus, when \(b = n\), we recover the full gradient descent. The noise variance is chosen as \(\sigma ^2 = c \frac{T\log (1/\delta ) L_0^2}{n^2\epsilon ^2}\) to ensure \((\epsilon , \delta )\)-differential privacy (Theorem 1) for some constant c. We remark that the choice of \(\sigma\) matches the standard results in the Euclidean space for gradient descent (Wang et al. , 2017) and for stochastic gradient descent (Abadi et al. , 2016; Bassily et al. , 2019; Wang et al. , 2019b) up to some constants that may depend on the manifold of interest. The output of the the algorithm depends on the function class of the objective, as discussed below.

Algorithm 1
figure a

Differentially private Riemannian (stochastic) gradient descent algorithm

4.1 Privacy and utility guarantees

Theorem 1

Algorithm 1 is \((\epsilon , \delta )\)-differentially private.

The proof relies on Lemma 2 and Lemma 3 to bound the moment of the randomized mapping.

For utility analysis, we start by making an assumption that all the iterates \(w_0, \ldots , w_T\) stay bounded within a compact support \({\mathcal {W}}\subseteq {\mathcal {M}}\) that contains a stationary point \(w^*\) (i.e., \(\textrm{grad}F(w^*) = 0\)). Let \(c_l = \inf _{w \in {\mathcal {W}}} \lambda _{\min }(G_w)\) where \(\lambda _{\min }(G_w)\) is the minimum eigenvalue of the metric tensor at w. We first show that the expected norm of the noise injected gradient can be bounded as follows.

Lemma 4

Suppose f(wz) is geodesic \(L_0\)-Lipschitz for any z. Consider a batch \({\mathcal {B}}\) of size b and \(w \in {\mathcal {W}}\). Let \(\zeta = \frac{1}{b} \sum _{z \in {\mathcal {B}}} \textrm{grad}f(w; z) + \epsilon\) where \(\epsilon \sim {\mathcal {N}}_w(0, \sigma ^2)\). Then, we have \({\mathbb {E}}[\zeta ] = \textrm{grad}F(w)\) and \({\mathbb {E}}\Vert \zeta \Vert _w^2 \le L_0^2 + d\sigma ^2 c_l^{-1}\), where the expectation is over randomness in both \({\mathcal {B}}\) and \(\epsilon\).

Next we show utility guarantees under various function classes on Riemannian manifolds, including geodesic (strongly) convex, general nonconvex and functions that satisfy Riemannian PL condition. It has been shown that many nonconvex problems in the Euclidean space are in fact geodesic (strongly) convex or satisfy the Riemannian PL condition on the manifold. This allows tighter utility bounds compared to differentially private projected gradient methods. Some examples of such problems are given in Sect. 5.

Geodesic convex optimization. When f(wz) is geodesic convex over \({\mathcal {W}}\) for any \(z \in {\mathcal {D}}\), \(w^*\) is a global minimum of F(w). The utility of Algorithm 1 is measured as the expected empirical excess risk \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*)\).

Theorem 2

Suppose f(wz) is geodesic convex, geodesic \(L_0\)-Lipschitz over \({\mathcal {W}}\). Assume \({\mathcal {W}}\) to be a totally normal neighbourhood with diameter \(D_{\mathcal {W}}\). Let \(\varsigma\) be the curvature constant of \({\mathcal {W}}\) defined in Lemma 1. Consider Algorithm 1 with output 3 where \(w^\textrm{priv} = \bar{w}_{T-1}\) is computed by geodesic averaging as follows: set \(\bar{w}_1 = w_1\) and \(\bar{x}_{t+1} = \textrm{Exp}_{\bar{w}_t}\left( \frac{1}{t+1} \textrm{Exp}^{-1}_{\bar{x}_t}(x_{t+1}) \right)\). Set \(\eta _t = \frac{D_{\mathcal {W}}}{\sqrt{(L_0^2 + d \sigma ^2 c_l^{-1}) \varsigma T}}\). Then, \(w^\textrm{priv}\) satisfies \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*) \le D_{\mathcal {W}}\sqrt{\frac{(L_0^2 + d \sigma ^2 c_l^{-1}) \varsigma }{T}} = O\left( \frac{\sqrt{d \log (1/\delta ) c_l^{-1} \varsigma } L_0 D_{\mathcal {W}}}{n\epsilon } \right)\), for the choice of \(\sigma ^2\) in Algorithm 1 and \(T = n^2\).

Geodesic strongly convex optimization. Under geodesic strong convexity, the global optimizer \(w^*\) is unique and we use the same measure for the utility.

Theorem 3

Suppose f(wz) is geodesic \(\beta\)-strongly convex, geodesic \(L_0\)-Lipschitz over \({\mathcal {W}}\). Assume \({\mathcal {W}}\) to be a totally normal neighbourhood with the curvature constant \(\varsigma\). Consider Algorithm 1 with output 3 where \(w^\textrm{priv} = \bar{w}_{T-1}\) is the geodesic averaging by setting \(\bar{w}_1 = w_1\), \(\bar{w}_{t+1} = \textrm{Exp}_{\bar{w}_t}\left( \frac{2}{t+1} \textrm{Exp}^{-1}_{\bar{w}_t} (w_{t+1}) \right)\). Set \(\eta _t = \frac{1}{\beta (t+1)}\). Then, \(w^\textrm{priv}\) satisfies \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*) \le \frac{2 \varsigma (L_0^2 + d \sigma ^2 c_l^{-1})}{\beta T} = O \left( \frac{\beta ^{-1} d \log (1/\delta ) c_l^{-1} \varsigma L_0^2}{ n^2 \epsilon ^2} \right)\), with the choice of \(\sigma ^2\) in Algorithm 1 and \(T = n^2\).

Optimization under Riemannian PL condition. We now consider the case when the objective F satisfies the Riemannian Polyak-Łojasiewicz (PL) condition. It is known that Polyak-Łojasiewicz is a sufficient condition to establish linear convergence to global minimum (Karimi et al. , 2016). In addition, the Riemannian PL condition includes the geodesic strong convexity as a special case (Zhang et al. , 2016).

Theorem 4

Suppose f(wz) is geodesic \(L_0\)-Lipschitz and \(L_1\)-smooth over \({\mathcal {W}}\) and F(w) satisfies the Riemannian PL condition with parameter \(\tau\), i.e., \(F(w) - F(w^*) \le \tau \Vert \textrm{grad}F(w) \Vert ^2_w\). Consider Algorithm 1 with output 1. Set \(\eta _t = \eta < \min \{1/L_1, 1/\tau \}\) and \(T = \log (\frac{n^2\epsilon ^2 c_l}{dL_0^2 \log (1/\delta )})\). Then, \(w^\textrm{priv}\) satisfies \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*) \le O \left( \frac{\tau ^{-1} d \log (n) \log (1/\delta ) c_l^{-1} L_0^2 \varDelta _0}{n^2\epsilon ^2} \right)\), where we denote \(\varDelta _0:= F(w_0) - F(w^*)\).

General nonconvex optimization. Under general nonconvex objectives, we show utility bound with respect to the expected gradient norm squared, i.e., \({\mathbb {E}}\Vert \textrm{grad}F(w^\textrm{priv}) \Vert ^2_{w^\textrm{priv}}\), which has also been considered in the Euclidean space (Zhang et al. , 2017; Wang et al. , 2017).

Theorem 5

Suppose f(xw) is geodesic \(L_0\)-Lipschitz and \(L_1\)-smooth. Consider Algorithm 1 with output 2. Set \(\eta _t = \eta = 1/L_1\) and \(T = \frac{\sqrt{L_1} n \epsilon }{\sqrt{d \log (1/\delta ) c_l^{-1}}L_0}\). Then, \(w^\textrm{priv}\) satisfies \({\mathbb {E}}\Vert \textrm{grad}F(w^\textrm{priv}) \Vert ^2_{w^\textrm{priv}} \le O \left( \frac{\sqrt{d L_1 \log (1/\delta ) c_l^{-1}} L_0}{n\epsilon } \right)\).

Remark 2

Given the Euclidean space is a special case of Riemannian manifold, we observe that the utility bounds obtained in Theorem 2 to 5 match the standard results in the Euclidean space (Bassily et al. , 2014; Wang et al. , 2017) up to some constants that depend on the curvature and diameter bound of the manifold considered. In addition, we highlight that our bounds depend on the intrinsic dimension of the manifold, rather than the ambient space. This suggests potentially great improvement on the utility when preserving privacy under the proposed Riemannian framework as opposed to the ambient Euclidean space.

Extension to general retraction. So far we have focused on the use of exponential map in Algorithm 1. However, for some manifolds, exponential map can be computationally costly to evaluate, e.g., on the symmetric positive definite (SPD) manifold. In such cases, we can replace the exponential map for the gradient update with its first-order approximation, known as the retraction (Absil et al. , 2009). Given that the proposed privatization in Algorithm 1 is on the Riemannian gradient, the privacy guarantees remain unchanged with the use of retraction based on the composition theorem. Similarly, the utility guarantees are also preserved under retraction, primarily because the retraction is a first-order approximation to the exponential map. In particular, the same utility bounds hold up to some constants that upper bound the deviation between the exponential map and retraction. See Han and Gao (2021a), Sato et al. (2019), Kasai et al. (2018) for more detailed discussions on convergence analysis with retraction.

5 Applications

We explore various kinds of applications to show the versatility and efficacy of the proposed framework of differentially private Riemannian optimization.

5.1 Private principal eigenvector computation over sphere manifold

We consider the problem of computing the principal eigenvector of a sample covariance matrix (Shamir , 2015; Zhang et al. , 2016) by solving

$$\begin{aligned} \min _{w \in {\mathcal {S}}^{d}} \left\{ F(w) = \frac{1}{n} \sum _{i =1}^n f(w; z_i) = - \frac{1}{n} \sum _{i=1}^n w^\top \left( z_i z_i^\top \right) w \right\} , \end{aligned}$$
(2)

where \({\mathcal {S}}^{d} = \{ w \in {\mathbb {R}}^{d+1}: \Vert w \Vert _2 = w^\top w = 1 \}\) is the sphere manifold of intrinsic dimension d and \(z_1, \ldots , z_n \in {\mathbb {R}}^{d+1}\) are zero-centered samples. Sphere manifold is an embedded submanifold of \({\mathbb {R}}^{d+1}\) with the tangent space \(T_w{\mathcal {S}}^{d} = \{ \xi \in {\mathbb {R}}^{d+1}: w^\top \xi = 0 \}\). A common Riemannian metric used is the Euclidean metric, i.e., \(\langle \xi , \zeta \rangle _w = \xi ^\top \zeta\) for any \(\xi , \zeta \in T_w{\mathcal {S}}^{d}\). The Riemannian gradient on \(T_w{\mathcal {S}}^{d}\) is \(\textrm{grad}F(w) = (I_{d+1} - ww^\top ) \nabla F(w)\), where \(I_{d+1}\) is the identity matrix of size \((d+1)\times (d+1)\) and \(\nabla F(w)\) is the Euclidean gradient.

Utility. It can be shown from Theorem 6 (in the appendix) that problem (2), although being nonconvex in the ambient Euclidean space, satisfies the Riemannian PL condition on \({\mathcal {S}}^{d}\) locally with \(\tau = O(d/q^2 \nu )\) with probability \(1- q\) where \(\nu\) is the difference between the largest two eigenvalues of \(\frac{1}{n} \sum _{i=1}^n z_iz_i^\top\). Also, we see that the metric tensor \(G_w = I_{d}\), and hence, \(c_l = 1\). This is because given an orthonormal basis on the tangent space \(T_w{\mathcal {S}}^{d}\), denoted as \(B \in {\mathbb {R}}^{(d+1) \times d}\), we have \(\langle \xi , \zeta \rangle _w = \langle {\xi }, \zeta \rangle _2 = \langle B \vec {\xi }, B \vec {\zeta } \rangle _2 = \langle \vec {\xi }, \vec {\zeta } \rangle _2\), where \(\vec {\xi }, \vec {\zeta } \in {\mathbb {R}}^{d}\) is the vectorization under the coordinate transformation given by B. In addition, the geodesic Lipschitz constant \(L_0\) is bounded as \(\Vert \textrm{grad}f(w; z_i) \Vert _w = \Vert 2(I_{d+1} - ww^\top ) (z_i z_i^\top ) w \Vert _2 \le 2\Vert (z_i z_i^\top ) w \Vert _2 \le \vartheta := 2 \max _{i} z_i^\top z_i\). Applying Theorem 4 for utility under the Riemannian PL condition, we can show if properly initialized, the expected empirical excess risk of Algorithm 1 is bounded as \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*) \le \widetilde{O} \left( \frac{ d q^2 \nu \vartheta ^2 \varDelta _0}{n^2\epsilon ^2} \right)\), where we have \(c_l = 1, L_0 = \vartheta\) and we use \(\widetilde{O}\) to hide the logarithmic factors.

Remark 3

(Comparison with perturbed projected gradient descent) Considering the alternative approach of using projected gradient descent with gradient perturbation in the ambient space for solving the problem (2), one can only guarantee a utility in the gradient norm, i.e., \({\mathbb {E}}[\Vert \nabla F(w^\textrm{priv})] \Vert _2^2] \le \widetilde{O} \left( \frac{ \vartheta }{n\epsilon } \right)\) due to the nonconvexity of the problem (Wang et al. , 2017). This results in a looser bound compared to our obtained utility guarantee above, which is measured in terms of the excess risk and scales with \(n^{-2}\) rather than \(n^{-1}\).

5.2 Private Fréchet mean computation over SPD manifold

The second application we consider is the problem of Fréchet mean computation (Bhatia , 2009; Jeuris et al. , 2012) over the manifold of \(r \times r\) symmetric positive definite (SPD) matrices, denoted as \({\mathbb {S}}_{++}^r\). Given a set of SPD matrices \(X_1, \ldots , X_n \in {\mathbb {S}}_{++}^r\), the goal is to find a center \(W \in {\mathbb {S}}_{++}^r\) by solving

$$\begin{aligned} \min _{W \in {\mathbb {S}}_{++}^r} \left\{ F(W) = \frac{1}{n} \sum _{i=1}^n f(W; X_i) = \frac{1}{n} \sum _{i=1}^n \Vert \textrm{logm}( W^{-1/2} X_i W^{-1/2}) \Vert _\textrm{F}^2 \right\} , \end{aligned}$$
(3)

where \(\textrm{logm}(\cdot )\) represents the principal matrix logarithm. The tangent space of the set of SPD matrices is given as \(T_W{\mathbb {S}}_{++}^r = \{ U \in {\mathbb {R}}^{r \times r}: U^\top = U \}\), i.e., the set of symmetric matrices. It can be shown that the function \(f(W; X_i)\) in (3) is the Riemannian distance squared \(\textrm{dist}^2(W, X_i)\) associated with the affine-invariant Riemannian metric (Skovgaard , 1984; Bhatia , 2009), defined as \(\langle U, V \rangle _W = \textrm{tr}(U W^{-1} V W^{-1})\).

Utility. Problem (3) is known to be nonconvex in the Euclidean space while being 2-geodesic strongly convex on the SPD manifold with the affine-invariant metric (Zhang et al. , 2016; Zhang and Sra , 2016). In addition, the Lipschitz constant can be bounded as follows. First, the Riemannian gradient of problem (3) is derived as \(\textrm{grad}f(W; X_i) = W \textrm{logm}( W^{-1} X_i )\) and \(\Vert \textrm{grad}f(W; X_i) \Vert _W = \Vert \textrm{logm}( W^{-1} X_i ) \Vert _F = \textrm{dist}(W, X_i) \le D_{\mathcal {W}}= L_0\). Also, it is known that the SPD manifold with the affine-invariant metric is negatively curved with the curvature constant \(\zeta \le 2 \sqrt{|\kappa _{\min }|} D_{\mathcal {W}}+1\) according to Lemma 1. Applying Theorem 3, we see that the utility of Algorithm 1 is \({\mathbb {E}}[F(W^\textrm{priv})] - F(W^*) \le \widetilde{O} \left( \frac{d c_l^{-1} \sqrt{|\kappa _\textrm{min}|} D_{\mathcal {W}}^3}{ n^2 \epsilon ^2} \right)\), where the intrinsic dimension of the SPD manifold \({\mathbb {S}}_{++}^r\) is \(d = r(r+1)/2\).

Remark 4

(Comparison to (Reimherr et al. , 2021) and (Soto et al. , 2022)) Recent works (Reimherr et al. , 2021; Soto et al. , 2022) have considered \(\epsilon\)-pure differential privacy (with \(\delta = 0\)) for privatizing the Frechet mean. Reimherr et al. (2021) perturbs the output on the manifold with Laplace noise and the utility given by \(\textrm{dist}^2(W^\textrm{priv}, W^*) \le O\left( \frac{d^2 D_{\mathcal {W}}^2}{n^2 \epsilon ^2} \right)\). In Soto et al. (2022), the Fréchet mean is privatized by sampling from the density \(g(W; \{ X_i\}_{i=1}^n) \propto \exp \left( - \frac{1}{\sigma n} \Vert \sum _{i=1}^n \textrm{grad}f(W; X_i) \Vert _W \right)\), known as the K-norm gradient mechanism (Reimherr and Awan , 2019). Their utility is given by \(O(\frac{d^2}{n^2\epsilon ^2})\). In contrast, our Algorithm 1 preserves \((\epsilon , \delta )\)-differential privacy with utility of \(\textrm{dist}^2(W^\textrm{priv}, W^*) \le \widetilde{O}\left( \frac{d D_{\mathcal {W}}^3}{n^2\epsilon ^2} \right)\) by applying the triangle inequality to the expected empirical excess risk and ignoring other factors. We see our obtained utility matches the state-of-the-art results in Reimherr et al. (2021), Soto et al. (2022).

In terms of the noise level, (Reimherr et al. , 2021) requires the noise rate \(\sigma = \frac{2\varDelta _\textrm{FM}}{\epsilon } = \frac{4D_{\mathcal {W}}}{n \epsilon }\) as the sensitivity \(\varDelta _\textrm{FM} = \frac{2D_{\mathcal {W}}}{n}\) for non-positively curved manifold (Reimherr et al. 2021, Theorem 2). The noise level required in Soto et al. (2022) is \(\sigma = \frac{4D_{\mathcal {W}}}{n\epsilon }\), which is the same as Reimherr et al. (2021). The proposed Algorithm 1 requires \(\sigma = O( \frac{\sqrt{T\log (1/\delta )} D_{\mathcal {W}}}{n\epsilon })\). Nevertheless, it should be emphasized that the noise distributions in Reimherr et al. (2021) and Soto et al. (2022) are different from ours which is the tangent space Gaussian distribution (Definition 3).

5.3 Private Wasserstein barycenter over SPD manifold

When endowing the SPD matrices with Bures-Wasserstein metric (Malagò et al. , 2018; Bhatia et al. , 2019), i.e., \(\langle U, V \rangle _W = \frac{1}{2} \textrm{tr}( {\mathcal {L}}_W[U] V)\) where \({\mathcal {L}}_W[U]\) is the solution to the equation \(W {\mathcal {L}}_W[U] + {\mathcal {L}}_W[U] W = U\), one can show that the induced Riemannian distance between \(\varSigma _1, \varSigma _2 \in {\mathbb {S}}_{++}^d\) recovers the 2-Wasserstein distance between zero-centered Gaussian distributions with covariance \(\varSigma _1, \varSigma _2\), i.e., \(\textrm{dist}_\textrm{bw}^2(\varSigma _1, \varSigma _2) = \textrm{tr}(\varSigma _1) + \textrm{tr}(\varSigma _2) -2 \textrm{tr}( \varSigma _1^{1/2} \varSigma _2 \varSigma _1^{1/2})^{1/2}\) where \(\textrm{tr}(X)^{1/2}\) represents the trace of matrix square root. Here we consider solving the Fréchet mean problem (3) with the Bures-Wasserstein metric, i.e., \(f(W; X_i) = \textrm{dist}_\textrm{bw}^2(W, X_i)\), which amounts to finding the Wasserstein barycenter of the zero-centered Gaussian distributions with covariances \(X_i\), \(i = 1,...,n\).

Utility. From Han et al. (2021a), we see that the curvature \(\kappa\) of the SPD manifold with Bures-Wasserstein metric is non-negative (as opposed to non-positive with the affine-invariant metric in Sect. 5.2). The curvature is bounded by \(\kappa _{\min } = 0\) and \(\kappa _{\max } = \sup _{W \in {\mathcal {W}}} \frac{3}{\lambda _r(W) + \lambda _{r-1}(W)}\) where \(\lambda _i(W)\) is the i-th largest eigenvalue of \(W \in {\mathbb {S}}_{++}^r\). Applying the results in Alimisis et al. (2020), problem (3) is geodesic strongly convex when the diameter of the domain \(D_{\mathcal {W}}\) satisfies \(D_{\mathcal {W}}< \frac{\pi }{2 \sqrt{\kappa _{\max }}}\). Its Riemannian gradient can be derived as \(\textrm{grad}f(W; X_i) = - (WX_i)^{1/2} - (X_i W)^{1/2} + 2 W\), and it can be verified that \(\Vert \textrm{grad}f(W; X_i) \Vert _W = \textrm{dist}_\textrm{bw}(W, X_i) \le D_{\mathcal {W}}< \frac{\pi }{2\sqrt{\kappa _\textrm{max}}}\). The curvature constant in Lemma 1 is given by \(\varsigma = 1\) and according to Theorem 3, the utility is \({\mathbb {E}}[F(W^\textrm{priv})] - F(W^*) \le \widetilde{O} \left( \frac{d c_l^{-1} h_{\mathcal {W}}^{-1} D_{\mathcal {W}}^2 }{ n^2 \epsilon ^2} \right)\) where \(h_{\mathcal {W}}= \sqrt{\kappa _{\max }} D_{\mathcal {W}}\cot (\sqrt{\kappa _{\max }} D_{\mathcal {W}})\).

Remark 5

(Comparison to Reimherr et al. (2021) and Soto et al. (2022) for the Bures-Wasserstein metric) In Reimherr et al. (2021), the utility is again \(O\left( \frac{d^2 D_{\mathcal {W}}^2}{n^2 \epsilon ^2} \right)\) with a required noise level \(\sigma = \frac{\varDelta _\textrm{FM}}{\epsilon } = \frac{2 D_{\mathcal {W}}(2- h_{\mathcal {W}}) h_{\mathcal {W}}^{-1}}{n \epsilon }\), when the manifold is positively curved, as per (Reimherr et al. 2021, Theorem 2). In addition, Soto et al. (2022) requires \(\sigma = \frac{2 D_{\mathcal {W}}(2-h_{\mathcal {W}})}{n}\), which is strictly smaller compared to Reimherr et al. (2021) given \(h_{\mathcal {W}}< 1\). The utility however, remains the same as \(O(\frac{d^2}{n^2\epsilon ^2})\). In contrast, Algorithm 1, presents an utility of \(\widetilde{O} \left( \frac{d h_{\mathcal {W}}^{-1} D_{\mathcal {W}}^2 }{ n^2 \epsilon ^2} \right)\) with a noise level \(\sigma = O( \frac{\sqrt{T\log (1/\delta )} D_{\mathcal {W}}}{n\epsilon })\).

5.4 Private hyperbolic structured prediction

The hyperbolic manifold has shown great promise in representation learning of complex structured data, including word semantics (Nickel and Kiela , 2017; Tifrea et al. , 2018), graphs (Chami et al. , 2019), images (Khrulkov et al. , 2020), among others. The hyperbolic manifold can be equivalently represented under multiple models. The Lorentz hyperboloid model is given by \({\mathcal {H}}^d:= \{ w \in {\mathbb {R}}^{d+1}: \langle w, w \rangle _{{\mathcal {L}}} = -1 \}\) where \(\langle x, y \rangle _{{\mathcal {L}}} = - x_0 y_0 + \sum _{i=1}^d x_i y_i\). The tangent space is given by \(T_w{\mathcal {H}}^d = \{ u \in {\mathbb {R}}^{d+1}: \langle w, u\rangle _{\mathcal {L}}= 0 \}\) with the Riemannian metric as \(\langle \cdot , \cdot \rangle _w = \langle \cdot , \cdot \rangle _{\mathcal {L}}\). The Riemannian distance is \(\textrm{dist}_{\mathcal {L}}(x, y) = \textrm{arccosh}(- \langle x, y\rangle _{\mathcal {L}})\) for \(x, y \in {\mathcal {H}}^d\).

In Marconi et al. (2020), hyperbolic structured prediction (also known as hyperbolic regression) has been considered to infer the taxonomy embeddings for unseen words. Consider a set of training pairs \(\{ x_i, y_i \}_{i=1}^n\) where \(x_i \in {\mathbb {R}}^r\) are the features and \(y_i \in {\mathcal {H}}^d\) are the hyperbolic embeddings of the class of \(x_i\), which are learned following (Nickel and Kiela , 2017). For a test sample x, the goal is to predict its hyperbolic embeddings by solving

$$\begin{aligned} \hat{f}(x) = \mathop {\textrm{arg}\,\textrm{min}}\limits _{w \in {\mathcal {H}}^d} \left\{ F(w) = \frac{1}{n}\sum _{i=1}^n f_i(w) = \frac{1}{n} \sum _{i = 1}^n \alpha _i(x) \textrm{dist}_{{\mathcal {L}}}^2(w, y_i)\right\} , \end{aligned}$$
(4)

where \(\alpha (x) = (\alpha _i(x),..., \alpha _n(x))^\top \in {\mathbb {R}}^n\) are learned via a variant of kernel ridge regression (Rudi et al. , 2018), i.e., \(\alpha (x) = (K + \gamma I)^{-1} K_x\) where \(\gamma > 0\) is the regularization parameter and \(K \in {\mathbb {R}}^{n \times n}, K_x \in {\mathbb {R}}^{n}\) are computed with \(K_{ij} = k(x_i, x_j)\) and \((K_x)_i = k(x_i, x)\) for some kernel function \(k(\cdot , \cdot )\). Problem (4) can be solved via Riemannian (stochastic) gradient descent. Here, we wish to privatize the prediction of hyperbolic embedding against the training data pairs \(\{ x_i, y_i\}\), which may contain sensitive information.

Utility. It is worth mentioning that problem (4) can be seen as computing the weighted Fréchet mean on the hyperbolic manifold. However, it is not guaranteed that the weights \(\alpha _i(x)\) are all positive. When \(\alpha _i(x) \ge 0\) for all i, problem (4) can be easily verified to be \(\left( 2/n \sum _{i = 1}^n \alpha _i(x) \right)\)-geodesic strongly convex given that \(\textrm{dist}_{{\mathcal {L}}}^2(w, y_i)\) is 2-geodesic strongly convex for negatively curved space (Alimisis et al. , 2020). In addition, following similar derivation as before, we can show \(\Vert \textrm{grad}f_i(w)\Vert _w = \alpha _i(x) \Vert \textrm{Exp}_{w}^{-1}(y_i) \Vert _w \le \alpha _i(x) \textrm{dist}_{\mathcal {L}}(w, y_i) \le {\alpha }_{i}(x) D_{\mathcal {W}}\). From Lemma 1, we have \(\varsigma \le 2 D_{\mathcal {W}}+ 1\) given \(\kappa _\textrm{min} = \kappa = -1\) for the hyperbolic space. Applying Theorem 3, the utility of Algorithm 1 on problem (4) is \({\mathbb {E}}[F(w^\textrm{priv})] - F(w^*) \le \widetilde{O} \left( \frac{dD_{\mathcal {W}}^3}{ n_\alpha n\epsilon ^2} \right)\) where we let \(n_\alpha := \max _i \frac{\alpha _i(x)}{\sum _{i=1}^n \alpha _i(x)}\) and we use the fact that \(c_l = 1\).

On the other hand, when \(\alpha _i(x) < 0\) for some i, the problem is in general nonconvex, where we apply Theorem 5 for the utility guarantee. Based on (Alimisis et al. 2020, Lemma 2), one can show the Riemannian Hessian of the Riemannian distance satisfies \(\Vert \textrm{Hess}\, \textrm{dist}_{\mathcal {L}}^2(w, y_i) [u] \Vert _w \le L_1 \Vert u \Vert _w\) for any \(u \in T_w{\mathcal {M}}\) where \(L_1 = \frac{\sqrt{|\kappa _\textrm{min}|} D_{\mathcal {W}}}{\tanh (\sqrt{|\kappa _\textrm{min}|} D_{\mathcal {W}})} =\frac{D_{\mathcal {W}}}{\tanh (D_{\mathcal {W}})}\). The Lipschitz constant \(L_0 = \max _{i} |\alpha _i(x)| D_{\mathcal {W}}\). Hence from Theorem 5, the utility bound is given by \(O\left( \frac{\sqrt{d D_{\mathcal {W}}^3\log (1/\delta )} \max _i |\alpha _i(x)|}{ \sqrt{\tanh (D_{\mathcal {W}})} n \epsilon } \right)\).

5.5 Private low rank matrix completion over Grassmann manifold

Low rank matrix completion (LRMC) (Candès and Recht , 2009; Jain et al. , 2013) aims to recover the matrix by assuming a low rank structure, and has been studied via Riemannian optimization (Mishra et al. , 2012; Vandereycken , 2013; Boumal and Absil , 2011a; Boumal and Absil , 2015). In particular, we follow Boumal and Absil (2011a, 2015) to cast the problem as optimization over the Grassmann manifold \({\mathcal {G}}(d, r)\), i.e., the set of r-dimensional subspaces in \({\mathbb {R}}^d\) \((d \ge r)\). Please refer to the appendix for details on the Grassmann manifold. The LRMC problem is defined as

$$\begin{aligned} \min _{W \in {\mathcal {G}}(d,r), V \in {\mathbb {R}}^{r \times n}} \left\{ \frac{1}{n} \Vert \mathbbm {1}_{\varOmega } \odot (X - W V) \Vert ^2_\textrm{F} = \frac{1}{n}\sum _{i =1}^n \Vert \mathbbm {1}_{\varOmega _i} \odot (x_i - W v_i) \Vert _2^2 \right\} , \end{aligned}$$

where \(\varOmega = \{(i,j): X_{ij} \; \textrm{observed} \}\) and \(\mathbbm {1}_{\varOmega } = A \in {\mathbb {R}}^{d \times n}\) where \(A_{ij} = 1\) if \((i,j) \in \varOmega\) and \(A_{ij} = 0\) otherwise. \(\odot\) represents elementwise product. \(x_i, v_i, \mathbbm {1}_{\varOmega _i}\) are the i-th column of matrix \(X, V, \mathbbm {1}_{\varOmega }\) respectively. We highlight that given \(W \in {\mathcal {G}}(d,r)\), V has a closed-form solution by least squares. The problem is nonconvex both in the Euclidean space and on the manifold, with utility given by Theorem 5, which matches between the Euclidean and Riemannian solvers up to some constants.

6 Experiments

This section presents the experiment results on applications introduced in Sect. 5. The experiments are implemented using the Manopt toolbox (Boumal et al. , 2014) on Matlab except for the hyperbolic application where we use Pytorch. The codes are available at https://github.com/andyjm3/DP-RiemOpt.

6.1 Results on private Fréchet mean on SPD manifold

We follow the steps in Reimherr et al. (2021) to generate synthetic samples on the SPD manifold \({\mathbb {S}}_{++}^r\) following the Wishart distribution \(W(I_{r}/r, r)\) with a diameter bound \(D_{\mathcal {W}}\). The optimal solution \(W^*\) is obtained by running the Riemannian gradient descent (RGD) method on problem (3) until the gradient norm falls below \(10^{-14}\). For the example, we choose \(r = 2\) and set \(T = n\), which we find empirically better than \(n^2\) (as suggested in Theorem 3). We choose \(\sigma ^2 = T \log (1/\delta ) D_{\mathcal {W}}^2/(n^2\epsilon ^2)\) where \(\epsilon = 0.1, \delta = 10^{-3}, D_{\mathcal {W}}= 1\), \(\eta = 0.01\). To sample noise from the tangent Gaussian distribution, i.e., \(\epsilon \sim {\mathcal {N}}_w(0, \sigma ^2) \propto \exp ( - \frac{1}{2\sigma ^2} \Vert \epsilon \Vert ^2_W )\), we implement the standard random walk Metropolis-Hastings algorithm (Robert et al. , 1999) with a burn-in period of \(10^4\) steps.

We compare the proposed algorithm DP-RGD with the differentially private Fréchet mean (FM-LAP) algorithm by Laplace mechanism in Reimherr et al. (2021) and K-norm gradient mechanism (FM-KNG) in Soto et al. (2022). For FM-LAP, we follow Reimherr et al. (2021) to first obtain a non-privatized Fréchet mean \(\widehat{W}\) by running RGD. Then, we sample from an intrinsic Laplace distribution on \({\mathbb {S}}_{++}^r\) (following the steps in Hajri et al. (2016)) with footprint \(\widehat{W}\) and \(\sigma = \frac{4D_{\mathcal {W}}}{n\epsilon } = \frac{4}{n\epsilon }\) is the sensitivity of the Fréchet mean on SPD manifold (Reimherr et al. 2021, Theorem 2). For FM-KNG, we use Metropolis-Hastings as suggested in Soto et al. (2022) with the proposal generated by \(\textrm{Exp}_W(t \sigma V)\) for the same \(\sigma\) where \(V \in T_W{\mathcal {M}}\) is sampled uniformly and \(t \in (0,1]\) controls the step. The complete procedures of sampling are included in the appendix, where we set \(t = 0.1\) to avoid ill-conditioning. We consider a burn-in period of 5000. It is worth highlighting that the sampling for FM-KNG is computationally costly given each step in Metropolis-Hastings requires the evaluation of full gradient. We plot in Fig. 1a the expected empirical excess risk against the sample size, averaged over 20 runs. We observe better utility of our proposed DP-RGD particularly when the sample size is small. Under large-sample regimes, our mechanism performs competitively against the baselines.

Fig. 1
figure 1

Fréchet mean results averaged over 20 runs. a Empirical excess risk against sample size on synthetic data, and bc against \(\epsilon\) (privacy parameter) on real datasets: YaleB and Dynt. We observe an improved utility guarantees of our proposed DP-RGD for all the cases

We also consider the real datasets: Extended Yale-B dataset (YaleB) (Wright et al. , 2008) and Kylberg dynamic texture dataset (Dynt) (Kylberg , 2011). We crop the images to \(32 \times 32\) and construct the Gabor-based region covariance descriptors (Pang et al. , 2008), which is an SPD matrix of size \(8\times 8\). We then use the geometry-aware PCA proposed in Han et al. (2021a) to reduce the dimensionality of each SPD covariance to a size of \(2 \times 2\). For each dataset, we extract randomly 50 samples and compute the Fréchet mean. We again follow the same parameter choices. Given the diameter bound \(D_{\mathcal {W}}\) is unknown in this case, we set \(D_{\mathcal {W}}= 1\). The stepsize is set to be 0.1 for DP-RGD. We show in Figs. 1b and 1c the expected empirical excess risk with varying privacy parameter \(\epsilon\) where we see DP-RGD enjoys an improved utility, particularly with smaller \(\epsilon\). Nevertheless, we notice the performance gap reduces when \(\epsilon\) increases.

6.2 Results on private Bures-Wasserstein barycenter

For the barycenter problem, we follow the same steps as in Sect. 6.1 to generate synthetic SPD matrices, while controlling the spectrum to ensure the condition \(D_{\mathcal {W}}< \frac{\pi }{2\sqrt{\kappa _\textrm{max}}}\) is satisfied. We choose \(D_{\mathcal {W}}= 0.3, \epsilon = 0.1, \delta = 10^{-3}, \eta = 0.01\) and the same setting of \(\sigma ^2 = T\log (1/\delta ) D^2_{\mathcal {W}}/(n^2\epsilon ^2)\) and \(T = n\) for the proposed DP-RGD. For FM-LAP (Reimherr et al. , 2021), we recall the sensitivity is given by \(\varDelta _\textrm{FM} = \frac{2 D_{\mathcal {W}}(2 - h_{\mathcal {W}})}{n h_{\mathcal {W}}}\) where \(h_{\mathcal {W}}= 2 D_{\mathcal {W}}\sqrt{\kappa _\textrm{max}} \cot (2 D_{\mathcal {W}}\sqrt{\kappa _\textrm{max}})\) based on (Reimherr et al. 2021, Theorem 2) and then we set \(\sigma = 2\varDelta _\textrm{FM}/\epsilon\). For FM-KNG, we set the same \(\sigma\) while \(\varDelta _\textrm{FM} = \frac{2 D_{\mathcal {W}}(2-h_{\mathcal {W}})}{n}\) as discussed in Sect. 6.1. The sampling is performed the same way as in Sect. 6.1. Figure 2a shows the empirical excess risk averaged over 20 runs. We similarly observe that the proposed DP-RGD obtains an improved utility compared to the baselines under small-sample regime while performing competitively when sample size increases. The better performance of FM-KNG over FM-LAP is expected given its sensitivity is strictly smaller when the curvature is positive.

Fig. 2
figure 2

a Private Wasserstein barycenter between zero-centered Gaussians. and bc Private principal eigenvector computation on ijcnn1 and satimage respectively. All results are averaged for 20 runs. We observe the proposed DP-RGD shows better utility in terms of excess risk across all the problems

6.3 Results on private principal eigenvector computation

For the problem of principal eigenvector computation in Sect. 5.1, we consider two real datasets, ijcnn1 and satimage from the LibSVM library (Chang and Lin , 2011), with \((n, d) = (49\,990, 14)\) and \((6\,430, 36)\) respectively. We process the datasets by taking the z-score for each feature and dividing by the spectral norm of the data matrix.

To sample from the tangent Gaussian distribution on sphere, we first sample from \({\mathcal {N}}(0, \sigma ^2 I_{d})\) and then transforming using a basis matrix. This is because the metric tensor is \(G_w = I_d\). We then set the parameters \(T =\log (n^2\epsilon ^2/((d+1)L_0^2\log (1/\delta )))\) and \(\sigma ^2 = T \log (1/\delta )L_0^2/(n^2\epsilon ^2)\) according to Theorem 4 under Riemannian PL condition. We set \(\delta = 10^{-3}\) and vary the privacy parameter \(\epsilon\). We estimate \(L_0\) from the samples.

We compare Algorithm 1 with full gradient \((b = n)\), denoted as DP-RGD, against the projected gradient descent (with noise added in the ambient space), denoted as DP-PGD. In fact, the projected gradient descent on the sphere approximates the Riemannian gradient descent because the former updates by \(w_{t+1} = \frac{w_t + \zeta _t}{\Vert w_t + \zeta _t \Vert _2}\), where \(\zeta _t = \nabla F(w_t) + \epsilon _t\). This approximates the exponential map to the first order and is known as the retraction (Absil and Malick , 2012; Boumal , 2020). For this reason and the purpose of comparability, we set the same noise variance and the max iterations for both DP-RGD and DP-PGD. The stepsizes are tuned and fixed as 0.9 for DP-RGD and 10 for DP-PGD. We also include another baseline, which is input perturbation via Gaussian mechanism (GMIP) on the covariance matrix \(\frac{1}{n} \sum _{i=1}^n z_i z_i^\top\) (Dwork et al. , 2014b). It can be derived that the \(L_2\) sensitivity of \(g(\{ z_i \}_{i=1}^n) = \frac{1}{n} \sum _{i=1}^n z_i z_i^\top\) is \(\varDelta g \le L_0/n = 2/n \max _{i} z_i^\top z_i\) as shown before. By Gaussian mechanism, we add symmetric Gaussian noise to the covariance with \(\sigma = \varDelta g \sqrt{2\log (1.25/\delta )} /\epsilon\). Once the covariance matrix is perturbed, we use SVD to obtain the private output. From Figs. 2b and 2c, we see DP-RGD maintains the superior utility across different levels of privacy guarantees for both the datasets.

Fig. 3
figure 3

a Hyperbolic structured prediction (HSP) learning results comparison on distance to optimal (pretrained embeddings). b Low rank matrix completion (LRMC) on Movielens. The proposed DP-RGD/DP-RSGD achieves comparable utility as non-privatized RGD/RSGD

6.4 Results on private hyperbolic structured prediction

For the hyperbolic structured prediction discussed in Sect. 5.4, we consider the task of inferring hyperbolic embedding of the mammals subtree of WordNet dataset (Miller , 1998). We use the pretrained hyperbolic embeddings on \({\mathcal {H}}^2\) following Nickel and Kiela (2017), with the transitive closure consisting of \(n = 1\,181\) nodes (words) and \(6\,541\) edges (hierarchies). The features we use are obtained from the Laplacian eigenmap (Belkin and Niyogi , 2003) to dimension \(r = 3\) of the (symmetric) adjacency matrix formed by the edges. Hence we obtain \((x_i, y_i) \in {\mathbb {R}}^3 \times {\mathcal {H}}^2\), for \(i =1,...,n\). We choose ‘horse’ as the test point with the rest being the training samples. We use the RBF kernel, i.e., \(k(x, x') = \exp (- \Vert x - x' \Vert _2^2/(2 \upsilon ^2))\) with \(\upsilon = 0.2\). The regularization parameter \(\gamma\) is set as \(\gamma = 10^{-5}\).

Here, we compare only Riemannian gradient descent (RGD) and private Riemannian gradient descent in Algorithm 1 (DP-RGD) given there exists no available baseline under the private setting. We follow the nonconvex setting in Theorem 5 and derivation in Sect. 5.4 to set \(T = \frac{\sqrt{L_1} n \epsilon }{\sqrt{d\log (1/\delta )} L_0}\) with \(L_0 = \max _i |\alpha _i(x)| D_{\mathcal {W}}\) with \(D_{\mathcal {W}}= 10\) (estimated from the samples) and \(L_1 = D_{\mathcal {W}}/\tanh (D_{\mathcal {W}})\). We set \(\sigma ^2 = T \log (1/\delta ) L_0^2/(n^2\epsilon ^2)\). We tune and set the stepsize for RGD as 0.003 and for DP-RGD as 0.001. For RGD, we set max epochs to be 1500.

We show in Fig. 3a, the Riemannian distance between the prediction to the true embedding on \({\mathcal {H}}^2\) averaged over 20 runs. We observe DP-RGD converges reasonably well to the pretrained embeddings while ensuring the guaranteed \((\epsilon , \delta )\) privacy in around \(T= 1\,000\) iterations.

To further compare the utility between the two, we plot in Fig. 4 the hyperbolic embeddings on the Poincaré disk (an isometric representation for the hyperbolic space). We sample three predicted embeddings from DP-RGD due to the randomness in the noise. We see in (4b), the predictions from DP-RGD are close to the ground-truth while preserving the hierarchical structures.

Fig. 4
figure 4

Hyperbolic structured prediction (HSP) on ‘horse’ where we compare the learning results from RGD and DP-RGD. We plot 1 prediction from RGD and 3 predictions from DP-RGD (due to the randomness) along with the true embedding. a Full embeddings , b Zoomed-in embeddings for ‘horse’. We observe the prediction from DP-RGD is reasonably well where the hierarchical structures are preserved

6.5 Results on private low rank matrix completion

We finally consider the recommender system on Movielens-1 M dataset (Maxwell Harper and Konstan , 2015) for the application of low rank matrix completion. We follow the same procedures in Han and Gao (2021a) to generate train and test data. We compare the proposed DP Riemannian stochastic gradient descent (DP-RSGD) with non-privatized RSGD for matrix completion (Balzano et al. , 2010). To sample noise from the tangent Gaussian distribution, we follow the same steps as for the sphere manifold via basis construction given by Huang et al. (2017). We choose a batch size \(b = 10\) and a fixed tuned stepsize \(\eta = 10^{-5}\) for RSGD and \(5\times 10^{-6}\) for DP-RSGD. We choose \(T = 10^3\), \(\sigma ^2 = c \frac{T \log (1/\delta )}{n^2 \epsilon ^2}\) with \(c = 10^4\). This ensures \(\sigma\) is sufficiently large and satisfies the condition under subsampling (Lemma 3). The results are averaged over 5 random initializations. From Fig. 3b, we see proposed DP-RSGD achieves comparable utility as RSGD particularly for training set while preserving privacy.

7 Concluding remarks

We propose a general framework to ensure differential privacy for ERM problems on Riemannian manifolds. Since the tangent space of a manifold is a vector space, we propose a Gaussian mechanism that adheres to the metric structure in the tangent space. We also prove privacy as well as utility guarantees for our differentially private Riemannian (stochastic) gradient descent method. Our work also provides the first unified framework to perform differentially private optimization on Riemannian manifolds. We have explored various interesting applications on manifolds where privacy is of concern. We expect the present work to motivate the development of other classes of differentially private algorithms on manifolds, which is an increasingly important topic.

We lastly comment on the complexity of the proposed framework. The extra computational effort compared to the baselines mostly comes from the need for sampling and would depend on the manifolds considered. For the sphere, hyperbolic, and Grassmann manifolds, sampling can be performed efficiently by directly sampling a standard Gaussian noise and transforming via a basis matrix. For the SPD manifold, we have implemented a sampling procedure based on the Metropolis-Hastings algorithm, which may be inefficient for SPD matrices with large size. In our follow-up work (Utpala et al. , 2023), we have addressed the computational limitation of the proposed framework and propose faster sampling procedures for all manifolds considered.