Next Article in Journal
Measures of Morphological Complexity of Gray Matter on Magnetic Resonance Imaging for Control Age Grouping
Previous Article in Journal
Entropy-Assisted Computing of Low-Dissipative Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Approach to Canonical Divergences within Information Geometry

1
Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig 04103 , Germany
2
Faculty of Mathematics and Computer Science, University of Leipzig, PF 100920, Leipzig 04009, Germany
3
Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA
4
Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako-shi Hirosawa 2-1, Saitama 351-0198, Japan
*
Author to whom correspondence should be addressed.
Submission received: 12 October 2015 / Revised: 21 November 2015 / Accepted: 25 November 2015 / Published: 9 December 2015

Abstract

:
A divergence function on a manifold M defines a Riemannian metric g and dually coupled affine connections ∇ and * on M. When M is dually flat, that is flat with respect to ∇ and * , a canonical divergence is known, which is uniquely determined from ( M , g , , * ) . We propose a natural definition of a canonical divergence for a general, not necessarily flat, M by using the geodesic integration of the inverse exponential map. The new definition of a canonical divergence reduces to the known canonical divergence in the case of dual flatness. Finally, we show that the integrability of the inverse exponential map implies the geodesic projection property.

1. Introduction: Divergence and Dual Geometry

A divergence function D ( p q ) is a differentiable real-valued function of two points p and q in a manifold M. It satisfies the non-negativity condition
D ( p q ) 0
with equality if and only if p = q . Thus, it is a distance-like function, but does not necessarily share all properties of a distance. For instance, it can be asymmetric in p and q. When a coordinate system ξ : p ξ p = ( ξ p 1 , , ξ p n ) R n is given in M, we pose one condition that, for two nearby points ξ p and ξ q = ξ p + Δ ξ , D is expanded as
D ( p q ) = 1 2 g D i j ( p ) Δ ξ i Δ ξ j + O Δ ξ 3
and ( g D i j ( p ) ) i j is a positive definite matrix. Here, the Einstein summation convention is used, which means that summation is taken with respect to any index that appears twice in a term, as a lower as well as an upper index. Throughout the paper, we apply this convention or explicitly use the summation sign. The coefficients g D i j in Equation (2) define a Riemannian metric g D . Furthermore, the divergence function D allows us to define also a pair of dual affine connections [1]. In order to be more explicit, we consider coordinates ξ p = ( ξ p 1 , , ξ p n ) of p and coordinates ξ q = ( ξ q 1 , , ξ q n ) of q and introduce the following simplified notations of differentiation
i = ξ p i , i = ξ q i
With D ( ξ p ξ q ) = D ( p q ) , the coefficients of the Riemannian metric can be written as
g D i j ( p ) = i j D ( ξ p ξ q ) q = p = i j D ( ξ p ξ q ) q = p
Furthermore, the coefficients
Γ D i j k ( p ) = i j k D ( ξ p ξ q ) q = p
Γ * D i j k ( p ) = i j k D ( ξ p ξ q ) q = p
define a pair of dual affine connections D and * D [1]. The duality of the connections holds with respect to the Riemannian metric g D in terms of the following condition:
X Y , Z = D X Y , Z + Y , * D X Z
for all vector fields X , Y and Z, where the brackets · , · denote the inner product with respect to g D [2].
The inverse problem is to find a divergence D which generates a given geometrical structure ( M , g , , * ) . Matumoto [3] showed that a divergence exists for any such manifold. However, it is not unique and there are infinitely many divergences that give the same geometrical structure. When a manifold is dually flat, a canonical divergence was introduced by Amari and Nagaoka [2], which is a Bregman divergence. Extensions of the canonical divergence within conformal geometry have been studied by Kurose [4] and Matsuzoe [5]. The canonical divergence has nice properties such as the generalized Pythagorean theorem and the geodesic projection theorem. It is an important problem to define a canonical divergence in the general case. The present paper gives an answer to this problem by using the inverse exponential map. We already used the inverse exponential map in our previous work [6], where we studied a different divergence function. We could show that it recovers the metric g in the sense of Equation (4) and has some consistency with the dual connections ∇ and * . However, it turns out that it does not reduce to the well-established canonical divergence in the dually flat case. The divergence introduced in the present article not only recovers the original geometry directly in terms of Equations (4)–(6), it also coincides with the original canonical divergence in the dually flat case.

2. A New Approach to the General Inverse Problem

We begin with a motivation in terms of a simple example where the manifold is R n equipped with the standard Euclidean metric and connection (here, the Levi-Civita connection): Let p be a fixed point in R n , and consider the vector field pointing to p, that is
R n R n , q p q
Obviously, the vector field Equation (8) can be seen as the negative gradient of the squared distance
D p : R n R , q D p ( q ) : = D ( p q ) : = 1 2 p q 2 = 1 2 i = 1 n ( p i q i ) 2
as potential function, that is
p q = grad q D p
Here, the gradient grad q is taken with respect to the canonical inner product on R n .
We shall now generalize the relation Equation (9) between the squared distance D p and the difference of two points p and q to the more general setting of a differentiable manifold M. Given a fixed point p M , we want to define a vector field q X ( q , p ) , at least in a neighbourhood of p, that corresponds to the difference vector field Equation (8). Obviously, the problem is that the difference p q is not naturally defined for a general manifold M. We need an affine connection ∇ in order to have a notion of a difference. Given such a connection ∇, for each point q M and each direction X T q M we consider the geodesic γ q , X ( t ) , with the initial point q and the initial velocity X, that is γ q , X ( 0 ) = q and γ ˙ q , X ( 0 ) = X . If γ q , X ( t ) is defined for all 0 t 1 , the endpoint p = γ q , X ( 1 ) is interpreted as the result of a translation of the point q along a straight line in the direction of the vector X. This straightness is expressed in terms of the local coordinates ξ ( t ) : = ( ξ 1 ( t ) , , ξ n ( t ) ) : = ξ ( γ q , X ( t ) ) of the geodesic γ q , X by the following set of differential equations:
ξ ¨ i ( t ) + Γ j k i ( ξ ( t ) ) ξ ˙ j ( t ) ξ ˙ k ( t ) = 0 , i = 1 , , n
The translation of points along geodesics defines a map, the so-called exponential map:
exp q : U q M , X γ q , X ( 1 )
where U q T q M denotes the set of tangent vectors X, for which the domain of γ q , X contains the unit interval [ 0 , 1 ] .
Given two points p and q, one can interpret any X with exp q ( X ) = p as a difference vector X that translates q to p. Throughout this paper we assume the existence and uniqueness of such a difference vector, denoted by X ( q , p ) (see Figure 1).
Figure 1. Illustration of (A) the difference vector p q in R n pointing from q to p; and (B) the difference vector X ( q , p ) = γ ˙ q , p ( 0 ) as the inverse of the exponential map in q.
Figure 1. Illustration of (A) the difference vector p q in R n pointing from q to p; and (B) the difference vector X ( q , p ) = γ ˙ q , p ( 0 ) as the inverse of the exponential map in q.
Entropy 17 07866 g001
This is a strong assumption, which is, however, always locally satisfied. On one hand, we are mainly interested in local properties. On the other hand, although being quite restrictive in general, this property will be satisfied in our information-geometric context, where g is given by the Fisher metric and ∇ is given by the m- and e-connections and their convex combinations, the α-connections.
If we attach to each point q M the difference vector X ( q , p ) , we obtain a vector field that corresponds to the vector field Equation (8) in R n . In order to interpret this vector field as a negative gradient field of a (squared) distance function, and thereby generalize Equation (9), we need a Riemannian metric g on M. Given such a metric, we assume integrability of X and ∇, respectively, in the sense that for all p there exists a function D p satisfying
X ( q , p ) = grad q D p
Here, the Riemannian gradient is taken with respect to g, which is defined by the property that the total differential d q D p can be expressed as an inner product:
grad q D p , Y = d q D p ( Y ) , Y T q M
Obviously, if there are functions D p satisfying the condition of Equation (12) then they are unique up to a constant that can vary with p, and we can therefore assume D p ( p ) = 0 . Throughout the paper we will also use the standard notation D ( p q ) = D p ( q ) of a divergence as a function D of two arguments. In order to recover D from Equation (12) we consider any curve γ : [ 0 , 1 ] M that connects q with p, that is γ ( 0 ) = q and γ ( 1 ) = p . We compose the inner product of the curve velocity γ ˙ ( t ) with the inverse of the exponential map X ( γ ( t ) , p ) in γ ( t ) and integrate this along the curve:
0 1 X ( γ ( t ) , p ) , γ ˙ ( t ) d t = 0 1 grad γ ( t ) D p , γ ˙ ( t ) d t = 0 1 ( d γ ( t ) D p ) ( γ ˙ ( t ) ) d t = 0 1 d D p γ d t ( t ) d t = D p ( γ ( 0 ) ) D p ( γ ( 1 ) ) = D p ( q ) D p ( p ) = D p ( q ) = D ( p q )
In particular, we can apply this derivation to the geodesic connecting q and p even when the integrability of X is not guaranteed and obtain the definition of a general canonical divergence, discussed in more detail in Section 5. Before we treat the general definition of a canonical divergence, however, we discuss important special cases of divergences within the cone of positive measures and the simplex of probability measures included in it. In particular, we verify that the well-known relative entropy (KL-divergence) and the α-entropy (α-divergence) can be derived in terms of Equation (13).

3. Natural Connections for Positive and Probability Measures

3.1. The Fisher Metric and Its Gradients

We represent measures on the set { 1 , , n } as elements of R n . In this representation, the Dirac measures δ i , i = 1 , , n , form the canonical basis of R n . We consider the n-dimensional cone of positive measures on the set { 1 , , n } , defined by
M n : = R + n = p = i = 1 n p i δ i R n : p i > 0  for all  i
and the corresponding ( n 1 ) -dimensional simplex of normalized measures (probability measures) S n 1 M n :
S n 1 : = p = i = 1 n p i δ i R n : p i > 0  for all  i ,  and  i = 1 n   p i = 1
There is a natural Riemannian metric on M n , called the Fisher metric:
g p ( X , Y ) : = i = 1 n 1 p i X i Y i , X , Y T p M n
In theoretical biology, the Fisher metric is also known as Shahshahani metric (see [7], Equation (7.48)). Given a point p S n 1 and a vector X T p M n , its projection onto T p S n 1 with respect to g p is given by
Π p X = i = 1 n X i p i j = 1 n X j δ i
and the corresponding projection onto the orthogonal complement of T p S n 1 is given by
Π p X = i = 1 n p i j = 1 n X j δ i
For a function V : M n R , this metric implies the Riemannian gradient
grad p V = i = 1 n p i V p i ( p ) δ i
A vector field
X p = i = 1 n p i f i ( p ) δ i , p M n
is the gradient of a function V if and only if it satisfies for all i , j
f i p j = f j p i
If we consider a function that is defined on S n 1 , for instance the restriction of V: M n R to S n 1 , then the vector Equation (16), evaluated in p S n 1 , will not necessarily be an element of T p S n 1 . Therefore, in order to evaluate the gradient on S n 1 , we have to project the vector Equation (16) onto T p S n 1 with respect to the metric g by using Equation (14). This leads to the following gradient formula for functions on S n 1 :
grad p V = i = 1 n p i V p i ( p ) j = 1 n p j V p j ( p ) δ i , p S n 1
This gives rise to consider general vector fields of the form
X p = i = 1 n p i f i ( p ) j = 1 n p j f j ( p ) δ i , p S n 1
Such a vector field is integrable, in the sense that it is the gradient Equation (19) of a potential function V, if and only if the following condition holds for all i , j , k (see [7], Equation (19.23)):
f i p j + f j p k + f k p i = f i p k + f k p j + f j p i

3.2. The Mixture and the Exponential Connections

After having introduced the Fisher metric and corresponding gradient fields, we now define natural notions of straight lines on M n and S n 1 , respectively, induced by corresponding affine connections. Let us first introduce the straight lines of the so-called mixture connection ( m ) on M n . Given a point p M n and a direction X T p M n , the most natural way to define a straight line that starts in p and has velocity X is given by the so-called m-geodesic
γ ( t ) = p + t X
We obtain the exponential map for t = 1 , which is, in this simple example, the translation:
exp p ( m ) ( X ) = p + X
The inverse, therefore, maps a point q to the difference vector that translates p into q:
X ( m ) ( p , q ) : = exp p ( m ) 1 ( q ) = q p
With this difference as X in Equation (22), we obtain the geodesics that connects p with q:
γ ( t ) = p + t ( q p )
If we choose a point p S n 1 and X T p S n 1 , or two points p , q S n 1 , respectively, then the corresponding geodesic Equation (22) and Equation (23) will stay in S n 1 . Therefore, the restriction of the exponential map to T p S n 1 and its inverse are trivial:
exp ¯ p ( m ) ( X ) = p + X , X ¯ ( p , q ) : = exp ¯ p ( m ) 1 ( q ) = q p
where we use a bar over symbols in order to denote the restriction of corresponding objects to S n 1 .
Now let us come to the notion of an e-geodesic and the exponential map of the so-called e-connection ( e ) . Given a point p M n and a direction X T p M n , we consider the geodesic
γ ( t ) = i = 1 n p i exp t X i p i δ i
(The “exp” on the right-hand side of Equation (24) denotes the standard real-valued natural exponential function.) The exponential map of the e-connection is given for t = 1 :
exp p ( e ) ( X ) = i = 1 n p i exp X i p i δ i
with the inverse
X ( e ) ( p , q ) : = exp p ( e ) 1 ( q ) = i = 1 n p i ln q i p i δ i
This implies that the e-geodesic connecting p with q is given by
γ ( t ) = i = 1 n p i q i p i t δ i
Clearly, if we start in a point p S n 1 and go along the e-geodesic Equation (24) in a direction X that is tangential to S n 1 , we will not stay in S n 1 . Analogously, if we connect a point p S n 1 with a point q S n 1 in terms of the e-geodesic Equation (25), then the intermediate points will in general not be in the set S n 1 . It turns out that, in order to obtain the right exponential map of the e-connection defined on S n 1 , we have to normalize the geodesic, which leads to:
exp ¯ p ( e ) ( X ) = i = 1 n p i exp X i p i j = 1 n p j exp X j p j δ i
X ¯ ( e ) ( p , q ) : = exp ¯ p ( e ) 1 ( q ) = i = 1 n p i ln q i p i j = 1 n p j ln q j p j δ i

3.3. The α-Connections

Given α [ 1 , 1 ] , we define the following convex combination of the mixture connection ( m ) and the exponential connection ( e ) on M n :
( α ) : = 1 α 2 ( m ) + 1 + α 2 ( e ) = ( m ) + 1 + α 2 ( e ) ( m )
The differential equation for the α-geodesic with initial point p M n and initial velocity X T p M n is given by
γ ¨ i 1 + α 2 γ ˙ i 2 γ i = 0 , γ ( 0 ) = p , γ ˙ ( 0 ) = X
One can easily verify that Equation (27) is solved by the following curve:
γ ( t ) = i = 1 n p i 1 + t 1 α 2 X i p i 2 1 α δ i
By setting t = 1 , we can define the corresponding α-exponential map:
exp p ( α ) ( X ) = i = 1 n p i 1 + 1 α 2 X i p i 2 1 α δ i
with the inverse
X ( α ) ( p , q ) : = exp p ( α ) 1 ( q ) = 2 1 α i = 1 n p i q i p i 1 α 2 1 δ i
Finally, the α-geodesic with initial point p and endpoint q is given by
γ ( t ) = i = 1 n p i 1 α 2 + t q i 1 α 2 p i 1 α 2 2 1 α δ i
The α-connection ¯ ( α ) on S n 1 is defined as the projection of ( α ) with respect to the Fisher metric g. The corresponding geodesic equation is a modification of Equation (27):
γ ¨ i 1 + α 2 γ ˙ i 2 γ i γ i j = 1 n γ ˙ j 2 γ j = 0 , γ ( 0 ) = p , γ ˙ ( 0 ) = X
It is reasonable to make a solution ansatz by normalization of the unconstrained geodesics Equation (28) and Equation (31). However, it turns out that, in order to solve the geodesic Equation (32), both normalized curves have to be reparametrized. More precisely, it has been shown in [8] (Theorems 14.1. and 15.1.) that, with appropriate reparametrizations τ p , X and τ p , q , we have the following form of the α-geodesic in the simplex S n 1 :
γ p , X ( t ) = i = 1 n p i 1 + τ p , X ( t ) 1 α 2 X i p i 2 1 α j = 1 n p j 1 + τ p , X ( t ) 1 α 2 X j p j 2 1 α δ i
and
γ p , q ( t ) = i = 1 n p i 1 α 2 + τ p , q ( t ) q i 1 α 2 p i 1 α 2 2 1 α j = 1 n p j 1 α 2 + τ p , q ( t ) q i 1 α 2 p i 1 α 2 2 1 α δ i
Here, the conditions
γ p , X ( 0 ) = p , γ ˙ p , X ( 0 ) = τ ˙ p , X ( 0 ) X = X , and γ p , q ( 0 ) = p , γ p , q ( 1 ) = q
imply
τ p , X ( 0 ) = 0 , τ ˙ p , X ( 0 ) = 1 , and τ p , q ( 0 ) = 0 , τ p , q ( 1 ) = 1
Now let us couple X and q by assuming γ p , X ( 1 ) = q . Together with the condition i = 1 n X i = 0 , this implies
X = 1 τ p , X ( 1 ) 2 1 α i = 1 n p i q i p i 1 α 2 j = 1 n p j q j p j 1 α 2 1 δ i
Furthermore, if the initial and endpoints of the two curves are identical, then γ p , X ( t ) = γ p , q ( t ) for all t. In particular,
X = γ ˙ p , X ( 0 ) = γ ˙ p , q ( 0 ) = τ ˙ p , q ( 0 ) 2 1 α i = 1 n p i q i p i 1 α 2 j = 1 n p j q j p j 1 α 2 δ i
A comparison of the Equation (35) and Equation (36) yields
τ ˙ p , q ( 0 ) j = 1 n p j q j p j 1 α 2 = 1 τ p , X ( 1 )

4. Canonical Divergences for Positive and Probability Measures

4.1. The Relative Entropy (KL-Divergence)

Now we apply the ansatz of Equation (12) in order to define divergence functions for the m- and e-connections on the cone M n of positive measures. The inverse maps of the corresponding exponential maps are given by
X ( m ) ( q , p ) i = 1 n ( p i q i ) δ i X ( e ) ( q , p ) i = 1 n q i ln p i q i δ i
We can easily verify that the corresponding vector fields
q X ( m ) ( q , p ) , q X ( e ) ( q , p )
are gradient fields: The functions
f i ( q ) : = p i q i , and g i ( q ) : = ln p i q i
trivially satisfy the integrability condition f i q j = f j q i and g i q j = g j q i for all i , j . Therefore, for both connections, there are canonical divergence functions which solve the corresponding Equation (12).
We derive the canonical divergence of the m-connection first, which we denote by D ( m ) . We consider two positive measures p and q and a curve γ: [ 0 , 1 ] M n connecting q with p, that is γ ( 0 ) = q and γ ( 1 ) = p . This implies
X ( m ) ( γ ( t ) , p ) , γ ˙ ( t ) = i = 1 n 1 γ i ( t ) ( p i γ i ( t ) ) γ ˙ i ( t )
and
D ( m ) ( p q ) = 0 1 X ( m ) ( γ ( t ) , p ) , γ ˙ ( t ) d t = i = 1 n 0 1 1 γ i ( t ) ( p i γ i ( t ) ) γ ˙ i ( t ) d t = i = 1 n p i ln γ i ( t ) γ i ( t ) 0 1 = i = 1 n p i ln p i p i p i ln q i + q i = i = 1 n q i p i + p i ln p i q i
With the same calculation for the e-connection, we obtain the corresponding canonical divergence, which we denote by D ( e ) . Again, we consider a curve γ connecting q with p. This implies
X ( e ) ( γ ( t ) , p ) , γ ˙ ( t ) = i = 1 n γ ˙ i ( t ) ln p i γ i ( t )
and
D ( e ) ( p q ) = 0 1 X ( e ) ( γ ( t ) , p ) , γ ˙ ( t ) d t = i = 1 n 0 1 γ ˙ i ( t ) ln p i γ i ( t ) d t = i = 1 n γ i ( t ) 1 + ln p i γ i ( t ) 0 1 = i = 1 n p i q i 1 + ln p i q i = i = 1 n p i q i + q i ln q i p i = D ( m ) ( q p )
These calculations give rise to the following definition:
Definition 1. 
The function D : M n × M n R defined by
D ( p q ) : = i = 1 n q i i = 1 n p i + i = 1 n p i ln p i q i
is called the relative entropy or Kullback–Leibler divergence. Its restriction to the set of probability distributions is given by
D ( p q ) : = i = 1 n p i ln p i q i
Proposition 1. 
The following holds:
X ( m ) ( q , p ) = grad q D ( p · ) , X ( e ) ( q , p ) = grad q D ( · p )
Furthermore, D is the only function on M n × M n that satisfies the conditions Equation (43) and D ( p p ) = 0 for all p.
Proof. 
We first compute the partial derivatives
D ( p · ) q i ( q ) = p i q i + 1 , D ( · p ) q i ( q ) = ln p i q i
With the Formula (16), we obtain
grad q D ( p · ) i = q i p i q i + 1 = p i + q i grad q D ( · p ) i = q i ln p i q i
A comparison with Equation (37) verifies the Equation (43) which uniquely characterize D ( p · ) as well as D ( · p ) , up to a constant depending on p. With the additional assumption D ( p p ) = 0 for all p, this constant is fixed. ☐
One can now ask whether the restriction Equation (42) of the Kullback–Leibler divergence to the manifold S n 1 is the right divergence function in the sense that Equation (43) also hold for the exponential maps of the restricted m- and e-connections. It is easy to verify that this is indeed the case. Let us elaborate on the geometric reason for this. We consider a general Riemannian manifold M and a submanifold N in it. Given an affine connection ∇ on M, we can define its restriction ¯ to N. More precisely, denoting the projection of a vector Z in T p M onto T p N by Π p ( Z ) , we define ¯ X Y p : = Π p X Y p , where X and Y are vector fields on N. Furthermore, we denote the exponential map of ¯ by exp ¯ p and its inverse by X ¯ ( p , q ) .
Now, given p N , we consider a function D p on M, which satisfies the Equation (12). With the restriction D ¯ p of D p to the submanifold N, this directly implies
Π q X ( q , p ) = grad q D ¯ p
However, in order to have X ¯ ( q , p ) = grad q D ¯ p , which corresponds to the Equation (12) on the submanifold N, the following equality is required:
X ¯ ( q , p ) = Π q X ( q , p )
This condition is satisfied for the m- and e-connections on M n and its submanifold S n 1 , which implies the following proposition.
Proposition 2. 
The following holds:
X ¯ ( m ) ( q , p ) = grad q D ( p · ) , X ¯ ( e ) ( q , p ) = grad q D ( · p )
where D is given by Equation (42) in Definition 1. Furthermore, D is the only function on S n 1 × S n 1 that satisfies the conditions (45) and D ( p p ) = 0 for all p.
The objects and derivations of this section represent a special case of a general dually flat manifold M, which will be studied in Section 5.

4.2. The α-Divergence

We now extend the method of Section 4.1 to the α-connections, leading to a generalization of the relative entropy, the so-called α-divergence. From the definition of the α-exponential map on the manifold M n of positive measures, given in Equation (29), we obtain the inverse
X ( α ) ( q , p ) : = exp q ( α ) 1 ( p ) = 2 1 α i = 1 n q i p i q i 1 α 2 1 δ i
In order to derive the canonical divergence D ( α ) of the α-connection, which is integrable, we consider two points p and q and a curve γ: [ 0 , 1 ] M n connecting q with p. We obtain
X ( α ) ( γ ( t ) , p ) , γ ˙ ( t ) = 2 1 α i = 1 n γ ˙ i ( t ) p i γ i ( t ) 1 α 2 1
and
D ( α ) ( p q ) = 0 1 X ( α ) ( γ ( t ) , p ) , γ ˙ ( t ) d t = i = 1 n 0 1 2 1 α γ ˙ i ( t ) p i γ i ( t ) 1 α 2 1 d t = i = 1 n 4 1 α 2 γ i ( t ) 1 + α 2 p i 1 α 2 2 1 α γ i ( t ) 0 1 = i = 1 n 2 1 + α p i 4 1 α 2 q i 1 + α 2 p i 1 α 2 2 1 α q i = i = 1 n 2 1 α q i + 2 1 + α p i 4 1 α 2 q i 1 + α 2 p i 1 α 2
Obviously, we have
D ( α ) ( p q ) = D ( α ) ( q p )
These calculations give rise to the following definition:
Definition 2. 
The function D ( α ) : M n × M n R defined by
D ( α ) ( p q ) : = 2 1 α i = 1 n q i + 2 1 + α i = 1 n p i 4 1 α 2 i = 1 n q i 1 + α 2 p i 1 α 2
is called the α-divergence. Its restriction to probability measures is given as
D ( α ) ( p q ) = 4 1 α 2 1 i = 1 n q i 1 + α 2 p i 1 α 2
Proposition 3. 
The following holds:
X ( α ) ( q , p ) = grad q D ( α ) ( p · )
Furthermore, D ( α ) is the only function on M n × M n that satisfies the condition (50) and D ( α ) ( p p ) = 0 for all p.
Proof. 
We compute the partial derivative
D ( α ) ( p · ) q i ( q ) = 2 1 α 1 q i 1 + α 2 1 p i 1 α 2
With the Formula (16), we obtain
grad q D ( α ) ( p · ) i = q i · 2 1 α 1 q i 1 + α 2 1 p i 1 α 2 = 2 1 α q i q i 1 + α 2 p i 1 α 2
A comparison with Equation (46) verifies Equation (50) which uniquely characterizes D ( α ) ( p · ) , up to a constant depending on p. With the additional assumption D ( α ) ( p p ) = 0 for all p, this constant is fixed. ☐
In what follows, we use the notation D ( α ) also for α { 1 , 1 } by setting D ( 1 ) ( p q ) : = D ( p q ) and D ( 1 ) ( p q ) : = D ( q p ) where D is relative entropy defined by Equation (41). This is consistent with the definition of the α-connections, given by Equation (26), where we have the m-connection for α = 1 and the e-connection for α = 1 . Note that D ( 0 ) is closely related to the Hellinger distance
d H ( p , q ) : = i = 1 n p i 1 2 q i 1 2 2 1 2
More precisely, we have
D ( 0 ) ( p q ) = 2 d H ( p , q ) 2
In fact, the derivation of D ( α ) was based on the idea to associate a distance-like function to the α-connections through the general Equation (12). However, it turns out that, although being naturally motivated, the functions D ( α ) do not share all properties of the square of a distance, except for α = 0 . The symmetry is obviously not satisfied. On the other hand, we have D ( α ) ( p q ) 0 , and D ( α ) ( p q ) = 0 if and only if p = q .
We now ask whether the restriction of D ( α ) , which is defined for positive measures, to the simplex S n 1 of probability distributions is the canonical divergence for the α-connections on S n 1 . We have seen that this is the case for the m- and e-connections, that is for α { 1 , + 1 } . However, for general α, the situation is more complicated. From Equation (36) we obtain
X ¯ ( α ) ( q , p ) = τ ˙ q , p ( 0 ) Π q X ( α ) ( q , p )
This equality deviates from the condition of Equation (44) by the factor τ ˙ q , p ( 0 ) , which proves that the restriction of the α-divergence to S n 1 does not coincide with the canonical α-divergence on the simplex. As an example, we consider the case α = 0 , where the α-connection is the Levi-Civita connection of the Fisher metric. As we will see in the next section, the canonical divergence in that case equals D ¯ ( 0 ) ( p q ) = 1 2 d F ( p , q ) 2 , where d F denotes the distance with respect to the Fisher metric (see Equation (62)). Obviously, this divergence is different from the divergence D ( 0 ) , given by Equation (51), which is based on the distance in the ambient space M n , the Hellinger distance.

5. General Canonical Divergence and Its Consistency

5.1. Canonical Divergence

We have derived a canonical divergence when the vector field X of the inverse exponential map, that is exp q ( X ( q , p ) ) = p for all p and q, is integrable. We now define a canonical divergence in a general n-dimensional dual manifold ( M , g , , * ) . Consider a ∇-geodesic γ q , p : [ 0 , 1 ] M connecting q and p. We define a tangent vector field X t ( p , q ) along this geodesic:
X t ( q , p ) : = X γ q , p ( t ) , p
Obviously,
X 0 = X ( q , p )
X 1 ( q , p ) = 0
Definition 3. 
A canonical divergence from p to q is defined by the path integral
D ( p q ) = 0 1 X t ( q , p ) , γ ˙ q , p ( t ) d t
Replacing the ∇-geodesic γ q , p from q to p by the reversed ∇-geodesic γ p , q from p to q and the vector field X t ( q , p ) by the vector field X t * ( p , q ) : = X * γ p , q ( t ) , p of the dual connection * leads to the following related definition of a canonical divergence:
D ( p q ) : = 0 1 X t * ( p , q ) , γ ˙ p , q ( t ) d t
= 0 1 X * ( γ q , p ( t ) , q ) , γ ˙ q , p ( t ) d t
Although motivated and derived in different terms, the divergence of the article [9] turns out to coincide with D . The authors apply Hooke’s law to a “ * -spring” and define their divergence, in terms of an expression related to Equation (57), as the work that is necessary to move a point of unit mass from q to p along the ∇-geodesic γ q , p against the force field X * ( γ q , p ( t ) , q ) . We became aware of this article after submission of our present article. The divergence D shares many nice properties of our canonical divergence. However, in the integrability case, it is not generally true that X ( q , p ) = grad q D ( p · ) , a property that serves as main motivation of our article and which is satisfied by our canonical divergence of Equation (55).
Before stating the main result that the canonical divergence defined by Equation (55) induces the same Riemannian metric g and the same pair of affine connections ∇ and * , we show some of its properties. Since the geodesic connecting γ q , p ( t ) and p is a part of the geodesic connecting q and p, corresponding to the interval [ t , 1 ] , the inverse exponential map at γ q , p ( t ) satisfies
X t q , p = ( 1 t ) γ ˙ q , p ( t )
Hence, we have
D ( p q ) = 0 1 ( 1 t ) γ ˙ q , p ( t ) 2 d t
where
γ ˙ q , p ( t ) 2 = γ ˙ q , p ( t ) , γ ˙ q , p ( t )
This already proves D ( p q ) 0 , and D ( p q ) = 0 if and only if p = q . If we replace the parameter t by 1 t and use γ q , p ( t ) = γ p , q ( 1 t ) , we directly obtain the following representation of the canonical divergence:
Proposition 4. 
The divergence of Definition 3 is given by
D ( p q ) = 0 1 t γ ˙ p , q ( t ) 2 d t
where γ p , q denotes the geodesic from p to q.
Remark 1. 
In the special case where M is self-dual, = * is the Levi-Civita connection with respect to g. In that case, the velocity field γ ˙ p , q is parallel along the geodesic γ p , q , and therefore
γ ˙ p , q ( t ) γ ( t ) = γ ˙ p , q ( 0 ) p = X ( p , q ) p = d ( p , q )
where d ( p , q ) denotes the Riemannian distance between p and q. This implies that the canonical divergence corresponds to the energy of the geodesic γ p , q , that is
D ( p q ) = 1 2 d 2 ( p , q )
In the general case, whereis not necessarily the Levi-Civita connection, we obtain the energy of the geodesic γ p , q as the symmetrized version of the canonical divergence:
1 2 D ( p q ) + D ( q p ) = 1 2 0 1 γ ˙ p , q ( t ) 2 d t
Remark 2. 
Let us compare the canonical divergence D of the affine connectionwith the canonical divergence D * of its dual connection * , both defined by Equation (55) or equivalently by Equation (61). In the special case of the α-connection = ( α ) , we have D * ( p q ) = D ( q p ) (see Equation (48)). In Section 5.3, we will prove that this kind of symmetry holds in the general case of a dually flat manifold. However, our canonical divergence does not necessarily have this property, when the space is not dually flat. This is contrary to most other approaches where the symmetry is considered to be a natural property of any divergence. In order to have that property also in our setting, we can consider the mean canonical divergence
D m c d ( p q ) : = 1 2 D ( p q ) + D * ( q p )
which obviously satisfies
D m c d ( * ) ( p q ) = D m c d ( q p )
As we will prove in the next section, the canonical divergence D induces the metric g and the connections and * . The same holds for the mean canonical divergence D m c d . However, if is integrable, then it is not generally true that X ( q , p ) = grad q D m c d ( p · ) , which is inconsistent with the main motivation of our canonical divergence (see Equation (12)).

5.2. Main Consistency Result

Let g D , D , and * D be the geometrical objects derived from the canonical divergence D as defined in Equation (55). We recall the corresponding definitions from Section 1 in terms of a local coordinate system ξ = ξ 1 , , ξ n :
g D i j ( p ) = i j D ( ξ p ξ q ) q = p
Γ D i j k ( p ) = i j k D ξ p ξ q q = p
Γ * D i j k ( p ) = i j k D ξ p ξ q q = p
We have defined our canonical divergence D based on a metric g and an affine connection ∇. It is natural to require that this divergence is consistent in the sense that the objects g D , D , and * D coincide with the original objects g, ∇, and * of M, where * is the dual affine connection of ∇ with respect to g. Since the geometry is determined by the derivatives of D ξ p ξ q at p = q , we consider the case where p and q are close to each other, that is
z i = ξ q i ξ p i
is small for all i. We evaluate the divergence by Taylor expansion up to O z 3 . Note that X ( p , q ) is of order z .
Proposition 5. 
When z = ξ q ξ p is small, the canonical divergence is expanded as
D ( p q ) = 1 2 g i j ( p ) z i z j + 1 6 Λ i j k ( p ) z i z j z k + O z 4
where
Λ i j k = 2 i g j k Γ i j k
Proof. 
We obtain the local coordinates ξ ( t ) of the geodesic γ p , q ( t ) in Taylor series as
ξ i ( t ) = ξ p i + t X i t 2 2 Γ j k i X j X k + O t X 3
where X i = X i ( p , q ) . When z is small, X is of order O ( z ) . Hence, we regard Equation (72) as Taylor expansion with respect to X, and t [ 0 , 1 ] when z is small. When t = 1 , we have
z i = X i 1 2 Γ j k i X j X k
where the higher-order terms are neglected. This in turn gives
X i = z i + 1 2 Γ j k i z j z k
We calculate D ( p q ) by using Equation (61). The velocity at t is given as
ξ ˙ i ( t ) = X i t Γ j k i X j X k
= z i + 1 2 ( 1 2 t ) Γ j k i z j z k
We also use
g i j ξ ( t ) = g i j ξ p + t k g i j z k
Collecting these terms, we have
t g i j ξ ( t ) ξ ˙ i ( t ) ξ ˙ j ( t ) = t g i j z i z j + t 2 i g j k + 2 t 2 + t Γ i j k z i z j z k
By integration, we have
D ( p q ) = 0 1 t g i j ξ ( t ) ξ ˙ i ( t ) ξ ˙ j ( t ) d t
= 1 2 g i j z i z j + 1 6 Λ i j k z i z j z k
where indices of Λ i j k are symmetrized because of multiplication of z i z j z k . This gives Equation (70). ☐
Theorem 1. 
(Consistency theorem) The geometric quantities g D , D , and * D , derived from the canonical divergence D ( p q ) of Definition 3 coincide with the original quantities g, ∇, and * .
Proof. 
By differentiating Equation (70) with respect to ξ p ,
i D = 1 2 i g j k z j z k g i j z j 1 2 Λ i j k z j z k
i j D = 1 2 i j g k l z k z j 2 i g j k z k + g i j + Λ i j k z k
of which the indexed quantities of the right-hand side need to be symmetrized with respect to i , j . By evaluating i j D at ξ p = ξ q , i.e., z = 0 , we have
g D i j = g i j
proving that the Riemannian metric derived from D is the same as the original one. We further differentiate Equation (82) with respect to ξ q and evaluate it at ξ p = ξ q . This yields
Γ D i j k = i j k D = 2 i g j k Λ i j k
= Γ i j k
Hence, the affine connection D derived from D is exactly the same as the original affine connection ∇. ☐
Remark 3. 
In the special case = * , the canonical divergence is given by half of the squared norm of the inverse exponential map (see Equation (62)):
D ( p q ) = 1 2 X ( p , q ) p 2
The right-hand side of Equation (86) defines a divergence for a general connection, which coincides with the canonical divergence in the self-dual case. We have studied this divergence in our previous work [6]. We have shown that this divergence recovers g in terms of Equation (66). However, it fails to recoverand * in terms of Equations (67) and (68) directly. In order to overcome this shortcoming, we considered the α-connection ( α ) = 1 α 2 + 1 + α 2 * and the corresponding inverse exponential map X ( α ) , which imply the following version of Equation (86):
D ( α ) ( p q ) : = 1 2 X ( α ) ( p , q ) p 2
( D ( α ) does not denote the α-divergence here.) We have shown in [6] that for α = 1 3 the divergence D ( α ) , referred to it as standard divergence, induces the original quantities g, , and * . It turns out, however, that this first attempt to define a canonical divergence has serious limitations. For instance, it does not reduce to the known canonical divergence in the dually flat case. This important property is satisfied by the canonical divergence of Definition 3, which we are going to prove in the next section.

5.3. Canonical Divergence in a Dually Flat Manifold

When a manifold M is dually flat, it has an affine coordinate system θ = ( θ 1 , , θ n ) and a potential function ψ ( θ ) , where the dual affine coordinates η = ( η 1 , , η n ) are given by
η i = ψ ( θ ) θ i , i = 1 , , n
The dual potential is then defined as
φ ( η ) = ψ θ θ · η
where θ · η = θ i η i and θ is a function of η by Equation (88). The geodesic connecting p and q, a generalisation of the e-geodesic of Section 3.2, has the form
θ ( t ) = θ p + t θ q θ p
Hence, the velocity is constant
θ ˙ ( t ) = z = θ q θ p
The canonical divergence from θ p to θ q is defined by
D θ p θ q = 0 1 t g i j θ ( t ) z i z j d t
Since g i j = i j ψ , we have
D θ p θ q = 0 1 t i j ψ θ p + t z z i z j d t
= 0 1 t ψ ¨ θ ( t ) d t
= 0 1 ψ ˙ θ ( t ) d t + t ψ ˙ θ ( t ) 0 1
= ψ θ p + φ η q θ p · η q
This shows that our canonical divergence is the same as the canonical divergence defined in terms of the Bregman divergence of M.
Now we come back to the symmetry property that we already addressed in Remark 2. We derived D ( p q ) by using the primal affine connection ∇ and the related inverse exponential map. We can construct its dual D * ( p q ) by using the dual affine connection * and the dual inverse exponential map. The dual affine coordinates are η, and the m-geodesic connecting p and q is given by
η ( t ) = η p + t η q η p
Hence, the velocity is constant
η ˙ ( t ) = z * = η q η p
The dual canonical divergence D * is defined by
D * ( p q ) = 0 1 t g i j η t z i * z j * d t
Here,
g i j ( η ) = i j φ ( η )
where
i = η i
So we have
D * ( p q ) = 0 1 t i j φ η p + t z * z i * z j * d t
By similar calculations, we have
D * ( p q ) = D ( q p )
This proves that ∇ and * give the same canonical divergence except that p and q are interchanged because of the duality. Such a nice property holds when M is dually flat.

6. Geodesic Projections and Integrability

Given a divergence D on M and a point p M , we consider the set of points q that satisfy
D ( p q ) = const
where p is fixed. This set is the surface of the equi-divergence ball centered at p. When a smooth submanifold S is given, we search for a point p ^ S that minimizes D ( p q ) , q S . Intuitively, we obtain such a minimizer by considering a ball centered at p. We increase its radius, starting from 0, until the ball touches S for the first time. Any touch point p ^ is then a minimizer of D ( p q ) , q S . When the geodesic connecting p ^ and p is orthogonal to S at p ^ , we call p ^ a geodesic projection of p onto S.
Definition 4. 
We say that the geodesic projection property holds if every minimizer p ^ of the divergence D is given by the geodesic projection of p onto S.
We know that the geodesic projection property holds when M is dually flat, but it does not hold in general. The following condition guarantees the geodesic projection property:
Proposition 6. 
The geodesic projection property holds when the inverse exponential map X ( q , p ) is in proportion to the gradient of D ( p q ) with respect to q,
X ( q , p ) = c · grad q D ( p · )
where c is a constant that may depend on q and p.
Proof. 
Consider the geodesic connecting q = p ^ and p. Then, the tangent vector at q is X ( q , p ) . Assume that X ( q , p ) has the same direction as the gradient grad q D ( p · ) , that is, the vector orthogonal to the surface of the ball touching S. Then X ( q , p ) is also orthogonal to the tangent space of S in p ^ , as the tangent space of the ball contains the tangent space of S at this point. This means that p ^ is a geodesic projection. ☐
Obviously, when the vector field of the inverse exponential map is integrable, the geodesic projection property directly follows from Equation (12). We have shown that this intergrability condition is satisfied for general dually flat manifolds. In particular, the integrability is satisfied for the α-connection ( α ) defined on the cone M n of positive measures, which leads to the α-divergence as canonical divergence. The restriction of the α-connection to the simplex S n 1 of probability distributions, denoted by ¯ ( α ) , is still integrable, even though S n 1 is not (dually) flat with respect ¯ ( α ) if α { 1 , + 1 } . As we have seen, the canonical divergence associated with ¯ ( α ) does not coincide with the restriction of the α-divergence to S n 1 . However, this restriction is still useful in the context of applications that require projections onto submanifolds S. The reason is that the geodesic projection property holds for ¯ ( α ) . To be more precise, consider the restriction of the α-divergence to the simplex S n 1 :
D ( α ) ( p q ) = 4 1 α 2 1 i = 1 n q i 1 + α 2 p i 1 α 2
The gradient is given as
grad q D ( α ) ( p · ) = 2 1 α i q i p i q i 1 α 2 j q j p j q j 1 α 2 δ i
Comparing this with Equation (36) we see that
X ( q , p ) = τ ˙ q , p ( 0 ) grad q D ( α ) ( p · )
With the condition (105) this implies that the geodesic projection property holds for D ( α ) , even though it is not the canonical α-divergence on the simplex.

Author Contributions

The research was designed and carried out by both authors. They both wrote the paper, with main contribution by Nihat Ay. Both authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat. 1983, 11, 793–803. [Google Scholar]
  2. Amari, S.-I.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
  3. Matumoto, T. Any statistical manifold has a contrast function—On the C3-functions taking the minimum at the diagonal of the product manifold. Hiroshima Math. J. 1993, 23, 327–332. [Google Scholar]
  4. Kurose, T. On the divergence of 1-conformally flat statistical manifolds. Tohoku Math. J. 1994, 46, 427–433. [Google Scholar] [CrossRef]
  5. Matsuzoe, H. On realization of conformally-projectively flat statistical manifolds and the divergences. Hokkaido Math. J. 1998, 27, 409–421. [Google Scholar] [CrossRef]
  6. Amari, S.-I.; Ay, N. Standard Divergence in Manifold of Dual Affine Connections. In Geometric Science of Information, Proceedings of the 2nd International Conference on Geometric Science of Information, Palaiseau, France, 28–30 October 2015.
  7. Hofbauer, J.; Sigmund, K. Evolutionary Games and Population Dynamics; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
  8. Morozova, E.A.; Chentsov, N.N. Markov invariant geometry on manifolds of states. J. Sov. Math. 1991, 56, 2648–2669. [Google Scholar] [CrossRef]
  9. Henmi, M.; Kobayashi, R. Hooke’s law in statistical manifolds and divergences. Nagoya Math. J. 2000, 159, 1–24. [Google Scholar]

Share and Cite

MDPI and ACS Style

Ay, N.; Amari, S.-i. A Novel Approach to Canonical Divergences within Information Geometry. Entropy 2015, 17, 8111-8129. https://doi.org/10.3390/e17127866

AMA Style

Ay N, Amari S-i. A Novel Approach to Canonical Divergences within Information Geometry. Entropy. 2015; 17(12):8111-8129. https://doi.org/10.3390/e17127866

Chicago/Turabian Style

Ay, Nihat, and Shun-ichi Amari. 2015. "A Novel Approach to Canonical Divergences within Information Geometry" Entropy 17, no. 12: 8111-8129. https://doi.org/10.3390/e17127866

APA Style

Ay, N., & Amari, S. -i. (2015). A Novel Approach to Canonical Divergences within Information Geometry. Entropy, 17(12), 8111-8129. https://doi.org/10.3390/e17127866

Article Metrics

Back to TopTop