1. Introduction
Fundamental activities for analyzing data include both an ability to characterize data complexity and an ability to make comparisons between distributions. Widely used measures for these activities include entropy (for assessing uncertainty) and statistical divergences or distances (to compare distributions) [
1]. Such analysis can be performed at either a global scale across the entire data distribution or at a local scale, in the vicinity of a given location in the distribution.
An important measure of global complexity is intrinsic dimensionality, which captures the effective number of degrees of freedom needed to describe the entire dataset. On the other hand,
local intrinsic dimensionality (LID) [
2] is capable of characterizing the complexity of the data distribution around a specified query location, thus capturing the number of degrees of freedom present at a local scale. LID is a unitless quantity that can also be interpreted as a relative growth rate of probability measure within an expanding neighborhood around the specified query location, or the intrinsic dimension of the space immediately around the query point.
Our focus in this paper is to characterize entropy and statistical divergences at a highly local scale, for an asymptotically small vicinity around a specified location. We show that it is possible to leverage properties that arise from LID based characterizations of lower tail distributions [
3], to develop analytical expressions for a wide selection of entropy variants and statistical divergences, in both univariate and multivariate settings. This yields expressions for
tail entropies and
tail divergences.
Analytical characterizations for tail divergences and tail entropies are appealing from a number of perspectives. These are as follows:
For univariate scenarios, if working with the tail of a distribution that has a single variable, we can conduct:
- –
Temporal analysis: when a distribution models some property varying over time (e.g., survival analysis), we can analyze the entropy of a univariate distribution within an asymptotically short window of time, or the divergence between two univariate distributions within an asymptotically short window of time.
- –
Distance-based analysis: when a distribution models distances from a query location to its nearest neighbors and the distances are induced by a global data distribution. Here, our results can be used for analysis of tail entropy or divergence between distributions within an asymptotically small distance interval. In the case of the latter, this can provide insight into multivariate properties, since under minimal assumptions the divergences between univariate distance distributions provide lower bounds for distances between multivariate distributions [
4,
5]. This is applicable for models such as generative adversarial networks (GANs), where it is important to test correspondence between synthetic and true distributions at a local level [
6].
For multivariate scenarios where we are analyzing distributions with multiple variables:
- –
If an assumption of locally spherical symmetry of the distribution holds, then we can directly compute the tail entropy of a distribution or the divergence between two tail distributions in the vicinity of a single point. Such an assumption is suitable for analyzing data distributions for many types of physical systems such as fluids, glasses, metals and polymers, where local isotropy holds.
A key challenge in developing analytical characterizations for tail entropies and tail divergences is how to avoid or minimize assumptions about the form of the local distribution in the vicinity of the query (for example, assumptions such as a local normal distribution or a local uniform distribution). As we will see, analytical results are in fact possible—as the neighborhood radius asymptotically tends to zero, the tail distribution (a truncated distribution induced from the global distribution) is guaranteed to converge to a generalized pareto distribution (GPD), with the GPD parameter determined by the LID value of the tail distribution. The technical challenge is to rigorously delineate under what circumstances it is possible to leverage this relationship to achieve a dramatic simplification of the integrals that are required to compute varieties of tail entropy or distribution divergences. Our results in this paper show that such simplifications are in fact possible, for a wide range of tail entropies and divergences. This allows us to characterize and analyze fundamental properties of local neighborhood geometry, with results holding asymptotically for essentially all smooth data distributions.
In summary, our key contributions are the development of substantial new theory that asymptotically relates tail entropy, divergences and LID. It builds on and extends an earlier work by Bailey et al. [
3], which focused solely on univariate entropies, without reference to divergences or multivariate settings. Specifically in this paper, we:
Formulate technical lemmas which delineate when it is possible to substitute certain types of tail distributions by simple formulations that depend only on their associated LID values.
Use these lemmas to compute univariate tail formulations of entropy, cross entropy, cumulative entropy, entropy power and generalized q-entropies, all in terms of the LID values of the original tail distributions.
Use these lemmas to compute tail formulations of univariate statistical divergences and distances (Kullback–Leibler divergence, Jensen–Shannon divergence, Hellinger distance, divergence, -divergence, Wasserstein distance and distance).
Extend the univariate results to a multivariate context, when local spherical symmetry of the distribution holds.
2. Related Work
The core of our study involves intrinsic dimensionality (ID) and we begin by reviewing previous work on this topic.
There is a long history of work on ID, and this can be assessed either globally (for every data point) or locally (with respect to a chosen query point). Surveys of the field provide a good overview [
7,
8,
9]. In the global case, a range of previous works have focused on topological models and appropriate estimation methods [
10,
11,
12]. Such examples encompass techniques such as PCA and its variants [
13], graph based methods [
14] and fractal models [
7,
15]. Other approaches such as IDEA [
16,
17], DANCo [
18] or 2-NN estimate the (global) intrinsic dimension based on concentration of norms and angles, or 2-nearest neighbors [
19].
Local intrinsic dimensionality focuses on the intrinsic dimension of a particular query point and has been used in a range of applications. These include modeling deformation in granular materials [
20,
21], climate science [
22,
23], dimension reduction via local PCA [
24], similarity search [
25], clustering [
26], outlier detection [
27], statistical manifold learning [
28], adversarial example detection [
29], adversarial nearest neighbor characterization [
30,
31] and deep learning understanding [
32,
33]. In deep learning, it has been shown that adversarial examples are associated with high LID estimates, a characteristic that can be leveraged to build accurate adversarial example detectors [
29]. It has also been found that the LID of deep representations [
33] learned by Deep Neural Networks (DNNs) or the raw input data [
34,
35] is correlated with the generalization performance of DNNs. A ‘dimensionality expansion’ phenomenon has been observed when DNNs overfit to noisy class labels [
32] and this can be leveraged to develop improved loss functions. The use of a “cross-LID” measure to evaluate the quality of synthetic examples generated by GANs has been proposed in [
36]. Connections between local intrinsic dimensionality and global intrinsic dimensionality were explored by Romano et al in [
37]. In the area of climate science and dynamical systems, a formulation similar to local intrinsic dimensionality has been developed and referred to as local dimension or instantaneous dimension [
22,
23,
38], using links to extreme value theoretic methods. It has proved useful as measure to characterize predictability of states and explain system dynamics.
For local intrinsic dimensionality, a popular estimator is the maximum likelihood estimator, studied in the Euclidean setting by Levina and Bickel [
39] and later formulated under the more general assumptions of extreme value theory by Houle [
2] and Amsaleg et al. [
40], who showed it to be equivalent to the classic Hill estimator [
41]. Other local estimators include expected simplex skewness [
42], the tight locality estimator [
43], the MiND framework [
17], manifold adaptive dimension [
44], statistical distance [
45] and angle-based approaches [
46]. Smoothing approaches for estimation have also been used with success [
47,
48].
Local intrinsic dimensionality is closely related to (univariate) distance distributions. Fundamental relations for interpoint distances, connecting multivariate distributions and univariate distributions have been explored by both [
4,
5]. The former showed that two multivariate distributions are equal whenever the interpoint distances both within and between samples have the same univariate distribution, while the latter showed that two multivariate distributions
F and
G are different if their univariate distance distributions from some randomly chosen point
z are different. This can form the basis of a two sample test for comparing
F and
G. These studies have implications for our work in this paper, since they characterize the role that comparison between univariate distributions can play as a necessary condition for comparing equality of multivariate distributions.
Our work in this paper formulates results for different varieties of entropy and different types of divergences. Entropy is a fundamental notion used across many scientific disciplines. A good overview of its role in information theory is presented in [
49]. Entropy power (the exponential of entropy) is commonly used in signal processing and information theory, and is a building block for the well-known Shannon entropy power inequality which can be used to analyze the convolution of two independent random variables [
50]. Entropy power goes under the name of perplexity in the field of natural language processing [
51] and true diversity in the field of ecology [
52]. It also corresponds to the volume of the smallest set that contains most of the probability measure [
49], and it can be interpreted as a measure of statistical dispersion [
53]. It is also related to Fisher information via Stam’s inequality [
54].
Cumulative entropy was formulated in [
55] and is a modification of cumulative residual entropy [
56]. It is popular in reliability theory where it is used to characterize uncertainty over time intervals. Apart from reliability theory analysis, it has been used in data mining tasks such as dependency analysis [
57] and subspace cluster analysis [
58], where it has proved more effective due to good estimation properties. These data mining investigations have used cumulative entropy at a global level (over the entire data domain), rather than at the local (tail) level, as in our study. Generalized variants based on Tsallis
q-statistics have been developed for both entropy [
59] and cumulative entropy [
60]. Inclusion of the extra
q parameter can assist with higher robustness to anomalies and better fitting to characteristics of data distributions. Tail entropy has been used in financial applications for measuring the expected shortfall [
61] in the upper tail using quantization. This is different from our context, where the our exclusive focus is on lower tails and we develop exact results for an asymptotic regime where lower tail size approaches zero.
Divergences between probability distributions are a fundamental building block in statistics and are used to assess the degree to which one probability distribution is different from another probability distribution. They have a wide range of formulations [
1] and applications, which range from use as objective functions in supervised and unsupervised machine learning [
62], to hypothesis and two sample or goodness of fit testing in statistics [
63], as well as generative modeling in deep learning, particularly using the Wasserstein distance [
64]. Asymptotic forms of KL divergence have been investigated by Contreras-Reyes [
65], for comparison of multivariate asymmetric heavy-tailed distributions.
Finally, we note that this work considerably expands a recent study by Bailey et al. [
3], which established relationships between tail entropies and LID. This current paper extends and generalizes that work in several directions: (i) We establish general lemmas that provide sufficient conditions for when it is possible to substitute a tail distribution with components such as a power law, inside an integral. The techniques of [
3] were specially crafted for specific integrals. (ii) We provide results for statistical divergences and distances (the work of [
3] only considers entropy). (iii) We show how to formulate results for the multivariate context (as [
3] only considers univariate scenarios).
3. Local Intrinsic Dimensionality
In this section, we summarize the LID model using the presentation of [
2]. LID can be regarded as a continuous extension of the expansion dimension [
66,
67]. Like earlier expansion-based models of intrinsic dimension, its motivation comes from the relationship between volume and radius in an expanding ball, where (as originally stated in [
68]) the volume of the ball is taken to be the probability measure associated with the region it encloses. The probability as a function of radius—denoted by
—has the form of a univariate cumulative distribution function (CDF). The model formulation (as stated in [
2]) generalizes this notion to real-valued functions
F for which
, under appropriate assumptions of smoothness.
Definition 1 ([
2])
. Let F be a real-valued function that is non-zero over some open interval containing ,
.
The intrinsic dimensionality of
F at
r is defined as follows whenever the limit exists: When F satisfies certain smoothness conditions in the vicinity of r, its intrinsic dimensionality has a convenient known form:
Theorem 1 ([
2])
. Let F be a real-valued function that is non-zero over some open interval containing , . If F is continuously differentiable at r and using to denote the derivative , then Let
be a location of interest within a data domain
for which the distance measure
has been defined. To any generated sample
we associate the distance
; in this way, a
global distribution that produces the sample
can be said to induce the random value
from a
local distribution of distances taken with respect to
. The CDF
of the local distance distribution is simply the probability of the sample distance lying within a threshold
r—that is,
. In characterizing the local intrinsic dimensionality in the vicinity of location
, we are interested in the limit of
as the distance
r tends to 0, which we denote by
Henceforth, when we refer to the local intrinsic dimensionality (LID) of a function
F, or of a point
whose induced distance distribution has
F as its CDF, we will take ‘LID’ to mean the quantity
. In general,
is not necessarily an integer. In practice, estimation of the LID at
would give an indication of the dimension of the submanifold containing
that best fits the distribution.
The function
can be seen to fully characterize its associated function
F. This result is analogous to a foundational result from the statistical theory of extreme values (EVT), in that it corresponds under an inversion transformation to the Karamata representation theorem [
69] for the upper tails of regularly varying functions. For more information on EVT and how the LID model relates to the extreme-value theoretic generalized pareto distribution, we refer the reader to [
2,
70,
71].
Theorem 2 (LID Representation Theorem [2]).Let be a real-valued function, and assume that exists. Let x and w be values for which and are both positive. If F is non-zero and continuously differentiable everywhere in the interval , thenwhenever the integral exists. In [
2], conditions on
x and
w are provided for which the factor
can be seen to tend to 1 as
. The convergence characteristics of
F to its asymptotic form are expressed by the factor
, which is related to the slowly varying component of functions as studied in EVT [
70]. As we will shown in the next section, we make use of the LID Representation Theorem in our analysis of the limits of tail entropy variants under a form of normalization.
4. Definitions of Tail Entropies and Tail Dissimilarity Measures
In this section, we present the formulations of entropy, divergences and distances that will be studied in the later sections, in the light of the model of local intrinsic dimensionality outlined in
Section 3. These entropies and dissimilarity measures will all be conditioned on the lower tails of smooth functions on domains bounded from below at zero. In each case, the formulations involve one or more non-negative real-valued functions whose restriction to
satisfies certain smooth growth properties:
Definition 2. Letbe a function that is positive except at. We say that F is a smooth growth function if
There exists a value such that F is monotonically increasing over ;
F is continuous over ;
F is differentiable over ; and
The local intrinsic dimensionality exists and is positive.
Given a smooth growth function F and a value , we define . If F is the CDF of some random variable , then , which can in turn be interpreted as the CDF of the distribution of X conditioned to the lower tail . It is easy to see that for a sufficiently small choice of w, must also be a smooth growth function. Its derivative exists since exists, and thus can be regarded as the probability density function (PDF) of the restriction of F to . In addition, it can easily be shown (using Theorem 1) that the LID of is identical to that of F.
If the monotonicity of the function F is strict over the domain of interest , its inverse function exists and satisfies the smooth growth conditions within some neighborhood of the origin. Moreover, is also a smooth growth function over , with and .
The following tail entropy, tail divergence and tail distance formulations all apply to any functions F and G satisfying the conditions stated above; in particular, they involve one or more of , , , , and (if the monotonicity of the functions is strict) and . In their definitions, the only difference between the tail variants and the original versions is that the distributions are conditioned on the lower tail . In the tail measures involving one or more of , , and , integration is performed over the lower tail and not the entire distributional range ; for the variant involving and , integration is performed over for values of w constrained to the lower tail.
We begin with (differential) tail entropy. Entropy is perhaps the most fundamental and widely used model of data complexity and can be regarded as a measure of the uncertainty of a distribution. Differential entropy assesses the expected surprisal of a random variable and can take negative values.
Definition 3 (Tail Entropy).The entropy of F conditioned on is The tail entropy is equal to
, the expected value of the (tail) log-likelihood. It is also possible to define the variance of the (tail) log-likelihood. This is known as the
varentropy. Understanding this further, note that one may define the information content of a random variable
X with density function
f, to be
. The entropy (uncertainty) then corresponds to the expected value of the information content of
X and the varentropy corresponds to the variance of the information content of
X. The varentropy was introduced by Song [
72] as an intrinsic measure of the shape of a distribution and has been explored in a range of studies [
73,
74,
75].
Definition 4 (Tail varentropy).The varentropy of F conditioned on is The cumulative entropy is a variant of entropy proposed in [
55,
56] due to its attractive theoretical properties. Tail conditioning on the cumulative entropy has the same general form as that of the tail entropy. Cumulative entropy [
55,
56] is an information-theoretic measure popular in reliability theory, where it is used to model uncertainty over time intervals. It corresponds to the expected value of the mean inactivity time. Compared to ordinary Shannon differential entropy, cumulative entropy has certain attractive properties, such as non-negativity and ease of estimation.
Definition 5 (Cumulative Tail Entropy).The cumulative entropy of F conditioned on is The entropy power is the exponential of the entropy, and is also known as
perplexity in the natural language processing community. It corresponds to the volume of the smallest set that contains most of the probability measure [
49], and can be interpreted as a measure of statistical dispersion [
53]. There are several standard definitions of entropy power in the research literature. For our purposes, we adopt the simplest—the exponential of Shannon entropy—for our definition conditioned to the tail.
Definition 6 (Tail Entropy Power).The entropy power of F conditioned on is defined to be In the introduction, we briefly mentioned some motivation for the entropy power . We can add to this as follows:
It can be interpreted as a diversity. Observe that when F is a (univariate) uniform distance distribution ranging over the interval , we have and . In other words, the entropy power is equal to the ‘effective diversity’ of the distribution (the number of neighbor distance possibilities).
Given two different queries, each with its own neighborhood, one query with tail entropy power equal to 2 and the other with tail entropy power equal to 4, we can say that the distance distribution of the second query is twice as diverse as that of the first query.
For each of the tail entropy variants introduced above, we also propose analogous variants based on the
q-entropy formulation due to Tsallis [
59]. Generalized Tsallis entropies [
59,
60] are a family of entropies characterized via an exponent parameter
q applied to the probabilities, in which the traditional (Shannon) entropy variants are obtained as the special case when
q is allowed to tend to 1. The use of such a parameter
q can often facilitate more accurate fitting of data characteristics and robustness to outliers.
Definition 7 (Tail q-Entropy).For any , the q-entropy of F conditioned on is defined to be Definition 8 (Cumulative Tail q-Entropy).For any , the cumulative q-entropy of F conditioned on is defined to be We define the tail
q-entropy power using the
q-exponential function from Tsallis statistics [
59],
. Note that L’Hôpital’s rule can be used to show that
as
.
Definition 9 (Tail q-Entropy Power).For any , the q-entropy power of F conditioned on is defined to be We next define the tail cross entropy. Cross entropy can be used to compare two probability distributions and is often employed as a loss function in machine learning, comparing a true distribution and a learned distribution. From an information theoretic perspective, cross entropy corresponds to the expected coding length when a wrong distribution G is assumed while the data actually follows a distribution F.
Definition 10 (Tail Cross Entropy).The cross entropy from F to G, conditioned on , is defined to be Similar to entropy power, we can also define the cross entropy power, which is the exponential of the cross entropy.
Definition 11 (Tail Cross Entropy Power).The cross entropy power from F to G, conditioned on , is defined to be A classic and fundamental method for comparing two probability distributions is the Kullback–Leibler divergence (KL Divergence) [
76].
measures the degree to which a probability distribution
G is different from a reference probability distribution
F. It is a member of both the family of
f-divergences and Bregman divergences. It is widely used in statistics, machine learning and information theory.
Definition 12 (Tail KL Divergence).The Kullback–Leibler divergence from F to G, conditioned on , is defined to be The tail KL divergence can be connected to the tail entropy and the tail cross entropy according to the relationship .
The Jensen–Shannon divergence (JS divergence) [
77] is another popular measure of distance between probability distributions. It is based on the KL divergence, but unlike the KL, the square root of the JS divergence is a true metric.
Definition 13 (Tail JS Divergence).The Jensen–Shannon divergence between F and G, conditioned on , is defined to be The tail JS divergence can also be written in terms of the tail entropies
The L2 distance is the squared Euclidean distance when comparing two probability distributions. It is part of the family of
divergences when setting
[
78].
Definition 14 (Tail L2 Distance).The L2 distance between F and G, conditioned on , is defined to be The Hellinger distance [
79] is a true metric for comparing two probability distributions. The squared Hellinger distance a member of the family of
f-divergences and is part of the family of
divergences when setting
[
80].
Definition 15 (Tail Hellinger Distance).The Hellinger distance between F and G, conditioned on , is defined to be The
divergence between two probability distributions [
81] is a member of the family of
f divergences and is part of the family of
divergences when setting
[
80].
Definition 16 (Tail-Divergence).The divergence between F and G, conditioned on , is defined to be The asymmetric
-divergence [
80] is another member of the family of
f divergences. When
it is proportional to the
divergence. When
it is proportional to the squared Hellinger distance. When
it corresponds to the KL-divergence.
Definition 17 (Tail-Divergence).The α-divergence from F to G, conditioned on , is defined to be The Wasserstein distance between two probability distributions is also known as the Kantorovich–Rubinstein metric [
82] or the earth mover’s distance. It has become very popular as part of the loss function used in generative adversarial networks [
83]. In the univariate case it can be expressed in a simple analytic form.
Definition 18 (Tail Wasserstein Distance).The p-th Wasserstein distance between F and G, conditioned on , is defined to be For some of the aforementioned tail measures, we will also consider a normalization of the entropy, divergence or distance (as the case may be) with respect to
w, the length of the tail. In
Section 5 and
Section 6, we will show that as
w tends to zero, the limits of these (possibly normalized) tail entropies and tail divergences can be expressed in terms of the local intrinsic dimensionalities of
F and
G. The notation for these variants, and our results for their limits in terms of
and
, are summarized in
Table 1.
5. Simplification of Tail Measures
Next, we present the main theoretical contributions of the paper: three technical lemmas that will later be used to establish relationships between local intrinsic dimensionality and a variety of tail measures based on entropy, divergences or distances. The results presented in this section all apply asymptotically, as the tail boundary tends toward zero.
Each of the three lemmas allow, under certain conditions, the simplification of limits of integrals involving smooth growth functions of the form
(as defined in
Section 4), or its associated first derivative
or inverse function
. The limit integral simplifications allow for the substitution of the function (or derivative or inverse) by expressions that involve one or more of the following: the LID value of the function, the variable of integration or the tail boundary
w. Moreover, the lemmas require that the integrand be monotone with respect to small variations in the targeted function.
The first lemma allows terms of the form (resembling the CDF of a tail-conditioned distribution) to be converted into a term that depends only on the variable of integration, the tail length w, and the local intrinsic dimension .
Lemma 1. Let F be a smooth growth function over the interval . Consider the function admitting a representation of the formwhere: ;
; and
for all fixed choices of t and w satisfying , is monotone and continuously partially differentiable with respect to z over the interval .
Thenwhenever the latter limit exists or diverges to or . Proof. Since
F is assumed to be a smooth growth function, the limit
exists and is positive. We present an ‘epsilon-delta’ argument based on this limit. For any real value
satisfying
, there must exist a value
such that
implies that
. Therefore, when
,
Exponentiating, we obtain the bounds
Applying this bound together with Theorem 2, the ratio
can be seen to satisfy
Over the domain of interest
, the assumption that
ensures that
, and that the upper and lower bounds of Inequality (
1) lie in the interval
. Since
has been assumed to be monotone with respect to
, the maximum and minimum attained by
over choices of
z restricted to any (closed) subinterval of
must occur at opposite endpoints of the subinterval. With this in mind, for any choice of
, Inequality (
1) implies that
where
Since
and
are also continuously partially differentiable with respect to
z over
,
It therefore follows from the squeeze theorem for integrals that
whenever the right-hand limit exists or diverges. □
In a manner similar to that of the preceding lemma, the following result allows terms of the form (the inverse of ) to be converted into a term that depends only on the variable of integration, the tail length w and the local intrinsic dimension . Here, in order to ensure the existence of the inverse function, F (and by extension and ) must be strictly monotonically increasing over the tail.
Lemma 2. Let F be a smooth growth function over the interval . Let us also assume that, over the interval, the monotonicity of F is strict. Consider the function admitting a representation of the formwhere: ;
for all , where is restricted to values of t in ; and
for all fixed choices of u and w satisfying and , is monotone and continuously partially differentiable with respect to z over the interval .
Thenwhenever the latter limit exists or diverges to or . Proof. First, we note that the strict monotonicity of F implies that for all and , the function is uniquely defined when is restricted to .
As in the proof of Lemma 1, an ‘epsilon-delta’ argument based on the existence of the limit
yields the following: for any real value
satisfying
, there exists a value
such that
holds for all
. Solving for
t through exponentiation of the bounds, and then setting
, we obtain
The remainder of the proof follows essentially the same path as that of Lemma 1. Over the domain of interest
, the assumption that
ensures that
, and that
u lies in the interval
. Since
has been assumed to be monotone with respect to
, the maximum and minimum attained by
over choices of
z restricted to any (closed) subinterval of
must occur at opposite endpoints. Therefore, for any choice of
,
where
Since
is also continuously partially differentiable with respect to
z over
,
It therefore follows from the squeeze theorem for integrals that
whenever the right-hand limit exists or diverges. □
The third lemma facilitates the conversion of a term of the form
to
, together with a factor that depends only on the variable of integration and
. Since
F is assumed to be a smooth growth function,
must be smooth as well, and therefore
satisfies the conditions of Theorem 1 over
. Hence,
can be substituted by an expression involving
:
The substitution comes at the cost of introducing a non-constant factor
. The following lemma shows that
can in turn be substituted by the constant
, provided that certain monotonicity assumptions are satisfied.
Lemma 3. Let F be a smooth growth function over the interval . Consider the function admitting a representation of the formwhere: ;
, and
there exists a value such that for all fixed choices of t satisfying , is monotone with respect to z over the interval .
Thenwhenever the latter limit exists or diverges to or . Proof. Since F is assumed to be a smooth growth function, the limit exists and is positive. We present an ‘epsilon-delta’ argument based on this limit. For any real value satisfying , there must exist a value such that implies that .
Since
has been assumed to be monotone with respect to
z over the interval
, the restriction
ensures that
is monotone over the entire domain of interest
. Therefore, the maximum and minimum attained by
over choices of
z restricted to any (closed) subinterval of
must occur at opposite endpoints of the subinterval. As in the proof of Lemma 1,
where
Since
is also continuously partially differentiable with respect to
z over the range
,
It therefore follows from the squeeze theorem for integrals that
whenever the right-hand limit exists or diverges. □
8. Conclusions
In this theoretical investigation, we have established asymptotic relationships between tail entropy variants, tail divergences and the theory of local intrinsic dimensionality. Our results are derived under the assumption that the distribution(s) under consideration are being analyzed in a highly local context, within the distribution tail(s), an asymptotically small neighborhood whose radius approaches zero. These results show that tail entropies and tail divergences depend in a fundamental way on local intrinsic dimensionality and help form a theoretical foundation for cross-fertilization between intrinsic dimensionality research and entropy research. As future work, we plan to investigate the potential of these new characterizations in a range of application settings. For example, for use as a basis in machine learning to characterize and improve representations and representation learning, as well as use in understanding behavior of physical systems such as fluids and helping characterize their critical transitions in time and space.
Our results from both univariate and multivariate cases, show that the tail entropies and divergences considered in this paper depend only on (i) the embedding (representation) dimension in which the distribution is situated, and (ii) the local intrinsic dimension(s) of the distribution(s). Furthermore, in many cases there is dependence involving the ratio between the intrinsic dimension and the embedding dimension.
Consider the context of distance based analysis, when a distribution models distances from a central query location to its nearest neighbors, and the distances are induced by global data. In this situation, our characterization of entropy might be termed as ‘personalized’, in that entropy expresses the uncertainty (or complexity) from the perspective of the query, in regard to the distances to samples within an asymptotically small neighborhood. Phrased another way, these local entropies are ‘observer-dependent’, since they are tied to the choice of query (the observer). This can be contrasted with the more common notion of entropy, where one analyzes a global distribution, and there is no requirement of a query point or its local neighborhood.
As alluded to in the introduction, divergences between tail distributions could be used for comparison of real and synthetic distributions, as is commonly required for generative adversarial networks (GANs). Given a particular query location we may either: (i) compute the divergence between the univariate tail distance distributions of synthetic and real examples, as measured from a query point; or (ii) compute the divergence between the multivariate tail distance distributions of synthetic and real examples, again as measured from the query, under an assumption of local isotropy. Our results show that under the assumption of local spherical symmetry, the use of divergences (such as KL) between tail distance distributions is asymptotically equivalent to the standard multivariate formulations with the same divergences, when restricted to the neighborhoods around locations of interest. For future work it will be interesting to consider whether it is possible to further extend our multivariate results to elliptically symmetric distributions or skew-elliptical distributions, such as those studied by Contreras-Reyes [
65].
Lastly, our results in
Table 1 and
Table 6 show theoretical relationships for entropies and divergences, but in practice one must estimate the measures using samples of data. A natural approach here is to first estimate local intrinsic dimensional values such as
and
using any desired estimator (such as the maximum likelihood estimator [
39,
40,
41]), and then plug in the estimated LID value into the desired tail entropy or tail divergence formula. For example, an estimator of the (univariate) Normalized Cumulative Entropy could be obtained by computing
, where
is the estimated LID of the distance distribution
F.