1. Introduction
In Bayesian analysis, a simple prior on inference parameters can induce a nontrivial prior on critical physical parameters of interest. This arises, for example, when estimating the masses of neutrinos from cosmological observations. Here, three parameters are inferred corresponding to the mass of each of the three neutrino species,
. Cosmological observations, however, are mainly sensitive to their sum,
. Simple priors, for example, log-uniform priors on individual masses, can induce undesired informative priors on their sum [
1].
Another example arises in nonparametric reconstructions. Here, one infers underlying physical function from the data, where the data are a reprocessing of the target function by some physical or instrumental transfer function. Typical approaches involve decomposing the target function into bins, principal component eigenmodes, or generally into any other basis functions. Simple priors on the amplitudes of basis functions can lead to undersized priors on physical quantities derived from the target function. Consideration of these effects is particularly important, for example, when reconstructing the history of cosmic reionization [
2].
A natural remedy is to importance-weight the original prior such that the nontrivial distribution on the parameter of interest is transformed to a more desirable one. In this paper, we show that this natural approach is the maximum-entropy prior distribution [
3]. Often, the more desirable prior is a uniform distribution, but our proof also holds for any desired target distribution. Our observation provides a powerful justification for the natural solution, as it is the distribution that assumes the least information, and is therefore particularly appropriate for choosing priors [
4].
In
Section 2, we demonstrate the key ideas with a toy example before providing a rigorous proof in
Section 3. We then apply these ideas to a more complicated example, appropriate for constructing priors on neutrino masses, in
Section 4.
2. Motivating Example
We begin with a simplified example. Consider a system with two parameters
, with a uniform distribution
on the unit square. Analogous to the sum of neutrino masses mentioned earlier, suppose that a derived parameter,
, is of physical interest. Effective distribution
is not uniform, but instead symmetric and triangular between
, as graphically illustrated in the left-hand side of
Figure 1. If one wished to construct a distribution
that was uniform in
, one could do so by dividing out the triangular distribution:
The resulting transformed distribution is illustrated in the right-hand side of
Figure 1. More weight is given to low and higher values of
a and
b, so that the tails of triangular distribution
are counterbalanced. This comes at the price of altering the marginal distributions of
a and
b, which become
(similarly for
b), but which now give a uniform prior,
. The transformation can be viewed as an importance weighting of the original distribution, and is intuitively the simplest way to force
to be uniform.
The aim of this paper is to show that the above intuition is well-founded, as (
1) is in fact the maximum-entropy solution. The entropy of a distribution
with respect to an underlying measure
is:
The maximum-entropy approach [
6,
7] finds distribution
p that maximises
H, subject to user-specified constraints. As it maximises entropy, solution
p is generally interpreted as the distribution that assumes the least information given the constraints.
In the next section, we show that (
1) is the maximum-entropy solution, subject to the constraint that
is uniform. We further generalize to a derived parameter that can be any arbitrary function of the original parameters, for which the desired distribution is in general nonuniform.
In a more usual maximum-entropy setting, user-applied constraints typically take the form of either a domain restriction such as or , or linear functions of distribution p, such as a specified mean , or variance . In this work, our constraints contrast with the traditional approach in that, instead of a discrete set of constraints, by demanding that a derived parameter has a distribution in a specified functional form, our constraints form a continuum. In other words, instead of a discrete set of Lagrange multipliers, one must introduce a continuous Lagrange multiplier function.
3. Mathematical Proof
Theorem 1. If one has a D-dimensional distribution on parameters x with probability density function along with a derived parameter f defined by a function , then the maximum-entropy distribution relative to satisfying the constraint that f is distributed with probability density function to is:where is the probability density for the distribution induced by q on . Proof. If we have some function
defining a derived parameter
, then cumulative density function
of
induced by
p can be expressed as a
D-dimensional integral over the region
with
D-dimensional volume element
:
Differentiating (
4) with respect to
f yields the probability density function of
f induced by
p, which via the Leibniz integral rule can be expressed as a
-dimensional integral over the boundary surface
, with the induced
-dimensional volume element
:
We aim to find distribution
p that maximises entropy
from (
2), subject to the constraint that
takes a given form with probability density
and cumulative density
:
The solution can be obtained via the method of Lagrange multipliers, wherein we maximise the functional
F:
subject to normalisation and distribution constraints
Here, we introduced a Lagrange multiplier
for the normalisation constraint (
8), and a continuous set of Lagrange multipliers
for the distribution constraints (9).
Functionally differentiating (
7) yields:
where in (
10) we have used the fact that:
and, in (
11), defined the new function:
All that remains to be done is to determine
M from Constraints (
8) and (9). Taking the right-hand form of distribution Constraint (9), and substituting in
from (
11), we find:
where we have used the fact that
is constant over the surface
, and Definition (
5) for a constrained probability distribution function. We now have the form of
M to substitute into (
11), yielding Solution (
3). □
Result (
3) is precisely what one would expect. The distribution that converts
to one, which instead has
distributed according to
, is found by first dividing out the distribution on
f induced by
q, and then modulating by desired distribution
.
Provided that
is correctly normalised, Expression (
3) automatically satisfies normalisation Constraint (
8):
In the above, we first split the volume integral into a set of nested surface integrals, drew out the functions that were constant over the surfaces, applied the definition of induced probability density
, and then used the normalisation of
r. A similar manipulation may be used to confirm that functional Form (
3) satisfies distribution Constraint (9).
The proof may be generalised to multiple derived parameters without modification, simply taking to represent a vector relationship, and the cumulative distribution functions to be their multiparameter equivalents.
4. Example: Neutrino Masses
In the past year, there has been interest in the cosmological and particle-physics community regarding the correct prior to put on neutrino masses. Simpson et al. [
8] controversially claimed that, with current cosmological parameter constraints (
[
9,
10]), the normal hierarchy of masses was strongly preferred over an inverted hierarchy, in contrast with the results of Vagnozzi et al. [
11]. Later, Schwetz et al. [
1] showed that the controversial claim was mostly due to a nontrivial prior that had been put on the neutrino masses. Since then, other choices of prior have been proposed by Caldwell et al. [
12], Long et al. [
13], Gariazzo et al. [
14] and Heavens and Sellentin [
15], which reduce the strength of the claim.
Using our methodology, a possible alternative prior to put on the masses can be constructed. Typically, one chooses a broad independent logarithmic prior on each of the masses of the three neutrinos . However, cosmological probes of the neutrino masses typically place a constraint on the sum of the masses . Simple logarithmic priors on the masses place a nontrivial prior on their sum. Using our approach, we can transform the initial distribution into one that has more reasonable distribution on the sum of the masses. Such considerations can be particularly important when determining the strength of cosmological probes.
A concrete example is illustrated in
Figure 2. As the original distribution, we take an independent Gaussian prior on the logarithm of the masses. This induces nontrivial distribution on the sum of the masses, approximately log-normal, but with a shifted centre. If one demands that the sum of the masses is instead centred on zero, then the maximum-entropy approach creates a distribution with tails toward low masses in order to compensate for the upward shift in the distribution of the sum of the masses. This tail enters a region of parameter space that would be completely excluded by the original prior; thus, choosing the transformed prior could influence the strength of a given inference on the nature of the neutrino hierarchy. It should be noted that we are not advocating this as the most suitable prior to put on neutrino masses, but merely to show that you may use our procedure to straightforwardly transform a distribution, should one wish to put a flat prior on the sum of the masses. A more physical cosmological example in the context of reionization reconstruction can be found in Millea and Bouchet [
2].