van der Linde, Tutz:
On association in regression: the coefficient of
determination revisited
Sonderforschungsbereich 386, Paper 391 (2004)
Online unter: http://epub.ub.uni-muenchen.de/
Projektpartner
On association in regression:
the coefficient of determination revisited
A. van der Linde* & G. Tutz**
10.6.2004
Abstract
Universal coefficients of determination are investigated which quantify the
strength of the relation between a vector of dependent variables Y and a vector of
independent covariates X. They are defined as measures of dependence between
Y and X through θ(x), with θ(x) parameterizing the conditional distribution of Y
given X = x. If θ(x) involves unknown coefficients γ the definition is conditional
on γ, and in practice γ, respectively the coefficient of determination has to be
estimated. The estimates of quantities we propose generalize R2 in classical linear
regression and are also related to other definitions previously suggested. Our
definitions apply to generalized regression models with arbitrary link functions
as well as multivariate and nonparametric regression. The definition and use
of the proposed coefficients of determination is illustrated for several regression
problems with simulated and real data sets.
Keywords: association, mutual information, regression, coefficient of determination, R2 , correlation curve.
——————————————————————————————————–
* University of Bremen, Germany
FB3: Institute of Statistics, PO Box 330 440, 28334 Bremen, email:
[email protected] (corresponding author)
** Ludwig-Maximilian-University Munich, Germany
Institute of Statistics, Ludwigstr. 33, 80539 Munich, email:
[email protected]
1
1
Introduction
As part of a regression analysis statisticians are interested in describing the
strength of the relation between the dependent variable Y and the independent
variables X. The measure commonly used to this end is the coefficient of determination, usually denoted by R2 . For jointly Gaussian variables X and Y, R2 is
related to the decomposition of the total variance of Y into the ‘between variance’
varX E(Y |X) (explained by regression) and the ‘within variance’ EX var(Y |X)
(error) and describes the proportion of the total variance explained by regression.
The decomposition of variance holds without any distributional assumption and
might therefore be used as a starting point for generalizations. In particular local
measures R2 (x) summarized in a ‘correlation curve’ were suggested ([9]: Doksum
et al., 1994) referring to this idea. For one-parameter exponential families where
the conditional distribution of Y given X = x is specified by generalized (linear)
regression models different definitions of R2 were proposed ( [26]: Magee, 1990;
[28]: Nagelkerke, 1991). These are based on a full likelihood ratio rather than
just the variance thus taking into account that the separation of the conditional
mean E(Y |x) and the conditional variance var(Y |x) is a special feature of the
Gaussian distribution.
Although the attempts to define more generally a coefficient of determination
indicate that there is a ‘natural appeal’ ([26]: Magee, 1990) of such a measure
its definition and use is not really clear. Often a coefficient of determination is
specified descriptively rather than theoretically as a quantity of interest. Hence
features of the parameter of interest and properties of estimates continue being
mixed up.
The ambiguity of continuing attempts to provide a coefficient of determination while its use and interpretation even in classical examples remains a subject
of debate indicates a lack of clarity about the underlying concept ‘strength of
relation between Y and X’ that is to be captured and quantified. In this paper we describe simple information theoretic ideas that help to disentangle the
various notions associated with a coefficient of determination, and we introduce
and review universal definitions of such global as well as local (conditional on
x) coefficients as measures of dependence. Although measures of association and
dependence between variables X and Y have been extensively discussed, especially in psychometrics (e.g. [7]: Cramer and Nicewander, 1979; [33]: Särndal,
1974) they have hardly been evaluated taking into account a regression model
([20]: Kent, 1983). Based on measures of association the determination of Y
by X(= x), can be interpreted in two ways: determination of the conditional
density p(y|x), capturing the discriminatory power of X(= x) or determination
of the (range of) values of Y given X(= x), capturing the explanatory power
of X(= x). Of course, these ideas are related as different conditional densities
correspond to different ranges of values of Y . The first notion will be the focus
of the paper, but the second notion, usually formalized referring to entropies of
2
distributions of Y will be discussed as well.
We specify a parameter θ that determines the distribution of a response Y as
a function of the independent variables X and coefficients γ, and hence in our approach a regression model induces a family of parameterized densities p(y|θ(x, γ))
for Y which vary depending on x. We claim that a coefficient of determination
should quantify the strength of the relation between X and Y in a regression
model (given the regression coefficients) measuring the variation in the family
of distributions induced by X. Different amounts of variation may be obtained
depending on which covariates are included in the model ; indeed a major field
of application of coefficients of determination has been the problem of variable
selection (e.g. [11]: Draper and Smith, 1981, ch.6.1; [25]: McKay, 1977). Also,
the choice of the link function in a generalized regression model or the choice of a
smoothing parameter in nonparametric regression has an impact on the strength
of the relation between covariates X and Y . Thus intuitively our approach is
based on the idea that the flexibility of a (regression) model is reflected by the
range of conditional densities. This range is quantified as the average distance to
a reference density, e.g. the marginal density pθ (y) which is estimable only under
joint sampling. Alternatively the reference density is often chosen corresponding
to a regression model with regression coefficients equal to zero. Under conditional
sampling the average w.r.t. X is to be taken according to the experimental design.
A coefficient of determination is not only a descriptive measure of a given regression model, but mainly of interest in model comparison. The coefficient used
to this end depends on the set of models under consideration. Two comparisons
are usually distinguished: (1) given a joint distribution of Y and all covariates the
comparison of sub-models e.g. based on subsets of covariates, (2) the comparison
of models specified by the same type of conditional density p(y|x) with different
parameters θ(x, γ), e.g. corresponding to different subsets or different functions
of the covariates. The reference density in the coefficient of determination has to
be chosen accordingly, and hence several coefficients of determination have been
proposed in the literature.
In order to express emphasis on x and for notational convenience we drop γ
writing simply θ(x) := θ(x, γ) unless we explicitly refer to γ. In practice γ needs
to be estimated from a given data set.
A local measure of association given X = x may be defined based on the
rate of change of densities p(y|θ(x)) if discrimination of conditional densities is in
focus and X is continuous. Alternatively the reduction of uncertainty by x yields
a local coefficient of determination in terms of explanation.
We introduce and investigate universal measures of dependence for regression models yielding coefficients of determination for all regression models, especially for univariate as well as multivariate response variables, for non- and
semi-parametric regression models, for generalized regression models with arbitrary link function or for regression models with heterogeneous variances.
The paper is organized as follows. In section 2 we start out with a formal
3
definition of a coefficient of determination based on a joint distribution of X
and Y. We derive representations in exponential families, and we also illustrate
for classical examples. In section 3 we refine the proposed definition considering
approximations as well as alternatives, particularly alternatives based on different
reference densities. We further discuss determination curves as generalization of
correlation curves quantifying the local strength of relation between X and Y,
and we introduce alternatives. In section 4 we summarize some properties of
coefficients of determination, especially monotonicity in the number of covariates,
and we discuss the use of coefficients of determination in model comparison.
Section 5 briefly addresses estimation of a coefficients of determination, and in
section 6 we exemplify our approach for several regression models applied to
simulated and real data. Section 7 concludes with a discussion.
Readers who are interested in a conceptual outline only are referred to sections
2.1-2.2, the discussion in section 7 and perhaps section 3. Examples with technical
details are collected in sections 2.3 and section 6. Sections 4 and 5 are both
sketchy, yet addressing rather subtle results and may be skipped by casual readers.
2
2.1
Mutual information in regression
Motivation and definitions
Assume a random vector Y is observed with its distribution given by the density
p(y|θ). In a regression problem we try to explain Y by covariates X modelling
θ as a function of X and parameters γ, part of which are regression coefficients.
Typically γ = (ω, β), where β denotes the regression coefficients and ω comprises
an intercept α and possibly a scale factor φ. Thus, more precisely, θ = θ(x, γ).
In order to determine the strength of the relation between Y and X we propose
to use a measure of dependence between Y and X, keeping γ fixed. We may
either assume that observations (y, x) are realizations of (Y, X) following a joint
distribution of (Y, X) or that we observe Y conditional on x, and
the values
xthat
1 ...xq
of X are specified according to an experimental design d = n(x1 )...n(xq ) which
can be interpreted as a discrete
probability measure such that P (X = xi ) =
p(xi ) = n(xi )/n where n =
i n(xi ). We set pθ (x, y) = p(y|θ(x))p(x) and
pθ (y) = pθ (x, y)dx.
In the sequel we use the following notation for (sequential) expectations: random variables w.r.t. which expectations are taken are written in capital letters,
and for the first (and possibly only) expectation no subscript is appended. For
sequential expectations a subscript indicates which variable it refers to. For example, in covY (E(ζ(X)|Y ), t(Y )) the first expectation refers to X conditional on
Y , and the covariance is a covariance of functions of Y.
A well-known measure of dependence between two random vectors is the distance between their joint density and the product of their marginal densities
4
representing independence. Using the Kullback–Leibler (KL-) discrepancy or directed divergence which for two densities p(z), q(z) is given by
p(z)
IKL (p(z), q(z)) = p(z) log
dz
q(z)
one obtains
I(θ(X), Y ) := IKL (pθ (x, y), p(x)pθ (y)).
(1)
The directed divergence is also called the mutual information between Y and
X. The KL-discrepancy between two densities has been extensively studied by
Kullback ([22]: 1968) as a key quantity in establishing an information-theoretic
approach to statistics. I(θ(X), Y ) is symmetric in X and Y, but not a symmetric distance between the densities pθ (x, y) and p(x)pθ (y). Using the symmetric
discrepancy or divergence
JKL (p(z), q(z)) = IKL (p(z), q(z)) + IKL (q(z), p(z))
(2)
one obtains
J(θ(X), Y ) :=
JKL (pθ (x, y), p(x)pθ (y))
pθ (x, y)
dx dy.
=
(pθ (x, y) − p(x)pθ (y)) log
p(x)pθ (y)
(3)
Note that J(θ(X), Y ) = J(X, Y ). We use the more complex notation though,
in order to emphasize the model in use, and representations of J(θ(X), Y ) may
depend on the parameterization. For our purposes the representation
p(y|θ(X))
dy
J(θ(X), Y ) = E (p(y|θ(X)) − pθ (y)) log
pθ (y)
(4)
= E JKL (p(y|θ(X)), pθ (y))
is intuitively most appealing. J(θ(X), Y ) describes the range of densities for Y
induced by regression models θ(x) w.r.t. to the marginal reference density pθ (y)
which does not depend on x. J(θ(X), Y ) measures how strongly the densities
induced by θ(X) deviate from the marginal density. Thus it captures the discriminatory power of x within the regression model. J(θ(X), Y ) takes values in
R+ . A value of 0 indicates independence of Y and X, while a high value indicates
variability of p(y|θ(x)) as a function of x. We suggest to refer to a scaled version
of J(θ(X), Y ) as a coefficient of determination.
5
Definition 1:
Define
RJ2 =
J(θ(X), Y )
∈ [0, 1]
1 + J(θ(X), Y )
(5)
to be the coefficient of determination of Y by X through θ based on the divergence.
⊳
One can immediately read off the definition that if the conditional density of
Y given x actually does not depend on x, e.g. θ(x) ≡ c0 , J(θ(X), Y ) equals zero
and RJ2 = 0.
2.2
Representations of the divergence
2.2.1
The integrated log–odds ratio function
Recently, Osius ([29]: 2000) demonstrated that under very mild regularity conditions the joint distribution of two random elements is characterized by their
marginal distributions and the odds ratio function. Hence the association between
the two random elements is captured in the odds ratio function, and a measure
of dependence might be regarded as a functional of a representative of the odds
ratio function specified by reference values. In our context focusing on random
vectors Y and X, a representative of the log–odds ratio function with reference
values x0 , y0 is given by
Ψ0 (θ(x), y) = log
p(y|θ(x))
p(y|θ(x0 ))
− log
.
p(y0 |θ(x))
p(y0 |θ(x0 ))
(6)
Then the following result holds (([24]: van der Linde, 2004, appendix).
Result 1:
J(θ(X), Y ) = EXY Ψ0 (θ(X), Y ) − EX EY Ψ0 (θ(X), Y )
for all reference values x0 , y0 .
(7)
Silvey’s ([34]: 1964) measure of association is derived from a similar intuition,
but it does not operate on the log-scale.
6
2.2.2
Exponential families
Prevailing in regression problems is the assumption that conditional on x respectively on θ(x) the density of Y belongs to a k–parameter exponential family. We
write
p(y|ζ(x)) = exp{ζ(x)T t(y) − M (ζ(x))},
where ζ(x) ∈ Rk is the canonical or natural parameter, t(y) is a sufficient statistic
for ζ(x) and M is a normalizing function. (We assume that functions of y only are
incorporated in the dominating measure.) The natural parameter space is defined
by the condition of integrability of exp{ζ(x)T t(Y )} (given x). It is well known
that the derivatives of M w.r.t. ζ(x) provide first and second order moments of
t(Y ),
T
∂
∂
∇M (ζ(x)) =
= E(t(Y )|ζ(x)) =: τ (x),
M (ζ(x)) . . .
M (ζ(x))
∂ζ1 (x)
∂ζk (x)
(8)
∂2
HM (ζ(x)) := ((
M (ζ(x))))i,j=1...k = cov(t(Y )|ζ(x)).
(9)
∂ζi (x) ∂ζj (x)
We can parameterize an exponential family by the natural parameter ζ(x), the
mean τ (x) or a predictor η(x). Depending on the parameterization special representations of the measures of dependence are obtained. We present such results
keeping the general notation θ(x) on the left hand side of an equation and inserting the parameter actually chosen on the right hand size. Thus in exponential
families θ(x) ∈ {ζ(x), τ (x), η(x)}, and each parameter identifies the conditional
density of Y given x. Since for exponential families the log-odds ratio function is
bi–affine in ζ(x) and t(y), i.e.
Ψ0 (ζ(x), y) = (ζ(x) − ζ(x0 ))T (t(y) − t(y0 )),
(10)
we obtain the following result.
Result 2:
J(θ(X), Y ) = tr{covY (E(ζ(X)|Y ), t(Y ))}
= tr{covX (ζ(X), E(t(Y )|ζ(X)))}.
(11)
(12)
Thus J(θ(X), Y ) measures the covariance of one (transformed) variable and the
projection of the other one onto the measurable space that the former spans.
Since J(θ(X), Y ) does not depend on the reference values x0 and y0 , it is often
convenient to choose ζ(x0 ) = ζ := Eζ(X), provided ζ belongs to the natural
parameter space. Then one has
J(θ(X), Y ) = EY [(E(ζ(X)|Y ) − ζ)T t(Y )]
= EX [(ζ(X) − ζ)T E(t(Y )|ζ(X))].
7
(13)
(14)
Moreover, it can be shown that in (4) the marginal reference density can be replaced by p(y|ζ), that is by a reference density which is also a member of the
exponential family. It is mainly due to this property that we emphasize the divergence in comparison to the directed divergence. In particular, if the reference
density belongs to the same family of parameterized densities {p(y|θ(x))}, expansions in the parameter can be used to obtain approximations of J(θ(X), Y ) and
RJ2 . Hence the following representation is a key result providing a link to other
definitions of a coefficient of determination in the literature.
Result 3:
J(θ(X), Y ) = EJKL (p(y|ζ(X)), p(y|ζ)).
(15)
For example, in generalized linear regression, assuming that p(y|θ(x)) belongs
to a one-parameter exponential family (k = 1), a linear predictor
η(x) = α + a(x)T β
is defined and the regression model is specified by a one–to–one link function g
relating τ (x) = E(t(Y )|ζ(x)) to the predictor by
g(τ (x)) = η(x).
In this case ζ(x) = ζ(x, γ) with γ = (α, β), β ∈ Rp . As a special case the
canonical link g0 maps the expected value to the natural parameter
g0 (τ (x)) = ζ(x) = α + a(x)T β.
If the vectors a(x)T are centered such that E a(X)T = 0, ζ = α. In this case the
reference density represents a regression model where the conditional distribution
of Y does not depend on x.
2.3
2.3.1
Examples
Gaussian response
All definitions of a coefficient of determination originate from the case of Normal
distributions with homogeneous covariance. Assume
Y |θ(x) ∼ N (µ(x), Σ),
8
Y ∈ Rr .
If Σ is known, p(y|θ(x)) can be regarded as an exponential family with canonical
parameter ζ(x) = Σ−1 µ(x) and t(y) = y. One obtains
J(θ(X), Y ) = tr{Σ−1 cov µ(X)}.
The resulting RJ2 is related to the decomposition of covariance
cov t(Y ) = EX cov(t(Y )|ζ(X)) + covX E(t(Y )|ζ(X))
(16)
which we abbreviate as
T =W +B
(17)
alluding to total, within and between covariance.
For the Normal distribution one has W = Σ, B = cov µ(X) yielding
J(θ(X), Y ) = tr{W −1 B}
(18)
and
tr{W −1 B}
.
(19)
1 + tr{W −1 B}
A special case of a conditional Normal response Y with homogeneous covariance is given if (X, Y ) are jointly Gaussian,
ΣX ΣXY
µX
X
).
,
∼ N(
ΣY X ΣY
Y
µY
RJ2 =
Then Y |x ∼ N (µ(x), Σ) where
µ(x) = E(Y |x) = µY + ΣY X Σ−1
X (x − µX ),
−1
Σ = ΣY − ΣY X ΣX ΣXY .
The corresponding decomposition (17) is given by
ΣY = Σ + ΣY X Σ−1
X ΣXY
T
where ΣY X Σ−1
X ΣXY = B = E((µ(X) − µY )(µ(X) − µY ) ). In this case Y ∼
N (µY , ΣY ) and
r
ρ2i
−1
J(θ(X), Y ) = tr(Σ B) =
(20)
1 − ρ2i
i=1
(see [22]: Kullback, 1968, p.203), where ρ2i denotes the i–the squared canonical
correlation coefficient. If Y is univariate (r = 1), this reduces to
J(θ(X), Y ) =
ρ2
,
1 − ρ2
(21)
where ρ2 is the multiple correlation coefficient of Y with X. Thus
RJ2 = ρ2 ,
(22)
with ρ2 being further reduced to corr(X, Y ) if also X is univariate. Hence RJ2
does generalize the conventional quantities of interest in the case where X and
Y are jointly Gaussian. Actually (21) motivates the transformation (5) of J to
RJ2 introduced in definition 1.
9
2.3.2
Multinomial response
Assume Y takes values in {0, . . . , k} with probabilities πi . Y may be represented
by a k-dimensional vector of binary (dummy) variables Yi with πi = P (Yi = 1).
Thus identify Y = (Y1 , . . . , Yk )T . Modelling πi as πi (x) depending on values of
covariates X and summarizing the results of n(x) trials in Y = (Y1 . . . Yk ), Y
conditionally follows a multinomial distribution
Y |π(x) ∼ M (n(x), π(x))
where π(x) = (π1 (x), . . . , πk (x))T . The multinomial distributions form a k–
parameter exponential family with
ζ(x) = (ζ1 (x), . . . , ζk (x)),
πi (x)
ζi (x) = log
,
π0 (x)
where π0 (x) = 1 −
k
j=1
πi (x). Furthermore one has
t(y) = y,
E(Yi |ζ(x)) = n(x)πi (x) = τi (x), var(Yi |ζ(x)) = n(x)πi (x)(1 − πi (x)),
cov(Yi , Yj |ζ(x)) = −n(x)πi (x)πj (x) f or i = j.
¿From (12) with log π(x) = (log π1 (x), . . . , log πk (x))T and 1k denoting the k–
dimensional vector of ones,
J(θ(X), Y ) = tr{cov X (ζ(X), E(t(Y )|ζ(X)))}
= tr{cov(log π(X) − log π0 (X)1k , n(X)π(X))}.
(23)
For binomial distributions with all n(x) = m, i.e. Y |π(x) ∼ B(m, π(x)), the
expression simplifies to
J(θ(X), Y ) = m cov(logit π(X), π(X))
(24)
where logit π(x) = log[π(x)/(1 − π(x))].
Model 1 (linear logistic regression, canonical link)
If ζ(x) = η(x) = α + β T a(x) where now α ∈ Rk ,
E(a(X)) = 0 as before we obtain from (14)
β = (β1 . . . βk ) ∈ Rp×k and
J(θ(X), Y ) = E((ζ(X) − ζ)T τ (X))
= E(a(X)T βτ (X))
with
τi (x) = n(x)
exp{αi + βiT a(x)}
.
k
1 + j=1 exp{αj + βjT a(x)}
10
For univariate Y and X with n(x) = 1, a(x) = x, this reduces to
J(θ(X), Y ) = cov(α + βX, π(X))
= β cov(X, π(X)).
(25)
2
For small σX
= varX one may further use the approximation π(x) ≈ π(x0 ) +
2
π (x0 )(x − x0 ) at x0 = 0 which yields J(θ(X), Y ) ≈ π(x0 )(1 − π(x0 ))β 2 σX
, and
therefore
2
β 2 σX
.
(26)
RJ2 ≈
2
[π(x0 )(1 − π(x0 ))]−1 + β 2 σX
′
For σX → 0 one obtains RJ2 → 0 for all β. On the other hand, if σX is fixed,
2
, RJ2
β → ∞ yields RJ2 → 1 and β → 0 yields RJ2 → 0. Also for large variances σX
may take the extreme values 0 and 1. Thus for the logistic model RJ2 naturally
depends only on the coefficients of the regression function and the variability of
X.
In contrast, for Normally distributed responses (cp. (19))
RJ2 =
2
2
β 2 σX
β 2 σX
/σ 2
=
2
2
σ 2 + β 2 σX
1 + β 2 σX
/σ 2
(27)
where σ 2 = var(Y |x). Thus RJ2 depends on β, the variability of X and σ 2 . RJ2
2
2
becomes 1 if β → ∞ or the ratio σX
/σ 2 increases. For fixed σX
the coefficient of
2
2
determination depends on β /σ .
Model 2 (linear logistic regression, probit link)
For notational convenience consider the Binomial case (k = 1) with
Y |π(x) ∼ B(n(x), π(x))
and assume
π(x) = Φ(η(x)),
where Φ denotes the cumulative distribution function of a standard univariate
Gaussian distribution. Then
ζ(x) = logit π(x) = logit (Φ(η(x))),
ζ = Eζ(X) = logit Φ(Eη(X)) = logit Φ(α)
However, there exists a predictor η0 such that JKL (p(y|ζ(x)), p(y|ζ)) equals
JKL (p(y|ζ(x)), p(y|logit(Φ(η0 )))).
More complex models induced by sophisticated link functions are specified by
Fahrmeir and Tutz ([12]: 1994, ch.3).
11
2.3.3
Binary response Y with Gaussian covariate X
A simple distributional model for (Y, X) illustrates how J(θ(X), X) or RJ2 capture
the association between a dichotomous variable Y ∈ {0, 1} and a continuous
variable X. Let
X|Y = 1 ∼ N (µ1 , σ 2 ),
X|Y = 0 ∼ N (µ0 , σ 2 )
and the marginal distribution be fixed by p(1) = P (Y = 1), p(0) = 1 − p(1)
with p(1) ∈ (0, 1). For simplicity E(X) = 0 is assumed which is equivalent to
µ0 = −µ1 p(1)/p(0).
The regression of X on Y has the form E(X|y) = β0y + βy y with β0y =
−µ1 p(1)/p(0), βy = µ1 /p(0). Treating the Normal distributions as one-parameter
family with fixed σ 2 , the reference density p(x|ζ) is normal with N (0, σ 2 ).
The logit regression model which reversely regresses Y on X is known to be
linear. For
logit π(x) = β0x + βx x
one obtains β0x = (µ20 − µ21 )/(2σ 2 ) + log(p(1)/p(0)) and βx = µ1 /(p(0)σ 2 ).
Application of (11) shows that
J(θ(X), Y ) =
µ21 p(1) 2
µ,
σ 2 p(0) 1
RJ2 =
p(1)µ21
.
p(0)σ 2 + p(1)µ21
The extreme case RJ2 = 0 (J(θ(X), Y ) = 0) is obtained for µ21 = 0 meaning
independence of X and Y . The case RJ2 = 1 (J(θ(X), Y ) → ∞) is obtained for
σ 2 = 0 or µ1 → ∞. If µ1 → ∞ the regression parameters βx as well as βy tend
to infinity meaning that the conditional distributions p(y|x) as well as p(x|y) are
maximally separated. For σ 2 = 0 the regression parameter of the logit model
βx tends to infinity whereas the separation of the distributions p(x|1) and p(x|0)
does not show up in the regression parameters.
J(θ(X), Y ) and RJ2 are symmetric measures of dependence between X and
Y . In the light of regression models they capture how strongly a conditional
distribution p(y|x) varies with x. Here, in terms of βx the coefficient RJ2 is given
by
p(1)p(0)βx2
.
RJ2 =
1/σ 2 + p(1)p(0)βx2
Reversely, when regressing X on Y the distance between p(x|1) and p(x|0) is
reflected what is seen from
RJ2 =
p(1)p(0)βy2
.
σ 2 + p(1)p(0)βy2
12
RJ2 may also be interpreted as the proportion of explained variance of X:
βy = µ1 /p(0) immediately yields
RJ2
µ21 p(1)/p(0)
= 2
,
σ + µ21 p(1)/p(0)
and varX = σ 2 + µ21 p(1)/p(0). While this interpretation holds for the regression
of X or Y it does not apply to the decomposition of varY when regressing Y on
X. Yet the divergence quantifies the explanatory value of one of the variables for
the other variable because of the symmetry in X and Y .
3
3.1
Measures of determination
I(θ(X), Y ) as alternative to J(θ(X), Y )
Similar concepts might be suggested measuring the distance of p(y|θ(x)) to a
reference density differently. Especially the mutual information I(θ(X), Y ) (given
in (1)) as an average of the directed divergences
IKL (p(y|θ(x)), pθ (y)) = E(log
p(Y |θ(x))
|θ(x))
pθ (y)
may be an appealing alternative to J(θ(X), Y ) ([36]: Soofi et al., 2000; [19]: Joe,
1989; [20]: Kent, 1983). Analogously to the decomposition of variance I(θ(X), Y )
describes the decomposition of entropy:
I(θ(X), Y ) = H(Y ) − H(Y |θ(X)),
(28)
where H(Y ) = −E(log pθ (y)) and H(Y |θ(X)) = EX [−E{log p(Y |θ(X))|θ(X)}]
(see e.g. [6]: Cover and Thomas, 1991, p.19).
In case of a bivariate Gaussian distribution of (X, Y ) with correlation coefficient ρ
1
I(θ(X), Y ) = − log(1 − ρ2 ),
(29)
2
([22]: Kullback, 1968, p.203). Then RI2 = ρ2 may be derived from the directed
divergence by
RI2 = 1 − exp(−2I(θ(X), Y )).
(30)
The relation also holds for the multiple correlation coefficient if X is multidimensional. The transformation (30) has in fact been suggested in terms of estimates to
define a general coefficient of determination ([20]: Kent, 1983; [26]: Magee, 1990;
[28]: Nagelkerke, 1991), and RI2 can be interpreted as a scaling transformation of
I(θ(X), Y ). See also Theil ([40]: 1987).
The main advantage of the symmetrized mutual information J(θ(X), Y ) over
the mutual information I(θ(X), Y ) appears to be result 1 yielding result 3,
the representation within an exponential family. EIKL (p(y|ζ(X)), p(y|ζ)) and
I(θ(X), Y ) usually differ.
13
3.2
Reference densities
The various reference densities we introduced partly explain the variety of suggestions generalizing the coefficient of determination. J(θ(X), Y ) is defined for
a joint density of X and Y but only a partial feature is explored in a regression
model. In particular, the (multivariate) distribution of X is often left unspecified. Hence (the type of) the marginal density pθ (y) is not determined unless Y is
categorical. Also, even if the distribution of X is known, the marginal densities
pθ (y) = Ep(y|θ(X)) are different for different models. In order to allow for a
comparison across models with the same type of sampling distribution therefore
a reference density p(y|θ0 ) is used where p(y|θ0 ) is assumed to belong to the same
family of densities as p(y|θ(x)) and θ0 is the specific parameter. θ0 = θ(x, γ0 )
with γ0 = (ω0 , β0 ) and β0 = 0 corresponds to a regression model with constant
regression function.
Definition 2:
Define
2
=
RJ,θ
0
EJKL (p(y|θ(X)), p(y|θ0 ))
1 + EJKL (p(y|θ(X)), p(y|θ0 ))
(31)
to be the coefficient of determination of Y by X through θ based on the symmetric discrepancy to the density specified by a specific parameter θ0 .
⊳
Similarly the KL-discrepancy IKL (p(y|θ(X)), p(y|θ0 )) might be used to define
analogously to (30). The average discrepancy might intuitively have been
aimed at by those authors who suggest to empirically define
2
RI,θ
0
2
2
= 1 − exp(− LR)
RLR
n
(32)
where LR denotes the log–likelihood ratio test statistic to compare the model of
interest to a ‘null model’ which is understood to be that with regression coefficients set to zero (see [28]: Nagelkerke, 1991). A second order Taylor expansion
of log p(y|ζ0 ) in ζ(X) yields ([22]: Kullback, 1968, pp.26)
EJKL (p(y|ζ(X)), p(y|ζ0 )) ≈ 2EIKL (p(y|ζ(X)), p(y|ζ0 )),
(33)
2
2
and to this extent the coefficients of determination RJ,θ
and RI,θ
work similarly.
0
0
Based on I(θ(X), Y ) = H(Y )−H(Y |θ(X)) an alternative to using a reference
other than pθ (y) is to consider the difference of entropies H(Y |θ0 ) − H(Y |θ(X).
Model comparison then refers to H(Y |θ(X)) and might be extended to models
which are not related being members of a common family of sampling distributions.
14
3.3
Approximations in exponential families
A second order Taylor expansion of log p(y|ζ(x)) in ζ links the symmetrized mutual information J(θ(X), Y ) = EJKL (p(y|ζ(X)), p(y|ζ)) in exponential families
to the matrix HM (ζ). Kullback ([22]: 1968, pp.26f ) shows the following result.
Result 4:
J(θ(X), Y ) ≈ tr{HM (ζ)cov(ζ(X))}.
(34)
A similar result is obtained using a reference density p(y|ζ0 ) and expanding
in ζ0 . This result relates our definition of RJ2 to the one based on Wald’s test for
testing H0 : β = 0 against H1 : β = 0 or H0 : ζ(x) = ζ0 against H1 : ζ(x) = ζ0
for all x. For a fixed x the component w(x) of a test statistic with ζ estimating
ζ is
w(x) = (ζ(x) − ζ0 )T HM (ζ0 )(ζ(x) − ζ0 ).
Then averaging over x yields
w = E[tr{HM (ζ0 )(ζ(X) − ζ0 )(ζ(X) − ζ0 )T }].
An empirical coefficient of determination of the type
Rw2 =
w
1+w
(35)
was suggested and investigated for several regression models, listed by Magee
([26]: 1990).
A link to the interpretation of the coefficient of determination as the proportion of the total variance explained by regression may be based on the approximation of J(θ(X), Y ) using the mean parameterization.
Result 5:
J(θ(X), Y ) ≈ tr{(cov(t(Y )|τ ))−1 cov X E(t(Y )|τ (X))}
where τ̄ := Eτ (X).
(36)
The right hand side of (36) is estimated by the test statistic of the score
test which Magee ([26]: 1990) referred to in order to generalize the (estimated)
coefficient of determination.
15
3.4
Decomposition of covariance
The decomposition of covariance (17) yields (18),
J(θ(X), Y ) = tr{W −1 B},
in the special case of Gaussian distributions of Y |θ(x) with homogeneous variance,
and (18) also holds approximately according to result 5 if W ≈ cov(t(Y )|τ ).
Hence
J(θ(X), Y ) ≈ tr[(T − B)−1 T T −1 B]
= tr[(I − T −1 B)−1 T −1 B].
(37)
This equation may be read as an analogue to the defining transformation (5),
J(θ(X), Y ) =
RJ2
.
1 − RJ2
Thus RJ2 is associated to the matrix T −1 B though not directly defined by a functional of it like tr(T −1 B) or det(T −1 B) as previously suggested in the literature
(see [27]: Mardia et al., 1995, ch.6.5.4, pp170f).
For example, Hooper ([17]: 1959) refers to the decomposition I = T −1 W +
T −1 B to obtain
1
2
RH
= tr(T −1 B)
(38)
k
Glahn ([13]: 1969) uses the decomposition applying the determinant to suggest
2
RG
= |B|/|T |.
(39)
This measure coincides with γ1 in Cramer’s and Nicewanders review ([7]: 1979) of
multivariate measures of association under the assumption of a joint multivariate
Gaussian distribution.
Based on the decomposition of covariance various measures have been derived which focus on ratios of (re-scaled) between variances to (re-scaled) total
variances. For instance the re-scaled variance ratio (called γ3 by Cramer and
Nicewander ([7]: 1979)
1
J(θ(X), Y )
tr{W −1 B}
tr{W −1 B}
k
=
=
tr{W −1 T }
k + tr{W −1 B}
1 + k1 J(θ(X), Y )
(40)
is built similarly to our RJ2 . For a univariate response it coincides with RJ2 .
An issue we finally address is the classification of measures of multivariate dependence into (symmetric) measures of association and (asymmetric) measures
of redundancy. The measures discussed by Cramer and Nicewander ([7]: 1979)
under the assumption of joint Normality are all - like ours - measures of dependence between X and Y . They can be rewritten in information theoretic terms
16
as transformations of I(X, Y ), J(X, Y ) or based on differences of the entropies
H(Y ), H(Y |X), H(E(Y |X)). They are all functions of the canonical correlation
coefficients, and hence X and Y may be interchanged. In contrast the ratio of
between to total variance of the standardized response vector Y
RI =
−1
CXY }
tr{CY X CX
,
tr{CY }
(41)
where C denotes a correlation matrix, is known as redundancy index in the
literature ([39]: Stewart and Love, 1968; [14]: Gleason, 1976). The redundancy
index is not symmetric in X and Y . In case of joint Gaussianity of X and Y it
is equal to the average squared multiple correlation coefficient of components of
Y given X thus quantifying linear predictability of Y (componentwise, not truly
multivariate) given X. It has been felt by many authors that determination in
regression should be measured in such an asymmetric way.
Relating to that discussion we justify our study of a symmetric measure of
dependence arguing that essentially a measure of redundancy is a scaled measure
of dependence. In information theory the redundancy of a variable Z is defined
as
H(Z)
,
(42)
red(Z) = 1 −
Hmax
where Hmax denotes a context dependent maximum achievable entropy. Lower
entropy of Z indicates structure, and the more structure is present the higher
the redundancy of Z. (Cp. e.g. the discussion in ([18]: Jessop, 1995, pp.51f). In
regression we are interested in reducing the uncertainty about Y by X, measured
by H(Y |θ(X)), that is in maximizing the redundancy of Y |X. This amounts to
being interested in
red(Y |θ(X)) = 1 −
I(θ(X), Y )
H(Y |θ(X))
=
,
H(Y )
H(Y )
(43)
the mutual information between X and Y scaled w.r.t. the entropy of Y .
For a bivariate Gaussian distribution of X and Y we obtain with e = exp(1)
red(Y |θ(X)) = −
log(1 − ρ2 )
log(2πeσY2 )
(44)
which tends to 0 if ρ2 → 0, and tends to ∞ if ρ2 → 1. More generally for a joint
Normal distribution
1
|T |
I(θ(X), Y ) = log
.
2
|W |
If Y is standardized yielding Ts = CY and Ws , say,
red(Y |θ(X)) =
log |CY | − log |Ws |
,
log(2πe)k |CY |
17
(45)
(where k = dim Y ). Thus, although red(Y |θ(X)) and RI do not exactly coincide, the motivating idea is the same. red(Y |θ(X)) is a truly multivariate (w.r.t.
Y ) index and generalizes to other but Gaussian distributions, particularly discrete distributions. In conclusion we suggest to consider measures of association
and dependence as primary, and measures of redundancy as secondary, derived
measures, and in this paper we discuss the decomposition of variance as a special Gaussian representation of association. Furthermore, although measures of
dependence like J(θ(X), Y ) are symmetric, the specification of the conditional
density p(y|θ(x)) is based on a regression model expressing ideas about a directed
influence of X on Y .
3.5
Local coefficients of determination
Sometimes it is felt that the strength of the relation between Y and X varies
with x, and a local rather than a global coefficient of determination is wanted.
For example, Doksum et al. ([9]: 1994) resume an example given by Härdle ([15]:
1990) where Y = expenditure for food depending on X= net income is analyzed.
The higher the income the weaker the relation is, and this structure is captured
by Doksum et al. ([9]: 1994) in a ‘correlation curve’. We discuss several concepts
of local coefficients of determination.
3.5.1
Correlation curves
The definition of a (squared) correlation curve aims at a local ‘decomposition of
variance’. Doksum et al. ([4]: 1993; [9]: 1994) suggested for univariate X and
t(Y ) with a conditional distribution in an exponential family
2
τ ′ (x)2 σX
,
ρ (x) =
2
var(t(Y )|θ(x)) + τ ′ (x)2 σX
2
(46)
2
where σX
= varX and τ ′ (x) denotes the derivative of τ (x). Actually Doksum et
al. ([9]: 1994) motivate their definition based on a linear model by
RJ2 =
2
β 2 σX
var(α + βX)
,
=
2
var(α + βX + ǫ)
varǫ + β 2 σX
where β is replaced by τ ′ (x) and varǫ by the heterogeneous variances σ 2 (x) =
var(t(Y )|θ(x)) depending on x. If τ ′ (x) and σ 2 (x) are constant, the correlation
coefficient ρ2 is re-obtained in bivariate Gaussian regression. For multivariate
X ∈ Rs (46) is generalized to
ρ2 (x) =
(∇τ (x))T ΣX ∇τ (x)
σ 2 (x) + (∇τ (x))T ΣX ∇τ (x)
18
(47)
3.5.2
Determination curves
A local measure of the strength of relation between X and Y may be based on
the rate of change. Following Blyth ([3]: 1994) consider local discrepancies
J(x, ∆x) = JKL (p(y|θ(x)), p(y|θ(x + ∆x))
(48)
for continuous X. For the limiting local discrepancy one has based on a second
order Taylor expansion
J0 (x) = lim
δ→0
=
J(x, ∆x)
δ2
Ii,j (x) = 1Ts I(x)1s ,
(49)
(50)
i,j
where I(x) is the s × s-dimensional Fisher information matrix defined by
Ii,j (x) = E[
∂
∂
log p(Y |θ(x))
log p(Y |θ(x))|θ(x)].
∂xi
∂xj
If p(y|θ(x)) belongs to a k−parameter exponential family,
I(x) = ∇(k) ζ(x)(cov(t(Y )|ζ(x))(∇(k) ζ(x))T ,
(51)
κ
where ∇(k) ζ(x) is the s × k matrix (( ∂ζ
))
. The Fisher information is
∂xi κ=1...k,i=1...s
well known to be an indicator of the sensitivity of (the conditional distribution
of) Y to changes in the parameter θ(x). See the discussion by Rao ([30]:1973,
pp.331f). More explicitly, for example, if X is univariate (s = 1),
I(x) = E[(
d
log p(Y |θ(x)))2 |θ(x)],
dx
and hence under regularity conditions
I(x) = var[
d
log p(Y |θ(x))|θ(x)].
dx
(52)
I(x) describes the rate of change (on the log scale) of the density induced by a
change of the parameter (depending on x) and thus refers to the interpretation
of association in terms of discrimination.
For illustration consider again the example of logistic regression discussed in
section 2.3.2 (model 1 and n(x) = 1). Then with π(x) = τ (x)
τ (x) =
τ ′ (x) =
exp(α + βx)
,
1 + exp(α + βx)
β exp(α + βx)
(1 + exp(α + βx))2
19
and var(Y |ζ(x)) = τ (x)(1 − τ (x)),
J0 (x) =
τ ′ (x)2
= βτ ′ (x)
τ (x)(1 − τ (x))
(cp. (26)). Hence J0 (x) = I(x) converges to 0 for x tending to ±∞ and thus
reflects the asymptotes of the sigmoid logistic curve τ (x).
As Blyth ([3]: 1994) pointed out, in a one-parameter exponential family with
univariate X the correlation curve appears to be a one-to-one transformation of
J0 (x). If p(y|θ(x)) is from a one-parameter exponential family (where X may be
again multivariate), J0 (x) takes the form of a signal-to-noise ratio
J0 (x) =
(∇τ (x))T ∇τ (x))
.
var(t(Y )|θ(x))
(53)
If additionally s = 1 the transformation of J0 (x),
Rσ2 X J0 (x) =
2
J0 (x)
σX
2
σX
J0 (x) + 1
(54)
2
yields the correlation curve ρ2 (x). The scaling by σX
in (54) is chosen in order to
2
′
recover ρ for bivariate Gaussian (X, Y ), where τ (x) and σ 2 (x) and hence J0 (x)
are constant functions.
In terms of I(x) the correlation curve ρ2 (x) for s = k = 1 is also given by
2
σX
I(x)
ρ (x) = 2
σX I(x) + 1
2
(55)
which can be generalized to s ≧ 1, k ≧ 1.
Definition 3:
Define
tr{Iθ (x)ΣX }
(56)
tr{Iθ (x)ΣX } + 1
to be the value of the determination curve for Y in x through θ, where Iθ (x)
denotes the Fischer information matrix derived from p(y|θ(x)).
⊳
RJ2 (x) =
According to (51) in a k−parameter exponential family
tr{I(x)ΣX } = tr{[cov(t(Y )|τ (x))]−1 (∇τ (x))T ΣX ∇τ (x)}
= tr{[W (x)]−1 B(x)}
(say).
Hence (47) is obtained as a special case where s ≧ 1 = k, and the family of
N (τ (x), σ 2 (x))-distributions is treated as a one-parameter exponential family.
The curves for a Gaussian response, where the density is assumed to belong to a
two-parameter exponential family, will be discussed for a numerical example in
section 6.3.
20
3.5.3
Certainty curves
If X is not continuous but e.g. nominal, concepts of local variation and differentiation do not apply. Yet the influence of X at x can be assessed as the reduction of uncertainty about Y measured in terms of entropies. Set H(Y |θ(x)) =
−EY |θ(x) (log p(y|θ(x))).
Definition 4:
Define
RI2 (x) = 1 − exp{−2[H(Y ) − H(Y |θ(x))]}
(57)
to be the coefficient of certainty about Y given x through θ. The resulting curve
as a function of x is called the certainty curve.
⊳
Definition 4 may also be applied to continuous X with differential entropies
if these exist, i.e. if the integrals are finite. For example, if (X, Y ) are bivariate
Gaussian, H(Y ) = (1/2) log(2πeσY2 ), H(Y |θ(x)) = (1/2) log(2πeσY2 |θ(x) ), where
σY2 |θ(x) = σ 2 does not depend on x. Hence the reduction of uncertainty about Y
is the same for all x, and the certainty curve is constant taking the value of the
correlation coefficient.
With the asymmetric scaling yielding the redundancy index one may similarly
define a redundancy curve by
red(Y |θ(x)) = 1 −
H(Y |θ(x))
.
H(Y )
(58)
taking values in R+ . Then red(Y |θ(X)) is the average redundancy.
Locally the difference between the notions of discriminatory power and explanatory power of X related to dependence measures based on J and I respectively,
becomes more distinct while globally the idea of variability within the family of
densities {p(y|θ(x)} covers both aspects.
4
Properties and use of coefficients of determination
The main idea we elaborate in this paper is the application of association measures under the assumption of a regression model. To this aim we focus on
J(θ(X), Y ) and consider to some extent I(θ(X), Y ). A list of postulates, of desirable properties of a measure of association, introduced by Renyi ([32]: 1959)
and modified later on e.g. by Bell ([1]: 1962) has been agreed on in the literature. These requirements comprise (i) generality of the definition, (ii) symmetry
in X and Y, (iii) normalization and (iv) invariance under 1:1-transformations of
21
X and Y . The crucial issue yielding refinements and modifications of association
measures turned out to be the normalization in two respects:
- Renyi ([32]: 1959) claimed that the measure should coincide with (the absolute value of) the correlation coefficient in case X and Y are bivariate Gaussian.
Bell ([1]: 1962) weakened this postulate suggesting that a measure of association
should be a monotone function of the (squared) correlation coefficient in that
case. In regression analysis Renyi’s claim is still virulent, and hence we emphasize standardizing transformations yielding the squared correlation coefficient as
a special case.
- However, an additional requirement, namely that a value of a measure of
association equal to 1 should indicate functional dependence between X and Y
may conflict with the orientation towards correlation. Particularly for nominal
random variables therefore different ways of scaling were proposed, for example
by Joe ([19]: 1997).
Apart from the properties of association measures additional features of coefficients of determination like additivity are of interest. See e.g. Soofi et al. ([36]:
2000). For correlation curves properties were proven by Bjerve and Doksum ([4]:
1993). Here we only address a few issues.
4.1
Invariance to re-parameterizations
JKL (p(y|θ(x)), pθ (y)) is invariant to one-to-one transformations of θ, but result
3, providing a representation within an exponential family with reference density
p(y|ζ), holds for the natural parameter only. Further, there is a lack of interchangeability of parameterization and expectation w.r.t. X, e.g. τ̄ = τ (ζ) and
different reference densities may not be equally useful.
4.2
Monotonicity in the number of covariates
Interpreting (directed) divergences as measures of variability of densities p(y|θ(x))
one might expect this variability to increase if more informative covariates are
included in the regression model. Indeed, within an encompassing model including all covariates, i.e. with θ being derived from the joint distribution of X and
Y, I(θ(X), Y ) is increasing with the number of covariates because of the chain
rule for mutual information. Darbellay ([8]: 1998) elaborated for RI2 how the
coefficient increases if new covariates are added. To our knowledge similarly general results are not available for J(θ(X), Y ).(See also the related discussion for
estimates Rw2 by Magee ([26]: 1990).) Usually, however, (even nested) regression
models are not assumed to be related by a joint distribution, and monotonicity
due to additivity of information is important. Simon ([35]: 1973) investigated
conditions for additivity of discrepancies in exponential families if models are
nested.
22
Assuming θ0 ≺ θ1 ≺ θ2 , where ≺ indicates nesting, in exponential families
IKL (p(y|θ0 ), p(y|θ2 )) = IKL (p(y|θ0 ), p(y|θ1 )) + IKL (p(y|θ1 ), p(y|θ2 ))
(59)
holds under the condition of orthogonality
(θ1 − θ2 )T (τ (θ0 ) − τ (θ1 )) = 0.
(60)
For example, if Y |θ(x) ∼ N (µ(x), Σ), θ(x) = µ(x), for x = (x1 , x2 ), θ0 = α,
θ1 (x) = θ0 + a1 (x1 )T β1 , θ2 (x) = θ1 (x) + a2 (x2 )T β2 the condition (60) amounts to
the requirement
β2T a2 (x2 )T a1 (x1 )β1 = 0
which is met if a1 (x1 ) is orthogonal to a2 (x2 ).
Soofi and Retzer ([37]: 2002, p.16) summarize results on additivity of information indices for exponential families.
4.3
Use of a coefficient of determination in model comparison
Coefficients of determination as measures of dependence between X and Y are
used to quantify a feature of a regression model under consideration. In the
comparison of models typically two situations are distinguished:
(1) An encompassing model is given by the joint density of Y and all covariates X. Submodels are derived within the encompassing model, for example
p(y|θ(s) (x(s) )) for subsets X(s) of the covariates. Often then the minimally comprehensive model is found that is still close enough to the most comprehensive
model (in terms of its ‘modelling potential’). An assessment of the potential of
a regression model is especially of interest in variable selection where a parsimonious model is looked for. Indeed, variable selection has been a main field of
application of coefficients of determination. RI2 is appropriate in this case, and
monotonicity in the number of covariates applies.
(2) Models corresponding to p(y|θ(x)) are not assumed to be related within
an encompassing model, but the conditional densities are assumed to belong to
a specified family of distributions. In this case pθ (y) varies with the regression
model, and p(y|θ0 ) is chosen as a reference density instead. Hence the comparison
2
2
or RJ,θ
, and monotonicity may hold for nested models.
of models is based on RI,θ
0
0
2
2
or RJ,θ
are also natural quantities to
The discrepancies corresponding to RI,θ
0
0
test associated hypotheses H0 : θ(x) = θ0 against H1 : θ(x) = θ0 about regression
parameters. These tests aim at the significance of a difference in the explanatory
power of models specified by θ(x) or θ0 .
Because of the monotonicity properties coefficients of determination are measures of the explanatory potential of a model rather than adequate criteria for
model choice aiming at prediction of future observations which typically compromise between data fit and model complexity. Model comparisons based on
23
coefficients of determination apply to models related by specifications according
to either (1) or (2). They are useful to find a parsimonious relative explanatorily
powerful model within a set of related models. Monotonicity of a coefficient of
determination supports strategies of forward selection or backward elimination
in variable selection.
Often an internal scaling of coefficients of determination is desired in model
2
is alcomparison w.r.t. the highest value attained. Although for instance RJ,θ
0
ready a scaling transformation of EJKL (p(y|θ(X)), p(y|θ0 )) taking values between
2
= 1 may not be attainable within the set of models to
0 and 1, across models RJ,θ
0
be compared. In practice therefore often estimates RJ2 are internally scaled w.r.t.
e.g. the most comprehensive model ( [11]: Draper and Smith, 1981, p.42; [28]:
Nagelkerke, 1991) and used as relative measures for the assessment of models in
an informal way.
5
Estimation of a coefficient of determination
We already occasionally referred to estimates of coefficients of determination but
we address the topic of estimation in a more systematic and explicit way in this
section. As often coefficients of determination (and measures of association) are
defined descriptively, in fact estimates of quantities of interest are suggested, and
frequently ‘corrections’ or ‘modifications of the definition’ of such coefficients are
meant to improve properties of estimators. For example, Särndal ([33]: 1974)
in his review discusses at length ‘corrections’ that yield unbiased estimators of
association measures. We investigate estimates in special examples in section 6.
5.1
Maximum Likelihood-Estimation
An immediate approach in order to obtain estimates is to substitute parameters
θ by their maximum likelihood (ml) estimates θ. For example, for I(θ(X), Y ) =
H(Y ) − H(Y |θ(X)) ranking of submodels according to RI2 reduces to ranking
of the quantities H(Y |θ(X)) estimated by H(Y |θ(X)) = H(Y |θ(X)). Within
an encompassing model these estimates are monotone for submodels with an
increasing number of covariates.
A crucial result is obtained for the ml-estimates of EIKL (p(y|θ(X)), p(y|θ0 )).
Under conditions frequently met in generalized regression (cp. [35]: Simon, 1973)
IKL (p(y|θ̂(xi )), p(y|θ̂0 )) coincides with the log-likelihood ratio log p(yij |ζ(xi )) −
log p(yij |ζ0 ). (59) carries over to likelihood ratios (e.g. [31]: Rao and Toutenburg, 1995, p.48). Model comparison with estimated quantities then turns out to
be based on the log-likelihood with estimated parameters. Thus ml-estimation
renders an empirical coefficient of determination which is simply a difference of
measures of goodness of fit. This is the reason why estimated coefficients of determination are used for comparing models w.r.t. their explanatory power for a
24
given data set but have been abandoned as criteria for model choice aimimg at
prediction.
2
can be derived from
In exponential families R̂J,θ
0
JKL (p(y|θ̂(x)), p(y|θ̂0 )) = (ζ(x) − ζ0 )T (τ (ζ(x)) − τ (ζ0 )).
(61)
2
Similarly RJ,θ
may be estimated inserting ml-estimates in the test statistic of
0
the Wald- or score test (see results 4 and 5).
The ml-estimates of the discrepancies and log-likelihood ratios in exponential
families are close but do not always coincide: equality requires additivity in nested
model which does not always hold. Intuitively, particularly if the conditional
densities p(y|θ(X)) do not belong to an exponential family, the data are used
twice in the log-likelihood ratio as an estimate of a KL-discrepancy: for estimating
θ and for the evaluation of the integral.
5.2
Non–parametric estimation
Estimates can also be obtained if no parametric model or no function space is
explicitly specified for θ(x). For example τ̂ (x) may be obtained from kernel
estimation yielding an estimate θ̂(x). Doksum and Samarov ([10]: 1995) and
Lavergne and Vuong ([23]: 1998) investigated nonparametric estimates to obtain
a coefficient based on a decomposition of variance.
5.3
Bayesian estimation
Within the Bayesian approach part of the model specification is a prior distribution for γ = (ω, β). Instead of plugging in a point estimate θ or γ the parameter θ
or γ may be integrated (averaged) w.r.t. a posterior distribution on the parameter space and Bayesian estimates like the posterior mean or the posterior mode
may be used. Then the approach may also share features due to the additivity
of information in nested models. Also, given a (joint) distribution of (X, Y, Θ)
in a Bayesian approach it may be more natural to focus on the dependence of
Y and X with θ integrated out than to investigate the posterior distribution of
a measure of dependence given θ like EJKL (p(y|θ(X)), p(y|θ0 )) or −H(Y |θ(X)).
For a Bayesian approach to hypothesis testing see ([2]: Bernardo and Rueda,
2002).
6
6.1
Case studies
Standard regression models with Gaussian response
We evaluate discrepancies to a reference density as quantities of interest for a
Gaussian response Y including regression models with heterogeneous variances.
25
Various estimates for the quantities of interest are discussed that were previously
introduced as modifications of the definition of a coefficient of determination.
Finally we investigate model comparisons for different regression functions including nested models.
Consider an outcome Y which is observed at only q = 3 values xi of a covariate
X and assume
i = 1, 2, 3.
Y (xi ) ∼ N (µ(xi ), σ 2 ),
x3
x 1 x2
,
For illustration we simulated data according to designs
n1 n2 n − n1 − n 2
such that pi = ni /n empirically determines a distribution of X. Thus we observe
Yj (xi ) for j = 1...ni , i = 1, 2, 3. We chose x1 = 0.2, x2 = 0.5, x3 = 0.9. First we
discuss the simple linear regression model
µ1 (xi ) = α1 + β1 (xi − x),
where x =
6.1.1
(62)
pi xi , for α1 = 1, β1 = 1.2 and σ 2 = 0.16.
Models with non-random coefficients and homogeneous variances
Quantities of interest Considering a univariate response variable Y we write
2
σW
(σB2 ) for the within (between) variance instead of the covariance matrices
W (B) introduced in section 2.3. Assuming homogeneous variances we have
2
= σ 2 . Having specified a set-up with conditional sampling the marginal
σW
density pθ (y) is not estimable and cannot be used as reference density. Instead,
we choose a reference density p(y|θ0 ) which is Normal with mean µY and variance
σY2 . Thus it is not claimed that pθ (y) is Gaussian but that its first two moments
might be estimated and are to be used as parameters of a Gaussian distribution.
(i) Using the reference density p(y|ζ0 ) where ζ0T = (µY /σY2 , −σY−2 /2), we obtain
2
EJKL (p(y|ζ(X)), p(y|ζ0 )) = σB2 /σW
,
(63)
2
2
= 1 − σW
/σY2 .
RJ,ζ
0
(64)
and thus
(ii)
EIKL (p(y|ζ(X)), p(y|ζ0 )) =
1
1
2
) = − log(1 − σB2 /σY2 )
log(σY2 /σW
2
2
and
2
2
= σB2 /σY2 = RJ,ζ
.
RI,ζ
0
0
2
We observe that given σY2 all discrepancies are decreasing functions of σW
and
so are the coefficients of determination as monotone transformations.
26
2
There has been some discussion in the literature about the fact that RJ,ζ
<
0
2
1 because σW > 0. In order to reduce this effect some authors (e.g. Healy
([16]: 1984)
ni suggested to refer in model2 comparison to the dependence between
Yi = ( j=1 Yj (xi ))/ni , Yi ∼ N (µ(xi ), σ /ni ) and X, where X is assumed to be
uniformly distributed with pi = 1/q. Thus the response variable depends on the
design. In model comparisons based on the same data set this may be an option
although an internal scaling would also do.
2
Estimation In order to estimate RJ,ζ
estimates of variance may be inserted.
0
The data yield an estimate σY2 while the underlying distribution of Y is unknown.
From a set of proposed regression models then the one providing a decomposition
of σY2 such that the dependence between X and Y is maximum, is chosen. That
2
, interpretable as maximizing goodness of fit. A crucial
amounts to minimizing σW
2
issue when using variance estimates is whether the estimates satisfy σY2 = σB2 +σW
.
of
the
sum
For a given data set {yj (xi) |i = 1, ..., q; j = 1, ..., ni } the decomposition
of squares based on least squares estimates µi and y = ( i,j yj (xi ))/n,
(yj (xi ) − y)2 =
(yj (xi ) − µi )2 +
i,j
i,j
ni (µi − y)2
i
ensures the decomposition for the ml-estimates
σY2 =
1
n
(yj (xi ) − y)2 ,
i,j
2
σW
=
1
n
(yj (xi ) − µi )2 ,
σB2 =
i,j
1
n
ni (µi − y)2
i
yielding the conventional
R2 = σB2 /σY2
(65)
2
= 1 − σW
/σY2 .
(66)
In contrast the adjusted estimates (for two unknown parameters in the regression
functions of our example)
2
σ
Y,adj =
2
σ
B,adj =
1
n−1
1
q−1
(yj (xi ) − y)2 ,
2
σ
W,adj =
i,j
q
1
n−2
(yj (xi ) − µi )2 ,
i,j
(µ(xi ) − y)2
i
yield two different estimates. The ‘adjusted R2 ’,
2
2 n−1
2
2
Radj,1
=1−σ
) ≤ R2
W,adj /σY,adj = 1 − (1 − R )(
n−2
27
Fig.1
1.4
Rsq
Rsqadj1
Rsqadj2
true
1.2
1
Rsq
0.8
0.6
0.4
0.2
0
0
2
4
6
8
10
n0
12
14
16
18
20
2
2
2
(‘true’) of RJ,ζ
as functions of the number
Figure 1: Estimates R2 , Radj
, RJ,ζ
0
0
n0 of replicates at each xi
coincides with R2 for large n. For n = qn0
2
2
2
2
Radj,2
=σ
B,adj /σY,adj = R (
n−1
) ≥ R2 .
n − n0
q−1
q
2
limn0 →∞ nn00q−n
= q−1
and Radj,2
≥ 1 may occur. Discussions in the literature
0
about the appropriateness of estimates are presented as discussions about the
appropriate definition of a coefficient of determination.
In fig.1 we visualize the performance of the estimators for simulated data
generated according to the regression model N (µ1 (x), σ 2 ) specified in the previous
section. We consider designs with equal replicates ni = n0 , such that pi =
2
is given which was
1/3 = n0 /n for n0 = 1, . . . , 20. For comparison the value RJ,ζ
0
calculated using the empirical distribution of X specified by the experimental
design to evaluate the integral w.r.t. X in σB2 = varµ1 (X) but otherwise using
the true values of the parameters. Thus E JKL (p(y|θ(X)), p(y|θ0 )) = 0.7400,
2
= 0.4253 are obtained.
RJ,θ
0
6.1.2
Mixed model
We consider the extended regression function
µ2 (x) = α1 + β1 (x − x) + β2 (x2 − x2 )
28
and assume that β2 is a random coefficient: β2 ∼ N (0, τ 2 ). Marginally with µ1 (x)
given in (62)
Y |θ(x) ∼ N (µ1 (x), σ 2 (x))
with heterogeneous variances σ 2 (x) = (x2 − x2 )2 τ 2 + σ 2 .
Quantities of interest Now for ζ0T = (µY /σY2 , −σY−2 /2)
1
(µ(X) − µY )2
1
EJKL (p(y|ζ(X)), p(y|ζ0 )) = − [1 − σY2 E( 2
) − E(
)].
2
σ (X)
σ 2 (X)
(67)
2
2
Only if σ 2 (x) ≡ σW
, the right hand side of (67) reduces to σB2 /σW
. Substituting
2
2
σ (x) by some average value σ W , say, one might approximate
1 1 σ2 + σ2
EJKL (p(y|ζ(X)), p(y|ζ0 )) ≈ − + ( Y 2 B ) =: J0 .
2 2
σW
(68)
The decomposition of variance then holds only approximately, σY2 ≈ σ 2W + σB2 ,
and
EJKL (p(y|ζ(X)), p(y|ζ0 )) ≈ σB2 /σ 2W =: J1
≈ (σY2 /σ 2W ) − 1 =: J2 .
(69)
(70)
Correspondingly
J0
=: RJ20
J0 + 1
σB2
2
≈
2 =: RJ1
2
σB + σ W
σ2
=: RJ22 .
≈ 1− W
σY2
2
RJ,ζ
≈
0
(71)
(72)
(73)
Using the distribution of X induced by the uniform design to evaluate integrals
w.r.t. X, the same numerical values α1 = 1, β1 = 1.2 as in 6.1.1 and σ 2 = 0.09,
2
= 0.4717. Hence
τ 2 = 0.4, we have E JKL (p(y|ζ(X)), p(y|ζ0 )) = 0.8929 and RJ,ζ
0
in the mixed model there is a slightly stronger relation between X and Y than
in the first model with non-random effects.
Estimation Willett and Singer ([42]: 1988) use weighted least squares estimates of µ(x) in order to quantify the ‘variance explained by regression’ in the
original metric of Y in contrast to the decomposition of sum of squares of trans2
refers to the metric
formed variables proposed by Kvålseth ([21]: 1985). RJ,ζ
0
of Y , and estimation should aim directly at EJKL (p(y|ζ(X)), p(y|ζ0 )). Otherwise
invariance of (the estimate of) EJKL (p(y|ζ(X)), p(y|ζ0 )) w.r.t. the chosen transformation of Y should be checked.
29
For illustration we discuss an example with univariate continuous X and Y introduced by Draper and Smith ([11]: 1981, pp.112f) and resumed by Willett and
Singer ([42]: 1988). In the example the variance function
σ 2 (x) = 1.5329 − 0.7334x + 0.0883x2
derived by Draper and Smith ([11]: 1981, p.115) is used, and the estimated
regression function based on weighted least squares is
µW LS (x) = −0.889 + 1.142x.
There are n = 35 data points.
Wefirst estimate EJKL (p(y|ζ(X)), p(y|ζ0 )) as in (67), using µY = y, σY2 =
1
2
ν (yν −y) and empirical means instead of expectations having evaluated µ(x)
35
and σ 2 (x) for each xν , ν = 1, . . . , 35. Thus we obtain E JKL (p(y|ζ(X)), p(y|ζ0 )) =
35.83, RJ,ζ0 = 0.9728. Next we set
1
2
σ
B,W LS =
35
35
(µW LS (xν ) − y)2 .
ν
In order to obtain approximations we try three values of σ 2W :
σ 2W,1
1
= 1/(
35
σ 2W,2
σ 2W,3 =
1
35
35
1
σ 2 (x
ν
ν)
) = 2.53,
35
1
=
35
σ 2 (xν ) = 1.63,
ν
35
(yν − µW LS (xν ))2 = 2.02.
ν
All of them do not satisfy σY2 = σ 2W + σB2 . Insertion of these estimates in (68)
results in J0,i (RJ2 ), say, for i = 1, 2, 3. Substituting in (69) yields J1,i (RJ2 ),
1,i
0,i
and evaluating (70) gives J2,i (RJ2 ) respectively. The values of the coefficients
2,i
of determination are given in table 1.
Table 1
RJ2
σ 2W,1
σ 2W,2
σ 2W,3
0,i
0.9670
0.8709
0.8425
RJ2
1,i
0.9655
0.8718
0.8460
30
RJ2
2,i
0.9684
0.8699
0.8389
The results indicate that the choice of the variance estimator is more important than the approximation of EJKL (p(y|ζ(X)), p(y|ζ0 )). Relative to RJ,ζ0 =
0.9728 the worst performance is obtained with σ 2W,3 , an estimate that does not
refer to the functional specification of σ 2 (x). The estimate RJ2 evaluated with
2,i
σ 2W,3
(i.e. 0.8389) was suggested by Willett and Singer ([42]: 1988). The best estimator is the one based on the mean of inverse variances, 1/σ 2 (x), which according
to (67) is in fact more appropriate than forming the inverse of the mean.
6.2
Example: Hald data
The Hald data set is a famous data set which has often been analyzed in order
to illustrate techniques for variable selection. The data and an extensive classical
analysis are given by Draper and Smith ([11]: 1981, pp.297, 591, 629). We
resume this example in order to illustrate how the classical strategies of ‘backward
elimination’ and ‘forward insertion’ amount to selecting a model with maximum
strength of relation between X and Y .
The response variable Y is ‘heat evolved in cement’, and the covariates X1 , ...,
X4 , measure amounts of ingredients (in %) in clinkers used to produce the cement.
We set X = (X1 , . . . , X4 ). The covariates are nearly linearly dependent summing
up to almost 100%. There are only n = 13 data points (yν , xν ), where xν =
(x1ν , . . . , x4ν ). We assume Y |x ∼ N (µ(x), σ 2 ), where µ(x) = α + β T x, without
referring to joint Normality of X and Y. We use centered covariates.
6.2.1
Quantities of interest
We again have conditional Gaussian densities with homogeneous variances. Hence
the theoretical derivations in section 6.1.1 apply, and the main quantity of interest
2
2
= σB2 /σW
.
is again RJ,ζ
0
6.2.2
Estimation
2
Rao and Toutenburg ([31]: 1995, p.48) show that R2 = R
J,ζ0 increases if variables
are added within the linear regression model thus preserving monotonicity of the
2
). For the Hald data the values of R2 are (with {j, k, l}
estimated quantity (in σW
31
indicating the sequence of variables Xi included in the model)
{1}
{1,2}
{1,2,4}
{1,2,4,3}
{2}
{2,1}
{2,1,4}
{2,1,4,3}
0.532
0.979
0.9823
0.9824
0.666
0.979
0.9823
0.9824
{3}
{3,4}
{3,4,1}
{3,4,1,2}
{4}
{4,1}
{4,1,2}
{4,1,2,3}
0.286
0.935
0.981
0.9824
.
0.674
0.972
0.9823
0.9824
n
2
−1
2
2
, now with σ
Radj
W,adj = (n − k − 1)
ν=1 (yν − µ(xν )) , when k covariates Xi
2
are included in the model, and with σ
as in section 6.1.1 is not monotone
Y,adj
anymore. For the Hald data the sequence
{4}
{4,1}
{4,1,2}
{4,1,2,3}
0.645
0.967
0.9645
0.974
is obtained. Adjusted variances do penalize ‘fit’ (in the sum of squares) by ‘model
complexity’ in the degrees of freedom. Yet adjusted variances exemplify, that
estimates better than ml-estimates can be expected ‘not to penalize sufficiently’
for model complexity in model choice aiming at prediction because a coefficient
of determination is an inappropriate criterion for predictive model choice.
6.2.3
Variable selection
Two principal methods are applied in variable selection: ‘backward elimination’
and ‘forward insertion’.
Eliminating backwards one starts with the complete vector of covariates in the
regression model and deletes in turn that component which least reduces the
measure of dependence between X and Y until a significant drop occurs. In each
step the difference of measures is assessed.
Under Normality of Y |θ(x) this reduces to the comparison of within variances,
2
2
s or RJ,ζ
s in this case is a standardized (by
and in fact, the difference of RI,ζ
0
0
2
σY ) difference of within variances. The test statistic of the partial F-test used to
establish significance of the difference under consideration is again a (differently)
standardized difference of estimated within variances.
For the Hald data backward elimination based directly on R2 suggests the
path
set
{1,2,3,4} → {1,2,4} → {1,2} → {2}
2
98.24
98.23
97.2
66.6
R
32
with an intuitive cut-off at {1,2}. This is confirmed using partial F-tests as
reported by Draper and Smith ([11]: 1981, p.306). Reversely, adding variables
to the model such that the dependence between X and Y is maximally increased
yields for the Hald data the steps
set
R2
{4} →
67.4
{4,1}
97.2
→
{4,1,2}
98.23
→
{4,1,2,3}
.
98.24
The cut-off point is again set at 97.2. Thus variable selection based on R2 amounts
to choosing the model with maximum dependence between X and Y taking into
account the variability of the estimators of the measure of dependence.
The incremental association measured in terms of I(θ(X), Y ) can be transformed to a difference of RI2 s using (31). If joint Normality of X and Y holds
this becomes a difference of multiple correlation coefficients. Standardization of
this difference yields the partial correlation coefficient used to assess the contribution of the added variable to the overall association. Compare the discussion
by Darbellay ([8]: 1998). As least squares estimates mimic a jointly Gaussian
distribution of (X, Y ) empirical multiple and partial correlation coefficients are
defined analogously to the theoretical quantities even if the distributional assumption does not hold. In this spirit Draper and Smith ([11]: 1981, pp.308f)
refer to empirical partial correlation coefficients for the Hald data.
6.3
Example: Doksum et al. (1994)
We analyze simulated data from a distribution introduced by Doksum et al.
([9]: 1994) in order to illustrate quantification of the local strength of association between X and Y . We also illustrate model comparison for nonparametric
regression looking at the divergence as a function of the smoothing parameter.
We have univariate X and Y with
Y |θ(x) ∼ N (µ(x), σ 2 (x)),
where
x
x
) exp(5 − ),
10
2
x 2
1
2
σ (x) =
(1 + ) ,
9
2
µ(x) = (
2
2
and X ∼ N (µX , σX
), µX = 1.2, σX
= 1/9.
6.3.1
Quantities of interest
Global measure of dependence Considering the conditional Gaussian distributions as a two-parameter exponential family we have canonical parameters
33
ζ1 (x) = µ(x)/σ 2 (x), ζ2 (x) = −1/(2σ 2 (x)) corresponding to the canonical statistics t1 (Y ) = Y, t2 (Y ) = Y 2 . Computation of J(θ(X), Y ) according to (67)
yields
J(θ(X), Y ) = 6.75, RJ2 = 0.87.
All expectations w.r.t. X were obtained by numerical integration over the range
(0.0,2.4) covering 0.999 of the probability mass of X. The general definition of a
coefficient of determination that we suggest thus naturally yields a coefficient for
a nonparametric regression model with heterogeneous variances.
Local strength of dependence The limiting local discrepancy, quantifying
the change of the distribution due to a change in x, is given by J0 (x) = I(x) as x
is univariate. Regarding the Normal distributions as a one-parameter exponential
family where only changes in µ(x) - though weighted by σ 2 (x) - are reflected,
J0 (x) =
µ′ (x)2
=: J01 (x).
σ 2 (x)
Considering the Normal distributions as a two-parameter exponential family
J0 (x) = I(x)
=
(ζ1′ (x), ζ2′ (x))cov
Y
Y2
ζ1′ (x)
ζ2′ (x)
= σ 2 (x)(ζ1′ (x))2 + 4µ(x)σ 2 (x)ζ1′ (x)ζ2′ (x)
+4µ2 (x)σ 2 (x) + 2σ 4 (x)(ζ2′ (x))2
= : J02 (x).
2
2
The corresponding determination curves RJ20i (x) = σX
J0i (x)/(1 + σX
J0i (x)) are
2
displayed in fig.2. Note that according to (54) RJ01 (x) is the correlation curve.
It coincides with the curve displayed in fig.4 in ([9]: Doksum et al., 1994) apart
from the range of x-values. At x = 2 the mean function µ attains a maximum,
and hence the derivative is zero.
Fig.2 shows that the rates of change J01 (x) and J02 (x) behave similarly in the
center of the distribution of X. Where they differ slightly (x > 1.6) it is indicated that the local strength of relation between X and Y evaluated in the
two-parameter family is weaker than in the one-parameter family
According to (57) the certainty curve is RI2 (x) = 1 − c. log(2πeσ 2 (x)), where
c. is a constant determined by H(Y ). Thus this curve pointwise results from an
antitone transformation of σ 2 (x). As σ 2′ (x) > 0, σ 2 (x) is an increasing function
of x, and the reduction of uncertainty about Y is maximum for x = 0. Hence
the certainty curve like the correlation curve decreases in (0,2.0).
34
Fig.2
1
0.9
0.8
0.7
RJ0sq
0.6
0.5
0.4
0.3
0.2
0.1
RJ01sq
RJ02sq
0
0
0.5
1.2
x
2
2.4
Figure 2: Determination curves corresponding to J0 (x) evaluated for the Normal family as one-parameter (RJ01sq) or as two-parameter (RJ02sq) exponential
family
2
Estimation Using 2000 simulated data points (xν , yν ) we estimate σX
, µY , σY2
by the empirical mean and variances, and following Doksum et al.([9]: 1994)
we then estimate µ(x) and σ 2 (x) by averages over neighbourhoods, where each
neighbourhood of an x-value contains K points. The derivative of σ 2 (x) is calculated like µ′ (x) as a ratio of differences using half the neighbourhoods. For
further details see the paper by Doksum et al. ([9]: 1994). Inserting the estimated curves and evaluating expected values w.r.t. X as empirical means we
obtain J(θ(X), Y ). It is displayed as a function of K (K = 30, 60, . . . , 300) in
fig.3 below. The highest (estimated) strength of relation is attained for K = 30.
Doksum et al.([9]: 1994) recommend K = 60 as optimal size of the neighbourhood, but we found that for higher K the estimates of the derivatives improve.
It is to be expected that for over-smoothed µ the association between X and
Y decreases, and this is confirmed in fig.3 showing J(θ(X), Y ) as a decreasing
function of K.
The estimated limiting local divergences J01 (x) and J02 (x) for the smoothing
parameter K = 300, transformed into determination curves RJ2 , RJ2 are shown
01
02
in fig.4.
The estimates perform rather poorly, and better estimates are required to reliably
evaluate the determination curves.
35
Fig.3
7
6.5
6
J
5.5
5
4.5
4
estimated
true
3.5
0
50
100
150
K
200
250
300
Figure 3: Estimated values of J(θ(X), Y ) for smoothing parameters K with
constant true value J(θ(X), Y ) = 6.75
6.4
Binomial response: example birth study
We illustrate our approach for a binomial response variable and emphasize comparison of models with different link functions. We also evaluate approximations
of EJKL (p(y|ζ(X)), p(y|ζ0 )) according to results 4 and 5.
In the example the dependent variable Y indicates occurrence of an infection following birth by Caesarian section. The risk of infection is modelled depending on three binary covariates: X1 indicates whether the Caesarian section
was planned or not, X2 indicates the presence of risk factors, and X3 indicates
prophylaxis with antibiotics. The data and analysis are given in the book by
Fahrmeir and Tutz ([12]:1994, pp.29f). We use the notation introduced in section 2.3.2. Thus π(x) denotes the risk of infection given x = (x1 , x2 , x3 ), and
Y |π(x) ∼ B(1, π(x)). The covariate vector X takes eight values, denoted by
triples of zeros and ones, and their probabilities are given as p(x) in general.
Let the
number of infections occurring under condition x be k(x). There are
n = x n(x) = 251 observations.
36
Fig.4a
1
R01sq
0.8
0.6
0.4
0.2
true
estimated
0
0.8
0.9
1
1.1
1.2
1.3
x
1.4
1.5
1.6
1.7
1.8
1.4
1.5
1.6
1.7
1.8
Fig.4b
1
R02sq
0.8
0.6
0.4
0.2
true
estimated
0
0.8
0.9
1
1.1
1.2
1.3
x
Figure 4: a) true and estimated correlation curve ρ2 (x) = determination curve
RJ201 (x) with estimate based on K = 300 b) determination curve RJ202 (x) with
estimate based on K = 300
6.4.1
Quantities of interest
(i) For a Bernoulli reference distribution B(1, π0 ) we have
JKL (p(y|π(x)), p(y|π0 )) =
((p(y|π(x)) − p(y|π0 )) log(
y
p(y|π(x))
),
p(y|π0 )
(74)
where p(y|π) = π y (1 − π)1−y , π ∈ {π(x), π0 }. Hence
EJKL (p(y|π(X)), p(y|π0 ))
=
p(x)(π(x) − π0 )(log
x
(75)
π(x)
1 − π0
+ log
).
1 − π(x)
π0
(ii) Similarly
EIKL (p(y|π(X)), p(y|π0 ))
p(y|π(x))
.
=
p(x)
p(y|π(x)) log
p(y|π0 )
x
y
(76)
(iii) We also investigate approximations to EIKL (p(y|π(X)), p(y|π0 )) and
EJKL (p(y|π(X)), p(y|π0 )). First we consider the log-likelihood ratio as an ap37
proximation to EIKL (p(y|π(X)), p(y|π0 )).
π(x)k(x) (1 − π(x))n(x)−k(x)
1
log x k(x)
n
(1 − π0 )n(x)−k(x)
x π0
n(x) 1
π(x)k(x) (1 − π(x))n(x)−k(x)
=
log
k(x)
n n(x)
π0 (1 − π0 )n(x)−k(x)
x
=
x
n(x) k(x)
π(x) n(x) − k(x)
1 − π(x)
[
log
log
+
]
n n(x)
π0
n(x)
1 − π0
= : ILR .
Hence in the log-likelihood ratio p(x) is replaced by the relative frequency n(x)/n
which would also be used in conditional sampling, and the model dependent
probabilities p(y|π(x)) are replaced by relative frequencies k(x)/n(x), n(x) −
k(x)/n(x) respectively. According to (33) the corresponding approximation for
EJKL (p(y|π(X)), p(y|π0 )) is EJKL (p(y|π(X)), p(y|π0 )) ≈ 2ILR .
Result 4 suggests the ‘Wald’ approximation (with ζ0 =logit π0 )
EJKL (p(y|π(X)), p(y|π0 )) ≈ E(tr[HM (ζ0 )(ζ(X) − ζ0 )(ζ(X) − ζ0 )T ])
yielding here
p(x)[var(Y |π0 )(logit π(x) − logit π0 )2
EJKL (p(y|π(X)), p(y|π0 )) ≈
x
≈
x
n(x)
π0 (1 − π0 )(logit π(x) − logit π0 )2(. 77)
n
Result 5 gives the ‘score’ approximation
p(x)(π(x) − π0 )2 /var(Y |π0 )
EJKL (p(y|π(X)), p(y|π0 )) ≈
x
≈
x
n(x) (π(x) − π0 )2
,
n π0 (1 − π0 )
(78)
which is the Binomial version of the ratio of ‘between’ to ‘within’ variance.
(iv) Analogously to definition 1 we define coefficients of determination of Y by
Xi conditional on X−i = x−i , where X−i denotes the vector of covariates without
Xi ,
J(θ(Xi ; x−i ), Y )
2
=
RJ|x
∈ [0, 1].
(79)
−i
1 + J(θ(Xi , x−i ), Y )
Here we examine the partial association between Y (infection) and X3 (antibiotics) fixing (x1 , x2 ) = x−3 . For example, for x−3 = (0, 1) corresponds to the
condition that a Caesarian section was not planned but risk factors were present.
38
We compare three models
- logit link without interaction between X1 and X2 ,
ζ(x) = log
π(x)
= α + β1 x1 + β2 x2 + β3 x3 =: η1 (x)
1 − π(x)
- logit link with interaction between X1 and X2 ,
ζ(x) = η1 (x) + β4 x1 x2 =: η2 (x)
- probit link without interaction,
π(x) = Φ(η1 (x)).
6.4.2
Estimation
We evaluate the expressions given above for the estimate π0 =
k(x)/n = 0.283.
x
For the three models we insert ml-estimates (given by Fahrmeir and Tutz
([12]:1994) in the formulae for the discrepancies and their approximations.
Under the logit link the estimated coefficients of η1 (x) are
α = −1.89,
β1 = 1.07,
β2 = 2.03,
β3 = −3.25,
and for η2 (x)
α = −1.39,
β1 = −11.64,
β2 = 1.36,
β3 = −3.83,
β4 = 13.49.
Under the probit link η1 (x) is estimated with coefficients
α = −1.09,
β1 = 0.69,
β2 = 1.2,
β3 = −1.9.
We also evaluate the discrepancies for the saturated model using the cell frequencies as estimates. We dealt with zero probabilities as arguments of the logarithm
using the convention 0 log 0 = 0. The values are displayed in fig.5.
2
= 0.2828, indicating
Evaluating EIKL (p(y|π(X)), p(y|π0 )) we obtain RI,π
0
that the saturated model does outperform all other models, although the model
with logit link and interactions attains nearly the same coefficient of determination. The relation between X and Y in terms of the divergence turns out to
be strongest in the model with the logit link and interactions. For this particular data set it even outperforms the saturated model. This is due to the case
where a Caesarian section was not planned and no risk factor was present but
antibiotics were given, which occurred twice without any infection observed. The
probability of infection is estimated by 0.0054 in the model with logit link and
interactions and hence does add a term to the divergence whereas using the relative frequencies and the convention 0 log 0 = 0 no term reflects this case in the
divergence.
39
Fig.5a
0.29
insert
lik
RIsq
0.28
0.27
0.26
0.25
0.24
sat
can2
can1
probit
Fig.5b
0.32
insert
lik
Wald
score
RJsq
0.3
0.28
0.26
0.24
0.22
sat
can2
can1
probit
2
Figure 5: a) Estimates and estimates of approximations of RI,π
b) Estimates
0
2
and estimates of approximations of RJ,π
0
We illustrate the coefficient of determination of Y by Xi conditional on x−i
given in (79). We evaluate the partial coefficients of determination between Y
(infection) and X3 (antibiotics) for the saturated model and for the model with
logit link and interactions. The values are given in the following table.
Table 2
2
Estimated coefficients of determination RJ,π
0 |x−3
saturated logit, interaction
x−3 = (x1 , x2 )
(0, 1)
(1, 1)
0.1758
0.3445
0.2470
0.3206
That is, the relation between infection and antibiotics turns out to be stronger
if a Caesarian section was not planned, than in the case where a section was
planned beforehand.
40
7
7.1
Discussion
Quantities of interest and estimates
We elaborated the idea to define coefficients of determination as measures of
dependence, particularly based on the directed divergence stressing the explanatory power of X as reduction of uncertainty about Y or, based on the divergence,
stressing the discriminatory power of X in telling apart conditional distributions
of Y. In order to apply coefficients of determination in model comparison essentially two versions have been focussed on:
- for the comparison of models defined within an encompassing joint distribution
of X and Y : J(θ(X), Y ) and I(θ(X), Y )
- for the comparison of models with (conditional) sampling distributions belonging to the same (often exponential) family of distributions:
EJKL (p(y|θ(X)), p(y|θ0 )) and EIKL (p(y|θ(X)), p(y|θ0 )).
It is hard to decide whether to use the directed divergence or the divergence, and
we do not recommend any of them exclusively. The most appealing feature of the
divergence is the representation (result 1) as a functional of the log-odds ratio
function Ψ0 characterizing the association between X and Y (in the sense that
their joint distribution is determined by the marginal distributions and Ψ). This
feature yields a simple representation of the divergence in exponential families
(result 3) where Ψ is bi-affine. The main advantage of the directed divergence
is its decomposition in terms of entropies (28) and the monotonicity properties
(discussed in section 4).
Model comparison using a reference density is essentially based on
−EX E(log p(y|θ(X))|θ0 ). For the comparison of models without any common
reference model the conditional entropy, that is the model specific expected value
of the log density, H(Y |θ(X)) = −EX E(log p(y|θ(X))|θ(X)) is used instead.
We did not consider covariates defined on an ordinal scale although measures
of association have been applied in this case as well (e.g. [33]: Särndal, 1974).
The adaptation of the coefficients of determination we propose to this set-up
requires further study.
The clear distinction between theoretical and estimated quantities may stimulate further research on properties of estimates. The interpretation and use of
coefficients of determination though is determined by the concept underlying their
definition rather than properties of their estimates. We emphasize that in our
view estimates of theoretically defined coefficients of determination empirically
describe a feature of a regression model, namely dependence between a response
Y and covariates X, and are not devised as measures of goodness of fit for a
data set at hand. As ml-estimates of coefficients of determination often happen
to be measures of goodness of fit this interpretation has been wide spread and
implemented in sampling definitions of coefficients of determination (for example
Cameron and Windmeijer ([5]:1997)).
41
7.2
Model choice
Model comparison w.r.t. the explanatory power of a model or the richness of a
family of distributions as described in the paper is different from model choice
if the target is prediction of future observations. In model choice it is often the
predictive potential of a model that is of interest, and hence predictive densities are compared. The approximation of a predictive density by a likelihood
needs to be corrected, and such a correction often yields a criterion for model
choice of the form ‘fit+complexity’, for example AIC or its Bayesian extension
DIC ([38]: Spiegelhalter et al., 2002). The target criterion for model choice
−E log p(Y |θ(x)), where Y denotes a future observation and θ a parameter estimate, provides the link between these approaches and coefficients of determination, particularly H(Y |θ(X)).
Appendix
Proof of result 3:
JKL (p(y|ζ(x)), p(y|ζ))
= (p(y|ζ(x)) − p(y|ζ)) log(p(y|ζ(x))/p(y|ζ)) dy
= E[(ζ(x) − ζ)T t(Y ) − M (ζ(x)) + M (ζ)|ζ(x)]
−E[(ζ(x) − ζ)T t(Y ) − M (ζ(x)) + M (ζ)|ζ]
= (ζ(x) − ζ)T [E(t(Y )|ζ(x)) − E( t(Y )|ζ)].
Taking expectations w.r.t. x yields
J(θ(X), Y ) = EX [(ζ(X) − ζ)T E(t(Y )|ζ(X))]
= EX [(ζ(X) − ζ)T (E(t(Y )|ζ(X)) − E t(Y ))]
= tr{cov X (ζ(X), E(t(Y )|ζ(X)))}
which is (12).
Proof of result 5:
Set τ (x) = f (ζ(x)) or ζ(x) = f −1 (τ (x)). (14) can be elaborated as
J(θ(X), Y ) =
=
=
=
E[(ζ(X) − ζ)T (τ (X) − τ̄ )]
E[ζ(X)T (τ (X) − τ̄ )]
E[(τ (X) − τ̄ )T f −1 (τ (X))]
E[u(τ (X))]
(say).
A second order Taylor expansion of u(τ (x)) in τ̄ then yields the result.
42
References
[1] Bell, C.B. (1962). Mutual Information and Maximal Correlation as Measures
of Dependence. Ann.Math.Statist. 33, 587-595.
[2] Bernardo, J.M. and Rueda, R. (2002). Bayesian Hypothesis Testing: a Reference Approach. Int.Statist.Rev. 70, 351-372.
[3] Blyth, S. (1994). Local Divergence and Association. Biometrika 81, 579-584.
[4] Bjerve, S. and Doksum, K. (1993). Correlation Curves: Measures of Association as Functions of Covariate Values. Ann.Statist. 21, 890-902.
[5] Cameron, A.C. and Windmeijer, A.G. (1997). An R-squared Measure of
Goodness of Fit for Some Common Nonlinear Regression Models. J. of
Econometrics 77, 329-342.
[6] Cover, T.M. and Thomas, J.A. (1991). Information Theory. Wiley: New
York.
[7] Cramer, E.M. and Nicewander, W.A. (1979). Some Symmetric, Invariant
Measures of Multivariate Association, Psychometrika 44, 43-54.
[8] Darbellay, G.A. (1998). Predictability: an Information-Theoretic Perspective. In: A. Procházka et al. (eds.). Signal Analysis and Prediction.
Birkhäuser: Boston 249-262.
[9] Doksum, K., Blyth, S., Bradlow, E., Meng, X.-L. and Zhao, H. (1994).
Correlation Curves as Local Measures of Variance Explained by Regression.
J.Amer.Statist.Ass. 89, 571-582.
[10] Doksum, K. and Samarov, A. (1995). Nonparametric Estimation of Global
Functionals and a Measure of the Explanatory Power of Covariates in Regression. Ann.Statist. 23, 1443-1473.
[11] Draper, N.R. and Smith, H. (1981). Applied Regression Analysis. Wiley:
New York.
[12] Fahrmeir, L. and Tutz, G. (1994). Multivariate Statistical Modelling Based
on Generalized Linear Models. Springer: New York.
[13] Glahn, H. (1969). Some Relationships Derived from Canonical Correlation
Theory. Econometrica 37, 252-256.
[14] Gleason, T.C. (1976). On Redundancy in Canonical Analysis. Psycho.Bull.
83, 1004-1006.
43
[15] Härdle, W. (1990). Applied Nonparametric Regression. Cambridge Univ.
Press: Cambridge.
[16] Healy, M.J.R. (1984). The Use of R2 as a Measure of Goodness of Fit.
J.R.Statist.Soc. A147, 608-609.
[17] Hooper, J. (1959). Simultaneous Equations and Canonical Correlation Theory. Econometrica 27, 245-256.
[18] Jessop, A. (1995). Informed Assessments. Ellis Horwood: New York.
[19] Joe, H. (1989). Relative Entropy Measures of Multivariate Dependence.
J.Amer.Statist.Ass. 84, 157-164.
[20] Kent, J.T. (1983). Information Gain and a General Measure of Correlation.
Biometrika 70, 163-173.
[21] Kvålseth, T.O. (1985). Cautionary Note about R2 . The Amer.Statist. 39,
279-285.
[22] Kullback, S. (1968). Information Theory and Statistics. Dover Publ.: Mineola, New York (2nd ed.).
[23] Lavergne, P. and Vuong. Q.H. (1998). An Integral Estimator of Residual
Variance and a Measure of Explanatory Power of Covariates in Nonparametric Regression. Nonpar. Statist. 9, 363-380.
[24] van der Linde, A. (2004). On the Association between a Random Parameter
and an Observable. Test 13, 85-111.
[25] McKay, R.J. (1977). Variable Selection in Multivariate Regression: an Application of Simultaneous Test Procedures. J.R.Statist.Soc. B39, 371-380.
[26] Magee, L. (1990). R2 Measures Based on Wald and Likelihood Ratio Joint
Significance Tests. J.Amer.Statist.Ass. 44, 250-253.
[27] Mardia, K.V., Kent, J.T. and Bibby, J.M. (1995). Multivariate Analysis.
Academic Press: New York (10th ed.).
[28] Nagelkerke, N.J.D. (1991). A Note on a General Definition of the Coefficient
of Determination. Biometrika 78, 691-692.
[29] Osius, G. (2000). The Association between Two Random Elements: a Complete Characterization in terms of Odds Ratios. Metrika, to appear.
[30] Rao, C.R. (1973). Linear Statistical Inference and Its Applications. Wiley:
New York.
44
[31] Rao, C.R. and Toutenburg, H. (1995). Linear Models. Spinger: New York.
[32] Renyi, A. (1959). On measures of dependence. Acta.Math.Acad.Sci. Hungar.10, 441-451.
[33] Särndal, C.E. (1974). A Comparative Study of Association Measures. Psychometrika 39, 165-187.
[34] Silvey, S.D. (1964). On a Measure of Association. Ann.Math.Statist. 35,
1157-1166.
[35] Simon, G. (1973). Additivity of Information in Exponential Family Probability Laws. J.Amer.Statist.Ass. 68, 478-482.
[36] Soofi, E.S., Retzer, J.J. and Yasai-Ardekani, M. (2000). A Framework for
Measuring the Importance of Variables with Applications to Management
Research and Decision Models. Decision Sciences 31, 595-625.
[37] Soofi, E.S.and Retzer, J.J. (2002). Information Indices: Unifications and
Applications. J. of Econometrics 107, 17-40.
[38] Spiegelhalter, D.J , Best, N.G., Carlin, B.P. and van der Linde, A.
(2002). Bayesian Measures of Model Complexity and Fit (with discussion).
J.R.Statist.Soc B64, 583-639.
[39] Stewart, D. and Love, W. (1968). A General Canonical Index. Psycho. Bull.
70, 160-163.
[40] Theil, H. (1987). How Many Bits of Information Does an Independent Variable Yield in a Multiple Regression ? Statist.&Prob.Lett. 6, 107-108.
[41] Theil, H. and Chung, C. (1988). Information-Theoretic Measures of Fit for
Univariate and Multivariate Regression. The Amer. Statistician 42, 249-252.
[42] Willett, J.B. and Singer, J.D. (1988). Another Cautionary Note about R2 :
its Use in Weighted Least-Squares Regression Analysis. J.Amer.Statist.Ass.
42, 236-238.
45