Academia.eduAcademia.edu

On association in regression: the coefficient of determination revisited

2008, Statistics

van der Linde, Tutz: On association in regression: the coefficient of determination revisited Sonderforschungsbereich 386, Paper 391 (2004) Online unter: http://epub.ub.uni-muenchen.de/ Projektpartner On association in regression: the coefficient of determination revisited A. van der Linde* & G. Tutz** 10.6.2004 Abstract Universal coefficients of determination are investigated which quantify the strength of the relation between a vector of dependent variables Y and a vector of independent covariates X. They are defined as measures of dependence between Y and X through θ(x), with θ(x) parameterizing the conditional distribution of Y given X = x. If θ(x) involves unknown coefficients γ the definition is conditional on γ, and in practice γ, respectively the coefficient of determination has to be estimated. The estimates of quantities we propose generalize R2 in classical linear regression and are also related to other definitions previously suggested. Our definitions apply to generalized regression models with arbitrary link functions as well as multivariate and nonparametric regression. The definition and use of the proposed coefficients of determination is illustrated for several regression problems with simulated and real data sets. Keywords: association, mutual information, regression, coefficient of determination, R2 , correlation curve. ——————————————————————————————————– * University of Bremen, Germany FB3: Institute of Statistics, PO Box 330 440, 28334 Bremen, email: [email protected] (corresponding author) ** Ludwig-Maximilian-University Munich, Germany Institute of Statistics, Ludwigstr. 33, 80539 Munich, email: [email protected] 1 1 Introduction As part of a regression analysis statisticians are interested in describing the strength of the relation between the dependent variable Y and the independent variables X. The measure commonly used to this end is the coefficient of determination, usually denoted by R2 . For jointly Gaussian variables X and Y, R2 is related to the decomposition of the total variance of Y into the ‘between variance’ varX E(Y |X) (explained by regression) and the ‘within variance’ EX var(Y |X) (error) and describes the proportion of the total variance explained by regression. The decomposition of variance holds without any distributional assumption and might therefore be used as a starting point for generalizations. In particular local measures R2 (x) summarized in a ‘correlation curve’ were suggested ([9]: Doksum et al., 1994) referring to this idea. For one-parameter exponential families where the conditional distribution of Y given X = x is specified by generalized (linear) regression models different definitions of R2 were proposed ( [26]: Magee, 1990; [28]: Nagelkerke, 1991). These are based on a full likelihood ratio rather than just the variance thus taking into account that the separation of the conditional mean E(Y |x) and the conditional variance var(Y |x) is a special feature of the Gaussian distribution. Although the attempts to define more generally a coefficient of determination indicate that there is a ‘natural appeal’ ([26]: Magee, 1990) of such a measure its definition and use is not really clear. Often a coefficient of determination is specified descriptively rather than theoretically as a quantity of interest. Hence features of the parameter of interest and properties of estimates continue being mixed up. The ambiguity of continuing attempts to provide a coefficient of determination while its use and interpretation even in classical examples remains a subject of debate indicates a lack of clarity about the underlying concept ‘strength of relation between Y and X’ that is to be captured and quantified. In this paper we describe simple information theoretic ideas that help to disentangle the various notions associated with a coefficient of determination, and we introduce and review universal definitions of such global as well as local (conditional on x) coefficients as measures of dependence. Although measures of association and dependence between variables X and Y have been extensively discussed, especially in psychometrics (e.g. [7]: Cramer and Nicewander, 1979; [33]: Särndal, 1974) they have hardly been evaluated taking into account a regression model ([20]: Kent, 1983). Based on measures of association the determination of Y by X(= x), can be interpreted in two ways: determination of the conditional density p(y|x), capturing the discriminatory power of X(= x) or determination of the (range of) values of Y given X(= x), capturing the explanatory power of X(= x). Of course, these ideas are related as different conditional densities correspond to different ranges of values of Y . The first notion will be the focus of the paper, but the second notion, usually formalized referring to entropies of 2 distributions of Y will be discussed as well. We specify a parameter θ that determines the distribution of a response Y as a function of the independent variables X and coefficients γ, and hence in our approach a regression model induces a family of parameterized densities p(y|θ(x, γ)) for Y which vary depending on x. We claim that a coefficient of determination should quantify the strength of the relation between X and Y in a regression model (given the regression coefficients) measuring the variation in the family of distributions induced by X. Different amounts of variation may be obtained depending on which covariates are included in the model ; indeed a major field of application of coefficients of determination has been the problem of variable selection (e.g. [11]: Draper and Smith, 1981, ch.6.1; [25]: McKay, 1977). Also, the choice of the link function in a generalized regression model or the choice of a smoothing parameter in nonparametric regression has an impact on the strength of the relation between covariates X and Y . Thus intuitively our approach is based on the idea that the flexibility of a (regression) model is reflected by the range of conditional densities. This range is quantified as the average distance to a reference density, e.g. the marginal density pθ (y) which is estimable only under joint sampling. Alternatively the reference density is often chosen corresponding to a regression model with regression coefficients equal to zero. Under conditional sampling the average w.r.t. X is to be taken according to the experimental design. A coefficient of determination is not only a descriptive measure of a given regression model, but mainly of interest in model comparison. The coefficient used to this end depends on the set of models under consideration. Two comparisons are usually distinguished: (1) given a joint distribution of Y and all covariates the comparison of sub-models e.g. based on subsets of covariates, (2) the comparison of models specified by the same type of conditional density p(y|x) with different parameters θ(x, γ), e.g. corresponding to different subsets or different functions of the covariates. The reference density in the coefficient of determination has to be chosen accordingly, and hence several coefficients of determination have been proposed in the literature. In order to express emphasis on x and for notational convenience we drop γ writing simply θ(x) := θ(x, γ) unless we explicitly refer to γ. In practice γ needs to be estimated from a given data set. A local measure of association given X = x may be defined based on the rate of change of densities p(y|θ(x)) if discrimination of conditional densities is in focus and X is continuous. Alternatively the reduction of uncertainty by x yields a local coefficient of determination in terms of explanation. We introduce and investigate universal measures of dependence for regression models yielding coefficients of determination for all regression models, especially for univariate as well as multivariate response variables, for non- and semi-parametric regression models, for generalized regression models with arbitrary link function or for regression models with heterogeneous variances. The paper is organized as follows. In section 2 we start out with a formal 3 definition of a coefficient of determination based on a joint distribution of X and Y. We derive representations in exponential families, and we also illustrate for classical examples. In section 3 we refine the proposed definition considering approximations as well as alternatives, particularly alternatives based on different reference densities. We further discuss determination curves as generalization of correlation curves quantifying the local strength of relation between X and Y, and we introduce alternatives. In section 4 we summarize some properties of coefficients of determination, especially monotonicity in the number of covariates, and we discuss the use of coefficients of determination in model comparison. Section 5 briefly addresses estimation of a coefficients of determination, and in section 6 we exemplify our approach for several regression models applied to simulated and real data. Section 7 concludes with a discussion. Readers who are interested in a conceptual outline only are referred to sections 2.1-2.2, the discussion in section 7 and perhaps section 3. Examples with technical details are collected in sections 2.3 and section 6. Sections 4 and 5 are both sketchy, yet addressing rather subtle results and may be skipped by casual readers. 2 2.1 Mutual information in regression Motivation and definitions Assume a random vector Y is observed with its distribution given by the density p(y|θ). In a regression problem we try to explain Y by covariates X modelling θ as a function of X and parameters γ, part of which are regression coefficients. Typically γ = (ω, β), where β denotes the regression coefficients and ω comprises an intercept α and possibly a scale factor φ. Thus, more precisely, θ = θ(x, γ). In order to determine the strength of the relation between Y and X we propose to use a measure of dependence between Y and X, keeping γ fixed. We may either assume that observations (y, x) are realizations of (Y, X) following a joint distribution of (Y, X) or that we observe Y conditional on x, and the values  xthat 1 ...xq of X are specified according to an experimental design d = n(x1 )...n(xq ) which can be interpreted as a discrete  probability measure such that P (X = xi ) = p(xi ) = n(xi )/n where n = i n(xi ). We set pθ (x, y) = p(y|θ(x))p(x) and pθ (y) = pθ (x, y)dx. In the sequel we use the following notation for (sequential) expectations: random variables w.r.t. which expectations are taken are written in capital letters, and for the first (and possibly only) expectation no subscript is appended. For sequential expectations a subscript indicates which variable it refers to. For example, in covY (E(ζ(X)|Y ), t(Y )) the first expectation refers to X conditional on Y , and the covariance is a covariance of functions of Y. A well-known measure of dependence between two random vectors is the distance between their joint density and the product of their marginal densities 4 representing independence. Using the Kullback–Leibler (KL-) discrepancy or directed divergence which for two densities p(z), q(z) is given by  p(z) IKL (p(z), q(z)) = p(z) log dz q(z) one obtains I(θ(X), Y ) := IKL (pθ (x, y), p(x)pθ (y)). (1) The directed divergence is also called the mutual information between Y and X. The KL-discrepancy between two densities has been extensively studied by Kullback ([22]: 1968) as a key quantity in establishing an information-theoretic approach to statistics. I(θ(X), Y ) is symmetric in X and Y, but not a symmetric distance between the densities pθ (x, y) and p(x)pθ (y). Using the symmetric discrepancy or divergence JKL (p(z), q(z)) = IKL (p(z), q(z)) + IKL (q(z), p(z)) (2) one obtains J(θ(X), Y ) := JKL (pθ (x, y), p(x)pθ (y))  pθ (x, y) dx dy. = (pθ (x, y) − p(x)pθ (y)) log p(x)pθ (y) (3) Note that J(θ(X), Y ) = J(X, Y ). We use the more complex notation though, in order to emphasize the model in use, and representations of J(θ(X), Y ) may depend on the parameterization. For our purposes the representation  p(y|θ(X)) dy J(θ(X), Y ) = E (p(y|θ(X)) − pθ (y)) log pθ (y) (4) = E JKL (p(y|θ(X)), pθ (y)) is intuitively most appealing. J(θ(X), Y ) describes the range of densities for Y induced by regression models θ(x) w.r.t. to the marginal reference density pθ (y) which does not depend on x. J(θ(X), Y ) measures how strongly the densities induced by θ(X) deviate from the marginal density. Thus it captures the discriminatory power of x within the regression model. J(θ(X), Y ) takes values in R+ . A value of 0 indicates independence of Y and X, while a high value indicates variability of p(y|θ(x)) as a function of x. We suggest to refer to a scaled version of J(θ(X), Y ) as a coefficient of determination. 5 Definition 1: Define RJ2 = J(θ(X), Y ) ∈ [0, 1] 1 + J(θ(X), Y ) (5) to be the coefficient of determination of Y by X through θ based on the divergence. ⊳ One can immediately read off the definition that if the conditional density of Y given x actually does not depend on x, e.g. θ(x) ≡ c0 , J(θ(X), Y ) equals zero and RJ2 = 0. 2.2 Representations of the divergence 2.2.1 The integrated log–odds ratio function Recently, Osius ([29]: 2000) demonstrated that under very mild regularity conditions the joint distribution of two random elements is characterized by their marginal distributions and the odds ratio function. Hence the association between the two random elements is captured in the odds ratio function, and a measure of dependence might be regarded as a functional of a representative of the odds ratio function specified by reference values. In our context focusing on random vectors Y and X, a representative of the log–odds ratio function with reference values x0 , y0 is given by Ψ0 (θ(x), y) = log p(y|θ(x)) p(y|θ(x0 )) − log . p(y0 |θ(x)) p(y0 |θ(x0 )) (6) Then the following result holds (([24]: van der Linde, 2004, appendix). Result 1: J(θ(X), Y ) = EXY Ψ0 (θ(X), Y ) − EX EY Ψ0 (θ(X), Y ) for all reference values x0 , y0 . (7)  Silvey’s ([34]: 1964) measure of association is derived from a similar intuition, but it does not operate on the log-scale. 6 2.2.2 Exponential families Prevailing in regression problems is the assumption that conditional on x respectively on θ(x) the density of Y belongs to a k–parameter exponential family. We write p(y|ζ(x)) = exp{ζ(x)T t(y) − M (ζ(x))}, where ζ(x) ∈ Rk is the canonical or natural parameter, t(y) is a sufficient statistic for ζ(x) and M is a normalizing function. (We assume that functions of y only are incorporated in the dominating measure.) The natural parameter space is defined by the condition of integrability of exp{ζ(x)T t(Y )} (given x). It is well known that the derivatives of M w.r.t. ζ(x) provide first and second order moments of t(Y ),  T ∂ ∂ ∇M (ζ(x)) = = E(t(Y )|ζ(x)) =: τ (x), M (ζ(x)) . . . M (ζ(x)) ∂ζ1 (x) ∂ζk (x) (8) ∂2 HM (ζ(x)) := (( M (ζ(x))))i,j=1...k = cov(t(Y )|ζ(x)). (9) ∂ζi (x) ∂ζj (x) We can parameterize an exponential family by the natural parameter ζ(x), the mean τ (x) or a predictor η(x). Depending on the parameterization special representations of the measures of dependence are obtained. We present such results keeping the general notation θ(x) on the left hand side of an equation and inserting the parameter actually chosen on the right hand size. Thus in exponential families θ(x) ∈ {ζ(x), τ (x), η(x)}, and each parameter identifies the conditional density of Y given x. Since for exponential families the log-odds ratio function is bi–affine in ζ(x) and t(y), i.e. Ψ0 (ζ(x), y) = (ζ(x) − ζ(x0 ))T (t(y) − t(y0 )), (10) we obtain the following result. Result 2: J(θ(X), Y ) = tr{covY (E(ζ(X)|Y ), t(Y ))} = tr{covX (ζ(X), E(t(Y )|ζ(X)))}.  (11) (12) Thus J(θ(X), Y ) measures the covariance of one (transformed) variable and the projection of the other one onto the measurable space that the former spans. Since J(θ(X), Y ) does not depend on the reference values x0 and y0 , it is often convenient to choose ζ(x0 ) = ζ := Eζ(X), provided ζ belongs to the natural parameter space. Then one has J(θ(X), Y ) = EY [(E(ζ(X)|Y ) − ζ)T t(Y )] = EX [(ζ(X) − ζ)T E(t(Y )|ζ(X))]. 7 (13) (14) Moreover, it can be shown that in (4) the marginal reference density can be replaced by p(y|ζ), that is by a reference density which is also a member of the exponential family. It is mainly due to this property that we emphasize the divergence in comparison to the directed divergence. In particular, if the reference density belongs to the same family of parameterized densities {p(y|θ(x))}, expansions in the parameter can be used to obtain approximations of J(θ(X), Y ) and RJ2 . Hence the following representation is a key result providing a link to other definitions of a coefficient of determination in the literature. Result 3: J(θ(X), Y ) = EJKL (p(y|ζ(X)), p(y|ζ)).  (15) For example, in generalized linear regression, assuming that p(y|θ(x)) belongs to a one-parameter exponential family (k = 1), a linear predictor η(x) = α + a(x)T β is defined and the regression model is specified by a one–to–one link function g relating τ (x) = E(t(Y )|ζ(x)) to the predictor by g(τ (x)) = η(x). In this case ζ(x) = ζ(x, γ) with γ = (α, β), β ∈ Rp . As a special case the canonical link g0 maps the expected value to the natural parameter g0 (τ (x)) = ζ(x) = α + a(x)T β. If the vectors a(x)T are centered such that E a(X)T = 0, ζ = α. In this case the reference density represents a regression model where the conditional distribution of Y does not depend on x. 2.3 2.3.1 Examples Gaussian response All definitions of a coefficient of determination originate from the case of Normal distributions with homogeneous covariance. Assume Y |θ(x) ∼ N (µ(x), Σ), 8 Y ∈ Rr . If Σ is known, p(y|θ(x)) can be regarded as an exponential family with canonical parameter ζ(x) = Σ−1 µ(x) and t(y) = y. One obtains J(θ(X), Y ) = tr{Σ−1 cov µ(X)}. The resulting RJ2 is related to the decomposition of covariance cov t(Y ) = EX cov(t(Y )|ζ(X)) + covX E(t(Y )|ζ(X)) (16) which we abbreviate as T =W +B (17) alluding to total, within and between covariance. For the Normal distribution one has W = Σ, B = cov µ(X) yielding J(θ(X), Y ) = tr{W −1 B} (18) and tr{W −1 B} . (19) 1 + tr{W −1 B} A special case of a conditional Normal response Y with homogeneous covariance is given if (X, Y ) are jointly Gaussian,       ΣX ΣXY µX X ). , ∼ N( ΣY X ΣY Y µY RJ2 = Then Y |x ∼ N (µ(x), Σ) where µ(x) = E(Y |x) = µY + ΣY X Σ−1 X (x − µX ), −1 Σ = ΣY − ΣY X ΣX ΣXY . The corresponding decomposition (17) is given by ΣY = Σ + ΣY X Σ−1 X ΣXY T where ΣY X Σ−1 X ΣXY = B = E((µ(X) − µY )(µ(X) − µY ) ). In this case Y ∼ N (µY , ΣY ) and r ρ2i −1 J(θ(X), Y ) = tr(Σ B) = (20) 1 − ρ2i i=1 (see [22]: Kullback, 1968, p.203), where ρ2i denotes the i–the squared canonical correlation coefficient. If Y is univariate (r = 1), this reduces to J(θ(X), Y ) = ρ2 , 1 − ρ2 (21) where ρ2 is the multiple correlation coefficient of Y with X. Thus RJ2 = ρ2 , (22) with ρ2 being further reduced to corr(X, Y ) if also X is univariate. Hence RJ2 does generalize the conventional quantities of interest in the case where X and Y are jointly Gaussian. Actually (21) motivates the transformation (5) of J to RJ2 introduced in definition 1. 9 2.3.2 Multinomial response Assume Y takes values in {0, . . . , k} with probabilities πi . Y may be represented by a k-dimensional vector of binary (dummy) variables Yi with πi = P (Yi = 1). Thus identify Y = (Y1 , . . . , Yk )T . Modelling πi as πi (x) depending on values of covariates X and summarizing the results of n(x) trials in Y = (Y1 . . . Yk ), Y conditionally follows a multinomial distribution Y |π(x) ∼ M (n(x), π(x)) where π(x) = (π1 (x), . . . , πk (x))T . The multinomial distributions form a k– parameter exponential family with ζ(x) = (ζ1 (x), . . . , ζk (x)), πi (x) ζi (x) = log , π0 (x) where π0 (x) = 1 − k j=1 πi (x). Furthermore one has t(y) = y, E(Yi |ζ(x)) = n(x)πi (x) = τi (x), var(Yi |ζ(x)) = n(x)πi (x)(1 − πi (x)), cov(Yi , Yj |ζ(x)) = −n(x)πi (x)πj (x) f or i = j. ¿From (12) with log π(x) = (log π1 (x), . . . , log πk (x))T and 1k denoting the k– dimensional vector of ones, J(θ(X), Y ) = tr{cov X (ζ(X), E(t(Y )|ζ(X)))} = tr{cov(log π(X) − log π0 (X)1k , n(X)π(X))}. (23) For binomial distributions with all n(x) = m, i.e. Y |π(x) ∼ B(m, π(x)), the expression simplifies to J(θ(X), Y ) = m cov(logit π(X), π(X)) (24) where logit π(x) = log[π(x)/(1 − π(x))]. Model 1 (linear logistic regression, canonical link) If ζ(x) = η(x) = α + β T a(x) where now α ∈ Rk , E(a(X)) = 0 as before we obtain from (14) β = (β1 . . . βk ) ∈ Rp×k and J(θ(X), Y ) = E((ζ(X) − ζ)T τ (X)) = E(a(X)T βτ (X)) with τi (x) = n(x) exp{αi + βiT a(x)} . k 1 + j=1 exp{αj + βjT a(x)} 10 For univariate Y and X with n(x) = 1, a(x) = x, this reduces to J(θ(X), Y ) = cov(α + βX, π(X)) = β cov(X, π(X)). (25) 2 For small σX = varX one may further use the approximation π(x) ≈ π(x0 ) + 2 π (x0 )(x − x0 ) at x0 = 0 which yields J(θ(X), Y ) ≈ π(x0 )(1 − π(x0 ))β 2 σX , and therefore 2 β 2 σX . (26) RJ2 ≈ 2 [π(x0 )(1 − π(x0 ))]−1 + β 2 σX ′ For σX → 0 one obtains RJ2 → 0 for all β. On the other hand, if σX is fixed, 2 , RJ2 β → ∞ yields RJ2 → 1 and β → 0 yields RJ2 → 0. Also for large variances σX may take the extreme values 0 and 1. Thus for the logistic model RJ2 naturally depends only on the coefficients of the regression function and the variability of X. In contrast, for Normally distributed responses (cp. (19)) RJ2 = 2 2 β 2 σX β 2 σX /σ 2 = 2 2 σ 2 + β 2 σX 1 + β 2 σX /σ 2 (27) where σ 2 = var(Y |x). Thus RJ2 depends on β, the variability of X and σ 2 . RJ2 2 2 becomes 1 if β → ∞ or the ratio σX /σ 2 increases. For fixed σX the coefficient of 2 2 determination depends on β /σ . Model 2 (linear logistic regression, probit link) For notational convenience consider the Binomial case (k = 1) with Y |π(x) ∼ B(n(x), π(x)) and assume π(x) = Φ(η(x)), where Φ denotes the cumulative distribution function of a standard univariate Gaussian distribution. Then ζ(x) = logit π(x) = logit (Φ(η(x))), ζ = Eζ(X) = logit Φ(Eη(X)) = logit Φ(α) However, there exists a predictor η0 such that JKL (p(y|ζ(x)), p(y|ζ)) equals JKL (p(y|ζ(x)), p(y|logit(Φ(η0 )))). More complex models induced by sophisticated link functions are specified by Fahrmeir and Tutz ([12]: 1994, ch.3). 11 2.3.3 Binary response Y with Gaussian covariate X A simple distributional model for (Y, X) illustrates how J(θ(X), X) or RJ2 capture the association between a dichotomous variable Y ∈ {0, 1} and a continuous variable X. Let X|Y = 1 ∼ N (µ1 , σ 2 ), X|Y = 0 ∼ N (µ0 , σ 2 ) and the marginal distribution be fixed by p(1) = P (Y = 1), p(0) = 1 − p(1) with p(1) ∈ (0, 1). For simplicity E(X) = 0 is assumed which is equivalent to µ0 = −µ1 p(1)/p(0). The regression of X on Y has the form E(X|y) = β0y + βy y with β0y = −µ1 p(1)/p(0), βy = µ1 /p(0). Treating the Normal distributions as one-parameter family with fixed σ 2 , the reference density p(x|ζ) is normal with N (0, σ 2 ). The logit regression model which reversely regresses Y on X is known to be linear. For logit π(x) = β0x + βx x one obtains β0x = (µ20 − µ21 )/(2σ 2 ) + log(p(1)/p(0)) and βx = µ1 /(p(0)σ 2 ). Application of (11) shows that J(θ(X), Y ) = µ21 p(1) 2 µ, σ 2 p(0) 1 RJ2 = p(1)µ21 . p(0)σ 2 + p(1)µ21 The extreme case RJ2 = 0 (J(θ(X), Y ) = 0) is obtained for µ21 = 0 meaning independence of X and Y . The case RJ2 = 1 (J(θ(X), Y ) → ∞) is obtained for σ 2 = 0 or µ1 → ∞. If µ1 → ∞ the regression parameters βx as well as βy tend to infinity meaning that the conditional distributions p(y|x) as well as p(x|y) are maximally separated. For σ 2 = 0 the regression parameter of the logit model βx tends to infinity whereas the separation of the distributions p(x|1) and p(x|0) does not show up in the regression parameters. J(θ(X), Y ) and RJ2 are symmetric measures of dependence between X and Y . In the light of regression models they capture how strongly a conditional distribution p(y|x) varies with x. Here, in terms of βx the coefficient RJ2 is given by p(1)p(0)βx2 . RJ2 = 1/σ 2 + p(1)p(0)βx2 Reversely, when regressing X on Y the distance between p(x|1) and p(x|0) is reflected what is seen from RJ2 = p(1)p(0)βy2 . σ 2 + p(1)p(0)βy2 12 RJ2 may also be interpreted as the proportion of explained variance of X: βy = µ1 /p(0) immediately yields RJ2 µ21 p(1)/p(0) = 2 , σ + µ21 p(1)/p(0) and varX = σ 2 + µ21 p(1)/p(0). While this interpretation holds for the regression of X or Y it does not apply to the decomposition of varY when regressing Y on X. Yet the divergence quantifies the explanatory value of one of the variables for the other variable because of the symmetry in X and Y . 3 3.1 Measures of determination I(θ(X), Y ) as alternative to J(θ(X), Y ) Similar concepts might be suggested measuring the distance of p(y|θ(x)) to a reference density differently. Especially the mutual information I(θ(X), Y ) (given in (1)) as an average of the directed divergences IKL (p(y|θ(x)), pθ (y)) = E(log p(Y |θ(x)) |θ(x)) pθ (y) may be an appealing alternative to J(θ(X), Y ) ([36]: Soofi et al., 2000; [19]: Joe, 1989; [20]: Kent, 1983). Analogously to the decomposition of variance I(θ(X), Y ) describes the decomposition of entropy: I(θ(X), Y ) = H(Y ) − H(Y |θ(X)), (28) where H(Y ) = −E(log pθ (y)) and H(Y |θ(X)) = EX [−E{log p(Y |θ(X))|θ(X)}] (see e.g. [6]: Cover and Thomas, 1991, p.19). In case of a bivariate Gaussian distribution of (X, Y ) with correlation coefficient ρ 1 I(θ(X), Y ) = − log(1 − ρ2 ), (29) 2 ([22]: Kullback, 1968, p.203). Then RI2 = ρ2 may be derived from the directed divergence by RI2 = 1 − exp(−2I(θ(X), Y )). (30) The relation also holds for the multiple correlation coefficient if X is multidimensional. The transformation (30) has in fact been suggested in terms of estimates to define a general coefficient of determination ([20]: Kent, 1983; [26]: Magee, 1990; [28]: Nagelkerke, 1991), and RI2 can be interpreted as a scaling transformation of I(θ(X), Y ). See also Theil ([40]: 1987). The main advantage of the symmetrized mutual information J(θ(X), Y ) over the mutual information I(θ(X), Y ) appears to be result 1 yielding result 3, the representation within an exponential family. EIKL (p(y|ζ(X)), p(y|ζ)) and I(θ(X), Y ) usually differ. 13 3.2 Reference densities The various reference densities we introduced partly explain the variety of suggestions generalizing the coefficient of determination. J(θ(X), Y ) is defined for a joint density of X and Y but only a partial feature is explored in a regression model. In particular, the (multivariate) distribution of X is often left unspecified. Hence (the type of) the marginal density pθ (y) is not determined unless Y is categorical. Also, even if the distribution of X is known, the marginal densities pθ (y) = Ep(y|θ(X)) are different for different models. In order to allow for a comparison across models with the same type of sampling distribution therefore a reference density p(y|θ0 ) is used where p(y|θ0 ) is assumed to belong to the same family of densities as p(y|θ(x)) and θ0 is the specific parameter. θ0 = θ(x, γ0 ) with γ0 = (ω0 , β0 ) and β0 = 0 corresponds to a regression model with constant regression function. Definition 2: Define 2 = RJ,θ 0 EJKL (p(y|θ(X)), p(y|θ0 )) 1 + EJKL (p(y|θ(X)), p(y|θ0 )) (31) to be the coefficient of determination of Y by X through θ based on the symmetric discrepancy to the density specified by a specific parameter θ0 . ⊳ Similarly the KL-discrepancy IKL (p(y|θ(X)), p(y|θ0 )) might be used to define analogously to (30). The average discrepancy might intuitively have been aimed at by those authors who suggest to empirically define 2 RI,θ 0 2 2 = 1 − exp(− LR) RLR n (32) where LR denotes the log–likelihood ratio test statistic to compare the model of interest to a ‘null model’ which is understood to be that with regression coefficients set to zero (see [28]: Nagelkerke, 1991). A second order Taylor expansion of log p(y|ζ0 ) in ζ(X) yields ([22]: Kullback, 1968, pp.26) EJKL (p(y|ζ(X)), p(y|ζ0 )) ≈ 2EIKL (p(y|ζ(X)), p(y|ζ0 )), (33) 2 2 and to this extent the coefficients of determination RJ,θ and RI,θ work similarly. 0 0 Based on I(θ(X), Y ) = H(Y )−H(Y |θ(X)) an alternative to using a reference other than pθ (y) is to consider the difference of entropies H(Y |θ0 ) − H(Y |θ(X). Model comparison then refers to H(Y |θ(X)) and might be extended to models which are not related being members of a common family of sampling distributions. 14 3.3 Approximations in exponential families A second order Taylor expansion of log p(y|ζ(x)) in ζ links the symmetrized mutual information J(θ(X), Y ) = EJKL (p(y|ζ(X)), p(y|ζ)) in exponential families to the matrix HM (ζ). Kullback ([22]: 1968, pp.26f ) shows the following result. Result 4: J(θ(X), Y ) ≈ tr{HM (ζ)cov(ζ(X))}.  (34) A similar result is obtained using a reference density p(y|ζ0 ) and expanding in ζ0 . This result relates our definition of RJ2 to the one based on Wald’s test for testing H0 : β = 0 against H1 : β = 0 or H0 : ζ(x) = ζ0 against H1 : ζ(x) = ζ0 for all x. For a fixed x the component w(x) of a test statistic with ζ estimating ζ is w(x) = (ζ(x) − ζ0 )T HM (ζ0 )(ζ(x) − ζ0 ). Then averaging over x yields w = E[tr{HM (ζ0 )(ζ(X) − ζ0 )(ζ(X) − ζ0 )T }]. An empirical coefficient of determination of the type Rw2 = w 1+w (35) was suggested and investigated for several regression models, listed by Magee ([26]: 1990). A link to the interpretation of the coefficient of determination as the proportion of the total variance explained by regression may be based on the approximation of J(θ(X), Y ) using the mean parameterization. Result 5: J(θ(X), Y ) ≈ tr{(cov(t(Y )|τ ))−1 cov X E(t(Y )|τ (X))} where τ̄ := Eτ (X). (36)  The right hand side of (36) is estimated by the test statistic of the score test which Magee ([26]: 1990) referred to in order to generalize the (estimated) coefficient of determination. 15 3.4 Decomposition of covariance The decomposition of covariance (17) yields (18), J(θ(X), Y ) = tr{W −1 B}, in the special case of Gaussian distributions of Y |θ(x) with homogeneous variance, and (18) also holds approximately according to result 5 if W ≈ cov(t(Y )|τ ). Hence J(θ(X), Y ) ≈ tr[(T − B)−1 T T −1 B] = tr[(I − T −1 B)−1 T −1 B]. (37) This equation may be read as an analogue to the defining transformation (5), J(θ(X), Y ) = RJ2 . 1 − RJ2 Thus RJ2 is associated to the matrix T −1 B though not directly defined by a functional of it like tr(T −1 B) or det(T −1 B) as previously suggested in the literature (see [27]: Mardia et al., 1995, ch.6.5.4, pp170f). For example, Hooper ([17]: 1959) refers to the decomposition I = T −1 W + T −1 B to obtain 1 2 RH = tr(T −1 B) (38) k Glahn ([13]: 1969) uses the decomposition applying the determinant to suggest 2 RG = |B|/|T |. (39) This measure coincides with γ1 in Cramer’s and Nicewanders review ([7]: 1979) of multivariate measures of association under the assumption of a joint multivariate Gaussian distribution. Based on the decomposition of covariance various measures have been derived which focus on ratios of (re-scaled) between variances to (re-scaled) total variances. For instance the re-scaled variance ratio (called γ3 by Cramer and Nicewander ([7]: 1979) 1 J(θ(X), Y ) tr{W −1 B} tr{W −1 B} k = = tr{W −1 T } k + tr{W −1 B} 1 + k1 J(θ(X), Y ) (40) is built similarly to our RJ2 . For a univariate response it coincides with RJ2 . An issue we finally address is the classification of measures of multivariate dependence into (symmetric) measures of association and (asymmetric) measures of redundancy. The measures discussed by Cramer and Nicewander ([7]: 1979) under the assumption of joint Normality are all - like ours - measures of dependence between X and Y . They can be rewritten in information theoretic terms 16 as transformations of I(X, Y ), J(X, Y ) or based on differences of the entropies H(Y ), H(Y |X), H(E(Y |X)). They are all functions of the canonical correlation coefficients, and hence X and Y may be interchanged. In contrast the ratio of between to total variance of the standardized response vector Y RI = −1 CXY } tr{CY X CX , tr{CY } (41) where C denotes a correlation matrix, is known as redundancy index in the literature ([39]: Stewart and Love, 1968; [14]: Gleason, 1976). The redundancy index is not symmetric in X and Y . In case of joint Gaussianity of X and Y it is equal to the average squared multiple correlation coefficient of components of Y given X thus quantifying linear predictability of Y (componentwise, not truly multivariate) given X. It has been felt by many authors that determination in regression should be measured in such an asymmetric way. Relating to that discussion we justify our study of a symmetric measure of dependence arguing that essentially a measure of redundancy is a scaled measure of dependence. In information theory the redundancy of a variable Z is defined as H(Z) , (42) red(Z) = 1 − Hmax where Hmax denotes a context dependent maximum achievable entropy. Lower entropy of Z indicates structure, and the more structure is present the higher the redundancy of Z. (Cp. e.g. the discussion in ([18]: Jessop, 1995, pp.51f). In regression we are interested in reducing the uncertainty about Y by X, measured by H(Y |θ(X)), that is in maximizing the redundancy of Y |X. This amounts to being interested in red(Y |θ(X)) = 1 − I(θ(X), Y ) H(Y |θ(X)) = , H(Y ) H(Y ) (43) the mutual information between X and Y scaled w.r.t. the entropy of Y . For a bivariate Gaussian distribution of X and Y we obtain with e = exp(1) red(Y |θ(X)) = − log(1 − ρ2 ) log(2πeσY2 ) (44) which tends to 0 if ρ2 → 0, and tends to ∞ if ρ2 → 1. More generally for a joint Normal distribution 1 |T | I(θ(X), Y ) = log . 2 |W | If Y is standardized yielding Ts = CY and Ws , say, red(Y |θ(X)) = log |CY | − log |Ws | , log(2πe)k |CY | 17 (45) (where k = dim Y ). Thus, although red(Y |θ(X)) and RI do not exactly coincide, the motivating idea is the same. red(Y |θ(X)) is a truly multivariate (w.r.t. Y ) index and generalizes to other but Gaussian distributions, particularly discrete distributions. In conclusion we suggest to consider measures of association and dependence as primary, and measures of redundancy as secondary, derived measures, and in this paper we discuss the decomposition of variance as a special Gaussian representation of association. Furthermore, although measures of dependence like J(θ(X), Y ) are symmetric, the specification of the conditional density p(y|θ(x)) is based on a regression model expressing ideas about a directed influence of X on Y . 3.5 Local coefficients of determination Sometimes it is felt that the strength of the relation between Y and X varies with x, and a local rather than a global coefficient of determination is wanted. For example, Doksum et al. ([9]: 1994) resume an example given by Härdle ([15]: 1990) where Y = expenditure for food depending on X= net income is analyzed. The higher the income the weaker the relation is, and this structure is captured by Doksum et al. ([9]: 1994) in a ‘correlation curve’. We discuss several concepts of local coefficients of determination. 3.5.1 Correlation curves The definition of a (squared) correlation curve aims at a local ‘decomposition of variance’. Doksum et al. ([4]: 1993; [9]: 1994) suggested for univariate X and t(Y ) with a conditional distribution in an exponential family 2 τ ′ (x)2 σX , ρ (x) = 2 var(t(Y )|θ(x)) + τ ′ (x)2 σX 2 (46) 2 where σX = varX and τ ′ (x) denotes the derivative of τ (x). Actually Doksum et al. ([9]: 1994) motivate their definition based on a linear model by RJ2 = 2 β 2 σX var(α + βX) , = 2 var(α + βX + ǫ) varǫ + β 2 σX where β is replaced by τ ′ (x) and varǫ by the heterogeneous variances σ 2 (x) = var(t(Y )|θ(x)) depending on x. If τ ′ (x) and σ 2 (x) are constant, the correlation coefficient ρ2 is re-obtained in bivariate Gaussian regression. For multivariate X ∈ Rs (46) is generalized to ρ2 (x) = (∇τ (x))T ΣX ∇τ (x) σ 2 (x) + (∇τ (x))T ΣX ∇τ (x) 18 (47) 3.5.2 Determination curves A local measure of the strength of relation between X and Y may be based on the rate of change. Following Blyth ([3]: 1994) consider local discrepancies J(x, ∆x) = JKL (p(y|θ(x)), p(y|θ(x + ∆x)) (48) for continuous X. For the limiting local discrepancy one has based on a second order Taylor expansion J0 (x) = lim δ→0 = J(x, ∆x) δ2 Ii,j (x) = 1Ts I(x)1s , (49) (50) i,j where I(x) is the s × s-dimensional Fisher information matrix defined by Ii,j (x) = E[ ∂ ∂ log p(Y |θ(x)) log p(Y |θ(x))|θ(x)]. ∂xi ∂xj If p(y|θ(x)) belongs to a k−parameter exponential family, I(x) = ∇(k) ζ(x)(cov(t(Y )|ζ(x))(∇(k) ζ(x))T , (51) κ where ∇(k) ζ(x) is the s × k matrix (( ∂ζ )) . The Fisher information is ∂xi κ=1...k,i=1...s well known to be an indicator of the sensitivity of (the conditional distribution of) Y to changes in the parameter θ(x). See the discussion by Rao ([30]:1973, pp.331f). More explicitly, for example, if X is univariate (s = 1), I(x) = E[( d log p(Y |θ(x)))2 |θ(x)], dx and hence under regularity conditions I(x) = var[ d log p(Y |θ(x))|θ(x)]. dx (52) I(x) describes the rate of change (on the log scale) of the density induced by a change of the parameter (depending on x) and thus refers to the interpretation of association in terms of discrimination. For illustration consider again the example of logistic regression discussed in section 2.3.2 (model 1 and n(x) = 1). Then with π(x) = τ (x) τ (x) = τ ′ (x) = exp(α + βx) , 1 + exp(α + βx) β exp(α + βx) (1 + exp(α + βx))2 19 and var(Y |ζ(x)) = τ (x)(1 − τ (x)), J0 (x) = τ ′ (x)2 = βτ ′ (x) τ (x)(1 − τ (x)) (cp. (26)). Hence J0 (x) = I(x) converges to 0 for x tending to ±∞ and thus reflects the asymptotes of the sigmoid logistic curve τ (x). As Blyth ([3]: 1994) pointed out, in a one-parameter exponential family with univariate X the correlation curve appears to be a one-to-one transformation of J0 (x). If p(y|θ(x)) is from a one-parameter exponential family (where X may be again multivariate), J0 (x) takes the form of a signal-to-noise ratio J0 (x) = (∇τ (x))T ∇τ (x)) . var(t(Y )|θ(x)) (53) If additionally s = 1 the transformation of J0 (x), Rσ2 X J0 (x) = 2 J0 (x) σX 2 σX J0 (x) + 1 (54) 2 yields the correlation curve ρ2 (x). The scaling by σX in (54) is chosen in order to 2 ′ recover ρ for bivariate Gaussian (X, Y ), where τ (x) and σ 2 (x) and hence J0 (x) are constant functions. In terms of I(x) the correlation curve ρ2 (x) for s = k = 1 is also given by 2 σX I(x) ρ (x) = 2 σX I(x) + 1 2 (55) which can be generalized to s ≧ 1, k ≧ 1. Definition 3: Define tr{Iθ (x)ΣX } (56) tr{Iθ (x)ΣX } + 1 to be the value of the determination curve for Y in x through θ, where Iθ (x) denotes the Fischer information matrix derived from p(y|θ(x)). ⊳ RJ2 (x) = According to (51) in a k−parameter exponential family tr{I(x)ΣX } = tr{[cov(t(Y )|τ (x))]−1 (∇τ (x))T ΣX ∇τ (x)} = tr{[W (x)]−1 B(x)} (say). Hence (47) is obtained as a special case where s ≧ 1 = k, and the family of N (τ (x), σ 2 (x))-distributions is treated as a one-parameter exponential family. The curves for a Gaussian response, where the density is assumed to belong to a two-parameter exponential family, will be discussed for a numerical example in section 6.3. 20 3.5.3 Certainty curves If X is not continuous but e.g. nominal, concepts of local variation and differentiation do not apply. Yet the influence of X at x can be assessed as the reduction of uncertainty about Y measured in terms of entropies. Set H(Y |θ(x)) = −EY |θ(x) (log p(y|θ(x))). Definition 4: Define RI2 (x) = 1 − exp{−2[H(Y ) − H(Y |θ(x))]} (57) to be the coefficient of certainty about Y given x through θ. The resulting curve as a function of x is called the certainty curve. ⊳ Definition 4 may also be applied to continuous X with differential entropies if these exist, i.e. if the integrals are finite. For example, if (X, Y ) are bivariate Gaussian, H(Y ) = (1/2) log(2πeσY2 ), H(Y |θ(x)) = (1/2) log(2πeσY2 |θ(x) ), where σY2 |θ(x) = σ 2 does not depend on x. Hence the reduction of uncertainty about Y is the same for all x, and the certainty curve is constant taking the value of the correlation coefficient. With the asymmetric scaling yielding the redundancy index one may similarly define a redundancy curve by red(Y |θ(x)) = 1 − H(Y |θ(x)) . H(Y ) (58) taking values in R+ . Then red(Y |θ(X)) is the average redundancy. Locally the difference between the notions of discriminatory power and explanatory power of X related to dependence measures based on J and I respectively, becomes more distinct while globally the idea of variability within the family of densities {p(y|θ(x)} covers both aspects. 4 Properties and use of coefficients of determination The main idea we elaborate in this paper is the application of association measures under the assumption of a regression model. To this aim we focus on J(θ(X), Y ) and consider to some extent I(θ(X), Y ). A list of postulates, of desirable properties of a measure of association, introduced by Renyi ([32]: 1959) and modified later on e.g. by Bell ([1]: 1962) has been agreed on in the literature. These requirements comprise (i) generality of the definition, (ii) symmetry in X and Y, (iii) normalization and (iv) invariance under 1:1-transformations of 21 X and Y . The crucial issue yielding refinements and modifications of association measures turned out to be the normalization in two respects: - Renyi ([32]: 1959) claimed that the measure should coincide with (the absolute value of) the correlation coefficient in case X and Y are bivariate Gaussian. Bell ([1]: 1962) weakened this postulate suggesting that a measure of association should be a monotone function of the (squared) correlation coefficient in that case. In regression analysis Renyi’s claim is still virulent, and hence we emphasize standardizing transformations yielding the squared correlation coefficient as a special case. - However, an additional requirement, namely that a value of a measure of association equal to 1 should indicate functional dependence between X and Y may conflict with the orientation towards correlation. Particularly for nominal random variables therefore different ways of scaling were proposed, for example by Joe ([19]: 1997). Apart from the properties of association measures additional features of coefficients of determination like additivity are of interest. See e.g. Soofi et al. ([36]: 2000). For correlation curves properties were proven by Bjerve and Doksum ([4]: 1993). Here we only address a few issues. 4.1 Invariance to re-parameterizations JKL (p(y|θ(x)), pθ (y)) is invariant to one-to-one transformations of θ, but result 3, providing a representation within an exponential family with reference density p(y|ζ), holds for the natural parameter only. Further, there is a lack of interchangeability of parameterization and expectation w.r.t. X, e.g. τ̄ = τ (ζ) and different reference densities may not be equally useful. 4.2 Monotonicity in the number of covariates Interpreting (directed) divergences as measures of variability of densities p(y|θ(x)) one might expect this variability to increase if more informative covariates are included in the regression model. Indeed, within an encompassing model including all covariates, i.e. with θ being derived from the joint distribution of X and Y, I(θ(X), Y ) is increasing with the number of covariates because of the chain rule for mutual information. Darbellay ([8]: 1998) elaborated for RI2 how the coefficient increases if new covariates are added. To our knowledge similarly general results are not available for J(θ(X), Y ).(See also the related discussion for estimates Rw2 by Magee ([26]: 1990).) Usually, however, (even nested) regression models are not assumed to be related by a joint distribution, and monotonicity due to additivity of information is important. Simon ([35]: 1973) investigated conditions for additivity of discrepancies in exponential families if models are nested. 22 Assuming θ0 ≺ θ1 ≺ θ2 , where ≺ indicates nesting, in exponential families IKL (p(y|θ0 ), p(y|θ2 )) = IKL (p(y|θ0 ), p(y|θ1 )) + IKL (p(y|θ1 ), p(y|θ2 )) (59) holds under the condition of orthogonality (θ1 − θ2 )T (τ (θ0 ) − τ (θ1 )) = 0. (60) For example, if Y |θ(x) ∼ N (µ(x), Σ), θ(x) = µ(x), for x = (x1 , x2 ), θ0 = α, θ1 (x) = θ0 + a1 (x1 )T β1 , θ2 (x) = θ1 (x) + a2 (x2 )T β2 the condition (60) amounts to the requirement β2T a2 (x2 )T a1 (x1 )β1 = 0 which is met if a1 (x1 ) is orthogonal to a2 (x2 ). Soofi and Retzer ([37]: 2002, p.16) summarize results on additivity of information indices for exponential families. 4.3 Use of a coefficient of determination in model comparison Coefficients of determination as measures of dependence between X and Y are used to quantify a feature of a regression model under consideration. In the comparison of models typically two situations are distinguished: (1) An encompassing model is given by the joint density of Y and all covariates X. Submodels are derived within the encompassing model, for example p(y|θ(s) (x(s) )) for subsets X(s) of the covariates. Often then the minimally comprehensive model is found that is still close enough to the most comprehensive model (in terms of its ‘modelling potential’). An assessment of the potential of a regression model is especially of interest in variable selection where a parsimonious model is looked for. Indeed, variable selection has been a main field of application of coefficients of determination. RI2 is appropriate in this case, and monotonicity in the number of covariates applies. (2) Models corresponding to p(y|θ(x)) are not assumed to be related within an encompassing model, but the conditional densities are assumed to belong to a specified family of distributions. In this case pθ (y) varies with the regression model, and p(y|θ0 ) is chosen as a reference density instead. Hence the comparison 2 2 or RJ,θ , and monotonicity may hold for nested models. of models is based on RI,θ 0 0 2 2 or RJ,θ are also natural quantities to The discrepancies corresponding to RI,θ 0 0 test associated hypotheses H0 : θ(x) = θ0 against H1 : θ(x) = θ0 about regression parameters. These tests aim at the significance of a difference in the explanatory power of models specified by θ(x) or θ0 . Because of the monotonicity properties coefficients of determination are measures of the explanatory potential of a model rather than adequate criteria for model choice aiming at prediction of future observations which typically compromise between data fit and model complexity. Model comparisons based on 23 coefficients of determination apply to models related by specifications according to either (1) or (2). They are useful to find a parsimonious relative explanatorily powerful model within a set of related models. Monotonicity of a coefficient of determination supports strategies of forward selection or backward elimination in variable selection. Often an internal scaling of coefficients of determination is desired in model 2 is alcomparison w.r.t. the highest value attained. Although for instance RJ,θ 0 ready a scaling transformation of EJKL (p(y|θ(X)), p(y|θ0 )) taking values between 2 = 1 may not be attainable within the set of models to 0 and 1, across models RJ,θ 0 be compared. In practice therefore often estimates RJ2 are internally scaled w.r.t. e.g. the most comprehensive model ( [11]: Draper and Smith, 1981, p.42; [28]: Nagelkerke, 1991) and used as relative measures for the assessment of models in an informal way. 5 Estimation of a coefficient of determination We already occasionally referred to estimates of coefficients of determination but we address the topic of estimation in a more systematic and explicit way in this section. As often coefficients of determination (and measures of association) are defined descriptively, in fact estimates of quantities of interest are suggested, and frequently ‘corrections’ or ‘modifications of the definition’ of such coefficients are meant to improve properties of estimators. For example, Särndal ([33]: 1974) in his review discusses at length ‘corrections’ that yield unbiased estimators of association measures. We investigate estimates in special examples in section 6. 5.1 Maximum Likelihood-Estimation An immediate approach in order to obtain estimates is to substitute parameters θ by their maximum likelihood (ml) estimates θ. For example, for I(θ(X), Y ) = H(Y ) − H(Y |θ(X)) ranking of submodels according to RI2 reduces to ranking of the quantities H(Y |θ(X)) estimated by H(Y |θ(X)) = H(Y |θ(X)). Within an encompassing model these estimates are monotone for submodels with an increasing number of covariates. A crucial result is obtained for the ml-estimates of EIKL (p(y|θ(X)), p(y|θ0 )). Under conditions frequently met in generalized regression (cp. [35]: Simon, 1973) IKL (p(y|θ̂(xi )), p(y|θ̂0 )) coincides with the log-likelihood ratio log p(yij |ζ(xi )) − log p(yij |ζ0 ). (59) carries over to likelihood ratios (e.g. [31]: Rao and Toutenburg, 1995, p.48). Model comparison with estimated quantities then turns out to be based on the log-likelihood with estimated parameters. Thus ml-estimation renders an empirical coefficient of determination which is simply a difference of measures of goodness of fit. This is the reason why estimated coefficients of determination are used for comparing models w.r.t. their explanatory power for a 24 given data set but have been abandoned as criteria for model choice aimimg at prediction. 2 can be derived from In exponential families R̂J,θ 0 JKL (p(y|θ̂(x)), p(y|θ̂0 )) = (ζ(x) − ζ0 )T (τ (ζ(x)) − τ (ζ0 )). (61) 2 Similarly RJ,θ may be estimated inserting ml-estimates in the test statistic of 0 the Wald- or score test (see results 4 and 5). The ml-estimates of the discrepancies and log-likelihood ratios in exponential families are close but do not always coincide: equality requires additivity in nested model which does not always hold. Intuitively, particularly if the conditional densities p(y|θ(X)) do not belong to an exponential family, the data are used twice in the log-likelihood ratio as an estimate of a KL-discrepancy: for estimating θ and for the evaluation of the integral. 5.2 Non–parametric estimation Estimates can also be obtained if no parametric model or no function space is explicitly specified for θ(x). For example τ̂ (x) may be obtained from kernel estimation yielding an estimate θ̂(x). Doksum and Samarov ([10]: 1995) and Lavergne and Vuong ([23]: 1998) investigated nonparametric estimates to obtain a coefficient based on a decomposition of variance. 5.3 Bayesian estimation Within the Bayesian approach part of the model specification is a prior distribution for γ = (ω, β). Instead of plugging in a point estimate θ or γ the parameter θ or γ may be integrated (averaged) w.r.t. a posterior distribution on the parameter space and Bayesian estimates like the posterior mean or the posterior mode may be used. Then the approach may also share features due to the additivity of information in nested models. Also, given a (joint) distribution of (X, Y, Θ) in a Bayesian approach it may be more natural to focus on the dependence of Y and X with θ integrated out than to investigate the posterior distribution of a measure of dependence given θ like EJKL (p(y|θ(X)), p(y|θ0 )) or −H(Y |θ(X)). For a Bayesian approach to hypothesis testing see ([2]: Bernardo and Rueda, 2002). 6 6.1 Case studies Standard regression models with Gaussian response We evaluate discrepancies to a reference density as quantities of interest for a Gaussian response Y including regression models with heterogeneous variances. 25 Various estimates for the quantities of interest are discussed that were previously introduced as modifications of the definition of a coefficient of determination. Finally we investigate model comparisons for different regression functions including nested models. Consider an outcome Y which is observed at only q = 3 values xi of a covariate X and assume i = 1, 2, 3. Y (xi ) ∼ N (µ(xi ), σ 2 ),   x3 x 1 x2 , For illustration we simulated data according to designs n1 n2 n − n1 − n 2 such that pi = ni /n empirically determines a distribution of X. Thus we observe Yj (xi ) for j = 1...ni , i = 1, 2, 3. We chose x1 = 0.2, x2 = 0.5, x3 = 0.9. First we discuss the simple linear regression model µ1 (xi ) = α1 + β1 (xi − x), where x = 6.1.1  (62) pi xi , for α1 = 1, β1 = 1.2 and σ 2 = 0.16. Models with non-random coefficients and homogeneous variances Quantities of interest Considering a univariate response variable Y we write 2 σW (σB2 ) for the within (between) variance instead of the covariance matrices W (B) introduced in section 2.3. Assuming homogeneous variances we have 2 = σ 2 . Having specified a set-up with conditional sampling the marginal σW density pθ (y) is not estimable and cannot be used as reference density. Instead, we choose a reference density p(y|θ0 ) which is Normal with mean µY and variance σY2 . Thus it is not claimed that pθ (y) is Gaussian but that its first two moments might be estimated and are to be used as parameters of a Gaussian distribution. (i) Using the reference density p(y|ζ0 ) where ζ0T = (µY /σY2 , −σY−2 /2), we obtain 2 EJKL (p(y|ζ(X)), p(y|ζ0 )) = σB2 /σW , (63) 2 2 = 1 − σW /σY2 . RJ,ζ 0 (64) and thus (ii) EIKL (p(y|ζ(X)), p(y|ζ0 )) = 1 1 2 ) = − log(1 − σB2 /σY2 ) log(σY2 /σW 2 2 and 2 2 = σB2 /σY2 = RJ,ζ . RI,ζ 0 0 2 We observe that given σY2 all discrepancies are decreasing functions of σW and so are the coefficients of determination as monotone transformations. 26 2 There has been some discussion in the literature about the fact that RJ,ζ < 0 2 1 because σW > 0. In order to reduce this effect some authors (e.g. Healy ([16]: 1984) ni suggested to refer in model2 comparison to the dependence between Yi = ( j=1 Yj (xi ))/ni , Yi ∼ N (µ(xi ), σ /ni ) and X, where X is assumed to be uniformly distributed with pi = 1/q. Thus the response variable depends on the design. In model comparisons based on the same data set this may be an option although an internal scaling would also do. 2 Estimation In order to estimate RJ,ζ estimates of variance may be inserted. 0 The data yield an estimate σY2 while the underlying distribution of Y is unknown. From a set of proposed regression models then the one providing a decomposition of σY2 such that the dependence between X and Y is maximum, is chosen. That 2 , interpretable as maximizing goodness of fit. A crucial amounts to minimizing σW 2 issue when using variance estimates is whether the estimates satisfy σY2 = σB2 +σW . of the sum For a given data set {yj (xi) |i = 1, ..., q; j = 1, ..., ni } the decomposition  of squares based on least squares estimates µi and y = ( i,j yj (xi ))/n, (yj (xi ) − y)2 = (yj (xi ) − µi )2 + i,j i,j ni (µi − y)2 i ensures the decomposition for the ml-estimates σY2 = 1 n (yj (xi ) − y)2 , i,j 2 σW = 1 n (yj (xi ) − µi )2 , σB2 = i,j 1 n ni (µi − y)2 i yielding the conventional R2 = σB2 /σY2 (65) 2 = 1 − σW /σY2 . (66) In contrast the adjusted estimates (for two unknown parameters in the regression functions of our example) 2 σ Y,adj = 2  σ B,adj = 1 n−1 1 q−1 (yj (xi ) − y)2 , 2  σ W,adj = i,j q 1 n−2 (yj (xi ) − µi )2 , i,j (µ(xi ) − y)2 i yield two different estimates. The ‘adjusted R2 ’, 2 2 n−1 2 2   Radj,1 =1−σ ) ≤ R2 W,adj /σY,adj = 1 − (1 − R )( n−2 27 Fig.1 1.4 Rsq Rsqadj1 Rsqadj2 true 1.2 1 Rsq 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 n0 12 14 16 18 20 2 2 2 (‘true’) of RJ,ζ as functions of the number Figure 1: Estimates R2 , Radj , RJ,ζ 0 0 n0 of replicates at each xi coincides with R2 for large n. For n = qn0 2 2 2 2   Radj,2 =σ B,adj /σY,adj = R ( n−1 ) ≥ R2 . n − n0 q−1 q 2 limn0 →∞ nn00q−n = q−1 and Radj,2 ≥ 1 may occur. Discussions in the literature 0 about the appropriateness of estimates are presented as discussions about the appropriate definition of a coefficient of determination. In fig.1 we visualize the performance of the estimators for simulated data generated according to the regression model N (µ1 (x), σ 2 ) specified in the previous section. We consider designs with equal replicates ni = n0 , such that pi = 2 is given which was 1/3 = n0 /n for n0 = 1, . . . , 20. For comparison the value RJ,ζ 0 calculated using the empirical distribution of X specified by the experimental design to evaluate the integral w.r.t. X in σB2 = varµ1 (X) but otherwise using the true values of the parameters. Thus E JKL (p(y|θ(X)), p(y|θ0 )) = 0.7400, 2 = 0.4253 are obtained. RJ,θ 0 6.1.2 Mixed model We consider the extended regression function µ2 (x) = α1 + β1 (x − x) + β2 (x2 − x2 ) 28 and assume that β2 is a random coefficient: β2 ∼ N (0, τ 2 ). Marginally with µ1 (x) given in (62) Y |θ(x) ∼ N (µ1 (x), σ 2 (x)) with heterogeneous variances σ 2 (x) = (x2 − x2 )2 τ 2 + σ 2 . Quantities of interest Now for ζ0T = (µY /σY2 , −σY−2 /2) 1 (µ(X) − µY )2 1 EJKL (p(y|ζ(X)), p(y|ζ0 )) = − [1 − σY2 E( 2 ) − E( )]. 2 σ (X) σ 2 (X) (67) 2 2 Only if σ 2 (x) ≡ σW , the right hand side of (67) reduces to σB2 /σW . Substituting 2 2 σ (x) by some average value σ W , say, one might approximate 1 1 σ2 + σ2 EJKL (p(y|ζ(X)), p(y|ζ0 )) ≈ − + ( Y 2 B ) =: J0 . 2 2 σW (68) The decomposition of variance then holds only approximately, σY2 ≈ σ 2W + σB2 , and EJKL (p(y|ζ(X)), p(y|ζ0 )) ≈ σB2 /σ 2W =: J1 ≈ (σY2 /σ 2W ) − 1 =: J2 . (69) (70) Correspondingly J0 =: RJ20 J0 + 1 σB2 2 ≈ 2 =: RJ1 2 σB + σ W σ2 =: RJ22 . ≈ 1− W σY2 2 RJ,ζ ≈ 0 (71) (72) (73) Using the distribution of X induced by the uniform design to evaluate integrals w.r.t. X, the same numerical values α1 = 1, β1 = 1.2 as in 6.1.1 and σ 2 = 0.09, 2 = 0.4717. Hence τ 2 = 0.4, we have E JKL (p(y|ζ(X)), p(y|ζ0 )) = 0.8929 and RJ,ζ 0 in the mixed model there is a slightly stronger relation between X and Y than in the first model with non-random effects. Estimation Willett and Singer ([42]: 1988) use weighted least squares estimates of µ(x) in order to quantify the ‘variance explained by regression’ in the original metric of Y in contrast to the decomposition of sum of squares of trans2 refers to the metric formed variables proposed by Kvålseth ([21]: 1985). RJ,ζ 0 of Y , and estimation should aim directly at EJKL (p(y|ζ(X)), p(y|ζ0 )). Otherwise invariance of (the estimate of) EJKL (p(y|ζ(X)), p(y|ζ0 )) w.r.t. the chosen transformation of Y should be checked. 29 For illustration we discuss an example with univariate continuous X and Y introduced by Draper and Smith ([11]: 1981, pp.112f) and resumed by Willett and Singer ([42]: 1988). In the example the variance function σ 2 (x) = 1.5329 − 0.7334x + 0.0883x2 derived by Draper and Smith ([11]: 1981, p.115) is used, and the estimated regression function based on weighted least squares is µW LS (x) = −0.889 + 1.142x. There are n = 35 data points. Wefirst estimate EJKL (p(y|ζ(X)), p(y|ζ0 )) as in (67), using µY = y, σY2 = 1 2 ν (yν −y) and empirical means instead of expectations having evaluated µ(x) 35 and σ 2 (x) for each xν , ν = 1, . . . , 35. Thus we obtain E JKL (p(y|ζ(X)), p(y|ζ0 )) = 35.83, RJ,ζ0 = 0.9728. Next we set 1 2 σ B,W LS = 35 35 (µW LS (xν ) − y)2 . ν In order to obtain approximations we try three values of σ 2W : σ 2W,1 1 = 1/( 35 σ 2W,2 σ 2W,3 = 1 35 35 1 σ 2 (x ν ν) ) = 2.53, 35 1 = 35 σ 2 (xν ) = 1.63, ν 35 (yν − µW LS (xν ))2 = 2.02. ν All of them do not satisfy σY2 = σ 2W + σB2 . Insertion of these estimates in (68) results in J0,i (RJ2 ), say, for i = 1, 2, 3. Substituting in (69) yields J1,i (RJ2 ), 1,i 0,i and evaluating (70) gives J2,i (RJ2 ) respectively. The values of the coefficients 2,i of determination are given in table 1. Table 1 RJ2 σ 2W,1 σ 2W,2 σ 2W,3 0,i 0.9670 0.8709 0.8425 RJ2 1,i 0.9655 0.8718 0.8460 30 RJ2 2,i 0.9684 0.8699 0.8389 The results indicate that the choice of the variance estimator is more important than the approximation of EJKL (p(y|ζ(X)), p(y|ζ0 )). Relative to RJ,ζ0 = 0.9728 the worst performance is obtained with σ 2W,3 , an estimate that does not refer to the functional specification of σ 2 (x). The estimate RJ2 evaluated with 2,i σ 2W,3 (i.e. 0.8389) was suggested by Willett and Singer ([42]: 1988). The best estimator is the one based on the mean of inverse variances, 1/σ 2 (x), which according to (67) is in fact more appropriate than forming the inverse of the mean. 6.2 Example: Hald data The Hald data set is a famous data set which has often been analyzed in order to illustrate techniques for variable selection. The data and an extensive classical analysis are given by Draper and Smith ([11]: 1981, pp.297, 591, 629). We resume this example in order to illustrate how the classical strategies of ‘backward elimination’ and ‘forward insertion’ amount to selecting a model with maximum strength of relation between X and Y . The response variable Y is ‘heat evolved in cement’, and the covariates X1 , ..., X4 , measure amounts of ingredients (in %) in clinkers used to produce the cement. We set X = (X1 , . . . , X4 ). The covariates are nearly linearly dependent summing up to almost 100%. There are only n = 13 data points (yν , xν ), where xν = (x1ν , . . . , x4ν ). We assume Y |x ∼ N (µ(x), σ 2 ), where µ(x) = α + β T x, without referring to joint Normality of X and Y. We use centered covariates. 6.2.1 Quantities of interest We again have conditional Gaussian densities with homogeneous variances. Hence the theoretical derivations in section 6.1.1 apply, and the main quantity of interest 2 2 = σB2 /σW . is again RJ,ζ 0 6.2.2 Estimation 2  Rao and Toutenburg ([31]: 1995, p.48) show that R2 = R J,ζ0 increases if variables are added within the linear regression model thus preserving monotonicity of the 2 ). For the Hald data the values of R2 are (with {j, k, l} estimated quantity (in σW 31 indicating the sequence of variables Xi included in the model) {1} {1,2} {1,2,4} {1,2,4,3} {2} {2,1} {2,1,4} {2,1,4,3} 0.532 0.979 0.9823 0.9824 0.666 0.979 0.9823 0.9824 {3} {3,4} {3,4,1} {3,4,1,2} {4} {4,1} {4,1,2} {4,1,2,3} 0.286 0.935 0.981 0.9824 . 0.674 0.972 0.9823 0.9824 n 2 −1 2 2  , now with σ Radj W,adj = (n − k − 1) ν=1 (yν − µ(xν )) , when k covariates Xi 2 are included in the model, and with σ as in section 6.1.1 is not monotone Y,adj anymore. For the Hald data the sequence {4} {4,1} {4,1,2} {4,1,2,3} 0.645 0.967 0.9645 0.974 is obtained. Adjusted variances do penalize ‘fit’ (in the sum of squares) by ‘model complexity’ in the degrees of freedom. Yet adjusted variances exemplify, that estimates better than ml-estimates can be expected ‘not to penalize sufficiently’ for model complexity in model choice aiming at prediction because a coefficient of determination is an inappropriate criterion for predictive model choice. 6.2.3 Variable selection Two principal methods are applied in variable selection: ‘backward elimination’ and ‘forward insertion’. Eliminating backwards one starts with the complete vector of covariates in the regression model and deletes in turn that component which least reduces the measure of dependence between X and Y until a significant drop occurs. In each step the difference of measures is assessed. Under Normality of Y |θ(x) this reduces to the comparison of within variances, 2 2 s or RJ,ζ s in this case is a standardized (by and in fact, the difference of RI,ζ 0 0 2 σY ) difference of within variances. The test statistic of the partial F-test used to establish significance of the difference under consideration is again a (differently) standardized difference of estimated within variances. For the Hald data backward elimination based directly on R2 suggests the path set {1,2,3,4} → {1,2,4} → {1,2} → {2} 2 98.24 98.23 97.2 66.6 R 32 with an intuitive cut-off at {1,2}. This is confirmed using partial F-tests as reported by Draper and Smith ([11]: 1981, p.306). Reversely, adding variables to the model such that the dependence between X and Y is maximally increased yields for the Hald data the steps set R2 {4} → 67.4 {4,1} 97.2 → {4,1,2} 98.23 → {4,1,2,3} . 98.24 The cut-off point is again set at 97.2. Thus variable selection based on R2 amounts to choosing the model with maximum dependence between X and Y taking into account the variability of the estimators of the measure of dependence. The incremental association measured in terms of I(θ(X), Y ) can be transformed to a difference of RI2 s using (31). If joint Normality of X and Y holds this becomes a difference of multiple correlation coefficients. Standardization of this difference yields the partial correlation coefficient used to assess the contribution of the added variable to the overall association. Compare the discussion by Darbellay ([8]: 1998). As least squares estimates mimic a jointly Gaussian distribution of (X, Y ) empirical multiple and partial correlation coefficients are defined analogously to the theoretical quantities even if the distributional assumption does not hold. In this spirit Draper and Smith ([11]: 1981, pp.308f) refer to empirical partial correlation coefficients for the Hald data. 6.3 Example: Doksum et al. (1994) We analyze simulated data from a distribution introduced by Doksum et al. ([9]: 1994) in order to illustrate quantification of the local strength of association between X and Y . We also illustrate model comparison for nonparametric regression looking at the divergence as a function of the smoothing parameter. We have univariate X and Y with Y |θ(x) ∼ N (µ(x), σ 2 (x)), where x x ) exp(5 − ), 10 2 x 2 1 2 σ (x) = (1 + ) , 9 2 µ(x) = ( 2 2 and X ∼ N (µX , σX ), µX = 1.2, σX = 1/9. 6.3.1 Quantities of interest Global measure of dependence Considering the conditional Gaussian distributions as a two-parameter exponential family we have canonical parameters 33 ζ1 (x) = µ(x)/σ 2 (x), ζ2 (x) = −1/(2σ 2 (x)) corresponding to the canonical statistics t1 (Y ) = Y, t2 (Y ) = Y 2 . Computation of J(θ(X), Y ) according to (67) yields J(θ(X), Y ) = 6.75, RJ2 = 0.87. All expectations w.r.t. X were obtained by numerical integration over the range (0.0,2.4) covering 0.999 of the probability mass of X. The general definition of a coefficient of determination that we suggest thus naturally yields a coefficient for a nonparametric regression model with heterogeneous variances. Local strength of dependence The limiting local discrepancy, quantifying the change of the distribution due to a change in x, is given by J0 (x) = I(x) as x is univariate. Regarding the Normal distributions as a one-parameter exponential family where only changes in µ(x) - though weighted by σ 2 (x) - are reflected, J0 (x) = µ′ (x)2 =: J01 (x). σ 2 (x) Considering the Normal distributions as a two-parameter exponential family J0 (x) = I(x) = (ζ1′ (x), ζ2′ (x))cov  Y Y2  ζ1′ (x) ζ2′ (x)  = σ 2 (x)(ζ1′ (x))2 + 4µ(x)σ 2 (x)ζ1′ (x)ζ2′ (x) +4µ2 (x)σ 2 (x) + 2σ 4 (x)(ζ2′ (x))2 = : J02 (x). 2 2 The corresponding determination curves RJ20i (x) = σX J0i (x)/(1 + σX J0i (x)) are 2 displayed in fig.2. Note that according to (54) RJ01 (x) is the correlation curve. It coincides with the curve displayed in fig.4 in ([9]: Doksum et al., 1994) apart from the range of x-values. At x = 2 the mean function µ attains a maximum, and hence the derivative is zero. Fig.2 shows that the rates of change J01 (x) and J02 (x) behave similarly in the center of the distribution of X. Where they differ slightly (x > 1.6) it is indicated that the local strength of relation between X and Y evaluated in the two-parameter family is weaker than in the one-parameter family According to (57) the certainty curve is RI2 (x) = 1 − c. log(2πeσ 2 (x)), where c. is a constant determined by H(Y ). Thus this curve pointwise results from an antitone transformation of σ 2 (x). As σ 2′ (x) > 0, σ 2 (x) is an increasing function of x, and the reduction of uncertainty about Y is maximum for x = 0. Hence the certainty curve like the correlation curve decreases in (0,2.0). 34 Fig.2 1 0.9 0.8 0.7 RJ0sq 0.6 0.5 0.4 0.3 0.2 0.1 RJ01sq RJ02sq 0 0 0.5 1.2 x 2 2.4 Figure 2: Determination curves corresponding to J0 (x) evaluated for the Normal family as one-parameter (RJ01sq) or as two-parameter (RJ02sq) exponential family 2 Estimation Using 2000 simulated data points (xν , yν ) we estimate σX , µY , σY2 by the empirical mean and variances, and following Doksum et al.([9]: 1994) we then estimate µ(x) and σ 2 (x) by averages over neighbourhoods, where each neighbourhood of an x-value contains K points. The derivative of σ 2 (x) is calculated like µ′ (x) as a ratio of differences using half the neighbourhoods. For further details see the paper by Doksum et al. ([9]: 1994). Inserting the estimated curves and evaluating expected values w.r.t. X as empirical means we obtain J(θ(X), Y ). It is displayed as a function of K (K = 30, 60, . . . , 300) in fig.3 below. The highest (estimated) strength of relation is attained for K = 30. Doksum et al.([9]: 1994) recommend K = 60 as optimal size of the neighbourhood, but we found that for higher K the estimates of the derivatives improve. It is to be expected that for over-smoothed µ the association between X and Y decreases, and this is confirmed in fig.3 showing J(θ(X), Y ) as a decreasing function of K. The estimated limiting local divergences J01 (x) and J02 (x) for the smoothing parameter K = 300, transformed into determination curves RJ2 , RJ2 are shown 01 02 in fig.4. The estimates perform rather poorly, and better estimates are required to reliably evaluate the determination curves. 35 Fig.3 7 6.5 6 J 5.5 5 4.5 4 estimated true 3.5 0 50 100 150 K 200 250 300 Figure 3: Estimated values of J(θ(X), Y ) for smoothing parameters K with constant true value J(θ(X), Y ) = 6.75 6.4 Binomial response: example birth study We illustrate our approach for a binomial response variable and emphasize comparison of models with different link functions. We also evaluate approximations of EJKL (p(y|ζ(X)), p(y|ζ0 )) according to results 4 and 5. In the example the dependent variable Y indicates occurrence of an infection following birth by Caesarian section. The risk of infection is modelled depending on three binary covariates: X1 indicates whether the Caesarian section was planned or not, X2 indicates the presence of risk factors, and X3 indicates prophylaxis with antibiotics. The data and analysis are given in the book by Fahrmeir and Tutz ([12]:1994, pp.29f). We use the notation introduced in section 2.3.2. Thus π(x) denotes the risk of infection given x = (x1 , x2 , x3 ), and Y |π(x) ∼ B(1, π(x)). The covariate vector X takes eight values, denoted by triples of zeros and ones, and their probabilities are given as p(x) in general. Let the  number of infections occurring under condition x be k(x). There are n = x n(x) = 251 observations. 36 Fig.4a 1 R01sq 0.8 0.6 0.4 0.2 true estimated 0 0.8 0.9 1 1.1 1.2 1.3 x 1.4 1.5 1.6 1.7 1.8 1.4 1.5 1.6 1.7 1.8 Fig.4b 1 R02sq 0.8 0.6 0.4 0.2 true estimated 0 0.8 0.9 1 1.1 1.2 1.3 x Figure 4: a) true and estimated correlation curve ρ2 (x) = determination curve RJ201 (x) with estimate based on K = 300 b) determination curve RJ202 (x) with estimate based on K = 300 6.4.1 Quantities of interest (i) For a Bernoulli reference distribution B(1, π0 ) we have JKL (p(y|π(x)), p(y|π0 )) = ((p(y|π(x)) − p(y|π0 )) log( y p(y|π(x)) ), p(y|π0 ) (74) where p(y|π) = π y (1 − π)1−y , π ∈ {π(x), π0 }. Hence EJKL (p(y|π(X)), p(y|π0 )) = p(x)(π(x) − π0 )(log x (75) π(x) 1 − π0 + log ). 1 − π(x) π0 (ii) Similarly EIKL (p(y|π(X)), p(y|π0 )) p(y|π(x)) . = p(x) p(y|π(x)) log p(y|π0 ) x y (76) (iii) We also investigate approximations to EIKL (p(y|π(X)), p(y|π0 )) and EJKL (p(y|π(X)), p(y|π0 )). First we consider the log-likelihood ratio as an ap37 proximation to EIKL (p(y|π(X)), p(y|π0 )).  π(x)k(x) (1 − π(x))n(x)−k(x) 1 log x k(x) n (1 − π0 )n(x)−k(x) x π0 n(x) 1 π(x)k(x) (1 − π(x))n(x)−k(x) = log k(x) n n(x) π0 (1 − π0 )n(x)−k(x) x = x n(x) k(x) π(x) n(x) − k(x) 1 − π(x) [ log log + ] n n(x) π0 n(x) 1 − π0 = : ILR . Hence in the log-likelihood ratio p(x) is replaced by the relative frequency n(x)/n which would also be used in conditional sampling, and the model dependent probabilities p(y|π(x)) are replaced by relative frequencies k(x)/n(x), n(x) − k(x)/n(x) respectively. According to (33) the corresponding approximation for EJKL (p(y|π(X)), p(y|π0 )) is EJKL (p(y|π(X)), p(y|π0 )) ≈ 2ILR . Result 4 suggests the ‘Wald’ approximation (with ζ0 =logit π0 ) EJKL (p(y|π(X)), p(y|π0 )) ≈ E(tr[HM (ζ0 )(ζ(X) − ζ0 )(ζ(X) − ζ0 )T ]) yielding here p(x)[var(Y |π0 )(logit π(x) − logit π0 )2 EJKL (p(y|π(X)), p(y|π0 )) ≈ x ≈ x n(x) π0 (1 − π0 )(logit π(x) − logit π0 )2(. 77) n Result 5 gives the ‘score’ approximation p(x)(π(x) − π0 )2 /var(Y |π0 ) EJKL (p(y|π(X)), p(y|π0 )) ≈ x ≈ x n(x) (π(x) − π0 )2 , n π0 (1 − π0 ) (78) which is the Binomial version of the ratio of ‘between’ to ‘within’ variance. (iv) Analogously to definition 1 we define coefficients of determination of Y by Xi conditional on X−i = x−i , where X−i denotes the vector of covariates without Xi , J(θ(Xi ; x−i ), Y ) 2 = RJ|x ∈ [0, 1]. (79) −i 1 + J(θ(Xi , x−i ), Y ) Here we examine the partial association between Y (infection) and X3 (antibiotics) fixing (x1 , x2 ) = x−3 . For example, for x−3 = (0, 1) corresponds to the condition that a Caesarian section was not planned but risk factors were present. 38 We compare three models - logit link without interaction between X1 and X2 , ζ(x) = log π(x) = α + β1 x1 + β2 x2 + β3 x3 =: η1 (x) 1 − π(x) - logit link with interaction between X1 and X2 , ζ(x) = η1 (x) + β4 x1 x2 =: η2 (x) - probit link without interaction, π(x) = Φ(η1 (x)). 6.4.2 Estimation We evaluate the expressions given above for the estimate π0 =  k(x)/n = 0.283. x For the three models we insert ml-estimates (given by Fahrmeir and Tutz ([12]:1994) in the formulae for the discrepancies and their approximations. Under the logit link the estimated coefficients of η1 (x) are α = −1.89, β1 = 1.07, β2 = 2.03, β3 = −3.25, and for η2 (x) α = −1.39, β1 = −11.64, β2 = 1.36, β3 = −3.83, β4 = 13.49. Under the probit link η1 (x) is estimated with coefficients α = −1.09, β1 = 0.69, β2 = 1.2, β3 = −1.9. We also evaluate the discrepancies for the saturated model using the cell frequencies as estimates. We dealt with zero probabilities as arguments of the logarithm using the convention 0 log 0 = 0. The values are displayed in fig.5. 2 = 0.2828, indicating Evaluating EIKL (p(y|π(X)), p(y|π0 )) we obtain RI,π 0 that the saturated model does outperform all other models, although the model with logit link and interactions attains nearly the same coefficient of determination. The relation between X and Y in terms of the divergence turns out to be strongest in the model with the logit link and interactions. For this particular data set it even outperforms the saturated model. This is due to the case where a Caesarian section was not planned and no risk factor was present but antibiotics were given, which occurred twice without any infection observed. The probability of infection is estimated by 0.0054 in the model with logit link and interactions and hence does add a term to the divergence whereas using the relative frequencies and the convention 0 log 0 = 0 no term reflects this case in the divergence. 39 Fig.5a 0.29 insert lik RIsq 0.28 0.27 0.26 0.25 0.24 sat can2 can1 probit Fig.5b 0.32 insert lik Wald score RJsq 0.3 0.28 0.26 0.24 0.22 sat can2 can1 probit 2 Figure 5: a) Estimates and estimates of approximations of RI,π b) Estimates 0 2 and estimates of approximations of RJ,π 0 We illustrate the coefficient of determination of Y by Xi conditional on x−i given in (79). We evaluate the partial coefficients of determination between Y (infection) and X3 (antibiotics) for the saturated model and for the model with logit link and interactions. The values are given in the following table. Table 2 2 Estimated coefficients of determination RJ,π 0 |x−3 saturated logit, interaction x−3 = (x1 , x2 ) (0, 1) (1, 1) 0.1758 0.3445 0.2470 0.3206 That is, the relation between infection and antibiotics turns out to be stronger if a Caesarian section was not planned, than in the case where a section was planned beforehand. 40 7 7.1 Discussion Quantities of interest and estimates We elaborated the idea to define coefficients of determination as measures of dependence, particularly based on the directed divergence stressing the explanatory power of X as reduction of uncertainty about Y or, based on the divergence, stressing the discriminatory power of X in telling apart conditional distributions of Y. In order to apply coefficients of determination in model comparison essentially two versions have been focussed on: - for the comparison of models defined within an encompassing joint distribution of X and Y : J(θ(X), Y ) and I(θ(X), Y ) - for the comparison of models with (conditional) sampling distributions belonging to the same (often exponential) family of distributions: EJKL (p(y|θ(X)), p(y|θ0 )) and EIKL (p(y|θ(X)), p(y|θ0 )). It is hard to decide whether to use the directed divergence or the divergence, and we do not recommend any of them exclusively. The most appealing feature of the divergence is the representation (result 1) as a functional of the log-odds ratio function Ψ0 characterizing the association between X and Y (in the sense that their joint distribution is determined by the marginal distributions and Ψ). This feature yields a simple representation of the divergence in exponential families (result 3) where Ψ is bi-affine. The main advantage of the directed divergence is its decomposition in terms of entropies (28) and the monotonicity properties (discussed in section 4). Model comparison using a reference density is essentially based on −EX E(log p(y|θ(X))|θ0 ). For the comparison of models without any common reference model the conditional entropy, that is the model specific expected value of the log density, H(Y |θ(X)) = −EX E(log p(y|θ(X))|θ(X)) is used instead. We did not consider covariates defined on an ordinal scale although measures of association have been applied in this case as well (e.g. [33]: Särndal, 1974). The adaptation of the coefficients of determination we propose to this set-up requires further study. The clear distinction between theoretical and estimated quantities may stimulate further research on properties of estimates. The interpretation and use of coefficients of determination though is determined by the concept underlying their definition rather than properties of their estimates. We emphasize that in our view estimates of theoretically defined coefficients of determination empirically describe a feature of a regression model, namely dependence between a response Y and covariates X, and are not devised as measures of goodness of fit for a data set at hand. As ml-estimates of coefficients of determination often happen to be measures of goodness of fit this interpretation has been wide spread and implemented in sampling definitions of coefficients of determination (for example Cameron and Windmeijer ([5]:1997)). 41 7.2 Model choice Model comparison w.r.t. the explanatory power of a model or the richness of a family of distributions as described in the paper is different from model choice if the target is prediction of future observations. In model choice it is often the predictive potential of a model that is of interest, and hence predictive densities are compared. The approximation of a predictive density by a likelihood needs to be corrected, and such a correction often yields a criterion for model choice of the form ‘fit+complexity’, for example AIC or its Bayesian extension DIC ([38]: Spiegelhalter et al., 2002). The target criterion for model choice −E log p(Y |θ(x)), where Y denotes a future observation and θ a parameter estimate, provides the link between these approaches and coefficients of determination, particularly H(Y |θ(X)). Appendix Proof of result 3: JKL (p(y|ζ(x)), p(y|ζ))  = (p(y|ζ(x)) − p(y|ζ)) log(p(y|ζ(x))/p(y|ζ)) dy = E[(ζ(x) − ζ)T t(Y ) − M (ζ(x)) + M (ζ)|ζ(x)] −E[(ζ(x) − ζ)T t(Y ) − M (ζ(x)) + M (ζ)|ζ] = (ζ(x) − ζ)T [E(t(Y )|ζ(x)) − E( t(Y )|ζ)]. Taking expectations w.r.t. x yields J(θ(X), Y ) = EX [(ζ(X) − ζ)T E(t(Y )|ζ(X))] = EX [(ζ(X) − ζ)T (E(t(Y )|ζ(X)) − E t(Y ))] = tr{cov X (ζ(X), E(t(Y )|ζ(X)))} which is (12).  Proof of result 5: Set τ (x) = f (ζ(x)) or ζ(x) = f −1 (τ (x)). (14) can be elaborated as J(θ(X), Y ) = = = = E[(ζ(X) − ζ)T (τ (X) − τ̄ )] E[ζ(X)T (τ (X) − τ̄ )] E[(τ (X) − τ̄ )T f −1 (τ (X))] E[u(τ (X))] (say). A second order Taylor expansion of u(τ (x)) in τ̄ then yields the result. 42  References [1] Bell, C.B. (1962). Mutual Information and Maximal Correlation as Measures of Dependence. Ann.Math.Statist. 33, 587-595. [2] Bernardo, J.M. and Rueda, R. (2002). Bayesian Hypothesis Testing: a Reference Approach. Int.Statist.Rev. 70, 351-372. [3] Blyth, S. (1994). Local Divergence and Association. Biometrika 81, 579-584. [4] Bjerve, S. and Doksum, K. (1993). Correlation Curves: Measures of Association as Functions of Covariate Values. Ann.Statist. 21, 890-902. [5] Cameron, A.C. and Windmeijer, A.G. (1997). An R-squared Measure of Goodness of Fit for Some Common Nonlinear Regression Models. J. of Econometrics 77, 329-342. [6] Cover, T.M. and Thomas, J.A. (1991). Information Theory. Wiley: New York. [7] Cramer, E.M. and Nicewander, W.A. (1979). Some Symmetric, Invariant Measures of Multivariate Association, Psychometrika 44, 43-54. [8] Darbellay, G.A. (1998). Predictability: an Information-Theoretic Perspective. In: A. Procházka et al. (eds.). Signal Analysis and Prediction. Birkhäuser: Boston 249-262. [9] Doksum, K., Blyth, S., Bradlow, E., Meng, X.-L. and Zhao, H. (1994). Correlation Curves as Local Measures of Variance Explained by Regression. J.Amer.Statist.Ass. 89, 571-582. [10] Doksum, K. and Samarov, A. (1995). Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression. Ann.Statist. 23, 1443-1473. [11] Draper, N.R. and Smith, H. (1981). Applied Regression Analysis. Wiley: New York. [12] Fahrmeir, L. and Tutz, G. (1994). Multivariate Statistical Modelling Based on Generalized Linear Models. Springer: New York. [13] Glahn, H. (1969). Some Relationships Derived from Canonical Correlation Theory. Econometrica 37, 252-256. [14] Gleason, T.C. (1976). On Redundancy in Canonical Analysis. Psycho.Bull. 83, 1004-1006. 43 [15] Härdle, W. (1990). Applied Nonparametric Regression. Cambridge Univ. Press: Cambridge. [16] Healy, M.J.R. (1984). The Use of R2 as a Measure of Goodness of Fit. J.R.Statist.Soc. A147, 608-609. [17] Hooper, J. (1959). Simultaneous Equations and Canonical Correlation Theory. Econometrica 27, 245-256. [18] Jessop, A. (1995). Informed Assessments. Ellis Horwood: New York. [19] Joe, H. (1989). Relative Entropy Measures of Multivariate Dependence. J.Amer.Statist.Ass. 84, 157-164. [20] Kent, J.T. (1983). Information Gain and a General Measure of Correlation. Biometrika 70, 163-173. [21] Kvålseth, T.O. (1985). Cautionary Note about R2 . The Amer.Statist. 39, 279-285. [22] Kullback, S. (1968). Information Theory and Statistics. Dover Publ.: Mineola, New York (2nd ed.). [23] Lavergne, P. and Vuong. Q.H. (1998). An Integral Estimator of Residual Variance and a Measure of Explanatory Power of Covariates in Nonparametric Regression. Nonpar. Statist. 9, 363-380. [24] van der Linde, A. (2004). On the Association between a Random Parameter and an Observable. Test 13, 85-111. [25] McKay, R.J. (1977). Variable Selection in Multivariate Regression: an Application of Simultaneous Test Procedures. J.R.Statist.Soc. B39, 371-380. [26] Magee, L. (1990). R2 Measures Based on Wald and Likelihood Ratio Joint Significance Tests. J.Amer.Statist.Ass. 44, 250-253. [27] Mardia, K.V., Kent, J.T. and Bibby, J.M. (1995). Multivariate Analysis. Academic Press: New York (10th ed.). [28] Nagelkerke, N.J.D. (1991). A Note on a General Definition of the Coefficient of Determination. Biometrika 78, 691-692. [29] Osius, G. (2000). The Association between Two Random Elements: a Complete Characterization in terms of Odds Ratios. Metrika, to appear. [30] Rao, C.R. (1973). Linear Statistical Inference and Its Applications. Wiley: New York. 44 [31] Rao, C.R. and Toutenburg, H. (1995). Linear Models. Spinger: New York. [32] Renyi, A. (1959). On measures of dependence. Acta.Math.Acad.Sci. Hungar.10, 441-451. [33] Särndal, C.E. (1974). A Comparative Study of Association Measures. Psychometrika 39, 165-187. [34] Silvey, S.D. (1964). On a Measure of Association. Ann.Math.Statist. 35, 1157-1166. [35] Simon, G. (1973). Additivity of Information in Exponential Family Probability Laws. J.Amer.Statist.Ass. 68, 478-482. [36] Soofi, E.S., Retzer, J.J. and Yasai-Ardekani, M. (2000). A Framework for Measuring the Importance of Variables with Applications to Management Research and Decision Models. Decision Sciences 31, 595-625. [37] Soofi, E.S.and Retzer, J.J. (2002). Information Indices: Unifications and Applications. J. of Econometrics 107, 17-40. [38] Spiegelhalter, D.J , Best, N.G., Carlin, B.P. and van der Linde, A. (2002). Bayesian Measures of Model Complexity and Fit (with discussion). J.R.Statist.Soc B64, 583-639. [39] Stewart, D. and Love, W. (1968). A General Canonical Index. Psycho. Bull. 70, 160-163. [40] Theil, H. (1987). How Many Bits of Information Does an Independent Variable Yield in a Multiple Regression ? Statist.&Prob.Lett. 6, 107-108. [41] Theil, H. and Chung, C. (1988). Information-Theoretic Measures of Fit for Univariate and Multivariate Regression. The Amer. Statistician 42, 249-252. [42] Willett, J.B. and Singer, J.D. (1988). Another Cautionary Note about R2 : its Use in Weighted Least-Squares Regression Analysis. J.Amer.Statist.Ass. 42, 236-238. 45