Academia.eduAcademia.edu

The effect of explanatory variables on income: A tool that allows a closer look at the differences in income

2020, Econometrics and Statistics

Accepted Manuscript The Effect of Explanatory Variables on Income: A Tool that Allows a Closer Look at the Differences in Income Gerhard Tutz, Moritz Berger PII: DOI: Reference: S2452-3062(18)30103-5 https://doi.org/10.1016/j.ecosta.2018.12.001 ECOSTA 125 To appear in: Econometrics and Statistics Received date: Revised date: Accepted date: 8 April 2018 4 December 2018 5 December 2018 Please cite this article as: Gerhard Tutz, Moritz Berger, The Effect of Explanatory Variables on Income: A Tool that Allows a Closer Look at the Differences in Income, Econometrics and Statistics (2018), doi: https://doi.org/10.1016/j.ecosta.2018.12.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. ACCEPTED MANUSCRIPT The Effect of Explanatory Variables on Income: A Tool that Allows a Closer Look at the Differences in Income IP T Gerhard Tutza,∗, Moritz Bergerb a Ludwig-Maximilians-Universität München, Ludwigstraße 33, D-80539 München, Germany für Medizinische Biometrie, Informatik und Epidemiologie, Universitätsklinikum Bonn, Sigmund-Freud-Straße 25, 53127 Bonn, Germany US CR b Institut Abstract Investigation of the effect of covariates on income typically relies on regression models with a transformed income. An underlying assumption is that the exact AN income is available. However, in surveys reported income is often available in income brackets only. For such grouped data one can use ordered regression models, which in their simplest form with a linear predictor work in a similar M way as regression models for exact income. They yield an overall measure of the effect of covariates but fail to detect the specific structure of the effects of single covariates. A model is proposed that allows a closer look at the effect PT ED of single covariates, showing in more detail how the income is determined by explanatory variables. The model exploits the potential of sequential regression models, which are extended to allow for varying coefficients. The model is not harder to use than classical regression models but is much more informative. The method is illustrated by using data from the German Socio-Economic Panel CE Study and the United States Census Bureau. Keywords: Income, Sequential model, Discrete hazard model, Varying AC coefficients, Ordinal models. ∗ Corresponding author Preprint submitted to Econometrics and Statistics December 19, 2018 ACCEPTED MANUSCRIPT 1. Introduction Income is determined by many variables, including among others educational level, gender, age, obesity, and height. In applied research the traditional way 5 IP T to investigate the effect of explanatory variables is to use regression models, see, for example, Steckel (1983), Telles and Murguia (1990), Hinze (2000), Berg and Lien (2002), Deaton and Arora (2009). When using regression models one US CR can estimate the effect of covariates to obtain results concerning, for example, the gender gap in income. For example, Bobbitt-Zeher (2007) conclude that college-educated men in their mid-20s already earn, on average, about $ 7, 000 10 more per year than college-educated women. For similar investigations of the gender gap see Marini (1989). AN If the income is available exactly and on a metric scale one can use classical linear regression models. Since income itself is hardly normally distributed one typically uses log-transformed income, see, for example, Muller (2002), Ermini and Hendry (2008). Alternatively, and often more appropriately, one can use M 15 distribution models as the Pareto distribution (Singh and Maddala, 2008) or a generalized linear model (McCullagh and Nelder, 1989). A candidate from this PT ED family of models is the gamma distribution model. It is able to account for the skewness of income distributions and the form of the distribution automatically 20 adapts to larger means. In all these regression models one assumes a strict form of the income distribution and models the effect of explanatory variables on the mean of income or log-transformed income. CE The methods considered here assume that income is given in intervals only. The main reason is that in many surveys income is available in income brackets AC 25 as, for example, in the Current Population Survey (CPS), which is considered later. Moreover, for self reported income intervals the non-response problem tends to be weaker when respondents do not know or do not want to tell their exact income, but are willing to report in which bracket they are. With income given in intervals one has an ordered categorical response and generalized linear 30 regression models for ordinal responses can be used. As classical regression 2 ACCEPTED MANUSCRIPT models for exact income, the generalized linear models for ordinal responses are simple to use and the effect of a variable typically is determined by one parameter. Thus, if one models, for example, the effect of gender on income one 35 IP T obtains one parameter, which reflects the difference in income between men and women. One obtains a crude overall picture of the difference. What is not seen, is how the effect was generated by the data. For example, there might be a US CR strong difference between gender groups in higher incomes, in lower incomes, or the same differences could be found in all income groups. The method proposed here allows a closer look at the effect of variables on income, which may vary 40 across the income intervals. Rather than obtaining a single parameter for each variable the effect of each variable is represented as a curve, which shows the conditional probabilities of reaching a higher income level given the previous AN level has been reached. In Section 2 the basic concept is introduced. In Section 3 estimation concepts 45 for simple models and the proposed model with varying coefficients are given. M The usefulness of the method is illustrated in Section 4 by considering two real PT ED data sets. The paper concludes with some remarks. 2. Hazard Models for Income Categories Let, in general, the income I be given in categories with I ∈ {1, . . . , k}, 50 with the categories being build from grouping into income intervals [a0 , a1 ), . . . , [ak−1 , ak ), where a0 = 0 and ak = ∞ . CE For illustration of the method we will use data from the Socio-Economic Panel Study (SoeP), a representative longitudinal study of private households AC in Germany. The data set investigated here consists of n = 4023 full-time 55 employees who are between 20 and 60 years of age and living in Germany. The grouping of yearly income is given in Table 1. What one observes is a discrete response, more precisely, an ordered categorical response and one wants to model I|x, where x is a vector of explanatory variables. Instead of modeling the mean of the response variable given x, which represents a rather crude measure of 3 ACCEPTED MANUSCRIPT 1 < 5,000 2 .. . 5,000 - 10,000 .. . 20 95,000 - 100,000 21 100,000 - 110,000 22 110,000 - 130,000 US CR Income (Euro) 23 130,000 - 160,000 24 > 160,000 the location, one wants a model that accounts for the form of the distribution. AN 60 Bracket IP T Table 1: Grouping of yearly income in 24 brackets for SoeP data. Candidates are ordinal regression models as proposed by McCullagh (1980). More recent overviews on ordinal regression are found in Agresti (2010) and Tutz (2012). The most widely used ordinal response model is the cumulative model, 65 M however for our purpose another model from the family of ordinal response models is more appropriate, namely the sequential logit model. PT ED The basic form of the model is given by P (I = r|I ≥ r, x) = F (ηr ) = F (γ0 (r) + xT γ ), r = 1, . . . , k − 1 , (1) where F (.) is a distribution function and γ0 (.) contains the parameters in the form of a function with the income category as the argument. We will use CE the logistic response function F (ηr ) = exp(ηr )/(1 + exp(ηr )), which yields the AC logistic sequential model P (I = r|I ≥ r, x) = exp(γ0 (r) + xT γ) . 1 + exp(γ0 (r) + xT γ) (2) It is also known as the continuation ratio model and has the alternative representation log  P (I = r|x) P (I > r|x)  = γ0 (r) + xT γ . The interesting feature of the model is the term on the left hand side of (1). It 4 ACCEPTED MANUSCRIPT is also known as a (discrete) hazard rate λ(r|x) = P (I = r|I ≥ r, x), r = 1, . . . , k − 1 , IP T which is the conditional probability of a response in income category r, i.e. in interval [ar−1 , ar ), given the income category r has been reached. The counterpart is the conditional transition probability r = 1, . . . , k − 1 , US CR tr(r|x) = 1 − λ(r|x), which is the conditional probability of a response in a higher income category than r given the income category r has been reached. The discrete functions λ(.|x) and tr(.|x) provide a specific view on the distribution of I|x. The former represents the probability of failing to reach the AN next income level, the latter is the probability of reaching the next level given the previous level has been reached. Each transition can be seen as a binary response, which denotes successful transition or failing. The modeling implicitly M assumes that people aim at obtaining higher income. The model (1) can be seen as directly modeling the binary response, transition or failing. It models the local hazard as a binary response PT ED λ(r|x) = F (γ0 (r) + xT γ ), r = 1, . . . , k − 1 . (3) We will refer to model (3) as the (logistic) discrete hazard model rather than using the term sequential model. 70 Sequential and discrete hazard models were considered by McCullagh (1980), CE Laara and Matthews (1985), Armstrong and Sloan (1989), Tutz (1991), Ananth and Kleinbaum (1997). Peyhardi et al. (2015) gave a careful investigation of AC the relationship among ordinal models and derived invariance properties for the 75 sequential model. The link to discrete survival modeling was considered, for example, by Fahrmeir and Tutz (2001), where the sequential model is treated within the framework of multivariate generalized linear models. An extensive treatment of discrete survival models is found in Tutz and Schmid (2016). However, it cannot be used in its common form. The essential tool is the extension to allow for varying coefficients, which is considered in the following. 5 ACCEPTED MANUSCRIPT 80 A Hazard Model for the Whole Population Let us briefly consider the simplest case without covariates. Then the discrete hazard function has the form λ(r|x) = F (γ0 (r)) and the number of param- IP T eters of the model equals the number of income categories. Therefore, without further constraints the model fits the observed distribution perfectly. The func85 tion λ(.|x) is just a different representation of the distribution of the income. US CR For illustration let us consider the income distribution found in the SoeP data. Figure 1 shows the income distribution on the left hand side and the (smoothed) corresponding function γ0 (.), which determines the hazard function, on the right hand side. It is seen that the hazard rate, that is, the probability of failing to 90 reach a higher income category is very small at the beginning but increases quickly to reach a constant level beginning at category 6, [55,000 - 60,000). Be- AN tween category 6 and 12 the probability of the lower income category remains on a high level, which even increases for higher levels of income indicating that the transition to very high income categories gets harder. The form of the hazard rate is typical for income data and was found in various sub samples, for M 95 example when stratifying for gender (see the applications in the following). One PT ED might even assume that hazard rates are monotonically increasing and might use estimation methods that allow to include monotonicity, see, for example Tutz and Leitenstorfer (2007), Hazelton and Turlach (2011). Since monotonic100 ity assumptions are only sensible in the case without covariates we abstain from making this additional assumption. It should be noted that we always present CE smoothed estimates, which are explained later, because with unsmoothed estimates the danger of fitting noise is high and typically the basic form of the AC function gets lost. 105 Hazard Rate Modelling with Covariates The objective of regression modelling is to quantify the impact of explanatory variables on the response. Within the framework of discrete hazard models the common form is the representation (3), in which γ represents the weights which quantify the impact of explanatory variables on the income distribution. 6 0 IP T γ0 −2 0.15 −4 0.10 1 3 5 7 9 11 15 19 US CR 0.00 −6 0.05 relative frequency 0.20 2 0.25 ACCEPTED MANUSCRIPT 23 5 income bracket 10 15 20 income bracket Figure 1: Income distribution (left) and corresponding baseline function (right) for SoeP data. Important aspects of the model are the following: Since the functional form AN 110 of γ0 (.) is not fixed the (discrete) income distribution is not specified to follow a specific distribution. In contrast, the use of the linear term xT γ = M (x1 , . . . , xp )(γ1 , . . . , γp )T is rather restrictive. It means that the effect of covariates is the same for each income category, more concrete, log(λ(r|x)/(1 − 115 λ(r|x))) changes by γj if xj increases by one unit, and the change does not de- PT ED pend on r. Thus the effect of a covariate is determined by just one parameter. If one compares, for example, the income of gender groups, one has just one gender coefficient, which is assumed to be the same for all the transitions. This is a rather strict assumption that is hardly realistic in applications. A model that allows a closer look into the shift of income due to explanatory CE variables is a model that lets the weight parameters γ depend on the income cat- AC egories. The corresponding model has the form λ(r|x) = F (ηr ) with predictor 120 ηr = γ0 (r) + xT γ (r) , (4) where γ(r) = (γ1 (r), . . . , γp (r)) is a vector-valued function of income categories with γj (1), . . . , γj (k − 1) representing the category-varying coefficients of the j-th covariate. The function γj (.) allows to model the impact of an explanatory variable in a very flexible way. Formally, the model is a varying-coefficient 7 ACCEPTED MANUSCRIPT model, a class of models that has been introduced by Hastie and Tibshirani 125 (1993). In general, in varying-coefficient models the effect of one or more covariates are modified by other covariates, the so-called effect modifiers. Here IP T the effect modifying variable is income measured in intervals. It modifies the effect of explanatory variables as, for example, gender. It means that the effect of an explanatory variable on the transition to higher income categories depends on the category. For example, it can be strong for small categories but weak US CR 130 for high categories. The data determine the variation of the effect over categories. The function γ0 (.) represents the basic form of the income distribution and, borrowing terminology from the hazard modeling literature, will be called baseline function. 135 For illustration let us consider the effect of single variables when using AN model (4). First of all, Figure 2 shows the distributions and the functions γ0 (.) fitted separately for male and female employees. It is immediately seen that the distributions are quite different with a definitely higher income for men. 140 M Consequently the fitted functions γ0 (.) differ across gender groups. However, it is not easy to compare the two functions, because as far as their difference is PT ED concerned they are not very informative. When fitting the regression model (4) one can investigate the effect of gender across categories. In Figure 3 the function γ0 (.) and the effect modifying function γGender (.) are plotted. For the interpretation of the effects it is useful to have a look at the proposed logistic discrete hazard model, which for the CE binary covariate gender can be given by AC P (I ≤ r|x) = exp(γ0 (r)) exp(xGender γGender (r)), P (I > r|x) 145 r = 1, . . . , k − 1 . The first term on the right hand side contains the baseline function. It does not depend on the value of the covariate and is therefore always present in the model. The second term on the right hand side represents the varying effect of the covariate. If x = 0 only the baseline function is present in the model and one obtains the function for the reference population. Since Gender = 1 denotes females and Gender = 0 denotes males, men are the reference popula8 0 IP T γ0 −2 0.15 −4 0.10 1 3 5 7 9 11 15 19 23 5 10 15 20 income bracket γ0 1 3 5 7 9 11 15 −6 M 0.00 0.05 −4 0.10 −2 AN 0.15 0 0.20 2 0.25 income bracket relative frequency US CR 0.00 −6 0.05 relative frequency 0.20 2 0.25 ACCEPTED MANUSCRIPT 19 23 5 10 15 20 income bracket PT ED income bracket Figure 2: Income distribution and baseline function for men (above) and women (below) for SoeP data. tion. Therefore, the baseline function represents the baseline function for the CE male sub population, which is the same as in Figure 2. The function γGender (.) 150 represents the deviation from the male reference population. It shows where the differences between gender groups are located. One can see that females AC have higher probabilities of remaining in lower income categories right from 155 the beginning because the function γGender (.) is positive, which means a higher probability of remaining in lower categories. It is positive up to about category 17. For categories larger than 17 the estimated effect is sometimes even negative but the confidence intervals are so wide that one cannot conclude that the effect is significantly different from zero. Overall, one sees a strong effect for 9 IP T 0 γgender −1 −2 5 10 15 20 US CR −3 −6 −2 −4 γ0 1 0 2 2 3 ACCEPTED MANUSCRIPT 5 income bracket 10 15 20 income bracket Figure 3: Varying coefficients for the discrete hazard model with gender as covariate, showing baseline function (left) and the varying gender effect (right) for SoeP data. AN low categories and a tendency to decrease across categories, which means that the difference in transition rates decreases across categories. Instead of 0-1 coding one could also use the symmetric so-called effect coding, M with xGender = 1 for females and xGender = −1 for males. Then, the estimated baseline function is no longer the function for men but for the mixture of the PT ED two gender groups. For the cumulative odds one obtains P (I ≤ r|female) = exp(γ0 (r)) exp(γGender (r)) , P (I > r|female) P (I ≤ r| male) = exp(γ0 (r)) exp(−γGender (r)) , P (I > r| male) with the function γGender (.) for females and −γGender (.) for males. A direct AC CE comparison of the two gender groups is captured by the cumulative odds ratio 160 P (I ≤ r|female)/P (I > r|female) = exp(2γGender (r)) . P (I ≤ r| male)/P (I > r| male) The effect function γGender (.) has a similar form as for 0-1 coding of gender, but the values are smaller, see Figure 4. A second example is the impact of age on income. Figure 5 shows the fitted functions γ0 (.) and γage (.). It is seen that the function γage (.) is negative, indicating that older people tend to have higher income. The effect is distinct in 10 IP T 0 γgender −1 −2 5 10 15 20 US CR −3 −6 −2 −4 γ0 1 0 2 2 3 ACCEPTED MANUSCRIPT 5 income bracket 10 15 20 income bracket Figure 4: Varying coefficients for the discrete hazard model with gender as covariate in effect coding, showing baseline function (left) and the varying gender effect (right) for SoeP data. low to medium income categories and seems to be not significant for high income AN 165 categories. In the case of a metric variable it seems useful to visualize the impact of the covariate in a three-dimensional plot. One can plot the hazard function M λ(r|x) as a function of r and x (see Figure 6). It is seen that for young people the hazard has an early peak around bracket 7, that means, the probability of failing to reach higher categories is rather large around bracket 7 but tends to PT ED 170 be lower for higher categories. For older people the peak is much less distinct, and the probability of failing to reach higher categories is much lower. For all people, young and old, the probability of failing to reach higher categories increases strongly beyond category 18. However, one should have in mind that the number of observations is small in this range and standard errors are large. CE 175 As last example we consider the impact of height on income, an effect that AC has been described, for example, by Steckel (1983). The study showed that the 180 income of taller men and women tends to be higher. However, the functions are quite different for men and women. As seen from Figure 7, for women the effect is very stable across income categories, whereas, for men, it is strong for low income categories but tends to become weaker for increasing categories. It should be noted that the fitting of models strongly relies on smoothing 11 5 10 15 20 US CR −0.10 −6 −0.05 IP T 0.00 γage −2 −4 γ0 0 0.05 2 0.10 ACCEPTED MANUSCRIPT 5 income bracket 10 15 20 income bracket Figure 5: Varying coefficients for the discrete hazard model with age as covariate, showing baseline function (left) and the varying age effect (right) for SoeP data. AN Hazard function 0.5 0.4 0.3 M 0.6 PT ED 0.2 0.1 5 inc 10 om e b 15 rac ket 20 40 35 30 age 25 50 45 CE 20 Figure 6: Hazard as a function of income bracket and age for SoeP data. AC techniques. Without smoothing one mostly fits noise, even with only single co- 185 variates, and more distinctly if no covariates are included. Thus, the underlying structure is hardly seen. For several covariates without smoothing the fitting becomes very unstable or infeasible because of the large number of parameters in the model. Appropriate estimation methods are given in the next section. 12 0.10 0.05 IP T 0.00 γheight 5 10 15 20 US CR −0.10 −0.05 0.00 −0.10 −0.05 γheight 0.05 0.10 ACCEPTED MANUSCRIPT 5 income bracket 10 15 20 income bracket Figure 7: Varying coefficients for the discrete hazard model with height as covariate, showing the varying height effect for men (left) and the varying height effect for women (right) for AN SoeP data. 3. Estimation M Let (ri , xi ), ri ∈ {1, . . . , k}, i = 1, . . . , n denote the observations. We consider the fitting of the model with predictor PT ED ηir = γ0 (r) + xTi γ (r) . (5) Maximum Likelihood Estimation by Utilizing Binary Models As is well known, discrete hazard models might be fitted by using tools that fit binary response models. One can exploit that the probability of an CE observation in category ri can be written as P (Ii = ri |xi ) = λ(ri |xi ) rY i −1 (1 − λ(j|xi )) . j=1 AC By defining the sequence (yi1 , . . . , yiri ) = (0, . . . , 0, 1) of binary observations one obtains for the likelihood of the i-th observation Li = ri Y λ(s|xi )yis (1 − λ(s|xi ))1−yis , s=1 which is the likelihood of a binary response model with values yis ∈ {0, 1}. However, for each i one has ri binary observations. The values yis code the binary 13 ACCEPTED MANUSCRIPT indicators for the transition to the next category, respectively. Specifically, yis codes the transition from s to category s + 1. The total log-likelihood of the l∝ ri n X X yis log λ(s|xi ) + (1 − yis ) log(1 − λ(s|xi )). i=1 s=1 190 IP T model is given by (6) Early references for the representation of hazard models as binary Bernoulli Tutz (2001), Tutz and Schmid (2016). Smoothed Maximum Likelihood Estimation US CR trials are Brown (1975) and Laird and Olivier (1981), see also Fahrmeir and Log-likelihood estimation works well for simple models without time-varying 195 coefficients (3). However, for the models (4) considered here simple log-likelihood AN estimation cannot be recommended because of the large number of parameters. The reason for the large number of parameters is that for each variable and each category one parameter has to be fit. However, the parameters are given 200 M in the form of functions γj (.), j = 0, . . . , p and it is quite natural to assume that the effects for each covariate vary smoothly over the income categories and to PT ED use regularized estimates. Regularized estimates are common tools in today’s statistical analyses. By assuming smooth functions the number of effective parameters is strongly reduced. The only drawback is that some tuning parameter, which determines the degree of smoothing, has to be selected. 205 There are several ways to model smoothly varying coefficients. One concept CE that has been used widely in regression modelling is localization, see Fan and Gijbels (1996) and Loader (1999). It has been used to estimate varying coeffi- AC cients in hazard models by Kauermann et al. (2005). However, by construction 210 the approach uses the same degree of smoothing for all variables, which is a severe restriction. A superior approach, which allows that variables are characterized by varying coefficient functions (with the smoothness varying across variables) uses basis functions in the tradition of Eilers and Marx (1996). Within the basis functions 14 2 3 4 5 6 7 8 9 10 US CR 1 IP T 0.4 0.0 0.2 φ(r) 0.6 ACCEPTED MANUSCRIPT r Figure 8: Six B-splines φ1 (r), . . . , φ6 (r), r = 1, . . . , 10, of degrees 3 with knots that are equally spaced. γj (r) = m X s=1 AN approach one assumes that the functions can be represented as βjs φs (r), j = 0, . . . , p , (7) M where φ1 (r), . . . , φm (r) are m fixed basis functions. Eilers and Marx (1996) propose to use B-splines, which have a simple representation as truncated power functions, see Dierckx (1993), Eilers and Marx (1996), or Tutz (2012). For PT ED illustration Figure 8 shows B-splines of degree 3. The expansion of the function in basis functions yields the predictor ηir = p X xij γj (r) = j=0 p X m X xij φs (r)βjs , j=0 s=1 CE which again is a linear predictor, AC 215 ηir = (xi0 φ1 (r), xi0 φ2 (r), . . . , xip φm (r))β , where β collects all the parameters to be estimated. The number of parameters in the predictor is determined by the number of basis functions. If one uses a small number of basis functions, say M = 3, the parameterization is sparse since only three parameters per covariate have to be estimated. However, if one uses only three basis functions the functional 220 form of γj (.) is severely constrained. In order to allow for flexible functions, 15 ACCEPTED MANUSCRIPT the number of basis functions should be moderate to large, say between 20 and 30 (Marx and Eilers, 1998). The downside is that many parameters have to be estimated, which might result in unstable estimates and overfitting. These 225 IP T effects can be prevented by using the penalized estimation approach considered in the following. Penalized Log-Likelihood Estimation US CR When using a moderate or large number of basis functions one does not maximize the simple log-likelihood (6) but a penalized log-likelihood. Then the usual log-likelihood is replaced by lp (β) = l(β) − λ J(β) , 2 where l(β) is the log-likelihood function (6), λ is a tuning parameter, and J(β) AN is a penalty that puts restrictions on β. Usually, these restrictions prevent the elements in β from varying too strongly over categories, thus controlling the 230 roughness of the smooth function represented by the basis functions. M For B-splines with equally spaced knots Eilers and Marx (1996) proposed a penalty that is based on differences of coefficients. In the varying-coefficient PT ED model considered here the smooth functions are γ0 (r), . . . , γp (r). For equally spaced B-splines an appropriate penalty is given by Jd = p m X X (∆d βjs )2 , (8) j=0 s=d+1 where ∆ is the difference operator, operating on adjacent B-spline coefficients, CE that is, ∆βjs = βjs − βj,s−1 , ∆2 βjs = ∆(βjs − βj,s−1 ) = βjs − 2βj,s−1 + βj,s−2 . AC Thus, if one uses differences of first order d = 1 one obtains the simple form Jd = p m X X (βjs − βj,s−1 )2 . j=0 s=d+1 It is obvious that for large λ, the parameters βj1 , . . . , βjm are strongly restricted and the number of effective degrees of freedom is strongly reduced. The method is referred to as P-splines (for penalized splines). Alternative penalty terms, which include variable selection, were considered by Möst et al. (2016), Tutz 235 and Schmid (2016). 16 ACCEPTED MANUSCRIPT Using Program Packages A main advantage of the proposed discrete hazard model is that it can be fitted by using standard software for binary response models. A hands- 240 IP T on tutorial on the use of this class of models with an application to discrete time-to-event data was given by Berger and Schmid (2017). Before fitting models with predictor (5), one has to generate the required binary observations 245 US CR (yi1 , . . . , yiri ) = (0, . . . , 0, 1). This is done by the generation of an augmented data matrix, which is composed of a set of smaller (augmented) data matrices Pn for each individual. The resulting matrix has i=1 ri rows. In R the augmented data matrix can be generated by applying the function dataLong() from the R package discSurv (Welchowski and Schmid, 2017), which was originally designed for discrete survival models. Estimates can be obtained by use of the AN function gam() from the R package mgcv (Wood, 2017). The penalized loglikelihood maximization problem is solved by Penalized Iteratively Reweighted 250 Least Squares (P-IRLS) see, for example, Wood (2000). If not otherwise speci- M fied the starting values for λ(r|x) are internally set to (yir + 0.5)/2. For further details on the estimation procedures see also Wood (2011). The modeling of PT ED smoothly varying coefficients is done by use of the function s() with the variable that varies across income categories specified in the by-argument. Throughout 255 this article we used P -splines with cubic B-splines as basis functions (argument bs="ps") and a second order difference penalty (argument m=2), see equation (8). For all models we chose a moderate number of five basis functions (argu- CE ment k=6). The optimal smoothing parameter λ was computed by generalized cross-validation (see Wood, 2006). AC 260 4. Applications Typically one wants to evaluate the effect of explanatory variables in the presence of other covariates. Therefore, in the following we first consider a model for the SoeP data that contains more than just one covariate. Then we consider an alternative data set, namely the Current Population Survey (CPS) 17 ACCEPTED MANUSCRIPT 265 from the U.S. Census Bureau. 4.1. SoeP Data IP T In the following the effects of covariates on income are investigated by using the SoeP data with several explanatory variables in the predictor. The explana- tory variables are gender (1: women, 0: men), age in years, bluecol (1: blue colour worker, 0: none), working in the public sector (1: yes, 0: no), German cit- US CR 270 izen (1: yes, 0: no), married (1: yes, 0: no), and the interaction married:gender. We started with a model in which all covariate effects were allowed to vary across categories. Then we refitted the model by including all covariates that did not have varying coefficients as linear effects. Table 2 shows the parameter 275 estimates for fixed effects and Table 3 shows the results of the smooth estimates. AN It is seen that all variables are highly significant. Figure 9 shows the baseline function and the estimated effects of those covariates that were found to be varying across income categories. The variables that were not varying across 280 M categories were married, body size, age, bluecol and public. It is seen that married persons have a tendency to higher response categories, blue collar workers PT ED and people working in the public sector tend to have lower incomes, whereas age and body size considerably increase the income. The varying effect of gender is linear, it is positive at the beginning but becomes negative for large income categories with definitely larger confidence bands in the range of higher income. 285 The effect of German citizenship is quadratic and is unequal zero only for low income categories. Also the interaction effect married:gender is non-zero only CE for low income categories. It is also seen that the basic distribution represented by the function γ0 (.) is similar to the distribution found in the total sample. AC The example also shows that when the impact of several covariates is modeled 290 simultaneously not all covariates are found to have a varying effect. 4.2. Current Population Survey Data The Current Population Survey (CPS) is one of the oldest and largest surveys in the United States. It provides information on many of the things that define 18 ACCEPTED MANUSCRIPT Estimate Std. Error z value Pr(>|z|) married -0.40 0.05 -7.59 0.00 size -0.03 0.00 -8.68 0.00 age -0.04 0.00 -13.68 0.00 bluecol 1.49 0.05 29.69 0.00 public 0.24 0.06 3.93 0.00 US CR Variable IP T Table 2: Non-varying coefficients for SoeP data. Table 3: Smoothly varying coefficients for SoeP data. Ref.df Chi.sq p-value 4.92 4.99 497.92 0.00 s(IncIntervalNum):gender 2.01 2.02 111.77 0.00 s(IncIntervalNum):married:gender 2.91 3.28 37.71 0.00 22.74 0.00 s(IncIntervalNum):german AN edf s(IncIntervalNum) 3.11 3.49 295 M individuals and a society. We consider the Current Population Survey Tables for Personal Income, as provided in PINC-01. For our analysis we use the PT ED incomes reported by individuals with work experience (Worked ) in 2016. The table (pinc01 3 1 1.xls) contains personal income in brackets of length 2,500. The data sets are very large (n= 86,906 for men and n= 77,797 for women). We use a grouping into brackets of length 5,000 up to 100,000. The last group 300 is formed by personal income above 100,000, which was given in the tables available from CPS. The data can not be used for building models with more CE than one covariate since only tables for specific sub populations are available. For our purpose, which is illustration of the method, we focus on gender. AC Figure 10 shows the income distribution on the left hand side and the 305 (smoothed) corresponding function γ0 (.), which determines the hazard function, on the right hand side, when fitting model (3) for the whole data set including men and women. The hazard function shows distinct similarities to the hazard function of the SoeP data. It first increases strongly and then remains on a high level. However, one does not see a further increase in very high categories 19 IP T γgender γ0 2 5 10 15 20 5 10 15 20 4 2 0 −2 5 10 15 −4 M −4 −2 0 AN 2 γmarried:gender 4 6 income bracket 6 income bracket γgerman US CR −4 −4 −2 −2 0 0 4 2 6 4 8 6 10 ACCEPTED MANUSCRIPT 20 5 10 15 20 income bracket PT ED income bracket Figure 9: Varying coefficients for the discrete hazard model with several covariates, showing baseline function (upper left), the varying gender effect (upper right), the varying effect of german (lower left) and the varying effect of the interaction married:gender (lower right) for SoeP data. as in the SoeP data. This may be due to the large bracket (≥ 100, 000), which CE 310 contains many observations. However, for higher income no finer resolution is AC available. 315 Figure 11 shows the coefficient functions fitted separately for men and women. Again one sees differences but it is hard to get an overall picture of the differences between men and women. The varying gender coefficient, obtained when fitting model (4), given in Figure 12 shows that the effect is positive across the whole range of income categories. It is slightly stronger for small income 20 −1.5 −3.5 1 3 5 7 9 11 13 15 17 19 IP T US CR −3.0 γ0 −2.5 −2.0 0.15 0.10 0.05 0.00 relative frequency 0.20 ACCEPTED MANUSCRIPT 5 income bracket 10 15 20 income bracket Figure 10: Income distribution (left) and corresponding baseline function (right) for CPS data. AN categories and has a tendency to become weaker across categories. The same decrease has been seen in the German SoeP data. However, for the U.S. data 320 the effect is distinctly positive over the whole range of values, in contrast to M the German data for which confidence intervals are very wide for higher income categories. Due to the huge number of observations confidence intervals for the PT ED U.S. data are very small, they are sometimes hardly seen. 5. Concluding Remarks 325 We presented a tool that can be used to investigate the effect of single or more covariates on income. The basic ingredients of the tool are the logistic CE discrete hazard model with varying coefficients instead of linear effects, accompanied by appropriate estimation methods. While simple models with a linear AC predictor yield parameters that describe the overall effect of explanatory vari- 330 ables on income the varying coefficients model allows that effects of covariates can vary over income brackets. Thus the effect strength of explanatory variables in specific income brackets is seen. The effects that are found to vary across categories can be visualized in a curve, which gives a much more concise picture of the effects of covariates than a simple, frequently inappropriate linear effect. 21 −1.5 −3.5 1 3 5 7 9 11 13 15 17 19 10 15 20 15 20 −1.5 0.20 income bracket γ0 −3.0 −2.5 AN −2.0 0.15 0.10 1 3 5 7 9 −3.5 M 0.05 relative frequency IP T 5 income bracket 0.00 US CR −3.0 γ0 −2.5 −2.0 0.15 0.10 0.05 0.00 relative frequency 0.20 ACCEPTED MANUSCRIPT 11 13 15 17 19 5 10 income bracket PT ED income bracket Figure 11: Income distribution and baseline function for men (above) and women (below) for CPS data. 335 The estimation method is based on P-splines, which use B-splines as basis CE functions for the representation of the unknown functions together with a difference penalty that penalizes adjacent parameters in the expansion. P-splines have the advantage that they are also stable if one has only few observations AC in some of the brackets. The use of P-splines determines the trade-off between 340 bias and variance and reduces the effective number of parameters considerably, which typically yields rather smooth curves. Estimates for the proposed discrete hazard model are easily obtained by available program packages, which in all our applications had no problems of convergence. The model is not harder to use than classical regression models 22 IP T γgender 5 10 15 20 5 income bracket Figure 12: US CR −0.5 −3.5 −3.0 0.0 −2.5 γ0 0.5 −2.0 1.0 −1.5 ACCEPTED MANUSCRIPT 10 15 20 income bracket Varying coefficients for the discrete hazard model with gender as covariate, showing baseline function (left) and the varying gender effect (right) for CPS data. but the resulting curves are much more informative than single parameters. AN 345 We explicitly restricted our considerations to data where the income is reported in intervals. Although the exact income is more informative, in many M surveys income is collected in intervals, only, in order to reduce the non-response problem. If exact income is available, alternative methods that allow a closer look at the effect of variables on the income distribution are available, for ex- PT ED 350 ample, quantile regression (Fahrmeir et al., 2011) can be used, or generalized additive models for location, scale and shape (GAMLSS), a framework developed by Rigby and Stasinopoulos (2005). CE Acknowledgments 355 The authors thank the reviewers for their thorough reading and the helpful AC suggestions. References References Agresti, A., 2010. Analysis of Ordinal Categorical Data, 2nd Edition. Wiley, 360 New York. 23 ACCEPTED MANUSCRIPT Ananth, C.V., Kleinbaum, D.G., 1997. Regression models for ordinal responses: A review of methods and applications. International Journal of Epidemiology 26, 1323–1333. 365 data. American Journal of Epidemiology 129, 191–204. IP T Armstrong, B., Sloan, M., 1989. Ordinal regression models for epidemiologic US CR Berg, N., Lien, D., 2002. Measuring the effect of sexual orientation on income: Evidence of discrimination? Contemporary Economic Policy 20, 394–414. Berger, M., Schmid, M., 2017. Semiparametric regression for discrete time-toevent data. Statistical Modelling 18, 322–345. Bobbitt-Zeher, D., 2007. The gender income gap and the role of education. Sociology of Education 80, 1–22. AN 370 Brown, C., 1975. On the use of indicator variables for studying the time- M dependence of parameters in a response-time model. Biometrics 31, 863–872. Deaton, A., Arora, R., 2009. Life at the top: the benefits of height. Economics & Human Biology 7, 133–136. PT ED 375 Dierckx, P., 1993. Curve and Surface Fitting with Splines. Oxford Science Publications, Oxford. Eilers, P.H.C., Marx, B.D., 1996. Flexible smoothing with B-splines and Penal- CE ties. Statistical Science 11, 89–121. 380 Ermini, L., Hendry, D.F., 2008. Log income vs. linear income: An application AC of the encompassing principle. Oxford Bulletin of Economics and Statistics 385 70, 807–827. Fahrmeir, L., Kneib, T., Lang, S., Marx, B., 2011. Regression. Models, Methods and Applications. Springer Verlag, Berlin. Fahrmeir, L., Tutz, G., 2001. Multivariate Statistical Modelling based on Generalized Linear Models. Springer, New York. 24 ACCEPTED MANUSCRIPT Fan, J., Gijbels, I., 1996. Local Polynomial Modelling and Its Applications. Chapman & Hall, London. Hastie, T., Tibshirani, R., 1993. Varying-coefficient models. Journal of the Royal Statistical Society B 55, 757–796. IP T 390 Hazelton, M.L., Turlach, B.A., 2011. Semiparametric regression with shape- US CR constrained penalized splines. Computational Statistics & Data Analysis 55, 2871–2879. Hinze, S.W., 2000. Inside medical marriages: The effect of gender on income. 395 Work and occupations 27, 464–499. Kauermann, G., Tutz, G., Brüderl, J., 2005. The survival of newly founded AN firms: A case-study into varying-coefficient models. Journal of the Royal Statistical Society A 168, 145–158. Laara, E., Matthews, J.N., 1985. The equivalence of two models for ordinal data. Biometrika 72, 206–207. M 400 PT ED Laird, N., Olivier, D., 1981. Covariance analysis of censored survival data using log-linear analysis techniques. Journal of the American Statistical Association 76, 231–240. Loader, C., 1999. Local Regression and Likelihood. Springer, New York. 405 Marini, M.M., 1989. Sex differences in earnings in the united states. Annual Re- CE view of Sociology 15, 343–380. doi:10.1146/annurev.so.15.080189.002015. AC Marx, B.D., Eilers, P.H.C., 1998. Direct generalized additive modelling with 410 penalized likelihood. Computational Statistics & Data Analysis 28, 193–209. McCullagh, P., 1980. Regression model for ordinal data (with discussion). Journal of the Royal Statistical Society B 42, 109–127. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models (Second Edition). Chapman & Hall, New York. 25 ACCEPTED MANUSCRIPT Möst, S., Pössnecker, W., Tutz, G., 2016. Variable selection for discrete competing risks models. Quality and Quantity 50, 1589 – 1610. Muller, A., 2002. Education, income inequality, and mortality: A multiple regression analysis. Biometrical Journal 324, 23. IP T 415 Peyhardi, J., Trottier, C., Guédon, Y., 2015. A new specification of generalized US CR linear models for categorical data. Biometrika , 889–906. Rigby, R.A., Stasinopoulos, D.M., 2005. Generalized additive models for loca420 tion, scale and shape. Journal of the Royal Statistical Society C 54, 507–554. Singh, S., Maddala, G.S., 2008. A function for size distribution of incomes, in: AN Modeling income distributions and Lorenz curves. Springer, pp. 27–35. Steckel, R.H., 1983. Height and per capita income. Historical Methods: A Journal of Quantitative and Interdisciplinary History 16, 1–7. Telles, E.E., Murguia, E., 1990. Phenotypic discrimination and income differ- M 425 ences among mexican americans. Social Science Quarterly 71, 682. PT ED Tutz, G., 1991. Sequential models in ordinal regression. Computational Statistics & Data Analysis 11, 275–295. Tutz, G., 2012. Regression for Categorical Data. Cambridge University Press. 430 Tutz, G., Leitenstorfer, F., 2007. Generalized smooth monotonic regression in CE additive modeling. Journal of Computational and Graphical Statistics 16, 165–188. AC Tutz, G., Schmid, M., 2016. Modeling Discrete Time-to-Event Data. Springer, 435 New York. Welchowski, T., Schmid, M., 2017. discSurv: Discrete time survival analysis. URL: http://CRAN.R-project.org/package=discSurv. R package version 1.1.7. 26 ACCEPTED MANUSCRIPT Wood, S.N., 2000. Modelling and smoothing parameter estimation with multiple quadratic penalties. Journal of the Royal Statistical Society B 62, 413–428. 440 Wood, S.N., 2006. Generalized Additive Models: An Introduction with R. IP T Chapman & Hall/CRC, London. Wood, S.N., 2011. Fast stable restricted maximum likelihood and marginal the Royal Statistical Society B 73, 3–36. 445 Wood, S.N., 2017. mgcv: US CR likelihood estimation of semiparametric generalized linear models. Journal of Mixed GAM computation vehicle with GCV/AIC/REML smoothness estimation. URL: https://CRAN.R-project. AC CE PT ED M AN org/package=mgcv. R package version 1.8-15. 27