Accepted Manuscript
The Effect of Explanatory Variables on Income: A Tool that Allows a
Closer Look at the Differences in Income
Gerhard Tutz, Moritz Berger
PII:
DOI:
Reference:
S2452-3062(18)30103-5
https://doi.org/10.1016/j.ecosta.2018.12.001
ECOSTA 125
To appear in:
Econometrics and Statistics
Received date:
Revised date:
Accepted date:
8 April 2018
4 December 2018
5 December 2018
Please cite this article as: Gerhard Tutz, Moritz Berger, The Effect of Explanatory Variables on Income:
A Tool that Allows a Closer Look at the Differences in Income, Econometrics and Statistics (2018), doi:
https://doi.org/10.1016/j.ecosta.2018.12.001
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
The Effect of Explanatory Variables on Income: A Tool
that Allows a Closer Look at the Differences in Income
IP
T
Gerhard Tutza,∗, Moritz Bergerb
a Ludwig-Maximilians-Universität
München, Ludwigstraße 33, D-80539 München, Germany
für Medizinische Biometrie, Informatik und Epidemiologie, Universitätsklinikum
Bonn, Sigmund-Freud-Straße 25, 53127 Bonn, Germany
US
CR
b Institut
Abstract
Investigation of the effect of covariates on income typically relies on regression
models with a transformed income. An underlying assumption is that the exact
AN
income is available. However, in surveys reported income is often available in
income brackets only. For such grouped data one can use ordered regression
models, which in their simplest form with a linear predictor work in a similar
M
way as regression models for exact income. They yield an overall measure of
the effect of covariates but fail to detect the specific structure of the effects of
single covariates. A model is proposed that allows a closer look at the effect
PT
ED
of single covariates, showing in more detail how the income is determined by
explanatory variables. The model exploits the potential of sequential regression
models, which are extended to allow for varying coefficients. The model is not
harder to use than classical regression models but is much more informative.
The method is illustrated by using data from the German Socio-Economic Panel
CE
Study and the United States Census Bureau.
Keywords: Income, Sequential model, Discrete hazard model, Varying
AC
coefficients, Ordinal models.
∗ Corresponding
author
Preprint submitted to Econometrics and Statistics
December 19, 2018
ACCEPTED MANUSCRIPT
1. Introduction
Income is determined by many variables, including among others educational
level, gender, age, obesity, and height. In applied research the traditional way
5
IP
T
to investigate the effect of explanatory variables is to use regression models,
see, for example, Steckel (1983), Telles and Murguia (1990), Hinze (2000), Berg
and Lien (2002), Deaton and Arora (2009). When using regression models one
US
CR
can estimate the effect of covariates to obtain results concerning, for example,
the gender gap in income. For example, Bobbitt-Zeher (2007) conclude that
college-educated men in their mid-20s already earn, on average, about $ 7, 000
10
more per year than college-educated women. For similar investigations of the
gender gap see Marini (1989).
AN
If the income is available exactly and on a metric scale one can use classical
linear regression models. Since income itself is hardly normally distributed one
typically uses log-transformed income, see, for example, Muller (2002), Ermini
and Hendry (2008). Alternatively, and often more appropriately, one can use
M
15
distribution models as the Pareto distribution (Singh and Maddala, 2008) or a
generalized linear model (McCullagh and Nelder, 1989). A candidate from this
PT
ED
family of models is the gamma distribution model. It is able to account for the
skewness of income distributions and the form of the distribution automatically
20
adapts to larger means. In all these regression models one assumes a strict form
of the income distribution and models the effect of explanatory variables on the
mean of income or log-transformed income.
CE
The methods considered here assume that income is given in intervals only.
The main reason is that in many surveys income is available in income brackets
AC
25
as, for example, in the Current Population Survey (CPS), which is considered
later. Moreover, for self reported income intervals the non-response problem
tends to be weaker when respondents do not know or do not want to tell their
exact income, but are willing to report in which bracket they are. With income
given in intervals one has an ordered categorical response and generalized linear
30
regression models for ordinal responses can be used. As classical regression
2
ACCEPTED MANUSCRIPT
models for exact income, the generalized linear models for ordinal responses
are simple to use and the effect of a variable typically is determined by one
parameter. Thus, if one models, for example, the effect of gender on income one
35
IP
T
obtains one parameter, which reflects the difference in income between men and
women. One obtains a crude overall picture of the difference. What is not seen,
is how the effect was generated by the data. For example, there might be a
US
CR
strong difference between gender groups in higher incomes, in lower incomes, or
the same differences could be found in all income groups. The method proposed
here allows a closer look at the effect of variables on income, which may vary
40
across the income intervals. Rather than obtaining a single parameter for each
variable the effect of each variable is represented as a curve, which shows the
conditional probabilities of reaching a higher income level given the previous
AN
level has been reached.
In Section 2 the basic concept is introduced. In Section 3 estimation concepts
45
for simple models and the proposed model with varying coefficients are given.
M
The usefulness of the method is illustrated in Section 4 by considering two real
PT
ED
data sets. The paper concludes with some remarks.
2. Hazard Models for Income Categories
Let, in general, the income I be given in categories with I ∈ {1, . . . , k},
50
with the categories being build from grouping into income intervals [a0 , a1 ),
. . . , [ak−1 , ak ), where a0 = 0 and ak = ∞ .
CE
For illustration of the method we will use data from the Socio-Economic
Panel Study (SoeP), a representative longitudinal study of private households
AC
in Germany. The data set investigated here consists of n = 4023 full-time
55
employees who are between 20 and 60 years of age and living in Germany. The
grouping of yearly income is given in Table 1. What one observes is a discrete
response, more precisely, an ordered categorical response and one wants to model
I|x, where x is a vector of explanatory variables. Instead of modeling the mean
of the response variable given x, which represents a rather crude measure of
3
ACCEPTED MANUSCRIPT
1
< 5,000
2
..
.
5,000 - 10,000
..
.
20
95,000 - 100,000
21
100,000 - 110,000
22
110,000 - 130,000
US
CR
Income (Euro)
23
130,000 - 160,000
24
> 160,000
the location, one wants a model that accounts for the form of the distribution.
AN
60
Bracket
IP
T
Table 1: Grouping of yearly income in 24 brackets for SoeP data.
Candidates are ordinal regression models as proposed by McCullagh (1980).
More recent overviews on ordinal regression are found in Agresti (2010) and Tutz
(2012). The most widely used ordinal response model is the cumulative model,
65
M
however for our purpose another model from the family of ordinal response
models is more appropriate, namely the sequential logit model.
PT
ED
The basic form of the model is given by
P (I = r|I ≥ r, x) = F (ηr ) = F (γ0 (r) + xT γ ),
r = 1, . . . , k − 1 ,
(1)
where F (.) is a distribution function and γ0 (.) contains the parameters in the
form of a function with the income category as the argument. We will use
CE
the logistic response function F (ηr ) = exp(ηr )/(1 + exp(ηr )), which yields the
AC
logistic sequential model
P (I = r|I ≥ r, x) =
exp(γ0 (r) + xT γ)
.
1 + exp(γ0 (r) + xT γ)
(2)
It is also known as the continuation ratio model and has the alternative representation
log
P (I = r|x)
P (I > r|x)
= γ0 (r) + xT γ .
The interesting feature of the model is the term on the left hand side of (1). It
4
ACCEPTED MANUSCRIPT
is also known as a (discrete) hazard rate
λ(r|x) = P (I = r|I ≥ r, x),
r = 1, . . . , k − 1 ,
IP
T
which is the conditional probability of a response in income category r, i.e. in
interval [ar−1 , ar ), given the income category r has been reached. The counterpart is the conditional transition probability
r = 1, . . . , k − 1 ,
US
CR
tr(r|x) = 1 − λ(r|x),
which is the conditional probability of a response in a higher income category
than r given the income category r has been reached.
The discrete functions λ(.|x) and tr(.|x) provide a specific view on the distribution of I|x. The former represents the probability of failing to reach the
AN
next income level, the latter is the probability of reaching the next level given
the previous level has been reached. Each transition can be seen as a binary
response, which denotes successful transition or failing. The modeling implicitly
M
assumes that people aim at obtaining higher income. The model (1) can be seen
as directly modeling the binary response, transition or failing. It models the
local hazard as a binary response
PT
ED
λ(r|x) = F (γ0 (r) + xT γ ),
r = 1, . . . , k − 1 .
(3)
We will refer to model (3) as the (logistic) discrete hazard model rather than
using the term sequential model.
70
Sequential and discrete hazard models were considered by McCullagh (1980),
CE
Laara and Matthews (1985), Armstrong and Sloan (1989), Tutz (1991), Ananth
and Kleinbaum (1997). Peyhardi et al. (2015) gave a careful investigation of
AC
the relationship among ordinal models and derived invariance properties for the
75
sequential model. The link to discrete survival modeling was considered, for
example, by Fahrmeir and Tutz (2001), where the sequential model is treated
within the framework of multivariate generalized linear models. An extensive
treatment of discrete survival models is found in Tutz and Schmid (2016). However, it cannot be used in its common form. The essential tool is the extension
to allow for varying coefficients, which is considered in the following.
5
ACCEPTED MANUSCRIPT
80
A Hazard Model for the Whole Population
Let us briefly consider the simplest case without covariates. Then the discrete hazard function has the form λ(r|x) = F (γ0 (r)) and the number of param-
IP
T
eters of the model equals the number of income categories. Therefore, without
further constraints the model fits the observed distribution perfectly. The func85
tion λ(.|x) is just a different representation of the distribution of the income.
US
CR
For illustration let us consider the income distribution found in the SoeP data.
Figure 1 shows the income distribution on the left hand side and the (smoothed)
corresponding function γ0 (.), which determines the hazard function, on the right
hand side. It is seen that the hazard rate, that is, the probability of failing to
90
reach a higher income category is very small at the beginning but increases
quickly to reach a constant level beginning at category 6, [55,000 - 60,000). Be-
AN
tween category 6 and 12 the probability of the lower income category remains
on a high level, which even increases for higher levels of income indicating that
the transition to very high income categories gets harder. The form of the hazard rate is typical for income data and was found in various sub samples, for
M
95
example when stratifying for gender (see the applications in the following). One
PT
ED
might even assume that hazard rates are monotonically increasing and might
use estimation methods that allow to include monotonicity, see, for example
Tutz and Leitenstorfer (2007), Hazelton and Turlach (2011). Since monotonic100
ity assumptions are only sensible in the case without covariates we abstain from
making this additional assumption. It should be noted that we always present
CE
smoothed estimates, which are explained later, because with unsmoothed estimates the danger of fitting noise is high and typically the basic form of the
AC
function gets lost.
105
Hazard Rate Modelling with Covariates
The objective of regression modelling is to quantify the impact of explanatory
variables on the response. Within the framework of discrete hazard models the
common form is the representation (3), in which γ represents the weights which
quantify the impact of explanatory variables on the income distribution.
6
0
IP
T
γ0
−2
0.15
−4
0.10
1
3
5
7
9 11
15
19
US
CR
0.00
−6
0.05
relative frequency
0.20
2
0.25
ACCEPTED MANUSCRIPT
23
5
income bracket
10
15
20
income bracket
Figure 1: Income distribution (left) and corresponding baseline function (right) for SoeP data.
Important aspects of the model are the following: Since the functional form
AN
110
of γ0 (.) is not fixed the (discrete) income distribution is not specified to follow a specific distribution. In contrast, the use of the linear term xT γ =
M
(x1 , . . . , xp )(γ1 , . . . , γp )T is rather restrictive. It means that the effect of covariates is the same for each income category, more concrete, log(λ(r|x)/(1 −
115
λ(r|x))) changes by γj if xj increases by one unit, and the change does not de-
PT
ED
pend on r. Thus the effect of a covariate is determined by just one parameter.
If one compares, for example, the income of gender groups, one has just one
gender coefficient, which is assumed to be the same for all the transitions. This
is a rather strict assumption that is hardly realistic in applications.
A model that allows a closer look into the shift of income due to explanatory
CE
variables is a model that lets the weight parameters γ depend on the income cat-
AC
egories. The corresponding model has the form λ(r|x) = F (ηr ) with predictor
120
ηr = γ0 (r) + xT γ (r) ,
(4)
where γ(r) = (γ1 (r), . . . , γp (r)) is a vector-valued function of income categories
with γj (1), . . . , γj (k − 1) representing the category-varying coefficients of the
j-th covariate. The function γj (.) allows to model the impact of an explanatory variable in a very flexible way. Formally, the model is a varying-coefficient
7
ACCEPTED MANUSCRIPT
model, a class of models that has been introduced by Hastie and Tibshirani
125
(1993). In general, in varying-coefficient models the effect of one or more covariates are modified by other covariates, the so-called effect modifiers. Here
IP
T
the effect modifying variable is income measured in intervals. It modifies the
effect of explanatory variables as, for example, gender. It means that the effect
of an explanatory variable on the transition to higher income categories depends
on the category. For example, it can be strong for small categories but weak
US
CR
130
for high categories. The data determine the variation of the effect over categories. The function γ0 (.) represents the basic form of the income distribution
and, borrowing terminology from the hazard modeling literature, will be called
baseline function.
135
For illustration let us consider the effect of single variables when using
AN
model (4). First of all, Figure 2 shows the distributions and the functions
γ0 (.) fitted separately for male and female employees. It is immediately seen
that the distributions are quite different with a definitely higher income for men.
140
M
Consequently the fitted functions γ0 (.) differ across gender groups. However, it
is not easy to compare the two functions, because as far as their difference is
PT
ED
concerned they are not very informative.
When fitting the regression model (4) one can investigate the effect of gender across categories. In Figure 3 the function γ0 (.) and the effect modifying
function γGender (.) are plotted. For the interpretation of the effects it is useful
to have a look at the proposed logistic discrete hazard model, which for the
CE
binary covariate gender can be given by
AC
P (I ≤ r|x)
= exp(γ0 (r)) exp(xGender γGender (r)),
P (I > r|x)
145
r = 1, . . . , k − 1 .
The first term on the right hand side contains the baseline function. It does
not depend on the value of the covariate and is therefore always present in the
model. The second term on the right hand side represents the varying effect
of the covariate. If x = 0 only the baseline function is present in the model
and one obtains the function for the reference population. Since Gender = 1
denotes females and Gender = 0 denotes males, men are the reference popula8
0
IP
T
γ0
−2
0.15
−4
0.10
1
3
5
7
9 11
15
19
23
5
10
15
20
income bracket
γ0
1
3
5
7
9 11
15
−6
M
0.00
0.05
−4
0.10
−2
AN
0.15
0
0.20
2
0.25
income bracket
relative frequency
US
CR
0.00
−6
0.05
relative frequency
0.20
2
0.25
ACCEPTED MANUSCRIPT
19
23
5
10
15
20
income bracket
PT
ED
income bracket
Figure 2: Income distribution and baseline function for men (above) and women (below) for
SoeP data.
tion. Therefore, the baseline function represents the baseline function for the
CE
male sub population, which is the same as in Figure 2. The function γGender (.)
150
represents the deviation from the male reference population. It shows where
the differences between gender groups are located. One can see that females
AC
have higher probabilities of remaining in lower income categories right from
155
the beginning because the function γGender (.) is positive, which means a higher
probability of remaining in lower categories. It is positive up to about category 17. For categories larger than 17 the estimated effect is sometimes even
negative but the confidence intervals are so wide that one cannot conclude that
the effect is significantly different from zero. Overall, one sees a strong effect for
9
IP
T
0
γgender
−1
−2
5
10
15
20
US
CR
−3
−6
−2
−4
γ0
1
0
2
2
3
ACCEPTED MANUSCRIPT
5
income bracket
10
15
20
income bracket
Figure 3: Varying coefficients for the discrete hazard model with gender as covariate, showing
baseline function (left) and the varying gender effect (right) for SoeP data.
AN
low categories and a tendency to decrease across categories, which means that
the difference in transition rates decreases across categories.
Instead of 0-1 coding one could also use the symmetric so-called effect coding,
M
with xGender = 1 for females and xGender = −1 for males. Then, the estimated
baseline function is no longer the function for men but for the mixture of the
PT
ED
two gender groups. For the cumulative odds one obtains
P (I ≤ r|female)
= exp(γ0 (r)) exp(γGender (r)) ,
P (I > r|female)
P (I ≤ r| male)
= exp(γ0 (r)) exp(−γGender (r)) ,
P (I > r| male)
with the function γGender (.) for females and −γGender (.) for males. A direct
AC
CE
comparison of the two gender groups is captured by the cumulative odds ratio
160
P (I ≤ r|female)/P (I > r|female)
= exp(2γGender (r)) .
P (I ≤ r| male)/P (I > r| male)
The effect function γGender (.) has a similar form as for 0-1 coding of gender, but
the values are smaller, see Figure 4.
A second example is the impact of age on income. Figure 5 shows the
fitted functions γ0 (.) and γage (.). It is seen that the function γage (.) is negative,
indicating that older people tend to have higher income. The effect is distinct in
10
IP
T
0
γgender
−1
−2
5
10
15
20
US
CR
−3
−6
−2
−4
γ0
1
0
2
2
3
ACCEPTED MANUSCRIPT
5
income bracket
10
15
20
income bracket
Figure 4: Varying coefficients for the discrete hazard model with gender as covariate in effect
coding, showing baseline function (left) and the varying gender effect (right) for SoeP data.
low to medium income categories and seems to be not significant for high income
AN
165
categories. In the case of a metric variable it seems useful to visualize the impact
of the covariate in a three-dimensional plot. One can plot the hazard function
M
λ(r|x) as a function of r and x (see Figure 6). It is seen that for young people
the hazard has an early peak around bracket 7, that means, the probability of
failing to reach higher categories is rather large around bracket 7 but tends to
PT
ED
170
be lower for higher categories. For older people the peak is much less distinct,
and the probability of failing to reach higher categories is much lower. For
all people, young and old, the probability of failing to reach higher categories
increases strongly beyond category 18. However, one should have in mind that
the number of observations is small in this range and standard errors are large.
CE
175
As last example we consider the impact of height on income, an effect that
AC
has been described, for example, by Steckel (1983). The study showed that the
180
income of taller men and women tends to be higher. However, the functions
are quite different for men and women. As seen from Figure 7, for women the
effect is very stable across income categories, whereas, for men, it is strong for
low income categories but tends to become weaker for increasing categories.
It should be noted that the fitting of models strongly relies on smoothing
11
5
10
15
20
US
CR
−0.10
−6
−0.05
IP
T
0.00
γage
−2
−4
γ0
0
0.05
2
0.10
ACCEPTED MANUSCRIPT
5
income bracket
10
15
20
income bracket
Figure 5: Varying coefficients for the discrete hazard model with age as covariate, showing
baseline function (left) and the varying age effect (right) for SoeP data.
AN
Hazard function
0.5
0.4
0.3
M
0.6
PT
ED
0.2
0.1
5
inc 10
om
e b 15
rac
ket 20
40
35
30 age
25
50
45
CE
20
Figure 6: Hazard as a function of income bracket and age for SoeP data.
AC
techniques. Without smoothing one mostly fits noise, even with only single co-
185
variates, and more distinctly if no covariates are included. Thus, the underlying
structure is hardly seen. For several covariates without smoothing the fitting
becomes very unstable or infeasible because of the large number of parameters
in the model. Appropriate estimation methods are given in the next section.
12
0.10
0.05
IP
T
0.00
γheight
5
10
15
20
US
CR
−0.10
−0.05
0.00
−0.10
−0.05
γheight
0.05
0.10
ACCEPTED MANUSCRIPT
5
income bracket
10
15
20
income bracket
Figure 7: Varying coefficients for the discrete hazard model with height as covariate, showing
the varying height effect for men (left) and the varying height effect for women (right) for
AN
SoeP data.
3. Estimation
M
Let (ri , xi ), ri ∈ {1, . . . , k}, i = 1, . . . , n denote the observations. We consider the fitting of the model with predictor
PT
ED
ηir = γ0 (r) + xTi γ (r) .
(5)
Maximum Likelihood Estimation by Utilizing Binary Models
As is well known, discrete hazard models might be fitted by using tools
that fit binary response models. One can exploit that the probability of an
CE
observation in category ri can be written as
P (Ii = ri |xi ) = λ(ri |xi )
rY
i −1
(1 − λ(j|xi )) .
j=1
AC
By defining the sequence (yi1 , . . . , yiri ) = (0, . . . , 0, 1) of binary observations one
obtains for the likelihood of the i-th observation
Li =
ri
Y
λ(s|xi )yis (1 − λ(s|xi ))1−yis ,
s=1
which is the likelihood of a binary response model with values yis ∈ {0, 1}. However, for each i one has ri binary observations. The values yis code the binary
13
ACCEPTED MANUSCRIPT
indicators for the transition to the next category, respectively. Specifically, yis
codes the transition from s to category s + 1. The total log-likelihood of the
l∝
ri
n X
X
yis log λ(s|xi ) + (1 − yis ) log(1 − λ(s|xi )).
i=1 s=1
190
IP
T
model is given by
(6)
Early references for the representation of hazard models as binary Bernoulli
Tutz (2001), Tutz and Schmid (2016).
Smoothed Maximum Likelihood Estimation
US
CR
trials are Brown (1975) and Laird and Olivier (1981), see also Fahrmeir and
Log-likelihood estimation works well for simple models without time-varying
195
coefficients (3). However, for the models (4) considered here simple log-likelihood
AN
estimation cannot be recommended because of the large number of parameters.
The reason for the large number of parameters is that for each variable and
each category one parameter has to be fit. However, the parameters are given
200
M
in the form of functions γj (.), j = 0, . . . , p and it is quite natural to assume that
the effects for each covariate vary smoothly over the income categories and to
PT
ED
use regularized estimates. Regularized estimates are common tools in today’s
statistical analyses. By assuming smooth functions the number of effective parameters is strongly reduced. The only drawback is that some tuning parameter,
which determines the degree of smoothing, has to be selected.
205
There are several ways to model smoothly varying coefficients. One concept
CE
that has been used widely in regression modelling is localization, see Fan and
Gijbels (1996) and Loader (1999). It has been used to estimate varying coeffi-
AC
cients in hazard models by Kauermann et al. (2005). However, by construction
210
the approach uses the same degree of smoothing for all variables, which is a
severe restriction.
A superior approach, which allows that variables are characterized by varying
coefficient functions (with the smoothness varying across variables) uses basis
functions in the tradition of Eilers and Marx (1996). Within the basis functions
14
2
3
4
5
6
7
8
9
10
US
CR
1
IP
T
0.4
0.0
0.2
φ(r)
0.6
ACCEPTED MANUSCRIPT
r
Figure 8: Six B-splines φ1 (r), . . . , φ6 (r), r = 1, . . . , 10, of degrees 3 with knots that are equally
spaced.
γj (r) =
m
X
s=1
AN
approach one assumes that the functions can be represented as
βjs φs (r),
j = 0, . . . , p ,
(7)
M
where φ1 (r), . . . , φm (r) are m fixed basis functions. Eilers and Marx (1996)
propose to use B-splines, which have a simple representation as truncated power
functions, see Dierckx (1993), Eilers and Marx (1996), or Tutz (2012). For
PT
ED
illustration Figure 8 shows B-splines of degree 3.
The expansion of the function in basis functions yields the predictor
ηir =
p
X
xij γj (r) =
j=0
p X
m
X
xij φs (r)βjs ,
j=0 s=1
CE
which again is a linear predictor,
AC
215
ηir = (xi0 φ1 (r), xi0 φ2 (r), . . . , xip φm (r))β ,
where β collects all the parameters to be estimated.
The number of parameters in the predictor is determined by the number
of basis functions. If one uses a small number of basis functions, say M = 3,
the parameterization is sparse since only three parameters per covariate have
to be estimated. However, if one uses only three basis functions the functional
220
form of γj (.) is severely constrained. In order to allow for flexible functions,
15
ACCEPTED MANUSCRIPT
the number of basis functions should be moderate to large, say between 20 and
30 (Marx and Eilers, 1998). The downside is that many parameters have to
be estimated, which might result in unstable estimates and overfitting. These
225
IP
T
effects can be prevented by using the penalized estimation approach considered
in the following.
Penalized Log-Likelihood Estimation
US
CR
When using a moderate or large number of basis functions one does not
maximize the simple log-likelihood (6) but a penalized log-likelihood. Then the
usual log-likelihood is replaced by
lp (β) = l(β) −
λ
J(β) ,
2
where l(β) is the log-likelihood function (6), λ is a tuning parameter, and J(β)
AN
is a penalty that puts restrictions on β. Usually, these restrictions prevent the
elements in β from varying too strongly over categories, thus controlling the
230
roughness of the smooth function represented by the basis functions.
M
For B-splines with equally spaced knots Eilers and Marx (1996) proposed
a penalty that is based on differences of coefficients. In the varying-coefficient
PT
ED
model considered here the smooth functions are γ0 (r), . . . , γp (r). For equally
spaced B-splines an appropriate penalty is given by
Jd =
p
m
X
X
(∆d βjs )2 ,
(8)
j=0 s=d+1
where ∆ is the difference operator, operating on adjacent B-spline coefficients,
CE
that is, ∆βjs = βjs − βj,s−1 , ∆2 βjs = ∆(βjs − βj,s−1 ) = βjs − 2βj,s−1 + βj,s−2 .
AC
Thus, if one uses differences of first order d = 1 one obtains the simple form
Jd =
p
m
X
X
(βjs − βj,s−1 )2 .
j=0 s=d+1
It is obvious that for large λ, the parameters βj1 , . . . , βjm are strongly restricted
and the number of effective degrees of freedom is strongly reduced. The method
is referred to as P-splines (for penalized splines). Alternative penalty terms,
which include variable selection, were considered by Möst et al. (2016), Tutz
235
and Schmid (2016).
16
ACCEPTED MANUSCRIPT
Using Program Packages
A main advantage of the proposed discrete hazard model is that it can
be fitted by using standard software for binary response models. A hands-
240
IP
T
on tutorial on the use of this class of models with an application to discrete
time-to-event data was given by Berger and Schmid (2017). Before fitting models with predictor (5), one has to generate the required binary observations
245
US
CR
(yi1 , . . . , yiri ) = (0, . . . , 0, 1). This is done by the generation of an augmented
data matrix, which is composed of a set of smaller (augmented) data matrices
Pn
for each individual. The resulting matrix has i=1 ri rows. In R the augmented
data matrix can be generated by applying the function dataLong() from the
R package discSurv (Welchowski and Schmid, 2017), which was originally designed for discrete survival models. Estimates can be obtained by use of the
AN
function gam() from the R package mgcv (Wood, 2017). The penalized loglikelihood maximization problem is solved by Penalized Iteratively Reweighted
250
Least Squares (P-IRLS) see, for example, Wood (2000). If not otherwise speci-
M
fied the starting values for λ(r|x) are internally set to (yir + 0.5)/2. For further
details on the estimation procedures see also Wood (2011). The modeling of
PT
ED
smoothly varying coefficients is done by use of the function s() with the variable
that varies across income categories specified in the by-argument. Throughout
255
this article we used P -splines with cubic B-splines as basis functions (argument
bs="ps") and a second order difference penalty (argument m=2), see equation
(8). For all models we chose a moderate number of five basis functions (argu-
CE
ment k=6). The optimal smoothing parameter λ was computed by generalized
cross-validation (see Wood, 2006).
AC
260
4. Applications
Typically one wants to evaluate the effect of explanatory variables in the
presence of other covariates. Therefore, in the following we first consider a
model for the SoeP data that contains more than just one covariate. Then we
consider an alternative data set, namely the Current Population Survey (CPS)
17
ACCEPTED MANUSCRIPT
265
from the U.S. Census Bureau.
4.1. SoeP Data
IP
T
In the following the effects of covariates on income are investigated by using
the SoeP data with several explanatory variables in the predictor. The explana-
tory variables are gender (1: women, 0: men), age in years, bluecol (1: blue
colour worker, 0: none), working in the public sector (1: yes, 0: no), German cit-
US
CR
270
izen (1: yes, 0: no), married (1: yes, 0: no), and the interaction married:gender.
We started with a model in which all covariate effects were allowed to vary
across categories. Then we refitted the model by including all covariates that
did not have varying coefficients as linear effects. Table 2 shows the parameter
275
estimates for fixed effects and Table 3 shows the results of the smooth estimates.
AN
It is seen that all variables are highly significant. Figure 9 shows the baseline
function and the estimated effects of those covariates that were found to be
varying across income categories. The variables that were not varying across
280
M
categories were married, body size, age, bluecol and public. It is seen that married persons have a tendency to higher response categories, blue collar workers
PT
ED
and people working in the public sector tend to have lower incomes, whereas age
and body size considerably increase the income. The varying effect of gender
is linear, it is positive at the beginning but becomes negative for large income
categories with definitely larger confidence bands in the range of higher income.
285
The effect of German citizenship is quadratic and is unequal zero only for low
income categories. Also the interaction effect married:gender is non-zero only
CE
for low income categories. It is also seen that the basic distribution represented
by the function γ0 (.) is similar to the distribution found in the total sample.
AC
The example also shows that when the impact of several covariates is modeled
290
simultaneously not all covariates are found to have a varying effect.
4.2. Current Population Survey Data
The Current Population Survey (CPS) is one of the oldest and largest surveys
in the United States. It provides information on many of the things that define
18
ACCEPTED MANUSCRIPT
Estimate
Std. Error
z value
Pr(>|z|)
married
-0.40
0.05
-7.59
0.00
size
-0.03
0.00
-8.68
0.00
age
-0.04
0.00
-13.68
0.00
bluecol
1.49
0.05
29.69
0.00
public
0.24
0.06
3.93
0.00
US
CR
Variable
IP
T
Table 2: Non-varying coefficients for SoeP data.
Table 3: Smoothly varying coefficients for SoeP data.
Ref.df
Chi.sq
p-value
4.92
4.99
497.92
0.00
s(IncIntervalNum):gender
2.01
2.02
111.77
0.00
s(IncIntervalNum):married:gender
2.91
3.28
37.71
0.00
22.74
0.00
s(IncIntervalNum):german
AN
edf
s(IncIntervalNum)
3.11
3.49
295
M
individuals and a society. We consider the Current Population Survey Tables
for Personal Income, as provided in PINC-01. For our analysis we use the
PT
ED
incomes reported by individuals with work experience (Worked ) in 2016. The
table (pinc01 3 1 1.xls) contains personal income in brackets of length 2,500.
The data sets are very large (n= 86,906 for men and n= 77,797 for women).
We use a grouping into brackets of length 5,000 up to 100,000. The last group
300
is formed by personal income above 100,000, which was given in the tables
available from CPS. The data can not be used for building models with more
CE
than one covariate since only tables for specific sub populations are available.
For our purpose, which is illustration of the method, we focus on gender.
AC
Figure 10 shows the income distribution on the left hand side and the
305
(smoothed) corresponding function γ0 (.), which determines the hazard function,
on the right hand side, when fitting model (3) for the whole data set including
men and women. The hazard function shows distinct similarities to the hazard
function of the SoeP data. It first increases strongly and then remains on a
high level. However, one does not see a further increase in very high categories
19
IP
T
γgender
γ0
2
5
10
15
20
5
10
15
20
4
2
0
−2
5
10
15
−4
M
−4
−2
0
AN
2
γmarried:gender
4
6
income bracket
6
income bracket
γgerman
US
CR
−4
−4
−2
−2
0
0
4
2
6
4
8
6
10
ACCEPTED MANUSCRIPT
20
5
10
15
20
income bracket
PT
ED
income bracket
Figure 9: Varying coefficients for the discrete hazard model with several covariates, showing
baseline function (upper left), the varying gender effect (upper right), the varying effect of
german (lower left) and the varying effect of the interaction married:gender (lower right) for
SoeP data.
as in the SoeP data. This may be due to the large bracket (≥ 100, 000), which
CE
310
contains many observations. However, for higher income no finer resolution is
AC
available.
315
Figure 11 shows the coefficient functions fitted separately for men and women.
Again one sees differences but it is hard to get an overall picture of the differences between men and women. The varying gender coefficient, obtained when
fitting model (4), given in Figure 12 shows that the effect is positive across
the whole range of income categories. It is slightly stronger for small income
20
−1.5
−3.5
1
3
5
7
9
11 13 15 17 19
IP
T
US
CR
−3.0
γ0
−2.5
−2.0
0.15
0.10
0.05
0.00
relative frequency
0.20
ACCEPTED MANUSCRIPT
5
income bracket
10
15
20
income bracket
Figure 10: Income distribution (left) and corresponding baseline function (right) for CPS
data.
AN
categories and has a tendency to become weaker across categories. The same
decrease has been seen in the German SoeP data. However, for the U.S. data
320
the effect is distinctly positive over the whole range of values, in contrast to
M
the German data for which confidence intervals are very wide for higher income
categories. Due to the huge number of observations confidence intervals for the
PT
ED
U.S. data are very small, they are sometimes hardly seen.
5. Concluding Remarks
325
We presented a tool that can be used to investigate the effect of single or
more covariates on income. The basic ingredients of the tool are the logistic
CE
discrete hazard model with varying coefficients instead of linear effects, accompanied by appropriate estimation methods. While simple models with a linear
AC
predictor yield parameters that describe the overall effect of explanatory vari-
330
ables on income the varying coefficients model allows that effects of covariates
can vary over income brackets. Thus the effect strength of explanatory variables
in specific income brackets is seen. The effects that are found to vary across
categories can be visualized in a curve, which gives a much more concise picture
of the effects of covariates than a simple, frequently inappropriate linear effect.
21
−1.5
−3.5
1
3
5
7
9
11 13 15 17 19
10
15
20
15
20
−1.5
0.20
income bracket
γ0
−3.0
−2.5
AN
−2.0
0.15
0.10
1
3
5
7
9
−3.5
M
0.05
relative frequency
IP
T
5
income bracket
0.00
US
CR
−3.0
γ0
−2.5
−2.0
0.15
0.10
0.05
0.00
relative frequency
0.20
ACCEPTED MANUSCRIPT
11 13 15 17 19
5
10
income bracket
PT
ED
income bracket
Figure 11:
Income distribution and baseline function for men (above) and women (below)
for CPS data.
335
The estimation method is based on P-splines, which use B-splines as basis
CE
functions for the representation of the unknown functions together with a difference penalty that penalizes adjacent parameters in the expansion. P-splines
have the advantage that they are also stable if one has only few observations
AC
in some of the brackets. The use of P-splines determines the trade-off between
340
bias and variance and reduces the effective number of parameters considerably,
which typically yields rather smooth curves.
Estimates for the proposed discrete hazard model are easily obtained by
available program packages, which in all our applications had no problems of
convergence. The model is not harder to use than classical regression models
22
IP
T
γgender
5
10
15
20
5
income bracket
Figure 12:
US
CR
−0.5
−3.5
−3.0
0.0
−2.5
γ0
0.5
−2.0
1.0
−1.5
ACCEPTED MANUSCRIPT
10
15
20
income bracket
Varying coefficients for the discrete hazard model with gender as covariate,
showing baseline function (left) and the varying gender effect (right) for CPS data.
but the resulting curves are much more informative than single parameters.
AN
345
We explicitly restricted our considerations to data where the income is reported in intervals. Although the exact income is more informative, in many
M
surveys income is collected in intervals, only, in order to reduce the non-response
problem. If exact income is available, alternative methods that allow a closer
look at the effect of variables on the income distribution are available, for ex-
PT
ED
350
ample, quantile regression (Fahrmeir et al., 2011) can be used, or generalized
additive models for location, scale and shape (GAMLSS), a framework developed by Rigby and Stasinopoulos (2005).
CE
Acknowledgments
355
The authors thank the reviewers for their thorough reading and the helpful
AC
suggestions.
References
References
Agresti, A., 2010. Analysis of Ordinal Categorical Data, 2nd Edition. Wiley,
360
New York.
23
ACCEPTED MANUSCRIPT
Ananth, C.V., Kleinbaum, D.G., 1997. Regression models for ordinal responses:
A review of methods and applications. International Journal of Epidemiology
26, 1323–1333.
365
data. American Journal of Epidemiology 129, 191–204.
IP
T
Armstrong, B., Sloan, M., 1989. Ordinal regression models for epidemiologic
US
CR
Berg, N., Lien, D., 2002. Measuring the effect of sexual orientation on income:
Evidence of discrimination? Contemporary Economic Policy 20, 394–414.
Berger, M., Schmid, M., 2017. Semiparametric regression for discrete time-toevent data. Statistical Modelling 18, 322–345.
Bobbitt-Zeher, D., 2007. The gender income gap and the role of education.
Sociology of Education 80, 1–22.
AN
370
Brown, C., 1975. On the use of indicator variables for studying the time-
M
dependence of parameters in a response-time model. Biometrics 31, 863–872.
Deaton, A., Arora, R., 2009. Life at the top: the benefits of height. Economics
& Human Biology 7, 133–136.
PT
ED
375
Dierckx, P., 1993. Curve and Surface Fitting with Splines. Oxford Science
Publications, Oxford.
Eilers, P.H.C., Marx, B.D., 1996. Flexible smoothing with B-splines and Penal-
CE
ties. Statistical Science 11, 89–121.
380
Ermini, L., Hendry, D.F., 2008. Log income vs. linear income: An application
AC
of the encompassing principle. Oxford Bulletin of Economics and Statistics
385
70, 807–827.
Fahrmeir, L., Kneib, T., Lang, S., Marx, B., 2011. Regression. Models, Methods
and Applications. Springer Verlag, Berlin.
Fahrmeir, L., Tutz, G., 2001. Multivariate Statistical Modelling based on Generalized Linear Models. Springer, New York.
24
ACCEPTED MANUSCRIPT
Fan, J., Gijbels, I., 1996. Local Polynomial Modelling and Its Applications.
Chapman & Hall, London.
Hastie, T., Tibshirani, R., 1993. Varying-coefficient models. Journal of the
Royal Statistical Society B 55, 757–796.
IP
T
390
Hazelton, M.L., Turlach, B.A., 2011. Semiparametric regression with shape-
US
CR
constrained penalized splines. Computational Statistics & Data Analysis 55,
2871–2879.
Hinze, S.W., 2000. Inside medical marriages: The effect of gender on income.
395
Work and occupations 27, 464–499.
Kauermann, G., Tutz, G., Brüderl, J., 2005. The survival of newly founded
AN
firms: A case-study into varying-coefficient models. Journal of the Royal
Statistical Society A 168, 145–158.
Laara, E., Matthews, J.N., 1985. The equivalence of two models for ordinal
data. Biometrika 72, 206–207.
M
400
PT
ED
Laird, N., Olivier, D., 1981. Covariance analysis of censored survival data using
log-linear analysis techniques. Journal of the American Statistical Association
76, 231–240.
Loader, C., 1999. Local Regression and Likelihood. Springer, New York.
405
Marini, M.M., 1989. Sex differences in earnings in the united states. Annual Re-
CE
view of Sociology 15, 343–380. doi:10.1146/annurev.so.15.080189.002015.
AC
Marx, B.D., Eilers, P.H.C., 1998. Direct generalized additive modelling with
410
penalized likelihood. Computational Statistics & Data Analysis 28, 193–209.
McCullagh, P., 1980. Regression model for ordinal data (with discussion). Journal of the Royal Statistical Society B 42, 109–127.
McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models (Second Edition).
Chapman & Hall, New York.
25
ACCEPTED MANUSCRIPT
Möst, S., Pössnecker, W., Tutz, G., 2016. Variable selection for discrete competing risks models. Quality and Quantity 50, 1589 – 1610.
Muller, A., 2002. Education, income inequality, and mortality: A multiple
regression analysis. Biometrical Journal 324, 23.
IP
T
415
Peyhardi, J., Trottier, C., Guédon, Y., 2015. A new specification of generalized
US
CR
linear models for categorical data. Biometrika , 889–906.
Rigby, R.A., Stasinopoulos, D.M., 2005. Generalized additive models for loca420
tion, scale and shape. Journal of the Royal Statistical Society C 54, 507–554.
Singh, S., Maddala, G.S., 2008. A function for size distribution of incomes, in:
AN
Modeling income distributions and Lorenz curves. Springer, pp. 27–35.
Steckel, R.H., 1983. Height and per capita income. Historical Methods: A
Journal of Quantitative and Interdisciplinary History 16, 1–7.
Telles, E.E., Murguia, E., 1990. Phenotypic discrimination and income differ-
M
425
ences among mexican americans. Social Science Quarterly 71, 682.
PT
ED
Tutz, G., 1991. Sequential models in ordinal regression. Computational Statistics & Data Analysis 11, 275–295.
Tutz, G., 2012. Regression for Categorical Data. Cambridge University Press.
430
Tutz, G., Leitenstorfer, F., 2007. Generalized smooth monotonic regression in
CE
additive modeling. Journal of Computational and Graphical Statistics 16,
165–188.
AC
Tutz, G., Schmid, M., 2016. Modeling Discrete Time-to-Event Data. Springer,
435
New York.
Welchowski, T., Schmid, M., 2017. discSurv: Discrete time survival analysis.
URL: http://CRAN.R-project.org/package=discSurv. R package version
1.1.7.
26
ACCEPTED MANUSCRIPT
Wood, S.N., 2000. Modelling and smoothing parameter estimation with multiple
quadratic penalties. Journal of the Royal Statistical Society B 62, 413–428.
440
Wood, S.N., 2006. Generalized Additive Models: An Introduction with R.
IP
T
Chapman & Hall/CRC, London.
Wood, S.N., 2011. Fast stable restricted maximum likelihood and marginal
the Royal Statistical Society B 73, 3–36.
445
Wood,
S.N.,
2017.
mgcv:
US
CR
likelihood estimation of semiparametric generalized linear models. Journal of
Mixed GAM computation vehicle with
GCV/AIC/REML smoothness estimation. URL: https://CRAN.R-project.
AC
CE
PT
ED
M
AN
org/package=mgcv. R package version 1.8-15.
27