MedLDA: Maximum Margin Supervised Topic Models for
Regression and Classification
Jun Zhu†∗
[email protected]
Amr Ahmed†
[email protected]
Eric P. Xing†
[email protected]
∗
Dept. of Comp. Sci & Tech, TNList Lab, Tsinghua University, Beijing 100084 China
†
School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 USA
Abstract
Supervised topic models utilize document’s
side information for discovering predictive
low dimensional representations of documents; and existing models apply likelihoodbased estimation. In this paper, we present
a max-margin supervised topic model for
both continuous and categorical response
variables. Our approach, the maximum entropy discrimination latent Dirichlet allocation (MedLDA), utilizes the max-margin
principle to train supervised topic models
and estimate predictive topic representations
that are arguably more suitable for prediction. We develop efficient variational methods for posterior inference and demonstrate
qualitatively and quantitatively the advantages of MedLDA over likelihood-based topic
models on movie review and 20 Newsgroups
data sets.
1. Introduction
Statistical topic models have recently gained much
popularity in managing a large collection of documents
by discovering a low dimensional representation that
captures the latent semantic of the collection. This
low dimensional representation can then be used for
tasks like classification and clustering or merely as a
tool to structurally browse the otherwise unstructured
collection. Latent Dirichlet Allocation (LDA) (Blei
et al., 2003) is an example of such models for textual
documents. LDA posits that each document is an admixture of latent topics where the topics are unigram
distribution over a given vocabulary. The admixture
proportion is document-specific and is distributed as
a latent Dirichlet random variable.
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).
When LDA is used for classification tasks, the
document-specific mixing proportions are fed, usually,
to a downstream classifier like an SVM. This two-step
procedure is rather suboptimal as the side information
of the documents, such as the category of a document
or a numerical rating of a movie review, is not used in
discovering the low-dimensional representation of the
documents and thus can result in a sub-optimal representation for prediction. Developing a low dimensional representation that retains as much information
as possible about the response variable has been studied in text modeling (McCallum et al., 2006) and image analysis (Blei & Jordan, 2003). Recently, supervised variants of LDA have been proposed, including
the supervised LDA (sLDA) (Blei & McAuliffe, 2007)
and the discriminative LDA (DiscLDA) for classification (Lacoste-Jullien et al., 2008). While sLDA and
DiscLDA share the same goal (uncovering the latent
structure in a document collection while retaining predictive power for supervised tasks), they differ in their
training procedures. sLDA is trained by maximizing the joint likelihood of data and response variables
while DiscLDA is trained to maximize the conditional
likelihood of response variables.
In this paper, we propose a max-margin discriminative
variant of supervised topic models for both regression
and classification. In contrast to the above two-stage
procedure of using topic models for prediction tasks,
the proposed maximum entropy discrimination latent
Dirichlet allocation (MedLDA) is an integration of
max-margin learning and hierarchical Bayesian topic
models by optimizing a single objective function with
a set of expected margin constraints. MedLDA is a special instance of PoMEN (i.e., partially observed maximum entropy discrimination Markov network) (Zhu
et al., 2008b), which was proposed to combine maxmargin learning and structured hidden variables in
Markov networks, for discovering latent topic presentations of documents. In MedLDA, the parameters
for the regression or classification model are learned in
MedLDA: Max-margin Topic Models for Regression and Classification
a max-margin sense; and the discovery of latent topics is coupled with the max-margin estimation of the
model parameters. This interplay yields latent topic
representations that are more suitable for supervised
prediction tasks. We develop an efficient and easyto-implement variational method for MedLDA, and in
fact its running time is comparable to that of an unsupervised LDA for classification. This property stems
from the fact that the MedLDA classification model
directly optimizes the margin and does not suffer from
a normalization factor which generally makes learning
hard as in fully generative models such as sLDA.
The paper is structured as follows. Sec. 2 presents the
MedLDA for both regression and classification, with
efficient variational EM algorithms. Sec. 3 generalizes
MedLDA to other latent variable topic models. Sec. 4
presents empirical comparison between MedLDA and
likelihood-based topic models. Finally, Sec. 5 concludes this paper with future research directions.
where E[X] is an expectation w.r.t the posterior distribution of the r.v. X or its variational approximation.
DiscLDA (Lacoste-Jullien et al., 2008) is a discriminative variant of supervised topic models for classification, where the unknown parameters (i.e., a linear
transformation matrix) are learned by maximizing the
conditional likelihood of the response variables.
Below, we present a max-margin variant of the supervised topic models, which can discover predictive topic
representations that are more suitable for supervised
prediction tasks, e.g., regression and classification.
2.2. Learning MedLDA for Regression
2. Max-Entropy Discrimination LDA
In this section, we present the MedLDA model for both
regression and classification. We first review the supervised topic models.
Instead of learning a point estimate of η as in sLDA,
we take a Bayesian-style approach and learn a distribution q(η) in a max-margin manner. For prediction,
we take the average over all the possible models:
E[Y |w1:N , α, β, δ 2 ] = E[η ⊤ Z̄|w1:N , α, β, δ 2 ].
2.1. (Un)Supervised Topic Models
The unsupervised LDA (latent Dirichlet allocation)
(Blei et al., 2003) is a hierarchical Bayesian model,
where topic proportions for a document are drawn
from a Dirichlet distribution and words in the document are repeatedly sampled from a topic which itself is drawn from those topic proportions. Supervised
topic models (sLDA) (Blei & McAuliffe, 2007) introduce a response variable to LDA for each document,
as illustrated in Figure 1.
Let K be the number of topics and M be the number
of terms in a vocabulary. β denotes a K × M matrix
and each βk is a distribution over the M terms. For
the regression problem, where the response variable
y ∈ R, the generative process of sLDA is as follows:
1. Draw topic proportions θ|α ∼ Dir(α).
2. For each word
(a) Draw a topic assignment zn |θ ∼ Mult(θ).
(b) Draw a word wn |zn , β ∼ Multi(βzn ).
3. Draw a response variable: y|z1:N , η, δ 2 ∼ N (η ⊤ z̄, δ 2 ),
P
where z̄ = 1/N N
n=1 zn .
To estimate the unknown constants (α, β, η, δ 2 ), sLDA
maximizes the joint likelihood p(y, W|α, β, η, δ 2 ),
where y is the vector of response variables in a corpus
D and W are all the words. Given a new document,
the expected response value is the prediction:
E[Y |w1:N , α, β, η, δ 2 ] = η ⊤ E[Z̄|w1:N , α, β, δ 2 ],
Figure 1. Supervised topic model (Blei & McAuliffe, 2007).
(1)
(2)
Now, the question underlying the averaging prediction rule (2) is how we can devise an appropriate
loss function and constraints to integrate the maxmargin concepts into latent topic discovery. In the
sequel, we present the maximum entropy discrimination latent Dirichlet allocation (MedLDA) based on
the PoMEN (i.e., partially observed maximum entropy
discrimination Markov networks) (Zhu et al., 2008b)
framework. PoMEN is an elegant combination of maxmargin learning with structured hidden variables in
Markov networks. The MedLDA is a special case of
PoMEN to learn latent topic models to discover latent
semantic structures of document collections.
For regression, the MedLDA is defined as an integration of a Bayesian sLDA, where the parameter η
is sampled from a prior p0 (η), and the ǫ-insensitive
support vector regression (SVR) (Smola & Schölkopf,
2003). Thus, MedLDA defines a joint distribution:
p(θ, z, η, y, W|α, β, δ 2 ) = p0 (η)p(θ, z, y, W|α, β, η, δ 2 ),
where the second term is the same as in
QDthe sLDA,
that is, p(θ, z, y, W|α, β, η, δ 2 ) =
d=1 p(θd |α)
QN
The
( n=1 p(zdn |θd )p(wdn |zdn , β))p(yd |η ⊤ z̄d , δ 2 ).
marginal likelihood on D is p(y, W|α, β, δ 2 ). Since
directly optimizing the log marginal likelihood is
intractable, as in sLDA, we optimize an upper bound
L(q), where q(θ, z, η|γ, φ) is a variational distribution
to approximate the posterior p(θ, z, η|α, β, δ 2 , y, W).
MedLDA: Max-margin Topic Models for Regression and Classification
Thus, the integrated learning problem is defined as:
P1(MedLDAr ) :
min
q,α,β,δ 2 ,ξ,ξ⋆
L(q) + C
D
X
(ξd + ξd⋆ )
d=1
⊤
yd − E[η Z̄d ] ≤ ǫ + ξd ,
⊤
⋆
−y
d + E[η Z̄d ] ≤ ǫ + ξd ,
s.t. ∀d :
ξ
≥
0,
d
ξd⋆ ≥ 0,
µd
µ⋆d
vd
vd⋆
where µ, µ⋆ , v, v ⋆ are lagrange multipliers; L(q) =
−E[log p(θ, z, η, y, W|α, β, δ 2 )] − H(q(z, θ, η)); H(q) is
the entropy of q; ξ, ξ ⋆ are slack variables absorbing
errors in training data; and ǫ is the precision.
The rationale underlying the MedLDAr is that: let the
current model be p(θ, z, η, y, W|α, β, δ 2 ), then we want
to find a latent topic representation and a model distribution (as represented by the distribution q) which
on one hand tend to predict correctly on the data with
a sufficient large margin, and on the other hand tend
to explain the data well (i.e., minimizing an variational
upper bound of the negative log-likelihood). This interplay will yield a topic representation that is more
suitable for max-margin learning, as explained below.
2.2.1. Variational EM-Algorithm
The constrained problem P1 is generally intractable.
Thus, we make additional independence assumptions
about q. As in standard
models,Qwe assume that
Qtopic
D
N
q(θ, z, η|γ, φ) = q(η) d=1 q(θd |γd ) n=1 q(zdn |φdn ),
where γd is a K-dimensional vector of Dirichlet parameters and each φdn is a categorical distribution
over K topics. Then, E[Zdn ] = φdn , E[η ⊤ Z̄d ] =
PN
E[η]⊤ (1/N ) n=1 φdn . We can develop an EM algorithm, which iteratively solves the following two steps:
E-step: infer the posterior distribution of the hidden
variables (i.e., θ, z, and η).
M-step: estimate the unknown parameters (i.e., α,
β, and δ 2 ).
The essential difference between MedLDA and sLDA
lies in the E-step to infer the posterior distribution of
z and η because of the margin constraints in P1. As
we shall see in Eq. (4), these constraints will bias the
expected topic proportions towards the ones that are
more suitable for max-margin learning. Since the constraints in P1 are not on the unknown parameters (α,
β, and δ 2 ), the M-step is similar to that of the sLDA.
We outline the algorithm in Alg. 1 and explain it in
details below. Specifically, we formulate a Lagrangian1
L for P1 and iteratively solve the following steps:
P
PD
⋆
L = L(q) + C D
d=1 (ξd + ξd ) −
d=1 µd (ǫ + ξd − yd +
P
⋆
⋆
⊤
E[η ⊤ Z̄d ])− D
(µ
(ǫ+ξ
+y
−E[η
Z̄d ])+vd ξd +vd⋆ ξd⋆ )−
d
d
d
d=1
PD PN
PK
d=1
i=1 cdi (
j=1 φdij − 1), where the last term is due
P
to the normalization condition K
j=1 φdij = 1, ∀i, d
1
Algorithm 1 Variational MedLDAr
Input: corpus D = {(y, W)}, constants C and ǫ, and
topic number K.
Output: Dirichlet parameters γ, posterior distribution
q(η), parameters α, β and δ 2 .
repeat
/**** E-Step ****/
for d = 1 to D do
Update γd as in Eq. (3).
for i = 1 to N do
Update φdi as in Eq. (4).
end for
end for
Solve the dual problem D1 to get q(η), µ and µ⋆ .
/**** M-Step ****/
Update β using Eq. (5), and update δ 2 using Eq. (6).
α is fixed as 1/K times the ones vector.
until convergence
Optimize L over γ: Since the constraints in P1 are
not on γ, we can get the same update formula as in
sLDA for each document d separately:
γd ← α +
N
X
φdn
(3)
n=1
Optimize L over φ: For each document d and each
word i, by setting ∂L/∂φdi = 0, we have:
yd
E[η]
N δ2
E[η]
2E[η ⊤ φd,−i η] + E[η ◦ η]
+
(µd − µ⋆d ) , (4)
−
2 δ2
2N
N
P
φdi ∝exp E[log θ|γ] + E[log p(wdi |β)] +
where φd,−i = n6=i φdn and the result of exponentiating a vector is a vector of the exponentials of its corresponding components. The first two terms in the exponential are the same as those in unsupervised LDA.
The essential differences of MedLDAr from the sLDA
lie in the last three terms in the exponential of φdi .
Firstly, the third and fourth terms are similar to those
of sLDA, but in an expected version since we are learning the distribution q(η). The second-order expectations E[η ⊤ φd,−i η] and E[η ◦ η] mean that the covariances of η affect the distribution over topics. This
makes our approach significantly different from a point
estimation method, like sLDA, where no expectations
or co-variances are involved in updating φdi . Secondly,
the last term is from the max-margin regression formulation. For a document d, which lies around the
decision boundary, i.e., a support vector, either µd or
µ⋆d is non-zero, and the last term biases φdi towards
a distribution that favors a more accurate prediction
on the document. Moreover, the last term is fixed for
words in the document and thus will directly affect the
latent representation of the document, i.e., γd . Therefore, the latent representation by MedLDAr is more
suitable for max-margin learning.
Optimize L over q(η): Let A be the D × K matrix
whose rows are the vectors Z̄d⊤ . Set the partial
MedLDA: Max-margin Topic Models for Regression and Classification
2.3. Learning MedLDA for Classification
derivative ∂L/∂q(η) = 0, then we get:
D
X
E[A⊤ A] For classification, the response variables y are discrete.
p0 (η)
yd
(µd − µ⋆d + 2 )E[Z̄d ] − η ⊤
exp η ⊤
η
q(η) =
Z
δ
2δ 2
For brevity, we only consider the multi-class classificad=1
⊤
PD
⊤
⊤
where E[A A] =
d Z̄d ], and E[Z̄d Z̄d ] =
d=1 E[Z̄
P
P
P
N
N
1
⊤
n=1
m6=n φdn φdm +
n=1 diag{φdn }). PlugN2 (
ging q(η) into L, we get the dual problem of P1:
− log Z − ǫ
D1 : max
⋆
µ,µ
D
X
(µd + µ⋆d ) +
D
X
yd (µd − µ⋆d )
d=1
d=1
s.t. ∀d : µd , µ⋆d ∈ [0, C].
In MedLDAr , we can choose different priors to introduce some regularization effects. For the standard normal prior: p0 (η) = N (0, I), the posterior is also a nor
PD
⋆
mal: q(η)= N µη , Σ , where µη = Σ
d=1 (µd −µd +
yd
2
⊤
−1
δ 2 )E[Z̄d ] is the mean and Σ = (I + 1/δ E[A A])
is a K × K co-variance matrix. Computation of Σ can
be achieved robustly through Cholesky decomposition
of δ 2 I +E[A⊤ A], an O(K 3 ) procedure. Another example is the Laplace prior, which can lead to a shrinkage
effect (Zhu et al., 2008a) that is useful in sparse problems. In this paper, we focus on the normal prior.
For the standard normal prior, the dual problem D1
is a quadratic programming problem:
max
−
⋆
µ,µ
D
D
d=1
d=1
X
X
1 ⊤
yd (µd − µ⋆d )
(µd + µ⋆d ) +
a Σa − ǫ
2
s.t. ∀d : µd , µ⋆d ∈ [0, C],
PD
where a = d=1 (µd − µ⋆d + yδd2 )E[Z̄d ]. This problem
can be solved with any standard QP solvers, although
they may not be so efficient. To leverage recent developments in support vector regression, we note that
the following primal form of D1 can be reformulated
as a standard SVR problem and solved by using
existing algorithms like SVM-light (Joachims, 1999)
to get µη and the dual parameters µ and µ⋆ :
D
D
X
X
yd
1
−1
(ξd + ξd⋆ )
µη − µ⊤
E[Z̄d ]) + C
min ⋆ µ⊤
η Σ
η (
2
µη ,ξ,ξ
2
δ
d=1
d=1
⊤
y − µη E[Z̄d ] ≤ ǫ + ξd , µd
d
⊤
⋆
⋆
s.t. ∀d : −yd + µη E[Z̄d ] ≤ ǫ + ξd , µd
ξd , ≥ 0, vd
ξd⋆ ≥ 0, vd⋆
Now, we estimate the unknown parameters α, β, and
δ 2 . Here, we assume α is fixed.
Optimize L over β. The update equations are the
same as for sLDA: X
N
D X
1(wdn = w)φdnk ,
βk,w ∝
(5)
d=1 n=1
Optimize L over δ 2 . This step is similar to that of
sLDA but in an expected version. The update rule is:
δ2 ←
1 ⊤
y y − 2y⊤ E[A]E[η] + E[η ⊤ E[A⊤ A]η] , (6)
D
where E[η ⊤ E[A⊤ A]η] = tr(E[A⊤ A]E[ηη ⊤ ]).
tion, where y ∈ {1, · · · , M }. The binary case can be
easily defined based on a binary SVM and the optimization problem can be solved similarly.
As we have stated, fully generative topic models, such
as the sLDA, have a normalization factor, which can
make the learning generally intractable, except for
some special cases like the normal distribution as in
the regression case. In (Blei & McAuliffe, 2007), variational methods or high-order Taylor expansion is applied to approximate the normalization factor of a
GLM. In our max-margin formulation, since our target
is to directly minimize a hinge loss, we do not need a
fully generative model. Instead, we define a partially
generative model on (θ, z, W) only as in the unsupervised LDA, and for the classification model (i.e., from
Zd to Yd ), we apply the max-margin principle, which
does not require a normalized distribution. Thus, in
this case, the likelihood of the corpus D is p(W|α, β).
Specifically, for classification, we assume the discriminant function F P
is linear, that is, F (y, z1:N , η) = ηy⊤ z̄,
where z̄ = 1/N n zn as in the regression model, ηy
is a class-specific K-dimensional parameter vector
associated with the class y and η is a M K-dimensional
vector by stacking the elements of ηy . Equivalently,
F can be written as F (y, z1:N , η) = η ⊤ f (y, z̄), where
f (y, z̄) is a feature vector whose components from
(y − 1)K + 1 to yK are those of the vector z̄ and all
the others are 0. From each single F , a prediction
rule can be derived as in SVM. Here, we consider the
general case to learn a distribution of q(η) and for
prediction, we take the average over all the possible
models and the latent topics:
y ⋆ = arg max E[η ⊤ f (y, Z̄)|α, β].
y
(7)
Similar to the regression model, we define the
integrated latent topic discovery and multi-class
classification model as follows:
P2(MedLDAc ) : min
q,q(η),α,β,ξ
L(q) + KL(q(η)||p0 (η)) + C
D
X
ξd
d=1
s.t. ∀d, y 6= yd : E[η ⊤ ∆fd (y)] ≥ 1 − ξd ; ξd ≥ 0,
where q(θ, z|γ, φ) is a variational distribution; L(q) =
−E[log p(θ, z, W|α, β)]−H(q(θ, z)) is a variational upper bound of − log p(W|α, β); ∆fd (y) = f (yd , Z̄d ) −
f (y, Z̄d ), and ξ are slack variables. E[η ⊤ ∆fd (y)] is the
“expected margin” by which the true label yd is favored
over a prediction y.
The rationale underlying the MedLDAc is similar to
that of the MedLDAr , that is, we want to find a latent topic representation q(θ, z|γ, φ) and a parameter
MedLDA: Max-margin Topic Models for Regression and Classification
distribution q(η) which on one hand tend to predict
as accurate as possible on training data, while on the
other hand tend to explain the data well. The KL-term
in P2 is a regularizer of the distribution q(η).
2.3.1. Variational EM-Algorithm
As in MedLDAr , we can develop a similar variational EM algorithm. Specifically, we assume that q is
fully factorized, as in the standard unsupervised
LDA.
PN
Then, E[η ⊤ f (y, Z̄d )] = E[η]⊤ f (y, 1/N n=1 φdn ). We
formulate the Lagrangian2 L of P2 and iteratively optimize L w.r.t γ, φ, q(η) and β. Since the constraints
in P2 are not on γ or β, their update rules are the same
as in MedLDAr and we omit the details here. We explain the optimization of P2 over φ and q(η) and show
the insights of the max-margin topic model:
Optimize L over φ. Again, since q is fully factorized,
we can perform the optimization on each document
separately. Set ∂L/∂φdi = 0, then we have:
φdi ∝ exp
E[log θ|γ] + E[log p(wdi |β)]
1 X
µd (y)E[ηyd − ηy ] .
+
N
(8)
normal prior in this paper. For the standard normal
prior p0 (η) = N (0, I), we can get: q(η) is a normal
with a shifted mean, i.e., q(η) = N (µη , I), and the
dual problem D2 is the same as the dual problem of
a standard multi-class SVM that can be solved using existing SVM methods (Crammer & Singer, 2001) :
D
D X
X
1 X X
µd (y)
k
µd (y)E[∆fd (y)]k22 +
µ
2
d=1 y6=yd
d=1 y6=yd
X
µd (y) ∈ [0, C].
s.t. ∀d :
max −
y6=yd
3. MedTM: a general framework
We have presented MedLDA, which integrates the
max-margin principle with an underlying LDA model,
which can be supervised or unsupervised, for discovering predictive latent topic representations of documents. The same principle can be applied to other
generative topic models, such as the correlated topic
models (CTM) (Blei & Lafferty, 2005), as well as undirected random fields, such as the exponential family
harmoniums (EFH) (Welling et al., 2004).
Formally, the max-entropy discrimination topic modThe first two terms in Eq. (8) are the same as in
els (MedTM) can be generally defined as:
the unsupervised LDA and the last term is due to the
min
L(q(H)) + KL(q(Υ)kp0 (Υ)) + U (ξ)
max-margin formulation of P2 and reflects our intu- P(MedTM) : q(H),q(Υ),Ψ,ξ
ition that the discovered latent topic representation is
s.t. expected margin constraints,
influenced by the max-margin estimation. For those
where H are hidden variables (e.g., (θ, z) in LDA);
examples that are around the decision boundary, i.e.,
Υ are the parameters of the model pertaining to the
support vectors, some of the lagrange multipliers are
prediction task (e.g., η in sLDA); Ψ are the paramenon-zero and thus the last term acts as a regularizer
ters of the underlying topic model (e.g., the Dirichlet
that biases the model towards discovering a latent repparameter α); and L is a variational upper bound of
resentation that tends to make more accurate predicthe negative log likelihood associated with the undertion on these difficult examples. Moreover, this term
lying topic model. U is a convex function over slack
is fixed for words in the document and thus will divariables. For the general MedTM model, we can derectly affect the latent representation of the document
velop a similar variational EM-algorithm as for the
(i.e., γd ) and will yield a discriminative latent repreMedLDA. Note that Υ can be a part of H. For exsentation, as we shall see in Section 4, which is more
ample, the underlying topic model of MedLDAr is a
suitable for the classification task.
Bayesian sLDA. In this case, H = (θ, z, η), Υ = ∅ and
the term KL(q(η)kp0 (η)) is contained in its L.
Optimize L over q(η): As in the regression model,
y6=yd
we get the dual problem of P2:
D2 :
max − log Z +
µ
s.t. ∀d :
D X
X
4. Experiments
µd (y)
d=1 y6=yd
X
µd (y) ∈ [0, C],
In this section, we provide qualitative as well as quantitative evaluation of MedLDA on text modeling, classification and regression.
y6=yd
and the posterior q(η) = Z1 p0 (η) exp(η ⊤ µη ), where
PD P
µη = d=1 y6=yd µd (y)E[∆fd (y)].
Again, we can choose different priors in MedLDAc
for different regularization effects. We consider the
P
= L(q) + KL(q(η)||p0 (η)) + C D
d=1 ξd −
PD P
⊤
d=1 vd ξd −
d=1
y6=yd µd (y)(E[η ∆fd (y)] + ξd − 1) −
PD PN
PK
d=1
i=1 cdi (
j=1 φdij − 1), where the last term is from
P
the normalization condition K
j=1 φdij = 1, ∀i, d.
2
PD
L
4.1. Text Modeling
We study text modeling of the MedLDA on the 20
Newsgroups data set with a standard list of stop
words3 removed. The data set contains postings in
20 related categories. We compare with the standard
unsupervised LDA. We fit the dataset to a 110-topic
MedLDAc model, which explores the supervised category information, and a 110-topic unsupervised LDA.
3
http://mallet.cs.umass.edu/
MedLDA: Max-margin Topic Models for Regression and Classification
Class
80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
60
40
20
0
comp.graphics
−20
MedLDA
LDA
Average θ per class
T 69
image
jpeg
gif
file
color
files
bit
images
format
program
T 11
graphics
image
data
ftp
software
pub
mail
package
fax
images
T 80
db
key
chip
encryption
clipper
system
government
keys
law
escrow
T 59
image
jpeg
color
file
gif
images
format
bit
files
display
T 104
ftp
pub
graphics
mail
version
tar
file
information
send
server
T 31
card
monitor
dos
video
apple
windows
drivers
vga
cards
graphics
T 32
ground
wire
power
wiring
don
current
circuit
neutral
writes
work
T 95
audio
output
input
signal
chip
high
data
mhz
time
good
T 46
source
rs
time
john
cycle
low
dixie
dog
weeks
face
T 30
power
ground
wire
circuit
supply
voltage
current
wiring
signal
cable
T 84
water
energy
air
nuclear
loop
hot
cold
cooling
heat
temperature
T 44
sale
price
offer
shipping
sell
interested
mail
condition
email
cd
T 30
israel
israeli
jews
arab
writes
people
article
jewish
state
rights
T 40
turkish
armenian
armenians
armenia
people
turks
greek
turkey
government
soviet
T 51
israel
lebanese
israeli
lebanon
people
attacks
soldiers
villages
peace
writes
T 42
israel
israeli
peace
writes
article
arab
war
lebanese
lebanon
people
T 78
jews
jewish
israel
israeli
arab
people
arabs
center
jew
nazi
T 47
armenian
turkish
armenians
armenia
turks
genocide
russian
soviet
people
muslim
T 109
sale
price
shipping
offer
mail
condition
interested
sell
email
dos
T 110
drive
scsi
mb
drives
controller
disk
ide
hard
bus
system
T 84
mac
apple
monitor
bit
mhz
card
video
speed
memory
system
T 44
sale
price
offer
shipping
sell
interested
mail
condition
email
cd
T 94
don
mail
call
package
writes
send
number
ve
hotel
credit
T 49
drive
scsi
disk
hard
mb
drives
ide
controller
floppy
system
−40
−60
sci.electronics
−80
−100
−80
−60
−40
−20
0
20
40
60
80
80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
60
40
20
0
politics.mideast
−20
misc.forsale
−40
−60
−80
−100
−80
−60
−40
−20
0
20
40
60
80
Figure 2. t-SNE 2D embedding of the topic representation
by: MedLDAc (above) and the unsupervised LDA (below).
Figure 2 shows the 2D embedding of the expected
topic proportions of MedLDAc and LDA by using the
t-SNE stochastic neighborhood embedding (van der
Maaten & Hinton, 2008), where each dot represents
a document and color-shape pairs represent class labels. Obviously, the max-margin based MedLDAc produces a better grouping and separation of the documents in different categories. In contrast, the unsupervised LDA does not produce a well separated embedding, and documents in different categories tend
to mix together. A similar embedding was presented
in (Lacoste-Jullien et al., 2008), where the transformation matrix in their model is pre-designed. The results
of MedLDAc in Figure 2 are automatically learned.
It is also interesting to examine the discovered topics and their association with class labels. In Figure 3 we show the top topics in four classes as discovered by both MedLDA and LDA. Moreover, we
depict the per-class distribution over topics for each
model. This distribution is computed by averaging
the expected latent representation of the documents in
each class. We can see that MedLDA yields sharper,
sparser and fast decaying per-class distributions over
topics which have a better discrimination power. This
behavior is in fact due to the regularization effect en-
Figure 3. Top topics under each class as discovered by the
MedLDA and LDA models
forced over φ as shown in Eq. (8). On the other hand,
LDA seems to discover topics that model the fine details of documents with no regard to their discrimination power (i.e. it discovers different variations of the
same topic which results in a flat per-class distribution
over topics). For instance, in the class comp.graphics,
MedLDA mainly models documents in this class using two salient, discriminative topics (T69 and T11)
whereas LDA results in a much flatter distribution.
Moreover, in the cases where LDA and MedLDA discover comparably the same set of topics in a given
class (like politics.mideast and misc.forsale), MedLDA
results in a sharper low dimensional representation.
4.2. Prediction Accuracy
In this subsection, we provide a quantitative evaluation of the MedLDA on prediction performance.
4.2.1. Classification
We perform binary and multi-class classification on the
20 Newsgroup data set. To obtain a baseline, we first
fit all the data to an LDA model, and then use the latent representation of the training4 documents as features to build a binary/multi-class SVM classifier. We
4
We use the training/testing split in:
http://people.csail.mit.edu/jrennie/20Newsgroups/
MedLDA: Max-margin Topic Models for Regression and Classification
0.2
0
MedLDA
MedLDA+SVM
DiscLDA
sLDA
LDA+SVM (baseline)
−0.1
−0.2
0.5
0
10
20
# Topics
30
0.05
0
MedLDA
MedLDA+SVM
DiscLDA
LDA+SVM (baseline)
−0.05
40
−0.1
−6.34
0.45
0.1
20
40
60
80
100
Predictive R2
Ralative Ratio
Relatvie Ratio
0.1
−6.32
0.55
0.15
0.2
−6.36
0.4
−6.38
0.35
0.3
0.25
MedLDA (full)
MedLDA (partial)
sLDA
LDA+SVR
0.2
0.15
# Topics
0.1
(a)
(b)
Figure 4. Relative improvement ratio against LDA+SVM
for: (a) binary and (b) multi-class classification.
denote this baseline by LDA+SVM. For a model M,
we evaluate its performance using the relative improve− precision(LDA+SV M )
.
ment ratio, i.e., precision(M)
precision(LDA+SV M )
Binary Classification: As in (Lacoste-Jullien et al.,
2008), the binary classification is to distinguish postings of the newsgroup alt.atheism and the postings of
the group talk.religion.misc. We compare MedLDAc
with sLDA, DiscLDA and LDA+SVM. For sLDA, to
the best of our knowledge, the classification model has
not been evaluated. Therefore, we fit an sLDA regression model using the binary representation (0/1) of the
class, and use a threshold 0.5 to make prediction. For
MedLDAc , to see whether a second-stage max-margin
classifier can improve the performance, we also build
a method MedLDA+SVM, similar to LDA+SVM. For
all the above methods that utilize the class label information, they are fit ONLY on the training data.
We use the SVM-light (Joachims, 1999) to build SVM
classifiers and to estimate q(η) in MedLDAc . The parameter C is chosen via 5 fold cross-validation during
the training from {k 2 : k = 1, · · · , 8}. For each model,
we run the experiments for 5 times and take the average as the final results. The relative improvement ratios of different models w.r.t topic numbers are shown
in Figure 4(a). For the recently proposed DiscLDA
(Lacoste-Jullien et al., 2008), since the implementation is not available, the results are taken from the
original paper for both DiscLDA and LDA+SVM.
We can see that the max-margin based MedLDAc
works better than sLDA, DiscLDA and the two-step
method of LDA+SVM. Since MedLDAc integrates the
max-margin principle in its training, the combination
of MedLDA and SVM does not yield additional benefits on this task. We believe that the slight differences between MedLDA and MedLDA+SVM are due
to tuning of the regularization parameters. For efficiency, we do not change the regularization constant
C during training MedLDAc . The performance would
be improved if we select a good C in different iterations
because the data representation is changing.
Multi-class Classification: We perform multi-class
classification on 20 Newsgroups with all the categories. We compare MedLDAc with MedLDA+SVM,
Per−word Likelihood
0.3
5
10
15
20
25
30
# Topics
−6.4
−6.42
MedLDA (full)
MedLDA (partial)
sLDA
LDA
−6.44
−6.46
5
10
15
20
25
30
# Topics
2
Figure 5. Predictive R (left) and per-word likelihood
(right) of different models on the movie review dataset.
LDA+SVM, and DiscLDA. We use the SVMstruct
package5 with a 0/1 loss to solve the sub-step of learning q(η) and build the SVM classifiers for LDA+SVM
and MedLDA+SVM. The results are shown in Figure 4(b), where the results of DiscLDA are again taken
from (Lacoste-Jullien et al., 2008). We can see that
all the supervised topic models discover more predictive topics for classification, and the max-margin based
MedLDAc can achieve significant improvements with
an appropriate number (e.g., ≥ 80) of topics. Again,
we believe that the slight difference between MedLDAc
and MedLDA+SVM is due to parameter tuning.
4.2.2. Regression
We evaluate the MedLDAr model on the movie review
data set. As in (Blei & McAuliffe, 2007), we take logs
of the response values to make them approximately
normal. We compare MedLDAr with the unsupervised
LDA and sLDA. As we have stated, the underlying
topic model in MedLDAr can be a LDA or a sLDA. We
have implemented both, as denoted by MedLDA (partial) and MedLDA (full), respectively. For LDA, we
use its low dimensional representation of documents as
input features to a linear SVR and denote this method
by LDA+SVR. The evaluation criterion is predictive
R2 (pR2 ) as defined in (Blei & McAuliffe, 2007).
Figure 5 shows the results together with the per-word
likelihood. We can see that the supervised MedLDA
and sLDA can get much better results than the unsupervised LDA, which ignores supervised responses.
By using max-margin learning, MedLDA (full) can get
slightly better results than the likelihood-based sLDA,
especially when the number of topics is small (e.g.,
≤ 15). Indeed, when the number of topics is small,
the latent representation of sLDA alone does not result in a highly separable problem, thus the integration
of max-margin training helps in discovering a more discriminative latent representation using the same number of topics. In fact, the number of support vectors
(i.e., documents that have at least one non-zero lagrange multiplier) decreases dramatically at T = 15
and stays nearly the same for T > 15, which with reference to Eq. (4) explains why the relative improvement
5
http://svmlight.joachims.org/svm multiclass.html
MedLDA: Max-margin Topic Models for Regression and Classification
over sLDA decreased as T increases. This behavior
suggests that MedLDA can discover more predictive
latent structures for difficult, non-separable problems.
For the two variants of MedLDAr , we can see an obvious improvement of MedLDA (full). This is because
for MedLDA (partial), the update rule of φ does not
have the third and fourth terms of Eq. (4). Those
terms make the max-margin estimation and latent
topic discovery attached more tightly. Finally, a linear SVR on the empirical word frequency gets a pR2
of 0.458, worse than those of sLDA and MedLDA.
4.2.3. Time Efficiency
6
10
MedLDA
sLDA
LDA+SVM
5
CPU−Seconds
10
For binary classification,
MedLDAc is much more
efficient than sLDA, and
is comparable with the
# Topics
LDA+SVM, as shown in
Figure 6. Training time.
Figure 6. The slowness of sLDA may be due to the
mismatching between its normal assumption and the
non-Gaussian binary response variables, which prolongs the E-step. For multi-class classification, the
training time of MedLDAc is mainly dependent on
solving a multi-class SVM problem, and thus is comparable to that of LDA. For regression, the training
time of MedLDA (full) is comparable to that of sLDA,
while MedLDA (partial) is more efficient.
4
10
3
10
sampling (T. Griffiths, 2004). Moreover, as the experimental results suggest, incorporation of a more expressive underlying topic model enhances the overall
performance. Therefore, we plan to integrate and utilize other underlying topic models like the fully generative sLDA model in the classification case.
Acknowledgements
This work was done while J.Z. was visiting CMU
under a support from NSF DBI-0546594 and DBI0640543 awarded to E.X.; J.Z. is also supported by
Chinese NSF Grant 60621062 and 60605003; National Key Foundation R&D Projects 2003CB317007,
2004CB318108 and 2007CB311003; and Basic Research Foundation of Tsinghua National TNList Lab.
References
2
10
1
10
0
10
20
30
40
5. Conclusions and Discussions
We have presented the maximum entropy discrimination LDA (MedLDA) that uses the max-margin principle to train supervised topic models. MedLDA integrates the max-margin principle into the latent topic
discovery process via optimizing one single objective
function with a set of expected margin constraints.
This integration yields a predictive topic representation that is more suitable for regression or classification. We develop efficient variational methods for
MedLDA. The empirical results on movie review and
20 Newsgroups data sets show the promise of MedLDA
on text modeling and prediction accuracy.
MedLDA represents the first step towards integrating
the max-margin principle into supervised topic models, and under the general MedTM framework presented in Section 3, several improvements and extensions are in the horizon. Specifically, due to the nature of MedTM’s joint optimization formulation, advances in either max-margin training or better variational bounds for inference can be easily incorporated. For instance, the mean field variational upper bound in MedLDA can be improved by using the
tighter collapsed variational bound (Teh et al., 2006)
that achieves results comparable to collapsed Gibbs
Blei, D., & Jordan, M. (2003). Modeling annotated
data. Inter. Conf. on Info. Retrieval, 127–134.
Blei, D., & Lafferty, J. (2005). Correlated topic
models. Neur. Info. Proc. Sys., 147–154.
Blei, D., & McAuliffe, J. D. (2007). Supervised topic
models. Neur. Info. Proc. Sys., 121–128.
Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. J. of Mach. Learn. Res., 993–1022.
Crammer, K., & Singer, Y. (2001). On the algorithmic
implementation of multiclass kernel-based vector
machines. J. of Mach. Learn. Res., 265–292.
Joachims, T. (1999). Making large-scale SVM learning practical. Advances in kernel methods–support
vector learning, MIT-Press, 169–184.
Lacoste-Jullien, S., Sha, F., & Jordan, M. I. (2008).
DiscLDA: Discriminative learning for dimensionality reduction and classification. NIPS, 897–904.
McCallum, A., Pal, C., Druck, G., & Wang, X.
(2006).
Multi-conditional learning:
generative/discriminative training for clustering and
classification. AAAI, 433–439.
Smola, A., & Schölkopf, B. (2003). A tutorial on support vector regression. Statistics and Computing,
199-222.
T. Griffiths, M. S. (2004). Finding scientific topics.
Proc. of National Academy of Sci., 5228–5235.
Teh, Y. W., Newman, D., & Welling, M. (2006). A
collapsed variational bayesian inference algorithm
for latent dirichlet allocation. NIPS, 1353–1360.
van der Maaten, L., & Hinton, G. (2008). Visualizing
data using t-SNE. JMLR, 2579–2605.
Welling, M., Rosen-Zvi, M., & Hinton, G. (2004).
Exponential family harmoniums with an application
to information retrieval. NIPS, 1481-1488.
Zhu, J., Xing, E., & Zhang, B. (2008a). Laplace maximum margin Markov networks. ICML, 1256–1263.
Zhu, J., Xing, E., & Zhang, B. (2008b). Partially
observed maximum entropy discrimination Markov
networks. Neur. Info. Proc. Sys., 1977–1984.