1076
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012
Visual Workflow Recognition Using a Variational
Bayesian Treatment of Multistream
Fused Hidden Markov Models
Sotirios P. Chatzis and Dimitrios Kosmopoulos
Abstract—In this paper, we provide a variational Bayesian
(VB) treatment of multistream fused hidden Markov models
(MFHMMs), and apply it in the context of active learningbased visual workflow recognition (WR). Contrary to training
methods yielding point estimates, such as maximum likelihood
or maximum a posteriori training, the VB approach provides
an estimate of the posterior distribution over the MFHMM
parameters. As a result, our approach provides an elegant
solution toward the amelioration of the overfitting issues of point
estimate-based methods. Additionally, it provides a measure of
confidence in the accuracy of the learned model, thus allowing
for the easy and cost-effective utilization of active learning in the
context of MFHMMs. Two alternative active learning algorithms
are considered in this paper: query by committee, which selects
unlabeled data that minimize the classification variance, and
a maximum information gain method that aims to maximize
the alteration in model variance by proper data labeling. We
demonstrate the efficacy of the proposed treatment of MFHMMs
by examining two challenging WR scenarios, and show that the
application of active learning, which is facilitated by our VB
approach, allows for a significant reduction of the MFHMM
training costs.
Index Terms—Active learning, hidden Markov models,
multistream fusion, workflow recognition.
I. Introduction
Human behavior understanding in video sequences is a
research field rapidly gaining momentum over the last few
years. This is mainly due to its fundamental applications in
automated video indexing, virtual reality, human–computer
interaction, and smart monitoring. Especially, throughout the
last few years we have seen an increasing need for assisting
and extending the capabilities of human operators in remotely
monitored large and complex spaces, such as public areas,
airports, railway stations, parking lots, bridges, tunnels, and so
on. The last generation of surveillance systems was designed
to utilize multiple video streams from heterogeneous sensors
Manuscript received June 7, 2011; revised October 31, 2011; accepted
December 14, 2011. Date of publication March 5, 2012; date of current
version June 28, 2012. This paper was recommended by Associate Editor
C. N. Taylor.
S. P. Chatzis is with the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K. (e-mail:
[email protected]).
D. Kosmopoulos is with the University of Texas, Arlington, TX 76019 USA
(e-mail: kosmopo@edu).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSVT.2012.2189795
to automatically assess the ongoing activities in large monitored environments, flagging and presenting to the operator
suspicious events as they happen in order to prevent dangerous
situations [1], [2].
In this paper, we focus on visual workflow recognition
(WR); workflows are comparatively structured processes, in
contrast to monitoring stations or airports, and it is more realistic to believe that workflows can be modeled using computer
vision and machine learning. The identified deviations from a
predefined workflow possibly indicate security and safety related events and will be automatically highlighted. Distributed
smart workflow monitoring is applicable to mass production
or large-scale enterprises like industrial plants which have a
clear need for automated supervision services to guarantee
safety, security, and quality by enforcing adherence to predefined procedures for production or services. Such supervision
services are frequently of vital importance for the enterprise
because, apart from cost reduction, timely detection of safety
and security concerns may prevent injuries and even fatalities.
The complexity of detection and tracking of moving objects under occlusions in a typical structured environment
requires more than a single camera and features that will not
result from an error-prone tracker. Multiple cameras provide
a wider coverage of the scene and redundant data that help
solve occlusions and improve accuracy. Furthermore, the high
diversity and complexity of the behaviors that need to be
monitored requires new learning methods that will be able
to fuse information from multiple streams. Finally, the limited
availability in model training data, due to the prohibitively
high costs of capturing and annotating behavioral data from
real (e.g., industrial) installations, necessitates utilization of
an active learning framework, allowing for the exploitation of
unlabeled data to improve the classification performance of
the trained models.
Hidden Markov models [Fig. 1(a)] are an extremely popular means of modeling a stream of sequential data, and
are vastly adopted in behavioral analysis applications [2].
Using information from multiple streams of data pertaining
to the same sequence of events has been shown to allow
for a significant performance enhancement of hidden Markov
model (HMM)-based event analysis and detection models
[3]–[6]. Modern multimedia capturing and processing technologies have rendered insignificant the main hurdle of the
additional computational requirements imposed by multisensor
c 2012 IEEE
1051-8215/$31.00
CHATZIS AND KOSMOPOULOS: VISUAL WORKFLOW RECOGNITION USING A VARIATIONAL BAYESIAN TREATMENT
1077
Fig. 1. Various fusion schemes using the HMM framework for two streams. The s, o stand for the states and the observations, respectively. The first index
marks the stream and the second the time. (a) Standard HMM. (b) HMM with feature fusion (FHMM). (c) State synchronous HMM (SHMM). (d) Parallel
HMM (PHMM). (e) Coupled HMM (CHMM). (f) Multistream fused HMM (MFHMM).
systems [2]. However, the reliability of the sensors is never
explicitly considered. Hence, in a video surveillance system
that employs multiple sensors, the problem of selecting the
most appropriate sensor or set of sensors to perform a certain
task often arises. Consequently, the first and most straightforward solution of early integration [7], which consists in merging all the observations related to all the streams into one large
stream (frame by frame), and modeling it using a single HMM,
is less than satisfactory [Fig. 1(b)]. To resolve this problem, an
adaptive multicue multicamera information fusion framework
based on democratic integration [8] is presented in [9]. Fusion
is performed by taking into account sensor reliability, yet there
is no direct sensor quality assessment. Instead, the reliability of
a source is estimated by measuring the distance between each
source estimate and the fused estimate, which is determined
by the sources estimates. This is based on the assumption that
the majority of sensors are producing reliable estimates, which
cannot always be taken for granted.
A different probabilistic framework for multistream data
fusion is the multistream HMM approach [10], under which
each stream is modeled separately using its own HMM. Then,
analysis of the observed data can be conducted by creating
a special HMM, recombining all the single stream HMM
likelihoods at various specific temporal points. Obviously,
depending on the specific selection of these recombination
points, different solutions arise. For instance, in coupled hidden Markov models [5] [Fig. 1(e)], two component HMMs are
linked by the dependence of their hidden states. However, in
many applications where the component HMMs do not consist
of many states, such as in cases of audio-visual data, this
1078
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012
dependence assumption is not strong enough to capture the
statistical correlations between the multiple streams. In the
state-synchronous multistream HMM [Fig. 1(c)] the streams
are assumed to be synchronized. Each stream is modeled using
an individual HMM; the postulated streamwise HMMs share
the same state dynamics. As a result, this approach provides
only limited sequential data modeling flexibility.
In [11], a parallel hidden Markov model has been proposed
[Fig. 1(d)], which factorizes the state space into multiple
independent temporal processes without causal connections inbetween. Nevertheless, the assumption of the different temporal processes being independent of each other is clearly
invalid in most cases, especially when dealing with group or
interactive activities.
Multistream fused HMMs (MFHMMs) is another promising method for multistream data modeling [12] [Fig. 1(f)].
Like coupled HMMs and mixed-memory HMMs, a MFHMM
consists of multiple HMMs. However, unlike the previous
methods, the connections between the component HMMs are
chosen based on a probabilistic fusion model, which is optimal
according to the maximum entropy principle and a maximum mutual information criterion for selecting dimensionality
reduction transforms. As a consequence, the MFHMM has
several desirable features: 1) it has simpler and faster training
and inference algorithms than the previous models; 2) if one
of the component HMMs fails due to noise or a probable
malfunction of the sensor capturing the related observations
stream, the rest of the constituent HMMs can still work
properly; and 3) it still retains the crucial information about the
interdependencies between the multiple data streams, which
coupled HMMs tend to neglect.
In this paper, motivated by the aforementioned advantages of MFHMMs, we consider their application to the
addressed problem of visual behavioral analysis and monitoring from multiple visual inputs in structured environments
WR. In the existing literature, MFHMM is treated under a
maximum-likelihood (ML) framework, using the expectationmaximization (EM) algorithm (see [12]). Even though ML is
a common, and, in general, reliable approach for estimation of
probabilistic generative models, it suffers from the undesirable
property of being ill-posed since the likelihood function is
unbounded from above [13]–[15]. This fact might result in several very significant deficiencies, especially in cases of limited
training data availability; this is quite the case regarding the
applications we focus on in this paper, as training data from
real installations are difficult to collect, and very expensive to
process and annotate. As a result, using ML to train a set of
generative models for behavioral analysis and detection in such
a context might result in an unstable training procedure, yielding poor model estimates, with high overfitting proneness; it
could even lead to yielding infinities in the likelihood function,
associated with the collapsing of the bell-shaped component
distributions onto individual data points, and, hence, resulting
in singular or near-singular covariance matrices [15].
To address these issues, in this paper we introduce a
Bayesian treatment of MFHMMs, overcoming the problems
of ML approaches elegantly, by marginalizing over the model
parameters with respect to appropriate priors, and maximizing
the resulting marginal likelihood of the model to obtain the
optimal model size. Our approach is based on variational
approximation methods [16], which have recently emerged
as a deterministic alternative to Markov chain Monte-Carlo
(MCMC) algorithms for doing Bayesian inference for probabilistic generative models [17], [18], with better scalability
in terms of computational cost [19]. Variational Bayesian
(VB) inference has been previously applied to a number
of probabilistic inference models, including relevance vector
machines [20], autoregressive models [21], [22], mixtures of
Gaussians and Student’s-t distributions [23], [24], mixtures of
factor analyzers [25], [26], and HMMs [27], [28], thereby
ameliorating the singularity and overfitting problems of ML
approaches in an elegant and computationally efficient manner.
Since variational Bayes provides a full posterior distribution
over the treated model parameters, the proposed approach
allows for the extraction of a reliable measure of confidence
in the obtained estimates of a trained MFHMM. This is yet
another significant advantage of the proposed VB treatment of
MFHMMs, as it allows for the easy and computationally efficient introduction of the MFHMM in the context of an elegant
active learning framework. Indeed, as we shall discuss in the
following sections of this paper, under the proposed variational
Bayesian regard, well-known active learning criteria can be
easily implemented for MFHMMs, while previously they
were either computationally inefficient or intractable (when
considering point-estimated MFHMMs) [29], [30]. Therefore,
the introduction of the VB machinery does also allow for the
exploitation of effective active learning methodologies so as
to significantly reduce the training costs of MFHMMs, by
efficiently utilizing pools of cheap to acquire unlabeled data.
The remainder of this paper is organized as follows. In
Section II, the proposed variational Bayesian treatment of
MFHMMs is introduced, and the related model inference
and prediction algorithms are derived. In Section III, the
proposed approach is examined in the context of the active
learning framework. As we show, the proposed VB treatment
of MFHMMs allows for the efficient utilization of effective
active learning algorithms, which would be either impossible
or computationally burdensome to apply when considering
point-estimated MFHMMs. In Section IV, we examine the
efficacy of the proposed approach considering two challenging
visual WR scenarios using publicly available datasets. Finally,
in the concluding section of this paper, we summarize and
discuss our results.
II. Variational Bayesian Approach
Toward MFHMMs
A. Multistream Fused Hidden Markov Models
Consider M tightly interdependent time series (streams),
m
m T
X = {Xm }M
m=1 , with X = {xt }t=1 . Assume that the constituent
m
streams X , m ∈ {1, . . . , M} of X, can be modeled by M
independent (streamwise) HMMs, with their corresponding
hidden state sequences denoted as S m = {stm }Tt=1 . Then, we have
p(Xm ) =
p(Xm , S m )
(1)
Sm
CHATZIS AND KOSMOPOULOS: VISUAL WORKFLOW RECOGNITION USING A VARIATIONAL BAYESIAN TREATMENT
where
m
p(Xm , S m ) = p(s1m )p(xm
1 |s1 )
T
m
m
p(stm |st−1
)p(xm
t |st ) ∀m
t=2
(2)
m
p(xm
t |st )
and
are the state-conditional likelihoods of the models, usually selected to be mixtures of Gaussians, or mixtures
of Student’s-t densities [31]. In the following, we shall be
denoting as πm = (πm
n )n the initial state probabilities vector of
the mth postulated streamwise HMM, with
πnm p(s1m = n)
and as Am = (aijm )i,j the corresponding state transition probabilities matrix, with
m
= i) ∀t.
aijm p(stm = j|st−1
The problem addressed by MFHMMs is how to construct
a new structure linking the postulated streamwise HMMs
together, which will be giving an optimal approximation
of the
joint probability of the stream data, p(X) = p {Xm }M
m=1 [12].
For this purpose, MFHMMs take advantage of the fact that
the streams {Xm }M
m=1 can be separately modeled by individual
HMMs. Then, to capture the statistical dependence between
these streams, a set of transforms wm g(Xm ) is introduced,
such that the joint probability p({wm }M
m=1 ) can be more easily
calculated compared to p({Xm }M
m=1 ). On the basis of this
regard, the MFHMM obtains an optimal approximation of
p(X) according to the maximum entropy principle, given by
[32]
p(X) ≈ p̃(X)
(3)
where
p̃(X) = p̃
{Xm }M
m=1
M
p {wm }M
m=1
p(Xm ).
M
m)
p(w
m=1
m=1
(4)
Selection of a proper expression for the transforms wm is
conducted on the basis of the maximum mutual information
(MMI) criterion [33], a criterion that has been also used
for discriminative training of HMMs with quite a success
[34]. MMI criterion essentially comprises minimization of the
Kullback–Leibler divergence KL(p||p̃) between the exact distribution p(X) and the approximate distribution p̃(X), where
KL(p||p̃) = − dX1 · · · dXM
(5)
m M
p̃ {Xm }M
m=1
× p {X }m=1 log m M .
p {X }m=1
It can be shown [12] that by application of the MMI criterion,
and considering that all the fused data streams of the MFHMM
are (a priori) of equal reliability, (4) yields
M
1
=
p̃(X) = p̃ {Xm }M
p(Xm )
p(Xr |Ŝ m ).
m=1
M m=1
r=m
1079
In (6), Ŝ m are the state sequence estimates of the available
stream data, obtained by application of the Viterbi algorithm
[15] on the individual streamwise HMMs comprising the postulated MFHMM. Regarding the coupling densities p(Xr |Ŝ m ),
from the conditional independence property of the Markovian
chain, we yield
p(Xr |Ŝ m ) =
T
p(xrt |ŝtm ).
(7)
t=1
The probabilities p(xrt |ŝtm ) of the MFHMM can be modeled by
means of mixtures of Gaussian or Student’s-t densities, similar
to the state-conditional likelihoods of the streamwise HMMs.
Note that for each possible value, say i, of ŝtm , a different
coupling density model p(xrt |ŝtm = i) is to be postulated. Hence,
if we consider N-state streamwise HMMs, there is a total of N
different finite mixture models that must be trained to model
the coupling densities p(xrt |ŝtm ), ∀r, m.
B. Variational Bayesian Inference for the MFHMM
Bayesian treatments of probabilistic generative models comprise introduction of a set of prior distributions over the
model parameters and further maximization of the model’s
log marginal likelihood (log evidence). For convenience, usually conjugate priors are preferred, as this selection greatly
simplifies inference and interpretability [16]. However, due to
the complexity of the MFHMM, exact Bayesian inference for
our model is intractable. Nevertheless, the choice of conjugate
exponential prior distributions for the model parameters allows
for the derivation of an elegant variational framework.
Let us consider a model p(X|) treated under the variational Bayesian paradigm. Let p() be the conjugate prior
imposed on the model, and X be the used set of training data.
Variational Bayesian inference is conducted by introducing an
approximate (variational) posterior over the model parameters
q(), and considering the well-known equality for the log
evidence, logp(X) [19]
logp(X) = F (q) + KL(q||p)
(8)
where
p(X, )
.
(9)
q()
Since the KL divergence term in (8) is a nonnegative quantity,
F (q) comprises a strict lower bound of the log evidence, that
is
F (q) =
dq()log
logp(X) ≥ F (q)
(10)
and would become exact if q() = p(|X). Hence, maximizing the lower bound of the log evidence (variational
free energy), F (q), so that it becomes as tight as possible,
i.e., minimizing the KL divergence between the true and the
variational posterior, a good variational inference scheme is
obtained. In other words, variational Bayes can be summarized
under the maximization scheme
(6)
q() = argmaxq F (q).
(11)
1080
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012
It is worthwhile noting that the variational posteriors obtained by optimization of the variational free energy F (q)
are only an approximation of the actual posterior densities
p(|X). However, the variational Bayesian approach allows
for considerably better scalability in terms of computational
cost [19] compared to exact Bayesian inference using MCMC
algorithms, which becomes of practical importance in applications requiring fast processing of high-dimensional large-scale
datasets.
As the MFHMM consists of two fundamental “building
blocks,” the streamwise HMMs, and the coupling models, variational Bayesian inference for this model can be degenerated
into two separate procedures: 1) variational Bayes for the
postulated (streamwise) HMMs; and 2) variational Bayes for
the postulated finite mixture models (coupling models). Below,
we provide an outline of the proposed variational Bayesian
treatment of the MFHMM.
C. Variational Posteriors
Consider an MFHMM modeling M tightly interdependent
time series, {Xm }M
m=1 . For simplicity, and without any loss of
generality, we assume that all the observed time series have the
T
same length, T , i.e., Xm = {xm
t }t=1 . The variational posteriors
of the postulated MFHMM can be derived as follows.
1) Streamwise HMM Training: Initially, M individual
HMMs are trained independently (one for each stream), by
means of the VB algorithm, as described, e.g., in [27]. These
are the constituent streamwise HMMs of our model, with obtained variational posteriors q(m ), where m are the parameters of the mth constituent streamwise HMM, m = 1, . . . , M.
For simplicity, and without any loss of generality, we consider
N-state streamwise HMMs.
Specifically, VB inference for the streamwise HMMs of
our model is conducted by imposing Dirichlet priors over the
initial state and state transition probabilities of the models
m
m
D(π1m , . . . πNm |φ1m , . . . , φN
)
p(π ) =
p(Am ) =
N
m
m
m
m
D(ai1
, . . . , aiN
|υi1
, . . . , υiN
).
(12)
(13)
i=1
The observation emission probabilities of the hidden states
of the models are taken as finite mixtures of Gaussian
or Student’s-t distributions. Considering, for simplicity,
K-component mixture models, we impose a Dirichlet prior
over their mixture component weights of the form
p(m ) =
N
m m
m
D(δm
i1 , . . . , δiK |ǫi1 , . . . , ǫiK )
(14)
As a result of choosing to impose conjugate priors over the
parameters of our model, the resulting variational posteriors
of the model parameters take the same functional form as
their corresponding priors [19]. Complete derivations of these
posteriors have been provided in one of our previous works
[27], and hence we refrain from repeating them here for
brevity.
2) Sequence Decoding: The best hidden state sequences
Ŝ m of the streamwise HMMs, corresponding to the used
training data Xm , are found using the VB Viterbi algorithm [27]. The VB Viterbi algorithm comprises maximization of the approximate (variational) posterior expectation of
log p(Xm , S m |m )
m
m
Ŝ = argmaxS
dm q(m )logp(Xm , S m |m )
(16)
where logp(Xm , S m |m ) is defined in (2) (for the details refer
to [27]).
3) Coupling Models Training: Finally, the coupling models
are obtained. This problem is equivalent to postulating one finite mixture model (with Gaussian or Student’s-t densities) for
each of the distributions p(xrt |ŝtm = i), ∀i ∈ {1, . . . , N}, r, m ∈
{1, . . . , M}, r = m, and subsequently employing variational
Bayes to obtain the variational posteriors q(ir,m ) over their
parameters sets ir,m . The complete derivations of the VB
training algorithm for finite mixtures of Gaussian densities can
be found in [16], while for the case of Student’s-t densities
they are provided in [24].
D. Hidden State Sequence Estimation Algorithm
Essentially, this is the problem of maximizing
m M
M
{Ŝ }m=1 =argmax{S m }m=1 dq()
× logp({X
m N,K
p({μm
ij , Rij }i,j=1 ) =
N
K
Then, following the
where
related results of [12], and assuming that all the postulated
streamwise HMMs are of the same reliability, using (3) and
(6) we have that (17) eventually reads
Ŝ m = argmaxS m dm q(m )
⎡
⎤
(18)
⎦.
× log ⎣p(Xm , S m |m )
)
p(Xr |S m ; Sr,m
m
r=m
Comparing the result (18) with (16), we directly observe that
estimation of the optimal state sequences Ŝ m for the MFHMM
effectively boils down to merely an application of the VB
Viterbi algorithm, with the probabilities
m
m
p(X |S ) =
(15)
T
m
p(xm
t |st ) ∀m
(19)
t=1
of the single-HMM Viterbi algorithm being now replaced with
the products
m
m m
m
m
N W(μm
ij , Rij |λij , γ ij , ηij , Qij ).
i=1 j=1
(17)
, S m }M
m=1 |)
m M
{{ir,m }N
i=1 , }m,r=1,m=r .
i=1
and a joint Normal–Wishart prior over the means and precision
matrices of the (Gaussian or Student’s-t) mixture component
densities
m
∀r
p(Xr |S m ) =
T
∀r t=1
p(xrt |sm
t ) ∀m
(20)
CHATZIS AND KOSMOPOULOS: VISUAL WORKFLOW RECOGNITION USING A VARIATIONAL BAYESIAN TREATMENT
in which expression the quantities p(Xm |S m ) are given by the
postulated streamwise HMMs, and the quantities p(Xr |S m ),
r = m, are given by the coupling models.
E. Predictive Probability
The ultimate goal of Bayesian learning is, given a set of
test data, to perform density estimation with respect to the
learned model. Let us suppose the test data Y = {Y m }M
m=1 , with
T
Y m = {ym
t }t=1 , and an MFHMM trained using the training data
X, with obtained variational posterior q(). The variational
(approximate) predictive density of the given test data with
respect to the considered MFHMM is given by
p(Y |X) = dq()p(Y |)
(21)
yielding
M
1
|X
≈
q(Y m )
p(Y |X) = p {Y m }M
q(Y r |Ŝ m ) (22)
m=1
M m=1
r=m
where
r
m
q(Y |Ŝ ) =
T
q(yrt |sˆt m )
1081
and, second, labeling of these samples and introduction into
the model training procedure of the MFHMM.
Under the proposed Bayesian treatment of the MFHMM,
the informativeness of a new data point can be assessed
analytically by viewing unlabeled sample selection as an
information extraction process: we select the data that gives
us maximum information about the pool of unlabeled samples;
in other words, we apply an information gain criterion. Since
variational Bayes yields a posterior over the model parameters
, information gain after augmenting an unlabeled data into
the training set can be expressed in the context of information
theory: “How much information about can be obtained if
we add an unlabeled data X∗ into the training set?”
Indeed, let us consider C modeled behavioral classes, each
one represented by a postulated MFHMM. Following [37], we
measure the information gain obtained by adding an unlabeled
data X∗ into the training set by means of the KL divergence
between the posterior density of the MFHMM parameters
obtained after augmenting the unlabeled data X∗ into the
training set and before the augmentation [37], defined as
follows:
G(X∗ )
(23)
C
KL(p(c∗ |X∗ , X)||p(c∗ |X))p(c∗ |X∗ ; X).
c∗ =1
(24)
t=1
q(yrt |ŝm
t )
are the predictive densities of the
while q(Y m ) and
streamwise HMMs and the coupling models, respectively,
comprising the trained MFHMM, which can be obtained based
on the VB treatments of these models (see the descriptions in
[27] regarding the streamwise HMMs, and the discussions in
[16] regarding the coupling models).
III. How Does VB Facilitate MFHMM-Based
Active Learning?
As we have already discussed, due to the prohibitively
high costs of capturing and annotating behavioral data from
real installations, measures have to be taken to avoid severe MFHMM training algorithm instabilities (e.g., yielding
singular covariance estimates). Variational Bayes serves us
well toward the achievement of this goal. However, another
significant repercussion of the shortage in training data regards
the high chances of the trained model manifesting a notably
poor generalization performance [35].
To remedy this issue, we employ in this paper the concept
of active learning. Active learning is based on the notion that
the performance of the learners (here MFHMMs) might be
considerably improved if the learners could actively participate
in the learning process [36]. That is, contrary to conventional
supervised learning, where the learner “passively” receives
the labeled data and generates a learned model, we would
like to introduce a framework for identifying a subset of a
pool of unlabeled examples that would be most informative
if the associated labels were available and incorporate them
in the learning procedure. Hence, the proposed active learning
methodology comprises two basic procedures: first, selection
of the most informative samples from a pool of unlabeled data;
In (24), p(c∗ |X∗ , X) is the variational posterior of the
c th postulated MFHMM (modeling the c∗ th behavioral class),
obtained after augmenting the unlabeled data X∗ into the
training data of the class; p(c∗ |X) is the variational posterior
of the c∗ th postulated MFHMM obtained before augmenting
any unlabeled data; and, finally, p(c∗ |X∗ ; X) is the a posteriori
probability of the c∗ th class regarding the unlabeled sample
X∗ , which, considering all the classes of equal a priori
probability, is given by
∗
p(c∗ |X∗ ; X) =
pc∗ (X∗ |X)
C
k=1
pk (X∗ |X)
(25)
where pk (X∗ |X) is the (variational) predictive probability of
the data X∗ with respect to the kth class MFHMM, defined in
(22).
In essence, G(X∗ ) seeks labels that can most shrink or
expand (i.e., change) the model variance; thus, the information
gain obtained by this measure is defined in terms of the
possible change in the model variance, which has been shown
to be more appropriate than other related information gain
metrics [38], as well as other candidate unlabeled data selection strategies, e.g., the query by committee (QBC) approach
[39], for comparably low computational costs. Finally, in
regards to the labeling decision for the unlabeled samples
selected to be incorporated in the model training procedure,
this can be simply effected by maximization of the a posteriori
probabilities (25) of the selected data points over the class
labels c∗ .
In our experimental investigations, apart from the information gain criterion (24), we shall also consider the QBC
approach as another alternative for the conduction of active
learning in the context of the variational Bayesian MFHMM.
1082
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012
In the framework of QBC [39], [40], the informativeness of an
example is measured by computing the classification variance
with respect to the entire space of possible models consistent
with the training data thus far. Since this computation is
practically infeasible, the QBC algorithm approximates the
entire space by randomly sampling the posterior distribution
of the model parameters obtained from model training. These
randomly selected models serve as a “committee” of classifiers
to classify each unlabeled example. Then, the classification
variance is measured by computing the disagreement over
the classifications obtained by the classifiers comprising the
committee. The data samples with the strongest disagreement
among the committee are selected for labeling.
In this paper, this degree of disagreement shall be measured
via the KL divergence, measuring the average distance of the
class posterior density resulting from each committee member
ˆ ξc }ξ=1 a set
to their mean value. Specifically, let us denote as {
of instances of the trained MFHMM model of the cth class,
with variational posterior q(c ), obtained by sampling q(c )
consecutive times. Then, the score of an unlabeled data X∗
given by the sampled committee of experts is given by
score(X∗ ) =
1
ξ=1
ˆ ξc∗ || pavg c∗ |X∗
KL p c∗ |X∗ ,
(26)
where
1 ∗ ∗ c∗
ˆξ
pavg c∗ |X∗ =
p c |X ,
(27)
ξ=1
and, considering all the classes of equal a priori probability,
we have
ˆ ξc =
p c|X∗ ,
ˆ ξc )
pc (X∗ |
C
∗ ˆk
k=1 pk (X |ξ )
(28)
ˆ ξc ) is the predictive probability of the MFHMM
where pc (X∗ |
of the cth class with respect to X∗ , and c∗ in (26) and (27) is
the class that maximizes (28) for the given committee member
ξ and predictive point X∗ .
Fig. 2. Different camera views in the CMU Multi-Modal Activity Database
(from [41]). We used the cameras 7151020 (first in second row) and 7151062
(second in first row).
sought to recognize 29 tasks described in Table I; the groundtruth annotations were taken from the dataset providers. Views
from two cameras (7151020 and 7151062) were employed for
that purpose (see Fig. 2).
To extract the spatiotemporal variations, we used pixel
change history images to capture the motion history
(see [42]), and computed the complex Zernike moments
A00 , A11 , A20 , A22 , A31 , A33 , A40 , A42 , A44 , A51 , A53 , A55 , A60 ,
A62 , A64 , A66 , for each of which we computed the norm and
the angle. Additionally the center of gravity and the area
of the found blobs were also used, making a total of 31
parameters, thus providing an acceptable scene reconstruction
without a computationally prohibitive dimension. Zernike
moments were calculated in rectangular regions of interest
of approximately 15 000 pixels in each image to limit the
processing and allow real-time feature extraction (performed
at a rate of approximately 50–60 f/s).
The employed HMMs comprised three states, each one having a single mixture component distribution, which facilitated
fast algorithm execution with acceptable results. The streams
were coupled using a Gaussian mixture of two components.
We randomly selected two full workflows for initial training
(each containing 62 samples of all possible tasks), we used two
different workflows to draw samples from (68 task samples in
total) for the purposes of the active learning algorithm, and
used the rest eight available workflows for testing (258 task
samples in total). A graphical representation of the obtained
success rates as new samples was included is given in Fig. 3.
B. Industrial Part Assembly
IV. Experimental Results
To experimentally verify the proposed approach, we
have used some public benchmark datasets involving action
recognition of humans, namely the CMU-MMAC and WR
databases.
A. Meal Preparation
The first set of experiments was based on a part of the CMUMMAC database [41]. The CMU-MMAC database contains
multimodal measures of human activity of subjects performing
tasks involved in cooking and food preparation. Six synchronized cameras have been used to capture scenarios, such as
preparation of salad, pizza, eggs, and sandwich. Many types
of tasks have been annotated within these scenarios. In our
experiments, we considered the brownie preparation scenario.
We have used 12 videos containing the full scenario, and
We used the WR dataset, and specifically the first two
workflows pertaining to car assembly (see [43] for more
details). The tasks to recognize in each of the workflows are
the following:
1) worker 1 picks up part 1 from rack 1 (upper) and places
it on the welding cell; mean duration is 8–10 s;
2) worker 1 and worker 2 pick part 2a from rack 2 and
place it on the welding cell;
3) worker 1 and worker 2 pick part 2b from rack 3 and
place it on the welding cell;
4) worker 2 picks up spare parts 3a, 3b from rack 4, and
places them on the welding cell;
5) worker 2 picks up spare part 4 from rack 1 and places
it on the welding cell;
6) worker 1 and worker 2 pick up part 5 from rack 5 and
place it on the welding cell.
CHATZIS AND KOSMOPOULOS: VISUAL WORKFLOW RECOGNITION USING A VARIATIONAL BAYESIAN TREATMENT
1083
TABLE I
Meal Preparation Tasks From the CMU-MMAC Database,
Including Their Code and the Total Amount of Samples in the
12 Brownie Preparation Scenarios
Task Code
03
06
07
12
14
15
16
17
18
19
24
22
27
28
30
31
32
33
34
35
37
38
39
40
42
43
44
45
50
Task
Close fridge
Open brownie bag
Open brownie box
Open fridge
Pour brownie bag into big bowl
Pour oil into big bowl
Pour oil into measuring cup small
Pour water into big bowl
Pour water into measuring cup big
Put baking pan into oven
Put pam into cupboard bottom right
Put oil into cupboard bottom right
Spray pam
Stir big bowl
Switch on
Take baking pan
Take big bowl
Take brownie box
Take egg
Take fork
Take measuring cup big
Take measuring cup small
Take oil
Take pam
Twist off cap
Twist on cap
Walk to counter
Walk to fridge
Crack egg on big bowl
Total Samples
11
9
12
11
12
12
12
12
11
12
9
10
10
12
12
12
12
12
11
12
12
12
10
9
11
12
11
11
9
Each of the above tasks is a class that has to be recognized.
The partial or total occlusions due to the racks make the
task very difficult to complete with a single camera and
therefore two views have been used (see Fig. 4), hence the
need for a methodology allowing for the successful fusion of
the information contained in tightly coupled times series.
In our experiments, we have used two different workflows,
each one comprising 20 sequences representing full assembly
cycles and containing at least one of the considered behaviors.
The total number of frames in each case was approximately
80 000. Annotation of these frames has been performed manually. The second workflow is considered more difficult because
the tasks may be executed in parallel, whereas in the first
workflow the tasks were always executed sequentially. The
same type of features was used as in the previous subsection.
HMM configuration was similar to the previous experiment.
We randomly selected three full workflows for initial training (each containing all possible tasks), we used seven workflows to draw samples from (42 task samples in total) for the
purposes of the active learning algorithm, and left the rest ten
workflows for testing (60 task samples in total). The results
for the first and second workflows are given in Figs. 5 and 6,
respectively.
C. Comparison to Baseline Classification Methods
To verify the merit of the variational Bayesian approach
toward observation fusion methods, we have included experimental comparisons of the variational Bayesian approach
against the standard HMM and MFHMM models obtained
Fig. 3. Success rates for the active learning methods compared to the random
case, using the first workflow of the CMU-MMAC dataset. The x-axis is
the number of selected samples for training, the y-axis is the respective
accuracy on the test set. (a) Accuracy of camera 7151020 steamwise model.
(b) Accuracy of camera 7151062 steamwise model. (c) Accuracy fusing both
cameras.
Fig. 4. Schematic and camera views in the car assembly environment.
using EM-based training. In all cases, three HMM states
with a single component observation model were used for
both the VB and EM methods. The results are displayed in
Table II for models with Gaussian observation densities, and in
Table III for models with Student’s-t observation densities. As
we observe, VB gives results that in most cases are much better
than the EM algorithm for both the streamwise and the fused
models. The higher accuracy comes, of course, at a higher
computational cost. In all our experiments, classification using
the VB models required between four and five times more time
compared to the EM approach. Nevertheless, although higher,
1084
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012
Fig. 5. Success rates for the active learning methods compared to the random
case, using the first workflow of the WR-dataset. The x-axis is the number
of selected samples for training, the y-axis is the respective accuracy on the
test set. (a) Accuracy of camera 1 steamwise model. (b) Accuracy of camera
2 steamwise model. (c) Accuracy fusing both cameras.
Fig. 6. Success rates for the active learning methods compared to the random
case, using the second workflow of the WR-dataset. The x-axis is the number
of selected samples for training, the y-axis is the respective accuracy on the
test set. (a) Accuracy of camera 1 steamwise model. (b) Accuracy of camera
2 steamwise model. (c) Accuracy fusing both cameras.
the computational time needed still remains of the same order
of magnitude.
Furthermore, we have also compared to MCMC-based
methods; for this purpose, we have considered the HMM
model proposed in [44]. In our experiments, we used a
truncation level of ten states for this model, and imposed priors
similar to our VB-based inference algorithm. Theoretically,
higher accuracy is expected as the number of sampling iterations increases. Indeed, we observed this behavior in our
experiments; nevertheless, to achieve similar or higher performance compared to the corresponding VB-based models,
a very large number of iterations was needed, requiring too
many computational resources, although the dimensionality
of the problem was not too large. In Table II, we provide
the results of the MCMC-based method of [44] for 10 000
sampling iterations, which is a number of sampling iterations
incurring reasonable computational costs (12 h on an Intel
Xeon 2.53GHz PC).
Finally, regarding the comparative computational costs
of sequence classification using the proposed VB-based
MFHMM model and simple streamwise models, we would like
to mention that the costs of the proposed model are roughly
equal to the sum of the costs of the corresponding streamwise
models. Hence, in cases where two streams are used, our
approach roughly imposes double the costs of a single streamwise model. This result was theoretically expected, considering
that prediction in our model is conducted using (22).
D. Discussion
In our experimental investigations, we evaluated the performance of the proposed information fusion scheme. Clearly,
our fusion approach yielded improved results over methods
using single-stream information. We also observed that the
VB methods outperformed the respective EM-based ones for
both the streamwise and the fused models. This result was
theoretically expected since the latter models make pointestimates, which are more vulnerable to overfitting [27].
We also investigated the effectiveness of the proposed
framework in an active learning setting. Two different active
learning criteria were examined, namely information gain and
query by committee. Using these methods, we were able to
select the most appropriate samples to incorporate in model
training. This process of sample selection was repeated until
the maximum number of new samples was reached.
CHATZIS AND KOSMOPOULOS: VISUAL WORKFLOW RECOGNITION USING A VARIATIONAL BAYESIAN TREATMENT
1085
TABLE II
Comparison to Standard EM Approaches Using the Gaussian Observation Model
Dataset
CMU-MMAC
WR 1
WR 2
EM-HMM1
39.49
90.00
55.71
EM-HMM2
35.90
70.00
37.14
MFHMM
41.03
90.00
63.33
VB-HMM1
43.08
95.00
63.33
VB-HMM2
37.95
86.67
56.67
VB-MFHMM
44.62
96.67
68.33
MCMC-1
42.13
78.00
35.00
MCMC-2
29.23
71.00
45.00
Columns EM-HMM1 and EM-HMM2 provide the accuracy of the EM-trained streamwise HMMs, and EM-MFHMM provides the
accuracy of the EM-trained multistream fused HMM. The corresponding results for models trained using the variational Bayesian
approach are provided in columns VB-HMM1, VB-HMM2, and VB-MFHMM, respectively. Accuracy for MCMC-trained streamwise
models are provided in MCMC-1 and MCMC-2.
TABLE III
Comparison to Standard EM Approaches Using the Student’s-t Observation Model: Columns EM-HMM1
and EM-HMM2 Provide the Accuracy of the EM-Trained Streamwise HMMs, and EM-MFHMM Provides
the Accuracy of the EM-Trained Multistream Fused HMM
Dataset
CMU-MMAC
WR 1
WR 2
EM-HMM1
41.03
90.00
60.00
EM-HMM2
33.85
72.86
38.33
MFHMM
43.07
91.42
65.71
VB-HMM1
43.59
93.33
61.67
VB-HMM2
42.56
91.67
56.67
VB-MFHMM
45.64
98.33
68.33
The corresponding results for models trained using the variational Bayesian approach are provided in
columns VB-HMM1, VB-HMM2, and VB-MFHMM, respectively.
It has to be mentioned that QBC entails sampling of the
model parameters, which may require a large number of
experts. In our setting we used 30 experts, by drawing the same
number of samples; increasing the number of experts would
give more representative results, however the computational
burden would increase proportionally. In our setting, the
required execution time was almost the same for both methods
for the selected amount of experts used from the QBC method.
Clearly, active learning outperformed random sample selection. To achieve the same performance, active learning
methods require much less data than random selection. The
differences in accuracy are bigger when adding only few
samples. We have observed that both the gain and QBC criteria
are able to select the samples that are closer to optimal in the
sense of acquired information. As expected, we also observed
that as more samples are labeled and added to the training set,
the gap in performance compared to random selection tends
to reduce. Furthermore, we noted that in most cases none
of the proposed active learning methods could significantly
outperform the other.
V. Conclusion
In this paper, we presented a novel variational Bayesian
treatment of multistream fused hidden Markov models, with
application to visual WR using multicamera networks. MFHMMs have been very successful in fusion of information from
tightly interdependent data streams, with low computational
requirements. In this paper, we employed an elegant variational
Bayesian treatment, which does not need large amounts of
training data to guarantee dependable model estimation, since
variational Bayes is much less prone to overfitting. Hence,
despite the fact that the annotation of training data can be a
major bottleneck, our VB-based method did not require large
amount of them.
A major advantage of the proposed variational Bayesian
treatment of MFHMMs over conventional approaches consists
in the provision of a measure of confidence in the obtained
model estimates. As we showed, utilization of this information
allowed for the computationally efficient integration of the
MFHMM into an active learning framework, by application
of popular active learning criteria that would be either computationally cumbersome or even intractable were it not for
the proposed variational Bayesian treatment.
References
[1] G. L. Foresti, C. Micheloni, L. Snidaro, P. Remagnino, and T. Ellis,
“Active video-based surveillance systems,” IEEE Signal Proc. Mag.,
vol. 22, no. 2, pp. 25–37, Mar. 2005.
[2] G. L. Foresti, C. S. Regazzoni, and P. K. Varshney, Multisensor
Surveillance Systems: The Fusion Perspective. Norwell, MA: Kluwer,
2003.
[3] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian computer vision
system for modeling human interactions,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 22, no. 8, pp. 831–843, Aug. 2000.
[4] S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous
speech recognition,” IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141–151,
Sep. 2000.
[5] M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models
for complex action recognition,” in Proc. IEEE CVPR, Jun. 1997, pp.
994–999.
[6] N. Oliver, E. Horvitz, and A. Garg, “Layered representations for learning
and inferring office activity from multiple sensory channels,” in Proc.
Int. Conf. Multimodal Interfaces, 2002, pp. 163–180.
[7] D. G. Stork and M. E. Hennecke, “Speech reading by humans and
machines,” in NATO ASI Series F, vol. 150. Berlin, Germany: SpringerVerlag, 1996.
[8] J. Triesch and C. von der Malsburg, “Democratic integration: Selforganized integration of adaptive cues,” Neural Comput., vol. 13, no. 9,
pp. 2049–2074, 2001.
[9] O. Kahler, J. Denzler, and J. Triesch, “Hierarchical sensor data fusion by
probabilistic cue integration for robust 3-D object tracking,” in Proc. 6th
IEEE Southwest Symp. Image Anal. Interpret., May 2004, pp. 216–220.
[10] A. Morris, A. Hagen, H. Glotin, and H. Bourlard, “Multi-stream adaptive
evidence combination for noise robust ASR,” Speech Comm., vol. 34,
nos. 1–2, pp. 25–40, Apr. 2001.
1086
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012
[11] C. Vogler and D. Metaxas, “A framework for recognizing the simultaneous aspects of American sign language,” Comput. Vis. Image
Understanding, vol. 81, pp. 358–384, Mar. 2001.
[12] Z. Zeng, J. Tu, B. M. Pianfetti, Jr., and T. S. Huang, “Audio-visual
affective expression recognition through multistream fused HMM,” IEEE
Trans. Multimedia, vol. 10, no. 4, pp. 570–577, Jun. 2008.
[13] K. Yamazaki and S. Watanabe, “Singularities in mixture models and
upper bounds of stochastic complexity,” Neural Netw., vol. 16, no. 7,
pp. 1029–1038, 2003.
[14] C. Archambeau, J. Lee, and M. Verleysen, “On the convergence problems of the EM algorithm for finite Gaussian mixtures,” in Proc. 11th
Eur. Symp. Artif. Neural Netw., 2003, pp. 99–106.
[15] G. McLachlan and D. Peel, Finite Mixture Models (Wiley Series in
Probability and Statistics). New York: Wiley, 2000.
[16] C. M. Bishop, Pattern Recognition and Machine Learning. New York:
Springer, 2006.
[17] J. Diebolt and C. Robert, “Estimation of finite mixture distributions
through Bayesian sampling,” J. Roy. Statist. Soc. B, vol. 56, no. 2, pp.
363–375, 1994.
[18] S. Richardson and P. Green, “On Bayesian analysis of mixtures with
unknown number of components,” J. Roy. Statist. Soc. B, vol. 59, no.
4, pp. 731–792, Apr. 1997.
[19] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, “An introduction
to variational methods for graphical models,” in Learning in Graphical
Models, M. Jordan, Ed. Dordrecht, The Netherlands: Kluwer, 1998, pp.
105–162.
[20] C. Bishop and M. Tipping, “Variational relevance vector machines,” in
Proc. 16th Conf. Uncertainty Artif. Intell., 2000, pp. 46–53.
[21] S. Roberts and W. Penny, “Variational Bayes for generalized autoregressive models,” IEEE Trans. Signal Process., vol. 50, no. 9, pp.
2245–2257, Sep. 2002.
[22] V. Smidl and A. Quinn, “Mixture-based extension of the AR model
and its recursive Bayesian identification,” IEEE Trans. Signal Process.,
vol. 53, no. 9, pp. 3530–3542, Sep. 2005.
[23] C. Archambeau and M. Verleysen, “Robust Bayesian clustering,” Neural
Netw., vol. 20, no. 1, pp. 129–138, Jan. 2007.
[24] M. Svensén and C. M. Bishop, “Robust Bayesian mixture modelling,”
Neurocomputing, vol. 64, no. 1, pp. 235–252, Jan. 2005.
[25] Z. Ghahramani and M. Beal, “Variational inference for Bayesian mixtures of factor analysers,” in Proc. 12th Adv. NIPS, vol. 12. 1999, pp.
449–455.
[26] S. Chatzis, D. Kosmopoulos, and T. Varvarigou, “Signal modeling and
classification using a robust latent space model based on t distributions,”
IEEE Trans. Signal Process., vol. 56, no. 3, pp. 949–963, Mar. 2008.
[27] S. Chatzis and D. Kosmopoulos, “A variational Bayesian methodology for hidden Markov models utilizing Student’s-t mixtures,” Pattern
Recognit., vol. 44, no. 2, pp. 295–306, 2011.
[28] I. Rezek and S. J. Roberts, “Ensemble hidden Markov models with
extended observation densities for biosignal analysis,” in Probabilistic
Modeling in Biomedicine and Medical Bioinformatics, E. D. Husmeier,
R. Dybowski, and S. Roberts, Eds. Berlin, Germany: Springer-Verlag,
2005.
[29] D. MacKay, “Information-based objective functions for active data
selection,” Neural Computation, vol. 4, no. 4, pp. 589–603, 1992.
[30] D. Cohn, Z. Ghahramani, and M. Jordan, “Active learning with statistical
models,” J. Artif. Intell. Res., vol. 4, no. 3, pp. 129–145, Mar. 1996.
[31] S. Chatzis, D. Kosmopoulos, and T. Varvarigou, “Robust sequential data
modeling using an outlier tolerant hidden Markov model,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 31, no. 9, pp. 1657–1669, Sep. 2009.
[32] S. P. Luttrell, “The use of Bayesian and entropic methods in neural
network theory,” in Maximum Entropy and Bayesian Methods. Boston,
MA: Kluwer, 1989, pp. 363–370.
[33] H. Pan, Z.-P. Liang, and T. S. Huang, “Estimation of the joint probability
of multisensory signals,” Pattern Recognit. Lett., vol. 22, no. 13, pp.
1431–1437, Nov. 2001.
[34] D. Povey and P. Woodland, “Minimum phone error and i-smoothing for
improved discriminative training,” in Proc. ICASSP, 2002, pp. 105–108.
[35] S. Raudys and A. Jain, “Small sample size effects in statistical pattern
recognition: Recommendations for practitioners,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 13, no. 3, pp. 252–264, Mar. 1991.
[36] M. A. Osborne, R. Garnett, and S. J. Roberts, “Active data selection for
sensor networks with faults and changepoints,” in Proc. IEEE 24th Int.
Conf. AINA, Apr. 2010, pp. 533–540.
[37] S. Ji, B. Krishnapuram, and L. Carin, “Variational Bayes for continuous
hidden Markov models and its application to active learning,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 28, no. 4, pp. 522–532, Apr.
2006.
[38] D. MacKay, “Information-based objective functions for active data
selection,” Neural Comput., vol. 4, no. 4, pp. 589–603, Apr. 1992.
[39] Y. Freund, H. Seung, E. Shamir, and N. Tishby, “Selective sampling
using the query by committee algorithm,” Mach. Learning, vol. 28, nos.
2–3, pp. 133–168, Mar. 1997.
[40] H. Seung, M. Opper, and H. Smopolinsky, “Query by committee,” in
Proc. 5th Ann. ACM Workshop Comput. Learning Theory, 1992, pp.
287–294.
[41] F. D. La Torre, J. Hodgins, J. Montano, S. Valcarcel, R. Forcada, and
J. Macey, “Guide to the Carnegie Mellon University multimodal activity
(CMU-MMAC) database,” Carnegie Mellon Univ., Pittsburgh, PA, Tech.
Rep. CMU-RI-TR-08-22, Jul. 2009.
[42] D. Kosmopoulos and S. Chatzis, “Robust visual behavior recognition,”
IEEE Signal Process. Mag., vol. 27, no. 5, pp. 34–45, Sep. 2010.
[43] A. Voulodimos, D. Kosmopoulos, G. Vasileiou, E. Sardis, A. Doulamis,
V. Anagnostopoulos, C. Lalos, and T. Varvarigou, “A dataset for workflow recognition in industrial scenes,” in Proc. IEEE Int. Conf. Image
Process., Sep. 2011, pp. 3310–3313.
[44] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky, “An HDPHMM for systems with state persistence,” in Proc. Int. Conf. Mach.
Learning, Jul. 2008, pp. 312–319.
Sotirios P. Chatzis received the M.Sc. (five-year
Diploma) degree in electrical and computer engineering from the National Technical University of
Athens, Athens, Greece, in 2005, and the Ph.D.
degree in machine learning from the National Technical University of Athens in 2008.
From January 2009 to June 2010, he was a PostDoctoral Researcher with the University of Miami,
Coral Gables, FL. He is currently a Post-Doctoral
Researcher with the Department of Electrical and
Electronic Engineering, Imperial College London,
London, U.K. His current research interests include machine learning theory
and methodologies with a special focus on hierarchical Bayesian models,
reservoir computing, robot learning by demonstration, copulas, quantum
statistics, Bayesian nonparametrics, and artificial creativity.
Dr. Chatzis first authored 28 journal papers in the most prestigious journals
of his research field by the age of 28 years. His Ph.D. research was supported
by the Bodossaki Foundation, Greece, and the Greek Ministry for Economic
Development, while he received the Dean’s Scholarship for Ph.D. Studies,
being the Best Performing Ph.D. Student of his class.
Dimitrios Kosmopoulos received the Ph.D. degree
in electrical and computer engineering from the
National Technical University of Athens, Athens,
Greece, in 2001.
Since then, he has collaborated with the National Center for Scientific Research “Demokritos,”
Athens, the National Technical University of Athens,
the University of Central Greece, Lamia, Greece,
and the Technical Educational Institute of Athens,
Athens. He is currently a Visiting Assistant Professor
with the University of Texas, Arlington. His current
research interests include computer vision, robotics, and machine learning.
He has published more than 50 papers in these fields and has participated in
several industrial and scientific projects as a developer, consultant, or technical
coordinator.