Academia.eduAcademia.edu
1076 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012 Visual Workflow Recognition Using a Variational Bayesian Treatment of Multistream Fused Hidden Markov Models Sotirios P. Chatzis and Dimitrios Kosmopoulos Abstract—In this paper, we provide a variational Bayesian (VB) treatment of multistream fused hidden Markov models (MFHMMs), and apply it in the context of active learningbased visual workflow recognition (WR). Contrary to training methods yielding point estimates, such as maximum likelihood or maximum a posteriori training, the VB approach provides an estimate of the posterior distribution over the MFHMM parameters. As a result, our approach provides an elegant solution toward the amelioration of the overfitting issues of point estimate-based methods. Additionally, it provides a measure of confidence in the accuracy of the learned model, thus allowing for the easy and cost-effective utilization of active learning in the context of MFHMMs. Two alternative active learning algorithms are considered in this paper: query by committee, which selects unlabeled data that minimize the classification variance, and a maximum information gain method that aims to maximize the alteration in model variance by proper data labeling. We demonstrate the efficacy of the proposed treatment of MFHMMs by examining two challenging WR scenarios, and show that the application of active learning, which is facilitated by our VB approach, allows for a significant reduction of the MFHMM training costs. Index Terms—Active learning, hidden Markov models, multistream fusion, workflow recognition. I. Introduction Human behavior understanding in video sequences is a research field rapidly gaining momentum over the last few years. This is mainly due to its fundamental applications in automated video indexing, virtual reality, human–computer interaction, and smart monitoring. Especially, throughout the last few years we have seen an increasing need for assisting and extending the capabilities of human operators in remotely monitored large and complex spaces, such as public areas, airports, railway stations, parking lots, bridges, tunnels, and so on. The last generation of surveillance systems was designed to utilize multiple video streams from heterogeneous sensors Manuscript received June 7, 2011; revised October 31, 2011; accepted December 14, 2011. Date of publication March 5, 2012; date of current version June 28, 2012. This paper was recommended by Associate Editor C. N. Taylor. S. P. Chatzis is with the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K. (e-mail: [email protected]). D. Kosmopoulos is with the University of Texas, Arlington, TX 76019 USA (e-mail: kosmopo@edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2012.2189795 to automatically assess the ongoing activities in large monitored environments, flagging and presenting to the operator suspicious events as they happen in order to prevent dangerous situations [1], [2]. In this paper, we focus on visual workflow recognition (WR); workflows are comparatively structured processes, in contrast to monitoring stations or airports, and it is more realistic to believe that workflows can be modeled using computer vision and machine learning. The identified deviations from a predefined workflow possibly indicate security and safety related events and will be automatically highlighted. Distributed smart workflow monitoring is applicable to mass production or large-scale enterprises like industrial plants which have a clear need for automated supervision services to guarantee safety, security, and quality by enforcing adherence to predefined procedures for production or services. Such supervision services are frequently of vital importance for the enterprise because, apart from cost reduction, timely detection of safety and security concerns may prevent injuries and even fatalities. The complexity of detection and tracking of moving objects under occlusions in a typical structured environment requires more than a single camera and features that will not result from an error-prone tracker. Multiple cameras provide a wider coverage of the scene and redundant data that help solve occlusions and improve accuracy. Furthermore, the high diversity and complexity of the behaviors that need to be monitored requires new learning methods that will be able to fuse information from multiple streams. Finally, the limited availability in model training data, due to the prohibitively high costs of capturing and annotating behavioral data from real (e.g., industrial) installations, necessitates utilization of an active learning framework, allowing for the exploitation of unlabeled data to improve the classification performance of the trained models. Hidden Markov models [Fig. 1(a)] are an extremely popular means of modeling a stream of sequential data, and are vastly adopted in behavioral analysis applications [2]. Using information from multiple streams of data pertaining to the same sequence of events has been shown to allow for a significant performance enhancement of hidden Markov model (HMM)-based event analysis and detection models [3]–[6]. Modern multimedia capturing and processing technologies have rendered insignificant the main hurdle of the additional computational requirements imposed by multisensor c 2012 IEEE 1051-8215/$31.00  CHATZIS AND KOSMOPOULOS: VISUAL WORKFLOW RECOGNITION USING A VARIATIONAL BAYESIAN TREATMENT 1077 Fig. 1. Various fusion schemes using the HMM framework for two streams. The s, o stand for the states and the observations, respectively. The first index marks the stream and the second the time. (a) Standard HMM. (b) HMM with feature fusion (FHMM). (c) State synchronous HMM (SHMM). (d) Parallel HMM (PHMM). (e) Coupled HMM (CHMM). (f) Multistream fused HMM (MFHMM). systems [2]. However, the reliability of the sensors is never explicitly considered. Hence, in a video surveillance system that employs multiple sensors, the problem of selecting the most appropriate sensor or set of sensors to perform a certain task often arises. Consequently, the first and most straightforward solution of early integration [7], which consists in merging all the observations related to all the streams into one large stream (frame by frame), and modeling it using a single HMM, is less than satisfactory [Fig. 1(b)]. To resolve this problem, an adaptive multicue multicamera information fusion framework based on democratic integration [8] is presented in [9]. Fusion is performed by taking into account sensor reliability, yet there is no direct sensor quality assessment. Instead, the reliability of a source is estimated by measuring the distance between each source estimate and the fused estimate, which is determined by the sources estimates. This is based on the assumption that the majority of sensors are producing reliable estimates, which cannot always be taken for granted. A different probabilistic framework for multistream data fusion is the multistream HMM approach [10], under which each stream is modeled separately using its own HMM. Then, analysis of the observed data can be conducted by creating a special HMM, recombining all the single stream HMM likelihoods at various specific temporal points. Obviously, depending on the specific selection of these recombination points, different solutions arise. For instance, in coupled hidden Markov models [5] [Fig. 1(e)], two component HMMs are linked by the dependence of their hidden states. However, in many applications where the component HMMs do not consist of many states, such as in cases of audio-visual data, this 1078 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012 dependence assumption is not strong enough to capture the statistical correlations between the multiple streams. In the state-synchronous multistream HMM [Fig. 1(c)] the streams are assumed to be synchronized. Each stream is modeled using an individual HMM; the postulated streamwise HMMs share the same state dynamics. As a result, this approach provides only limited sequential data modeling flexibility. In [11], a parallel hidden Markov model has been proposed [Fig. 1(d)], which factorizes the state space into multiple independent temporal processes without causal connections inbetween. Nevertheless, the assumption of the different temporal processes being independent of each other is clearly invalid in most cases, especially when dealing with group or interactive activities. Multistream fused HMMs (MFHMMs) is another promising method for multistream data modeling [12] [Fig. 1(f)]. Like coupled HMMs and mixed-memory HMMs, a MFHMM consists of multiple HMMs. However, unlike the previous methods, the connections between the component HMMs are chosen based on a probabilistic fusion model, which is optimal according to the maximum entropy principle and a maximum mutual information criterion for selecting dimensionality reduction transforms. As a consequence, the MFHMM has several desirable features: 1) it has simpler and faster training and inference algorithms than the previous models; 2) if one of the component HMMs fails due to noise or a probable malfunction of the sensor capturing the related observations stream, the rest of the constituent HMMs can still work properly; and 3) it still retains the crucial information about the interdependencies between the multiple data streams, which coupled HMMs tend to neglect. In this paper, motivated by the aforementioned advantages of MFHMMs, we consider their application to the addressed problem of visual behavioral analysis and monitoring from multiple visual inputs in structured environments WR. In the existing literature, MFHMM is treated under a maximum-likelihood (ML) framework, using the expectationmaximization (EM) algorithm (see [12]). Even though ML is a common, and, in general, reliable approach for estimation of probabilistic generative models, it suffers from the undesirable property of being ill-posed since the likelihood function is unbounded from above [13]–[15]. This fact might result in several very significant deficiencies, especially in cases of limited training data availability; this is quite the case regarding the applications we focus on in this paper, as training data from real installations are difficult to collect, and very expensive to process and annotate. As a result, using ML to train a set of generative models for behavioral analysis and detection in such a context might result in an unstable training procedure, yielding poor model estimates, with high overfitting proneness; it could even lead to yielding infinities in the likelihood function, associated with the collapsing of the bell-shaped component distributions onto individual data points, and, hence, resulting in singular or near-singular covariance matrices [15]. To address these issues, in this paper we introduce a Bayesian treatment of MFHMMs, overcoming the problems of ML approaches elegantly, by marginalizing over the model parameters with respect to appropriate priors, and maximizing the resulting marginal likelihood of the model to obtain the optimal model size. Our approach is based on variational approximation methods [16], which have recently emerged as a deterministic alternative to Markov chain Monte-Carlo (MCMC) algorithms for doing Bayesian inference for probabilistic generative models [17], [18], with better scalability in terms of computational cost [19]. Variational Bayesian (VB) inference has been previously applied to a number of probabilistic inference models, including relevance vector machines [20], autoregressive models [21], [22], mixtures of Gaussians and Student’s-t distributions [23], [24], mixtures of factor analyzers [25], [26], and HMMs [27], [28], thereby ameliorating the singularity and overfitting problems of ML approaches in an elegant and computationally efficient manner. Since variational Bayes provides a full posterior distribution over the treated model parameters, the proposed approach allows for the extraction of a reliable measure of confidence in the obtained estimates of a trained MFHMM. This is yet another significant advantage of the proposed VB treatment of MFHMMs, as it allows for the easy and computationally efficient introduction of the MFHMM in the context of an elegant active learning framework. Indeed, as we shall discuss in the following sections of this paper, under the proposed variational Bayesian regard, well-known active learning criteria can be easily implemented for MFHMMs, while previously they were either computationally inefficient or intractable (when considering point-estimated MFHMMs) [29], [30]. Therefore, the introduction of the VB machinery does also allow for the exploitation of effective active learning methodologies so as to significantly reduce the training costs of MFHMMs, by efficiently utilizing pools of cheap to acquire unlabeled data. The remainder of this paper is organized as follows. In Section II, the proposed variational Bayesian treatment of MFHMMs is introduced, and the related model inference and prediction algorithms are derived. In Section III, the proposed approach is examined in the context of the active learning framework. As we show, the proposed VB treatment of MFHMMs allows for the efficient utilization of effective active learning algorithms, which would be either impossible or computationally burdensome to apply when considering point-estimated MFHMMs. In Section IV, we examine the efficacy of the proposed approach considering two challenging visual WR scenarios using publicly available datasets. Finally, in the concluding section of this paper, we summarize and discuss our results. II. Variational Bayesian Approach Toward MFHMMs A. Multistream Fused Hidden Markov Models Consider M tightly interdependent time series (streams), m m T X = {Xm }M m=1 , with X = {xt }t=1 . Assume that the constituent m streams X , m ∈ {1, . . . , M} of X, can be modeled by M independent (streamwise) HMMs, with their corresponding hidden state sequences denoted as S m = {stm }Tt=1 . Then, we have  p(Xm ) = p(Xm , S m ) (1) Sm CHATZIS AND KOSMOPOULOS: VISUAL WORKFLOW RECOGNITION USING A VARIATIONAL BAYESIAN TREATMENT where m p(Xm , S m ) = p(s1m )p(xm 1 |s1 ) T  m m p(stm |st−1 )p(xm t |st ) ∀m t=2 (2) m p(xm t |st ) and are the state-conditional likelihoods of the models, usually selected to be mixtures of Gaussians, or mixtures of Student’s-t densities [31]. In the following, we shall be denoting as πm = (πm n )n the initial state probabilities vector of the mth postulated streamwise HMM, with πnm  p(s1m = n) and as Am = (aijm )i,j the corresponding state transition probabilities matrix, with m = i) ∀t. aijm  p(stm = j|st−1 The problem addressed by MFHMMs is how to construct a new structure linking the postulated streamwise HMMs together, which will be giving an optimal approximation of the  joint probability of the stream data, p(X) = p {Xm }M m=1 [12]. For this purpose, MFHMMs take advantage of the fact that the streams {Xm }M m=1 can be separately modeled by individual HMMs. Then, to capture the statistical dependence between these streams, a set of transforms wm  g(Xm ) is introduced, such that the joint probability p({wm }M m=1 ) can be more easily calculated compared to p({Xm }M m=1 ). On the basis of this regard, the MFHMM obtains an optimal approximation of p(X) according to the maximum entropy principle, given by [32] p(X) ≈ p̃(X) (3) where p̃(X) = p̃  {Xm }M m=1    M  p {wm }M m=1 p(Xm ).  M m) p(w m=1 m=1 (4) Selection of a proper expression for the transforms wm is conducted on the basis of the maximum mutual information (MMI) criterion [33], a criterion that has been also used for discriminative training of HMMs with quite a success [34]. MMI criterion essentially comprises minimization of the Kullback–Leibler divergence KL(p||p̃) between the exact distribution p(X) and the approximate distribution p̃(X), where   KL(p||p̃) = − dX1 · · · dXM   (5)  m M  p̃ {Xm }M m=1 × p {X }m=1 log  m M  . p {X }m=1 It can be shown [12] that by application of the MMI criterion, and considering that all the fused data streams of the MFHMM are (a priori) of equal reliability, (4) yields M    1  = p̃(X) = p̃ {Xm }M p(Xm ) p(Xr |Ŝ m ). m=1 M m=1 r=m 1079 In (6), Ŝ m are the state sequence estimates of the available stream data, obtained by application of the Viterbi algorithm [15] on the individual streamwise HMMs comprising the postulated MFHMM. Regarding the coupling densities p(Xr |Ŝ m ), from the conditional independence property of the Markovian chain, we yield p(Xr |Ŝ m ) = T  p(xrt |ŝtm ). (7) t=1 The probabilities p(xrt |ŝtm ) of the MFHMM can be modeled by means of mixtures of Gaussian or Student’s-t densities, similar to the state-conditional likelihoods of the streamwise HMMs. Note that for each possible value, say i, of ŝtm , a different coupling density model p(xrt |ŝtm = i) is to be postulated. Hence, if we consider N-state streamwise HMMs, there is a total of N different finite mixture models that must be trained to model the coupling densities p(xrt |ŝtm ), ∀r, m. B. Variational Bayesian Inference for the MFHMM Bayesian treatments of probabilistic generative models comprise introduction of a set of prior distributions over the model parameters and further maximization of the model’s log marginal likelihood (log evidence). For convenience, usually conjugate priors are preferred, as this selection greatly simplifies inference and interpretability [16]. However, due to the complexity of the MFHMM, exact Bayesian inference for our model is intractable. Nevertheless, the choice of conjugate exponential prior distributions for the model parameters allows for the derivation of an elegant variational framework. Let us consider a model p(X|) treated under the variational Bayesian paradigm. Let p() be the conjugate prior imposed on the model, and X be the used set of training data. Variational Bayesian inference is conducted by introducing an approximate (variational) posterior over the model parameters q(), and considering the well-known equality for the log evidence, logp(X) [19] logp(X) = F (q) + KL(q||p) (8) where  p(X, ) . (9) q() Since the KL divergence term in (8) is a nonnegative quantity, F (q) comprises a strict lower bound of the log evidence, that is F (q) = dq()log logp(X) ≥ F (q) (10) and would become exact if q() = p(|X). Hence, maximizing the lower bound of the log evidence (variational free energy), F (q), so that it becomes as tight as possible, i.e., minimizing the KL divergence between the true and the variational posterior, a good variational inference scheme is obtained. In other words, variational Bayes can be summarized under the maximization scheme (6) q() = argmaxq F (q). (11) 1080 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012 It is worthwhile noting that the variational posteriors obtained by optimization of the variational free energy F (q) are only an approximation of the actual posterior densities p(|X). However, the variational Bayesian approach allows for considerably better scalability in terms of computational cost [19] compared to exact Bayesian inference using MCMC algorithms, which becomes of practical importance in applications requiring fast processing of high-dimensional large-scale datasets. As the MFHMM consists of two fundamental “building blocks,” the streamwise HMMs, and the coupling models, variational Bayesian inference for this model can be degenerated into two separate procedures: 1) variational Bayes for the postulated (streamwise) HMMs; and 2) variational Bayes for the postulated finite mixture models (coupling models). Below, we provide an outline of the proposed variational Bayesian treatment of the MFHMM. C. Variational Posteriors Consider an MFHMM modeling M tightly interdependent time series, {Xm }M m=1 . For simplicity, and without any loss of generality, we assume that all the observed time series have the T same length, T , i.e., Xm = {xm t }t=1 . The variational posteriors of the postulated MFHMM can be derived as follows. 1) Streamwise HMM Training: Initially, M individual HMMs are trained independently (one for each stream), by means of the VB algorithm, as described, e.g., in [27]. These are the constituent streamwise HMMs of our model, with obtained variational posteriors q(m ), where m are the parameters of the mth constituent streamwise HMM, m = 1, . . . , M. For simplicity, and without any loss of generality, we consider N-state streamwise HMMs. Specifically, VB inference for the streamwise HMMs of our model is conducted by imposing Dirichlet priors over the initial state and state transition probabilities of the models m m D(π1m , . . . πNm |φ1m , . . . , φN ) p(π ) = p(Am ) = N  m m m m D(ai1 , . . . , aiN |υi1 , . . . , υiN ). (12) (13) i=1 The observation emission probabilities of the hidden states of the models are taken as finite mixtures of Gaussian or Student’s-t distributions. Considering, for simplicity, K-component mixture models, we impose a Dirichlet prior over their mixture component weights of the form p(m ) = N  m m m D(δm i1 , . . . , δiK |ǫi1 , . . . , ǫiK ) (14) As a result of choosing to impose conjugate priors over the parameters of our model, the resulting variational posteriors of the model parameters take the same functional form as their corresponding priors [19]. Complete derivations of these posteriors have been provided in one of our previous works [27], and hence we refrain from repeating them here for brevity. 2) Sequence Decoding: The best hidden state sequences Ŝ m of the streamwise HMMs, corresponding to the used training data Xm , are found using the VB Viterbi algorithm [27]. The VB Viterbi algorithm comprises maximization of the approximate (variational) posterior expectation of log p(Xm , S m |m )  m m Ŝ = argmaxS dm q(m )logp(Xm , S m |m ) (16) where logp(Xm , S m |m ) is defined in (2) (for the details refer to [27]). 3) Coupling Models Training: Finally, the coupling models are obtained. This problem is equivalent to postulating one finite mixture model (with Gaussian or Student’s-t densities) for each of the distributions p(xrt |ŝtm = i), ∀i ∈ {1, . . . , N}, r, m ∈ {1, . . . , M}, r = m, and subsequently employing variational Bayes to obtain the variational posteriors q(ir,m ) over their parameters sets ir,m . The complete derivations of the VB training algorithm for finite mixtures of Gaussian densities can be found in [16], while for the case of Student’s-t densities they are provided in [24]. D. Hidden State Sequence Estimation Algorithm Essentially, this is the problem of maximizing  m M M {Ŝ }m=1 =argmax{S m }m=1 dq() × logp({X m N,K p({μm ij , Rij }i,j=1 ) = N  K  Then, following the where   related results of [12], and assuming that all the postulated streamwise HMMs are of the same reliability, using (3) and (6) we have that (17) eventually reads  Ŝ m = argmaxS m dm q(m ) ⎡ ⎤ (18)  ⎦. × log ⎣p(Xm , S m |m ) ) p(Xr |S m ; Sr,m m r=m Comparing the result (18) with (16), we directly observe that estimation of the optimal state sequences Ŝ m for the MFHMM effectively boils down to merely an application of the VB Viterbi algorithm, with the probabilities m m p(X |S ) = (15) T  m p(xm t |st ) ∀m (19) t=1 of the single-HMM Viterbi algorithm being now replaced with the products m m m m m N W(μm ij , Rij |λij , γ ij , ηij , Qij ). i=1 j=1 (17) , S m }M m=1 |) m M {{ir,m }N i=1 ,  }m,r=1,m=r . i=1 and a joint Normal–Wishart prior over the means and precision matrices of the (Gaussian or Student’s-t) mixture component densities m  ∀r p(Xr |S m ) = T  ∀r t=1 p(xrt |sm t ) ∀m (20) CHATZIS AND KOSMOPOULOS: VISUAL WORKFLOW RECOGNITION USING A VARIATIONAL BAYESIAN TREATMENT in which expression the quantities p(Xm |S m ) are given by the postulated streamwise HMMs, and the quantities p(Xr |S m ), r = m, are given by the coupling models. E. Predictive Probability The ultimate goal of Bayesian learning is, given a set of test data, to perform density estimation with respect to the learned model. Let us suppose the test data Y = {Y m }M m=1 , with T Y m = {ym t }t=1 , and an MFHMM trained using the training data X, with obtained variational posterior q(). The variational (approximate) predictive density of the given test data with respect to the considered MFHMM is given by  p(Y |X) = dq()p(Y |) (21) yielding M    1  |X ≈ q(Y m ) p(Y |X) = p {Y m }M q(Y r |Ŝ m ) (22) m=1 M m=1 r=m where r m q(Y |Ŝ ) = T  q(yrt |sˆt m ) 1081 and, second, labeling of these samples and introduction into the model training procedure of the MFHMM. Under the proposed Bayesian treatment of the MFHMM, the informativeness of a new data point can be assessed analytically by viewing unlabeled sample selection as an information extraction process: we select the data that gives us maximum information about the pool of unlabeled samples; in other words, we apply an information gain criterion. Since variational Bayes yields a posterior over the model parameters , information gain after augmenting an unlabeled data into the training set can be expressed in the context of information theory: “How much information about  can be obtained if we add an unlabeled data X∗ into the training set?” Indeed, let us consider C modeled behavioral classes, each one represented by a postulated MFHMM. Following [37], we measure the information gain obtained by adding an unlabeled data X∗ into the training set by means of the KL divergence between the posterior density of the MFHMM parameters  obtained after augmenting the unlabeled data X∗ into the training set and before the augmentation [37], defined as follows: G(X∗ )  (23) C  KL(p(c∗ |X∗ , X)||p(c∗ |X))p(c∗ |X∗ ; X). c∗ =1 (24) t=1 q(yrt |ŝm t ) are the predictive densities of the while q(Y m ) and streamwise HMMs and the coupling models, respectively, comprising the trained MFHMM, which can be obtained based on the VB treatments of these models (see the descriptions in [27] regarding the streamwise HMMs, and the discussions in [16] regarding the coupling models). III. How Does VB Facilitate MFHMM-Based Active Learning? As we have already discussed, due to the prohibitively high costs of capturing and annotating behavioral data from real installations, measures have to be taken to avoid severe MFHMM training algorithm instabilities (e.g., yielding singular covariance estimates). Variational Bayes serves us well toward the achievement of this goal. However, another significant repercussion of the shortage in training data regards the high chances of the trained model manifesting a notably poor generalization performance [35]. To remedy this issue, we employ in this paper the concept of active learning. Active learning is based on the notion that the performance of the learners (here MFHMMs) might be considerably improved if the learners could actively participate in the learning process [36]. That is, contrary to conventional supervised learning, where the learner “passively” receives the labeled data and generates a learned model, we would like to introduce a framework for identifying a subset of a pool of unlabeled examples that would be most informative if the associated labels were available and incorporate them in the learning procedure. Hence, the proposed active learning methodology comprises two basic procedures: first, selection of the most informative samples from a pool of unlabeled data; In (24), p(c∗ |X∗ , X) is the variational posterior of the c th postulated MFHMM (modeling the c∗ th behavioral class), obtained after augmenting the unlabeled data X∗ into the training data of the class; p(c∗ |X) is the variational posterior of the c∗ th postulated MFHMM obtained before augmenting any unlabeled data; and, finally, p(c∗ |X∗ ; X) is the a posteriori probability of the c∗ th class regarding the unlabeled sample X∗ , which, considering all the classes of equal a priori probability, is given by ∗ p(c∗ |X∗ ; X) = pc∗ (X∗ |X) C k=1 pk (X∗ |X) (25) where pk (X∗ |X) is the (variational) predictive probability of the data X∗ with respect to the kth class MFHMM, defined in (22). In essence, G(X∗ ) seeks labels that can most shrink or expand (i.e., change) the model variance; thus, the information gain obtained by this measure is defined in terms of the possible change in the model variance, which has been shown to be more appropriate than other related information gain metrics [38], as well as other candidate unlabeled data selection strategies, e.g., the query by committee (QBC) approach [39], for comparably low computational costs. Finally, in regards to the labeling decision for the unlabeled samples selected to be incorporated in the model training procedure, this can be simply effected by maximization of the a posteriori probabilities (25) of the selected data points over the class labels c∗ . In our experimental investigations, apart from the information gain criterion (24), we shall also consider the QBC approach as another alternative for the conduction of active learning in the context of the variational Bayesian MFHMM. 1082 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012 In the framework of QBC [39], [40], the informativeness of an example is measured by computing the classification variance with respect to the entire space of possible models consistent with the training data thus far. Since this computation is practically infeasible, the QBC algorithm approximates the entire space by randomly sampling the posterior distribution of the model parameters obtained from model training. These randomly selected models serve as a “committee” of classifiers to classify each unlabeled example. Then, the classification variance is measured by computing the disagreement over the classifications obtained by the classifiers comprising the committee. The data samples with the strongest disagreement among the committee are selected for labeling. In this paper, this degree of disagreement shall be measured via the KL divergence, measuring the average distance of the class posterior density resulting from each committee member ˆ ξc }ξ=1 a set to their mean value. Specifically, let us denote as { of instances of the trained MFHMM model of the cth class, with variational posterior q(c ), obtained by sampling q(c ) consecutive times. Then, the score of an unlabeled data X∗ given by the sampled committee of experts is given by score(X∗ ) = 1  ξ=1     ˆ ξc∗ || pavg c∗ |X∗ KL p c∗ |X∗ ,  (26) where   1   ∗ ∗ c∗  ˆξ pavg c∗ |X∗ = p c |X ,  (27) ξ=1 and, considering all the classes of equal a priori probability, we have   ˆ ξc = p c|X∗ ,  ˆ ξc ) pc (X∗ | C ∗ ˆk k=1 pk (X |ξ ) (28) ˆ ξc ) is the predictive probability of the MFHMM where pc (X∗ | of the cth class with respect to X∗ , and c∗ in (26) and (27) is the class that maximizes (28) for the given committee member ξ and predictive point X∗ . Fig. 2. Different camera views in the CMU Multi-Modal Activity Database (from [41]). We used the cameras 7151020 (first in second row) and 7151062 (second in first row). sought to recognize 29 tasks described in Table I; the groundtruth annotations were taken from the dataset providers. Views from two cameras (7151020 and 7151062) were employed for that purpose (see Fig. 2). To extract the spatiotemporal variations, we used pixel change history images to capture the motion history (see [42]), and computed the complex Zernike moments A00 , A11 , A20 , A22 , A31 , A33 , A40 , A42 , A44 , A51 , A53 , A55 , A60 , A62 , A64 , A66 , for each of which we computed the norm and the angle. Additionally the center of gravity and the area of the found blobs were also used, making a total of 31 parameters, thus providing an acceptable scene reconstruction without a computationally prohibitive dimension. Zernike moments were calculated in rectangular regions of interest of approximately 15 000 pixels in each image to limit the processing and allow real-time feature extraction (performed at a rate of approximately 50–60 f/s). The employed HMMs comprised three states, each one having a single mixture component distribution, which facilitated fast algorithm execution with acceptable results. The streams were coupled using a Gaussian mixture of two components. We randomly selected two full workflows for initial training (each containing 62 samples of all possible tasks), we used two different workflows to draw samples from (68 task samples in total) for the purposes of the active learning algorithm, and used the rest eight available workflows for testing (258 task samples in total). A graphical representation of the obtained success rates as new samples was included is given in Fig. 3. B. Industrial Part Assembly IV. Experimental Results To experimentally verify the proposed approach, we have used some public benchmark datasets involving action recognition of humans, namely the CMU-MMAC and WR databases. A. Meal Preparation The first set of experiments was based on a part of the CMUMMAC database [41]. The CMU-MMAC database contains multimodal measures of human activity of subjects performing tasks involved in cooking and food preparation. Six synchronized cameras have been used to capture scenarios, such as preparation of salad, pizza, eggs, and sandwich. Many types of tasks have been annotated within these scenarios. In our experiments, we considered the brownie preparation scenario. We have used 12 videos containing the full scenario, and We used the WR dataset, and specifically the first two workflows pertaining to car assembly (see [43] for more details). The tasks to recognize in each of the workflows are the following: 1) worker 1 picks up part 1 from rack 1 (upper) and places it on the welding cell; mean duration is 8–10 s; 2) worker 1 and worker 2 pick part 2a from rack 2 and place it on the welding cell; 3) worker 1 and worker 2 pick part 2b from rack 3 and place it on the welding cell; 4) worker 2 picks up spare parts 3a, 3b from rack 4, and places them on the welding cell; 5) worker 2 picks up spare part 4 from rack 1 and places it on the welding cell; 6) worker 1 and worker 2 pick up part 5 from rack 5 and place it on the welding cell. CHATZIS AND KOSMOPOULOS: VISUAL WORKFLOW RECOGNITION USING A VARIATIONAL BAYESIAN TREATMENT 1083 TABLE I Meal Preparation Tasks From the CMU-MMAC Database, Including Their Code and the Total Amount of Samples in the 12 Brownie Preparation Scenarios Task Code 03 06 07 12 14 15 16 17 18 19 24 22 27 28 30 31 32 33 34 35 37 38 39 40 42 43 44 45 50 Task Close fridge Open brownie bag Open brownie box Open fridge Pour brownie bag into big bowl Pour oil into big bowl Pour oil into measuring cup small Pour water into big bowl Pour water into measuring cup big Put baking pan into oven Put pam into cupboard bottom right Put oil into cupboard bottom right Spray pam Stir big bowl Switch on Take baking pan Take big bowl Take brownie box Take egg Take fork Take measuring cup big Take measuring cup small Take oil Take pam Twist off cap Twist on cap Walk to counter Walk to fridge Crack egg on big bowl Total Samples 11 9 12 11 12 12 12 12 11 12 9 10 10 12 12 12 12 12 11 12 12 12 10 9 11 12 11 11 9 Each of the above tasks is a class that has to be recognized. The partial or total occlusions due to the racks make the task very difficult to complete with a single camera and therefore two views have been used (see Fig. 4), hence the need for a methodology allowing for the successful fusion of the information contained in tightly coupled times series. In our experiments, we have used two different workflows, each one comprising 20 sequences representing full assembly cycles and containing at least one of the considered behaviors. The total number of frames in each case was approximately 80 000. Annotation of these frames has been performed manually. The second workflow is considered more difficult because the tasks may be executed in parallel, whereas in the first workflow the tasks were always executed sequentially. The same type of features was used as in the previous subsection. HMM configuration was similar to the previous experiment. We randomly selected three full workflows for initial training (each containing all possible tasks), we used seven workflows to draw samples from (42 task samples in total) for the purposes of the active learning algorithm, and left the rest ten workflows for testing (60 task samples in total). The results for the first and second workflows are given in Figs. 5 and 6, respectively. C. Comparison to Baseline Classification Methods To verify the merit of the variational Bayesian approach toward observation fusion methods, we have included experimental comparisons of the variational Bayesian approach against the standard HMM and MFHMM models obtained Fig. 3. Success rates for the active learning methods compared to the random case, using the first workflow of the CMU-MMAC dataset. The x-axis is the number of selected samples for training, the y-axis is the respective accuracy on the test set. (a) Accuracy of camera 7151020 steamwise model. (b) Accuracy of camera 7151062 steamwise model. (c) Accuracy fusing both cameras. Fig. 4. Schematic and camera views in the car assembly environment. using EM-based training. In all cases, three HMM states with a single component observation model were used for both the VB and EM methods. The results are displayed in Table II for models with Gaussian observation densities, and in Table III for models with Student’s-t observation densities. As we observe, VB gives results that in most cases are much better than the EM algorithm for both the streamwise and the fused models. The higher accuracy comes, of course, at a higher computational cost. In all our experiments, classification using the VB models required between four and five times more time compared to the EM approach. Nevertheless, although higher, 1084 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012 Fig. 5. Success rates for the active learning methods compared to the random case, using the first workflow of the WR-dataset. The x-axis is the number of selected samples for training, the y-axis is the respective accuracy on the test set. (a) Accuracy of camera 1 steamwise model. (b) Accuracy of camera 2 steamwise model. (c) Accuracy fusing both cameras. Fig. 6. Success rates for the active learning methods compared to the random case, using the second workflow of the WR-dataset. The x-axis is the number of selected samples for training, the y-axis is the respective accuracy on the test set. (a) Accuracy of camera 1 steamwise model. (b) Accuracy of camera 2 steamwise model. (c) Accuracy fusing both cameras. the computational time needed still remains of the same order of magnitude. Furthermore, we have also compared to MCMC-based methods; for this purpose, we have considered the HMM model proposed in [44]. In our experiments, we used a truncation level of ten states for this model, and imposed priors similar to our VB-based inference algorithm. Theoretically, higher accuracy is expected as the number of sampling iterations increases. Indeed, we observed this behavior in our experiments; nevertheless, to achieve similar or higher performance compared to the corresponding VB-based models, a very large number of iterations was needed, requiring too many computational resources, although the dimensionality of the problem was not too large. In Table II, we provide the results of the MCMC-based method of [44] for 10 000 sampling iterations, which is a number of sampling iterations incurring reasonable computational costs (12 h on an Intel Xeon 2.53GHz PC). Finally, regarding the comparative computational costs of sequence classification using the proposed VB-based MFHMM model and simple streamwise models, we would like to mention that the costs of the proposed model are roughly equal to the sum of the costs of the corresponding streamwise models. Hence, in cases where two streams are used, our approach roughly imposes double the costs of a single streamwise model. This result was theoretically expected, considering that prediction in our model is conducted using (22). D. Discussion In our experimental investigations, we evaluated the performance of the proposed information fusion scheme. Clearly, our fusion approach yielded improved results over methods using single-stream information. We also observed that the VB methods outperformed the respective EM-based ones for both the streamwise and the fused models. This result was theoretically expected since the latter models make pointestimates, which are more vulnerable to overfitting [27]. We also investigated the effectiveness of the proposed framework in an active learning setting. Two different active learning criteria were examined, namely information gain and query by committee. Using these methods, we were able to select the most appropriate samples to incorporate in model training. This process of sample selection was repeated until the maximum number of new samples was reached. CHATZIS AND KOSMOPOULOS: VISUAL WORKFLOW RECOGNITION USING A VARIATIONAL BAYESIAN TREATMENT 1085 TABLE II Comparison to Standard EM Approaches Using the Gaussian Observation Model Dataset CMU-MMAC WR 1 WR 2 EM-HMM1 39.49 90.00 55.71 EM-HMM2 35.90 70.00 37.14 MFHMM 41.03 90.00 63.33 VB-HMM1 43.08 95.00 63.33 VB-HMM2 37.95 86.67 56.67 VB-MFHMM 44.62 96.67 68.33 MCMC-1 42.13 78.00 35.00 MCMC-2 29.23 71.00 45.00 Columns EM-HMM1 and EM-HMM2 provide the accuracy of the EM-trained streamwise HMMs, and EM-MFHMM provides the accuracy of the EM-trained multistream fused HMM. The corresponding results for models trained using the variational Bayesian approach are provided in columns VB-HMM1, VB-HMM2, and VB-MFHMM, respectively. Accuracy for MCMC-trained streamwise models are provided in MCMC-1 and MCMC-2. TABLE III Comparison to Standard EM Approaches Using the Student’s-t Observation Model: Columns EM-HMM1 and EM-HMM2 Provide the Accuracy of the EM-Trained Streamwise HMMs, and EM-MFHMM Provides the Accuracy of the EM-Trained Multistream Fused HMM Dataset CMU-MMAC WR 1 WR 2 EM-HMM1 41.03 90.00 60.00 EM-HMM2 33.85 72.86 38.33 MFHMM 43.07 91.42 65.71 VB-HMM1 43.59 93.33 61.67 VB-HMM2 42.56 91.67 56.67 VB-MFHMM 45.64 98.33 68.33 The corresponding results for models trained using the variational Bayesian approach are provided in columns VB-HMM1, VB-HMM2, and VB-MFHMM, respectively. It has to be mentioned that QBC entails sampling of the model parameters, which may require a large number of experts. In our setting we used 30 experts, by drawing the same number of samples; increasing the number of experts would give more representative results, however the computational burden would increase proportionally. In our setting, the required execution time was almost the same for both methods for the selected amount of experts used from the QBC method. Clearly, active learning outperformed random sample selection. To achieve the same performance, active learning methods require much less data than random selection. The differences in accuracy are bigger when adding only few samples. We have observed that both the gain and QBC criteria are able to select the samples that are closer to optimal in the sense of acquired information. As expected, we also observed that as more samples are labeled and added to the training set, the gap in performance compared to random selection tends to reduce. Furthermore, we noted that in most cases none of the proposed active learning methods could significantly outperform the other. V. Conclusion In this paper, we presented a novel variational Bayesian treatment of multistream fused hidden Markov models, with application to visual WR using multicamera networks. MFHMMs have been very successful in fusion of information from tightly interdependent data streams, with low computational requirements. In this paper, we employed an elegant variational Bayesian treatment, which does not need large amounts of training data to guarantee dependable model estimation, since variational Bayes is much less prone to overfitting. Hence, despite the fact that the annotation of training data can be a major bottleneck, our VB-based method did not require large amount of them. A major advantage of the proposed variational Bayesian treatment of MFHMMs over conventional approaches consists in the provision of a measure of confidence in the obtained model estimates. As we showed, utilization of this information allowed for the computationally efficient integration of the MFHMM into an active learning framework, by application of popular active learning criteria that would be either computationally cumbersome or even intractable were it not for the proposed variational Bayesian treatment. References [1] G. L. Foresti, C. Micheloni, L. Snidaro, P. Remagnino, and T. Ellis, “Active video-based surveillance systems,” IEEE Signal Proc. Mag., vol. 22, no. 2, pp. 25–37, Mar. 2005. [2] G. L. Foresti, C. S. Regazzoni, and P. K. Varshney, Multisensor Surveillance Systems: The Fusion Perspective. Norwell, MA: Kluwer, 2003. [3] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian computer vision system for modeling human interactions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 831–843, Aug. 2000. [4] S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141–151, Sep. 2000. [5] M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models for complex action recognition,” in Proc. IEEE CVPR, Jun. 1997, pp. 994–999. [6] N. Oliver, E. Horvitz, and A. Garg, “Layered representations for learning and inferring office activity from multiple sensory channels,” in Proc. Int. Conf. Multimodal Interfaces, 2002, pp. 163–180. [7] D. G. Stork and M. E. Hennecke, “Speech reading by humans and machines,” in NATO ASI Series F, vol. 150. Berlin, Germany: SpringerVerlag, 1996. [8] J. Triesch and C. von der Malsburg, “Democratic integration: Selforganized integration of adaptive cues,” Neural Comput., vol. 13, no. 9, pp. 2049–2074, 2001. [9] O. Kahler, J. Denzler, and J. Triesch, “Hierarchical sensor data fusion by probabilistic cue integration for robust 3-D object tracking,” in Proc. 6th IEEE Southwest Symp. Image Anal. Interpret., May 2004, pp. 216–220. [10] A. Morris, A. Hagen, H. Glotin, and H. Bourlard, “Multi-stream adaptive evidence combination for noise robust ASR,” Speech Comm., vol. 34, nos. 1–2, pp. 25–40, Apr. 2001. 1086 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012 [11] C. Vogler and D. Metaxas, “A framework for recognizing the simultaneous aspects of American sign language,” Comput. Vis. Image Understanding, vol. 81, pp. 358–384, Mar. 2001. [12] Z. Zeng, J. Tu, B. M. Pianfetti, Jr., and T. S. Huang, “Audio-visual affective expression recognition through multistream fused HMM,” IEEE Trans. Multimedia, vol. 10, no. 4, pp. 570–577, Jun. 2008. [13] K. Yamazaki and S. Watanabe, “Singularities in mixture models and upper bounds of stochastic complexity,” Neural Netw., vol. 16, no. 7, pp. 1029–1038, 2003. [14] C. Archambeau, J. Lee, and M. Verleysen, “On the convergence problems of the EM algorithm for finite Gaussian mixtures,” in Proc. 11th Eur. Symp. Artif. Neural Netw., 2003, pp. 99–106. [15] G. McLachlan and D. Peel, Finite Mixture Models (Wiley Series in Probability and Statistics). New York: Wiley, 2000. [16] C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer, 2006. [17] J. Diebolt and C. Robert, “Estimation of finite mixture distributions through Bayesian sampling,” J. Roy. Statist. Soc. B, vol. 56, no. 2, pp. 363–375, 1994. [18] S. Richardson and P. Green, “On Bayesian analysis of mixtures with unknown number of components,” J. Roy. Statist. Soc. B, vol. 59, no. 4, pp. 731–792, Apr. 1997. [19] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, “An introduction to variational methods for graphical models,” in Learning in Graphical Models, M. Jordan, Ed. Dordrecht, The Netherlands: Kluwer, 1998, pp. 105–162. [20] C. Bishop and M. Tipping, “Variational relevance vector machines,” in Proc. 16th Conf. Uncertainty Artif. Intell., 2000, pp. 46–53. [21] S. Roberts and W. Penny, “Variational Bayes for generalized autoregressive models,” IEEE Trans. Signal Process., vol. 50, no. 9, pp. 2245–2257, Sep. 2002. [22] V. Smidl and A. Quinn, “Mixture-based extension of the AR model and its recursive Bayesian identification,” IEEE Trans. Signal Process., vol. 53, no. 9, pp. 3530–3542, Sep. 2005. [23] C. Archambeau and M. Verleysen, “Robust Bayesian clustering,” Neural Netw., vol. 20, no. 1, pp. 129–138, Jan. 2007. [24] M. Svensén and C. M. Bishop, “Robust Bayesian mixture modelling,” Neurocomputing, vol. 64, no. 1, pp. 235–252, Jan. 2005. [25] Z. Ghahramani and M. Beal, “Variational inference for Bayesian mixtures of factor analysers,” in Proc. 12th Adv. NIPS, vol. 12. 1999, pp. 449–455. [26] S. Chatzis, D. Kosmopoulos, and T. Varvarigou, “Signal modeling and classification using a robust latent space model based on t distributions,” IEEE Trans. Signal Process., vol. 56, no. 3, pp. 949–963, Mar. 2008. [27] S. Chatzis and D. Kosmopoulos, “A variational Bayesian methodology for hidden Markov models utilizing Student’s-t mixtures,” Pattern Recognit., vol. 44, no. 2, pp. 295–306, 2011. [28] I. Rezek and S. J. Roberts, “Ensemble hidden Markov models with extended observation densities for biosignal analysis,” in Probabilistic Modeling in Biomedicine and Medical Bioinformatics, E. D. Husmeier, R. Dybowski, and S. Roberts, Eds. Berlin, Germany: Springer-Verlag, 2005. [29] D. MacKay, “Information-based objective functions for active data selection,” Neural Computation, vol. 4, no. 4, pp. 589–603, 1992. [30] D. Cohn, Z. Ghahramani, and M. Jordan, “Active learning with statistical models,” J. Artif. Intell. Res., vol. 4, no. 3, pp. 129–145, Mar. 1996. [31] S. Chatzis, D. Kosmopoulos, and T. Varvarigou, “Robust sequential data modeling using an outlier tolerant hidden Markov model,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 9, pp. 1657–1669, Sep. 2009. [32] S. P. Luttrell, “The use of Bayesian and entropic methods in neural network theory,” in Maximum Entropy and Bayesian Methods. Boston, MA: Kluwer, 1989, pp. 363–370. [33] H. Pan, Z.-P. Liang, and T. S. Huang, “Estimation of the joint probability of multisensory signals,” Pattern Recognit. Lett., vol. 22, no. 13, pp. 1431–1437, Nov. 2001. [34] D. Povey and P. Woodland, “Minimum phone error and i-smoothing for improved discriminative training,” in Proc. ICASSP, 2002, pp. 105–108. [35] S. Raudys and A. Jain, “Small sample size effects in statistical pattern recognition: Recommendations for practitioners,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 3, pp. 252–264, Mar. 1991. [36] M. A. Osborne, R. Garnett, and S. J. Roberts, “Active data selection for sensor networks with faults and changepoints,” in Proc. IEEE 24th Int. Conf. AINA, Apr. 2010, pp. 533–540. [37] S. Ji, B. Krishnapuram, and L. Carin, “Variational Bayes for continuous hidden Markov models and its application to active learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 4, pp. 522–532, Apr. 2006. [38] D. MacKay, “Information-based objective functions for active data selection,” Neural Comput., vol. 4, no. 4, pp. 589–603, Apr. 1992. [39] Y. Freund, H. Seung, E. Shamir, and N. Tishby, “Selective sampling using the query by committee algorithm,” Mach. Learning, vol. 28, nos. 2–3, pp. 133–168, Mar. 1997. [40] H. Seung, M. Opper, and H. Smopolinsky, “Query by committee,” in Proc. 5th Ann. ACM Workshop Comput. Learning Theory, 1992, pp. 287–294. [41] F. D. La Torre, J. Hodgins, J. Montano, S. Valcarcel, R. Forcada, and J. Macey, “Guide to the Carnegie Mellon University multimodal activity (CMU-MMAC) database,” Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep. CMU-RI-TR-08-22, Jul. 2009. [42] D. Kosmopoulos and S. Chatzis, “Robust visual behavior recognition,” IEEE Signal Process. Mag., vol. 27, no. 5, pp. 34–45, Sep. 2010. [43] A. Voulodimos, D. Kosmopoulos, G. Vasileiou, E. Sardis, A. Doulamis, V. Anagnostopoulos, C. Lalos, and T. Varvarigou, “A dataset for workflow recognition in industrial scenes,” in Proc. IEEE Int. Conf. Image Process., Sep. 2011, pp. 3310–3313. [44] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky, “An HDPHMM for systems with state persistence,” in Proc. Int. Conf. Mach. Learning, Jul. 2008, pp. 312–319. Sotirios P. Chatzis received the M.Sc. (five-year Diploma) degree in electrical and computer engineering from the National Technical University of Athens, Athens, Greece, in 2005, and the Ph.D. degree in machine learning from the National Technical University of Athens in 2008. From January 2009 to June 2010, he was a PostDoctoral Researcher with the University of Miami, Coral Gables, FL. He is currently a Post-Doctoral Researcher with the Department of Electrical and Electronic Engineering, Imperial College London, London, U.K. His current research interests include machine learning theory and methodologies with a special focus on hierarchical Bayesian models, reservoir computing, robot learning by demonstration, copulas, quantum statistics, Bayesian nonparametrics, and artificial creativity. Dr. Chatzis first authored 28 journal papers in the most prestigious journals of his research field by the age of 28 years. His Ph.D. research was supported by the Bodossaki Foundation, Greece, and the Greek Ministry for Economic Development, while he received the Dean’s Scholarship for Ph.D. Studies, being the Best Performing Ph.D. Student of his class. Dimitrios Kosmopoulos received the Ph.D. degree in electrical and computer engineering from the National Technical University of Athens, Athens, Greece, in 2001. Since then, he has collaborated with the National Center for Scientific Research “Demokritos,” Athens, the National Technical University of Athens, the University of Central Greece, Lamia, Greece, and the Technical Educational Institute of Athens, Athens. He is currently a Visiting Assistant Professor with the University of Texas, Arlington. His current research interests include computer vision, robotics, and machine learning. He has published more than 50 papers in these fields and has participated in several industrial and scientific projects as a developer, consultant, or technical coordinator.