Realtime Segmentation and Recognition of
Gestures using Hierarchical Markov Models
Jules FRANÇOISE
Master’s thesis
ATIAM 2010-2011
Université Pierre et Marie Curie
Ircam
Telecom paristech
IRCAM - Interaction Musicales Temps-Réel
Supervisors:
Baptiste Caramiaux
[email protected]
Frédéric Bevilacqua
[email protected]
Jules Françoise
ii
ATIAM — IRCAM - UPMC, 2011
Abstract
In this work, we present a realtime system for continuous gesture segmentation and recognition. The model is an extension of the system called Gesture Follower developed at
Ircam, which is an hybrid model between Dynamic Time Warping and Hidden Markov
Models. This previous model allows for a realtime temporal alignment between a template
and an input gesture.
Our model extends it by proposing a higher-level structure which models the switching
between templates. Taking advantage of a representation as a Dynamic Bayesian Networks, the time complexity of the inference algorithms is reduced from cubic to linear in
the length of the observation sequence. We propose various segmentation methods, both
offline and realtime.
A quantitative evaluation of the proposed model on accelerometer sensor data provides a
comparison with the Segmental Hidden Markov Model, and we discuss several sub-optimal
methods for realtime segmentation. Our model reveals able to handle signal distortions
due to speed variations in the execution of gestures. Finally, a musical application is
outlined in a case study about the segmentation of violin bow strokes.
Résumé
Nous présentons un système temps-réel pour la reconnaissance et la segmentation de gestes
continus. Le modèle est une extension d’un système existant appelé Gesture Follower.
Développé à l’Ircam, ce modèle est un hybride entre le Dynamic Time Warping et les
Modèles de Markov Cachés, qui permet l’alignement temporel entre un geste d’entrée et
un geste de référence.
Notre modèle étend le précédent par l’ajout d’une structure de plus haut niveau qui
modélise les transitions entre gestes de référence. En tirant parti d’une représentation sous
forme de réseau bayesien dynamique, la complexité temporelle des algorithmes d’inférence
est réduite de cubique à linéaire en fonction de la longueur de la séquence d’entrée. Nous
proposons plusieurs méthodes de segmentation, à la fois temps différé et temps réel.
Une évaluation quantitative du modèle est réalisée sur une base de signaux d’accéléromètres,
permettant une comparaison avec le modèle de Markov segmental. Par ailleurs, différentes
méthodes sous-optimales pour la segmentation temps-réel de geste complexes sont discutées. Notre modèle se révèle adapté à des variations de vitesses lors de l’exécution des
gestes qui induisent une distorsion des signaux gestuels. Enfin, une application musicale
est présentée au travers d’une étude de cas visant à identifier et segmenter les modes de
jeux d’un violoniste.
iii
Jules Françoise
iv
ATIAM — IRCAM - UPMC, 2011
Acknowledgments
First of all, I would like to thank my supervisors, Baptiste Caramiaux and Frédéric Bevilacqua for accompanying me along this study, for the fruitful discussions and advices, for
strengthening my interest in music and gesture.
Overall, I thank them for the strong support to my PhD application, and I am eager to
start my thesis in the team.
I thank the whole IMTR team for the good atmosphere and the constructive discussions.
Finally, I would like to thank the many people who helped me along this internship –
for one reason or another – which include Tommaso and Fivos, Hélène, Brett, Thomas,
Pierre, Aymeric, Benjamin, Jérome, Louis, Jose, Julien, and many more.
v
Jules Françoise
vi
ATIAM — IRCAM - UPMC, 2011
Contents
1 Introduction
1
2 Background
2.1 Gesture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Musical interfaces . . . . . . . . . . . . . . . . . .
2.1.2 Musical gestures . . . . . . . . . . . . . . . . . . .
2.1.3 Chunking . . . . . . . . . . . . . . . . . . . . . . .
2.2 Gesture modeling using HMMs . . . . . . . . . . . . . . .
2.2.1 Gesture recognition . . . . . . . . . . . . . . . . .
2.2.2 Hidden Markov Models . . . . . . . . . . . . . . .
2.3 Multilevel models . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 The Segmental Hidden Markov Model . . . . . . .
2.3.2 The Hierarchical Hidden Markov Model . . . . . .
2.3.3 A unified framework: Dynamic Bayesian Networks
2.4 Modeling musical gestures: the Gesture Follower . . . . .
3 Proposed model
3.1 Description of the model . . . . . . . . . . . . . .
3.1.1 Introduction . . . . . . . . . . . . . . . .
3.1.2 Modeling scheme . . . . . . . . . . . . . .
3.2 Efficient implementation using Dynamic Bayesian
3.2.1 Representation and formal description . .
3.2.2 Optimal decoding: the Viterbi Algorithm
3.2.3 Towards a realtime system: algorithms for
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
. 3
. 3
. 4
. 5
. 5
. 5
. 6
. 8
. 8
. 9
. 11
. 13
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
Networks . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
approximate segmentation
4 Evaluation
4.1 Evaluation method . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.3 Evaluation function . . . . . . . . . . . . . . . . . . . . .
4.2 Offline segmentation: a comparison with the segmental model . .
4.2.1 Same subject, same tempo . . . . . . . . . . . . . . . . . .
4.2.2 Inter-subject, same tempo . . . . . . . . . . . . . . . . . .
4.2.3 Same subject, inter-tempo segmentation . . . . . . . . . .
4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Towards a realtime system . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Forward algorithm . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Fixed-Lag Smoothing . . . . . . . . . . . . . . . . . . . .
4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 A musical application: segmentation of violin bowing techniques
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
15
16
17
18
20
21
25
25
25
26
28
30
30
32
34
37
37
37
40
41
41
5 Conclusion and future directions
45
Bibliography
50
vii
CONTENTS
A Model description: representation and algorithms
A.1 Representation . . . . . . . . . . . . . . . . . . . . .
A.1.1 Notations . . . . . . . . . . . . . . . . . . . .
A.1.2 Conditional Probability Distributions . . . .
A.2 Forward Algorithm . . . . . . . . . . . . . . . . . . .
A.2.1 Forward pass: formalization using the frontier
A.2.2 Reduction . . . . . . . . . . . . . . . . . . . .
A.2.3 Algorithm . . . . . . . . . . . . . . . . . . . .
A.3 Fixed-Lag Smoothing . . . . . . . . . . . . . . . . .
A.4 Viterbi Algorithm . . . . . . . . . . . . . . . . . . .
A.4.1 Algorithm : Viterbi Decoding . . . . . . . . .
Jules Françoise
viii
. . . . . .
. . . . . .
. . . . . .
. . . . . .
algorithm
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
51
51
52
53
53
54
55
55
57
57
ATIAM — IRCAM - UPMC, 2011
List of Figures
1.1
Two projects of the IMTR team . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Examples of gestural interfaces for musical expression . . . . . . . . .
Temporal Representation of a HMM . . . . . . . . . . . . . . . . . . .
Graphical representation of a SHMM . . . . . . . . . . . . . . . . . . .
An example of a simple HHMM . . . . . . . . . . . . . . . . . . . . . .
A simple Bayesian network representing the probabilistic dependences
tween three random variables. . . . . . . . . . . . . . . . . . . . . . . .
DBN Representation of an input-output HMM . . . . . . . . . . . . .
DBN representation of a HHMM with 2 levels of hierarchy . . . . . . .
The learning procedure of the Gesture Follower . . . . . . . . . . . . .
. .
. .
. .
. .
be. .
. .
. .
. .
.
.
.
.
11
12
13
14
3.1
3.2
3.3
3.4
3.5
Segmentation of a complex signal as a sequence of primitive
The topology of the proposed model . . . . . . . . . . . . .
The particular learning procedure of the model . . . . . . .
DBN representation of the proposed model . . . . . . . . .
Pseudo-code for the Fixed-Lag Smoothing algorithm . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
17
18
19
23
Gesture vocabulary of the uWave database . . . . . . . . . . . . . . . . .
Our gesture database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The two reference segmentations considered in this study . . . . . . . . .
The different types of errors identified by the evaluation function . . . . .
Acceleration signals of gesture 2 performed at 2 different speeds . . . . . .
Acceleration signals of gesture 2 respectively performed by participants 3
and 7 at 60 bpm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Coarticulation effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Histogram of the length of the errors (substitutions + insertions) for the
Forward algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9 A typical example of segmentation computed with the forward algorithm
presenting insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10 Fixed-Lag Smoothing: Influence of the lag and comparison with the Forward algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11 The musical score of the violin phrase, correlated with the audio track and
an acceleration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12 Segmentation results on the violin phrase compared for three algorithms .
.
.
.
.
.
26
27
29
29
32
4.1
4.2
4.3
4.4
4.5
4.6
shapes
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
1
. 4
. 7
. 9
. 10
. 33
. 36
. 39
. 39
. 40
. 42
. 43
A.1 The 2-level HHMM of our study, represented as a DBN. Qt is the production
state at time t, St is the symbolic state at time t; Ft = 1 if the sub-HMM
has finished (entered its exit state), otherwise Ft = 0. Shaded nodes are
observed, the remaining nodes are hidden. . . . . . . . . . . . . . . . . . . 51
ix
LIST OF FIGURES
Jules Françoise
x
ATIAM — IRCAM - UPMC, 2011
1
Introduction
At Ircam, the Realtime Musical Interaction Team 1 focuses its activities on interactive musical systems. In particular, current research involve both modeling gesture and sound and
developing gesture capture systems and interfaces. Specific applications of the research
are defined in the context of performing arts and interaction with digital media.
Two complementary approaches are conducted simultaneously. First, experimental studies
aim at analyzing gesture in various contexts, from instrumental playing [Rasamimanana
et al., 2009] and dance performance [Bevilacqua and Flety, 2004] to gestural embodiment
of environmental sounds [Caramiaux et al., 2011a]. Second, interactive systems for realtime performance are developed, from new musical interfaces (such as the MO shown on
figure 1.1(b) [Rasamimanana et al., 2011]) to gesture analysis software. One example of
this complementarity is the Augmented Violin project, which combined theoretical studies about bowing techniques with the creation of a system for mixed-music allowing for
realtime recognition of bowing modes such as Martelé and Détaché. As capturing instrumentalists’ movements was required to be as unintrusive as possible, a specific device was
developed displayed on figure 1.1(a).
(a) The augmented violin: example
of an instrumented bow
(b) Modular Musical Objects (MO)
Figure 1.1: Two projects of the IMTR team
1
IMTR team: Realtime Musical Interactions. http://imtr.ircam.fr
1
Machine learning techniques have been successfully applied to modeling gesture. In particular, Markov models were found suitable for modeling gesture signals as sequential
data.
First, a gesture recognition system was developed for the specific context of performing
arts, namely the Gesture Follower [Bevilacqua et al., 2010]. In order to extract information
about the temporal execution of a gesture, the system continuously updates characteristics
of the movement.
Second, a recent study investigated a computational model for off-line gesture segmentation and parsing. This work, tested on ancillary gestures [Wanderley et al., 2005], aimed
for a quantitative analysis of musical gestures using segment models to highlight their
inherent multi-level information content [Caramiaux et al., 2011b].
Our study lies at the intersection between these two approaches. Specifically, the goal
of our work is to implement and evaluate a model which extends the Gesture Follower
by providing a method for gesture analysis following multiple time scales. Based on the
Hierarchical Hidden Markov Model (HHMM), the proposed model allows for segmenting
complex gestures and gestures sequences based on time profile recognition. Using the formalism of Dynamic Bayesian Networks (DBNs), an efficient implementation is presented
and different segmentation methods are proposed, both offline and realtime.
The first chapter is concerned with an overview of the background in gesture modeling.
After a general introduction to musical gestures, we review Hidden Markov Models for
gesture recognition and study two extensions of the model: the Segmental Hidden Markov
Model and the Hierarchical Hidden Markov Model. The models are unified under the
general framework of Dynamic Bayesian Networks. The proposed model is detailed in
chapter 3 where we expose various segmentation methods, from optimal decoding to suboptimal methods for realtime segmentation. Finally, we describe a quantitative evaluation
of the model in chapter 4. In a first section, the model is compared with the segmental
model on offline segmentation. Then, two methods for realtime segmentation are evaluated
and discussed to identify a compromise between the accuracy of the segmentation and the
delay to realtime.
Jules Françoise
2
ATIAM — IRCAM - UPMC, 2011
2
Background
2.1
Gesture
Gesture have been of growing interest in many research fields, bringing an abundance of
new definitions. First, it is important to make a distinction between a movement – physical
action – and a gesture which is usually considered as carrying information [Kendon, 2004].
A common definition tends to qualify gestures as body movements, involving for example
the head or a hand, which convey meaningful information [Godøy and Leman, 2009].
The computer vision community has come to define the notions of actions, considered as
simple motion patterns, and activities which are complex and involve coordinated actions
between humans [Turaga et al., 2008].
Another important role of gesture has been highlighted in the Human-Computer Interaction community (HCI) , namely interacting with the environment [Mitra and Acharya,
2007]. Gestures are then considered as an input modality which aims at controlling and
interacting with a computer.
2.1.1
Musical interfaces
Gestures for control, as considered in HCI, may be divided in manipulative and emptyhanded gestures [Godøy and Leman, 2009]. Thus, important developments have been
made from new physical interfaces to vision-based gesture recognition systems aiming at
interpreting human gestures. In computer music, the development of new interfaces for
musical expression dramatically increased over the last decade – even giving its name to
an international conference 1 .
From the early experiments of Max Mathews and its Radio baton (figure 2.1(a)) [Mathews, 1989] to electronic instruments, for example the Meta-Instrument [de Laubier and
Goudard, 2006], the T-stick [Malloch and Wanderley, 2007] or the Karlax 2 (figure 2.1(b)),
a great number of musical interfaces have been designed, focusing on gesture inputs for
musical expression. This interest in new interfaces for musical expression drives specific
research about musical gesture, raising the need of formalization [Cadoz and Wanderley,
2000] and studies of gestural data and gesture-to sound mapping strategies [Wanderley
and Battier, 2000].
1
2
NIME: International Conference on New Interfaces for Musical Expression. http://www.nime.org
Karlax: innovative midi controller developed by Dafact. http://www.dafact.com/
3
2.1 Gesture
(a) Max Mathews and the Radio Baton
(b) The Karlax, developed by Dafact
Figure 2.1: Examples of gestural interfaces for musical expression
2.1.2
Musical gestures
Music-related gestures are found in a large number of musical activities. Indeed, musical
gestures encompass more than controlling musical instruments or coordinating musicians,
as, for instance, people tend to make gestures while listening to music: dancing, mimicking
instruments, beating rythms or performing abstract movements.
Recent studies in the field of embodied cognition tend to emphasize the role of gesture in
music perception, from listening to performance [Leman, 2006]. An embodied perspective
in music suggests that the whole body is involved in our experience of music, and that we
are constantly simulating sound-related actions when perceiving or even imagining sound
and music. This embodiment process is strongly connected with the multimodal aspect of
perception. Relationships between gesture and sound are anchored in our ecological experience of the mechanical coupling between an action and the sound produced [Jensenius,
2007].
A typology of music-related actions has emerged [Cadoz and Wanderley, 2000]. From a
functional point of view, we may distinguish instrumental or effective gestures, involved in
the production of sound, from more symbolic gestures such as accompanying and ancillary
gestures which are not involved in the production of sound, but which convey expressivity
to the audience. A phenomenological analysis of musical gestures would focus on an analytical description of gestures in terms of cinematic, spatial and frequential characteristics.
This last approach is quite attractive for gesture modeling as it allows for representing
gestures in terms of measurable physical parameters. In recent research, quantitative analyses of musical gestures have taken advantage of a representation of gestures as time series,
and particularly as time profiles. A major benefit of this assumption is the possibility of
comparing quantitatively gesture profiles with sound descriptors to characterize gestureJules Françoise
4
ATIAM — IRCAM - UPMC, 2011
2.2 Gesture modeling using HMMs
sound relationships. For example, experiments on sound-tracing [Godøy et al., 2006]
reveal an interesting analogy between gesture morphology and Pierre Schaeffer’s Musical objects [Schaeffer, 1966], extending his concept to Gestural-Sonorous objects [Godøy,
2006]. Recently at Ircam, an experiment has been made to better understand the link
between both mental and gestural representations of environmental sounds [Caramiaux
et al., 2011a]. Participants were asked to move freely on environmental sounds, putting in
evidence two types of behavior. First, sounds which cause is clearly identified often lead
participants to mimic the action, whereas they tend to follow salient sound feature profiles
when the source cannot be identified.
2.1.3
Chunking
In this framework, if basic categories of gesture-sound objects can be defined, for example
following excitatory types (impulsive, sustained, iterative), music performance cannot be
reduced to a sequential process. In [Widmer et al., 2003], authors investigate artificial
intelligence methods to analyze musical expressivity, highlighting the need for multilevel
models: "Music performance is a multilevel phenomenon, with musical structures and performance patterns at various levels embedded in each other.". If this work focused on
analyzing audio and midi data of piano performance, we understand here the importance
of considering multiple time scales when studying gestures in the context of musical performance. Recent findings about pianists’ finger tapping emphasize two factors constraining
musical gestures: biomechanical coupling and chunking [Loehr and Palmer, 2007]. Introduced by Miller in the fifties [Miller, 1956], chunking suggest that "perceived action
and sound are broken down into a series of chunks in people’s mind when they perceive
or imagine music" [Godoy et al., 2010]. More than just segmenting a stream into small
entities, chunking refers to their transformation and construction into larger and more
significant units. In term of action execution, it involves a hierarchical planning of movements [Rosenbaum et al., 1983]. Thus, studying gesture in an action-sound interaction
context should take into account different levels of resolution, organized in a hierarchical
structure. This theoretical background opens new perspectives for both analysis and gestural control of music. In the rest of this study, we focus on designing a method able to
model gesture according to multiple time scales.
2.2
2.2.1
Gesture modeling using HMMs
Gesture recognition
In the Human-Computer Interaction community, gesture recognition has been of growing
interest for the last few years with a wide range of applications such as sign language
recognition, navigation in virtual environments, interaction with computers and control of
digital media. In HCI, gesture is used as an input device for control, often involving only
a few iconic movements to be recognized once completed. Thus, Dynamic Time Warping
(DTW) has been widely used in this domain for its ability to model time variations in
the execution of a gesture [Corradini, 2001]. However, DTW requires sampling the whole
time series, revealing expensive in memory.
Models with hidden states have been introduced to address this limitation, considering
their suitable property of compressing information. The basic assumption of these modJules Françoise
5
ATIAM — IRCAM - UPMC, 2011
2.2 Gesture modeling using HMMs
els is to consider hidden states that generate observations. Evolving in time and possibly
given input data, these states aim at modeling an underlying phenomenon which produces
observable outputs instead of focusing on the signal itself. The most common examples
are Hidden Markov Models (HMMs) and Kalman Filter Models (KFMs). In the following,
we focus on HMMs and we use this acronym to designate to the standard model defined
in [Rabiner, 1989].
First introduced for speech recognition, HMMs revealed efficient for gesture recognition [Yamato et al., 1992]. If many gesture recognition techniques inherit from speech
and handwriting recognition, this modality differs from the previous ones in two aspects:
data acquisition involves a wide variety of sensing systems, and more important, "gestures
are ambiguous and incompletely specified" [Mitra and Acharya, 2007]. These problems
involve developing domain-specific techniques in order to achieve efficient recognition.
Some shortcomings of HMMs can limit the efficiency of gesture recognition. Indeed, this
model is unable to reject unrecognized gestures, deal with geometric invariants – scale,
offset, orientation, – and training procedure must be achieved before recognition. In [Lee
and Kim, 1999], authors propose a threshold model based on HMMs which allows for the
rejection of unrecognized gestures and Eickeler et al introduce filler models in a HMMbased system in [Eickeler et al., 1998]. In order to extract significative gestures in a
continuous stream of information, the model proposed in [Bhuyan et al., 2006] focuses on
detecting gesture boundaries by identifying "pauses" at the beginning and end of gestures.
In [Wilson and Bobick, 1999a], Wilson and Bobick define a model for parameterized gesture recognition and introduce online adaptive learning of gesture models in [Wilson and
Bobick, 1999b]. Often, difficulties arise in the interpretation of hidden states. The Semantic Network Model, presented in [Rajko et al., 2007], introduce semantic states carrying a
symbolic meaning.
2.2.2
Hidden Markov Models
Introduced by [Baum and Petrie, 1966], Hidden Markov Models (HMMs) have been reviewed in the classical tutorial [Rabiner, 1989]. We give here a short description of the
Model. For comprehensive studies, refer to [Rabiner, 1989] and [Bilmes, 2006].
Representation
First-order Markov chains model sequential data by considering probabilistic dependences
between successive observations. Hidden Markov Models extend this principle by encoding
information in hidden states that control the dependence of the current observation on
the history of observations.
At each time step, a HMM is characterized by a discrete hidden variable Qt taking values
in a set of N hidden states, and an observation variable Yt which can be discrete or
continuous. Then, three probability distributions are defined:
⊲ A prior probability distribution Π = {πi } where πi = P (Q1 = i) is the probability
that the system initializes in state i.
⊲ A state transition probability matrix A = {aij } where aij = P (Qt = j|Qt−1 = i) is
the probability to make a transition from state i to state j, respecting a first-order
Markov property.
Jules Françoise
6
ATIAM — IRCAM - UPMC, 2011
2.2 Gesture modeling using HMMs
⊲ An observation probability distribution B = {bj (y)} where bj (y) = P (Yt = y|Qt = j)
evaluates the probability of observing y given that the hidden variable is in state j.
The behavior of a HMM can be depicted by the temporal representation shown on figure
2.2. The model presents a repeating structure: each time slice contains one hidden node
and one observable node. Along this report, we keep the convention to represent hidden
states as white circles and observations as shaded nodes. The system first initializes in
state S1 according to Π and emits an observation symbol Y1 . Then, at each time step
a transition is made according to A and the system produces an observation symbol
according to B. On the right side of the figure, the parameters are explicitly represented.
π
A
Q1
Q2
Q3
Q1
Q2
Q3
Y1
Y2
Y3
Y1
Y2
Y3
B
Figure 2.2: The temporal representation of a HMM. The model is unrolled for 3 time slices, but
the structure can be repeated for each time step. On the right side, the model parameters A, B
and π are explicitly represented.
Inference
In the classical tutorial [Rabiner, 1989], L. Rabiner points out three problems of interest
for HMMs:
⊲ Compute the probability of the sequence ȳ = y1 · · · yT given the model.
⊲ Find the sequence of hidden states q̄ = q1 · · · qT which best explains the observation
sequence.
⊲ Adjust model parameters λ = (A, B, π) which maximizes the probability of the
sequence.
The Forward-Backward algorithm has been introduced to solve the first problem. The
basic principle is to define the forward variable αt (i) = P (y1 · · · yt , Qt = i|λ). Initialized
to α1 (i) = πi bi (y1 ), the variable is then updated at each new observation by summing over
the possible transitions:
αt+1 (j) =
"
N
X
i=1
#
αt (i) · aij · bj (yt+1 )
(2.1)
The most probable state at time t can be estimated as the index maximizing αt . Finding
the optimal sequence of hidden states given the observation sequence is solved using the
Viterbi algorithm. First, a variable δ is updated at each time step given a formula similar
to equation 2.1, changing the sums to maximizations and keeping track of the arguments
which maximized δ. Then, the optimal state sequence is computed by a backtracking
operation.
Jules Françoise
7
ATIAM — IRCAM - UPMC, 2011
2.3 Multilevel models
Limitations
If HMMs are extensively used in recognition systems, some limitations of the model can
reveal problematic in various contexts [Ostendorf et al., 1996]. First, the probability of
staying a duration τ in the same state derives from the auto-transition probability of a
τ −1
state: p(τ ) = aii
(1 − aii ). As a consequence, the state duration modeling suffers from an
implicit definition and decreases exponentially. Second, feature extraction is conditioned
by the production of observations at a frame level. As a consequence, the standard model
is too weak for a segmentation task which requires higher-level modeling.
2.3
2.3.1
Multilevel models
The Segmental Hidden Markov Model
In speech recognition, segment models were proposed to overcome the restriction on feature extraction imposed by frame-based observations. In the Segmental Hidden Markov
Model (SHMM), each hidden states emits a sub-sequence of observations rather than a
single one, given a geometric shape and a duration distribution. Successfully applied to
speech recognition [Ostendorf et al., 1996], handwritten shape recognition [Artieres et al.,
2007] and, at Ircam, time profile recognition of pitch and loudness [Bloit et al., 2010];
the Segmental HMM was used in a recent study for gesture modeling [Caramiaux et al.,
2011b]. The model aimed at segmenting the gestures of a clarinetist. The model was used
to represent a continuous stream of gestures as a sequence of geometric shapes extracted
from a given dictionary. Tested on ancillary gestures, the model provided a quantitative
analysis of the performance of a musician, highlighting recurrent patterns of ancillary gestures correlated to the musical performance.
In this section, we give a brief overview of the Segmental HMM, an extensive study of the
general model is formalized in [Ostendorf et al., 1996].
Representation
The SHMM extends the standard HMM by defining the observation probability distribution at a segment level. Instead of emitting a single symbol, a hidden state produces a
variable-length sequence of observations. Let’s note yt1 :t2 = [yt1 · · · yt2 ] a subsequence of
observations of duration l = t2 − t1 + 1. In a standard HMM, the observation probability
is defined at a sample level:
bj (y) = P (Yt = y|Qt = j)
In a SHMM, a hidden states emits a sequence of observations of length l given:
P (ytt12 , Lt = l|Qt = j) = P (ytt12 |Qt = j, Lt = l) · P (Lt = l|Qt = j)
= bj,l (ytt12 ) · P (Lt = l|Qt = j)
where bj,l (ytt12 ) is the probability of the subsequence of observation given the base shape
and the duration l and P (Lt = l|Qt = j) is the probability of having a duration l given
that the system is in state j. As a consequence, a new distribution is introduced to define
the possible durations of the segments. A graphical representation of a SHMM is shown
on figure 2.3.
Jules Françoise
8
ATIAM — IRCAM - UPMC, 2011
2.3 Multilevel models
τ
Q1
Q2
L1
Y1
Y2
···
Yl1
L2
Yl1 +1
Qτ
···
Yl2
···
Lτ
···
YP
···
YT
li
Figure 2.3: Graphical representation of a SHMM
Inference
When using the SHMM, the first step is to define a dictionary of primitive shapes. These
can be defined given prior knowledge – for example considering simple geometric shapes
– or can be learned during a training phase. A simple technique is to annotate manually
a reference signal to extract short segments to feed the model with a single template.
Similarly to HMMs, three problems can be defined for segment models. Here, each internal
state generates a duration and a subsequence of observations. So finding the optimal
sequence of hidden states and durations is equivalent to find the best representation of
the signal as a sequence of segments characterized by a label – the geometric shape – and
a duration.
This task is achieved by a dynamic programming algorithm analogous to the Viterbi
algorithm of HMMs. As the duration distribution appears like a new dimension, the
complexity of the decoding algorithm increases and the optimal decoding algorithm is
called the 3D Viterbi algorithm.
Limitations
If some of the limitations of HMMs have been solved introducing the Segmental HMM,
the model is only able to handle one level of hierarchy governing the transitions between
segments, thus limiting the analysis to this unique time scale. Moreover, gestures and
particularly musical gestures are subject to timing variations. Different executions of the
same gestures reveal local speed variations involving a non uniform time stretching of the
primitive shapes composing a gesture. In the SHMM, the decoding process amounts to
fit the geometric shapes to an input signal by applying a uniform scale transformation to
the primitive shapes. This implies that the only transformation of the signal allowed is a
uniform time-stretch over a whole segment which is not suitable for our application. The
Hierarchical HMM defined in the next section overcomes these restrictions, allowing an
arbitrary number of hierarchy levels and modeling segments by Markov process instead of
geometric shapes.
2.3.2
The Hierarchical Hidden Markov Model
The Hierarchical Hidden Markov Model (HHMM) [Fine et al., 1998] extends the standard
HMM by making each state an autonomous model, allowing a deep hierarchy. The model
has been applied to speech recognition to model different imbricated time scales, from
phonemes to words and sentences.
Jules Françoise
9
ATIAM — IRCAM - UPMC, 2011
2.3 Multilevel models
Representation
The Hierarchical Hidden Markov Model (HHMM) extends the standard HMM by making
each of the hidden state an "autonomous model". Hidden states of the model are classified
in two classes:
⊲ production states: states which emit observations similarly to the hidden states of a
HMM.
⊲ internal states: instead of emitting observation, an internal state generates an autonomous model.
Thus, each internal state produces a sequence of observations by recursively activating its
substates until a production state is reached. When the system has finished at the current
level, it reaches an exit state which allows to go back to the parent node and make a higher
level transition.
root
S1
a
S2
Q1
Q2
b
c
Send
Qend
Figure 2.4: An example of a simple HHMM
In order to better understand the behavior of such a model, consider the simple HHMM
of figure 2.4. The model has two levels of hierarchy and generates symbols associated with
letters. Starting from the root a time t = 0, the model will make a vertical transition at
t = 1 according to a prior probability distribution to enter a state of the first level, for
example S1 . Then, the model will emit the symbol "a" and make a transition, either to the
same state or to S2 . As S2 is an internal state, the system will make a vertical transition
and enter for example Q1 . Once produced the symbol "b" the system will make a transition
to Q2 and emit "c". After looping between Q1 and Q2 , the system will reach an exit state
– namely Qend – and go back to the parent state S2 to make a transition at the first level.
Here the only possibility is to reach Send and go back to root. Finally, the system would
generate the regular expression ax (bc)y . More importantly, we can notice that the internal
state S2 would generate the subsequence (bc)y . So, this structures inherently handles a
signal analysis on multiple time scales.
Inference
Introduced as an extension of the standard HMM, this model have been studied in a
similar way, defining the three classical problems. Considering the recursive structure of
the model, each internal state calling a submodel, Fine et al. proposed recursive algorithms
for the Hierarchical HMM [Fine et al., 1998], inspired from the Inside-Outside algorithm for
Stochastic Context-Free Grammars. Notably, an equivalent to the Viterbi algorithm was
designed to compute the optimal sequence of hidden states for each level. Unfortunately,
Jules Françoise
10
ATIAM — IRCAM - UPMC, 2011
2.3 Multilevel models
!
"
#
Figure 2.5: A simple Bayesian network representing the probabilistic dependences between three
random variables.
the time complexity of the algorithm is cubic in the length of the observation sequence,
which becomes intractable even for time series analysis. In the next section we introduce
Dynamic Bayesian Networks which provide powerful algorithms for Markov models.
2.3.3
A unified framework: Dynamic Bayesian Networks
The inability of HMMs to deal with multiple time scales led researcher in activity recognition to develop more complex models showing a hierarchical structure. Often based
on computer vision techniques, activity recognition knew a growing interest in the last
few years with applications to surveillance, behavioral biometrics and interactive environments [Turaga et al., 2008]. Activities, defined as complex sequences of primitive actions,
require multi level models, from feature extraction and action recognition to high level
schemes for activity recognition. Recently, Dynamic Bayesian Networks (DBNs) raised
attention considering their ability to encode complex conditional dependences in dynamic
systems, thus permitting hierarchical modeling of human activities [Subramanya et al.,
2007]. As shown by K. Murphy [Murphy, 2002], this framework allows the expression of
every Markov model, offering a simple representation and powerful inference algorithms.
Bringing together graph theory and probability theory, graphical models provide a powerful formalism for statistical modeling [Wainwright and Jordan, 2007]. In a graphical
model, a set of nodes representing random variables are connected together by directed
or undirected arcs defining probabilistic dependences between these variables. Bayesian
networks are the family of graphical models with directed arcs. In such models, a directed
arc between two nodes defines a conditional probability distribution between two random
variables – a parent node and its child, – offering both a simple representation and strong
stochastic modeling of causality.
A simple example is given on figure 2.5 where three random variables are represented. The
edges indicate a causal relationship between different events, for example A → B means
that we can evaluate the probability of observing B given knowledge about A. The scheme
also shows that A and C are not independent given B, meaning that the inference we can
make about A given B is conditioned on the state of C.
Specifically developed to model dynamic systems, Dynamic Bayesian Networks (DBNs)
are a special case of Bayesian networks which contain edges pointing in the direction of
time. In fact, the graphical representation of a DBN shows a recurrent scheme in which
a unique structure is repeated at each time slice. This structure contains an arbitrary
number of connected nodes, a part of which have children in the following time slice.
Dynamic Bayesian Networks can be seen as an extension of Hidden Markov models with
an arbitrary number of hidden and observable states per time slice, representing complex
dependences. The simplest DBN is in fact the HMM, which only contains a hidden state
Jules Françoise
11
ATIAM — IRCAM - UPMC, 2011
2.3 Multilevel models
and an observation per time slice. Representing a HMM as a DBN only lies on unrolling
its temporal representation, as presented on figure 2.2 of section 2.2.2.
Representation
We give now a short formal description of DBNs. For an extensive study, see [Murphy,
2002].
A DBN is a directed acyclic graph, where a set of nodes defines random variables Zt ,
which can be partitioned in hidden , observable and input variables: Zt = (Xt , Yt , Ut ). A
DBN is characterized by a prior P (Z1 ) and a two-slice Temporal Bayes Net (2TBN) which
defines P (Zt |Zt−1 ) as follows:
P (Zt |Zt−1 ) =
N
Y
i=1
P (Zti |P a(Zti ))
where Zti is the i’th node at time t, which could be a component of Xt , Yt or Ut , and
P a(Zti ) are the parents of Zti in the graph, which can either belong to the same or the
previous time slice.
A simple example is given on figure 2.6 where an Input-Output HMM is represented as a
DBN. The unit structure contains 3 nodes, one for each possible type. Only hidden nodes
have children in the following time slice. The model is characterized by a prior on the
first time slice, and by the 2TBN represented by time slices 2 and 3 which is sufficient to
encode the whole temporal structure of the model.
!"#$%
!"
!#
!$
&'(()"
%"
%#
%$
*+,)-./+0)
&"
&#
&$
%&'(&
!!!
!"#$
Figure 2.6: An input-output HMM represented as a DBN. The unit structure of the model contains
one input node, one hidden node and one observation. Red arrows represent temporal relationships.
In addition to offering a simple temporal representation, DBNs allows for deriving powerful
inference algorithms, from exact inference with the Frontier algorithm [Zweig, 1996], to
approximate methods, for example particle filtering [Doucet et al., 2000].
Expressing the HHMM as a DBN
The Hierarchical HMM detailed in the previous section can be expressed as a DBN. On
figure A.1, a HHMM with two levels of hierarchy is represented as a DBN. In each time
slice, an internal state St and a production state Qt are connected together and have
children in the next time slice. Two binary nodes are introduced to handle the exit states.
For each level of the hierarchy, these nodes will activate if the sub-HMM has entered its exit
state, enabling the system to make a transition at the parent level. This representation
as a DBN allows for deriving powerful inference algorithms. Particularly, we focus on
the Frontier algorithm, generalization of the Forward-backward algorithm, which time
Jules Françoise
12
ATIAM — IRCAM - UPMC, 2011
2.4 Modeling musical gestures: the Gesture Follower
complexity is linear in the length of the observation sequence. As the proposed model
detailed in the next chapter is a special case of HHMM, we don’t give here a comprehensive
description of the model. Representation and algorithms are detailed for our specific model
in order to lighten the notation.
U1
S1
U2
S2
U3
S3
F1
F2
F3
Q1
Q2
Q3
Y1
Y2
Y3
Figure 2.7: DBN representation of a HHMM with 2 levels of hierarchy. Qt is the production state
at time t, St is the internal state at time t; Ft = 1 if the sub-HMM has finished (entered its exit
state), otherwise Ft = 0. Shaded nodes are observed, the remaining nodes are hidden.
2.4
Modeling musical gestures: the Gesture Follower
Often, online gesture recognition systems tend to define gestures as unbreakable units
which can only be recognized once completed. In music and performing arts, the way a
gesture is performed is often more important that identifying the gesture itself. Considering the specific constraints of this context, a realtime system for gesture-based interaction
was developed in the IMTR team. A novel approach was proposed: instead of recognizing
discrete gesture units, the system continuously updates a set of parameters characterizing
the execution of a gesture. Particularly, the system allows for a realtime estimation of the
likelihood of a gesture and the time progression, answering the question: "Where are we
within the gesture ?".
The basic assumption is that gestures can be represented as multidimensional temporal
curves. In order to reach a precise temporal analysis, authors focus on modeling time
profiles at a sample level using HMMs with a particular topology. We can consider the
model as an hybrid scheme between HMMs and Dynamic Time Warping (DTW). We now
briefly review the modeling scheme of the system, for a complete review, see [Bevilacqua
et al., 2010]. In the rest of this report, we will refer to this model by "Gesture Follower".
Representation
To fit the constraints imposed by the context of performing arts for which the set of examples cannot be extensive, a particular learning procedure was adopted, illustrated on
figure 2.8. A single example is used as a template to build a Left-Right HMM, in which
each state is associated with a sample of the reference gesture. The observation probability is defined as a Gaussian distribution centered around the sample of the reference.
The standard deviation of this normal distribution corresponds to the variability between
different performances of the same gesture and need to be set a priori or given preliminary
Jules Françoise
13
ATIAM — IRCAM - UPMC, 2011
2.4 Modeling musical gestures: the Gesture Follower
Figure 2.8: The learning procedure of the Gesture Follower. A left-right HMM is build from one
example curve.
experiments.
Considering its left-right topology, the system models precisely the temporal structure of
the gesture and is able to handle time variations in its execution. In addition, assumptions are made about the transition probabilities. On figure 2.8, transitions a0 , a1 and a2
stand for self, next and skip probabilities, which can be set in according to the expected
variability of the speed of execution. For example, defining a0 = a1 = a2 = 1/3 allows
for performing the gesture twice faster or slower than the original example, with equal
probabilities of speeding up or down.
decoding
With respect to the constraint of realtime, the Viterbi algorithm typically used to decode
the optimal sequence of hidden states is excluded. Here, the standard forward procedure
for HMMs is preferred (see section 2.2.2). At each time step, an update of the forward
variable α estimates the likelihood of the gesture given a partial observation sequence.
Moreover, the time progression can be derived from the index of the state maximizing the
forward variable at each time step:
time progression index(t) = argmax αt (i)
i
Applications
The system, implemented as an external object for Max/MSP 3 , was called Gesture Follower. Among applications, a time synchronization paradigm has emerged which allows for
controlling the playback of recorded sounds, for example conducting a virtual orchestra.
3
Max/MSP/Jitter: visual programming language for music and multimedia. http://cycling74.com/
Jules Françoise
14
ATIAM — IRCAM - UPMC, 2011
3
Proposed model
To remind the reader, we are interested here in designing a model which aims at analyzing gestures following multiple time scales, extending the hybrid model called Gesture
Follower. Using the formalism of Hierarchical HMMs, the proposed model provides a
multilevel representation of gestures which goal is the segmentation of complex gestures
and gesture sequences.
The chapter begins with a formal description of the model. Then, using the formalism of
Dynamic Bayesian Networks, an effective implementation is detailed and different segmentation algorithms are exposed, from exact decoding to sub-optimal methods for realtime
segmentation.
3.1
3.1.1
Description of the model
Introduction
The proposed model represents a gesture as a sequence of primitive shapes modeled by
Markov chains. An illustration of our goal is shown on figure 3.1. We can imagine
capturing a gesture using a single axis accelerometer, thus representing the gesture as a
unidimensional temporal curve. Once defined three primitive gestures (a) – decelaration,
acceleration and constant speed –, the model must be able to segment a target signal (b) as
a sequence of these primitive shapes (c). For the example of figure 3.1, the segmentation
would output the series of symbols 12321, together with their temporal alignment with
the reference.
As illustrated by this simple example, two challenging issues have to be solved:
1. find the correct sequence of primitive shapes, as a series of symbols
2. detect the time boundaries of these shapes to infer a correct temporal alignment of
segments
A desirable property of the model is the ability to handle time variations within segments:
the same template must be able to fit a signal with important time variations. For example,
the first and last segment of the curve of figure 3.1(c) are derived from the same template,
but present different temporal unfolding. Moreover, as for the Gesture Follower, the
specific context of performing arts requires modeling fine temporal variations, requiring a
precision at a sample level.
15
%$&'(")*+(
3.1 Description of the model
,/.
!"#$
"1)2*
!
($0$($1'$
,-.
!"#$%
%$&'(")*+(
"
#
!"#$
+2*)2*
,'.
"
#
"
!
%$&'(")*+(
!
!"#$
Figure 3.1: Segmentation of a complex signal as a sequence of primitive shapes. A gesture is
represented by a time profile of a descriptor, for example acceleration. 3 primitive gestures are
learned in the model (a). Segmenting the input signal (b) corresponds to identifying the correct
sequence of primitive shapes and their temporal alignment (c).
3.1.2
Modeling scheme
We propose a model, special case of the Hierarchical HMM defined in [Fine et al., 1998],
with two levels of hierarchy. A graphical representation of the topology of the model can
be found on figure 3.2.
The first level si of the hierarchy contains a small number of internal states which are
respectively associated with primitive gestures. On the figure, only two internal states
s1 and s2 are illustrated, each standing for a primitive shape. As this level models the
transitions between gestural segments, it will be called symbolic level in the following. As
a first approximation, no assumption is made about gesture sequences and this high level
structure is considered ergodic. However, high level transitions can be set given a priori
knowledge, or could be learned from an extensive set of gesture sequences. Each of these
symbolic states generates a submodel which encodes the temporal evolution of the corresponding primitive gesture. This submodel inherits from the particular topology of the
hybrid model of section 2.4 called Gesture Follower, associating each sample of a reference
gesture to a state in a left-right Markov chain. Hence, observations are only emitted at
this second level of the hierarchy – the production level. As it focuses on a time scale at
a sample level, it will then be named signal level.
As for the Gesture Follower, a simplified learning procedure is adopted, illustrated on figure
Jules Françoise
16
ATIAM — IRCAM - UPMC, 2011
3.2 Efficient implementation using Dynamic Bayesian Networks
root
s1
q11
q21
q31
s2
q41
1
qend
q12
send
q22
q32
2
qend
Figure 3.2: The topology of the proposed model. Only 2 symbolic states are represented. exit
states can be reached if the gesture is close to complete, in order to go back to the parent node
and make a transition at the symbolic level
3.3. A single example is needed during the learning procedure. This template gesture is
cut given a priori knowledge to constitute a dictionary of primitive shapes. Each sample
of a primitive shape is then associated with a state of the signal level to form a submodel,
which has a left-right topology. The transition probabilities of this Markov chain are
fixed a priori given the expected variability in the timing of execution. As each state is
a sample, the transition probabilities define the time-stretching allowed. For example,
auto-transitions permit to slow down and skipping states amounts to speeding up within
the time profile. The prior probability for submodels is a equal to 1 for the first state and
zero elsewhere, ensuring that a segment can only be entered from its first sample. The
probabilities of reaching an exit state are equal to zero except for a few samples at the
end of the gesture, ensuring that a transition between two segment is only possible if the
current primitive shape is about to finish.
As we associated each symbolic state with a primitive shape we are interested in finding
the correct sequence of symbolic states and the instants of the transitions to obtain the
segmentation.
3.2
Efficient implementation using Dynamic Bayesian Networks
In [Fine et al., 1998], authors propose a recursive algorithm which generalizes the Viterbi
algorithm of HMMs [Rabiner, 1989]. If this algorithm computes an optimal decoding of
the sequence of internal states, offering a precise segmentation technique, its computational complexity strongly limits a use with our model. Precisely, the time complexity of
the Viterbi algorithm is cubic in the length of the observation sequence. Considering the
particular topology of our model, where each state of the signal level is associated with
a sample of the reference time profile, this algorithm reveals intractable, even for short
sequence.
Taking advantage of representing the HHMM as a Dynamic Bayesian Network, powerful
inference algorithm can be derived, reducing the complexity from cubic to linear in the
length of the observation sequence. In this section we introduce the representation of our
Jules Françoise
17
ATIAM — IRCAM - UPMC, 2011
3.2 Efficient implementation using Dynamic Bayesian Networks
"
#
!"#$%&'()%
!
***
***
!"
!#
!$
%&&'
Figure 3.3: The particular learning procedure of the model. Each sample of a reference primitive
shape is associated with a state of a sub-HMM, itself generated by an symbolic state associated
with a segment.
model as a DBN and detail its formal description. Then different segmentation algorithms
are proposed: the Viterbi algorithm which performs an optimal decoding; and sub-optimal
methods for realtime segmentation.
3.2.1
Representation and formal description
Dynamic Bayesian Networks (DBNs) have been quickly introduced in section 2.3.3. DBNs
are a generalization of HMMs allowing an arbitrary number of nodes per time slice to
model complex dependences. Notably, Hierarchical HMMs can be represented as DBNs.
Figure 3.4 depicts the graphical representation of the DBN corresponding to the two-levels
HHMM of our study.
The model is characterized by a set of M symbolic states associated with primitive gestures. Each internal state i generates a sub-HMM of length M (i) – length in samples of the
reference segment. This can be represented at each time step by considering two random
variables respectively associated with the symbolic state and the production state at time
t ∈ [[1; T ]]. In each time slice, 4 hidden states are defined: a symbolic state St , a production
state Qt , and two binary variables Ft and Ut . These binary nodes are introduced to handle
exit states characteristic of the Hierarchical HMM. Ft will turn to 1 if the segment has
finished, i.e. is about to enter its exit state, and Ut will turn on if the entire gesture has
finished.
Five probability distribution are necessary to complete the model:
⊲ H = {hi }: Prior probabilities for the symbolic level
hi = P (S1 = i)
Jules Françoise
18
i ∈ [[1; M ]]
ATIAM — IRCAM - UPMC, 2011
3.2 Efficient implementation using Dynamic Bayesian Networks
U1
S1
U2
U3
S2
S3
F1
F2
F3
Q1
Q2
Q3
Y1
Y2
Y3
Figure 3.4: DBN representation of the proposed model, HHMM with 2 levels of hierarchy. Qt
is the production state at time t, St is the internal state at time t; Ft = 1 if the sub-HMM has
finished (entered its exit state), otherwise Ft = 0. Shaded nodes are observed, the remaining nodes
are hidden.
⊲ G = (gil ) : State transition probability matrix for the symbolic level
gil = P (St+1 = l|St = i)
i, l ∈ [[1; M ]]
(i)
⊲ Π(i) = {πj }: Prior probability distribution for primitive i (vertical transition probability)
(i)
πj = P (Qt = j|St = i)
i ∈ [[1; M ]]; j ∈ [[1; M (i) ]]
(i)
⊲ A(i) = (ajk ): State transition probability within primitive i
(i)
ajk = P (Qt+1 = k|Qt = j, St = i)
i ∈ [[1; M ]]; j, k ∈ [[1; M (i) ]]
(i)
⊲ B (i) (yt ) = {bj (yt )}: Emission probability distribution
(i)
bj (yt ) = P (Yt |Qt = j, St = i)
i ∈ [[1; M ]], j ∈ [[1; M (i) ]]
As for the gesture Follower, the observation probability is defined as a gaussian distribution:
(i)
||yt − µj ||2
1
(i)
bj (yt ) = √ · exp −
2σ 2
σ 2π
(i)
where yt is the observation at time t and µj is the j th sample of the primitive gesture i.
The parameter σ, standard deviation of the normal distribution is considered constant on
a whole gesture and has to be set given prior knowledge on the variability between various
executions of the same gesture.
The model is finally parametrized by a set of parameters:
n
(i)
λ = H, G, Π
Jules Françoise
o
i∈[[1;M ]]
n
(i)
, A
19
o
i∈[[1;M ]]
n
, B
(i)
o
i∈[[1;M ]]
ATIAM — IRCAM - UPMC, 2011
3.2 Efficient implementation using Dynamic Bayesian Networks
The complete description of the conditional probability distributions of the model is given
in appendix A.1. We remind that exit states need to be reached in order to go back to the
parent level and make a transition between two symbolic state, i.e. between two segments.
To clarify the role of the binary variables introduced to handle exit state, we define the
probabilities of these nodes to activate as:
(i)
end
P (Ft = 1|Qt = j, St = i) = aj
P (Ut = 1|St = i, Ft = f ) =
(
0
gi
end
if
if
f =0
f =1
(i)
where aj end is the probability of reaching an exit state from state j of primitive i; and
gi end is the probability of reaching a high level exit state from primitive i. The binary
variable Ut which characterizes the possibility of terminating a gesture, is conditioned by
the termination of the current primitive shape, hence the dependence to the value of Ft .
To force all segmentations to be consistent with the length of the sequence, we must ensure
that all sub-hmms have reached their final state, assuming UT = 1, and FT = 1.
3.2.2
Optimal decoding: the Viterbi Algorithm
This temporal representation of Hierarchical HMMs allows for deriving powerful inference
algorithms. Here, we are particularly interested in exact inference, but many approximate
techniques have been developed in a general case by K. Murphy. A very complete tutorial
about Dynamic Bayesian Networks can be found in [Murphy, 2002].
Principle and definitions
For HMMs, the forward-backward algorithm is based on an update of the probability distributions between two time steps, giving that the hidden node Xt separates the past from
the future. Here, 4 hidden nodes are defined in each time slice. The Frontier algorithm
proposed in [Zweig, 1996] is a way of updating the joint probability over a set of nodes,
without needing to create a a Macro-node Xt = {St , Qt , Ft , Ut } which would require a
large transition matrix. The basic idea is to create a frontier containing at time t all the
nodes of the time slice. Then, the frontier is extended by progressively including children
nodes from the next time slice and marginalizing over parent nodes to exclude them from
the frontier. This operation is repeated until the frontier only contains the nodes of time
slice t + 1, meaning that the joint probability over the set of nodes has been updated. A
backward pass can be defined in a similar manner.
This procedure permits the derivation of an efficient generalization the Viterbi algorithm.
Define the δ variable:
δt (j, i, f, u) = max P (Qt = j, Q1:t−1 , St = i, S1:t−1 , Ft = f, F1:t−1 , Ut = u, U1:t−1 , y1:t )
X1:t−1
The update process is explicated in appendix A.4, and leads to the following recurrence
relation:
Jules Françoise
20
ATIAM — IRCAM - UPMC, 2011
3.2 Efficient implementation using Dynamic Bayesian Networks
δt (Qt , St , Ft , Ut ) = P (yt |Qt , St ) · P (Ut | St , Ft )
· P (Ft |Qt , St )
·
Qt−1 ,Ft−1
max
{P (Qt |Qt−1 , Ft−1 , St )
·
St−1 ,Ut−1
max
{P (St | St−1 , Ut−1 , Ft−1 )
δt−1 (Qt−1 , St−1 , Ft−1 , Ut−1 ) }}
This expression can seem quite complex but important simplifications can be done considering that Ft and Ut are binary variables. In fact, the couple {Ft ; Ut } can take 3 possible
values, because the whole gesture can finish only if the current primitive shape has terminated. Introducing Et = Ft + Ut (Et ∈ {0, 1, 2}), we propose a new definition of the
dynamic programming variable:
δte (k, l) = max log P (Qt = k, St = l, Et = e, Q1:t−1 , S1:t−1 , E1:t−1 )
X1:t−1
A simplified algorithm can then be computed separating the 3 different cases.
As for HMMs, the indices maximizing δ are kept in order to retrieve the sequence of hidden
states in the backtracking operation. Finally, the optimal state sequence is obtained,
giving at each time step the symbolic state, i.e. the most probable primitive gesture and
the production state which provides the time progression in comparison with the original
template.
Complexity
Due to its recursive nature, the time complexity of the original Viterbi algorithm is
O(KM T 3 ), where M is the number of primitive gestures of mean length K and T is
the length of the observation sequence [Fine et al., 1998]. Moreover, the memory complexity is in O(KM T 2 ).
Here, a short analysis of the algorithm shows that its time complexity is in O (KM )2 T .
So, the time complexity is now linear in the length of the observation whereas it was
cubic with the recursive algorithm. However, this gain is done at the expense of the space
complexity, which is now quadratic in the total number of states of the model. Another
important gain is the memory: since the algorithm just involves an update, the most consuming operation is stocking the indices in the matrix Ψ. This requires O(KM T ) which
is an important reduction compared with the original algorithm, quadratic in the length
of the observation sequence.
If this algorithm provides an optimal segmentation of a gesture given a dictionary of
primitive shapes, it is inappropriate for designing a realtime system because the whole
observation sequence is required. In the next section different methods based on the Frontier algorithm are proposed to offer approximate solutions to the segmentation problem
in order to move towards a realtime system.
3.2.3
Towards a realtime system: algorithms for approximate segmentation
In the hybrid model called Gesture Follower, a simple forward pass is used to estimate
at each time step the most probable state, giving access either to the likelihood of the
Jules Françoise
21
ATIAM — IRCAM - UPMC, 2011
3.2 Efficient implementation using Dynamic Bayesian Networks
gesture and the time progression within this gesture. In the following, two procedures are
presented: a causal algorithm which consists on a forward pass similar to the method used
in the gesture Follower, and Fixed-Lag Smoothing which includes a backward pass a a few
samples, improving the accuracy of the simple forward algorithm but delaying the results
of a duration equal to the lag.
Forward algorithm
Filtering corresponds to computing p(Xt |y1:t ), which allows to perform a realtime segmentation in the sense that the most probable symbolic and production states are updated at
each time step. The algorithm is just the forward procedure of the frontier algorithm.
We first need to define the Forward variable as the probability of being at time t in state
k of primitive l given the partial observation sequence until time t:
αet (k, l) = P (Qt = k, St = l, Et = e, y1:t )
As it derives from the same general procedure, the algorithm is similar to the forward pass
of the Viterbi algorithm, replacing maximizations by sums. For a complete derivation of
the algorithm, the reader can refer to appendix A.2.
The most probable symbolic state at each time step can be computed by:
(∗)
{Q∗t , St∗ , Et } = argmax αet (k, l)
k,l,e
Hence, an approximate segmentation can be obtained. An major difference with the
Viterbi algorithm is to be noticed: here the most probable state is computed for each time
step, whereas it is the most probable sequence in the Viterbi algorithm. As a consequence,
the resulting sequence of hidden states may not be consistent as the local maximization
does not impose to respect state transitions.
As the algorithm is quite similar to the Viterbi algorithm, the computational complexity
of the forward algorithm is linear in the length of the observation sequence: O((KM )2 T ).
Fixed-Lag Smoothing
Fixed-Lag Smoothing (FLS) is the process of estimating a state of the past given the
partial observation sequence up to the current time, i.e. computing p(Xt−L |y1:t ). This
algorithms again derives from the frontier algorithm. In addition to the forward update
introduced before, a backward pass has to be computed. As the backward update is built
similarly to the forward update, it is not detailed here (see appendix A.3).
The algorithm, written in pseudo code on figure 3.5, outputs at each time step t the most
probable state at time t − L given the evidence up to time t, where L is a constant called
lag. At each time step, a forward update is computed. Then, after initializing the backward variable to 1, a backward update is iterated from time t to t − L. Finally, the most
probable state at time t − L is defined as the index which maximizes γt−L = αt−L . ∗ βt−L .
It is evident that compared to the filtering algorithm, the number of operation is increased
by the addition of a backward pass. In fact the time complexity of the algorithm is
in O(K 2 M (M + L)T ). However, as the estimation of the state at time t depends on
future events, the quality of the segmentation is expected to be better than with the
Jules Françoise
22
ATIAM — IRCAM - UPMC, 2011
3.2 Efficient implementation using Dynamic Bayesian Networks
t = 1:
alpha[1] = init_forw(y[1])
FOR t = 2:∞
alpha[t] = forward(y[t] , alpha[t-1])
IF t >= L
beta = 1
FOR k = 0:L DO
beta = backward(y[t-k] , beta)
END FOR
gamma[t-k] = alpha[t-k].*beta
state[t-k] = argmax(gamma[t-k])
END IF
END FOR
Figure 3.5: Pseudo-code for the Fixed-Lag Smoothing algorithm
simple filtering algorithm. It is important to notice that the algorithm estimates the most
probable state at each time step and not the most probable sequence as in the Viterbi
algorithm.
summary
Algorithm
Complexity
Segmentation
delay / realtime (samples)
Viterbi
O((KM )2 T )
optimal
offline
Filtering
O((KM )2 T )
sub-optimal
0
Fixed-Lag Smoothing
O(K 2 M (M + L)T )
sub-optimal
L
Table 3.1: A summary of the algorithms used, compared in terms of time delay, complexity and
accuracy
For the efficiency of the evaluation procedure detailed in the next section, all the algorithms
introduced in this chapter were implemented in C++.
Jules Françoise
23
ATIAM — IRCAM - UPMC, 2011
3.2 Efficient implementation using Dynamic Bayesian Networks
Jules Françoise
24
ATIAM — IRCAM - UPMC, 2011
4
Evaluation
We have introduced a specific model for complex gesture segmentation. The model is
expected to improve the segmentation method proposed in previous research based on the
Segmental HMM [Caramiaux et al., 2011b]. Particularly, we expect the model to be suited
to situations where important speed variations arise within gestures. In order to quantify
these differences and estimate the respective performance of each algorithm, an evaluation
method was designed.
Once introduced the evaluation method, we report a quantitative comparison of the proposed model with the segmental HMM. Different situations are investigated which aim at
defining the suitability of the algorithms in particular contexts. As the Segmental HMM
is limited to offline inference we focus on optimal decoding using the Viterbi Algorithm.
Then, the issue of an implementation in realtime is investigated. Different inference algorithms are compared and discussed in order to reach a compromise between the quality
of the segmentation and the delay to realtime a user could allow. Finally, a case study
on segmenting the gestures of a violinist aims at identifying bowing techniques such as
Spiccato and Pizzicato, offering an interesting insight of the musical applications of this
study.
4.1
Evaluation method
In this section, we first introduce the material used to evaluated the model, which required
the creation of a specific gesture database. Based on a function which identifies different
types of errors, a general evaluation method is exposed to provide quantitative indicators
about the quality of the segmentation computed by each algorithm.
4.1.1
Data collection
The first step in designing an evaluation procedure is the choice of an appropriate database
for evaluation. As highlighted before, we focus in this study on gestures represented as
time series. As our system is not dedicated to vision-based recognition, we must exclude
all the databases that emerged in the computer vision community. Considering the specific
context of music and performing arts we also exclude handwriting, fingerspelling and sign
language to focus on more abstract gestures. As a consequence, it seems more appropriate
for our study to look for a database from the HCI community, specifically focusing on
mobile interfaces.
In [Liu et al., 2009], authors propose a gesture recognition system for mobile devices based
on Dynamic Time Warping (DTW). The model is evaluated on a collection of 8 gestures,
25
4.1 Evaluation method
captured using accelerometers, which were identified in a previous study conducted by
Nokia Research [Kela et al., 2006] to constitute a small vocabulary of iconic gestures preferred by users. The gesture vocabulary of the database is illustrated on figure 4.1.
Figure 4.1: Gesture vocabulary of the uWave database. The dot denote the start and arrow the
end.
The gestures of the database are independent and isolated, forbidding an evaluation on
a segmentation task or for continuous gesture recognition. But the collection presents an
interesting gesture: a square icon – gesture 2 of the database – which can be considered
as the concatenation of 4 segments: up, right, down and left – respectively gestures 5, 3,
6 and 4 of the database. We defined a segmentation task as as representing the square
gesture as four time-aligned primitive shapes.
However, given that the square gesture is a single unit, no reference segmentation is
accessible and segmenting signals manually revealed very difficult. As a consequence, a
new gesture database was created, inspired from this square shaped icon. We chose the
vocabulary shown on figure 4.2(b). The database is constituted by four square-shaped
gestures, each starting at a different corner. Gestures were recorded using a Wii Remote,
primary controller of the video game console Wii by Nintendo R (figure 4.2(a)) which
includes a three-axis accelerometer. Acceleration signals were resampled at a sampling
frequency of 100 Hz. 8 participants were asked to repeat each gesture 10 times at three
different speeds. In order to obtain a precise reference segmentation, participants had to
synchronize their movements on an external tempo, by performing an edge of the square
between two successive clicks. Three tempos were chosen: 60, 120 and 180 beats per
minute (bpm). Recording the click track provides a precise reference segmentation of
gestures. Because gestures are performed in a vertical plan, we keep only two directions
of the accelerometer in the following tests.
4.1.2
Method
A major interest of creating a quite large database, containing 960 gestures, is to perform a
statistical evaluation of the model on a segmentation task. With respect to the particular
learning procedure which reduces the training phase to learning from a single example, we
adopted the following general evaluation procedure. First, for each tempo, each gesture,
a participant is chosen and a trial is taken as reference. Second, this trial gesture is
segmented given a reference segmentation to constitute a training set of primitive gestures.
In this study we consider two different reference segmentations which are discussed in the
next paragraph. Finally, another trial is then taken as input signal and segmented by the
model. Different situations are investigated:
⊲ same subject, same tempo: reference and input signals are extracted from gesJules Françoise
26
ATIAM — IRCAM - UPMC, 2011
4.1 Evaluation method
(a) The Wii remote controller from Nintendo R .
!
"
#
$
(b) The gesture vocabulary of our
database.
Figure 4.2: Our gesture database
tures performed by the same participant at the same speed.
⊲ inter-subject, same tempo: reference and input signals are performed at the
same speed by two different participants.
⊲ same subject, inter-tempo: reference and input signals are performed by the
same participant at different speeds.
This general procedure provides a statistical estimation of the performance of the algorithms. Indeed, for the same subject and the same tempo, if each trial is successively
taken as reference and the nine others are segmented, 90 segmentations are computed
per participant; tempo; gesture. As a result, for each tempo a total of 2880 gestures are
segmented, giving access to an important base for evaluation. We can then estimate the
efficiency of various algorithms by comparing the computed segmentations.
Parameters
In order to achieve a consistent comparison between the models, parameters have to be
set so as to maximize the segmentation results. Parameters are optimized on the Hierarchical HMM by manually finding the local maximum of the recognition rate over a list
of possible values of the parameters. Notably, the two models have common parameters
carrying an identical meaning. These parameters are optimized on the Hierarchical HMM
and are then equally set on the two models for the testing procedure. For each model,
the high level structure, governing the transition probabilities between primitive gestures,
is considered ergodic. That means that the initial probabilities of each primitive gesture
are equal, as well as the transition probabilities between segments. Another important
parameter to be set is the standard deviation of the normal distribution which defines the
observation probability.
For the proposed model, two other parameters are of major interest. First, state transition
probabilities of the signal level are determined by the vector:
LRprob = [self, next, skip, skip2 ]
where self sets the auto-transition probability, next define the probability of making a
transition to the next state and skip, skip2 respectively define the probability of skipping
Jules Françoise
27
ATIAM — IRCAM - UPMC, 2011
4.1 Evaluation method
one or two states in the left-right topology. As our model considers one hidden state per
sample of the reference gesture, these probabilities respectively correspond to looping on
a sample – i.e. slowing down, – moving to the next sample, and skipping samples – i.e.
speeding up. If the vector is limited to four values, the system can adapt to an increase
of the speed of execution by a factor 3. Besides, setting the probabilities of reaching an
exit state in the model is a primordial issue. In our model, this parameter defines the
probability of finishing a primitive gesture, thus conditioning the possibility to make a
transition to another primitive. In our test, setting to 0.1 and 0.75 the respective probability of reaching an exit state on the last two samples of the primitive shape was found
optimal.
The Segmental HMM introduces a duration distribution constraining the possible time
variations of segments. here, the durations are uniformly distributed on a range centered
around the length of the reference segment, plus or minus 10 samples.
Reference segmentations
In the proposed model, the learning process consists in defining templates of the primitive
shapes. In this study we chose to extract the primitives from a reference signal by manually
annotating the time boundaries of each segment. In the following sections we refer to this
definition of the primitive gestures as "reference segmentation" and we investigate two
situations:
⊲ Tempo-based segmentation : the click track recorded synchronously with the
gestures of each participant is used as a reference segmentation. We define 4 segments
as the acceleration signal between two successive clicks. This segmentation is directly
linked to the position and corresponds to the visual representation of the gesture, as
represented on the top of figure 4.3
⊲ Signal-based segmentation : as highlighted by the middle plot of figure 4.3,
the acceleration signals show an important activity in three areas corresponding
to the corners of the squares. Focusing on the signal itself instead of considering
the visual representation of the gesture leads to identify 3 primitive shapes defined
by the acceleration patterns. This segmentation is defined manually by identifying
the separations between two patterns. The bottom curve of figure 4.3 depicts the
variance of the acceleration signals computed over the 10 trials. Interestingly, the
minima of the variance – or the crossings between the variance on each axis – are a
powerful indicator of this segmentation.
4.1.3
Evaluation function
In order to quantify the quality of a segmentation, an evaluation function has been defined,
inspired from the evaluation method proposed for the MIREX score following contest [Cont
et al., 2007]. Given a musical score, recorded as a midi file, a score following system aims
at aligning the performance of musicians to the score by recognizing musical events such
as notes. Thus, the systems outputs a series of symbols with a given timing and duration.
Here, the problem is fairly similar, as segmenting a gesture in primitive shapes amounts
to computing the optimal series of segment with a given temporal alignment. The proposed evaluation function identifies different types of errors to quantify the quality of the
segmentation, a correct segmentation requiring the recognition of the adequate sequence
Jules Françoise
28
ATIAM — IRCAM - UPMC, 2011
4.1 Evaluation method
Gesture 2, tempo 500
"
Acceleration
0.15
2
1
4
3
0.1
0.05
!
#
0
−0.05
−0.1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
$
2
Time (s)
Gesture 2, tempo 500
!
Acceleration
2
1
0.15
"
3
0.1
0.05
0
−0.05
−0.1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1.4
1.6
1.8
2
#
Time (s)
Variance : Gesture 2, tempo 500
0.08
Acceleration
0.06
0.04
0.02
0
0
0.2
0.4
0.6
0.8
1
Time (s)
1.2
Figure 4.3: The two reference segmentations considered in this study. Mean of the acceleration
signals of gesture 4 at tempo 120 bpm for subject 1. Dashed curves represent standard deviation
of the 10 trial around the mean. The tempo-based segmentation is shown on the top curve and
the bottom plot describes the signal-based segmentation.
Examples of segmentation errors
ref. segmentation
trial segmentation
4
3
4
1
primitive
3
2
2
1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
time (s)
Figure 4.4: The different types of errors identified by the evaluation function: substitution (1),
good segment (2), insertion (3) and misaligned segment (4).
of primitive shapes with a precise time alignment. An hypothetical segmentation result
is represented on figure 4.4 which illustrates the various errors we need to recognize. The
figure depicts two stair plots, the reference segmentation is represented in dashed blue
line and the computed segmentation in solid red line. Each horizontal portion is a recognized segment which length defines its temporal extension. This representation provides
Jules Françoise
29
ATIAM — IRCAM - UPMC, 2011
4.2 Offline segmentation: a comparison with the segmental model
a visual analysis of the segmentation in comparison with the reference, both on segment
identification and time alignment.
We define four types of errors. First, a substitution (1) is found if a wrong symbol is
identified on a whole of a segment. Second, a correct segment (2) is identified if the
symbol is correct on a whole segment and aligned on the reference segmentation under a
threshold fixed a priori. On the figure, we observe a substitution of short duration at the
beginning of the third segment (3). As it does not imply the whole segment to be wrong,
this error is called an insertion. Finally, we define as misaligned segments those which
symbol is correctly recognized but which are delayed over a given threshold compared to
the reference segmentation (4).
In order to extract a single consistent indicator, we denote as good segmentations those for
which all segments are recognized and aligned under a given threshold, i.e. which contain
only correct segments and no insertions. In the tests of the following sections, the threshold
was fixed at 200 ms. This value is chosen to authorize variations in the alignment between
various performances of the same gesture.
4.2
Offline segmentation: a comparison with the segmental
model
As illustrated by B. Caramiaux in a recent study, offline segmentation of gesture signals
provide interesting prospects for musical performance analysis [Caramiaux et al., 2011b].
Intrinsically linked to time, musical gestures can show important variation in their temporal unfolding. Thus, a model designed for gesture analysis must be able to handle speed
variations within gestures.
In this section, we achieve a quantitative comparison between the proposed model and
the segmental HMM. Our goal is to identify the suitability of each model in various
situations. in particular, we want to estimate the efficiency of the models when particular
distortions appear between the reference and test signals. For that sake, we begin with a
comparison on the segmentation of gestures performed by the same participant at the same
speed. Then, we investigate the robustness of the algorithms to two types of distortion
introduced either by inter-subject or inter-tempo performances of the same gesture. In the
first case, the different execution strategies of the participants lead to distort the shapes
of the acceleration patterns, in the second case it is due to a different speed of execution.
For each situation, the two reference segmentations introduced in the previous section are
evaluated.
4.2.1
Same subject, same tempo
Protocol and parameters
For each gesture, the participant recorded 10 trials – one after the other – synchronized on a
given tempo. As a results, weak speed variations are expected considering the repeatability
between successive gestures. This consideration leads to set the transition probabilities
of the production level to LRprob = [0.2, 0.6, 0.1, 0.1], reinforcing the transitions to the
next sample, so favoring an execution at the same speed. The standard deviation of the
gaussian observation probability was optimized by finding the maximum recognition rate
for the proposed model over a range of possible values. For the three different tempos,
180, 120 and 60 bpm, this parameter was find optimal at 0.06, 0.05 and 0.04.
Jules Françoise
30
ATIAM — IRCAM - UPMC, 2011
4.2 Offline segmentation: a comparison with the segmental model
Tempo-based segmentation
The results are summarized in table 4.1, which compares the accuracy of our model with
that of the Segmental HMM in terms of good segmentation rate. The score is an average of the results over 4 gestures performed by 8 participants, which provides 2880 test
segmentations.
As a first observation, we may notice a difference between segmentation results for the
different tempos: for the proposed model, if the results are comparable for intermediate
and high speed – 97.4% at 120 and 98.9% 180 bpm – they are lower for gestures performed
at 60 bpm, with a recognition rate of 89.2%. This can be explained by analyzing the gesture signals shown on figure 4.5. A comparison between gestures executed at tempos 120
and 60 bpm reveal a lower amplitude and an important variability between trials for slow
gestures, making acceleration patterns more difficult to identify. If we focus on the results
obtained with the segmental HMM, the same decrease of the results is observed at 60
bpm. The good segmentation rate is an average of the results of 90 test for each gesture
and each participant. In order to compare the results of the two models, we performed a
t-test between the lists of 32 scores obtained for each tempo. For each t-test, the rejection
of the null hypothesis means that the two distributions don’t have equal mean. For the
three tempo, the null hypothesis cannot be rejected at the 5% significance level, showing
that the results obtained with the models are equivalent. This conclusion emphasizes a
good repeatability between gestures performed one after the other by the same participant.
Segmentation results (% good segmentation)
Tempo
tempo-based
signal-based
Hierarchical HMM
Segmental HMM
Hierarchical HMM
Segmental HMM
180 bpm
97.4
96.9
99.2
97.7
120 bpm
98.9
99.2
100
99.8
60 bpm
89.2
93.0
96.5
95.3
Table 4.1: Results of the segmentation on same subject and same tempo. The table reports
the percentage of good segmentation, averaged over 8 participant and 4 gestures. Two reference
segmentation are shown: tempo-based and signal-based segmentations.
Signal-based segmentation
The results of both models applied to the signal-based segmentation are summarized in the
right part of table 4.1. For this second segmentation, results are globally higher than those
obtained on the tempo-based segmentation. However, it is important to warn the reader
about such a comparison. First, the signal-based segmentation includes only three primitive gestures whereas the tempo-based involves four primitives. Statistically, the number
of possible errors is reduced for the signal-based segmentation. Second, the beginning and
the end of the signal are cut for the signal-based segmentation, the input signals considered in each reference segmentation are then different. As a result, we cannot conclude to
a higher efficiency of the models using the signal-based reference segmentation but several
observations derive from these results. The better results obtained with the signal-based
segmentation highlight that the selected region concentrate the most salient acceleration
Jules Françoise
31
ATIAM — IRCAM - UPMC, 2011
4.2 Offline segmentation: a comparison with the segmental model
Participant 9: Gesture 2, tempo 500 ms
0.15
acceleration
0.1
Horizontal acceleration
Vertical Acceleration
0.05
0
−0.05
−0.1
0
20
40
60
80
100
time (sample)
120
140
160
180
200
Participant 9: Gesture 2, tempo 1000 ms
0.15
acceleration
0.1
Horizontal acceleration
Vertical Acceleration
0.05
0
−0.05
−0.1
0
50
100
150
200
time (sample)
250
300
350
400
Figure 4.5: Acceleration signals of gesture 2 performed at 2 different speeds by participant 9. The
10 trials of the participant are superposed, and two axes are represented.
patterns. This reflects a greater variability at the beginning and the end of the gestures.
When they perform the three corners of the square, the gestures of the participants are
constrained by the goal of reaching the following corner. At the contrary, they start from
a static pose and the end of the movement anticipates a relaxation, introducing variable
behaviors.
Nevertheless, the comparison between the two models leads to the conclusions derived
from the results on the tempo-based segmentation. At the 5% significance level, the null
hypothesis of the t-test is not rejected, confirming the equivalence between the proposed
model and the SHMM for gestures showing weak speed variations.
Conclusion
We have investigated same subject, same tempo segmentation. Results show that the two
models are equally suitable for this situation, where weak distortions of the signals are
observed due to the repeatability of the participants between various trials of the same
gesture. Also, it appears that gestures performed at 60 bpm are more difficult to segment
that faster gestures, because of a small amplitude and variability. In the two following
section, we evaluate the efficiency of both algorithms in situations involving important
distortion.
4.2.2
Inter-subject, same tempo
Protocol and parameters
In this section, the input signal and the primitive shapes are extracted from the gestures
of two distinct participants. Distortion is expected to be introduced here because different
participant perform the gesture in different manners, resulting in distinct shapes of the
acceleration signals. The parameters are identical to those used in the previous section.
Tempo-based segmentation
The results obtained with each algorithm are reported in the left side of table 4.2, averaged
over 3200 tests. We observe that the results are lower than for same-subject segmentation,
Jules Françoise
32
ATIAM — IRCAM - UPMC, 2011
4.2 Offline segmentation: a comparison with the segmental model
which confirms the hypothesis of signal distortion. Moreover, the gap between intermediate and slow tempos is even wider: at 120 bpm about 86.4% of the gestures are well
segmented against only 50.2% at 60 bpm. Different strategies can be chosen for executing
the movement as shown on figure 4.6, where the mean acceleration signals of slow performances of gesture 2 are plotted for two participants. In the first case, the participant
chose to move very regularly, changing direction when hearing the click. The gestures of
the second participant indicates more salient acceleration patterns because it chose to hit
the corners on the clicks rather than keep a constant speed. The decrease of the results
at slow speed is an indicator of various strategies in the execution of the gestures when
the tempo is low. As before, we performed a t-test between the results of each model to
compare their respective efficiency. Here, the null hypothesis is rejected at 180 bpm at
the 5% significance level, but the hypothesis cannot be rejected at 120 and 60 bpm. This
means that our model is more efficient at 180 bpm, but its superiority at 120 bpm is not
verified. However, the null hypothesis is rejected at 120 bpm at the 7% significance level.
Thus, it appears that the proposed model is more likely to handle distortion due to various
execution strategies if the gesture is performed fast.
Segmentation results (% good segmentation)
Tempo
tempo-based
signal-based
Hierarchical HMM
Segmental HMM
Hierarchical HMM
Segmental HMM
180 bpm
78.9
72.9
92.3
87
120 bpm
86.4
81.5
94.2
89.4
60 bpm
50.2
54.8
77.1
72.1
Table 4.2: Results of the segmentation on inter-subject and same tempo. The table reports the
percentage of good segmentation, averaged over 8 participant and 4 gestures. Two reference segmentation are shown: tempo-based and signal-based segmentations.
Participant 3: Gesture 2, tempo 60 bpm
Acceleration
0.1
0.05
Horizontal axis
Vertical axis
0
−0.05
−0.1
0
0.5
1
1.5
2
Time (s)
2.5
3
3.5
4
3
3.5
4
Participant 7: Gesture 2, tempo 60 bpm
Acceleration
0.1
0.05
Horizontal axis
Vertical axis
0
−0.05
−0.1
0
0.5
1
1.5
2
Time (s)
2.5
Figure 4.6: Acceleration signals of gesture 2 respectively performed by participants 3 and 7 at
60 bpm. Plain line represent the mean acceleration over 10 trials, and dashed lines represent the
standard deviation around the mean curve.
Jules Françoise
33
ATIAM — IRCAM - UPMC, 2011
4.2 Offline segmentation: a comparison with the segmental model
Signal-based segmentation
As previously, the signal-based segmentation shows better results with both models compared to the tempo-based segmentation. Two important observations have to be made
here. First, it appears that the gap between the results at high and low tempos is reduced
for the HHMM as for the SHMM. As participant start from a static pose, the beginning of
the gestures is often close to zero. With the tempo-based segmentation, if we consider an
input gesture of small amplitude, fitting any segment to the first primitive becomes highly
probable because it can be considered as a noisy signal centered around zero. Second,
performing a t-test between the results of each model proves a superiority of the proposed
model at 120 and 180 bpm at the 5% significance level. Moreover, if not confirmed by a
t-test, the results of our model are superior to those of the segmental HMM at 60 bpm,
which was false for the tempo-based segmentation. These two observations highlight that
the SHMM has a better ability to handle noisy inputs of small amplitude, but that our
model is able to fit the distortions of the acceleration patterns that characterize gestures
executed by different participants.
Conclusion
Finally, the results show that the segmental HMM is more robust to noise, because it
performs a regression on a whole segment, whereas the proposed model works at a sample
level. However, as shown by the results on the signal-based segmentation, our model has
a better ability to handle time variations implying a distortion of the primitive shapes. In
order to confirm this conclusion, we investigate inter-tempo segmentation which implies
even larger distortions between the reference and the input signal.
4.2.3
Same subject, inter-tempo segmentation
Protocol and parameters
In this section we investigate inter-tempo segmentation, meaning that the reference and
input signals are gestures performed at a different speed. Contrary to the same-tempo
situation, we expect important variations in the speed of execution between reference and
test signals. Thus, transition probabilities were adapted to allow positive or negative speed
variations: LRprob = [0.25, 0.25, 0.25, 0.25]. The standard deviation of the observation
probability distribution was optimized by the procedure explained in section 4.1.2 and
fixed to 0.01.
Tempo-based segmentation
The good segmentation rate computed for each couple of reference and test tempos are
reported in table 4.3, averaged over 3200 tests.
Let’s consider the situations for which the input gesture is performed at a higher tempo
than the reference. For the proposed model, the best score is obtained between 120 and 180
bpm, namely for an increase of the speed by a factor 1.5. With 90% of good segmentation,
our model outperforms the segmental HMM which shows a score of 73%. When doubling
the speed, the difference is less important between the models, with 70.2% for our model
against 60.2%. For that case, a t-test does not reject the null hypothesis at the 5%
significance level and we cannot conclude to a superiority of the proposed model. Between
60 and 180 bpm, the Hierarchical HMM achieves 0% of good segmentation. This very
low score is due to the transition probabilities of the signal level, which only authorizes
to skip two states, limiting the possible speed up to a factor 3. Here, if this factor is
Jules Françoise
34
ATIAM — IRCAM - UPMC, 2011
4.2 Offline segmentation: a comparison with the segmental model
respected on a whole gesture, local speed variations can exceed this threshold. Setting the
transition vector to LRprob = [0.2; 0.2; 0.2; 0.2; 0.2] allows for local speed variations of
a factor 4, increasing the results from 0 to 21.2%. Situations involving an increase of the
speed between the reference and the test gesture tend to highlight a better precision of
our model, but this is not systematically confirmed by a t-test.
Test Tempo (bpm)
120
90
60
0
60
45.8
0
180
3.5
70.2
Tempo (bpm)
180
120
Reference
Tempo (bpm)
Reference
180
Test Tempo (bpm)
180
120
73
60
15
120
60
24.2
0
3.5
60.2
Segmental HMM
Hierarchical HMM
Table 4.3: Inter-tempo segmentation: evaluation on the tempo-based segmentation.
Considering the top right part of the table, it is evident that the results of a segmentation
between a fast reference and a slow test gesture are less precise than for speeding up.
Indeed, from 180 to 120 bpm the respective scores of each model are 45.8% and 24.2%,
and fall to 3.5% and 0% for both models for test gestures executed at 60 bpm. here, only
slowing down by a factor 1.5 reveals a more precise segmentation using the Hierarchical
HMM. In previous section, we drew attention to the difficulty of segmenting slow gestures.
Again, gestures performed at 60 bpm reveal very difficult to identify.
Finally, the proposed model outperforms the segmental HMM for variations of the speed
of execution of a factor 1.5 or 2. In extreme cases and for segmenting very slow gestures,
the model presents very low segmentation rates. Notably, tripling the speed of execution
induces too much distortion to achieve an efficient segmentation. As before, segmenting
slow gestures reveals a difficult task because of the noise and the small amplitude of
gestures performed at 60 bpm.
Signal-based segmentation
The good segmentation rates of each model on the signal-based segmentation are reported
in table 4.4. Like for the tempo-based reference segmentation, the best results are obtained
for speeding up of a factor 1.5 or 2, with respective scores of 98.5% and 93.2%. For these
cases, the segmentation rate of the SHMM is lower than for the first reference segmentation
and our model really outperforms the Segmental HMM. As for the tempo-based segmentation, setting the transition probabilities of the production level to authorize skipping
three states improves the results between 60 and 180 bpm from 21.2% to 67.2%, whereas
the segmental HMM reaches 21.2%. For this reference segmentation, the proposed model
show far better precision on segmenting gestures performed speeder than the reference.
For same-tempo segmentation, the signal-based reference segmentation always shows better results than that based on the tempo. Here, the Segmental HMM even show lower
scores for the signal-based segmentation when speeding up and no real improvement is
observed for slowing down. Indeed, between 333 and 500 bpm, the results of the signalbased segmentation are equal to 87% against 45.8% for tempo-based segmentation. Thus,
we can conclude that this definition of the primitive shapes – avoiding the beginning and
Jules Françoise
35
ATIAM — IRCAM - UPMC, 2011
4.2 Offline segmentation: a comparison with the segmental model
the end of the gestures – is very suitable to the proposed model and confirms its weakness
in presence of noise. At the contrary, the global tendency for the SHMM is a decrease of
the results with signal-based segmentation, which testifies of a lack of adaptation to the
deformation of the acceleration patterns.
Test Tempo (bpm)
120
98.5
60
21.2
60
180
87
3.4
21.3
93.2
Tempo (bpm)
180
120
Reference
Tempo (bpm)
Reference
180
Test Tempo (bpm)
180
Hierarchical HMM
120
47.5
60
21.2
120
60
25.3
4.5
10.2
28
Segmental HMM
Table 4.4: Inter-tempo segmentation: evaluation on the signal-based segmentation.
Gesture 1 performed at 180 bpm
0.15
Co−articulation
Acceleration
0.1
Co−articulation
0.05
0
−0.05
−0.1
0
0.2
0.4
0.6
0.8
1
1.2
Time (s)
Gesture 1 performed at 60 bpm
Acceleration
0.05
0
−0.05
0
0.5
1
1.5
2
Time (s)
2.5
3
3.5
4
Figure 4.7: Coarticulation effects
As highlighted when studying synchronized gestures, a wide gap exist between rapid and
slow gestures. For inter-tempo segmentation, this difference is even more significant:
between 120 and 60 bpm the accuracy is about 21.3%, and falls to 3.4% between 180 and 60
bpm. In order to explain these results, two signals are plotted on figure 4.7: the top curve
represents gesture 1 performed at 180 bpm by participant 9, and the same gesture executed
at 60 bpm is represented on the bottom plot. The slow gesture puts in evidence a clear
separation of the acceleration patterns, a short lapse of silence being inserted between each
motif. A quite different behavior is observed at high speed: because the participant must
trace the shape precisely in a short time interval, segments are sequenced quickly. This
constraint is reflected in the acceleration signals by an overlapping of acceleration patterns
on the top curve. This observation means that sequenced primitive gestures influence each
other, introducing coarticulation. This phenomenon explains the bad results obtained for
an input signal at 60 bpm and primitive shapes learned at a higher tempo.
Jules Françoise
36
ATIAM — IRCAM - UPMC, 2011
4.3 Towards a realtime system
4.2.4
Discussion
We have detailed an extensive evaluation procedure which aimed at comparing the proposed model with the Segmental HMM on a segmentation task. The model are equivalent
when weak distortions of the signal are observed between the reference shapes and the
input gestures. Often, segmenting slow gestures is very challenging due to the small amplitudes of the acceleration signals. Two types of distortion of the acceleration patterns
have been studied, introduced by inter-subject or inter-tempo segmentation. In both cases,
our models shows a better accuracy, especially at high speed, which confirms its ability
to handle distortion of the gesture signals due to nonlinear time variations. Thanks to
the two reference segmentation introduced, we showed that the segmental HMM lacks of
a dynamic fitting of the speed variations, but has an ability to smooth noisy input. A
major benefit of the proposed model is the access to the time progression which enables
us to study the time alignment between various executions of the same gesture.
4.3
Towards a realtime system
In the previous section, a comprehensive comparison of our model with the SHMM has
been conducted, proving the ability of the proposed model to handle important speed
variations. However, the method based on the Viterbi algorithm requires the entire observation sequence, forbidding a use in realtime. If this approach offers a interesting tool for
studying musical performance from gesture analysis, a realtime system would open new
perspectives for interactive musical systems.
In this section, we investigate several techniques to develop a realtime segmentation
method. Here, the interest of expressing Hierarchical HMMs as Dynamic Bayesian Networks is even more evident. The time complexity of the algorithms derived from this
framework is linear in the length of the observation sequence, ensuring to keep a constant
number of operations at each time step. Moreover, different variations of the Frontier
algorithm enables us to develop several methods for realtime or quasi-realtime gesture
segmentation.
We begin by introducing a causal inference technique based on the forward algorithm.
Then, an implementation of a Fixed-Lag Smoothing algorithm aims at improving the accuracy of the segmentation. A drawback of this second method is that smoothing involves
a backward pass on a fixed lag, delaying the results of a few samples.
As we focus on developing a realtime system, only the tempo-based reference segmentation
is considered in this section. Indeed, this reference segmentation is intuitive because related
to the visual representation of the gesture so it is more suitable to a performance situation
for which the signal-based segmentation would be harder to put in practice.
4.3.1
Forward algorithm
The frontier algorithm is a generalization for DBNs of the Forward-Backward algorithm.
The forward algorithm is analogous that of HMMs and consists in updating the forward
variable at each time step. We propose here a causal segmentation method for realtime
segmentation. At each time step, the forward variable is updated, evaluating the probability of being in the macro-state Xt given the observation sequence up to the current
time: P (Xt |y1:t ). Thus, a segmentation can be derived by maximizing this quantity over
Jules Françoise
37
ATIAM — IRCAM - UPMC, 2011
4.3 Towards a realtime system
all primitive shapes. The index of the most probable production state within the reference
primitive allows for estimating the time progression within the gesture.
Parameters
The standard deviation was respectively set to 0.1, 0.075 and 0.025 for tempos 180, 120 and
60 bpm. As before, the probabilities of reaching an exit state were set to zero except for
the last two samples. We focus on same-tempo segmentation, so that left-right transition
probabilities favor a transition to the next sample and the vector is set to LRprob =
[0.2 ; 0.6 ; 0.1 ; 0.1].
Results
In table 4.5, the results obtained with the forward algorithm are compared with an optimal
decoding, for a reference segmentation based on the tempo. As we are performing causal
inference, the segmentation method is sub-optimal and its accuracy is inferior to that of
the Viterbi algorithm. The relationship between the segmentation rate and the speed of
the gesture is verified here and is even more important for the forward algorithm, which
shows a fall of the score from 64.5% to 31.4% between 120 and 60 bpm. The sensitivity
to noise of our model derives from the fact it is working at a sample level, which is all the
more important for the forward algorithm which provides a causal estimation of the state
sequence.
Tempo
Segmentation results (% good segmentation)
Forward algorithm
Viterbi algorithm
180 bpm
75.8
97.4
120 bpm
64.5
98.9
60 bpm
31.4
89.2
Table 4.5: Comparison of the Forward algorithm with the Viterbi Algorithm for same-subject and
same-tempo segmentation. The results report the segmentation rate obtained on the tempo-based
segmentation, averaged over 2880 test.
As the results are lower than those of the Viterbi algorithm, it is important to detail the
errors introduced by the forward algorithm. The histogram of the lengths of the errors –
insertions and substitutions – identified by the evaluation function is depicted on figure 4.8.
As shown on the figure, the number of errors grows importantly as the tempo decreases.
A more significant observation derives from the shape of the duration histogram which
follows an decreasing distribution. With the Viterbi algorithm, the major part of errors
are substitutions on a whole segment. At the contrary, the forward algorithm introduces
a great number of insertions of short duration which don’t involve the whole segment to
be wrong. In fact, the causal algorithm estimates the most probable primitive at each
time step, whereas the Viterbi algorithm computes the optimal sequence of primitives.
Accordingly, the state sequence estimated by the forward algorithm can include forbidden transition by updating the most likely hidden state without taking into account the
path up to that moment. Notably, if a portion of a primitive shape locally maximizes the
probability, it will be inserted within a segment without reaching an exit state and make
a high level transition.
Jules Françoise
38
ATIAM — IRCAM - UPMC, 2011
4.3 Towards a realtime system
Histogram of error durations for the 3 tempos
3000
60 bpm
120 bpm
180 bpm
Number of errors
2500
2000
1500
1000
500
0
0
10
20
30
40
50
60
duration (samples)
70
80
90
100
110
Figure 4.8: Histogram of the length of the errors (substitutions + insertions) for the Forward
algorithm.
A typical example of such a phenomenon is presented on figure 4.9. The gesture performed
by participant 2 at 60 bpm has a small amplitude and present low frequency oscillations.
Locally, some patterns of the input signal fit exactly some portions of the reference gesture,
inserting errors of short duration. The Viterbi algorithm performs an optimal decoding
in the sense that it maximizes the probability of the sequence, forbidding these insertions.
However, in the context of musical performance, it s evident that insertions are preferable
to substitutions. The ability of the model to realign on the input signal in realtime is then
an advantage in a real situation.
input signal : Participant 2, gesture 1
0.15
Horizontal acceleration
Vertical acceleration
Acceleration
0.1
0.05
0
−0.05
−0.1
0
0.5
1
1.5
2
time (s)
2.5
3
3.5
4
Reference and computed segmentation, Participant 2, gesture 1
5
segment
4
3
2
Computed Segmentation
Reference
1
0
0
0.5
1
1.5
2
time (s)
2.5
3
3.5
4
Figure 4.9: A typical example of segmentation computed with the forward algorithm presenting
insertions. Executed at 60 bpm, this gesture of participant 2 has a very small amplitude and low
frequency noise. As local portions of the signal fit perfectly some portions in the reference gesture,
insertions are detected along the movement.
Our model aims at adding a hierarchy to the Gesture Follower developed in previous
research. The lack of a high-level modeling makes this last model very pour for precise
segmentation with an accuracy near to zero. The exit states introduced in the Hierarchical
Jules Françoise
39
ATIAM — IRCAM - UPMC, 2011
4.3 Towards a realtime system
HMM are particularly interesting as they increase the probability of the first samples of
the primitive shapes when a segment is about to finish, providing a kind of automatic
"reinitilization" mechanism.
4.3.2
Fixed-Lag Smoothing
As proved in the previous section, many insertions arise when segmenting gestures using
the simple forward algorithm. As shown by the histogram of figure 4.8, the shorter the
insertion, the numberous they are. In some situation, one could be interested in having a
precise segmentation, even if that means conceding a delay to realtime. The goal would
be to define a system able to smooth the forward segmentation to avoid insertions.
Again, DBNs allows for deriving such algorithms. We are particularly interested here in
the Fixed-Lag Smoothing algorithm detailed in section 3.2.3, which estimates the most
probable state at time t − L given the observation sequence up to time t. The system
then continuously updates the same parameters – primitive and time progression – at each
time step, but with a delay equal to the lag. We detail here some segmentation results
obtained with this method in order to find a compromise between the quality of the estimated segmentation and the delay a user could allow.
Figure 4.10 depicts the results obtained for segmenting gestures given the tempo-based
reference segmentation. The good segmentation rate is evaluated as a function of the
lag and compared with the results of the forward algorithm. To achieve a consistent
comparison, the same parameters are set for each algorithm, the standard deviation being
optimized on the forward algorithm.
Segmentation results: Fixed Lag Smoothing vs Forward algorithm
100
90
Segmentation results (% good seg.)
80
70
60
50
40
30
FL Smoothing 333 ms
FL Smoothing 500 ms
FL Smoothing 1000 ms
Forward 333 ms
Forward 500 ms
Forward 1000 ms
20
10
0
0
50
100
150
200
250
Lag (ms)
300
350
400
450
500
Figure 4.10: Fixed-Lag Smoothing: Influence of the lag and comparison with the Forward algorithm
As a first observation, we may notice that the accuracy increases with the lag, which
confirms the smoothing effect expected: the more the lag is large, the more the backward
Jules Françoise
40
ATIAM — IRCAM - UPMC, 2011
4.4 A musical application: segmentation of violin bowing techniques
pass is able to smooth and avoid insertions. However, we may notice that for short lags,
the algorithm performs lower than the simple forward pass. We observe that the forward
algorithm is very efficient for short gestures, performed at 180 bpm. For this tempo, a
delay of almost 100 ms is needed for the smoothing algorithm to show better accuracy.
At the contrary, the forward algorithm is very sensitive to insertions at 60 bpm and the
smoothing algorithm overtakes the forward pass from 30 ms, namely for a lag of 3 samples.
4.3.3
Discussion
Globally, the realtime algorithms presented in this section are less precise than the optimal
decoding using the Viterbi algorithm. Notably, a great number of insertions arise when
using the Forward algorithm, the length of the errors following a decreasing distribution.
The number of insertions increases as the speed decreases. Fixed-Lag smoothing has
been studied as a compromise between a smoothing effects which avoids insertions, and
a delay to realtime. The accuracy of the algorithm increases with the lag, confirming
the smoothing effect. For each tempo, Fixed-Lag smoothing outperforms the Forward
algorithm after a given lag, and this threshold grows as the speed increases.
Finally, it would be interesting to implement an adaptive system taking advantage of
the performance of the forward algorithm at high speed and of the smoothing for slow
gestures. Defining the lag dynamically as a function of the estimated speed would optimize
the results of realtime segmentation. Moreover, this perspective is coherent in the sense
that rapid gestures require short lags to remain reactive, but slow gestures can accept a
longer delay.
4.4
A musical application: segmentation of violin bowing
techniques
If the previous method was designed to evaluate the suitability of our model for gesture
analysis, the database used can seem far from what we could consider as "musical" or "expressive" gestures. We propose in this section an application to segmenting the movements
of a violinist. In the IMTR team, two types of research have been conducted: analyzing
the gestures of violinists and designing a system which involved developing both specific
sensors and realtime gesture analysis software.
Notably, a collaboration with the french composer Florence Baschet raised interest in
studying bowing techniques. Captured using miniature sensors fixed on the bow, the
acceleration was analyzed in order to correlate the gestures of the instrumentalist with
bowing techniques such as Spiccatto and Pizzicato. Embedded in a realtime system, the
gesture analysis technique is central to Florence Baschet’s quartet StreicherKreis 1 , created at Ircam in 2007.
In the case study presented here, our model is used for segmenting violin bowing techniques. We consider two recordings of the same musical phrase, which include an audio
track an the signals from a three-axis accelerometer. On figure 4.11, the musical score is
correlated with the audio track and the acceleration signal in the direction of the bow.
Different bow strokes appear in the score, such as Spiccato, Col Legno and Pizzicato. Two
trials of the same phrase are presented on the figure, performed by the same player on
the same day. As reference segmentation we consider the intrinsic divisions of the musical
score, represented by dashed lines which emphasizes the temporal alignment between the
1
http://www.florencebaschet.com/site/fiche9-StreicherKreis.html
Jules Françoise
41
ATIAM — IRCAM - UPMC, 2011
4.4 A musical application: segmentation of violin bowing techniques
scores and the two performances. The segmentation task was set as follows: the first
performance is segmented manually, given the audio track and the musical score. The
twelve primitive shapes obtained are learned and the accelerations signals of the second
performance are segmented by the model.
"
- - "* "* "* "* "*
%)*+
"#$"% $&'((#"%
$&'((#"% # *' &',,+
# *" * " * # *&',,+ " &',,+
#
*- " *- " * # *
*
*
*
)
'
# * " *- - *- *
!
- - !
#1 2
'
#*
( ! ) " ) )- " " )* # # )* ! ! )* *. #
&)* / / /
% & *) '' *) ! * " * " *
!
+ +
#
"
"
0 "
!
,! #
!
3
3
3 3
3
3'
*#$&$./%#
!"#$$%&'
()**+,-./.012-
!
"
!
#$
"
(%-.-/01%.2#""+
(%1.
3%),#
!
!
!"##$
$
!"##$
!"#$"%&'
()"**+,#
&0+,,1
)"--"*+,#
()"**+,#
!"#$%&'!()*%+,
5+$!.&6
!--)./&0&1!/"/2
3$,)&142
!"#$%&'!()*%+,
5+$!.&7
!--)./&0&&1!/"/2
3$,)&142
Figure 4.11: The musical score of the violin phrase, correlated with the audio track and an acceleration signal. The phrase includes different bowing techniques: Spiccato, Col Legno and Pizzicato.
The time alignment between the score, the audio track and the acceleration signal is represented
by vertical dashed lines. Two trials are presented, performed by the same player on the same day.
Resulting segmentations are plotted on figure 4.12. First, the Viterbi algorithms provides
an almost perfect segmentation of the gesture, as shown on the second curve of figure
4.12. In fact, only two substitution are introduced. Segment 9 is detected in place of
segment 2, but they correspond to the same bow stroke, namely Spiccato. Similarly, the
sixth segment is substituted to the seventh, making a confusion between two impulsions
of Pizzicato. Thus, the error introduced are not aberrant as they correspond to the same
bowing technique.
The same type of error is introduced by the Forward algorithm between segments 6 and
7, as shown on the third curve. However, the causal segmentation presents a lot of insertions, by a majority situated at the transitions between segments. These insertions have
a length inferior to a few samples. So, in a concert situation for example, errors will only
be introduced on a duration inferior to 30 ms at the transitions between segments.
In order to improve the precision of the realtime segmentation, we applied Fixed-Lag
smoothing to the same segmentation task. The segmentation computed with a lag of 25
samples is plotted on the bottom curve of the figure. Only a few insertions remains and
Jules Françoise
42
ATIAM — IRCAM - UPMC, 2011
4.4 A musical application: segmentation of violin bowing techniques
the segmentation is close to the results of the Viterbi algorithm. However, these results
require a lag of 25 samples, inducing a delay of 110 ms. Depending on the needs of a
user, one could chose either to accept insertions of 30 ms or concede a delay to smooth
the segmentation results.
Input signal and segmentation points
2
Col Legno
Amplitude
Spiccato
Pizzicato
Spiccato
1
0
−1
1
2
3
4
5
times (s)
6
7
8
9
6
7
8
9
7
8
9
7
8
9
segment number
Viterbi algorithm: optimal decoding
12
11
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
times (s)
segment number
segment number
Forward algorithm: causal inference
12
11
10
9
8
7
6
5
4
3
2
1
1
2
3
1
2
3
4
5
6
times (s)
Fixed−Lag Smoothing, Lag = 25 samples (0.11 s)
12
11
10
9
8
7
6
5
4
3
2
1
4
5
times (s)
6
Figure 4.12: Segmentation results on the violin phrase compared for three algorithms. On each
figure, the top curve reports the audio waveform and the reference segmentation points. On the
second curve, the segmentation computed with the Viterbi algorithm (red) is compared with the
reference segmentation (dashed blue line). The third curve reports the results obtained with the
forward algorithm, and the bottom plot reports the results of Fixed-Lag Smoothing with a lag of
25 samples, corresponding to a delay of 110 ms
Jules Françoise
43
ATIAM — IRCAM - UPMC, 2011
4.4 A musical application: segmentation of violin bowing techniques
Jules Françoise
44
ATIAM — IRCAM - UPMC, 2011
5
Conclusion and future directions
We have proposed a model for realtime segmentation of complex gestures. Based on
the Hierarchical Hidden Markov Model, the system allows for representing gestures as
a sequence of primitive shapes . A particular topology provides a time precision at the
sample level and enables learning templates from a single example. Efficient algorithms
derive from the formalization as a Dynamic Bayesian Network: the Viterbi algorithm
performs offline decoding and two realtime methods provide approximate solutions.
The evaluation of the offline algorithm on an accelerometer database provided a quantitative comparison between our model and the segmental HMM on a segmentation task.
Two types of distortion of the acceleration patterns have been studied, introduced by intersubject or inter-tempo segmentation. In both cases, our models shows a better accuracy,
especially at high speed, which confirms its ability to handle distortion of the gesture signals due to nonlinear time variations. At the contrary, the rigid temporal modeling of the
segmental HMM limits its fit to distortion, but the model is able to smooth noisy input.
An evaluation on the same database compared two methods for realtime segmentation.
The forward algorithm is an efficient causal segmentation technique but inserts many false
detections of short durations while segmenting slow gestures. Fixed-Lag smoothing is a
compromise between a smoothing effect which avoids insertions and a delay to realtime.
The threshold on the lag after which the algorithm outperforms the forward procedure
decreases with the speed. As a result, it would be interesting to define a system in which
the lag is adapted dynamically as a function of the estimated speed of the gestures.
Finally, the example of realtime segmentation of bowing techniques illustrates a musical application of our model. Both the forward algorithm and Fixed-Lag smoothing show
interesting results and provide two different strategies for a context of musical performance.
A major prospect would be to apply the system to two types of studies: performance
analysis, for example by segmenting the gestures of an instrumentalist or dancer; and realtime interaction. During this study, we have developed a prototype of an external object
for Max/MSP. Due to a lack of time, the object could not be optimized to reduce the
computation time and memory. For gestural control of sound synthesis, such a system is
interesting because the definition of a vocabulary of primitive gestures permits a realtime
segmentation of complex or continuous gestures, allowing multi-level mapping strategies.
Here, we consider the high level structure as ergodic. Implementing a learning procedure
of the transition probabilities between primitive shapes would improve the segmentation
results, in particular avoiding insertions in realtime algorithms. We can imagine a learning
process based on an extensive database of gesture sequences, but a more interesting process
would define the transition probabilities dynamically given a musical score to achieve
45
efficient gesture "following".
The PhD thesis I begin next year aims at defining the coupling between gesture and sound
in various contexts by machine learning techniques. Notably, using active learning would
allow for an adaptive learning of both the primitive gestures and their relationship to music.
Finally, the evaluation has been limited to a simple database. If continuous gesture recognition and spotting is an active issue in the computer vision community, few solutions
exist for mobile devices. Notably, we did not find a collection including continuous gesture sequences. A short term perspective is the creation of a new database of continuous
gestures related to musical expression. Inspired from conductor gestures, the specifications
of a new database are currently investigated to define complex gestures composed by a
pre-gesture or attack, followed by a series of unit segments.
Jules Françoise
46
ATIAM — IRCAM - UPMC, 2011
Bibliography
[Artieres et al., 2007] Artieres, T., Marukatat, S., and Gallinari, P. (2007). Online handwritten shape recognition using segmental hidden markov models. IEEE Trans. Pattern
Anal. Mach. Intell., 29:205–217.
[Baum and Petrie, 1966] Baum, L. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state markov chains. The Annals of Mathematical Statistics,
37(6):1554–1563.
[Bevilacqua and Flety, 2004] Bevilacqua, F. and Flety, E. (2004). Captation et analyse
du mouvement pour l’interaction entre danse et musique. In Rencontres Musicales
Pluridisciplinaires - le corps & la musique, Lyon.
[Bevilacqua et al., 2010] Bevilacqua, F., Zamborlin, B., Sypniewski, A., Schnell, N.,
Guédy, F., and Rasamimanana, N. (2010). Continuous realtime gesture following and
recognition. In Kopp, S. and Wachsmuth, I., editors, Gesture in Embodied Communication and Human-Computer Interaction, volume 5934 of Lecture Notes in Computer
Science, pages 73–84. Springer Berlin / Heidelberg.
[Bhuyan et al., 2006] Bhuyan, M., Ghosh, D., and Bora, P. (2006). Continuous hand
gesture segmentation and co-articulation detection. Computer Vision, Graphics and
Image Processing, pages 564–575.
[Bilmes, 2006] Bilmes, J. A. (2006). What hmms can do. IEICE - Trans. Inf. Syst.,
E89-D:869–891.
[Bloit et al., 2010] Bloit, J., Rasamimanana, N., and Bevilacqua, F. (2010). Modeling and
segmentation of audio descriptor profiles with segmental models. Pattern Recognition
Letters, 31(12).
[Cadoz and Wanderley, 2000] Cadoz, C. and Wanderley, M. (2000). Gesture-music. In
Trends in Gestural control of music. Ircam - Centre Pompidou.
[Caramiaux et al., 2011a] Caramiaux, B., Susini, P., Bianco, T., Bevilacqua, F., Houix,
O., Schnell, N., and Misdariis, N. (2011a). Gestural embodiment of environmental
sounds: an experimental study. In Jensenius, A. R., Tveit, A., Godøy, R. I., and
Overholt, D., editors, Proceedings of the International Conference on New Interfaces
for Musical Expression, pages 144–148, Oslo, Norway.
[Caramiaux et al., 2011b] Caramiaux, B., Wanderley, M., and Bevilacqua, F. (2011b).
Segmenting and parsing instrumentalist’s gestures. Journal of New Music Research
(submitted).
[Cont et al., 2007] Cont, A., Schwarz, D., Schnell, N., and Raphael, C. (2007). Evaluation
of real- time audio-to-score alignment. International Symposium on Music Information
Retrieval (ISMIR).
[Corradini, 2001] Corradini, A. (2001). Dynamic Time Warping for off-line recognition of
a small gesture vocabulary. Recognition, Analysis andTracking of Faces and Gestures
in Real -Time Systems, IEEE ICCV Workshop on.
47
BIBLIOGRAPHY
[de Laubier and Goudard, 2006] de Laubier, S. and Goudard, V. (2006). Meta-instrument
3: a look over 17 years of practice. In Proceedings of the 2006 conference on New
interfaces for musical expression, NIME ’06, pages 288–291, Paris, France, France.
IRCAM - Centre Pompidou.
[Doucet et al., 2000] Doucet, A., Freitas, N. d., Murphy, K. P., and Russell, S. J. (2000).
Rao-blackwellised particle filtering for dynamic bayesian networks. In Proceedings of
the 16th Conference on Uncertainty in Artificial Intelligence, UAI ’00, pages 176–183,
San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
[Eickeler et al., 1998] Eickeler, S., Eickeler, S., Kosmala, A., and Rigoll, G. (1998). Hidden
Markov Model based continuous online gesture recognition. International Conference
on Pattern Recognition (ICPR), 2:1206–1208.
[Fine et al., 1998] Fine, S., Singer, Y., and Tishby, N. (1998). The hierarchical hidden
markov model: Analysis and applications. Machine Learning, 32:41–62.
[Godøy, 2006] Godøy, R. I. (2006). Gestural-sonorous objects: embodied extensions of
schaeffer’s conceptual apparatus. Org. Sound, 11:149–157.
[Godøy et al., 2006] Godøy, R. I., Haga, E., and Jensenius, A. R. (2006). Exploring musicrelated gestures by sound tracing: a preliminary study. In Proceedings of the COST287ConGAS 2nd International Symposium on Gesture Interfaces for Multimedia Systems.
[Godoy et al., 2010] Godoy, R. I., Jensenius, A. R., and Nymoen, K. (July/August 2010).
Chunking in music by coarticulation. Acta Acustica united with Acustica, 96:690–
700(11).
[Godøy and Leman, 2009] Godøy, R. I. and Leman, M. (2009). Musical gestures: sound,
movement, and meaning. Routledge.
[Jensenius, 2007] Jensenius, A. R. (2007). Action-sound: Developing methods and tools to
study music-related body movement. PhD thesis, Department of Musicology, University
of Oslo.
[Kela et al., 2006] Kela, J., Korpipaa, P., Mantyjarvi, J., Kallio, S., Savino, G., Jozzo, L.,
and Marca, S. (2006). Accelerometer-based gesture control for a design environment.
Personal and Ubiquitous Computing, 10(5):285–299.
[Kendon, 2004] Kendon, A. (2004). Gesture: Visible action as utterance. Cambridge Univ
Press.
[Lee and Kim, 1999] Lee, H.-K. and Kim, J. H. (1999). An hmm-based threshold model
approach for gesture recognition. Pattern Analysis and Machinery Intelligence, Transactions on.
[Leman, 2006] Leman, M. (2006). Embodied Music Cognition and Mediation Technology.
MIT Press.
[Liu et al., 2009] Liu, J., Wang, Z., Zhong, L., Wickramasuriya, J., and Vasudevan, V.
(2009). uwave: Accelerometer-based personalized gesture recognition and its applications. Pervasive Computing and Communications, IEEE International Conference on,
pages 1–9.
Jules Françoise
48
ATIAM — IRCAM - UPMC, 2011
BIBLIOGRAPHY
[Loehr and Palmer, 2007] Loehr, J. D. and Palmer, C. (2007). Cognitive and biomechanical influences in pianists’ finger tapping. Experimental Brain Research, 178(4):518–528.
[Malloch and Wanderley, 2007] Malloch, J. and Wanderley, M. M. (2007). The t-stick:
from musical interface to musical instrument. In Proceedings of the 7th international
conference on New interfaces for musical expression, NIME ’07, pages 66–70, New York,
NY, USA. ACM.
[Mathews, 1989] Mathews, M. V. (1989). The conductor program and mechanical baton,
pages 263–281. MIT Press, Cambridge, MA, USA.
[Miller, 1956] Miller, G. A. (1956). The magical number seven, plus or minus two: some
limits on our capacity for processing information. Psychological Review, 101(2):343–52.
[Mitra and Acharya, 2007] Mitra, S. and Acharya, T. (2007). Gesture recognition: A survey. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 37(3):311–324.
[Murphy, 2002] Murphy, K. P. (2002). Dynamic bayesian networks: representation, inference and learning. PhD thesis, University of California, Berkeley.
[Ostendorf et al., 1996] Ostendorf, M., Digalakis, V., and Kimball, O. A. (1996). From
hmms to segment models: a unified view of stochastic modeling for speech recognition.
In Speech and Audio Processing, IEEE Transactions on, volume 4.
[Rabiner, 1989] Rabiner, L. (1989). A tutorial on hidden markov models and selected
applications in speech recognition. Proceedings of the IEEE, 77(2):257 – 286.
[Rajko et al., 2007] Rajko, S., Qian, G., Ingalls, T., and James, J. (2007). Real-time
gesture recognition with minimal training requirements and on-line learning. Computer
Vision and Pattern Recognition, IEEE Computer Society Conference on, 0:1–8.
[Rasamimanana et al., 2011] Rasamimanana, N., Bevilacqua, F., Schnell, N., Guedy, F.,
Flety, E., Maestracci, C., Zamborlin, B., Frechin, J.-L., and Petrevski, U. (2011). Modular musical objects towards embodied control of digital music. In Proceedings of the
fifth international conference on Tangible, embedded, and embodied interaction, TEI ’11,
pages 9–12, New York, NY, USA. ACM.
[Rasamimanana et al., 2009] Rasamimanana, N., Kaiser, F., and Bevilacqua, F. (2009).
Perspectives on gesture-sound relationships informed from acoustic instrument studies.
Organised Sound, 14(2).
[Rosenbaum et al., 1983] Rosenbaum, D. A., Kenny, S. B., and Derr, M. A. (1983). Hierarchical control of rapid movement sequences. Journal of Experimental Psychology:
Human Perception and Performance, 9(1):86 – 102.
[Schaeffer, 1966] Schaeffer, P. (1966). Traité des Objets Musicaux. Editions du Seuil.
[Subramanya et al., 2007] Subramanya, A., Subramanya, A., Raj, A., and Bilmes, J. A.
(2007). Hierarchical models for activity recognition. IEEE Conference in Multimedia
Processing, pages 233–237.
[Turaga et al., 2008] Turaga, P., Chellapa, R., Subrahmanian, V., and Udrea, O. (2008).
Machine recognition of human activities: A survey. In Circuits and Systems for Video
Technology, IEEE Transactions on, volume 18.
Jules Françoise
49
ATIAM — IRCAM - UPMC, 2011
BIBLIOGRAPHY
[Wainwright and Jordan, 2007] Wainwright, M. and Jordan, M. (2007). Graphical models,
exponential families, and variational inference. Foundations and Trends in Machine
Learning, 1(1–2):1–305.
[Wanderley and Battier, 2000] Wanderley, M. and Battier, M. (2000). Trends in Gestural
control of music. Ircam - Centre Pompidou.
[Wanderley et al., 2005] Wanderley, M., Vines, B. W., Middleton, N., McKay, C., and
Hatch, W. (2005). The musical significance of clarinetists’ ancillary gestures: An exploration of the field. Journal of New Music Research, 34(1):97—113.
[Widmer et al., 2003] Widmer, G., Dixon, S., Goebl, W., Pampalk, E., and Tobudic, A.
(2003). Horowitz factor. AI Magazine, pages 111–130.
[Wilson and Bobick, 1999a] Wilson, A. D. and Bobick, A. F. (1999a). Parametric hidden
markov models for gesture recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 21(9):884–900.
[Wilson and Bobick, 1999b] Wilson, A. D. and Bobick, A. F. (1999b). Real-time online
adaptive gesture recognition. In Proceedings of the International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, RATFG-RTS
’99, pages 111–, Washington, DC, USA. IEEE Computer Society.
[Yamato et al., 1992] Yamato, J., Ohya, J., and Ishii, K. (1992). Recognizing human action in time-sequential images using hidden markov model. In Computer Vision and
Pattern Recognition, 1992. Proceedings CVPR’92., 1992 IEEE Computer Society Conference on, pages 379–385. IEEE.
[Zweig, 1996] Zweig, G. (1996). A forward-backward algorithm for inference in bayesian
networks and an empirical comparison with HMMs. Master’s thesis, UC Berkeley CS
Dept.
Jules Françoise
50
ATIAM — IRCAM - UPMC, 2011
A
Model description: representation and algorithms
A.1
Representation
U1
S1
U2
S2
U3
S3
F1
F2
F3
Q1
Q2
Q3
Y1
Y2
Y3
Figure A.1: The 2-level HHMM of our study, represented as a DBN. Qt is the production state
at time t, St is the symbolic state at time t; Ft = 1 if the sub-HMM has finished (entered its exit
state), otherwise Ft = 0. Shaded nodes are observed, the remaining nodes are hidden.
A.1.1
Notations
⊲ Y = y1 , y2 · · · yT : Observation sequence of length T
⊲ Qt : Production state at time t
⊲ St : Symbolic state at time t
⊲ Ft : binary indicator that is "on" if the sub-HMM has just "finished" (is about to
enter his end state)
⊲ Ut : binary indicator that is "on" if the symbolic state sequence has just finished.
⊲ M : number of primitive gestures = number of symbolic states
⊲ M (i) : Number of states of the production level called by i (length of primitive i)
⊲ H = {hi } : Prior probabilities for the symbolic level
hi = P (S1 = i)
51
A.1 Representation
⊲ G = (gil ) : State transition probability matrix for the symbolic level
gil = P (St+1 = l|St = i)
(i)
⊲ Π(i) = {πj } : Prior probability distribution for primitive i (vertical Transition
Probability)
(i)
πj = P (Qt = j|St = i)
(i)
⊲ A(i) = (ajk ) : State transition probability for primitive i
(i)
ajk = P (Qt+1 = k|Qt = j, St = i)
(i)
⊲ B (i) = bj (yt ) : Emission probability
(i)
bj (yt ) = P (Yt |Qt = j, St = i)
A.1.2
Conditional Probability Distributions
Initial probabilities : t = 1
P (S1 = i) = hi
(i)
P (Q1 = j|S1 = i) = πj
U1 = 0
F1 = 0
Production Level : t = 2 · · · T − 1
P (Qt = k|Qt−1 = j, Ft−1 = f, St = i) =
(
(i)
if
f =0
(i)
πk
if
f =1
ãjk
(i)
(i)
P (Qt = k|Qt−1 = j, Ft−1 = f, St = i) = (1 − f ) · ãjk + f · πk
where we assume j, k 6= end, and where Ã(i) is a scaled version of A(i) where :
(i)
(i)
ãjk
=
ajk
(i)
end
1 − aj
The equation for binary indicator Ft is :
(i)
end
P (Ft = 1|Qt = j, St = i) = aj
(i)
end
P (Ft = f |Q= j, St = i) = f · aj
Jules Françoise
52
(i)
end )
+ (1 − f )(1 − aj
ATIAM — IRCAM - UPMC, 2011
A.2 Forward Algorithm
Symbolic Level : t = 2 · · · T − 1
P (St = l|St−1 = i, Ft−1 = f, Ut−1 = u) =
δ(i, l)
g˜il
h
l
if
if
if
f =0
f = 1 and u = 0
f = 1 and u = 1
P (St = l|St−1 = i, Ft−1 = f, Ut−1 = u) = f [u · hl + (1 − u)g̃il ] + (1 − f )δ(i, l)
The equation for binary indicator Ut is :
P (Ut = 1|St = i, Ft = f ) =
(
0
gi
end
P (Ut = u|St = i, Ft = f ) = uf · gi
end
if
if
f =0
f =1
+ (1 − u)(1 − f · gi
end )
Final Slice : t = T
To force all segmentations to be consistent with the length of the sequence, we must ensure
that all sub-hmms have reached their final state, assuming UT = 1, and FT = 1.
A.2
A.2.1
Forward Algorithm
Forward pass: formalization using the frontier algorithm
In this section we apply the general Frontier algorithm detailed in [Murphy, 2002]. The
notations are adapted to our case and respect the conventions of the previous section.
The algorithm is based on the definition of a frontier containing nodes. The frontier is
initialized at time step t by containing all nodes of the time slice. To update the frontier,
nodes of the following time slice are added and nodes of time slice t are removed, until the
frontier only contains the node of time slice t + 1. A node can be added if all its parents
are in the frontier , and we can remove a node if all its children are already in the frontier.
First we need to define a variable α as the probability of being in the macro state Xt =
{St , Qt , Ft , Ut }:
αt (j, i, f, u) = P (Qt = j, St = i, Ft = f, Ut = u | y1:t )
Let Ft,0 = αt−1 (j, i, f, u), and consider the frontier containing all nodes in slice t − 1.
Since all its parents are already in the frontier, we can add node St to the frontier :
Ft,1 (j, l, i, f, u) = P (St = l, St−1 = i, Qt−1 = j, Ft−1 = f, Ut−1 = u|y1:t−1 )
= P (St = l| St−1 = i, Ft−1 = f, Ut−1 = u) · Ft,0 (j, i, f, u)
As all nodes that depend on St−1 and Ut−1 are in the frontier, we can marginalize over
these variables :
XX
Ft,1 (j, l, i, f, u)
Ft,2 (j, l, f ) =
i
Jules Françoise
u
53
ATIAM — IRCAM - UPMC, 2011
A.2 Forward Algorithm
Add now Qt :
Ft,3 (k, j, l, f ) = P (Qt = k, St = l, Qt−1 = j, Ft−1 = f |y1:t−1 )
= P (Qt = k| Qt−1 = j, Ft = f, St = l) · Ft,2 (j, l, f )
then remove Qt−1 and Ft−1 :
Ft,4 (k, l) = P (Qt = k, St = l|y1:t−1 )
=
XX
j
Ft,3 (k, j, l, f )
f
In the same way we can add nodes Ft and Ut :
Ft,5 (k, l, f ) = P (Ft = f |Qt = k, St = l) · Ft,4 (k, l)
Ft,6 (k, l, f, u) = P (Ut = u|St = l, Ft = f ) · Ft,5 (k, l, f )
Hence we have computed :
Ft,6 (k, l, f, u) = P (Qt = k, St = l, Ft = f, Ut = u | y1:t−1 )
We can finally update the forward variable :
αt (k, l, f, u) = ct · P (yt |Qt = k, ST = l) · Ft,6 (k, l, f, u)
with ct a scaling coefficient defined by :
ct =
X
αt (k, l, f, u)
k,l,f,u
The complete recurrence relation can be deduced :
(l)
αt = ct · bk (yt ) · {uf · gl
n
· f·
·
XX
j
h
end
(l)
ak end
f
+ (1 − u)(1 − f · gl
+ (1 − f )(1 −
{
(l)
(l)
(1 − f ) · ãjk + f · πk
·
A.2.2
"
XX
i
u
end )}
(l)
ak end )
o
i
#
[f [u · hl + (1 − u)g̃il ] + (1 − f )δ(i, l)] αt−1 (j, i, f, u) }
Reduction
Considering that two nodes of the model are binary, simplifications can be achieved. In
particular, the value of Ut is conditioned on Ft , because the symbolic level can only finish
– enter its exit state – if and only if the production level has already finished. As a consequence, three values are acceptable for the couple {Ft , Ut }.
Defining Et = Ft +Ut
with
∈ {0, 1, 2}, we can propose a new definition of the forward variable:
αet (j, i) = P (Qt = j, St = i, Et = e | y1:t )
Et =
0 if
1 if
2 if
Ft = 0
Ft = 1 and Ut = 0
Ft = 1 and Ut = 1
Separating the three possible cases, important simplifications can be done, included in the
algorithm of the following section.
Jules Françoise
54
ATIAM — IRCAM - UPMC, 2011
A.3 Fixed-Lag Smoothing
A.2.3
Algorithm
Initialization: t = 1
α21 (k, l) = 0
α11 (k, l) = 0
(l)
(l)
α01 (k, l) = hl · πk · bk (y1 )
Propagation: t = 2 ... T-1
Compute an intermediate quantity:
Vf (k, l) =
X
(l) X
πk
g̃il
α1t−1 (j, i)
i
j
+ hl
X
j
Then Update forward variable:
(l)
α2t (k, l) = bk (yt ) · gl
α1t (k, l)
=
α0t (k, l) =
(l)
bk (yt )
(l)
bk (yt )
end
α2t−1 (j, i) +
(l)
end
· ak
· (1 − gl
end ) ·
(l)
end )
· (1 − ak
Xh
j
(l)
i
ãjk · α0t−1 (j, l)
· Vf (k, l)
(l)
end
ak
· Vf (k, l)
· Vf (k, l)
Scale the forward Variable:
ct = P
1
k,l,e αt (k, l, e)
αt (k, l, e) = ct · αt (k, l, e)
Termination: t = T
(l)
α2T (k, l) = cT · bj (yT ) · gl
end
α1T (k, l) = 0
(i)
end
· aj
· Vf (k, l)
α0T (k, l) = 0
Segmentation Technique:
The most probable symbolic state at each time step can be computed by:
(∗)
{Q∗t , St∗ , Et } = argmax αt (k, l, e)
k,l,e
A.3
Fixed-Lag Smoothing
Fixed-Lag Smoothing is the process of estimating a state of the past given the evidence
up to the current time. Defining Xt = {Qt , St , Ft , Ut } the set of nodes at time t, the
smoothing operation amounts to estimate P (Xt−L |y1:t ) where L is a constant called lag.
The basic idea is to add a backward pass to the simple forward update of the previous
section. At each time step, the forward variable is updated, and a backward operation is
repeated from time t to t − L. We must then introduced a backward variable defined by:
e
βt−τ
(j, i) = P (Qt−τ = j, St−τ = i, Et−τ = e|yt−τ +1:t )
Jules Françoise
55
ATIAM — IRCAM - UPMC, 2011
A.3 Fixed-Lag Smoothing
The update of the backward variable derives from the backward pass of the frontier algorithm. As the process is analogous to the forward update of the previous section, we do
not give details and introduce the simplified backward pass:
Initialization: τ = −1
e
βt+1
(j, i) = 1
Propagation: τ = 0 · · · L
Define intermediate variable:
n
(l)
end )
FB (k, l) = bk (yt )· (1 − ak
(l)
end
0
· βt−τ
+1 (k, l) + ak
Update backward variable:
0
βt−τ
(j, i) =
X
δ(i, l)
=
1
βt−τ
(j, i) =
X
l
2
βt−τ
(j, i)
=
X
· FB (k, i)
g̃il ·
X
l
πl ·
k
X
k
1
βt−τ
+1 (k, l) + gl
end
(l)
πk · FB (k, l)
(l)
πk · FB (k, l)
Segmentation technique
At each time step, a forward update is computed, followed by a backward pass. This allow
for deriving:
e
e
γt−L
(j, i) = αet−L (j, i) · βt−L
(j, i)
= P (Qt−L = j, St−L = i, Et−L = e|y1:t )
Finally, the Fixed-lag smoothing algorithm can be summarized in pseudo-code by:
t = 1:
alpha[1] = init_forw(y[1])
FOR t = 2:∞
alpha[t] = forward(y[t] , alpha[t-1])
IF t >= L
beta = 1
FOR k = 0:L DO
beta = backward(y[t-k] , beta)
END FOR
gamma[t-k] = alpha[t-k].*beta
state[t-k] = argmax(gamma[t-k])
END IF
END FOR
Jules Françoise
56
io
2
· βt−τ
+1 (k, l)
ãjk · FB (k, l)
(i)
ãjk
k
end ) ·
(l)
X
k
l
X
h
· (1 − gl
ATIAM — IRCAM - UPMC, 2011
A.4 Viterbi Algorithm
A.4
Viterbi Algorithm
The algorithm for the forward Viterbi procedure is similar to the forward algorithm, replacing sums by maximizations.
Let X1:t = {Q1:t , S1:t , F1:t , U1:t }, define the Dynamic programming variable :
δt (j, i, f, u) = max P (Qt = j, Q1:t−1 , St = i, S1:t−1 , Ft = f, F1:t−1 , Ut = u, U1:t−1 , y1:t )
X1:t−1
The variable is updated at each time step by :
(l)
δt (k, l, f, u) = bk (yt ) · {uf · gl
n
· f·
end
(l)
ak end
+ (1 − u)(1 − f · gl
+ (1 − f )(1 −
(l)
n
(l)
· max (1 − f0 ) · ãjk + f0 · πk
j,f0
end )}
(l)
ak end )
o
o
· max {[f0 [u0 · hl + (1 − u0 )g̃il ] + (1 − f0 )δ(i, l)] δt−1 (j, i, f0 , u0 )}
i,u0
A.4.1
Algorithm : Viterbi Decoding
Initialization: t = 1
δ12 (k, l) = 0
δ11 (k, l) = 0
(l)
(l)
δ10 (k, l) = hl · πk · bk (y1 )
Ψ1 (k, l) = {0, 0, 0}
Propagation: t = 2 ... T-1
∀k, l, j, i, e, Compute V0e :
(l)
2
V02 (k, l, j, i) = πk · hl · δt−1
(j, i)
(l)
1
(j, i)
V01 (k, l, j, i) = πk · g̃il · δt−1
(l)
0
(j, i)
V00 (k, l, j, i) = ãjk · δ(i, l) · δt−1
Maximize over variables at previous time step:
V1 (k, l) = max V0e (k, l, j, i)
j,i,e
Ψt (k, l) = argmax V0e (k, l, j, i)
j,i,e
Then Update forward variable: ∀k, l,
(l)
δt2 (k, l) = bk (yt ) · gl
end
(l)
(l)
end
· ak
δt1 (k, l) = bk (yt ) · (1 − gl
57
(l)
end
ak
(l)
end ) ·
(l)
δt0 (k, l) = bk (yt ) · (1 − ak
Jules Françoise
end ) ·
· V1 (k, l)
· V1 (k, l)
V1 (k, l)
ATIAM — IRCAM - UPMC, 2011
A.4 Viterbi Algorithm
Termination: t = T
(l)
δT2 (k, l) = bk (yT ) · gl
end
δT1 (k, l) = 0
(l)
end
· ak
· V1 (k, l)
δT0 (k, l) = 0
Backtracking
Define the state sequence Xt∗ = {Q∗t , St∗ , Ft∗ , Ut∗ }
P ∗ = max δT2 (k, l)
j,i,f,u
XT∗
= argmax δT2 (k, l)
j,i,f,u
Trace back to find the optimal state sequence:
for t = T-1· · · 1:
∗
)
Xt∗ = Ψt+1 (Xt+1
Finally, the optimal state sequence is obtained, giving at each time step the symbolic state,
i.e. the most probable primitive gesture and the production state which permits an analysis
of the timing execution of the gesture in comparison with the original template.
Jules Françoise
58
ATIAM — IRCAM - UPMC, 2011