See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/222664873
Using hidden Markov model to uncover
processing states from eye movements in
information search tasks
Article in Cognitive Systems Research · October 2008
DOI: 10.1016/j.cogsys.2008.01.002 · Source: DBLP
CITATIONS
READS
36
53
3 authors:
Jaana Simola
Jarkko Salojarvi
29 PUBLICATIONS 655 CITATIONS
112 PUBLICATIONS 2,511 CITATIONS
University of Helsinki
SEE PROFILE
University of Helsinki
SEE PROFILE
Ilpo Kojo
University of Helsinki
37 PUBLICATIONS 580 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jaana Simola on 16 January 2017.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
Available online at www.sciencedirect.com
Cognitive Systems Research 9 (2008) 237–251
www.elsevier.com/locate/cogsys
Using hidden Markov model to uncover processing states
from eye movements in information search tasks
Action editor: Rajiv Khosla
Jaana Simola a,*, Jarkko Salojärvi b, Ilpo Kojo c
a
b
Humanities Laboratory, Centre for Language and Literature, Lund University, S-22100 Lund, Sweden
Adaptive Informatics Research Centre, Department of Information and Computer Science, Helsinki University of Technology,
P.O. Box 5400, FI-02015 TKK, Finland
c
Center for Knowledge and Innovation Research, Helsinki School of Economics, P.O. Box 1210, FI-00101 Helsinki, Finland
Received 2 September 2007; accepted 19 January 2008
Available online 14 April 2008
Abstract
We study how processing states alternate during information search tasks. Inference is carried out with a discriminative hidden Markov model (dHMM) learned from eye movement data, measured in an experiment consisting of three task types: (i) simple word search,
(ii) finding a sentence that answers a question and (iii) choosing a subjectively most interesting title from a list of ten titles. The results
show that eye movements contain necessary information for determining the task type. After training, the dHMM predicted the task for
test data with 60.2% accuracy (pure chance 33.3%). Word search and subjective interest conditions were easier to predict than the question–answer condition. The dHMM that best fitted our data segmented each task type into three hidden states. The three processing
states were identified by comparing the parameters of the dHMM states to literature on eye movement research. A scanning type of
eye behavior was observed in the beginning of the tasks. Next, participants tended to shift to states reflecting reading type of eye movements, and finally they ended the tasks in states which we termed as the decision states.
Ó 2008 Elsevier B.V. All rights reserved.
Keywords: Eye movements; Computational models; Hidden Markov model; Information search; Scanning; Reading; Decision process
1. Introduction
Eye movements are commonly used as indicators of
online reading processes because of their sensitivity to
word characteristics. Empirical evidence supports this
eye–mind link assumption: longer eye fixations have been
observed together with misspelled words, less common
words, or words that are unpredictable from their context
(Rayner, 1998; Rayner & Pollatsek, 1989). However, reading studies typically concentrate on microprocesses of reading, such as studying how word features determine when
and where the eyes move. Moreover, their analysis of eye
*
Corresponding author.
E-mail addresses: jaana.simola@helsinki.fi (J. Simola), jarkko.salojarvi@tkk.fi (J. Salojärvi), kojo@hse.fi (I. Kojo).
1389-0417/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.cogsys.2008.01.002
movement data is often based on linear models that fail
to consider eye movements as time series data and therefore do not account for variations within a task.
Our contribution is to analyze the whole sequence of fixations and saccadic eye movements to gain an insight into
how processing alternates during the reading task. In other
words, we assume the reverse inference approach, and try
to infer the hidden cognitive states from an observable
eye movement behavior (see Poldrack (2006) for a discussion on the possible benefits and pitfalls of the approach
within neuroimaging research). The relationship between
eye movements and cognitive states is modeled with a discriminative hidden Markov model (dHMM). In our application, we use the dHMM to map the changes in statistical
patterns of eye movements to changes of the hidden states
of the model as participants proceed in information search
238
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
tasks. A hypothesis on the cognitive states corresponding
to the hidden states can then be made by comparing the
parameters of the hidden states (for example fixation durations and saccade lengths) to literature on eye movement
research where the cognitive state is known.
The states discovered by our model suggest that processing alternates along the completion of the tasks, even when
the abstractness of the searched topics varies. The results
can be used in practical applications. Earlier, Hyrskykari,
Majaranta, Aaltonen, and Räihä (2000, 2003) have used
the fact that fixations are longer during processing difficulties in order to develop an interactive dictionary that gives
translation aid when it detects reading difficulties. However, detecting changes in processing states makes it possible to develop more advanced applications. For example, a
proactive information retrieval application can search for
more documents on a specific topic after detecting eye
movements that indicate careful processing when a person
is reading about that topic (see Puolamäki, Salojärvi, Savia, Simola, & Kaski (2005) for a feasibility study). The goal
of the present article is to show that prerequisites for implementing such techniques exist.
Previously, Carver (1990) has argued that readers use
different processes in order to better accomplish their goals.
They change their ongoing process either by instructions or
by the difficulty of the text. Carver distinguishes five basic
processes based on variations in reading rates, that is, the
number of words covered by reading time (i.e. words per
minute, wpm). The suggested processes are called scanning,
skimming, ‘rauding’, learning and memorizing. Scanning is
performed at 600 wpm and is used while the reader is
searching for a particular word in a text. Another rapid
and selective process is skimming (450 wpm), which is used
in situations where the reader tries to get an overview of the
content without reading through the entire text. ‘Rauding’
(300 wpm) corresponds to normal reading in which the
reader is looking at each consecutive word of a text to comprehend the content. Learning is slow (200 wpm) and is
used for knowledge acquisition. Memorizing is the slowest
process (138 wpm) and involves continuous checks to
determine whether the ideas encountered might be remembered later.
According to Carver, the processes represent different
cognitive processes and he suggests that readers shift
between them, in a manner similar to drivers shifting gears.
He also suggests that skilled readers vary their reading processes more than poor readers. The eye movement results
indicate that when participants switched up, for example,
from the ‘rauding’ to the skimming process, the mean fixation durations decreased together with the mean number of
fixations and regressions (i.e. fixations back to previously
read text). Also the length of forward saccades increases.
On the other hand, switching down resulted in more regressions, longer fixation durations, and shorter saccade
lengths.
Carver suggests that the primary factor influencing reading rate is the selected reading process. Minor within-pro-
cess variations result from the difficulty of the text and
individual differences, such as age, practice or cognitive
speed. Previous research indicates also between-individual
differences in reading strategies (Hyönä, Lorch, & Kaakinen, 2002).
1.1. Models of eye movement control during reading
Computational models on eye movement control during
reading have been successful in explaining how various perceptual, cognitive and motor processes determine when and
where saccades are initiated during reading. The current
controversy is whether attention in reading is allocated
serially to one word at a time, as suggested by the E-Z
Reader model (Pollatsek, Reichle, & Rayner, 2006;
Reichle, Pollatsek, & Rayner, 2006), or whether attention
is spatially distributed so that several words are processed
at the same time. This parallel hypothesis is supported
for example by the SWIFT (Richter, Engbert, & Kliegl,
2006), the Glenmore (Reilly & Radach, 2006) and the
Competition/Interaction (Yang, 2006) models. (For a
review of the computational models of reading, see: Cognitive Systems Research, 2006, 7, pp. 1–96.) However, these
models are limited in their ability to consider variations
in higher level reading processes.
The models mentioned above construct very specific
hypotheses on the reading process and thus use tailored
parameter values developed in accordance with what is previously known about human vision, such as the size of the
visual span and variability in saccade and fixation metrics,
as well as word recognition processes like the time for lexical access. Instead of fixing model parameters manually,
the model parameters can also be learned from the data.
The general idea is that information required for constructing a model is learned from the empirical data, for example
the best model structure or the best parameter values. To
avoid overfitting, the data is split into two subsets: training
and testing data sets (see e.g. Hastie, Tibshirani, & Friedman (2001)). The best model and its parameters are
selected using the training data, and then its generalization
capability (i.e. how well the model fits new data) is tested
using the test data. Feng (2006) has applied similar
approach for modeling age-related differences in reading
eye movements.
1.2. Purpose of the study
Our goal is to investigate how processing changes as the
participants proceed in three types of information search
tasks: simple word search, question–answer task and finding subjectively most interesting topic. For this purpose, we
combine experimentation with data-driven modeling using
a discriminative hidden Markov model (dHMM). As a
time series model it is well suited for our purposes because
it provides a more comprehensive description of the eye
movement pattern than the basic summary statistics such
as average fixation duration. To capture the relationship
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
between language processing and eye movements, we
model the observed time series of fixations and saccades
by assuming latent states that are supposed to be indicators
of the cognitive system that switches between different
states of processing. We assume that in each processing
state the statistical properties of the eye movement patterns
are different. The best model topology, that is, the number
of hidden states, is found by comparing several possible
model topologies with cross-validation, and choosing the
one that best explains unseen data. We also compare the
parameter values of the model to what is previously known
about reading and performance in other cognitive tasks.
This information is used to make inference about processing during the tasks.
Our approach is not committed to any particular processing theory. Therefore many of the theoretical issues discussed in eye movement models of reading (Pollatsek et al.,
2006; Reichle et al., 2006; Reilly & Radach, 2006; Richter
et al., 2006), such as the parafoveal preview and parafoveal-on-foveal effects, do not concern our model. Instead,
the dHMM applied here describes how eye movement
behavior varies during a single trial, and the states uncovered by the dHMM can be seen as hypotheses about the
ongoing processes which are based on the statistical regularities of the eye movement data.
2. Data collection
2.1. Participants
Eye movement data were collected from ten volunteers
(6 female). The age range was 23–29 years, mean age
25.7 years, ðSD ¼ 1:9Þ. They had normal or corrected to
normal vision and all of them were native speakers of Finnish. Participants filled in a written consent before the
experiment.
2.2. Procedure
Our tasks represented single online information search
episodes where the user is inspecting listings returned by
a search engine in order to find a topic of her interest.
The task types were selected to fit the possible practical
implementation, a proactive information retrieval application. The task of the participants was to find a target from
a list of ten titles. The level of complexity of the searched
topics was varied by having three different task types:
1. Word search (W): The task is to find a word from the
list.
2. Question–answer (A): A question is presented and the
task is to find an answer to the question from the list.
3. True interest (I): The participants are instructed to
search for the most interesting title in the list.
The trial structure was similar across the tasks (Fig. 1).
First, the assignment was presented: The participants saw a
239
sentence instructing them to find either a word (W), an
answer to a question (A), or the most interesting sentence
(I), according to the condition. After the assignment, a list
of sentences was presented, and the participants were
instructed to view the list until they had found the relevant
line. Eye movements were recorded during this period.
After finding the relevant line, they pressed ‘enter’, and
were shown the same sentences with line numbers. They
then typed the number corresponding to the line they had
chosen. Before the experiment, participants read the
instructions and practiced each of the tasks.
Each participant conducted a total of 150 assignments.
The experiment was divided into 10 blocks, with 15 assignments in each block. Each task type was presented five
times within a block. The presentation order of the blocks
and the assignments within them was randomized.
2.3. Stimulus material
The text material consisted of 500 online newspaper
titles, revised to grammatical sentences. The maximum
length of the sentences was 80 characters. On average, there
were 5.8 words per sentence, and the mean word length was
9.9 characters. The sentences were divided to 50 lists of 10
sentences. To control for the effects of previous topic
knowledge, the sentences were selected to represent three
general topics: Finnish homeland news (20 trials), foreign
news (20 trials) and business and finance news (10 trials).
The texts were written in Finnish, and a 30-point Arial font
was used. The average character height was 0.9 degrees and
the average character width was 0.5 degrees from the viewing distance of about 60 cm.
For the word search condition, fifty words were chosen as
target words. The positions of the targets in sentences were
balanced, i.e., the words appeared equally often as the first,
second, third or fourth word of the sentences. For the question–answer condition, we prepared 50 questions, which
were validated with a pilot test including five participants.
We modified the questions and sentences, and tested them
again with three new participants. Their answers agreed in
74% of the trials. The actual experiments were conducted
with the modified questions and sentences. In word search
and question–answer conditions, the locations of the correct
lines were balanced so that the answers appeared equally
often in all 10 sentence-lines. For the true interest condition,
no additional stimulus preparations were needed.
To emphasize the differences between tasks and to minimize stimulus-driven factors on processing, the same stimuli were presented in all three task types. In order to
control for the possible effects of repetition, a set of analysis was carried out with repeated measures ANOVAs. We
found no significant effect of presenting the same stimulus
three times during the experiment on the number of fixations (F(2,18) = 2.86, ns:), average fixation durations
(F(2,18) =.18, ns:) or saccade lengths (F(2,18) = 1.00,
ns:) in an assignment. Therefore we did not have to consider the effect of stimulus repetition in our modeling work.
240
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
What is the importance of religion in Pakistan?
A lutheran church was built to Petroskoi with collected funds
The resignation of an important politician elicited controverial views
The oldest person in the world, Kamato Hongo, died at the age of 116
The Pakistani security troops attacked Al-Qaida at the border
In Pakistan, Islam affects all walks of life
Pakistan informed about a succesful
missilechurch
test was built to Petroskoi with collected funds
1. A lutheran
An attack to a refugee camp left
1500resignation
refugees homeless
2. The
of an important politician elicited controverial views
Pakistan reported another missile
testoldest person in the world, Kamato Hongo, died at the age of 116
3. The
The death of a priest elicited fear
in Pakistan
4. The
Pakistani security troops attacked Al-Qaida at the border
The fire fighting in California will
at least Islam
a week
5. take
In Pakistan,
affects all walks of life
6. Pakistan informed about a succesful missile test
time
7. An attack to a refugee camp left 1500 refugees homeless
8. Pakistan reported another missile test
9. The death of a priest elicited fear in Pakistan
10. The fire fighting in California will take at least a week
Fig. 1. An example stimulus presenting a question–answer task. The sentences are translated. The solid time line represents the time slot when the
participants were instructed to find the relevant line and their eye movements were recorded. Participants proceeded in a self-paced manner, and the next
trial began immediately after they typed in the line number corresponding to the selected line.
2.4. Apparatus
The stimuli were presented on a 17 in. TFT display with
a screen resolution of 1280 1024 pixels. The display was
located on a table at the eye level of the participants, at
the distance of approximately 60 cm. In order to maintain
the life-likeness of our setup, no chin or forehead rests were
used for stabilizing the heads of the participants.
Eye movements were recorded by a Tobii 1750 remote
eye tracking system with a spatial accuracy of 0.5°. The
screen coordinates of both eyes were collected from each
participant at 50 Hz sampling rate. The eye tracking system
was calibrated between the experimental blocks using a set
of 16 calibration points shown one at a time.
left out from the raw data by the Tobii software, otherwise
no editing of the eye movement data was carried out.
The best fixation window parameters were determined
using the logistic regression model (see Sections 3.1 and
3.4.1) and a 40-fold cross-validation (see Section 3.5) of
the data. The procedure produced 40 perplexity values
for left-out data with each of the fixation window parameter combinations.
For the Tobii 1750 eye tracker, the fixation window
parameters that resulted in best classification accuracy
(p < :05, Wilcoxon signed rank test) of the left-out data sets
were the 40 pixel window (corresponding to approx. 3.2 letter spaces) with the minimum fixation duration of 80 ms.
3. Modeling
2.5. Preprocessing
Fixations were computed from the data using a windowbased algorithm by Tobii. Visualizations of measured gaze
coordinates were used to choose fixation window parameters for further analysis. Based on the visual inspections we
selected three candidate parameter setups: (i) a 20 pixel
widow with a minimum fixation duration of 40 ms, (ii) a
40 pixel window with 80 ms fixation duration, and (iii) a
20 pixel window with 100 ms fixation duration. Blinks were
The total data consisted of 1456 eye movement trajectories, that is, fixation–saccade sequences measured from
each assignment. Forty-four trials were missing because
no eye movements were obtained, for example due to double key pressings of the participants. The total data were
randomly split into a training set of 971 trajectories and
a test set of 485 trajectories.
Throughout the analysis, we used a data-driven
approach: the data was used for making decisions on differ-
241
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
ent modeling questions. Best model topology was selected
by using cross-validation with the training data. Parameters of the best model were then learned using the full training data, and the generalization capability, i.e., how well
the model fits unseen data, was tested with the test set.
The reason for using test data is that with increasing model
complexity, that is, with increasing number of parameters,
the model will more accurately fit the training data. At
some point this turns into overfitting, where increasing
the model complexity will decrease the model performance
on unseen data while the performance on training data set
continues to increase.
3.1. Logistic regression
In our experiment the ground truth for a given eye
movement trajectory, that is, the information about the
task type, was always available. Suitable models for such
data belong to the general category of supervised or discriminative models. The simplest discriminative model is
logistic regression (see Hastie et al., 2001), which predicts
the probability of class (task type), conditional on covariates (the associated data) and parameters. The covariates
are assumed to be given, that is, no uncertainty is associated with their values. The model is optimized by maximizing the conditional likelihood. However, logistic regression
cannot model time series data. A common approach is to
compute some form of statistics from the time series and
then use these as covariates.
We used logistic regression as a simple classifier to
obtain baseline results for the HMM, and for selecting
the best fixation window parameters.
cross-validation. Each of the states addresses an associated
observation distribution pðxjhs Þ, from which the data is
generated. The parameters hs can be different for each state
(e.g. Gaussian distributions having different means and
standard deviations). The changes in the distributions of
the observations are thus associated with transitions
between hidden states. The transitions are probabilistic,
and defined by a transition matrix B. We assume a firstorder Markov property for the transitions, that is, we
assume probabilities of form pðsðt þ 1ÞjsðtÞÞ; the transition
to the next state sðt þ 1Þ depends only on the current state
sðtÞ. Pieters, Rosbergen, and Wedel (1999) showed that eye
movements follow this property. Additionally, this restricts
the number of parameters in the model, making modeling
computationally more efficient.
A full definition of HMMs requires one more set of
parameters, pðsÞ; s ¼ 1; . . . ; S, which is the probability of
initiating the time sequence at state s. An example topology
of an HMM is illustrated in Fig. 2.
For a time series x1;...;T of observations the full likelihood
of the HMM is then
X
pðx1;...;T jHÞ ¼
pðsð1ÞÞpðxð1Þjsð1ÞÞ
S
T
Y
pðxðtÞjsðtÞÞpðsðtÞjsðt 1ÞÞ;
ð1Þ
t¼2
where S denotes all ‘‘paths” through the model, that is, all
S T combinations of hidden states for a sequence of length
T, and xðtÞ is the measured observation vector at time t.
0.16
0.91
3.2. Hidden Markov models
To analyze the fixation–saccade sequence as a time series
we used Hidden Markov model, which is commonly used
for analyzing sequential data, such as speech (see e.g. Rabiner (1989) for an introduction on HMMs). The HMMs
belong to the general category of generative joint density
models which attempt to describe the full process of how
the data is being created, that is, they do not use covariates.
Whereas fully discriminative models concentrate only on
separating different classes, and thus provide no physical
interpretation of the parameter values, the parameters of
a joint density model can be associated with the data,
giving an insight into the underlying process, assuming that
the model describes the data accurately enough. HMMs are
optimized by maximizing the log-likelihood, log pðC; X jHÞ,
of the data C [ X , given the model and its parameters H.
Here X is the observation sequence, eye movement trajectory, associated with class C, the task type.
HMMs are applied in a case where the statistical properties of the signal change over time. The model explains
these changes by a switch of a hidden (unobservable,
latent) state s within the model. The total number S of hidden states can be learned from the data, for example by
0.88
s
0.32
0.01
0.99
r
0.01
0.02
d
0.11
0.08
0.00
W
0.10
π
0.89
0.89
s
0.20
0.05
0.95
r
0.03
0.02
d
0.06
0.02
0.08
A
0.02
0.15
0.03
0.03
s
0.04
0.94
0.07
0.94
0.02
r
d
0.12
0.86
I
Fig. 2. The transition probabilities and topology of the discriminative
hidden Markov model. Hidden states are denoted by circles, transitions
among hidden states by arrows, along with their probabilities. The
beginning of the sequence is denoted by p. The capital letters on the right
denote the sections of the HMM that were assigned for each of the tasks
(W = word search, A = question–answer, I = true interest), small letters
within the hidden states denote the names of the hidden states (s = scanning, r = reading, d = decision).
242
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
Maximum likelihood parameter values of the HMMs
are obtained with the Baum–Welch (BW) algorithm, a special case of Expectation–Maximization (EM) algorithm,
which can be proven to converge to a local optimum. Fast
computation of the most probable path (hidden state
sequence) through the model, given a new data sequence,
is obtained using the Viterbi algorithm.
Previously, Liechty, Pieters, and Wedel (2003) applied
hidden Markov models to study two states of covert attention, local and global attention. They showed that viewers
were switching between the attention states while they were
exploring print advertisements in magazines. The local
visual attention state was characterized by short saccades,
whereas in the global attention state, longer saccades were
common. In another line of research, Salojärvi, Puolamäki,
and Kaski (2005b) showed that perceived relevance of a
text could be predicted from eye movements in an information search task.
3.3. Discriminative hidden Markov models
A generative model can be converted to a discriminative model by optimizing the conditional likelihood of the
model log pðCjX ; HÞ, obtained from a generative model
via Bayes formula. Compared to a fully discriminative
model (such as logistic regression), the converted model
still has the benefits of a generative model, such as easier
interpretation of model parameters (see Salojärvi, Puolamäki, & Kaski (2005c) for a description of the
differences).
Discriminative training of HMMs is carried out by
assigning a set of ‘‘correct” hidden states Sc in the model
to always correspond to a certain class c, and then maximizing the likelihood of the state sequences that go
through the ‘‘correct” states for the training data, versus
all the other possible state sequences S in the model
(Povey, Woodland, & Gales, 2003; Schlüter & Macherey,
1998).
The parameters of a discriminative HMM (dHMM) are
optimized with a discriminative EM (DEM) algorithm,
which is a modification of the original BW algorithm (the
derivation of the algorithm is in Salojärvi, Puolamäki, &
Kaski (2005a)).
3.4. Feature extraction
3.4.1. Features for logistic regression model
The logistic regression was used as a baseline to a
HMM. It uses averaged features that can be derived from
the fixation–saccade time sequence, that is, it obtains the
same information as the HMM. The features were:
(1)
(2)
(3)
(4)
(5)
Length of the sequence (number of fixations).
Mean of fixation duration (in ms).
Standard deviation of fixation duration.
Mean of saccade length (in pixels).
Standard deviation of saccade length.
3.4.2. Features for hidden Markov model
For the time series model, four features of each fixation
were computed from the eye movement trajectory, that is,
from the raw fixation–saccade data from each assignment.
The features are listed below with the corresponding modeling distribution (the distributions denoted by pðxjsÞ in
Eq. (1)) reported in parenthesis. See e.g. Gelman, Carlin,
Stern, and Rubin (2003) for the parametric form of the
distributions.
(1) Logarithm of fixation duration in milliseconds (onedimensional Gaussian).
(2) Logarithm of outgoing saccade length in pixels (onedimensional Gaussian).
(3) Outgoing saccade direction (quantized to four different directions) + a fifth state indicating that the trial
had ended (Multinomial).
(4) Indicator variable of whether there have been previous fixations on the word which is currently fixated
(Binomial).
In literature (e.g. Reichle et al. (2006)), a gamma distribution has often been used for modeling fixation durations,
because its negatively skewed distribution resembles the
data. There are two alternatives to implement this. In the
first version, the data sequence is indexed by time, and thus
the hidden state sequences are directly mapped into fixation
durations (Liechty et al., 2003), and therefore the probability of staying in state s must follow a gamma distribution.
However, in ordinary HMMs this probability follows an
exponential rather than gamma distribution, and therefore
a semi-hidden Markov model needs to be implemented,
where the transition probabilities depend on the time spent
in the current hidden state. We here applied the second alternative. We constructed a HMM that emitted the fixation
durations, changing the time scale of the HMM into fixation
counts. Instead of having a HMM that is in state s for the
time t . . . t þ s, we now have a HMM that is in state s for fixation i, which has the duration s. We then make a simplifying assumption by modeling the logarithm of fixation
durations with a Gaussian. Further work could include
extending this model to a mixture of two log-normal distributions, since this has been found to work well for reading
fixations (Carpenter & McDonald, 2007).
The saccade lengths were quantified as pixels and were
not converted to more conventional measures, such as
characters or degrees during computations, because conversions would have added noise to data (since the Tobii
1750 allows free head movement). Saccade lengths were
computed from the raw 50 Hz gaze data by computing
the distance between the gaze location at the end of the previous fixation and the beginning of the current fixation.
The spatial accuracy of the eye tracker was 0.5° corresponding to approximately 12 pixels.
For saccade quantization, each fixation was first
mapped to the closest word in the preprocessing stage.
The outgoing saccade direction was then encoded with an
243
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
indicator variable that can obtain five different values: 1 –
saccade forward on the current line of text, 2 – saccade
upwards from the current line, 3 – saccade backwards on
the current line, 4 – saccade downwards from the current
line and 5 – ending the assignment.
3.5. Model selection
When choosing fixation window parameters or the number of hidden states of the HMM, an n-fold cross-validation with the training data was carried out. In this
procedure, the training set is divided into n non-overlapping subsets, and each of the subsets is in turn left out as
a validation data set. The training is carried out using the
other n 1 subsets, and then the generalization capability
of the model is tested with the validation set. The procedure is carried out for all alternative modeling configurations. The method produces n paired measures of
goodness of model fit, calculated from validation data,
allowing us to test the out-of-sample performance of the
model configurations. The reason for using cross-validation is to avoid overfitting, i.e., choosing a too complex
model. Alternative methods for model selection include a
computationally much heavier bootstrap method (Efron
& Tibshirani, 1993), or using information theoretic criteria
(Akaike, 1974; Schwartz, 1978). The latter however are not
theoretically justified in case of HMMs, see e.g. Robertson,
Kirshner, and Smyth (2004), and the references therein.
Goodness of the model was measured in two ways; in
terms of classification accuracy and perplexity. Classification accuracy is the amount of correctly predicted task
types divided by the total amount of tasks. However, for
relatively small data sets, the classification accuracy is a
noisy measure, since each sample can be assigned to only
one class. A better measure is therefore the perplexity of
the test data set, which measures the confidence in the predictions of the classifier. It is defined as a function of the
average of log-likelihoods L of the N s test data sequences,
denoted formally by
PN s
1
perp ¼ eN s i¼1 Li ; Li ¼ log pðci jxi1;...;T i ; hÞ;
ð2Þ
where xi1;...;T i denotes the ith sequence of observations of
length T i , and ci is the type of task i. N s is the number of sequences, and h the model parameters. The best possible perplexity is 1, where the correct task type is predicted with a
probability 1. On the other hand, perplexity of 3 corresponds to random guessing with a probability of 13 for each
of the task types. In our data analysis, the class distribution
was not equal within the training and test sets. This was
mainly due to random split of the data, and in part due to
missing eye movement trajectories. If these are taken into
account, the random perplexity for the test set is 3.01. If perplexity is greater than this the model is doing worse than
random guessing. In the worst case where the classifier gives
a (close to) zero probability for the correct class, the perplexity is restricted to a maximum value of 1022 .
4. Results
4.1. Logistic regression
The results of the logistic regression are reported in
Table 1. The perplexity of the test set was 2.42 with a classification accuracy of 59.8%.
4.2. Discriminative hidden Markov model
All modeling with HMMs was carried out in a data-driven fashion. The topology of a HMM was fully connected,
that is, transitions between all states were possible. All
parameter values were learned from data by maximizing
the conditional likelihood. The number of hidden states
in the dHMM was determined with a 6-fold cross-validation. The different hidden state configurations that were
tried out were S 2 {2-2-2, 2-2-3, 2-3-3, 3-3-3, 3-3-4, 3-4-4, 44-4}, corresponding to the number of hidden states used
for modeling word search, question–answer and true interest conditions, respectively. The scheme for increasing the
number of hidden states in the HMM was decided after
observing that the eye movement trajectories were usually
longest in the true interest condition and then in the question–answer condition.
The number of hidden states was decided as in Robertson et al. (2004) by comparing the mean of perplexities of
validation sets. The decrease of out-of-sample perplexities
started to level off when the number of hidden states was
nine, suggesting that this is the optimal number of hidden
states. Since the variance of conditional maximum likelihood estimates is larger than maximum likelihood estimates (Nádas, 1983), we additionally compared the
paired perplexity values for eight, nine, and ten hidden
state configurations with a Wilcoxon signed rank test.
The difference between the 8-state and 9-state models was
statistically significant ðp < :05Þ, whereas the difference
between 9-state and 10-state models was not. Since the data
does not support the preference of a 10-state model over a
9-state model, the less complex model should be preferred.
The model with nine hidden states is obtained also when
Table 1
Confusion matrix from the test data, showing the number of assignments
classified by the logistic regression into the three task types (columns)
versus their true task type (rows)
Prediction
W (77.2%)
A (28.3%)
I (70.6%)
W (66.2%)
A (45.3%)
I (60.0%)
139
55
16
23
43
29
18
54
108
The diagonal contains the number of correctly predicted assignments. The
percentages (in parentheses) denote row- and column-wise classification
accuracies. The row-wise accuracy shows the percentage of correctly
predicted assignments for the given task type, the column-wise accuracy
shows the percentage of correctly predicted task types, given the
prediction.
244
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
using a majority vote-based model selection scheme (Miloslavsky & van der Laan, 2002).
The 9-state HMM achieved the perplexity 2.32 and classification accuracy of 60.2% for the test data. The confusion matrix of the dHMM is reported in Table 2. Both
logistic regression and dHMM could separate the two
extremes, word search and true interest, but predicting
the question–answer-tasks is difficult. One possible reason
is that some of the question–answer assignments were easier than others. The search behavior in easy assignments
may have resembled the fixation patterns in word search
task (in case where the question can be answered with
one word), whereas difficult question–answer assignments
were confused with the task of indicating subjective
interest.
4.2.1. Comparing the classification accuracies and
perplexities
If the time series of the eye movement data contains
information about the task type, the dHMM should perform better than the logistic regression model using averaged features. The perplexity of the test set for dHMMs
was 2.32, whereas logistic regression achieved the perplexity of 2.42. The dHMM was significantly better than logistic regression (p < :01, comparison of perplexities with a
Wilcoxon signed rank test). The time series of the eye
movements therefore contained relevant information for
determining the task type.
4.3. Interpreting HMM parameters
Proper interpretation of the parameters of a discriminatively trained joint density model (e.g., a dHMM) is still
somewhat of an open question. Based on asymptotic analysis (with infinite data), following can be said.
Ordinary maximum likelihood training of a joint density
model minimizes the Kullback–Leibler divergence (Cover
& Thomas, 1991) between the data and the model parameters. This can be seen by considering the data to be generated from a ‘‘true”, however unknown, model with model
parameters ~
h. In practise the model is always an approximation of the ‘‘truth”, and therefore the model will not
fit perfectly to the data (if it were perfect, it should predict
Table 2
Confusion matrix showing the number of assignments classified by the
discriminative HMM into the three task types (columns) vs. their true task
type (rows)
Prediction
W (78.9%)
A (35.5%)
I (62.8%)
W (70.0%)
A (50.0%)
I (57.5%)
142
43
18
22
54
39
16
55
96
The percentages (in parentheses) denote row- and column-wise classification accuracies.
all unseen data perfectly) This incorrectness causes a bias in
the obtained model parameters h.
Discriminative training, on the other hand, maximizes
conditional likelihood which minimizes the Kullback–Leibler divergence between a subset of variables in the data
and the model parameters. As a result, this subset (here
the task types) is modeled as well as possible. A tradeoff
is that other variables of the data are modeled more inaccurately. However, in an asymptotic case with infinite
amount of data, and where the ‘‘true” model is within
our model family, the parameters are the same as those
obtained from maximum likelihood. In case of an incorrect
model, by inspecting the gradient of the conditional likelihood (proof omitted), it can be shown that the conditional
maximum likelihood and the maximum likelihood estimates are close to each other (and asymptotically the same)
when (i) the model is close to the true model or (ii) the class
predictions of the model are accurate, but the particular
parameters do not help in discriminating between the classes. In these cases, the parameters can be interpreted as in
an ordinary joint likelihood model.
From this point of view, a straightforward way of interpreting parameter values is therefore to report and compare the parameter values from conditional and ordinary
maximum likelihood. If the values are same, the data does
not contain additional information that can be used for
more accurate prediction of the task type. On the other
hand, if the two parameter estimates differ, it implies that
the variables that they model help in predicting the task
type, and their modeling assumptions are incorrect. This
fact can be used for checking and revising the model. The
revised model has to be checked afterwards with new data.
In our experiment the parameters of the discriminative
and joint density HMMs (Table 3) are roughly the same,
suggesting that our model uses the information that eye
movements contain on task types fairly well. The greatest
discrepancy between the parameter values follows from
the log-Gaussian approximation of the fixation distributions, which was to be expected (as discussed in Section
3.4.2). The difference between the two parameter estimates
also shows that the fixation durations are important in predicting the task type.
We next discuss modeling results of each set of parameters of HMM. Analysis is carried out with conditional
maximum likelihood parameters; maximum likelihood
parameters can be analysed in a similar manner, with
approximately similar results.
4.3.1. Observation distributions and hidden states
The discriminative hidden Markov model that best fitted
our data segmented each task type into three states (Fig. 2).
The parameter values of the dHMM (Table 3) exhibited
relatively similar eye behavior in the three hidden states
for each of the task types. Next, we compared the parameter values to literature on reading and other cognitive
tasks, and designated the states to describe the processing
features that were reflected in the eye movement behavior.
245
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
Table 3
Discriminative HMM parameter values for scanning, reading and decision states for each task type (corresponding maximum likelihood estimates in
parentheses)
Scanning
Reading
Decision
32% (17)
20% (21)
15% (17)
100
134
(125)
180
68
166
(155)
409
16% (15)
10% (12)
7% (8)
140
199
(187)
284
67
132
(159)
259
0%
0%
0%
31% (34)
22% (21)
19% (16)
28% (28)
1% (0)
61% (53)
6% (9)
15% (15)
18% (23)
0%
39% (22)
6% (3)
20% (36)
17% (2)
18% (37)
23% (25)
99
134
(129)
182
60
160
(156)
422
24% (15)
141
205
(204)
299
74
133
(141)
239
78% (64)
96
177
(173)
323
48
(133)
137
391
37%
21%
16%
27%
0%
63% (63)
5% (5)
12% (12)
20% (20)
0%
33%
14%
27%
10%
16%
28% (25)
97
134
(125)
184
57
160
(165)
452
26% (21)
138
200
(196)
291
73
(131)
128
226
86% (83)
95
176
(169)
326
48
(135)
133
365
Saccade direction
Forward
Upward
Backward
Downward
End assignment
41%
21%
13%
26%
0%
61% (61)
7% (7)
13% (14)
19% (18)
0%
37% (38)
15% (16)
30% (25)
11% (14)
7% (8)
Previous fixations = true
27% (28)
Probability of beginning the task
Word search
Question–answer
True interest
Word search observations
Fixation duration (ms)
Saccade length (pix)
Saccade direction
Forward
Upward
Backward
Downward
End assignment
Previous fixations = true
Question–answer observations
Fixation duration (ms)
Saccade length (pix)
Saccade direction
Forward
Upward
Backward
Downward
End assignment
Previous fixations = true
True interest observations
Fixation duration (ms)
Saccade length (pix)
(39)
(20)
(14)
(26)
(43)
(19)
(11)
(26)
92
171
(219)
320
54
132
(120)
319
(35)
(15)
(26)
(12)
(11)
24% (26)
86% (88)
lr
In saccade lengths, 160 pixels approximates to 13 letters. Standard deviation r is reported with respect to mean l by
, where applicable (67% of the
lþr
probability mass is within this interval).
With a combined probability of 67% (Table 3 and
Fig. 2), participants began the assignments from states
which we termed as scanning, because the parameters suggested rather long saccades, with no clear preference on
direction (i.e., almost random), and fewer saccades towards
previously fixated areas. The fixation durations were relatively short (approximately 135 ms), which is in accordance
with previous results indicating shorter fixations in association with easier tasks (Rayner, 1998). On average, participants spent 2.8 s scanning (Table 4).
The second set of states were labeled as reading, because
they were characterized by frequent forward saccades (over
60% probability) with an average fixation duration of
about 200 ms, also typical for reading. The percentage of
backward saccades was 12–15%, corresponding to the previous findings suggesting that in normal reading about 10–
15% of saccades are regressions (Rayner, 1998). The average saccade length was 10.3–10.7 letters (128–133 pixels),
which corresponds to the average length of a word (9.9
characters), plus a space between words.
Frequent forward and backward saccades were typical
for the third and final states (Table 3). The percentage of
backward fixations (20–30%) was twice the amount usually
observed in reading. Saccade lengths were approximately
10.7 letters (133 pixels), corresponding to the length of a
word, and occurred within the same line (with 75% probability). The fixations landed to previously fixated words
with 78–86% probability. On average, the fixation durations (175 ms) were shorter than in reading states. This is
possibly due to the fact that participants were mostly fixating on words which they had recently seen, and therefore
the lexical access took less time. We termed the third states
246
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
Table 4
Expected dwell times and standard deviations in scanning, reading and decision states, plus times before and after reaching the decision state, along with
the mean percentages of prevalence of the states
W
A
I
Mean
Stdev.
Mean
Stdev.
Mean
Stdev.
Total T
4.1 ± 0.4
3.1 ± 0.5
8.5 ± 1.2
7.1 ± 1.8
11.6 ± 1.1
6.7 ± 0.9
T in scanning
T in reading
T in decision
2.2 ± 0.3
4.3 ± 0.7
0.7 ± 0.1
1.8 ± 0.4
3.2 ± 0.7
0.8 ± 0.4
2.8 ± 0.4
6.1 ± 1.0
1.4 ± 0.4
2.3 ± 0.3
5.1 ± 0.8
2.9 ± 1.8
3.4 ± 0.4
6.2 ± 0.8
1.8 ± 0.3
2.4 ± 0.2
5.2 ± 0.6
2.0 ± 0.4
T to decision
T after decision
3.4 ± 0.4
0.8 ± 0.2
2.8 ± 0.5
1.0 ± 0.4
6.1 ± 0.7
2.5 ± 0.7
4.5 ± 0.6
4.8 ± 1.6
8.0 ± 0.7
3.6 ± 0.8
4.6 ± 0.7
5.1 ± 1.0
% in scanning
% in reading
% in decision
51 ± 6
33 ± 6
16 ± 2
40 ± 2
41 ± 2
15 ± 2
47 ± 6
38 ± 6
15 ± 2
40 ± 2
41 ± 2
15 ± 2
47 ± 6
38 ± 6
15 ± 2
40 ± 2
41 ± 2
15 ± 2
Values are computed from the observation trajectory which was segmented using the Viterbi algorithm on dHMM. Capital letters denote the tasks
(W = word search, A = question–answer, I = true interest), and units are in seconds. Error estimates () are 95% confidence intervals, obtained with a
bootstrap method with 400 replicate data sets.
as decision states, because the features indicated a lot of rereading of the previously seen lines. Almost without exception, participants ended the assignments while they were in
the third states. This pattern is visible in Fig. 4. Shimojo,
Simion, Shimojo, and Scheier (2003) have reported similar
results in the context of preference decisions made for
faces. They also showed that participants tended to look
more often at the target they chose just before they made
their decisions.
One potential concern regarding the comparisons of
parameters with previous reading studies, for example
those reviewed by Rayner (1998), is that the participants
may have varied their processing states also in the reviewed
tasks. However, as brought out by Hyönä et al. (2002), in
many reading studies, factors such as global reading strategies have been treated as a nuisance, and their influence is
minimized by studying reading under simplified conditions
(i.e. using brief and simple texts for very simple purposes).
Therefore it is likely that previous results mostly reflect
rather ‘pure’ types of processes.
4.3.2. Transition probabilities
The transition probabilities of the dHMM are shown in
Fig. 2. Within state transitions indicate that participants
continued in the same processing state for several steps
(i.e., fixations), indicating that the associated cognitive processes operate on time scales longer than one fixation. Similarly, previous research suggests that the ongoing processes
are not reset after every saccade, but their influence survives
across saccades (Yang & McConkie, 2005). An estimate of
these time scales was next obtained with the dHMM.
4.3.2.1. Method. The most probable state sequence for each
eye movement trajectory was computed by applying Viterbi algorithm to the learned HMM. The means and standard deviations of the process durations (Table 4) were
computed from the data using the state segmentation
obtained from the dHMM. The mean is the average time
spent in a state, and standard deviation describes how
the time varies in individual cases. An error of the two estimates, i.e., how accurate the estimates are given in our
(finite) data sample, is obtained with a bootstrap method
(Efron & Tibshirani, 1993). We generate 400 replicate
(bootstrap) data sets by sampling from the original data
with replacement. For each of the replicate data sets a
bootstrap estimate was computed (e.g. the mean). The
error is now the standard deviation of the 400 bootstrap
estimates computed with respect to the original estimate.
4.3.2.2. Results. Table 4 shows that the times spent in each
of the states did not differ considerably across the task conditions. On average, participants spent more time in scanning and reading than in decision states. The decision
times were two times longer for the question–answer and
for the subjective interest conditions than for the word
search, where the assignment was ended approximately
1 s after reaching decision state. This corresponds to the
duration of making the decision, because the participants
did not go back to scanning or reading states, unlike in
other conditions. Also, the time to reach the decision state
increased with the task complexity.
4.3.3. Transitions between states
Fig. 2 shows that in the word search condition, transitions from the decision state are rare, with only 1% probability, whereas in the question–answer condition these
transitions occur with 5% probability and in the subjective
interest condition with 14% probability. In the word search
and question–answer conditions, participants switched
more often from scanning to decision (with 80% probability) than to reading (20% probability). This can be seen
from Fig. 2 by comparing the associated transition probabilities (8% vs. 2%). From reading, they shifted to the decision state. In word search, this probability was 92% (11%
vs. 1%), and in the question–answer condition 55% (6%
vs. 5%). In the true interest condition, there was a strong
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
tendency to switch from decision to reading with 86% probability (12% vs. 2%).
4.3.4. Eye movement trajectories
When combining the most probable (Viterbi) path
through the hidden Markov model with the interpretations
of the hidden states, it is possible to make hypotheses on
the switches of the cognitive states during an assignment.
An interesting further study would be to map these
switches to text contents. Fig. 3 shows example trajectories
for the task types, plotted on the screen coordinates (stim-
W
A
I
Fig. 3. Examples of eye movement trajectories in the experiment. The
HMM states along the most probable paths are denoted by ‘’ – state 1
(scanning), ‘M’ – state 2 (reading), ‘’ – state 3 (decision). See text for
interpretations of the states. The beginning of the trajectory is marked
with a circle; ending with two concentric circles. W: word search. A:
question–answer. I: true interest.
247
ulus words are not plotted for clarity). It appears that when
the participant closes in to the relevant line, the decision
state is adopted. In the word search condition, the trajectories indicate mostly scanning, whereas in question–answer
condition the lines are read word by word, but the state
of processing varies, depending on whether the line is relevant for the task or not.
4.3.5. Average behavior
Drawing summaries from the plots shown in Fig. 3 is
difficult. Instead, it is easier to find common patterns by
inspecting the mean behavior of the conditions.
4.3.5.1. Method. Computing average behavior from our
time series data is not straightforward, because time
sequences have different lengths and the observations
are probabilities. We first computed the a posteriori probabilities of being in state s at time t, given the observations
x1;...;T and model parameters h, that is, ct ðsÞ ¼ pðst jx1;...;T ; hÞ.
The probabilities can be computed with a forward–backward algorithm. The probabilities were then converted to
their natural parameters (by hct ðsÞ ¼ log ct ðsÞ, thus mapping the probabilities to real values). Next, the sequences
were normalized to the same length by resampling them
to the same length as the longest sequence (Gallinari,
1998). After that, the values were mapped back to probaexpfhct ðsÞg
bilities using the inverse mapping ct ðsÞ ¼ P expfh
: A
ct ðiÞg
i
simple assumption is that for each time instance
t, the
probabilities are emitted from a Dirichlet distribution with
parameters aðtÞ . The parameters can be estimated using the
maximum likelihood criteria (see Minka (2000) for update
formulas), after which the mean and standard deviation of
the Dirichlet distribution can be computed (see e.g. Gelman et al. (2003)).
4.3.5.2. Results. The mean behavior along with its standard
deviation is plotted in Fig. 4. In the word search condition,
participants began the assignment from the scanning state
with a probability of 70%. There was a slight tendency
for being in the reading before switching to the final decision state. For the question–answer and subjective interest
conditions the strategies were similar, although they were
less emphasized. Participants began the tasks almost
equally often from the scanning and reading states. In the
middle of the task performance, the reading state was
slightly more common and towards the end, the decision
state was very common. In general, the results suggested
that before shifting to the decision states participants
adopted different strategies. This was also visible in the
standard deviations, which were larger in the beginning
and in the middle of the tasks than in the end.
5. Discussion
In this paper, we applied a reverse inference approach
with the aim of making hypotheses on hidden cognitive
states in an experiment resembling everyday information
248
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
reading
scanning
W
A
I
decision
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
Fig. 4. Average probability (y-axis) of being in state s. Horizontal axis is the normalized sequence length. Top row: word search (W). Center: question–
answer (A). Bottom: true interest (I). The plots show the mean probability (and one standard deviation; 66% confidence interval) of being in a given
HMM state as a function of time. Left column: scanning state, middle column: reading state, right column: decision state.
search tasks. Our setup differs from traditional research
methods in psychology where controlled experiments are
designed to find out what happens in eye movements when
cognitive processes are manipulated. Instead, we designed a
less controlled experiment, and then applied advanced statistical modeling, a hidden Markov model to make inferences about cognitive processing during the tasks (see
Feng (2003) for a discussion on benefits of the data-driven
approach).
Our model suggests that participants shifted their eye
movement behavior while they proceeded in the tasks.
They typically began the assignments from a set of states
reflecting a scanning type of behavior (see Fig. 4 and Table
3). The scan paths indicated long saccades with no preference on direction, accompanied with rather short fixations.
Additionally, the fixations tended to land on previously
unfixated areas on the text.
The second set of states were labeled as reading because
they contained frequent forward saccades, and the distance
covered by saccades mostly corresponded to an average
word length. Also the mean fixation durations (200 ms)
and the amount of regressions (about 13%) were in accordance with the previous research findings of reading (Rayner, 1998).
The characteristics of the third set of states suggested a
more careful analysis of sentences, possibly of deciding
whether the sentence is the correct answer to a given task.
This was indicated by the fact that the participants ended
the assignments while they were in the decision states.
The saccades landed almost always on the previously seen
lines and were directed either forward or backward. The
distance covered by saccades was about the length of an
average word.
Our results support and complement the modeling work
by Liechty et al. (2003), who used eye movement data to
identify two states of visual attention in an advertisement
viewing task. As an extension to their approach our model
includes experimental manipulations of the search tasks.
Although we used literal tasks, our processing states shared
similarities with their findings. The scanning state had similar features with their global processing state, which were
both characterized by long saccades and rather short fixations. Short saccades and long fixations were typical of
their attentive processing state. In our study, the empirical
data supported segmenting the attentive state into two processes, i.e. the reading and the decision processes, suggesting
a finer structure.
Besides their behavioral relevance, the labels given to the
hidden states are suggestive, and can be used as hypotheses
about the underlying processes. The hypotheses can be
tested by collecting additional data with known processing
states, for example by selecting tasks that emphasize pure
visual scanning or naturalistic reading, to empirically validate the parameters of suspected processes. With the setup
presented here, it is also possible to make more specific
hypotheses by constraining the dHMM structure. For
example, some of the overlapping processes across the
three tasks could have been linked in the HMM training.
For mutually exclusive processes the probability for
being in one state at a certain time would be either one
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
or zero. However, the probabilities suggested by our model
were somewhere between one and zero (see Fig. 2), indicating that the states are not mutually exclusive but rather
reflect mixtures of ongoing processes that are optimal for
the performance. This is in accordance with an experimental and theoretical evidence suggesting that reading eye
movements are generated through multiple competing processes rather than one homogenous mechanism (Findlay &
Walker, 1999). In addition, a considerable proportion of
variation in eye movements can be attributed to random
fluctuations in the oculomotor system (Feng, 2006). Also,
McConkie and Yang (2003), Yang and McConkie (2005)
have shown that a considerable amount (even 50%) of saccades during reading are executed by a basic mechanism
that, repetitively, produces saccades without direct cognitive control.
Our model was able to predict the task types with an
accuracy of 60.2%, which is 27% units above pure chance
(33.3% for three classes). We did not expect much better
accuracy. First, because we used all data in modeling,
including participants with noisier eye movement signals.
Second, the tasks were not very controlled. Instead, the
instructions allowed participants to freely choose their
own search strategies. Third, the 50 Hz sampling rate of
the Tobii 1750 eye tracker quantized the fixation durations
to 20 ms intervals. With a higher temporal resolution the
model may have been able to predict the tasks more accurately, since more information would have been available.
The classification accuracy could also be improved by giving word level features, such as word frequencies and word
lengths as an input to the model. This feature can be implemented for example by using a IOHMM model (Bengio,
1996; Bengio & Frasconi, 1999). Currently, the only additional information (besides eye movement data) given to
our model was the task type of the learning data. Despite
the moderate classification accuracy, the model parameters
appeared behaviorally relevant when compared to the previous results about reading.
5.1. Relation to other models
The model applied here, dHMM, makes it possible to
study cognitive control across fixations, since the eye movements are inspected as a time series instead of summary
measures, such as average fixation duration. Since the
HMM is designed for reverse inference tasks, it differs from
traditional computational models in psychology that are
models of forward inference; they attempt to describe
how perceptual and cognitive processes drive eye movements, whereas our model tries to make conclusions about
cognition given the eye movements.
According to the visuo-oculomotor research tradition,
non-cognitive factors, such as the landing position of the
eyes on a word, mainly determine when and where the eyes
move. Furthermore, Vitu, O’Regan, Inhoff, and Topolski
(1995) showed that eye movements varied little from normal reading when participants were pretending to read z-
249
strings (however see Rayner & Fischer (1996)). Similar
results were also shown by McConkie and Yang (2003),
Yang and McConkie (2005). A strategy-tactics model
(O’Regan, 1990, 1992) suggests that, based on their expectations about the difficulty of the forthcoming task, readers
can adopt either careful or risky global strategies that coarsely influence fixation times and saccade lengths. He claims
that predetermined oculomotor strategies are important in
defining global characteristics of eye movement behavior in
reading. In our tasks, the question presented prior to the
sentence lists most probably primes expectations and
adjusts certain strategies for the forthcoming performance.
Also, the states discovered by dHMM showed similar features across the task types. Therefore, it is possible that an
oculomotor strategy optimized for the given tasks could
explain the variations in processing states.
Other theories have emphasized the role of cognitive
control on eye movements. For example, Just and Carpenter (1980) have proposed that eye movements act as direct
pointers indicating which word is being processed and for
how long. Also, computational models on reading eye
movements, such as the E-Z Reader (Reichle et al., 2006;
Pollatsek et al., 2006), are based on the assumption that fixation durations, word skipping or regressing are determined by lexical processes. However, the current
discussions on the cognitive control theory focus on the
decisions of when and where the next saccade is initiated
within a single fixation. In contrast, the strategic control
across fixations is until recently treated marginally.
In our tasks, the participants could have adjusted their
processing states in a moment-to-moment basis according
to the current task demands, as proposed in Carver
(1990). The finding that the task types differed in the transition sequences between the processing states could support the cognitive control theory. For example, in the
question–answer and the subjective interest conditions,
participants switched more often from the decision state
back to the reading state, whereas in the word search condition the sequence was more straightforward, starting
from the scanning state and ending in decision state.
5.2. Future directions
As discussed above, both cognitive and oculomotor theories can explain our results. Therefore further studies, for
example combining fMRI and eye tracking, could provide
valuable information about the activities that correlate
with the processing states reflected in eye movement patterns. For instance, emphasized simultaneous activation
in language areas could support the cognitive control theory, whereas stronger correlations with motor activities
would indicate that the strategies are determined by oculomotor factors.
In spite of the controversial views about the basis of
the processes driving eye movements, our results are useful
in practical applications. The finding that eye movement
patterns differ when different processing demands are
250
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
encountered can be used for developing an interactive
information search application that learns and adapts to
users’ goals and intentions. For example, by examining
which parts of a search engine results are read in different
states, such as reading or decision states, it is possible to
infer about the intentions and interests of the user. On
the basis of this information the system could provide more
material which is of possible interest to her. However, further studies are needed to make this kind of proactivity
from the side of the system most beneficial to the users.
For future research more detailed experiments need to
be designed, allowing deeper examination of the findings
presented here. For example, it would be of interest to
study to what extent the processing states generalize to
other cognitive tasks and how individuals differ in switching between processing states.
Acknowledgements
This work was supported by the Academy of Finland,
decisions no. 202211 and 202209, Helsingin Sanomain
100-vuotissäätiö, NordForsk and Jenny ja Antti Wihurin
Rahasto. Parts of this paper were completed while the first
author was employed by the Low Temperature Laboratory
in Helsinki University of Technology. Therefore, this work
was also supported by the Sigrid Jusélius Foundation and
the Academy of Finland National Programme of Excellence 2006–2011. The authors would like to thank Jarkko
Venna, Kai Puolamäki, Jukka Hyönä, Samuel Kaski, Kenneth Holmqvist and Erik D. Reichle together with anonymous reviewers for valuable comments and discussions on
the manuscript.
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE
Transactions on Automatic Control, 19(6), 716–723.
Bengio, Y. (1996). Input/output HMMs for sequence processing. IEEE
Transactions on Neural Networks, 7(5), 1231–1249.
Bengio, Y., & Frasconi, P. (1999). Markovian models for sequential data.
Neural Computing Surveys, 2, 129–162.
Carpenter, R. H. S., & McDonald, S. A. (2007). Later predicts saccade
latency distributions in reading. Experimental Brain Research, 177(2),
176–183.
Carver, R. (1990). Reading rate: A review of research and theory. San
Diego, CA: Academic Press Inc.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory.
New York: Wiley.
Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. New
York: Chapman and Hall.
Feng, G. (2003). From eye movement to cognition: Toward a general
framework of inference. Comment on Liechty et al., 2003. Psykometrika, 68, 551–556.
Feng, G. (2006). Eye movements as time-series random variables: A
stochastic model of eye movement control in reading. Cognitive
Systems Research, 7, 70–95.
Findlay, J. M., & Walker, R. (1999). A model of saccade generation based
on parallel processing and competitive inhibition. Behavioral and Brain
Sciences, 22, 661–721.
Gallinari, P. (1998). Predictive models for sequence modelling, application
to speech and character recognition. In C. L. Giles & M. Gori (Eds.),
Adaptive processing of sequences and data structures: International
summer school on neural networks. Lecture notes in computer science
(Vol. 1387, pp. 418–434). Berlin, Germany: Springer-Verlag.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian
data analysis (2nd ed.). Boca Raton, FL: Chapman and Hall/CRC.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of
statistical learning. New York: Springer.
Hyönä, J., Lorch, R., & Kaakinen, J. (2002). Individual differences in
reading to summarize expository text: Evidence from eye fixation
patterns. Journal of Educational Psychology, 94, 44–55.
Hyrskykari, A., Majaranta, P., Aaltonen, A., & Räihä, K.-J. (2000).
Design issues of idict: a gaze-assisted translation aid. In Proceedings of
eye tracking research and applications (ETRA2000) (pp. 9–14). ACM
Press.
Hyrskykari, A., Majaranta, P., & Räihä, K.-J. (2003). Proactive response
to eye movements. In G. W. M. Rauterberg, M. Menozzi, & J. Wesson
(Eds.), INTERACT’03. IOS Press.
Just, M., & Carpenter, P. (1980). A theory of reading: From eye fixations
to comprehension. Psychological Review, 87(4), 329–354.
Liechty, J., Pieters, R., & Wedel, M. (2003). Global and local covert visual
attention: Evidence from a Bayesian hidden Markov model. Psychometrika, 68, 519–541.
McConkie, G. W., & Yang, S.-N. (2003). How cognition affects eye
movements during reading. In J. Hyönä & R. H. D. Radach (Eds.),
The mind’s eye: Cognitive and applied aspects of eye movement research
(pp. 413–427). Amsterdam, The Netherlands: Elsevier.
Miloslavsky, M., & van der Laan, M. J. (2002). Fitting of mixtures with
unspecified number of components using cross validation distance
estimate. Computational Statistics and Data analysis, 41, 413–428.
Minka, T. (2000). Estimating a Dirichlet distribution. Unpublished but
available in Web.
Nádas, A. (1983). A decision theoretic formulation of a training problem
in speech recognition and a comparison of training by unconditional
versus conditional maximum likelihood. IEEE Transactions on Acoustics, Speech, and Signal Processing, 31(4), 814–817.
O’Regan, J. K. (1990). Eye movements and reading. In E. Kowler (Ed.),
Eye movements and their role in visual and cognitive processes
(pp. 395–453). Amsterdam, The Netherlands: Elsevier.
O’Regan, J. K. (1992). Optimal viewing position in words and the
strategy-tactics theory of eye movements in reading. In K. Rayner
(Ed.), Eye movements and visual cognition: Scene perception and reading
(pp. 333–354). New York: Springer-Verlag.
Pieters, R., Rosbergen, E., & Wedel, M. (1999). Visual attention to
repeated print advertising: A test of scanpath theory. Journal of
Marketing Research, 36, 424–438.
Poldrack, R. A. (2006). Can cognitive processes be inferred from
neuroimaging data? Trends in Cognitive Sciences, 10, 59–63.
Pollatsek, A., Reichle, E. D., & Rayner, K. (2006). Tests of the E-Z reader
model: Exploring the interface between cognition and eye-movement
control. Cognitive Psychology, 52, 1–56.
Povey, D.,Woodland, P., & Gales, M. (2003). Discriminative MAP for
acoustic model adaptation. In IEEE international conference on
acoustics, speech, and signal processing, 2003. Proceedings
(ICASSP’03) (Vol. 1) (pp. 312– 315).
Puolamäki, K., Salojärvi, J., Savia, E., Simola, J., & Kaski, S. (2005).
Combining eye movements and collaborative filtering for proactive
information retrieval. In G. Marchionini, A. Moffat, J. Tait, R. BaezaYates, & N. Ziviani (Eds.), SIGIR’05: Proceedings of the 28th annual
international ACM SIGIR conference on research and development in
information retrieval (pp. 146–153). New York, NY, USA: ACM Press.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected
applications in speech recognition. Proceedings of the IEEE, 77(2),
257–286.
Rayner, K. (1998). Eye movements in reading and information processing:
20 years of research. Psychological Bulletin, 124(3), 372–422.
Rayner, K., & Fischer, M. H. (1996). Mindless reading revisited: Eye
movements during reading and scanning are different. Perception and
Psychophysics, 58, 734–747.
J. Simola et al. / Cognitive Systems Research 9 (2008) 237–251
Rayner, K., & Pollatsek, A. (1989). The psychology of reading. New Jersey,
USA: Prentice-Hall Inc..
Reichle, E. D., Pollatsek, A., & Rayner, K. (2006). E-Z reader: A cognitive
control, serial-attention model of eye-movement behavior during
reading. Cognitive Systems Research, 7, 4–22.
Reilly, R. G., & Radach, R. (2006). Some empirical tests of an interactive
activation model of eye movement control in reading. Cognitive
Systems Research, 7, 34–55.
Richter, E., Engbert, R., & Kliegl, R. (2006). Current advances in swift.
Cognitive Systems Research, 7, 23–33.
Robertson, A. W., Kirshner, S., & Smyth, P. (2004). Downscaling of daily
rainfall occurrence over northeast brazil using a hidden markov model.
Journal of Climate, 17(22), 4407–4424.
Salojärvi, J., Puolamäki, K., & Kaski, S. (2005a). Expectation maximization algorithms for conditional likelihoods. In L. D. Raedt & S. Wrobel
(Eds.), Proceedings of the 22nd international conference on machine
learning (ICML-2005) (pp. 753–760). New York, USA: ACM Press.
Salojärvi, J., Puolamäki, K., & Kaski, S. (2005b). Implicit relevance
feedback from eye movements. In W. Duch, J. Kacprzyk, E. Oja, & S.
Zadrozny (Eds.), Artificial neural networks: Biological inspirations –
ICANN 2005. Lecture notes in computer science (Vol. 3696,
pp. 513–518). Berlin, Germany: Springer-Verlag.
View publication stats
251
Salojärvi, J., Puolamäki, K., & Kaski, S. (2005c). On discriminative joint
density modeling. In J. Gama, R. Camacho, P. Brazdil, A. Jorge, & L.
Torgo (Eds.), Machine learning: ECML 2005. Lecture notes in artificial
intelligence (Vol. 3720, pp. 341–352). Berlin, Germany: SpringerVerlag.
Schlüter, R. & Macherey, W. (1998). Comparison of discriminative
training criteria. In Proceedings of the ICASSP’98 (pp. 493–496).
Schwartz, G. (1978). Estimating the dimension of a model. Annals of
Statistics, 6(2), 461–464.
Shimojo, S., Simion, C., Shimojo, E., & Scheier, C. (2003). Gaze bias both
reflects and influences preference. Nature Neuroscience, 6(12),
1317–1322.
Vitu, F., O’Regan, K., Inhoff, A. W., & Topolski, R. (1995). Mindless
reading: Eye-movement characteristics are similar in scanning letter
strings and reading texts. Perception and Psychophysics, 57, 352–364.
Yang, S.-N. (2006). An oculomotor-based model of eye movements in
reading: The competition/interaction model. Cognitive Systems
Research, 7, 56–69.
Yang, S.-N., & McConkie, G. W. (2005). New directions in theories of
eyemovement control during reading. In G. Underwood (Ed.),
Cognitive processes in eye guidance (pp. 105–130). Great Britain:
Oxford University Press.