1 Introduction
Autism is a neurodevelopmental disorder associated with social communication deficits and repetitive behaviors [
12]. Assessments performed by trained psychologists and clinicians and caregiver reports are the main sources of clinical and research assessment of ASD. In addition to being subjective, these assessments provide minimal information about the underlying mechanism of the disorder. Adding to these limitations the observed heterogeneity in autism, the need for a robust approach that has less reliance on expert clinical assessments and can also address the heterogeneity of autism disorder is clear and of immediate interest in the autism research community.
Deficit in social attention is a known symptom of autism and one of the focus areas in autism biomarker discovery [
13]. This atypicality in social attention in ASD is observed in a variety of research studies and experimental modalities [
14,
15,
16].
Eye-tracking (
ET) is shown to be able to assess social attention presented as video clips. This is because ET is able to provide moment-to-moment and frame-by-frame evaluation of how and when social (e.g., faces) and non-social (e.g., toys, photo frames on the wall, and background) components of the scenes are looked at. Over the last decades, an extensive body of research has been performed focusing on the intersection between autism disorder and social attention deficit, revealing differences in looking at faces and other social information [
17,
18]. Being non-invasive, safe, and tolerated within a wide agerange of participants, from infancy to adulthood, makes ET a suitable approach for studying adaptive and cognitive functioning [
19,
20,
34].
Automatic detection of autism diagnosis based on eye-gaze information attracted attention with several studies utilizing various representations of eye-gaze data combined with machine learning methods to predict patients’ diagnosis [
41,
42]. Deep neural network is considered state-of-the-art in machine learning and has been successfully incorporated into computer vision studies. Deep visual attention networks are found effective for predicting gaze location [
2,
3,
4,
5,
6].
Carette et al. [
1] used saccades information and
long short-term memory (
LSTM) network to predict autism diagnosis with 83% accuracy using data from 32 children aged between 8 and 10. Li et al. [
42] introduced
Oculomotor Behavior Framework (
OBF) model that is capable of learning OBs from unsupervised and semi-supervised tasks. Using a dataset of 49 children (38 ASD) authors achieved 80% classification accuracy in stratifying ASD from TD. Elbattah et al. [
8] studied the utility of deep autoencoder for identifying clusters of ASD and non-ASD using scan-path patterns of 59 children (mean age of 7.88 years old). Considering 2 and 3 clusters, Elbattah et al., showed evidence of ASD heterogeneity in the sense of inclusion of ASD participants in all clusters ranging from 28% to 94% contribution to clusters. Pusiol, et al. [
9], studied feasibility of vision-based gender-specific stratification of
Fragile X Syndrome (
FXS), a case of autism with a genetic cause, and
developmental disorder (
DD). In their study, eye gaze data of 70 participants were used in combination with a modified LSTM and showed that it is feasible to stratify male-FXS and female-FXS gaze patterns from DD with 86% and 91% classification accuracies, respectively. The results reported in this study indicate clear differences in eye gaze patterns of DD, and male and female FXS individuals. However, such stratification analysis between Male vs. female FXS, FXS vs. DD, and FXS vs. TD (Typically developing) is not reported making it difficult to attest to the ability of the proposed approach to address autism heterogeneity. Tao and Shyu [
10] in their 2019 Saliency4ASD grand challenge submission proposed SP-ASDNet, a hybrid
Convolutional Neural Network (
CNN) and LSTM network that utilizes eye-gaze scan-path images to diagnose ASD. Tao and Shyu used eye-gaze data from 28 children (5–12 years old) looking at 300 images and were able to achieve 74.22% accuracy on the validation set, although their performance on the testing set was reduced to 55.66% due to over-fitting problem. Wu et al. [
11] proposed image-based and synthetic saccade methods that use scan-path images for automatic classification of ASD. The authors used two deep networks with 8 and 10 layers to train the two models they proposed. 2019 Saliency4ASD grand challenge dataset containing scan-paths of 28 children was used and 65.41% and 55.13% classification accuracy is achieved on validation and testing sets, respectively. Li et al. [
41] introduced
Sparsely Grouped Input Variables for Neural Network (
SGIN), a mechanism for automated selection of ET experimental stimuli where high between group discrimination (ASD vs. non-ASD) is observed and regression with clinical variables are achieved.
In this article, four well-studied ET tasks are used for data collection. These tasks we selected considering (a) strong construct performance, (b) between-group discrimination, and (c) relation to ASD symptoms in prior research studies of school-aged children. These tasks are used to record eye-gaze patterns during observation of (1) videos of two adults playing with toys (Activity Monitoring (AM) task); (2) videos of three adults having conversation (Dyadic Bid (DB)), (3) videos of an adult secretly changing the location of an object placed by another adult (Theory of Mind (ToM)), and (4) videos of an adult performing a stressful action (Social Referencing (SR)). The objective of this study is to investigate the feasibility of using scan-paths, an eye-gaze spatial information representation mechanism, to predict autism diagnosis. What are the contributions of this work?
This study introduces the novel ideas of (a) incorporating temporal information to scan-path images developed based on eye-gaze information and (b) infusing eye-gaze velocity in spatiotemporal scan-path samples aiming to improve the informativeness of such gaze data representation and increasing their collective ability to stratify children with Autism Spectrum Disorder from their Typically Developing counter-parts.
4 Computational Modeling of Eye-gaze Scan-Paths
In this study, eye gaze scan-path of participants watching video clips representing AM, DB Sensitivity, SR, and ToM are used. The main objective of this study is to develop flexible machine learning approaches for parsing heterogeneity within ASD and segregating individuals with ASD from those without. The study utilized CNN for this purpose. The structural layout of CNN used in this study is presented in Figure
5. Several evaluations are performed focusing on general factors such as
—
Feasibility of spatial representation of eye-gaze (Scan- Paths) as features for stratifying ASD and TD.
—
Factoring ASD heterogeneity (e.g., Age, Gender, and their mixed effects) in stratification of ASD through scan- paths.
—
Spatio-Temporal Analysis of eye-gaze data.
—
Fusion of gaze velocity in Spatio-Temporal scan-paths and its impact on ASD stratification.
4.1 Spatial Analysis I: Basic Feasibility Evaluation of Scan-paths
4.1.1 Stratification of ASD and TD.
The method used in this study is inspired by [
22,
23,
24]. In this method, the ET scan-paths are generated on a black background for each visualized video clip by each participant. The size of the background is set to 1,680
\(\times\) 1,050 and later the images are resized to 100
\(\times\) 100 pixels and changed to grayscale. This process resulted in 1,012 scan-path images in the ASD and 380 in the TD diagnostics categories.
Dense Neural Network (
DNN) and CNN models are utilized. The ratio of training images and testing images is set to 7:3.
It is noticeable that there is a gap in performance between training and testing accuracy and AUC performances. Compared with the CNN model, the DNN model achieved higher accuracy, but it performed poorly in AUC in comparison to CNN.
In order to clarify the identified issue with AUC results, a further analysis is performed using DNN and the dataset provided in [
22,
23,
24]. This dataset contains 219 images for ASD and 328 images for non-ASD. The results are shown in Table
4. It is noteworthy that the authors indicated that the AUC results can be improved to 0.8120 after augmenting the number of images [
22,
23,
24].
Comparing the results presented in Tables
3 and
4 impact on AUC estimates are observed with the understanding that the differences between testing accuracy and AUC results using DNN are set to 6% and 7% (dropout 0.5 and 0.2, respectively) in this dataset (see Table
3). Considering results with the database provided in [
22,
23,
24], the observed differences between testing accuracy and AUC is 3% and 8% with 0.5 and 0.2 dropouts, respectively. The lower AUC results can be explained by having a smaller number of TD training samples compared to ASD. This hinders the network training. One possible resolution to address this issue is to use data augmentation to increase the number of training samples.
4.1.2 Age Classification (2–4 vs. 10–17).
Heterogeneity in autism is known to play a significant role in difficulty to identify reliable biomarkers for this disorder. Main factors contributing to autism heterogeneity include genetic variability, comorbidity, and gender [
25].
ASD prevalence is found to have a gender bias with one in four individuals with ASD diagnosis being male [
26,
27].
ASD heterogeneity and proven difficulty to identify generalizable markers of ASD have led to the presumption of the presence of multiple etiologies rather than a single disorder [
28]. The quest to develop personalized medicine for ASD is impacted by ASD heterogeneity [
25].
In this experiment, ASD data is used to evaluate the impact of age on stratification of ASD. Two age groups of 2 to 4 years (2–4) and 10 to 17 years (10–17) are considered in this analysis. 2–4 groups contain 11 ASD participants with 199 scan-path sample images. 10–17 group contains 13 participants with 255 scan-path images. Aiming to increase the number of samples, these images are augmented resulting in 871 and 1196 images, respectively. The results are presented in Table
5.
The results indicate high classification accuracy in predicting the age group of ASD participants based on their eye-gaze scan-path images. This in turn indicates that the eye-gaze trajectories of kids with autism spectrum disorder are highly impacted by age.
4.2 Spatial Analysis II: Assessing the Performance Variations Impacted by the Video Clips Observed
This experiment investigates possible effects in scan-path patterns caused by variations in video clips watched by participants. First, total number of times each video clip is watched by participants is counted (see Table
6), and later, to better understand the impact of scan-paths of each video on the overall ability to stratify ASD and TD, these scan-paths are eliminated from the dataset and the process of training DNN and CNN models and their evaluations are repeated.
As presented in Table
6, AM_A4_S6_B4_GM_D1_F1 and sr04 are the two video clips with the highest number of overall views used to verify their impact on feasibility by participants of the study, both having more than 90 views. In this analysis, the scan-path images of these videos are used to stratify ASD and TD categories. To do so, the scan-paths generated from these two video clips are removed from the dataset and the stratification capability of DNN and CNN models are reevaluated. The results presented in Table
7 indicate that in the absence of scan-path samples of AM_A4_S6_B4_GM_D1_F1 video clips, the accuracy of validation is slightly improved compared to the scenario where scan-path images generated from sr04 video clip are removed. This indicates that sr04 video clip is relatively more powerful than AM_A4_S6_B4_GM_D1_F1 in generating scan-paths that can distinguish ASD from TD.
To better assess the impact of various videoclips and their contribution to the observed stratification power of scan-paths, the procedure discussed earlier further considers the second set of videos that were viewed most (e.g., videos that were viewed 60 to 62 times). The results, reported in Table
8, indicate that scan-path images generated from eye-gaze information of participants viewing video clips tom0301A32 and db0101B12 have the highest level of impact in the observed stratification power of DNN. The results also indicate scan-path images generated from eye-gaze information of participants viewing sr01 video clip have the least contribution to the performance of DNN. This is due to achieving the highest classification performance when the scan-path images generated from sr01 are removed from the dataset.
The findings of this experiment indicate that there is a degree of difference between the contribution of each video clip to the overall stratification of ASD and TD.
4.2.1 Gender Classification for ASD Subjects in Trial Level.
To understand the impact of sex and age in children with autism on the eye-gaze scan-paths and the feasibility of using these scan-path images in stratifying ASD participants based on their gender, two set of experiments are performed:
(1.1) Gender classification in all ASD subjects: The results reported in Table
9 indicate validation accuracy of 0.7804% reflecting feasibility of using scan-paths for stratifying male and female children with autism. This in turn attest to considerable differences in scan-paths of female and male children with ASD enrolled in the study.
(1.2) Gender classification in 5 to 9 years old ASD subjects: Considering the low number of female ASD participants in the study, the group with the largest number of female ASD participants, 5 to 9 years old, is considered in this analysis. The low number of female ASD participants results in a much lower number of available training scan-path samples which in turn negatively impacts DNN’s training. To circumvent this problem, three degrees of sample augmentations are considered aiming to increase the number of scan-path samples. A high classification accuracy value among male and female ASD participants is indicative of substantial difference among these two subcategories of ASD in response to an ET stimulus. The results are reported in Table
10 indicating 98% female and male stratification accuracy among 5 to 9 years old ASD population when the highest level of sample augmentation is used and the highest number of samples are generated. This performance is reduced when a lower level of sample augmentations are used but the results still indicate a substantial difference in response between female and male ASD participants.
Because there are few samples in other age groups, there is no further experiment on the gender classification of ASD patients in different age groups.
4.3 Spatio-Temporal Analysis I: The Importance of Temporal Information
The results presented in previous experiments attest to the following aspects:
(1)
Patterns of spatial information of eye gaze data captured in scan-path images are able to stratify ASD and TD participants (Table
3: DNN = 74.4%)
(2)
Scan-path patterns in children with ASD are influenced by Age (Table
5: DNN = 79.8%, CNN = 80.5% 2–4 years old children with ASD vs. 10–17 years old children with ASD)
(3)
Scan-path patterns in children with ASD are influenced by Gender (Table
9: DNN = 78.0% males ASD vs. female ASD)
(4)
Age and Gender factors together have dual impact scan-path patterns in children with ASD (Table
10: DNN = 98.8% male vs. female classification in 5–9 years old children with ASD)
The results are encouraging and indicative of some degree of success on mining the underlying spatial eye-gaze pattern differences between ASD and TD. The results with spatial eye-gaze scan-paths also speak to factors contributing to ASD heterogeneity e.g., Age and Gender. Spatial-based scan-paths capture the eye-gaze patterns by removing the temporal dimension of the data. These scan-path images, while providing an overall view of visited points on the screen, eliminate any pattern differences between ASD and TD in temporal dimension of the data. In order to better understand the importance of temporal dimension on eye-gaze scan-paths, a new set of analysis are performed.
4.3.1 Stratification of ASD and TD Using Spatio-Temporal Eye-gaze Scan-paths.
To understand ASD and TD pattern differences in temporal dimensions of eye-gaze scan-paths, non-overlapping 3 s temporal windows are considered and their associated scan-paths are assessed. The results are reported in Table
11. The evaluation is repeated 3 times and the outcome is averaged to represent final performance.
In order to understand the effect of each temporal period more intuitively, the average values of each period in Table
11 are used (see Figure
6). Since the training accuracies are always 100%, no curve is drawn separately.
The results indicate that the first 3 s of video clips representing 0–3 s temporal window encapsulates the scan-path trajectories with highest ASD and TD stratification capability while the remaining periods are performing almost consistent. This is indicative that there are a degree of differences between ASD and TD eye-gaze patterns hidden in temporal dimensions of scan-path patterns.
4.3.2 Assessing the Performance Variations Impacted by the Video Clips Watched By Participants.
Before having a closer look at the contribution of temporal information on stratifiability of scan-path patterns, it is necessary to consider the possible performance variation across video clips and their contribution to the observed performance.
Given the variations in number of times video clips were viewed by participants enrolled in this study, and considering the low number of views presented in Table
6 for some of the clips, only the four most viewed video clips are assessed in this analysis (See Table
12). Similar to previous experiments, in each evaluation, all scan-path samples generated from each of these four video clips are removed from dataset and the remaining samples are used for training and evaluation. The classification accuracy outcome of this experiment attests to the contribution of each video on stratification of ASD and TD. In this experiment, high and low classification accuracy achieved after removing samples of a given video clip is indicative of low or negative and high or positive contribution of such samples in stratification of ASD and TD, respectively.
Inspired by findings of previous experiments, only the 0–3 s tie window scan-paths are considered in this experiment. The evaluation with each video is repeated three times. The results are presented in Table
13.
In order to understand the effect of each video while only using scan-paths of 0–3 s temporal window, the average performances presented in Table
13 are used to draw the patterns presented in Figure
7.
The results reported in Table
13 and Figure
7 indicate that removing scan-path samples from first 3 s temporal window of sr01 video clip increases the overall stratification capability of the classifier. This attest to negative impact of scan-path samples generated from sr01 eye-gaze data.
The results also indicate omission of samples from sr04, AM_A4_S6_B4_GM_D1_F1, and db0101b12 causes the highest loss in ASD and TD stratification capability of the network attesting to the importance of these video clips in the observed overall performance.
It is noteworthy that in preparation of results presented in Tables
12 and
13 no augmentation is performed since the main purpose of the evaluation is to verify the impact of each video clip on overall performance.
4.4 Spatio-Temporal Analysis II: Digging Deeper in Presentation of Temporal Dimension
Looking at scan-path images, it is a common phenomenon to see multiple lines between close-by points. See Figure
8 as a zoomed example.
Aiming to increase the depth of information presented in scan-path representation of eye gaze data, factors such as average velocity between two points are considered using following equation
where the distance between the two points is obtained by the Euclidean algorithm.
\(P_1\) and
\(P_2\) represent the
x and
y coordinates of two consecutive points. To incorporate both temporal and velocity information into scan-paths, the velocity values, generated from predefined temporal windows are used as color values of scan paths in the given time window. Table
14 represents the sets of velocity color mappings considered in this experiment. Another variation of this colormap where threshold values of the velocity are different is also considered and no considerable difference in the performance is observed.
An Exampled scan-path sample using these two color coding schemes is illustrated in Figures
9–
11.
5 Experiments and Results
ASD and TD Classification for Each Second in 3–6 sec Period:
Aiming to deal with the unbalanced nature of samples in ASD and TD, the TD samples are augmented using rotation method to close the sample size gap between the two group. To obtain relatively accurate data, the test was repeated three times in each interval. In order to compare the test results with previous findings presented in Table
11, the test results of the 3 s to 6 s time window are added to the table. The results indicate a considerable increase in the overall performance of the 3–6 s tie window from 71.74% (see Table
11) to 80.7% (see Table
15). Table
16 provides a comparison between the use of velocity-based color-coded scan-paths and the original scan-paths. The 3–6 s time window results acquired by the velocity-based color-coded scan-paths is also performing considerably better than the 1 s tie window intervals between the 3 s and 6 s period (see Table
15). A possible explanation for the observed increase in classification accuracy in 3–6 s time window compared to the 1s time intervals is the difference in the amount of information and patterns that can be captured by a 3 s time window compared to 1 s time window. Figures
12 and
13 provides an example of such scan-path differences between these two-time windows.