1 Introduction
Recent years have seen an increase in the development of systems for the generation of humanlike communicative behaviour. This is driven by the need for socially interactive virtual and robotic agents in various domains. For instance, artificial agents may range from household service robots to museum guide avatars and social robots in education and medicine, whose primary function is not only to assist people but also to connect with people through effectively producing social signals [
13].
Research has long established a rule-based approach as an advantageous one in human behaviour generation [
12,
109,
141]. However, in light of state-of-the-art developments, major issues in the rule-based approach have been identified. While it is efficient in producing human behaviours for a single or a limited number of modalities, its is hampered by the need for explicitly formulating rules, resulting in a practical limit on the number of rules, which in turn curbs the expressiveness of behaviour [
62]. Additionally, rule-based systems typically fall short of producing multimodal behaviours, as the number of rules increases rapidly when new modalities are added [
170]. Recent evidence suggests that rule-based models seem to fail when producing natural variations of human behaviour, often because they do not cover the entire range of behaviour or their naturalness is found to be lacking [
125].
In contrast, models that are trained by learning from available corpora of speech, text, audio, and multimodal data allow for a more robust human–agent interaction, as they can learn correlated behaviour that is difficult or labour-intensive to capture in rules. For example, it is believed that computational models based on data hold promise in uncovering the complex relationships between verbal and non-verbal human behaviours [
124,
218]. Advances in the deep learning and machine learning models, and the availability of large datasets have led to a growing interest in data-driven systems for behaviour generation [
85,
111,
228], dialogue systems [
173], and speech synthesis systems [
197,
211]. The data-driven approach to interaction design is deemed to improve on the labour-intensive rule-based approach. Human behaviours are generally produced through various modes that make communication multimodal [
7]. Those are primarily speech and different types of bodily gestures such as facial gestures, movements of the head, and manual (hand, arm, shoulder) gestures [
7]. These all play an integral role in conveying social signals and information [
147]. Moreover, the affective states of an interlocutor are consciously or unconsciously communicated by means of these verbal and non-verbal communicative channels [
7]. Data from several studies suggest that robots and virtual agents able to cause affect in human users are perceived as more vivid and humanlike [
54,
160].
Compared to other recent reviews [
127,
226], this survey intends to take stock of the dynamically expanding field of co-speech gesture and behaviour generation for anthropomorphic agents and of the methodological approaches used for the evaluation of such models. We review existing research on data-driven approaches in verbal and non-verbal human behaviour generation and cover progress in data-driven communicative behaviour generation from the last five to six years. Furthermore, this work attempts to identify challenges and directions and in doing so sets a road-map for future research in this field.
Section
2 explains the methodology for the review. Sections
3,
4,
5, and
6 are dedicated to reviewing data-driven models that generate various communicative behaviours that occur in human–human interactions and designed for human–agent and human–robot interaction scenarios. Section
7 finishes the review and focuses on speech synthesis, the communicative behaviour in which most resources have been invested for arguably the longest period of time and that therefore holds essential lessons for data-driven behaviour generation. Section
8 provides an outlook for the field and concludes the article.
3 Head Gestures
Head gestures constitute an important part of human body language during communication and co-occur with speech. Speech-driven head gesture synthesis through data-driven approaches has attracted attention since the early 2010s. Unlike rule-based models for gesture synthesis, data-driven models can learn dependencies between data so as to map a sequence of speech features to meaningful head animations. The related literature shows different frameworks employing Deep Neural Networks (DNNs) [
184], Bi-directional Long Short-Term Memory (BLSTM) networks [
172], and deep generative models [
72,
179], which are capable of learning the temporal and cross-modal dependencies of continuous signals.
Ding et al. [
45] discussed a DNN for synthesizing head motion from speech features. To this end, they pre-trained a Deep Belief Network (DBN) [
89], using stacked Restricted Boltzmann Machines [
178] with a target layer for fine-tuning the DBN model parameters, creating a DNN model. The objective evaluation criteria depend on three measures: Canonical Correlation Analysis (CCA) [
83], Average Correlation Coefficient (ACC) [
159], and Mean Square Error (MSE) [
6] for the differences between predicted head movements with respect to ground-truth movements, where the results show that the generative pre-trained DNN model outperformed the randomly initialized network trained through back propagation. Furthermore, Ding et al. [
47] showed that this DNN model outperformed a traditional Hidden Markov Model (HMM) approach for head motion synthesis from speech [
91] in the CCA analysis.
Ding et al. [
46] compared two types of neural network models, BLSTM and feed-forward networks, to learn the correspondences between speech and head motion. The results show that the BLSTM model significantly reduced the root mean squared error (RMSE)—of predicted movements with respect to ground-truth movements—compared to that of the feed-forward model that does not converge when the number of hidden layers is bigger than two. Furthermore, the BLSTM model, with different numbers of hidden layers, achieves a better performance than that of the feed-forward model in the CCA [
83]. Over and above, a hybrid network composed of two BLSTM layers and one feed-forward layer in between, shows a higher performance in objective evaluations and in subjective evaluation—measuring the naturalness of head motion—than a separate BLSTM model and the other stacked network architectures.
Haag and Shimodaira [
82] presented a bottleneck DNN architecture, where bottleneck features—resulting from a DNN model containing a hidden bottleneck layer and trained on the features of speech and head motion—are used with speech features as input to another DNN model with a BLSTM layer in a forward pass to synthesize head motion. These bottleneck features can capture the dependencies between the features of speech and head motion curves, which allows for improving the accuracy of generating head movements. They report that bottleneck features enhanced the performance of the DNN-BLSTM architecture and achieved better scores in the CCA [
83] than when they were not present in the architecture.
Greenwood et al. [
77] introduced a BLSTM model to predict head motion from speech and further extended the model through conditioning by a prior motion input to limit the possible head motion predictions for speech. Moreover, they proposed a generative Conditional Variational Autoencoder (CVAE) [
179] using BLSTM models as encoder and decoder to map speech to head motion. This last model allows for predicting a variety of output head motion curves for the same speech input by sampling from the Gaussian space and conditioning on speech features.
Sadoughi and Busso [
165] presented a conditional Generative Adversarial Network (GAN) [
72] with BLSTM cells for generating head movements for speech segments. It learns, during training, the conditional distributions of head motion curves and prosodic features of speech. The performance of the proposed model was compared with a DBN [
132] and a BLSTM model [
46]. The results show that the proposed conditional GAN model outperformed the baseline DBN and BLSTM models in terms of the log-likelihood measures as well as in subjective evaluation.
Table
1 summarizes the related information to the corpora and evaluation approaches used in the studies covered in this survey. While most of these studies considered objective measures to evaluate the proposed models, some of them had subjective evaluations. It is noteworthy that the sizes of the corpora and the scale of evaluations are often small; therefore, measuring how appropriate the generated head gestures is not always possible, and new metrics supplementing the existing objective metrics might be needed.
5 Hand Gestures
As a natural mode of interaction, hand gestures carry important functions in human–human communication, such as maintaining an image of a concrete or abstract object and idea (iconic and metaphoric gestures), pointing and giving directions (deictic gestures), or emphasizing some parts of the speech (beat gestures) [
134]. Hand gestures, including fingers and arms, also act as an independent modality or part of modalities designed for various virtual agents and robots, adding expressivity to their motions. This versatility of hand gestures served as an incentive for their application in such domains as human–computer interaction [
207] and its related fields HRI [
128] and human–agent interaction (HAI). In HRI, hand gestures are applied to socially assistive robots because of the expressivity they add to robots’ verbal and non-verbal communication with humans [
170]. Besides, hand gestures are believed to ease the interaction between humans and robotic agents [
142].
A considerable amount of research has been conducted on a data-driven generation of hand gestures, utilizing various databases and displaying a range of architectural choices [
113,
194,
228]. For example, the earliest work by Chiu and Marsella [
29] in 2011 made use of Hierarchical Factored Conditional Restricted Boltzmann machines (HFCRBMs) [
30], whereas the most recent works resorted to models such as Long Short-Term Memory networks [
85,
186] and a Variational Autoencoder (VAE) [
111], to mention a few. Despite their purely communicative nature, sign language gestures are not covered in this survey as they rely solely and largely on a visual modality. Thus, in the paragraphs that follow, we cover the hand gestures that are characteristic of co-speech communication of information.
Chiu and Marsella [
29] relied on HFCRBMs [
30]—an extension of Deep Belief Network [
89]—to generate hand gestures that are tied to prosodic information. In particular, the gesture generator function learns the relationship between previous motion frames, audio features (inputs) and current motion frame (output) to generate hand gesture animations. The model was trained on motion capture and audio data from human conversation. Particularly, the motion capture data contained joint rotation vectors with 21 degree of freedom, whereas audio features used prosodic information such as pitch and intensity values. During the subjective evaluation, three animation types—Original, Generated, and Unmatched—were compared against each other in a user study. The results demonstrated the naturalness of the movements of generated gesture animations and the consistency of the motion dynamics with utterances.
Bozkurt et al. [
17] presented a speaker-independent framework for joint analysis of hand gestures with continuous affect attributes, such as activation, valence, dominance, and speech prosody using Hidden semi-Markov models (HSMMs) [
230]. Moreover, during the synthesis step, prosody feature extraction and continuous affect attributes are followed by the HSMM-Viterbi algorithm. Gestures in motion capture data were represented by joint angles of arms and forearms. Consequently, the animation is generated via unit selection applied on a gesture pool with regard to a multi-objective cost function. Their system was trained on multimodal USC CreativeIT database [
135]. Phrase-level gesture sequences for (1) affect and prosody feature fusion, (2) prosody only, and (3) affect only configurations were evaluated based on CCA scores [
83] and symmetric Kullbeck–Leibler (KL) divergence. Their findings suggest that affect and prosody fusion provides the best correlation with the original gesture trajectories and has the best gesture and gesture duration modeling. However, affect only configuration has the least kinetic energy difference with the original sequence. Subjective evaluations were planned for their future work.
Takeuchi et al. [
186] used deep neural networks with BLSTM [
232] to study the production of metaphoric hand gestures from speech features of audio. During the data pre-processing, the hand gestures were represented as rotations of bone joints. The network is composed of three non-recurrent layers, a BLSTM layer, and a final output layer. The first non-recurrent layer takes MFCCs features of audio as input, while other non-recurrent layers take independent data. However, the final output layer takes the backward and forward recurrence units from the BLSTM layer as input. Thus, the model output—the vector of prediction—is represented in a BioVision Hierarchy format. The objective evaluation, conducted by comparing the final loss results from the proposed model with a simple RNN implementation, resulted in significantly better performance of the proposed model. The subjective evaluation of the original, mismatched, and generated gestures demonstrated significantly lower ratings of the generated gestures than the former two (original and mismatched) in terms of naturalness, matching in timing, and context. This result, as the authors explain, might be affected by the gesture motion’s frequent moving.
Hasegawa et al. [
85] presented the BLSTM model integrating it with Bi-directional RNN [
75] to generate co-speech 3D metaphoric hand gestures from speech audio. Specifically, speech audio features were converted to MFCC features and the joint positions of a whole body were used to represent the gestures. The network learns the relationship between speech and audio with backward and forward consistencies. Similarly to the model proposed by Takeuchi et al. [
186], the architecture consists of five layers shown in Figure
3. The objective evaluation was performed through Average Position Error (APE)
19 [
117], which displayed insignificant errors in the left and right wrists in terms of accuracy. Moreover, the user study revealed that the generated gestures among the three gesture conditions (original, mismatched, and generated) were perceived as significantly more natural but significantly less time and semantically consistent than original gestures.
Kucherenko et al. [
112] presented a novel speech-input and gesture-output DNN framework consisting of two steps. First, the network learns the lower-dimensional representation of human motion with a denoising autoencoder neural network. Then, an encoder network
SpeechE learns a mapping between speech and a corresponding motion representation. Kucherenko et al. [
112] applied representation learning on top of the DNN model to make learning from speech and speech-to-motion mapping easier. The objective evaluation compared the proposed network with the baseline BLSTM model presented in Hasegawa et al. [
85] using APE
20 [
117] and Motion Statistics
21 as metrics for the average distance between the generated and original motion as well as the average values and distributions of acceleration and jerk, respectively. The proposed model achieved better results compared to the baseline and demonstrated the plausibility of the generated gestures. A further validation of the results through a user study confirmed the model’s performance in terms of producing natural gestures.
Ginosar et al. [
70] presented a model based on Convolutional Neural Network with General Adversarial Network (CNN-GAN) and log-mel spectrogram input, which can predict and generate hand gestures from a large dataset of speech audio [
70]. For gesture representation, the authors used skeletal keypoints corresponding to the neck, shoulders, elbows, wrists, and hands, which were obtained through OpenPose [
24]. The network learns to map speech to gesture using L1 regression, while the adversarial discriminator
D ensures that the produced motion is plausible. Using the L1 Regression Loss and percent of correct keypoints (PCK) [
225] as objective evaluation metrics, it was discovered that the proposed model outperformed an RNN-based baseline [
176] in gesture generation. Besides, the extent to which the produced gestures were convincing was measured through a perceptual study applying the percentage of the generated sequences, labelled as real, as a metric. The result of the comparison between fake (produced by an algorithm) and real pose sequences did not display any statistical significance.
Yoon et al. [
228] deployed a Bi-directional RNN model consisting of an encoder and decoder for co-speech gesture generation from speech text input. More specifically, the encoder takes the input text, while the decoder RNN with pre- and post-linear layers generates gestures. The model was trained on the TED Gesture Dataset [
228] to produce four common types of gestures—iconic, metaphoric, deictic, and beat gestures—from both trained and untrained speech texts. A gesture is represented as a sequence of human poses, namely joint configurations of the upper body. As for the speech text, it is represented as a sequence of words, and each word is encoded as a one-hot vector that indicates the word index in a dictionary. The results indicated that anthropomorphism and speech-gesture correlation were the most crucial factors for participants’ perception of the generated gestures, as demonstrated in the subjective evaluation. The results also showed significance over the three baseline methods measured with BLEU
22 [
149]. While the study used only speech text resulting in the weak coupling of the gestures with audio, it could be improved with audio input.
Ferstl et al. [
63] attempted to map speech to 3D gestures through training networks with multiple adversaries to generate co-speech gestures. The authors extracted MFCC and pitch emphasis (F0) from the recorded speech and used upper-body joint positions to represent the gestures. The model architecture consists of a two-layer recurrent network composed of Long Short-Term Memory [
90] cells and a feed-forward layer for input processing. Moreover, a GRU [
32] propagates the input for faster training purposes in producing joints. The novelty of the model lies in the training of the recurrent network with multiple generative adversaries instead of a standard regression loss. Drawing on the objective evaluation measured by the accuracy of the binary cross-entropy objective for each discriminator, the authors report the effectiveness of discriminators in solving a distinct sub-problem in the gesture generation task.
Tuyen et al. [
194] employed a conditional extension of the Generative Adversarial Network [
72] with an additional input condition. The GAN network includes convolutional Generator (G) and Discriminator (D) networks. Altogether, the model generates communicative gestures by synthesizing the verbal content of speech. Here the gestures were represented as human joint configurations. The objective evaluation was carried out through covariance with temporal hierarchical construction [
95]. Overall, the results illustrated the successful training of the model to imitate hand gestures that corresponded to the meaning of an utterance, which matched the iconic gestures by definition [
134].
Lee et al. [
118] introduced a temporal neural network, trained with Inverse Kinematics (IK) loss to generate finger motions and hand gestures taking upper-body joint angles and audio as input from a multimodal
16.2-million-frame (16.2M) dataset [
118], created alongside the model. The audio features included frequency (e.g., pitch, jitter), energy, amplitude (e.g., shimmer, loudness), and spectral features. The IK was applied to LSTM [
90], Variational Recurrent Neural Network [
35], and Temporal Convolutional Network (TCN) [
198] to incorporate kinematic structural knowledge. The ablation study results demonstrated the advantages of IK loss function contrary to joint angle loss, whereas the subjective evaluation yielded positive results with respect to the proposed model and its capability to generate natural humanlike finger gestures.
Table
3 presents the summary of the corpora and evaluation metrics employed in the studies above. The majority of studies relied on both objective and subjective evaluation criteria, while a few studies either used objective [
194] or subjective evaluation criteria [
96,
228]. To sum up, the works reviewed here demonstrate the prevalence of speech input data among data modalities used for hand gesture generation. Modelwise, recent research [
63,
85] shows a comprehensive exploration of recurrent networks to capture the dynamics of human motion, which excel at solving gesture generation tasks. That being said, an omnipresent limitation of such models lies in the dearth of gesture-rich datasets required to enable a robot to produce a wide range of hand gestures as opposed to certain predefined gestures produced with sparse datasets [
29]. Interestingly, the training and test sets used in Reference [
29] seem arguable considering the training and test set sizes used in other works. Thus, the following section reviews the existing state of the art on models that consider other body parts along with hands, hence outputting appropriate behaviours.
6 Multimodal Gestures
In this survey, we define multimodal gestures when referring to the multimodality of the output. In particular, we refer to the interpretation of multimodal output by Rojc et al. [
160], who emphasized the importance of synchronisation of generated non-verbal gesture types (facial expressions, head, hands, and body) with verbal (speech audio or video) in an attempt to make the interaction more natural and fluent. Therefore, the generation of such multimodal outputs as
head and facial movements synchronized with speech [
26,
48,
58,
132] or body behaviours involving
shoulder and torso along with
facial movements [
31,
49,
113] accompanied with speech will be discussed in this section.
An audiovisual model by Mariooryad and Busso [
132] relied on three joint Dynamic Bayesian Networks (jDBNs) to generate facial gestures, involving head and eyebrow movements by mapping the acoustic speech data from the IEMOCAP database [
20] to Facial Animation Parameters [
145]. The model was trained by adapting the algorithms used for HMM and Factorial Hidden Markov Model (FHMM) [
68]. Using the CCA [
44,
83], the joint DBN model was compared to similar models used to synthesize head and eyebrow motions separately. Overall, the objective evaluation results revealed that the jDBN models can cope with speaker variability, while the subjective results showed an increase in the quality of jointly modeled eyebrow and head gestures as well as their naturalness.
Ding et al. [
48] proposed an animation model of a virtual agent, based on a fully parameterized HMM, which produces head and eyebrow movements in synchronisation with speech. As an extension of the contextual HMM, in Fully Parametrized Hidden Markov Model (FPHMM) [
216], contextual variables control and parametrize the means, covariance matrices, transition probabilities as well as initial state distribution. The model was evaluated objectively and subjectively on the Biwi 3D AudioVisual Corpus of Affective Communication database [
60], considering facial motion and speech features. An objective evaluation, compared with the baseline proposed in Reference [
132] using the MSE [
6] demonstrated the best performance by the HMM-based joint model. Overall, the proposed model demonstrated an ability to capture the link between speech prosody and head and eyebrow motions. Subjectively, the perceptual questionnaire struggles to validate the objective evaluation as the results were marginally significant, showing quite identical performance in terms of expressiveness.
Ding et al. [
49] presented a multimodal behaviour generation model based on the contextual Gaussian model and a Proportional-Derivative controller. They leveraged the AVLaughter database [
196] for producing multiple outputs (lip, jaw, head, eyebrow, torso, and shoulder motions) synchronized with laughter audio. Using the pseudo-phonemes and speech features as input, motion synthesis was carried out in three steps: first, the lip and jaw motions were synthesized by a contextual Gaussian module; second, speech features were extracted for predicting head and eyebrow movements, and, consequently, torso and shoulder motions were synthesized from the previous step of synthesis by concatenation. The sophisticated subjective evaluation of the generated laughter and bodily behaviours, using a questionnaire adapted from Reference [
143] and Likert-scale rating, manifested users’ preference for an agent that produces synchronized speech and laughter animations.
Chiu and Marsella [
31] introduced a combined model to learn a twofold mapping: from speech to a gestural annotation using Conditional Random Fields and from gestural annotation to gesture motion by applying Gaussian Process Latent Variable Models [
208]. The model was subjectively evaluated against the approach in Reference [
29], which used direct mapping. The subjective evaluation was followed up by an objective assessment to establish the performance of the model against support vector machines [
42]. As a result, the proposed method performed significantly better in generating and coupling the gestures with speech, despite the hurdles of the inference model that requires temporal information.
Fan et al. [
58] discussed the use of deep Bi-directional Long Short-Term Memory (DBLSTM) [
232] to model the temporal and long-range dependencies of audio/visual stereo data for a photo-real talking head animation from audio, video, and text input. To train the network, the study used back-propagation through time algorithm [
214,
215]. The study demonstrated the advantages of two BLSTM layers sitting on top of one feed-forward layer on the datasets. As a result of objective (RMSE [
73,
162,
209] and CORR [
215]) and subjective evaluation (A/B preference test [
108]), the proposed deep BLSTM model showed higher performance compared with the previous HMM-based approach.
Li et al. [
123] adopted a deep DBLSTM [
232] recurrent neural network as a regression method to generate audiovisual animation of an expressive talking face. This method was devised to overcome the shortcomings of the previous state-of-the-art models in incorporating lip movements with emotional facial expressions. Thus, Li et al. [
123] proposed five methods based on DBLSTM trained using a large corpus of neutral data and a smaller scale corpus of emotional data. Specifically, in method (a), the DBLSTM network is trained with emotional corpus only; methods (b) and (c) capture neutral and emotional information simultaneously by training a single DBLSTM network; while methods (d) and (e) capture neutral information by a separate DBLSTM network in addition to emotional DBLSTM. To evaluate the proposed approaches, the authors adopted RMSE between the predicted Facial Animation Parameters and ground truth. This revealed how different regression models worked for different emotions. Notably, information from the neutral dataset was found more valuable for peaceful expressions (e.g., sadness) than exaggerated expressions (e.g., surprise and disgust). A further framewise comparison of RMSE values displayed the effectiveness of the proposed methods in modelling the interaction between emotional states, facial expressions, and lip movements. Finally, the subjective evaluation results confirmed the effectiveness of using the neutral dataset as it can improve the performance of an expressive talking avatar.
Suwajanakorn et al. [
183] used recurrent neural networks to learn the mapping from raw audio input (MFCC audio features) to lip landmarks (PCA), synthesizing lip textures and then merging them into the 3D face to output a realistic talking head with clear lip motions synced with the input audio. The network consisted of LSTM nodes and was trained using backpropagation through time with 100 timesteps. When compared against AAM approach [
41] and Face2Face algorithm [
191] in an objective evaluation, the proposed method synthesized cleaner and more convincing lip movements.
Chung et al. [
37] proposed an encoder–decoder CNN-based Speech2Vid model, taking still images and audio speech segments to output a video of the face, including lip synchronized with the audio. The architecture constitutes three modules, such as the audio encoder, identity encoder, and image decoder, which were trained together. Learning the joint embedding of the target face and speech segments is central to this approach in generating a talking face. Evaluations, conducted to qualitatively measure the quality using the alignment and the Poisson editing [
150] techniques, determined the ability of Speech2Vid to generate videos of talking faces with certain identities.
Chen et al. [
26] developed a method that takes speech audio and one lip image of a target identity as input and generates an output of multiple lip images with the accompanying speech audio. The model is designed by combining correlation networks with an audio encoder and an optical flow encoder, implemented on 3D RNN to mitigate delayed correlation problems. The generated lip movements were evaluated quantitatively and qualitatively on the GRID [
40] corpus, LRW [
36], and LDC [
157] dataset, not used previously for training purposes, as well as with different metrics—Landmark Distance Error (LMD), CPBD [
140], SSIM, and PSNR [
213]. The proposed model generated realistic lip movements and proved their robustness to view angles, lip shapes, and facial characteristics. However, the main limitations are bound to learning from a single image, which resulted in difficulties in capturing lip deformations.
Plappert et al. [
153] introduced a model based on deep RNNs, and sequence-to-sequence learning [
182], which learns a bi-directional mapping between whole-body motion and natural language. One model is fed the encoded motion sequences obtained from motion capture recordings during training, and the other is trained on natural language descriptions to generate whole-body motions. Based on the quantitative comparison with the baseline model, the language-to-motion model demonstrated the capability of generating proper human motion, achieving higher performance rates. The performance of the model was also measured by BLEU scores [
149], which suggested minimal overfit and generalisation to previously unseen motions. The model showed a capability to generate whole-body motions given proper descriptions in natural language.
Alexanderson et al. [
5] adapted a deep learning-based MoGlow [
87] for a probabilistic speech-driven model to output full-body gestures synced with speech. Particularly, the normalising flows were used the same way as GANs to generate output by a nonlinear transformation of latent noise variables. Thus, four models were trained on a speech-only condition, while the other four were conditioned on style control. The model was compared against three baselines taking the same speech representation as input: unidirectional LSTM [
90], CVAE [
77], and the audio-to-representation system [
112]. While the subjective evaluation of the style control experiment yielded significant results in favor of the MoGlow-based model for the human-likeness of the gesticulation, the model trained on speech only achieved better results compared to the second baseline.
Dahmani et al. [
43] used a conditional generative model based on a VAE framework for expressive text-to-audiovisual speech synthesis. The proposed model learns from textual input, which provides the VAE with embedded representation to further capture emotion characteristics (Figure
4). Although the experimental results showed a high recognition rate for almost all emotions in audiovisual animations, sadness and fear turned out to be the hardest to recognize by participants. According to the authors, this was explained by the role of the upper part of the face, thus causing a potential limitation of the study. Overall, the model performed well in terms of producing nuances of emotions as well as generating emotions beyond those retrieved from the database as illustrated by subjective evaluation results.
Kucherenko et al. [
113] presented a deep learning-based model that takes audio and text transcriptions as input data to generate arbitrary (metaphoric, iconic, and deictic) and semantically linked upper-body gestures together with speech for virtual agents. The model was evaluated on The Trinity Speech-Gesture Dataset [
62] using the RMSE, acceleration and jerk, and acceleration histograms as objective metrics. A binomial test was used for the analysis of data obtained from the perceptual questionnaire and attention check. Altogether, the evaluations demonstrated a preference for the proposed model (no PCA) over the CNN-GAN model introduced by Ginosar et al. [
70] in terms of human-likeness and speaker reflection. The evaluation results also highlighted the efficacy of the multiple modalities used to train the model.
Yoon et al. [
227] discussed an end-to-end model that takes speech text, audio, and speaker identity to generate upper-body gestures, co-occurring with speech and its rhythm. The proposed method is based on Bi-directional GRU [
32] along with recurrent neural networks used for encoding three different input modalities. The ablation study demonstrated that all three modalities had a positive effect on the generation of gestures. Overall, the proposed model performed well as identified by a novel objective evaluation metric called Fréchet Gesture Distance (FGD) [
88], subjective user study, and in comparison to other state-of-the-art models. Despite the superiority of the proposed model over baselines, the main disadvantage still remains the demand for a large dataset as the generated motion quality and upper-body gestures were limited to the dataset used in the study. Additionally, the gesture generation process lacks controllability. Other limitations regard the FGD, which made it atypical to analyze mixed measurements of motion quality and diversity.
Ahuja et al. [
3] presented a Mixture-Model guided Style and Audio for Gesture Generation (Mix-StAGE) model that trains a single model for multiple speakers while learning unique style embeddings for each speaker’s gestures in an end-to-end manner. A novelty of Mix-StAGE is to learn a mixture of generative models that allows for conditioning on the unique gesture style of each speaker. The model used a TCN module for both content and style encoders. It is trained on a custom-made dataset PoseAudio-Transcript-Style designed specifically for this work. In the experimental study, the Mix-StAGE model was compared against existing baselines capable of generating similar co-speech gestures (i.e., single speaker models Speech2Gesture [
70], CMix-GAN and multi-speaker models MUNIT [
92], and StAGE). The results of the objective evaluation revealed that the Mix-StAGE model significantly outperformed the state-of-the-art approaches for gesture generation and provided a path toward performing gesture style transfer across multiple speakers. Perceptual studies also showed that the generated animations by the proposed model were more natural whilst being able to retain or transfer style.
Wang et al. [
210] introduced an integrated deep learning architecture for speech and gesture synthesis (ISG) model to synthesize two modalities in a single model, compatible with both social robots and ECAs. The proposed model is adapted from Tacotron 2 [
174] and Glow-TTS [
102], with Tacotron 2 being auto-regressive and non-probabilistic and Glow-TTS being parallel and probabilistic, and takes text as input to generate speech and gesture. Subjective tests performed separately for each modality demonstrated that one of the proposed ISG models (ST-Tacotron2-ISG) performs comparably to the current state-of-the-art pipeline system while being faster and having much fewer parameters.
Huang et al. [
93] proposed a fine-grained Audio-to-Video-to-Words framework, called AVWnet, which is deemed to produce videos of a talking face in a coarse-to-fine manner and maintain audio-lip motion consistency. The framework architecture consisted of treelike architecture and a GAN-based [
72] neural architecture for synthesizing realistic talking face frames directly from audio clips and an input image. The GAN framework is conditioned on image features to enable further fusion of facial features and audio information in generating the face video. Compared with the state-of-the-art approaches [
27,
37], the performance of AWVnet excelled on all three adopted metrics and datasets as a result of objective evaluation. Metrics such as SSIM, PSNR, and LMD were used to evaluate the model objectively. A comparison of the proposed model with the model by Chen et al. [
27] through perceptual user study revealed the former to be as good as the existing model.
Zhou et al. [
236] presented a model that learns from disentangled audio-video representations to generate a talking face corresponding to speech. Both talking video and audio were used to train the Disentangled Audio-Visual System (DAVS). The DAVS network demonstrated several advantages over the previous baseline [
36], which encompass the improvement of lip-reading performance, unification of audio-visual speech recognition and synchronisation in an end-to-end framework, and the achievement of a high-quality and temporally accurate talking face generation as a result of both subjective user study and effectiveness verification by PSNR and SSIM [
213].
Sadoughi and Busso [
166] demonstrated a Constrained Dynamic Bayesian Networks (CDBN) [
132], to overcome the individual limitations of rule-based and data-driven approaches in gesture generation. The authors aimed to build a generative model to produce believable hand gestures along with head gestures with bimodal audio-speech and video data synchronisation. The model was evaluated by two objective metrics: CCA [
21,
83] and log-likelihood rate (LLR) [
136]. Based on the results of the subjective evaluation, the CDBN model is perceived to generate more appropriate and natural gestures compared to baseline models. Overall, the hand gestures generated by the constrained model showed 85% accuracy for certain types of gestures.
Vougioukas et al. [
206] discussed the GAN-based talking face generator, consisting of a temporal generator and multiple discriminators, which takes a single image and raw audio signals as input. The quality of the generated video output was evaluated on the GRID [
40] corpus, TCD TIMIT [
84] corpus, CREMA-D [
23], and LRW [
36] datasets by applying reconstruction (Peak Signal-to-Noise Ratio and Structural Similarity [
213]), sharpness (cumulative probability blur detection (CPBD) measure [
139]), content (ACD [
193] and word error rate (WER)), and audio-visual synchrony metrics. When assessed subjectively, the results of the Turing test
30 showed naturalness of the generated faces. Moreover, compared to baselines [
37,
183], the model demonstrated an ability to not only capture and maintain identity but also generate facial expressions matching the speaker’s tone and speech.
Sinha et al. [
177] approached the generation of identity-preserving and audio-visually synchronized 2D facial animation through GAN, utilizing DeepSpeech features, given an audio input of speech, and facial landmarks from the benchmark corpora as GRID [
40] and TCD-TIMIT [
84]. Same objective evaluation metrics as in Reference [
26] were used in the study. Moreover, a qualitative evaluation compared the model with the state-of-the-art baselines of Reference [
26], Reference [
206], and Reference [
236]. These evaluations yielded overall positive results regarding identity preservation, superior image quality and texture clarity, and smooth audio-visual synchronisation.
Tables
4 and
5 summarize the state of the art in multimodal gesture generation, concerning the corpora and evaluation metrics used. Even though studies emphasize objective evaluation as a challenging task, the existing literature shows effective and nuanced exploitation of objective metrics along with subjective ones. Note that objective metrics are often the same as the cost functions used to optimise the generative models, with authors assuming that optimising the cost functions equates with improving the model’s performance. However, for now subjective measures remain the gold standard for assessing the quality of the generated behaviour and this is recognised across the field.
8 Outlook
It is clear that data-driven methods relying on connectionist architectures are an important and perhaps definitive answer to the question of how to generate humanlike communicative behaviour. Never before have models produced such rich and varied behaviour without the need for explicit programming. However, there are a number of challenges that still face the relatively young field of data-driven behaviour generation.
Multimodal behaviour generation. Most models take a single signal and map it onto a modality: text to speech, emotion to facial expression, and speech to gesture. However, in human-to-human communication all modalities are intertwined: emotion colours speech and gestures, gestures have an impact on speech, context influences eye gaze, and so on. The fact that communication is a highly interdependent process is glossed over in current data-driven generation methods, for obvious reasons. Still, in future systems we would expect more modalities to be taken into consideration. In the speech generation community, for example, emotion has long been the subject of study, and research systems are able to generate speech modulated by emotion. However, the flipside to this is that for a data-driven approach more data will be needed. Already the amount of data required to train systems is expensive to collect for two connected modalities, adding other modalities is likely to increase the size of the required training data exponentially. How this will be overcome is as yet unclear.
Dyadic and multiparty communication. The large majority of data-driven models do not take the receiver into account. Instead they are trained to produce communicative behaviour as if it would concern a monologue in which the receiver of the message does not respond. In human-to-human communication, most interactions are multiparty interactions and our communicative behaviour is finely tuned to the reactions and responses of others. We watch for signals showing understand or misunderstanding, monitor for affective responses and are sensitive to bids for turn-taking. All these elements are largely missing from current data-driven methods, as they are exclusively trained on data that does not take into account the interactive nature of communication. Again, it seems likely that more data could resolve this problem, but at the same time collecting this data comes at a great cost and might be beyond the means of most R&D labs.
Measuring quality of generated behaviour. Assessing the quality of generated behaviour relies on objective and subjective measures. Objective measures are the workhorse of data-driven methods, as they form the cost function against which the models are optimised. Unfortunately, these objective measures only weakly correlate with subjective measures (see for example Reference [
114]). Subjective measures, during which people (or simulated subjective raters) judge the quality of the generated behaviour, remain the gold standard in evaluation. However, using human raters is expensive and time consuming and as such subjective measures cannot be used during training when many millions of evaluations are needed to drive the model ever closer to generating behaviour that is humanlike. Recent work on gesture generation showed how subjective measures still are better for measuring the quality of models, and that objective measures often fall short as they only optimise a quantitative metric that is often a poor representation of qualitative assessment [
217,
219]. Simulated subjective raters might be a way forward, as in GAN models in which one part of the model is trained to discriminate between artificial and humanlike output, pushing the generated behaviour ever closer to being indistinguishable from human behaviour. Another challenge is the lack of common standards to evaluate models. Sometimes this is informed by the need to evaluate very specific elements of the generated behaviour, or because the accepted standard has outlived its usefulness. Benchmarks often form the focus of intense research investment and are often reached in just a few years, at which point they become useless as a target to aim for. Challenges, where different models are pitted against each other, have proven useful in this context—co-speech gestures for example have benefited from a series of challenges pushing the field, but also pushing the way in which models are evaluated [
114,
229].
Common datasets and evaluation methods. From the survey it appears that there are few common datasets on which models are trained and evaluated. Researchers and engineers prefer taking a pragmatic approach when chosing data to train and evaluate against. Factors such as availability, easy-of-use, feature availability, cost and appropriateness for the task at hand are deemed important and are often used as a reason to not use datasets that have been used by others. One corollary is that the field would benefit from agreed datasets and evaluation standards, something that happens for some modalities (such as speech synthesis) and is slowly being adopted for other modalities (such as gesture generation [
114]).
Semantics of multimodal communication. Communication serves to change the mind of others. As such, any communicative act carries semantics. However, this is usually glossed over in data-driven models. In some cases, this is not too much of a problem. Speech generation, for example, generates speech from text. Text has a well-agreed notation and speech generation maps this orthography to sound. However, speech generation is largely context-free and the production of humanlike speech is possible without requiring much access to the semantics of the text and without access to the internal affective state of the agent. For exceptions to this the context of the neighbouring text is sufficient to disambiguate the required speech sounds. For example, disambiguating “bass” as a fish (/bas/) or a musical instrument (/beIs/) can often be done by relying on other words nearby. Other modalities are different in that what they convey is tightly linked with affect, emotion and semantics of the message. Current data-driven methods do not have access to these, and while the models can with sufficient data pick up semantic correlations, the training cost at which this comes is prohibitive.
Fine tuning models. One promising benefit of data-driven neural models is the potential for fine-tuning (also known as transfer learning) of a pre-trained model. In this, a model is first trained using a large amount of data and then later training continues often on a smaller dataset so that the pre-trained model is more relevant for a specific task. While few behaviour generation models have been made available for fine-tuning, the practice is already well established in other fields, such as Large Language Models, where models can be relatively easily fine-tuned for other language-based generative tasks (e.g., Reference [
233]).
Hardware does not match the dynamics of software generated behaviour. Most social robots rely on actuation technology, such as electric motors and planetary gears, which do not offer the velocity, acceleration and jerk typically seen in the human body. This leads to multimodal social behaviour that appears unnaturally slow. Some solutions exist: some robots, such as Keepon, rely on simpler, smaller and lighter bodies that allow low-cost actuators to generate high-velocity dynamics. Others, such as EngineeredArts’ Ameca or RoboThespian animatronic robots, rely on alternative actuation technology, often using pneumatics, to produce high-velocity animations matching human dynamics. However, humanlike dynamics are for the moment still out of scope for most commercial and research social robots.
Despite these challenges, data-driven methods for the time being look to be the way forward. But to achieve near-human multimodal behaviour, a number of important obstacles will need to be overcome. One striking observation is that a developing child does not have access to thousands or perhaps millions of hours of training opportunities. Instead, children learn to interact multimodally through a combination of observation and online learning, and innate biases and constraints. This combination allows them to become skilled multimodal communicators in just a short few years. Perhaps future data-driven models should, instead of taking a tabula rasa approach, also start with biases and constraints to make the training process more efficient.