1. Introduction
In recent years, Human–Machine Interfaces (HMI) have been introduced into numerous applications, including prosthesis control, robotic arm control, exoskeletons, smart wheelchair control, smart environment control, and exergaming [
1]. The goal of an HMI is to transcribe user movements or movements intention into the desired action, thus allowing for an effortless and natural interaction between the user and the machine. Therefore, several types of sensing modalities, both of invasive and non-invasive natures, have been developed for monitoring user’s specific activities, such as hand, eye, limb, and joint movements. These types of sensing modalities measure different kinds of physiological signals that can be classified into three main categories: bio-potentials, muscle mechanical motion, and body-motion signals [
2]. In order to accomplish their goal, HMIs leverage a wide variety of algorithms, ranging from simple thresholding to complex machine learning algorithms, for processing these different sensing modalities.
In the medical field, HMIs are introduced in rehabilitation and assistive technologies. Rehabilitative technologies aim to restore impaired motor function in individuals with motor disabilities (within the limits of each individual’s disability) in order to gradually enable re-participation in activities of daily living, whereas assistive technologies attempt to allow an individual with motor disabilities to perform motor functions that are beyond their motor capabilities. In medical HMIs, bio-potentials such as electroencephalogram (EEG) [
3], electromyogram (EMG) [
4,
5], and electroocullogram (EOG) [
6]—generated by electrical sources within the human body and thus reflecting the function of organs by means of electrical activity—have been extensively utilized. Furthermore, there is a ongoing interest in signals that monitor gross muscle motion, such as force myography (FMG) [
7] and mechanomyography (MMG) [
8], and muscle-tendon movement such as electrical impedance tomography (EIT) [
9] and medical ultrasound [
10]. Hybrid HMIs have also been developed [
11,
12,
13], which exploit complementary information of different sensing modalities and thus allow for improved control but come at the expense of increasing the complexity of the HMI.
Regarding all biological signals, the most widely employed sensing modality is surface-electromyography (sEMG) signals, a bio-potential that directly measures the electrical activity generated during voluntary contraction, since they can be easily acquired non-invasively and provide an intuitive control strategy to reproduce the function of a biological limb [
2]. On the other hand, Ultrasound-Based (US-based) HMIs remain vastly unexplored compared to their sEMG counterparts, despite ultrasound sensing techniques providing a non-invasive framework for monitoring deep tissues inside the human body with high temporal resolution and sub-millimeter precision. In the context of US-based HMIs, two sensing modalities are commonly employed, namely, B-mode ultrasound and A-mode ultrasound [
10]. Both modalities utilize a device, a transducer made from piezoelectric crystals that is capable of both transmitting and receiving US waves [
14]. In B-mode ultrasound, phased-array transducers are utilized in order to synthesize a 2D image of the human tissues through either the combination of simultaneous emission of acoustic beams and software beam-forming or by sweep time control of the acoustic beam; whereas, in A-mode ultrasound, the simplest US sensing technique—a single transducer element—scans a line through the human body and the received echoes are plotted as a function of depth. Currently, research interest has shifted towards A-mode ultrasound sensing as it has been demonstrated that using a set of sparsely selected scanlines instead of the full imaging array does not hinder the HMI’s performance [
15,
16]. The benefits of utilizing a reduced number of scanlines, such as reduced computational complexity and power consumption as well as miniaturization of the instrumentation, have motivated the proposal of novel US acquisition systems [
17] and recent advancements in the development of flexible fully printed transducers targeting medical applications [
18].
The aforementioned characteristics of ultrasound as a sensing modality can intuitively explain the superiority of US-based HMIs for simultaneous proportional control compared to their sEMG counterparts, as it has been shown in numerous works [
19,
20]. More recent works include a semi-supervised framework, featuring a Sparse Gaussian Process model and principal component analysis for operating a prosthetic device with two degrees of freedom (hand grasp and wrist rotation) [
21] and a novel portable US armband system with 24 channels with a multiple receiver approach, which allowed for simultaneous and proportional control of 3-DoF and 4-DoF in an online and offline setting, respectively [
22]. Except from simultaneous and proportional control, a promising application of US-based HMIs is hand gesture recognition. In [
23], the authors used both a novel multi-task deep learning framework and a multi-output Gaussian process for the simultaneous estimation of wrist rotation angle and recognition of finger gestures. In a more recent work, Liu et al. [
24] proposed an algorithm based on a Long Short-Term Memory framework for the recognition of handwritten digits (dynamic gesture recognition) based on A-mode ultrasound sensing.
However, there are several drawbacks that hinder the application of the aforementioned HMIs in practical applications:
Collecting large datasets may by feasible in laboratory conditions but remains impractical for real-life scenarios, a strict requirement for conventional ML/DL algorithms known for their data-hungry nature [
25].
It is both cost- and time-inefficient to collect representative datasets and, thus, most datasets are well-suited only within a pre-defined context. For example, the performance of a classifier significantly drops when tested on new arm positions [
26,
27].
It is assumed that dataset samples are i.i.d. sampled from the same distribution, but in real-life scenarios, the test–time distribution quickly diverges from the distribution with which the model was initially trained. Indicative causes are muscle fatigue [
28] and sensor donning and doffing [
29].
The musculoskeletal differences between individuals hinder the interoperability of the HMI leading to only subject-specific solutions [
30].
In an effort to enhance the robustness of a US-Based HMI, the authors of [
31] examined the potential of adaptive learning using A-mode ultrasound sensing on mitigating the performance deterioration induced by muscle fatigue. Thus, they compared the performance between conventional adaptive ML algorithms and an adaptive convolutional neural network, in which, in contrast to the former, both the feature extractor and the classifier part of the network had adaptability capabilities. In order to compare their performance, they instructed subjects to perform 15 gestures (their selection was inspired by the prominent NinaPro Database [
32], extensively used as a benchmark by the sEMG-based HMI research community) once, which corresponds to one repetition, for a total of 16 repetitions. All 16 repetitions were performed without any rest to enforce muscle fatigue. The first three repetitions were used for training, the fourth repetition was used for testing, and each one of the remaining repetitions was used as a separate testing phase. During each testing phase, the predictions as well as the embeddings of the test samples were retained for obtaining both pseudo-labels and mean class embeddings used for updating the feature extractor and classifier of the network separately. By updating their network, they were able to achieve a significant
improvement in accuracy during the late stage of muscle fatigue. In their work, the classifier is updated in order to adapt to the test–time distribution of the data, which differs from the data distribution that the classifier was initially trained on.
This gap between training and test–time distributions is encountered in various domains and is often referred as concept drift in the literature [
33,
34]. Regarding US-Based HMIs, one can easily construct cases where the concept drift is expected to be substantially larger than the others, i.e., multiple days of the HMI being unused, subjects with large anatomical differences, and multiple donning and doffing of the sensors. All the aforementioned factors constitute a challenging environment for the development of robust US-based HMIs. Compared to the US-based HMI research community, the sEMG-based HMI research community has made significant efforts for mitigating the aforementioned optimality gap. For example, Cote Allard et al. proposed a transfer learning (TL) framework—inspired by progressive neural networks and multi-stream adaptive batch normalization—that could take advantage of multiple small datasets, allowing models to generalize well to new subjects by utilizing a single repetition from each gesture [
35]. Unfortunately, TL techniques require the user to manually annotate the data acquired for re-calibrating the HMI. A more appealing scenario is to be able to re-calibrate an HMI by using newly acquired unlabeled data. This possibility is offered by Unsupervised Domain Adaptation (UDA) algorithms that take advantage of an initial labeled dataset in order to adapt a classifier to newly unlabeled data that are sampled from a similar but different distribution.
UDA algorithms have achieved remarkable results in computer vision tasks, and notable performance enhancements have also been demonstrated in inter-session re-calibration of sEMG-based HMIs by using Adaptive Batch Normalization (AdaBN) [
36] and by the incorporation of domain-adversarial training in a self re-calibrating neural network [
37]. Their aforementioned success and unique advantages make them a perfect candidate for exploration in different sensing modalities. In this work, we investigate the effectiveness of a wide variety of UDA algorithms in the re-calibration process of a US-based HMI. The application of UDA in a particular domain is not a straightforward process and requires experimentation for the optimization of UDA algorithms hyperparameters, design of discriminator networks used in domain-adversarial training algorithms, and even the network architectures themselves in order to fully leverage their capabilities. For our purpose, we used the Ultra-Pro dataset [
23], which, to the best of our knowledge, is the only publicly available dataset. Compared to the induced muscle fatigue study [
31], the Ultra-Pro dataset provides a challenging benchmark since all gestures are performed with concurrent wrist rotation and there are no intermediate data to bridge the gap between different sessions of the same subject. For assessing the performance of the UDA algorithms, we introduce an adaptation scheme, resembling the adaptation schemes introduced in the re-calibration of sEMG-based HMIs [
37], where the newly acquired data arrive at two different time frames, in a shorter and a larger one from when the original labeled data were acquired, with the latter representing a more challenging scenario. Finally, we discovered that with the appropriate initialization, several UDA algorithms are capable of enhancing the performance of a US-based HMI compared to its non re-calibrated counterpart. However, in a realistic scenario, where the input modality, the data acquisition protocols, the re-calibration period, or the pattern recognition algorithm may differ, it is more likely that the performance enhancements would be rather small or not achievable at all.
The main contributions of the paper are the following:
A thorough examination of unsupervised domain adaptation algorithms for the re-calibration of US-based HMIs, where extensive guidelines for optimizing their performance in the field of US-based hand gesture recognition are provided. We examine domain-adversarial algorithms such as Domain Adversarial Training of Neural Networks (DANN) and its variant, Virtual Adversarial Domain Adaptation (VADA), where cluster assumption violation is incorporated in its optimization objective, as well as the non-conservative UDA algorithm, in which source domain performance is ignored, such as decision boundary refinement training with a teacher (DIRT-T), which uses VADA as initialization, and SHOT, which is also a source-data agnostic algorithm. To the best of our knowledge, this is the first time UDA algorithms have been applied to this field.
A new CNN-based architecture, featuring less parameters from the state-of-the art model for US-based simultaneous estimation of wrist rotation angle and prediction of finger gestures, suitable for a UDA setting without any performance degradation on the primary task.
A benchmark for performance comparison of different architectures on the Ultra-Pro dataset, simulating a realistic scenario where the HMI will need to be operable shortly after user data are collected.
Insights about the performance of each UDA algorithm, with DANN (a domain-adversarial training algorithm) offering a average performance enhancement and systematically improving the HMIs performance after re-calibration for all subjects and sessions. Unfortunately, we discovered that all UDA algorithms examined are unable to fully restore the HMI’s operability, even though newly acquired data for re-calibration were obtained from different within-day sessions, rendering them inappropriate for re-calibration purposes.
The rest of the paper is organized as follows. In
Section 2, we introduce our methods, which include the different domain adaptation algorithms examined as well as our performance comparison benchmark on the Ultra-Pro dataset and the proposed adaptation scheme for the re-calibration of US-based HMIs tailored for UDA algorithms. In
Section 3, we introduce our proposed CNN-based architecture and explain why modifications were deemed necessary for its suitability in a UDA task. In
Section 4, we describe our experimental setup, which includes a brief description of the Ultra-Pro dataset and training details about both single-task and multi-task settings. In
Section 5, we provide our results regarding the re-calibration performance of the UDA algorithms as well as our performance comparison results of the proposed architecture with the state-of-the-art in both single- and multi-task settings. Finally, in
Section 6, we conclude our work.
3. Proposed CNN Architecture
In this work, a new CNN architecture is proposed for the task of simultaneous wrist rotation angle estimation and finger gesture prediction as well as for each task individually. The CNN architecture for the classification task is illustrated in
Figure 1. The architecture consists of four distinct blocks, a fully connected layer (acting as a bottleneck), and a
Softmax layer. Furthermore, each block B consists of a convolutional layer, followed by a pre-activation batch normalization layer [
43], a max pooling layer, and a dropout layer with a forget rate of
[
44]. It is important to mention that pre-activation batch normalization is also included in the fully connected layer of the network. Finally, we used
Leaky ReLU with a negative slope of
as the activation function [
45].
Following the work of [
23], we also apply 1-D kernels, operating along the width dimension, with a stride equal to the size of the kernel in the width dimension and a stride of 1 in the height dimension. The sizes of the kernels for the convolutional layers of blocks B1, B2, B3, and B4 are 51, 23, 8, and 4, respectively. Similarly to all other DNN architectures, the trainable parameters of our network are learned in the training process by updating them using gradient-based methods and backpropagation of error algorithm. The error is associated with a loss function, which is estimated multiple times during training from a reduced subset of the initial dataset, referred to as the batch, instead of the whole dataset in order to accelerate the training process. Finally, the addition of batch normalization layers, whose transformations rely on batch statistics, allows faster convergence of our network to desirable weights for a wider range of learning rates.
The motivation behind the proposal of a new architecture is twofold; first and foremost, we believe that by adopting design trends from models utilized in a UDA setting we will encourage the formation of class-discriminative clusters and thus facilitate source- and target-distribution alignment [
42]. The need of class-discriminative features is highlighted by the fact that discrepancy between the two domains can be minimized by simply mixing samples from different classes, which will inevitably lead to degradation of the classification performance [
46]. The trends that we adopted in our network were the use of
Leaky ReLU as well as dropout layers in the feature extractor part of the network, which are commonly used in network architectures regarding UDA in various fields [
37,
40]. The proposed architecture’s ability to learn class-discriminative features is illustrated in
Figure 2. Lastly, we believe that by enhancing the model’s discriminative ability on the main task, i.e., the classification of finger gestures, we will also be able to achieve better adaptation results.
3.1. Improving the Original CNN Model
Our experimentation with the original CNN model [
23] allowed us to identify two main setbacks on its design. First and foremost, the original network is characterized by a small receptive field, which influences the quality of the extracted spatial features, and secondly, the connections from the extracted spatial features to the first fully connected layer contribute most of the network’s learnable parameters. The dimensionality of the original feature space is equal to 8192 (32 filters × 8 height × 32 width), and the first fully connected layer—in which neurons are trained to distinguish patterns of the extracted spatial features—features 48 neurons. Each neuron in the fully connected layer is connected to each individual feature and their importance is expressed through a corresponding weight. Additionally, each neuron in the first fully connected layer also includes a bias term. This results in a total of
learnable parameters (8192 features × 48 neurons plus the corresponding 48 bias terms).
Based on our observations, we decided to experiment with the depth of the network. This design choice was based on the fact that it is well known that deeper convolutional neural network architectures are capable of capturing rich and more complex features of the data [
47]. Increasing the depth of the network was also deemed beneficial due to the high resolution of our input data in the width dimension. Furthermore, by increasing the depth of the network, we also managed to address the second setback we identified, as the total number of learnable parameters in the first fully connected layer is significantly reduced since the dimensionality of the proposed model’s feature space (a 256-dimensional feature space compared to the 8192-dimensional feature space of the original model) is significantly reduced. By increasing the depth of the network and experimenting with different sizes of kernels and max pooling operators, we determined the appropriate depth and layer parameters for our network by training all the different models on the sessions of the Ultra-Pro dataset and evaluating their performance on the corresponding validation sets. Finally, the derivation of the proposed model was finalized by the incorporation of the aforementioned design techniques commonly employed in a UDA setting.
3.2. Computational and Time Complexity Analysis
The growing interest of DNN-based solutions is mainly due to two reasons: (a) advancements made in hardware, which allow us to train and deploy, and (b) the availability of data. However, recent research efforts have focused on deriving optimal architectures under user-defined constraints and, furthermore, deploying them in resource-constrained devices such as mobile phones, drones, and robots [
48,
49,
50]. Thus, we provide results for commonly used metrics such as the number of floating point operations, multiply–accumulate operations, and direct memory accesses to evaluate its computational efficiency as well as the corresponding memory requirements for deployment (see
Table 1). Furthermore, we also provide the same results for the original model.
According to [
48], our proposed CNN architecture is suitable for deployment in resource-constrained devices since the number of floating point operations required for a forward pass lies within the range of 10–150 MFLOPs. Furthermore, the inference time of
ms on our GPU and
ms on our CPU is suitable for the targeted applications since latency for prosthetic control is suggested to be within the range of 100 to 250 ms [
51]. With our modifications, we are able to significantly reduce the model’s size, though at the cost of increasing the number of floating-point operations, multiply–accumulate operations, and direct memory accesses.
3.3. Real-World Applications of the Proposed CNN Architecture
Based on our computational complexity and memory resources analysis, our proposed architecture is rendered appropriate for applications targeting recourse-constrained devices. Thus, the possibility of potential integration to a prosthetic socket or a wearable armband is offered. The proposed CNN architecture could be integrated in an assistive technology, by assisting amputees through transcribing their indented motions to commands and control signals for operating a prosthetic arm, and in a rehabilitative technology, by enabling people with motor disabilities (e.g., neuromuscular disorders) to gradually restore motor functions within the limits of their disabilities through interactive means such as exergaming.
6. Conclusions and Future Work
Inspired by sEMG-based HMI research, we investigated the possibility of re-calibrating US-based HMIs using unlabeled data from different within-day sessions by employing state-of-the-art UDA algorithms. This is a more challenging task than making a classifier adapt to muscle fatigue [
31], where no continuous stream of data is available for continual adaptation, in a more demanding dataset featuring concurrent wrist rotation. For our experiments, we introduced an adaptation scheme on the publicly available Ultra-Pro dataset, which allows us to investigate the performance enhancements of UDA algorithms through two adaptation scenarios based on the time interval between the initial label dataset that was collected and the acquisition of unlabeled data used for re-calibration. Our experimentation also led to the proposal of a new architecture, featuring
less parameters than the current state-of-the-art. For effective performance comparison of the two architectures, we introduced a benchmark that simulates a realistic scenario while also avoiding any bias, where no validation set will be available and the HMI will need to be operable shortly after user data are collected. Furthermore, we provide extensive guidelines for the re-calibration of US-based HMIs using UDA algorithms and draw important conclusions on their drawbacks and performance. According to our findings, our proposed CNN-based architecture achieves similar performance with the current state-of-the-art [
23] while featuring only 50,582 trainable parameters instead of 401,382. Also, by using DANN, a domain-adversarial based algorithm, we achieved a
average performance enhancement and systematically improved the classification performance of a US-based HMI compared to its non re-calibrated counterpart for all subjects and sessions. However, our results indicate that in cases where the data acquisition process, the re-calibration period, and the network architecture may differ and proper initialization of the UDA algorithms may not be feasible, the observed enhancements would be rather small or even not noticeable.
We believe that the findings of our work raise several research questions that need to be addressed. To begin with, UDA algorithms require careful initialization in order to fully leverage their capabilities and cannot fully restore the operability of a US-based HMI. These results indicate the need for a online learning learning algorithm, capable of adapting in dynamic environments where test–time data continuously change in unforeseen ways. Furthermore, it is important to ensure that important information will not be lost during the dynamic adaptation of the model, a common issue that is often referred to as catastrophic forgetting [
58]. To continue with, most of the aforementioned techniques rely on backpropagation of error algorithm to update the network’s parameters, a non-local and computationally intensive learning rule. Based on our observations, our main research focus would be the development of an online continual learning algorithm, capable of adapting to dynamic environments using local learning rules. Finally, we plan to collect data in order to construct a representative dataset, featuring hand movements performed in activities of daily living and multiple sessions for a better evaluation of the re-calibration performance of different algorithms.