1 Introduction
The current gold standard for smart devices is touch, and a wide variety of touch-enabled wearable devices and wrist-worn devices are becoming more ubiquitous, such as smartwatches,
1 smart jackets
2 and even smart rings.
3 These wearable and wrist-worn devices allow instantaneous command input due to their location on the wrist. Through continuous contact with the skin, embedded sensors (like IMUs, microphones, and pulse sensors) record users’ daily activities and monitor their health [
57]. However, touch screens and touchpads in wearable devices lack any form of haptic feedback of distinct textures and surfaces, making them challenging to operate eyes-free.
The growing requirements in more ubiquitous scenarios continuously motivate researchers to design easy-to-use hand gestures and develop advanced sensing technologies [
67]. Among the many options for command input, hand gestures, such as unimanual gestures [
12,
40,
58,
75,
85], mid-air bimanual gestures [
23,
46,
54,
61,
71], and on-skin gestures [
9,
24,
75], require no tangible input devices and require minimal cognitive loads.
One method to address the limitation of distinct surfaces on a device is to allow a touch on the user’s skin surface [
7]. However, previous works on turning the skin into an interactive surface generally focused on the relatively flat areas of the body that are in proximity to hands, such as the palm or forearm [
28,
35]. Alternatively, the back of the hand is comprised of distinct topographic features and surface texture, making it an ideal environment for interaction. The decision was made based on a human ability to perceive the locations of body parts without looking at them, referred to as
proprioception [
55]. Previous works showed that utilising skin surfaces that are rich in unique landmarks (such as joints, bumps, wrinkles, tattoos, and veins) can improve input performance with virtual items [
8,
24,
35,
65]. One such topographically distinct region is the back of the hand.
Prior researches [
28,
89] often require an additional sensor on the dominant or pointing hand or relies upon a camera-based system that is prone to occlusion, poor illumination, and privacy concerns. In other studies [
13], researchers have used an on-the-shelf smartwatch to detect hand gestures through an integrated
IMU (Inertial Measurement Unit) and a microphone. However, the microphone is often susceptible to background noise and privacy concerns, whereas IMU is vulnerable to motion noise and does not have spatial sensing capability to detect various hand gestures for on-skin interaction.
Previous research has not extensively explored the notion of sensing gestures at the back of the hand using the conventional gesture paradigm for smartwatches, including gestures such as tapping, horizontal swiping, vertical swiping, and rotation of knobs [
16]. Earlier research that utilizes IMU shows promising results for tap gestures across the back of the hand [
13]. However, it would be difficult to detect other gestures aside from tapping due to how IMU sensors use distinct vibrations to recognise gestures. In contrast, radar has shown impressive results in sensing a wide variety of gestures, has spatial resolution, and can achieve a small footprint to fit in a mobile device [
4,
29,
38].
In this article, we present RadarHand, a wrist-worn miniature radar based on Google Soli that senses proprioceptive on-skin touch input gestures, which were designed upon our study on hand topography [
38]. The topographic features of the back of the hand are distinct in terms of their position, length, texture, and angle. Taking advantage of them is suitable for realizing unique input modalities.
Additionally, Soli has multiple attractive properties for the deployment of gestural interactions in wearable computing. The radar radio frequency (RF) signal penetrates through plastic, glass, and other non-metallic materials, preserving its functionality inside device enclosures. It will also not be affected by ambient light or acoustic noise. Soli is also less privacy-invasive compared to image sensors since the radar does not produce distinguishable representations of a target’s spatial structure. Finally, Soli’s sensitivity to sub-millimetre displacements, regardless of distance, allows motion recognition in both near and far fields with the same hardware.
We have conducted three studies to evaluate our system. Prior to developing a prototype system, we evaluate the appropriate gesture design for RadarHand based on the previously established design considerations. To do this, we conducted an experiment to determine the performance of attaining landmarks on the back of the hand objectively. This reflects the proprioceptive and exteroceptive error in reaching the various landmarks on the back of the hand [
53]. Our objective approach aimed to provide data to inform the design of the gesture set, as we aimed for a more rigorous approach than the current standard of using subjective scores (e.g., [
9,
41]). We performed an evaluation to assess hand proprioception and tactile perception using an extended Fitts’ Law predictive model [
21]. We also ran a post-survey to determine user perception of gesture usability, memorability, social acceptance, and disambiguity. From this study, we obtained the least proprioceptive error landmarks using tapping gestures at the back of the hand that are suitable for gestural interaction in eyes-free and high cognitive load scenarios.
Leveraging the unique advantages of radar-based sensing and touch-based proprioceptive gesture insights from study 1, we developed a novel machine-learning model and collected a dataset to classify various gesture sets. First, we collected all the potential hand gestures data for our prototype, along with background gestures. Second, we developed and evaluated a novel machine-learning model based on the previous work [
29] on radar gesture sensing. Finally, we introduce two types of gestures based on the locations of the back of the hand: generic gestures and discrete gestures [
7]. Discrete gestures are refined to specific areas at the back of the hand, whereas generic gestures are able to be performed anywhere on the back of the hand. We grouped different sets of gestural groups based on generic and discrete gesture categories, which have the best accuracy and usage in various context-driven applications.
In our final study, we evaluated our trained machine learning model in real-time conditions and reported its false positives and shortcomings. We evaluated our RadarHand prototype and best-performing gesture groups from generic and discrete categories under two interaction modes: Active interaction and Reactive interaction. In Active interaction, the user initiates interaction with a system to achieve the desired output, such as tapping a button to access an app intentionally. In Reactive interaction, the system initiates interaction and requires the user to react, such as replying to a notification prompt.
Figure
1 depicts the overview of the three studies with the key information from each study steps.
Finally, we summarize all findings into design guidelines regarding the use of wrist-worn radars for gesture detection and interaction design.
This work introduces the following contributions:
(1)
We introduce and demonstrate the use of radar as a compact wrist-worn device to detect on-skin touch-based proprioceptive gestures, which maintains privacy and is not affected by occlusion.
(2)
We established the least proprioceptive error points on the back of the left hand for eyes-free and intuitive gesture interactions.
(3)
We group and classify the gestures using a deep learning model (accuracy of 92% for a generic 7 gesture set and 93% for the best discrete 8 gesture set) and propose contextual applications for each gesture group.
(4)
We conducted a real-time evaluation on the model to report its performance (accuracy of 87% for a generic 7 gesture set, 74% for the discrete 8 gesture set in active interaction conditions, and accuracy of 91.3% for a generic 7 gesture set, 81.7% for the discrete 8 gesture set in reactive interaction conditions) and shortcomings.
(5)
We propose a set of design guidelines regarding the use of wrist-worn radar for wearable gesture recognition, what can be applied, and what can be improved for further iterations.
3 RadarHand Benefits and Design Considerations
When considering gestural interaction in wearable computing contexts, we need to consider a different set of technological and interaction requirements. Hayashi et al. [
29] pointed out that users will interact with such devices at the periphery or even out of their visual attention. These points clearly distinguish the benefits of the use of radar over other conventional sensing methods:
(1)
Always On. Gesture recognition techniques should run continuously and be ready anytime whenever the user wants to initiate the interaction. The core value of gestural interactions in wearable computers is in their immediacy: the user can quickly interact on a small device for simple tasks with minimal cognitive effort and without complex hand-eye coordination. Any friction, such as the need to wake up the device, will impact the usefulness of gestural interactions in these contexts. We refer to this as active interactions, which we elaborate further in Study 3.
(2)
Reliable and Robust. Gesture recognition techniques for wearable computing applications should work in various contexts. They should be robust against environmental changes such as sensor coverings, atmospheric temperature, and lighting conditions.
(3)
Private. The diffusion of products with advanced sensing capabilities in personal spaces, such as our bedrooms, living rooms, or workplaces, makes privacy a key factor for their wide adoption.
(4)
Potentially Small and Invisible. Gesture recognition techniques should possess a sufficiently small footprint to be embedded in various objects without compromising their form factor or aesthetic appeal. The interaction space should also be compact enough to accommodate gestures in close proximity or involve micro-movements. These techniques should be integrated seamlessly behind surfaces without necessitating openings or other modifications to the product’s physical design. They should demonstrate the capability to penetrate various materials, including those commonly found in everyday objects, such as glass or cloth. This aspect is explored in detail in Study 2.
(5)
Low-Computation Cost. The computation cost of gesture prediction should be efficient and lightweight to be executed on wearable devices. We explore an interaction mode called reactive interactions that is better for battery life which we elaborate further in Study 3.
To design the RadarHand gestures, which take advantage of the proprioceptive and tactile perception sense, we considered several factors:
(1)
Dominant Hand as a Pointer, Non-Dominant Hand as a Surface. We aim to have RadarHand’s interaction depend on a finger of the dominant hand (assigned as the right index finger in the prototype system) as a pointer and the surface of the non-dominant hand (assigned as the left hand) as an interactive surface. In general, the dominant hand is more accurate and intuitive when it comes to performing specific gestures [
42].
(2)
Landmark-Driven Gestures. As one of the primary goals of our gesture design is to be proprioceptive, we focus on the gestures being based on hand topography [
65]. The section of the hand with the most distinct landmarks becomes the targeted surface of interaction.
(3)
Open Hand State. The state of the non-dominant hand is kept open with each of the fingers stretched for the gestures. This is to accommodate more of the aforementioned touch-based proprioceptive gestures, such as long slides along a finger, gestures on a nail, and so on.
(4)
Explicit Gestures. The selected gestures are explicit, which means they need to be purposefully performed, as opposed to implicit gestures, which are potentially “normal” everyday actions such as waving, clapping, and so on [
6]. This is to reduce overall false positives and negatives.
(5)
Gestures Derived from Smartwatches. We refer to a gesture design when interacting with smartwatches.
6 Smartwatches typically have three basic interactions, which are tapping, vertical slide, and horizontal slide [
16]. Additionally, there is also a rotational gesture which refers to rotary input in certain smartwatches, as mentioned by the Android Wear developer guide.
7 From these considerations, we target the back of the non-dominant hand as the interactive surface due to its many distinct landmarks, including nails, fingers, knuckles, and so on. We support four distinct gestures, which are (1) tapping, (2) vertical slide, (3) horizontal slide, and (4) knob rotation [
16,
42]. We focus on distinctive points like nails or
distal phalanges (DP), middle knuckle or
proximal interphalanges (PIP), knuckles or
metacarpophalangeals (MCP), the back of the palm or
metacarpals (ML), and wrist or
carpus (CS) Figure
2(a). Gestures such as tapping ML or CS are removed since there is no distinct landmark to tap. Based on our gesture and the location area at the back of the hand, we established 55 gesture possibilities Figure
2(b). For tapping, 15 tapping gestures were selected, ranging from tapping on the pinky MCP to the thumb DP. For slide, we looked at sliding on each of the fingers in both directions, for a total of 10 gestures. Another 10 sliding gestures were selected on the ML for each finger. For knob rotation, we only perform this on the MCP of each finger for both clockwise and counter-clockwise, for a total of another 10 gestures. Lastly, a horizontal slide is performed along the CS, ML, MCP, PIP, and DP for both directions for a total of 10 gestures (
\(15 + 10 + 10 + 10 + 10 = 55\)).
4 Study 1: Proprioceptive Gestures
Before proceeding with the development of the RadarHand prototype, it is crucial to evaluate the appropriate design of gestures based on established design considerations. In this context, identifying the landmarks on the back of the hand with the least proprioceptive error becomes essential in order to design effective gestures specifically for these landmarks. Proprioception plays a vital role in perceiving the position and movement of our body parts, including our fingers. Exteroceptors aid in verifying the accuracy of touch locations when tapping on the back of the hand. Therefore, we objectively assess the performance of reaching different landmarks on the back of the hand, as this directly reflects the associated proprioceptive error [
53].
In order to adopt a more rigorous approach compared to the subjective scoring methods employed in previous related studies (e.g., [
9] and [
41]), we conducted experiments to objectively measure the performance of reaching landmarks on the back of the hand. This objective approach aims to provide empirical data that can inform the design of the gesture set. By employing this approach, we aim to obtain precise and reliable insights into identifying the landmarks on the back of the hand that exhibit the least proprioceptive error, making them most suitable for gestural interactions. These identified landmarks, with the least proprioceptive error, will serve as the optimal points on which users can perform gestures under conditions of high cognitive load and eyes-free scenarios. This will enable enhanced smartwatch interaction for everyday activities.
4.1 Apparatus
Figure
3 shows the study setup. Two computer systems were running simultaneously to conduct the session. The first system was for precise hand tracking. We set up a motion tracking system with four OptiTrack Flex 13
8 cameras to track retroreflectors that are applied to the index fingertip of the right hand. The motion capture cameras were placed around the main table and were connected to a Windows desktop computer, which ran OptiTrack Motive
9 software for real-time data collection. The second system was for an N-back task and gesture prompting. A 15-inch MacBook Pro laptop computer with a 65-inch generic colour monitor was used to display both the N-back task and gesture prompt with an illustration of the left hand annotated with a red dot. The red dot on the hand illustration tells the participant the exact position to perform a tap gesture. Since both systems were linked through a local network connection, live motion tracking data of the right index fingertip was streamed from the first system to the second in real-time. The second system ran C++-based custom software built with openFrameworks,
10 which prompted gestures with graphics, ran the N-back task, and recorded live motion data of the right index fingertip and the N-back task results.
We also introduced physical constraints to the setup to keep it the same across all the participants. To remove any chance of visual aid, we placed a large piece of cardboard attached to the edge of the table as a divider to obstruct the participants’ vision of their own hands.
For reliable hand motion tracking, we used 3D-printed hand placeholders for the left hand. The placeholders were attached to a piece of paper with a hand outline printed on it, which was attached to the table. For the right hand, we sculpted a foam block that was attached to the table so that the participant could place their hand in it with the index finger pointed outward. These solutions also ensured that the hands were returned to the original position after each task without the need for looking. All apparatus was sanitized before and after each participant.
4.2 Study Design
The design of our study is based on the methodology proposed by Gunasekaran et al. [
21] that adapted Fitts’ law and N-back task for assessment of proprioception. In the first stage of the study, we focused on finding the best locations that have the least proprioceptive error to perform hand topology-based gestures while balancing a high cognitive load. This was done by getting the participants to perform tapping gestures with their right index finger on different areas of the back of their left hand and then analysing the precision of the gesture performance on the target spot using Fitts’ Law as the means of evaluation.
4.3 Participants
A total of 26 participants (13 females, mean age (years): 25, SD: 3.2) were recruited for this study. Almost a quarter (24%) of the participants used a laptop on a daily basis, and 27% used smartwatches. All the participants were right-hand dominant with no prior accidents that could affect motor skills in the hand.
4.4 Procedure
The participant first read and signed the consent form, followed by the information sheet about the N-back task while the experimenter explained the tasks. The width of each finger on their left hand was measured with a vernier calliper. Then, fifteen spherical retroreflective markers of 10-mm diameter were attached to the MCP, PIP, and DP of each finger on their left hand. One spherical marker was attached to the index DP of their right hand. The location of the markers was then captured using the OptiTrack system, where the system retrieved the three-dimensional position of the points of interest. The markers were removed from the left hand after the positions of the markers had been captured so that participants could actually touch the MCP, PIP, and DP points on their fingers.
Participants initially begin a short practice round to familiarize themselves with the system before moving on to the actual study. As shown in Figure
3, the N-back task was running on the bottom half of the main screen. For each section of the N-back task, an alphabetical character was presented for 500 milliseconds. Then, the participant had 4,500 milliseconds to answer verbally if that character was the same as the one or two characters shown before. On the top half of the screen, a gesture prompt with the left-hand silhouette and a red dot, shown somewhere on one of the MCPs, PIPs, and DPs on the left hand, was displayed. The participant would tap the corresponding position of the left band with their right index fingertip, then return the right hand back to the resting position indicated with the foam hand rest. They were asked to answer the N-back test and perform the tapping gesture simultaneously. They would tap each dot position three times during the study, and there were 45 tapping gestures performed overall (15 positions x 3 iterations). The entire gesture performance section took 5 minutes or less time. Each finger movement tracked with the OptiTrack system was recorded, along with the response and answering time of the N-back test.
Later, participants were required to complete a questionnaire that examined four different gestures (tapping, vertical slide, horizontal slide, and knob rotation) on the free hand state. Participants were shown short videos of the gestures and encouraged to try them out. The questions asked were around the usability, memorability, and social acceptance for each of the gestures [
42]. Participants were also asked to rank their preferred finger and gestural design area. The whole experiment took around an hour to complete. At the end, participants were compensated with a $10 shopping voucher.
4.5 Computing the Fitts’ Formula
The movement data consisted of the position coordinate, timestamp, velocity, and the marker’s angular velocity. The N-back response data consisted of the participant’s answer and the timestamp. We used the Fitts’ law extension of Murata and Iwase [
45] and Cha and Myung [
11] to compute the
Fitts’ Index of Difficulty (ID). To achieve this, we made a Python script to compute the distance parameter (A), width parameter (W), and movement time (MT). Based on the measured finger width (W), we created a bounding box around the target point, shown in Figure
4 on the right. At every sampled data frame, we measured the traversal distance between the right index and the center of the bounding box. From there, we could calculate the minimum distance (A) for when the right index entered the bounding box before returning to the origin. MT was the time taken during the right-index trajectory movement toward the target.
According to the revised Fitts’ law of Murata and Iwase [
45], the inclination angle (
\(\theta _1\)) is formulated using trigonometric calculations of the minimum distance point (
t), origin point (
o), and ground plane (
n) as these points form a triangle, as shown in Figure
4. Using these three points, we calculated the inclination angle using the formula below:
For the revised Fitts’ law of Cha and Myung [
11], we used a trigonometric evaluation of three points to compute the azimuth angle (
\(\theta _2\)) value: the origin point (
o), the ground plane (
n), and the displacement of the minimum distance point of the original plane (
\(t^{\prime }\)). These points formed a triangle, shown in Figure
4 from which we calculated the azimuth angle using the formula below:
4.6 Results
4.6.1 Analysis of Fitts’ ID vs MT.
Figure
5(a) and Figure
5(b) shows the comparison between the two Fitts’ IDs of Murata and Iwase [
45] and Cha and Myung [
11]. It can be seen that it has a stronger linear relationship compared to Murata and Iwase’s Fitts’ ID when Cha and Myung’s ID linear regression value is closer to 1, though both did not perform very well with our data. The Fitts’ IDs were mainly affected by distance and target size conditions. The effect of distance shows MT gradually increases as the target fingers move from the thumb to the pinky. The variance in target size due to the different finger widths of the participants was also a reason for the non-linearity in Figure
5 between MT versus ID. The data was modeled using the revised Fitts’ ID formula for the pointing tasks in 3D space [
45]. As a result, we got ID levels for each of the 15 different conditions based on the equation. The model regression coefficients were
\(R^2 = 0.0604\) for Murata and Iwase’s method and
\(R^2 = 0.109\) for Cha and Myung’s method (computed based on the Fitts’ formula detailed in Section
4.5). From the ID calculated using Murata and Iwase extended Fitts’ law, we created a heat map that depicts the distribution of Fitts’ ID from thumb to pinky, as shown in Figure
1 (right).
4.6.2 Analysis of Movement Time vs. Answer Time.
Since the participants were required to perform the tapping and N-Back tasks simultaneously, there could be a delay in one task. This will allow us to understand user behavior as well as performance accuracy while performing a task under high cognitive load.
Movement time (MT) is the time taken by participants to move from the origin to the target.
Answer Time (AT) is the time taken by the participant to give an answer for the N-Back task. The rightmost plot in Figure
5 depicts the box plot comparing MT and AT with the y-axis represented in milliseconds. There is a general trend among all the participants that they try first to touch the target and then give an answer to the N-Back task. Running a paired t-Test gives
\(t(6.07)=478, p\lt 0.001\), which shows there is significance between MT (M = 2128.4, SD = 452.4) and AT (M = 2360.5, SD = 380.8).
4.6.3 Questionnaire Results.
For the post-experiment questionnaire results, the average scores for all the hand gestures are shown in Figure
6. We first tested our data for normality using the Shapiro-Wilk test and found its distribution is not being normal. As our data is non-parametric, we ran the Friedman test and found statistical significance across usability (
\(X^2(3) = 57.758, p \lt 0.001\)), memorability (
\(X^2(3) = 36.482, p \lt 0.001\)), social acceptance (
\(X^2(3) = 34.119, p \lt 0.001\)), and disambiguity (
\(X^2(3) = 28.702, p \lt 0.001\)). Looking at usability first, we conducted a Nemenyi post-hoc test and found that the usability interaction on tapping gesture and knob rotation, tapping gesture, and vertical slide (
\(p \lt 0.01\)) lead to significance. Horizontal slide and knob rotation (
\(p \lt 0.01\)) also lead to significance under usability. For memorability, we found that the tapping gesture and knob rotation, tapping gesture, and vertical slide lead to significance (
\(p \lt 0.001\)). Next, under social acceptance, the tapping gesture and knob rotation (
\(p \lt 0.001\)) lead to significance. Finally, knob rotation and tapping gave significance under disambiguity (
\(p \lt 0.001\)). All other combinations lead to no non-significant results (
p > 0.05). The user preference ranking of fingers, hand position, and gestures are summarized in Table
2.
4.7 Discussion
Our findings indicate that both Fitts’ models exhibit a non-linear relationship. When the index of difficulty (ID) is high, reaching the tap point becomes more challenging, resulting in decreased gesture accuracy. In terms of the fingers of the left hand, the ID gradually increases from the thumb to the pinky, which aligns with the sensory somatotopic mapping of the human hand. This observation may potentially explain the variation in the Fitts’ ID between fingers. Notably, the index and thumb fingers have shown a larger activation volume in the somatosensory cortex during an fMRI study [
43], with the index finger exhibiting the most prominent activation cluster.
Regarding gesture locations on the fingers, excluding the thumb, the ID is lowest at the metacarpophalangeal (MCP) joint, followed by the proximal interphalangeal (PIP) joint, and finally, the distal phalangeal (DP) joint. Although the fingertip has the highest number of neuron endings, it exhibits a higher ID compared to the other joints. This can be explained by two factors. First, according to Long et al. [
41], individuals perceive their bodies, especially their fingers, to be smaller when there are no visual cues. Second, the palms have a higher concentration of muscle mass compared to the fingers, resulting in a greater number of mechanoreceptors.
Our study involved participants concurrently performing two tasks: touching the target point and answering the N-back task, without relying on visual cues. From the analysis of the average touch (AT) time for the N-back task and the movement time (MT), we observed a trend where participants prioritized touching the target before engaging with the N-back task. This observation can be attributed to the fact that proprioceptive-based tasks generally impose lower mental load compared to the N-back task [
2]. Consequently, this suggests that users may still be able to accurately perform tapping gestures without visual aid using RadarHand in real-world scenarios where they can briefly focus on interacting with the wearable, such as when driving or carrying objects that obstruct their vision.
The survey results, depicted in Figure
6, and the analysis of the questionnaire revealed significant differences in perception among the different gestures. The tapping gesture was perceived as significantly more usable than knob rotation and vertical slide gestures. Furthermore, the horizontal slide gesture received significantly higher usability ratings compared to knob rotation. In terms of memorability, participants perceived the tapping gesture as more memorable overall when compared to the knob rotation and vertical slide gestures. Additionally, the tapping gesture received a significantly higher score in terms of social acceptance compared to the knob rotation gesture. Finally, participants indicated that the knob rotation gesture was significantly more ambiguous than the tapping gesture.
Analysis of our results, as depicted in Figure
5, aligns with the perceived finger and joint ranking. Consequently, based on the survey and Fitts’ study, we have determined that the thumb, index, and middle fingers are the least prone to proprioceptive errors and are therefore preferred for gestural interactions. Specifically, the metacarpophalangeal (MCP) and distal phalangeal (DP) joints are identified as the least prone to proprioceptive errors and are preferred locations for performing gestures.
In our current empirical study design, participants were instructed to tap on the back of the hand under conditions of high cognitive load and without visual cues. This tap gesture process incorporates both proprioceptive and exteroceptive elements, necessitating consideration of the different components of feedback during the interaction [
7]. Proprioception refers to the internal sense of body position and movement, enabling individuals to perceive the location and orientation of their body parts [
66,
91]. In the context of gestural interactions, proprioceptive feedback plays a crucial role in guiding users to accurately reach the desired gesture locations on the back of the hand. On the other hand, exteroception involves external cues and sensory information derived from the environment [
91]. In our study, exteroceptive feedback aids in verifying the accuracy of touch locations on the back of the hand. By integrating both proprioceptive and exteroceptive elements, the RadarHand gestures can provide users with comprehensive feedback, thereby enhancing the overall usability and effectiveness of gestural interactions, particularly in the landmarks with the least proprioceptive error. Hence, RadarHand gestures performed around these landmarks exhibit high accuracy during everyday smartwatch interactions under conditions of high cognitive load and without visual cues.
5 Study 2: Deep Learning and Gesture Grouping
We established the best gestures based on proprioception and survey results from Study 1. In this section, we aim to design and train a deep neural network model to recognize and classify these touch-based proprioceptive gestures accurately. As we plan to establish a design guideline for RadarHand, we will collect an adequate amount of gesture performance data and train different models for a wide variety of gesture groupings and combinations, as well as suggest which gesture group is contextually more suited to a proposed RadarHand application.
5.1 Apparatus
5.1.1 Hardware.
The entire sensing paradigm for RadarHand depends on the Soli sensor that was mounted on the wrist of the user, and the signal processing pipeline runs as a part of application software. For this trial, the sensor was tethered to a laptop computer for power and data transmission. The sensor circuit board was mounted on a wrist using a custom 3D printed mount with a bangle made of a double-sided Velcro strip so that it was easily wearable. The mount had a channel that allowed a material sample of two millimetres thick to be mounted in front of the sensor. This was to simulate the use of different kinds of hardware casing materials or the sensor surface obscured by clothing. The device is shown in Figure
7. The Soli sensor we deployed was a custom design based on Infineon’s BGT60TR24B chip [
59].
Soli is a miniature FMCW (Frequency-Modulated Continuous Wave) radar system. Generally, FMCW radar emits a sinusoidal electromagnetic wave and reads the reflected signal after bouncing off an object (Figure
7). The electromagnetic wave changes its oscillation frequency within a specific frequency band available in the system, continuously between lower to higher over time. That allows Soli to measure the distance and velocity of a target object, hence it is possible to extract spatiotemporal information of it.
The data we obtained from the sensor was a time-series complex range-Doppler (CRD) map [
29]. The range-Doppler map is a two-dimensional representation of reflected signals captured with a receiver. The range dimension and Doppler dimension are responsible for the distance to an object and the velocity of an object from the sensor surface, respectively. In other words, the range-Doppler map exhibits spatial energy intensity. The CRD map is data represented in complex numbers, which allows us to reconstruct an absolute range-Doppler map that illustrates an amount of reflected energy in the sensing area, and a map of the arrival angle of reflected signals, used to retrieve the angle of the incoming signal to understand angular motion of an object by combining multiple channel inputs. In this article, we call these two maps
the magnitude map and
the phase map from the CRD map, respectively.
The sensor device we used equips two signal transmitter antennas and four signal receiver antennas. Since the receiver antennas are arranged to form a square, we can take two pairs of receivers to form an L shape so that we can extract the angular motion of the objects in Soli’s sensing field about the azimuth and elevation direction (Figure
7). Thus, we can elicit three key characteristics to read micro gestures with CRD maps: distance, velocity, and angular motion of a finger on the back of another hand [
29].
We used the C++-based Soli software development kit (SDK), which has digital signal processing features to elicit adequate features from the raw sensor signal to talk with the sensor and extract CRD map data. As before, all apparatus was sanitized before and after each participant’s session.
5.1.2 Software.
We employed a 15-inch MacBook Pro laptop (Apple Inc.) for data collection with a Soli sensor device tethered by a USB cable. Software to talk to the sensor for data acquisition, processing, and recording was developed with the openFrameworks software library. In the RadarHand prototype, the sensor was configured to sample signals from three signal receivers at 1,000 Hz. Then, the signal captured was down-sampled with a sliding window of 32 consecutive sample frames. This operation technically realizes low-pass filtering on the sample frames to eliminate noises and avoid confusion in sample frames apart from the hand performing a gesture. After the downsampling, the software streams the data at 32 frames per second. From each sample, we extracted a CRD map composed of three magnitude and phase maps from the receivers. Figure
7 shows an example of consecutive CRD maps captured with Soli while one of the authors was performing a gesture.
5.2 Procedure
The study participant was seated in a shared space in a laboratory building and given an information sheet and a consent form to read and sign. A Soli sensor was put on their left wrist, and the experimenter explained the tasks. Once the data collection started, the participant was shown the gesture to perform with a color image of the left hand with gesture instructions shown in simple graphics on the laptop screen. They had 5 seconds to confirm the gesture to perform from the image. Then, they were prompted to start performing the gesture ten times. A beep sound was played at the start of each gesture iteration to notify the participant when to perform the gesture, along with a progress bar showing each gesture iteration on the laptop screen. Participants had 2 seconds to complete each gesture.
In total, there were 55 gestures plus 1 background noise class consisting of all background data as shown in Figure
9. A combined class of background data is intended to reduce false positives in the prediction and increase the robustness of the model in real-time. The background data was selected based on everyday activities such as clapping, waving your hands, or typing on a keyboard. The complete gesture set was repeated four times with different sensor covering situations with different materials: no material, glass, plastic, and fabric. Each covering material was placed in front of the sensor. Performing the entire gesture set took under half an hour, and the whole session took two hours. The gestures the participant was asked to perform were tapping, vertical slide from the CS to MCP and MCP to DP, horizontal slide from one end of the back of the left hand to another end, and knob rotation.
5.3 Data Collection
We recruited 46 participants (24 females, mean age: 24 years, SD: 2.53). All the participants were right-handed, meaning they wore a Soli on their left forearm and used their right hand for gesture trials. Prior to the main session, we explained to the participants the details of the study and gave them 5 minutes to try out each gesture and software used for collecting data. The participant then watched the laptop display and performed the gesture displayed on the screen. Figure
8 shows the RadarHand gesture samples and Figure
9 shows the study setup from a participant’s point of view.
To increase the robustness of the model, we prepared three different material samples to add some variation to the sensor output (glass panel, cloth, and plastic panel, all of them of two-millimetre thickness). These materials were installed to the channel of the sensor mount to simulate a sensor under different conditions.
In this study, we had four different material conditions along with no material setup to obscure the sensor surface, so we collected 2,440 trials (10 trials * 55 + 6 (55 gestures with 6 variations of background noise) * 4 (repeat)) from each participant, and each data collection process per participant took 2 hours. At the end of the study, we compensated participants with a $40 shopping voucher. Overall, we collected a total of 112,240 samples.
We randomly split each participant’s data into training, validation, and test sets. A total of 36 participants’ data went to training (1,000 samples per gesture), 6 participants’ data went to Validation (200 samples per gesture), and 4 participants’ data went to test data (120 samples per gesture). We made sure that gestures from any participant were not split into the training, validation, and testing sets.
5.4 Algorithm
For gesture recognition, we used the Deep RadarNet architecture, which is a modified version of the RadarNet software [
29]. As Soli samples at 1,000 Hz, we used a sliding window of 32 frames to generate a range-Doppler map to effectively down-sample it to 32 frames per second. This resulted in 64 frames of data over two seconds of recording, which we used as the final time window per gesture. We specifically fine-tuned the model to perform better for close-range gestures like what we proposed in Study 1. The model consists of (1) radar signal preprocessing, (2) Deep RadarNet, and (3) gesture debouncer.
5.4.1 Data Preprocessing.
After we split the data into three sets of trials, we refined the timings of the labelled segments attached in the positive recordings. During the gesture recordings, it was observed that there was a timing difference between when gesture prompts were given to participants and when those participants actually performed the gestures. To design a deep learning model that is real-time gesture inference ready, we fed continuous data of each gesture (a recording of consecutive 10 gesture trials) into our pre-processing algorithm. The pre-processing algorithm calculated a time-series plot of maximum energy value from the magnitude component in the range-Doppler map for each frame in the continuous data recording, and found out the peaks using the peak detection algorithm and extracted 32 frames before the peak and 31 frames after the peak, as shown in Figure
10. Once the data was extracted, it was saved in the NumPy
11 data format.
We then generated negative samples from background recordings by extracting 64 frames and splitting them into the training, development, and test sets. With the Soli SDK, we extracted CRD maps for each receiver channel. Each CRD map was composed of 2D data that has 32 range bins and 32 Doppler bins. All of the values were stored as a complex floating-point value and can be decomposed into (1) a magnitude part, which holds information on energy distribution in the sensing area, and (2) a phase part, which contains information on the angular velocity of the object in the sensing area. With the configuration we adopted in this study, we captured CRD maps from three receivers in the sensor, and thus we got three CRD maps per frame (32 * 32-pixel map for both magnitude and phasic parts for three channels).
Deep RadarNet uses complex range-Doppler maps as input, similar to the RadarNet. The Deep RadarNet used the first 16 range bins as an input to detect gestures in a 40-cm radius from the sensor surface. Each bin size can be calculated by the equation:
In this study, the radar signal bandwidth was \(63GHz - 57GHz = 6GHz\), and each range bin size could be calculated as 2.5 cm from the equation. We limited the interaction space to 40 cm from the sensor, so we cropped the range bins at the 16th bin, corresponding to 40 cm distance to have users’ index fingers always within the interaction space on CRD maps when their hand performed the gestures. The input was reshaped into a tensor with sizes (16, 32, 6) and passed to the frame model of the Deep RadarNet.
5.4.2 Deep RadarNet.
Deep RadarNet takes a CRD map consisting of complex values from three channels as input. In these data representations, the relative phases among the channels correspond to the angle of scattering surfaces around a Soli sensor. The absolute phases of the complex values are affected by many factors, including surface positions under the range bin resolutions, phase noises, and errors in sampling timings, and can be regarded as uniformly distributed random values. While the absolute phases can distribute uniformly in a large dataset, there could be biases in a smaller dataset. To address the potential biases, we augmented the data by rotating the complex values with random phase values chosen from a uniform distribution between
\(-\pi\) and
\(\pi\). Furthermore, the magnitude of complex values are affected by many factors to some extent, including antenna properties differences among multiple Soli sensors, signal reflectivity of scattering surfaces, and orientations of the surfaces. Thus, we augmented the data by scaling the magnitude with a scaling factor chosen from a normal distribution. These data augmentations can be written as:
where: CRD is a complex range-Doppler map; r, d, and c are a range bin index, a Doppler bin index, and a channel index in the complex range-Doppler map, respectively;
s is a scaling factor chosen from a normal distribution with a mean of 1;
\(\theta\) is a random rotation phase chosen from a uniform distribution between
\(-\pi\) and
\(\pi\). While we used the proposed data augmentation technique from the Deep RadarNet research only for our gesture detection system, we believe that this technique can be applied to other deep neural network systems using radar signals as inputs.
The Deep RadarNet consists of a frame model and a temporal model as shown in Figure
11. At the beginning of the frame model, an average pooling is applied to the input tensor to make the tensor size smaller and to reduce the computational cost. After this layer, a series of separable 2D residual blocks and max pooling are applied. We opted to use separable convolution layers instead of standard convolution layers in the residual blocks to optimize computations at the cost of minor degradation in the gesture recognition performance. Finally, one separable convolution layer is applied to compress each channel into one value, shrinking the tensor to 36 values as an output from the frame model.
The temporal model takes the last 64 frames of the frame model outputs as input. The 64 frames of the frame model outputs, except the latest one, were taken from the cache. The temporal model concatenates and processes them with a series of one-dimensional residual blocks and one-dimensional max pooling layers. At the end of the temporal model, a dense layer outputs the classes’ variable values, and a softmax layer is applied to compute the probabilities for the variable classes. We found that using an LSTM layer instead of the series of the one-dimensional residual blocks and the one-dimensional max pooling gave performance improvements. However, we opted to use the current structure to reduce computational cost.
5.4.3 Gesture Debouncer.
A window size of 64 frames was used for the segmented classification task. However, the algorithm needed to correctly classify the gestures from an unsegmented data stream. This is significantly more challenging since we do not know where the gesture is situated within the time series data. To overcome this, we added the following heuristics: (1) the chances of a gesture to be performed should be higher than an experimentally determined threshold of 0.3 within the last three consecutive frames, and (2) after a gesture is detected, the chance of any subsequent gesture to be detected becomes lower than 0.3 before gesture detection is initiated again. The thresholds were determined experimentally in order to achieve the desired balance between recall and false positives.
5.5 Evaluation
All training was done for 1,000 steps with a batch size of 32. Each training took 5 hours with an NVIDIA V-100 with 32 GB graphic card memory. We evaluated the performance of our architecture in two tasks. The first is a segmented classification task. In this task, recordings in the test set were segmented into multiple samples using the same algorithm we used to generate the training samples. As a result, the samples were easier to classify. For instance, complete gesture motions are always detected at similar positions in the segments. The second task is an unsegmented recognition task, where a model combined with a gesture debouncer had to detect gestures with continuous time-series data without knowing the positions and the numbers of gestures in the data. This made the unsegmented recognition task more challenging. However, algorithms have to process continuous data in practice; evaluations with this task gave us more ecologically valid performance estimates.
According to Study 1 results, the hand fingers are ranked as the thumb, index finger, middle finger, ring finger, and pinky finger based on their least proprioceptive error on tapping abilities. Additionally, we found the ranking of gestures from participants as listed in Table
3 from our survey. Combining the results of Study 1 and the survey, We examined our Deep RadarNet algorithm and proposed 29 gesture groups. We reported each gesture group’s test accuracy and its contextual applications.
5.5.1 Computational Efficiency.
We evaluated the computational efficiency of the Deep RadarNet using its model size and inference time. The model size affects how much memory is required to run the model. Inference time affects how much computational power a processor needs to run the model, as well as the power consumption and thermal effects of the computation.
We compared with RadarNet [
29] and modified its input based on RadarHand’s gestural input size. We evaluated inference time using the TensorFlow Lite [
1] and its performance profiler distributed with TensorFlow v.2.4.1. We converted all models into TFLite models and measured the inference time on Pixel 4 XL with the performance profiler by taking the average inference times over 1,000 inference trials.
As shown in Table
3, Deep RadarNet has a 3.5 times larger model size and 1.25 times longer inference time than RadarNet. This is because RadarNet has a smaller interaction area compared to Deep RadarNet. Compared with the modified RadarNet with the same interaction area as Deep RadarNet, the model size and the inference time were comparable, indicating that their computational efficiencies are similar. Furthermore, Deep RadarNet achieved higher classification accuracy than RadarNet when it comes to detecting gestures set of 8. In addition, RadarNet [
29] have proved to be efficient compared to previous works baseline models [
31,
73]. These comparisons demonstrate that Deep RadarNet is as computationally efficient as RadarNet and can be executed on mobile devices with limited computing resources.
5.5.2 Segmented Classification Task.
We generated a segmented test set for each gesture trial, along with the training samples from the training set that were generated with a sample generation algorithm. Each sample had 64 frames of CRD maps from three receivers and was assigned one of the classes as a ground truth label. Then we evaluated the classification result of all gesture trials in the segmented test set using Deep RadarNet. The labels with the highest probability among the gesture classes were used as predictions without applying the gesture debouncer. This was because the test recordings were pre-segmented into samples in the segmented classification task. Figure
13 shows the results, which will be explained in the Gesture Grouping and Accuracy Results section, demonstrating that Deep RadarNet provided robust classification performance in the segmented classification task.
5.5.3 Trial on Applying Different Frame Size.
As each gesture was observed at any time in a two-second time window, with 64 frames worth of data, we tried to find the best-performing frame size by trying a few different frame size conditions. To examine that, we took 8 unique gestures based on the Study 1 results, which gave us the highest accuracy from the Deep RadarNet inference trial. Then, we fed this as input to the Deep RadarNet algorithm. One of the strong motivations of this trial is that, as we mentioned in the previous section, we observed some gestures starting timing drift due to each participant’s response to the beep sound and gesture speed variability. Hence, it is worthwhile to know the best-performing frame size as an input to the model, which is directly proportional to the parameter size and computational power required for inference. We compared all the conditions we examined using the results of the development set. Figure
12 shows the graph of the validation accuracies for different input frame sizes. We also tried a sliding window of 30 consecutive frames across 64 frames of data under the same labelling as input to the deep learning model. Even though a window size of 56 yielded the highest accuracy, we opted for 64 instead, which has a lower accuracy by 1, yet can better account for different gesture timing across participants.
5.6 Gesture Grouping and Accuracy Results
In this section, we design groupings for our gesture combination and report their accuracy. We group the gestures for 2 reasons. First, it is unrealistic and redundant for us to report inference accuracy for every possible gesture combination. Second, we wish to design gesture sets that cater to specific contexts of applications. To group the gestures, we consider several aspects listed below.
Each group was trained with an additional class for background noise, hence three gestures actually mean four classes, and so on:
(1)
Gestures were selected based on the gathered results of Study 1. The thumb, index, and middle finger are the least proprioceptive error and are preferred areas for gestural interaction. MCP and DP were also found to be the least proprioceptive error and preferred spots. Additionally, one of the limitations was the defined orientation. If the orientation changes, the results will also change. Thus, we include all fingers in our gesture combination trials.
(2)
We referenced
smartwatch user interfaces (SMUI) when narrowing and grouping the gesture selections [
16].
(3)
We selected three gestures, including tapping, sliding, and rotation gestures.
(4)
As the group size increased, we narrowed the gesture selection to models that had previously performed well.
(5)
We generally avoided placing similar gestures adjacent to each other and at the same location since even a slight change in sensor placement on the wrist can have a significant effect on its performance.
(6)
Since we do not calibrate for hand size, we limited the gestures per region to only three to minimize false positives. All of the gesture set results are illustrated in Figure
13.
The grouping of various gestures enabled gesture interaction to take on a new dimension by distinguishing between gestures for essential interactions, such as answering/dismissing a phone call, and gestures that require a great deal of attention, such as increasing the brightness of a smartwatch. We created a gesture group consisting of 3, 4, 6, 8, and 10 gestures. Additionally, we created a generic gesture combination trial that combines all the basic gestures regardless of the position of the user’s fingers. We hypothesize generic gesture combination trial will be a robust model and can be integrated with any situation. The grouping below provides the rationale behind the grouping as well as its potential application in SMUI. The confusion matrices for all the trials are attached in the Appendix for reference.
3 Gestures (Sets 1 to 6). Referring to Table
3, we focused on the three least proprioceptive error fingers, which are the thumb, index, and middle finger, in that particular order. For the points on each finger, we also focused on DP for the least proprioceptive error landmark, followed by MCP from user preference. As the DP of the thumb is the least proprioceptive error landmark, we envision its use for something crucial, such as an emergency button, and maintain this throughout all the groups. We disregarded direction as of this moment, focusing only on vertical up-slide and clockwise rotation. A total of 6 gesture sets were derived from this, where sets 1 to 4 were gestures between the three fingers, and sets 5 and 6 focused on only the thumb and index fingers. For sets 1 to 4, the thumb and index were reserved mainly for tap and slide for being overall more preferred, as shown in Table
3. Therefore, rotation was left mainly on the middle MCP. This overall grouping is suitable for the most basic SMUI like the Fitbit Alta,
12 with a single tap for selection, single slide for menu interface, and single rotation to imitate a one-directional knob.
4 Gestures (Sets 7 to 12). For this group, we added the horizontal right slide gesture to the previous group, making a total of 4 gestures. We chose the right slide as it mimics gestures like “slide-to-unlock”. From Table
3, we chose the gesture only to be performed on CS. It can be seen that sets 5 and 6 from the last group, as well as sets 11 and 12 have a dip in accuracy, which leads us to focus on spreading the gestures across different fingers for future groups. This grouping is suitable for basic SMUIs as well, with an additional function to perform a slide-to-delete or slide-to-unlock.
6 Gestures (Sets 13 to 18). We skipped the group for 5 gestures because we decided to directly expand the slide gesture to cover both directions (up and down, as well as left and right). Additionally, we selected the 2 best results from the previous group (sets 7 and 9) and expanded on them. Set 10 was removed because even though it performed well previously, its gestures have high proprioceptive error compared to 7 (slide was performed on thumb and tap was performed on the index) and are less accurate compared to 9. Sets 13 and 14 were based on 7 and 9, respectively. Set 15 is a variation of 13, whereas 16 is a variation of 14 by shifting the down vertical slide to ML of the respective finger. Set 17 is a variation of 15, and 18 is a variation of 16, where the vertical slides have experimented between the index and thumb since we found from the previous group that variation among fingers provides better results. This grouping is suitable for a fully functional SMUI with a knob, which is close to a watch crown-inspired interface, such as the Digital Crown on Apple Watch.
13 7 Gestures (Sets 19 to 21). For this group, we again chose the best option from the last, namely set 17 (set 18 was deemed less proprioceptive since the tap gesture was on the index). Additionally, we added the counter-clockwise rotation gesture for a more sophisticated virtual knob control. Set 19 places both rotations on the middle MCP, whereas 20 and 21 split them between the thumb, middle, and index. This grouping is suitable for SMUIs with larger scale knob interface, such as the Samsung Galaxy Watch 4
14 with a rotational bezel as a key SMUI navigation component.
8 Gestures (Sets 22 to 25). With all the basic gestures and directions added, the next step for us would be to simply add multiple interaction points of the same gesture type. Since tap was overall preferred, we added one more tap gesture based on the previous best-performing set, which is set 20. Sets 22 and 23 replicate set 20 with the added tap on the index DP and middle DP, respectively. Set 24 is a variation of 21 by shifting the down vertical slide from the thumb ML to the finger, and set 25 is a variation of 22 that does the same.
10 Gestures (Sets 26 and 27). Here, we further push the limits of the tapping gesture by doubling it to 4 taps, hence skipping 9 gestures. Sets 26 and 27 are variations of the previous sets 24 and 25 that performed the best. Here, we experimented with including the high proprioceptive error landmarks like the pinky and ring finger. This was for two reasons: (1) we wished to split the tap point according to actual distance to resemble how the buttons on a watch typically look, such as 2 buttons on each side, or 3 on one side, and 1 on the other, and (2) the models will perform better in general when the tap locations are physically distinct from our initial testing. The thumb MCP knob rotation has also been changed to the index MCP to reduce the number of gestures per finger.
Generic Gestures (G1 and G2). For this group, we disregard the location of the gestures and simply focus on generalizing all 7 gestures. For example, tapping on the thumb DP is the same as tapping on the index DP. This group does not take into account proprioception errors and allows lower accuracy, making it more suitable for multitasking, such as interaction while driving or cycling. These gesture sets are also independent of hand state, where participants can perform the gesture when their hands are busy. A similar interface is available in the navigation mode of smartphone’s music application, which have larger UI components for ease of maneuver while driving. Additionally, it can be used in everyday smartwatch interaction scenarios and act as a modeless interface.
6 Study 3: Real-Time Evaluation
In this section, we aim to evaluate our model’s performance in real-time using new test datasets with real-world noise and determine false positives in the real-world scenario. For real-time evaluation, we chose the discrete gesture set 25, which includes all the eight gesture possibilities, and the generic gesture set G1, which gave the highest segmented classification accuracy. The generic gesture set G1 is not location-specific. Hence, we hypothesize that, in real-time, the gesture inference will have more false positives compared to the discrete gesture set of 25. This is because there is more chance of free human movements similar to gestures which could produce false positives under real-time sensing in a real-world scenario.
6.1 Study Design
Besides selecting the discrete and generic gesture set, we further break both models down to with and without background classification.
We do this because we believe smartwatches conventionally function in two modes: (1) active interaction, where the user initiates the input to achieve a desired output, such as tapping a button to access an app intentionally, and (2) reactive interaction, where the device initiates interaction and requires the user to react to it, such as replying to a notification prompt. Active interactions require background classification since it is always actively aware of user input, though this consumes more background computational resources and power. Reactive interaction, on the other hand, does not require background classification since it only listens after the prompt. This makes it less intuitive, yet overall consumes less power and is suitable for better battery life.
6.2 Apparatus
To conduct this study, we developed three systems; a backpack, an on-wrist wearable, and a remote monitoring system. The backpack is a self-contained gesture inference system using Soli. It equips a laptop computer (Apple 13-inch MacBook Air), a portable battery, and a dummy external display adapter. The laptop ran a live gesture inference with a tethered Soli sensor, while the external battery powers the laptop for the stable system performance of the current prototype. The dummy display adapter is to make the inference system keep running while the laptop is folded in a backpack. The tethered Soli sensor is on the left wrist as a part of the on-wrist wearable system.
Alongside the Soli sensor, we adapted an Android smartphone (Google Pixel 4 XL) as a participant’s communication terminal during the study session. A custom Android application acted as a console for the participant to log their gestures and report their current activity or issues with the system. The wearable (smartphone and Soli sensor) was worn on the participant’s left wrist.
The remote monitoring setup gathers all the log data related to the activity of a participant or study status transmitted from these two systems. The response made by a participant on the on-wrist wearable or gesture inference status from the backpack will be sent to the C++-based monitoring software that runs on a 15-inch MacBook Pro laptop via the same Wi-Fi network across all these three devices. This application also allowed the experimenter to manage the study and data logs. Each of these devices was using the WebSocket protocol to send data back and forth between systems, and a simple cloud server runs on Amazon Web Services
15 has been used to assist the communication. We illustrate the full setup in Figure
14(a). All the apparatus was properly sanitized before and after each participant.
6.3 Study Procedure
At the start of the study session, the experimenter introduced the tasks the participant would perform and the apparatus they would use. After the briefing, the participant was asked to wear an equipped backpack and the wearable hardware prior to the study session. They could wear a Soli sensor in the same manner in Study 2 and a smartphone with an arm mounting belt. Figure
14(b) shows an example where a participant performs a gesture with the system setup. The experimenter monitored the system status and the participant’s actions via the activity logs throughout the session. All sessions were taking one hour approximately.
The participant was asked to perform the gestures while they performed daily activities in an office space, including working with a computer on a desk, walking, standing, chatting, or cooking in a small kitchen for brewing tea or coffee.
The monitoring application was designed to prompt the participant to perform the next gesture automatically through the console on their left arm for each gesture trial. The timing was notified by the vibration of the console and an image shown on the smartphone screen. The participant tapped the image to start a gesture performance and tapped again to finish it. The participant was asked to report an action they did at the time from the console after each trial by selecting a pre-defined action or describing the action by using an on-screen keyboard. Each gesture was repeated three times throughout the study. The order of the gestures performed and the interval between the gesture trials were randomized from 30 seconds to 4 minutes for each participant.
The participant was asked to report all incorrect gestures they performed during the gesture prompts and prior to the gesture prompts in the monitoring app. In addition, when they encountered any trouble or issues around the study, they reported it to the examiner through the console. An examiner could also respond to the participant through the monitor software if the participant needs assistance. For each condition, it took one hour to conduct the whole session. At the end of the study, we compensated the participant with a $10 shopping voucher.
6.4 Data Collection and Processing
We recruited 12 participants (6 females, mean: 24.4, SD: 4.0). The criteria for the recruitment of participants were the same as the studies from the previous sections. We asked participants to remove any rings, wristwatches, and other hand-worn objects. All the participants had never experienced the RadarHand gestures previously.
As mentioned in the previous section, gesture sensing with Soli performs with a sampling rate of 1000Hz. In order to reduce the data size for the real-time gesture inference, we applied a sliding window of 32 frames, similar to what was done in Study 2. As a result, we acquired 30 to 32 frames of data for each second from three signal receiver channels. A C++-based software program captured the sensor data, and each frame data was piped to another Python-based inference server program running simultaneously. The Python script infers the gesture in real-time based on incoming frame-by-frame data with the selected models from Study 2 results. The program continuously updates the frame history for the last 64 consecutive frames, including the one recently loaded, and it infers any gesture with the frame history for every two incoming frames based on the gesture debouncer algorithm discussed in Study 2. We ran the inference for every two frames to make sure the system ran the inference in real-time without any delay on inference result delivery. Both of the programs were linked by the ZeroMQ protocol for frame data transporting and inference result reporting.
All the log data was recorded in the CSV file format, and each line shows each data sent from another device at a particular time. We recorded the following: (1) timestamp of the recording device, (2) device name (monitor, inference system, and wearable console), (3) timestamp of the device sending the data, (4) remaining study duration, (5) data owner (system and participant), and (6) any action performed by the participant. The action log contains detailed information on each gesture trial that a participant performed, an action report from a participant, the situation of the study, and the network connection status. The Wi-Fi connection between the devices actually caused some time lag for each data transaction. To overcome this, we collected timestamp information from all the devices and a “ping” signal, which was continuously sent from the main monitor computer to understand how much time drift was needed to accurately process the log data. Each data packet exchanged between the devices came with unique information that was related to a particular device or action. We combined all the data that we collected from those devices to process as one singular dataset for post-analysis. We also developed a simple data processing script program with Python to analyze and plot the data for each gesture trial.
For the data processing, we checked the system log file from the monitor system (observer system), the inference system (the backpack), and the real-time inference history simultaneously. First, we looked at the gesture section from the system log and found the corresponding Soli frame number for the beginning and the ending of the trial. Next, we checked the last 10 ping response times between the monitor system and the wearable console to check the average duration of the delay taken for message sending. Depending on the delay, we added an extra 16 frames of data if the delay was less than half a second and an extra 32 frames if it was more than a half second before the trial section was reported. An extra 32 frames after a trial section was also considered since each gesture inference required 64 frames of data. After that, we extracted a subsection of the inference history within a trial section to elicit the gesture performed. We applied the gesture debouncer from Study 2 to get the gesture performance history out of the last 64 frames.
We applied a model trained on both gesture and background data for active interaction. The model’s inference was predicted based on the input data from the sensor for the entire duration of the study, including when the system was prompted to perform a gesture. We used the model that was trained without background class data to predict reactive interaction. The inference of this model could only be predicted when participants were prompted to perform the gesture. The gesture with its confidence rate of the inference was documented as a result for each gesture section.
6.5 Results and Discussion
As shown in Figure
15, the results from the reactive interaction situations showed clear classifications on all gestures compared to the active interaction situations. In the following sections, we will discuss the results and findings from each study condition in more detail.
6.5.1 Active Interaction.
The system achieved
\(87\%\) accuracy as shown in Figure
15(a), with the generic gesture set (
G1 in Figure
13) and
\(74\%\) as shown in Figure
15(b) with the discrete gesture set (set 25 in Figure
13).
Since the generic gesture set was trained to be more tolerant towards the location of the gestures, it resulted in overall much fewer false positives and negatives. This may also lead to the gestures being overall more understandable and easier to master. Among the gestures, generic knob (\(100.0\%\)), generic slide up (\(94.4\%\)), and generic tap (\(88.9\%\)) performed well.
The main issue with the generic gesture set is that all gestures have an interaction region overlap. For example, the tap and knob gestures can both be on the index’s MCP. This led to several misclassifications, such as the generic tap, slide left, and slide right being occasionally recognized as the generic knob. Among all the gestures, it can be seen that both the generic slide left (\(77.8\%\)) and slide right (\(55.6\%\)) performed the poorest. This is possibly due to the presence of different hand states when the participant is performing other tasks. Ideally, they would need to stop their task before performing the gesture, but we did not enforce this to observe the behavioural impact on the model performance. The generic slide right gesture overall performed the worst, possibly due to misrecognizing the initial approach of the right hand towards the left side of the left hand, which can be seen as a tapping or sliding up gesture at any of the regions around the pinky.
For the discrete gesture set, we found that the majority of the misclassification is due to the model recognizing the gesture as background. This is due to various factors like the speed of gesture performance, finger arrival and departure angle, finger length and hand state, which we did not enforce in this study. The worst performer was tap on middle DP (\(53\%\)), followed by slide down on index DP to MCP (\(61\%\)) and slide right on CS (\(61\%\)). All of them tend to be misclassified as background noise. Slide up on thumb MCP to DP (\(83\%\)), tap on thumb DP (\(72\%\)), and knob counter-clockwise turn on thumb MCP (\(72\%\)) performed the best possibly due to the thumb being the least proprioceptive error and that it is physically the furthest finger away. We also noticed that the knob clockwise gesture on the middle MCP tends to be misclassified as counter-clockwise being performed on the thumb. This was actually observed as a mistake performed by three participants who performed a counter-clockwise rotation instead.
6.5.2 Reactive Interaction.
The system achieved
\(91.3\%\) accuracy as shown in Figure
15(c), with the generic gesture set (
G1 in Figure
13) and
\(81.7\%\) as shown in Figure
15(d) with the discrete gesture set (set 25 in Figure
13). Overall, the performance of forced interaction models performed well in real-time compared to the voluntary interaction models.
The reactive interaction generic tap gesture performed accurately (\(94.6\%\)) compared to the active interaction generic tap (\(88.9\%\)). The generic swipe-up (\(88.9\%\)) and swipe-down (\(94.4\%\)) have similar movements from the participant while performing the gesture, which causes misrecognition. Similarly, the generic slide right (\(83.3\%\)) and slide left (\(83.3\%\)) gesture performance was misclassified due to similar hand movements by the participant while performing the gesture. The generic knob (\(94.4\%\)) is misclassified as a generic tap due to region overlap between the tap and knob gestures.
For the discrete gesture set, we found that the majority of the misclassification is due to similar gestures, irrespective of their position. The worst performer was slide left metacarpals (\(66.7\%\)) and slide right carpals (\(66.7\%\)). Additionally, the swipe-up thumb MCP (\(76.5\%\)) to DP is classified as tap thumb DP (\(93.8\%\)) due to the region overlap and similar movement before performing the gesture.
8 Limitations and Future Works
There are a number of limitations that could be addressed in future work. In Study 1, we only tested gestures on a specific body part at a specific orientation. These two factors can significantly affect the Fitts’ law ID since there will be a drastic change in the user’s motion and proprioceptive sense for different body parts and orientations. We plan to explore other body parts and different hand orientations to understand this in the future. In addition, we used Fitts’ Law to identify or score the best finger tap position to perform gestures in that area. This design may not apply to all hand gestures. We plan to explore this research gap in the future by proposing an extended Fitts’ Law to assess proprioception for hand-based gestures.
For Study 2, we collected our data by considering that the participant’s hands were placed on the table every time. We did not consider different hand sizes, shapes, or movements of the dominant hand, which may lead to poorer results. Hence, we plan to expand this with various hand states and movements. Additionally, we intend to collect more data combining both background and gesture data from a wide range of participants with a variety of hand sizes in order to improve our model’s generalizability and robustness.
For Study 3, the current dataset we obtained does not consider the influence of the gesture performances under different background conditions, hand accessories, and the displacement of the arm wearing a radar sensor. To improve on this study, our next step is to perform a data collection procedure that is closer to real-world use, such as while performing daily tasks like holding a bag, walking, and so on. Furthermore, Our study was designed and conducted on only right-handed participants. In the future, we would like to extend recognising gestures to all participants using few-short or zero-short learning methods [
84]. Finally, we designed our gestures for bi-manual interaction with limited types of gestures. In the future, we plan to investigate single-handed interaction and explore different hand gestures and hand-to-object interactions with the aid of proprioception.
9 Conclusion
We present RadarHand, a wrist-worn radar for detecting proprioceptive touch-based gestures on human skin landmarks. We first established the most appropriate proprioceptive points on the back of the left hand for gestures. In addition, we grouped and classified the gestures using a deep learning model (with an accuracy of \(92\%\) for a generic gesture set and \(93\%\) for the best discrete gesture set) and proposed contextual applications for each gesture group. Furthermore, we conducted an additional real-time evaluation of the model to understand its performance and shortcomings based on active and reactive interactions. We obtained an accuracy of 87% and 74% for active generic and discrete gestures, respectively, as well as 91% and 81.7% for reactive generic and discrete gestures, respectively. Finally, we summarized the findings in a set of design guidelines based on the gathered results regarding radar as a wrist-worn wearable. Our results indicate that RadarHand has a lot of potential for future on-skin interfaces.
The data that support the findings of this study are available from the corresponding authors, Ryo Hajika and Tamil Selvan Gunasekaran, upon reasonable request. The data will contain all RadarHand Dataset, Deep learning model weights, software, and results from all the studies. Researchers interested in accessing the data can contact both of the authors to request the data and discuss any necessary terms or restrictions.