1 Introduction
Gaze-based interactions have emerged as a powerful and intuitive method for users to interact with Extended Reality (XR) environments. This surge in interest can be attributed not only to the growing number of Head-Mounted Displays (HMDs) incorporating eye-tracking systems [
93], but also to the efficiency, speed, and reduced physical effort that gaze interactions offer [
105,
110]. Essentially, these interactions eliminate the need for the pointing step in
point-and-select interactions. Numerous studies have investigated how to harness natural eye movements for direct input, such as object selection [
2,
14,
71,
74,
121], with one of the most common methods being the Dwell technique [
25,
30,
43,
76]. This method mandates users to fixate on a target for a predetermined duration to confirm selection, distinguishing between passive viewing and active engagement. However, the Dwell technique is not without its drawbacks, as it can lead to the Midas touch issue [
43], where unintentional selections are mistakenly registered by the system.
To mitigate the Midas touch, HCI researchers have proposed the use of an external trigger as a form of manual input [
128]. This solution has been implemented in a variety of ways from voluntary blinks [
71] to simply pressing a button on a physical controller [
29] (see Section 2.1). These techniques improve selection accuracy, however, they require physical exertion hence potentially introducing fatigue over extended use [
45]. Furthermore, these actions diminish the instinctive nature of gaze interactions and can break the immersive experience, as users are forced to shift their attention to voluntarily execute a motor task. The pursuit of a more effortless and passive approach to addressing Midas touch in gaze-based interactions continues to be a significant challenge, especially for XR experiences.
Brain-computer Interfaces (BCIs) have the potential to overcome these limitations by offering a natural and accessible interaction technique for these spatial environments [
39]. However, prevalent BCI paradigms, including the P300 Event Related Potential (ERP), Steady State Visually Evoked Potential (SSVEP), and Motor Imagery (MI)-based BCIs are fraught with issues such as high user-workloads, fatigue, and slow interaction times that limit their widespread adoption [
1], especially for healthy users. Typically, these BCIs are categorized as
active, where users need to consciously imagine muscle movement (MI), or
reactive, where reactions such as phase-locked spectral activity are evoked by stimuli such as SSVEP’s flashing lights [
3]. In contrast,
passive BCIs have emerged as a compelling avenue, where users do not deliberately control or manipulate their thoughts nor does the stimuli need to be altered to evoke a specific brainwave response, facilitating smoother interactions and reducing user cognitive load [
126]. Passive BCI paradigms tapping into slow negative waves, specifically the Stimulus-Preceding Negativity (SPN) and the Contingent Negative Variation (CNV), have gained attention only in recent literature [
41,
94,
104,
123,
129], though they were proposed back in 1966 for ‘the direct cerebral control of machines’ [
118].
The SPN is a purely nonmotor negative potential that arises when a user expects a forthcoming stimulus, often as a consequence of a prior action [
8]. It results from the summation of numerous postsynaptic potentials in the brain regions that will be engaged in processing the upcoming stimulus. Its functional importance is believed to be related to preparatory activity or the suppression of neural activity, aimed at accelerating neural processing upon receiving the stimulus [
8]. This negative potential has the capability to serve as a passive confirmation mechanism (to effectively address the Midas touch), and we investigate that capability in this work. Moreover, SPN allows for an unobtrusive and implicit integration of user intent detection into the interaction process, thus making it
seamless.
To understand how an SPN can act as a confirmation trigger, let’s consider a user navigating an XR music interface. As the user gazes at the play button, they anticipate a stimulus feedback (an audio confirmation or a visual cue such as a change in UI) that will happen upon selection. The SPN is elicited due to their anticipation of the stimulus that will be provided upon selection. The system, sensing the change in brainwaves via Electroencephalogram (EEG), can automatically activate the play function. Such an interaction, where a user’s anticipation of a response becomes the trigger, enables a more fluid interaction. The user no longer has to make a deliberate action, but instead, their inherent neural signatures based on experience and intent becomes the primary mode of interaction. Thus, these negative slow-wave potentials serve as a passive mental switch and have the potential to address inadvertent gaze selections as users do not expect feedback when not selecting a target. More importantly, this passive mechanism sidesteps the need for users to consciously engage in the selection process, eliminating the selection step as well in point-and-select interactions.
In this study, we delve deeper into the dynamics of SPN in the context of XR, by examining its presence within an immersive three-dimensional VR environment (Figure
1) and exploring its viability as a solution to the challenges presented by the Midas touch issue in gaze-based dwell interactions. We also address certain potential confounds, unclarified from prior research, thereby establishing SPN’s link to user anticipation, and examine the effect of UI familiarity on its manifestation.
Terminology. Previous neuroscience research [
104,
129] that explored SPN in addressing the Midas touch, differentiated active engagement and passive viewing in Dwell-based control using terms such as
Intentional vs Non-intentional fixations and
Intentional vs Spontaneous selections. However, these distinctions may be misleading within the HCI field, as all gaze activities are fundamentally intentional. Therefore, in our study, we adopt the terms
intent-to-select interactions for the explicit intent to interact with or select a target and
intent-to-observe interactions for implicit or passive engagement, such as watching or viewing, without intending to interact with it. Our choice of terminology more accurately and precisely captures the distinction in the user’s intention and the intent behind their gaze activity.
3 Novelty, Hypotheses, and Contributions
Prior work [
94,
104,
129] has demonstrated the SPN’s ability to distinguish between an intent-to-select and an intent-to-observe interaction, indicating its potential in resolving the Midas touch problem in gaze-based dwell interactions, via a more seamless or unobtrusive approach. However, having used two-dimensional stimuli on computer monitors, such studies are often constrained to a small visual field, utilizing screen-based eye trackers with restricted head movements [
104,
129]. They offer limited ecological validity for understanding human behavior in more naturalistic 3D environments.
In distinguishing our research’s use of XR environments, it’s imperative to consider its unique aspects. XR’s immersive nature, coupled with its depth and spatial dynamics, offers a complex interaction space as illustrated in Figure
1. This 3D environment influences cognitive processes and neural responses, potentially in how users anticipate and react to stimuli. Prior work found that 3D immersive environments, in comparison to 2D, decreases the alpha band power and amplifies cortical networks in the parietal and occipital lobes [
51,
70,
124], due to the enhanced visual attention and perception. Li et al. [
67] adds that VR intensifies visuospatial sensory inputs, leading to increased corticothalamic connectivity [
103]. This enhanced connectivity, in turn, leads to a more rapid cortical spike timing, as evidenced by the quicker onset of the P300 component [
10]; which is also linked to visual stimuli like the SPN.
Moreover, much of our current knowledge derived from 2D stimuli studies on monitors, does not effectively apply to XR environments. For instance, research indicates that VR environments can reduce working memory reliance in object search and copying tasks, compared to traditional laboratory settings [
23]. They also found that as locomotive demands increase, such as when gaze moves to more peripheral locations, the dependence on working memory increases. Lastly, the integration of head and eye movements in XR enhances the realism, user engagement, and provides a richer context for studying anticipatory neural responses like the SPN.
We also introduce an additional control condition where feedback is presented to the user even when they are not actively trying to select the target, in contrast to prior work [
104,
129] that did not have this controlled between conditions. This condition seeks to confirm that the SPN is intrinsically tied to user’s implicit
anticipation of selection feedback, regardless of feedback presence. It is worth noting here that, to the best of our knowledge we are the first to address this confound.
Additionally, we contrasted the ERPs from intent-to-select interactions between instances when participants were familiar with target locations and when they were not, to determine whether an SPN manifests more rapidly when targets are familiar. Prior work has shown that memory guided attention can trigger temporal predictions leading to enhanced performance and neural activity at the expected location [
12,
86]. We particularly investigated this because in a BCI used for everyday tasks, users are typically well-acquainted with the UI they engage with. Understanding how familiarity with the UI influences anticipatory responses is vital for optimizing BCI systems for everyday use. Building on the foundation laid by prior research, we set forth specific hypotheses:
•
H1: We hypothesize the SPN will manifest in an intent-to-select interaction but not in an intent-to-observe interaction for an immersive environment.
•
H2: We hypothesize that an SPN is driven by the user’s intention, or anticipation of the stimulus, and not by the stimulus presence itself.
•
H3: As users acclimatize to the target locations, we hypothesize the SPN’s quicker manifestation.
4 User Study
Our study aims to ascertain whether the SPN wave, a type of slow negative potential, appears during an intent-to-select interaction in an XR environment, a key preliminary investigation in addressing the Midas touch issue. To explore this, we devised a VR task involving three-dimensional object selection using the Dwell technique.
4.1 Participants
The University of Colorado recruited 22 participants through online bulletins, but rejected 5 due to technical issues, difficulties with the counting task, or the eye-tracker’s failure to accurately track gaze. 17 participants (9 male and 8 female), averaging 28.8 years (SD = 10.7) are analyzed in this work. One participant had moderate experience with virtual and augmented reality, 10 participants had minimal experience, and 6 had no experience; 5 participants had experience with gaze-based control. Participants who required corrective eye-wear were excluded due to gaze tracking concerns, although clear contact lenses were allowed and 4 participants wore them; 9 participants had naturally normal vision and 8 had corrected-to-normal vision. All participants reported free of any neurological disorders. They were paid 15 USD per hour for the two-hour study.
4.2 Apparatus
The experimental setup, as seen in Figure
2, was comprised of an (i) HP Reverb G2 Omnicept Edition VR headset with an integrated Tobii eye-tracker
1, and (ii) a 64-channel BioSemi ActiveTwo EEG system
2 with four additional EOG electrodes and two reference electrodes that were recorded on the same amplifier. 64 EEG electrodes were placed in a 10-10 montage across the scalp [
87]. The reference electrodes were placed on the left and right mastoids, and the EOG on the outer canthus and below the orbital fossa of each eye. We made four slits in the HMD’s face cushion to reduce recording interference and pressure points on the EOG electrodes. All electrode impedances were kept below 50 kΩ and referenced to the Common Mode Sense active electrode during the recording. EEG and EOG data were recorded at 512 Hz. The Tobii eye-tracker, sampling at 120Hz, identified the user’s gaze point in real time, while also logging pupillometry at the same rate.
The VR headset was tethered to a computer with an Intel i7-9700K @ 3.6GHz CPU, RAM of 16GB, and an NVIDIA GeForce RTX 2080 Super GPU. Our tasks, which were developed and run on Unity
3, maintained a 90fps frame rate, fully leveraging the HMD’s display capacity. The HMD used inside-out tracking for six-degrees-of-freedom tracking, had a resolution of 2160 x 2160 per eye, and a 114° field of view. A separate computer was connected to the BioSemi amplifier, on which all data was logged. The EEG, EOG, eye-tracker, and an experimental marker stream (which highlighted interaction events and other key moments in our task) were synchronized and logged using the Lab Streaming Layer
4, facilitated by wired LAN connections between the systems.
4.3 Task
Participants engaged in VR-based three-dimensional object selection tasks modeled after Zhao et al. [
129], though the objects remained static in our study. The objects comprised of 15 targets, i.e., numbered blue spheres spread throughout a three-dimensional space as depicted in the left image of Figure
3. The nearest target occupied a visual angle of 19.6°, and the farthest target had a visual angle of 7.8°. The targets were distributed across an area that spanned a horizontal visual angle of 60.4° and a vertical visual angle of 47°.
Dwell-based selection mechanism. The HMD’s eye-tracker provided real-time coordinates of the gaze origin and the normalized gaze direction. Using Unity’s
Raycast function, we projected a virtual ray from the gaze origin in the direction of this normalized vector. The point at which this ray intersected an object in the environment indicated the user’s current visual focus. When the user’s gaze landed on a target, the ray intersected the target’s collider, changing its color from blue to red, as seen in the right image of Figure
3.
Prior research found the presence of the SPN at dwell thresholds of both 1000ms and 500ms [
104]. We chose a 750ms threshold for our study as a balanced approach, considering the novel challenge of detecting an SPN in immersive environments. The dwell countdown started when the user’s gaze first landed on the target (or when it turned red). If the user’s gaze remained fixed on that target for the full 750ms duration, the target illuminated in bright green (as seen in the middle image of Figure
3) accompanied by the audio cue, signaling a successful selection. It is this bright green visual stimuli that we expect the user to anticipate in intent-to-select interactions. The target turned back to blue when the user’s gaze shifted away from it.
The tasks involved interactions across three conditions: (i) Intent-to-select, (ii) Intent-to-observe without feedback, and (iii) Intent-to-observe with feedback.
4.3.1 Intent-to-select Interactions.
In this condition, participants were required to select the 15 targets based on the Dwell mechanism, with targets turning green and accompanied by a short audio cue to indicate a successful selection. The Intent-to-select condition was split into two separate tasks: selecting targets in ascending (Intent-to-select ascending) and in descending order (Intent-to-select descending) — to ensure a thorough mix with the Intent-to-observe conditions.
4.3.2 Intent-to-observe Interactions.
For intent-to-observe interactions, which served as the control, participants engaged in a counting task without intending to select the targets. These conditions depict instances where inadvertent dwell selections can happen when users are passively observing the target. Each sphere had varying surrounding dots and participants were instructed to identify five spheres with a specific number of dots. Upon identifying them, they were instructed to add up the central numbers displayed on the spheres. The resulting sum was then entered into a virtual keypad in VR using a controller. The intent-to-observe interactions were split into two conditions:
Intent-to-observe without feedback. In this condition, the Dwell selection mechanism operates similarly in the background. However, unlike the Intent-to-select condition, the target does not turn green post the Dwell threshold because the user is not "selecting" the target. Nevertheless, an intent-to-observe interaction trigger is logged once the timer ends.
Intent-to-observe with feedback. If the stimulus is not provided during intent-to-observe interactions, is the missing SPN attributable to the lack of stimulus or the lack of the user’s intention to select? To address this ambiguity, we introduced an extra control condition. In this condition, the target turns green and is accompanied by the audio cue despite the user not actively selecting the target. We hypothesize (H2) an SPN’s absence in this condition.
To summarize, the study consisted of three conditions (Intent-to-select, Intent-to-observe without feedback, and Intent-to-observe with feedback), presented across four tasks: (i) Intent-to-select ascending, (ii) Intent-to-select descending, (iii) Intent-to-observe without feedback, and (iv) Intent-to-observe with feedback.
4.4 Experimental Procedure
We employed a within-subjects design, with participants completing three blocks, each containing five phases as shown in Figure
4. In each phase, the four tasks (depicted by the blue boxes) were randomized. Participants performed a total of 450 intent-to-select interactions across all three blocks. Each block consisted of 150 interactions, further broken down into 30 interactions for each phase, with each task comprising 15 interactions. For intent-to-observe interactions, there was no fixed number of interactions as the task was designed to be exploratory. Nonetheless, each task required participants to identify at least 5 targets marked with the specified number of dots, resulting in a minimum of 10 interactions for each phase. This translates to at least 50 interactions in each block and a total of 150 interactions minimum across all three blocks. The actual number of interactions varied from task to task for every participant.
At the start of every intent-to-observe interaction task, the number of dots to find, the number of dots present on each target, and the position of the dots on each target were randomized, with the number of dots to find falling within a range of 5 to 8. To aid familiarity with target locations, the target numbers remained in the same spatial locations within each block (phases 1-5). This was done to specifically probe H3. However, these numbers were randomized at the start of each block.
Upon arriving at the lab, participants signed consent forms per the University’s IRB (IRB# 23-0156) and completed a demographics questionnaire. Next, the EOG, reference, and EEG electrodes were attached to the participants, with electrode gel applied to ensure optimal signal conductance. Participants received task instructions, with specific guidelines such as making slow head movements. They were advised to approach the tasks at a comfortable pace and to focus exclusively on the tasks. They were also cautioned against speaking during the study and were reminded to maintain relaxed facial muscles to prevent interference with EEG signals. Before starting the tasks, participants wore the HMD and underwent an 8-point native eye-calibration.
Before the main task, participants completed a tutorial encompassing one phase of all four tasks. They then proceeded to complete the three blocks, each lasting approximately 20 minutes and each followed by a 5-minute break where they could remove the HMD. The eye-tracker was recalibrated at the beginning of each block and any time the participant put the HMD back on after removal.
4.5 Eye/EEG Data Processing
Data processing and analysis were conducted offline through custom MATLAB scripts leveraging the EEGLAB toolbox [
16], along with ERPLAB [
69] and the EYE-EEG plugins [
19]. The eye-tracking data was used for the detection of eye events. First, blinks were identified by recognizing typical gaps in gaze behavior ranging from 50 to 700ms [
38]. Any gaps shorter than 50ms were interpolated and those larger than 700ms were deemed data dropouts. Subsequently, using a velocity-based algorithm [
26,
27], adapted from the EYE-EEG plugin [
19], we identified saccade and fixation events. Specifically, a velocity factor of six (or six times the median velocity per subject) combined with a minimum saccade duration of 12ms was applied. Fixation events were recognized as eye data with velocities below this velocity factor. To remove biologically implausible eye movements and outliers obtained from the previous velocity factor classification, we focused on fixations that lasted between 100 to 2000ms and saccades meeting the following criteria: (i) a 12 to 100ms duration, (ii) a 1 to 30° amplitude, and (iii) a velocity ranging from 30 to 1000° /sec. We then merged the eye-tracker data with the EEG data (upsampling the eye data to match EEG), including the identified blink, saccade, and fixation events.
4.5.1 Correcting ocular artifacts.
In complex visual search tasks, ocular artifacts can be introduced due to the movement of the eyeballs, eyelids, and extraocular muscles that contaminate the neural activity [
92]. To isolate EEG recordings associated with these artifacts, we applied an ICA to the EEG and EOG data. This was done using the extended Infomax algorithm [
66] in line with the OPTICAT methodology [
18]. The OPTICAT method is particularly effective in removing eye muscle-generated saccadic spike potentials [
18]. It employs a more liberal filter settings to ensure retention of higher frequency activity in the training data, overemphasizes saccade onset events, and automates the process of identifying and discarding inappropriate ICs.
For every participant, we initially re-sampled the EEG data from 512 Hz to 256 Hz across all 70 EEG, EOG, and mastoid channels and then re-referenced the data using the average of the left and right mastoids, in line with previous studies [
17,
33,
50,
85,
89]. To detect bad channels, we applied a voltage threshold of four standard deviations, using the
pop_rejchan function in EEGLAB. These bad channels were then spherically interpolated, using the EEGLAB’s
pop_interp function. Finally, we applied a second-order Butterworth filter with a 2 to 100 Hz bandpass filter, optimal for our task [
18].
Before applying ICA to the EEG data, we overweighted saccade events by considering EEG intervals 20ms before and 10ms after saccade onsets, resulting in twice the original event data. After running ICA, we used an automatic IC flagging/rejection process to classify activity related to ocular artifacts based on a procedure from Plöchl et al. [
92] implemented in the EYE-EEG toolbox (function:
pop_eyetrackerica). We computed variance ratios between saccade epochs and fixation epochs (i.e., variancesaccade/variancefixation) using a saccade window starting 11.7ms (or 3-samples) before saccade onset until the saccade offset and fixation windows defined as from fixation onset to 11.7ms before a saccade. If the variance ratio exceeded 1.1, it was flagged as an eye-artifact related IC [
92]. On average, 4.2 ICs (SD = 1.26) were marked for rejection across all participants. The benefit of this approach is that it uses an objective criteria to identify which ICs covary with saccade and fixation related activity measured by an eye-tracker.
4.6 Extracting ERPs
The preproccessing for the ERPs followed an approach akin to the ICA preprocessing. We resampled the EEG signals from 512 Hz to 256 Hz, eliminated 60 Hz line noise using the Zapline-plus extension for EEGLAB [
15,
58], and re-referenced the data using the average of the left and right mastoids. Bad channels were interpolated as mentioned earlier, followed by application of a 4th order 0.1 to 40 Hz Butterworth bandpass filter. ICs marked for rejection during ICA decomposion were removed and the remaining ICs were backprojected to this new pre-processed file. In this way, the ocular artifacts could be substantially reduced in our EEG signal. We used ERPLAB to epoch the data, extracting ERPs time-locked to the selection / interaction event within a -2000ms to 1500ms window. For intent-to-select interactions, only the events pertaining to correct selections (sequentially triggered) were used, and inadvertent selections were discarded. Moreover, intent-to-select interactions were reduced to the mean of the ascending and descending tasks for further analysis. For intent-to-observe interactions, all trials that led to the interaction trigger (750ms) are used, given that the accuracy of the counting task is not pertinent to our analysis.
The ERPs were baseline corrected from -1000 to -850ms as we were primarily interested in the -750 to 0ms window where an SPN is to be expected. We avoided the -850 to -750ms range due to the fixation-related Lambda response resulting from the first fixation on the target stimulus. ERP epochs with artifacts were detected and excluded using a 100
μ V sample-to-sample voltage threshold, resulting in roughly 2.6% rejection rate per condition across all participants. The average number of trials used for the grand average ERP for each task is shown in Table
1. We focused our analysis on the occipitoparietal ROI (O1, Oz, O2, Iz, PO7, PO3, POz, PO4, PO8, P5, P6, P7, P8, P9, and P10 electrodes as depicted in Figure
5), in line with prior findings highlighting the SPN’s prominence in these regions [
94,
104,
129].
6 Contributions, Benefits, and Limitations
The Midas touch dilemma has long plagued gaze-based interactions. Researchers have addressed the issue in a myriad ways (see Section 2.1), mostly involving an extra selection step requiring conscious physical effort. A natural and passive approach to compliment the intuitiveness of gaze-based pointing is lacking. Our study delved into the potential of leveraging the SPN as a passive indicator of user intent. We observe higher negativity levels during intent-to-select interactions, peaking at -4
μ V at 0ms or when the target was selected. In contrast, the control conditions exhibited minimal negativity with levels approaching -1
μ V, aligning with findings by Shishkin et al. [
104]. Interestingly, the intent-to-select ERPs (Figure
7) bear close resemblance to those observed in Zhao et al. [
129]’s work.
Furthermore, prior work [
104,
129] did not investigate the lateral parietal ROI (P5, P6, P7, P8, P9, and P10 electrodes). Notably, in our research, the P10 electrode yielded particularly significant results (Figure
8). Pronounced negativity for intent-to-select interactions is also evident at the frontal ROI (FpZ in Figure
6). While not investigated here, we attribute this to anticipation of the audio cue on successful selection, given that frontal areas display greater negativity for auditory stimuli and occipital areas for visual ones [
7,
60].
Our results also show robust statistical significance via the MU approach, which not only corrects for multiple comparisons but also harnesses the full spatial and temporal precision of EEG; unlike prior work [
104,
129] that employed a mean amplitude ANOVA. Additionally, the MU approach highlights the lower bounds of the temporal onset or the earliest moment the EEG signal responds to a forthcoming stimuli [
88]. These insights are crucial for BCI development, indicating how soon the brain can distinguish between conditions. This, in turn, suggests the potential rapidity of a BCI system in differentiating intent-to-select from intent-to-observe interactions. In our study, Intent-to-select versus Intent-to-observe without feedback interactions differed as early as 12ms post initial gaze at P10. For the Intent-to-select versus Intent-to-observe with feedback scenario, this distinction was evident at both PO7 and O1 from -750ms itself (when the user looked at the target).
We demonstrated that an SPN can differentiate between intent-to-select and intent-to-observe interactions within an immersive VR context. This is particularly noteworthy given the intricacies of immersive environments, characterized by frequent head movements, diverse depth of fields, and expansive field of views. Despite the fact that we use a VR environment specifically, we believe our results are generalizable to the broader XR umbrella as many of the concepts related to the three-dimensional aspect remain similar.
Additionally, our research elucidates that SPN is intrinsically tied to a user’s intention to select the target and is not influenced by feedback alone — a vital confounding element overlooked in earlier studies [
104,
129]. The implications of this finding bolster the rationale for the SPN wave’s elicitation, and reinforces its practicality for a BCI system. This finding can also be interpreted from prior literature on SPN [
11] which states that an SPN is present only for feedback that conveys certain information. In our research, this information could simply be a notifying the user of a successful selection, provided they were actively aiming to select it. Moreover, these insights pave the way for BCI designs that dynamically adjust feedback in response to real-time neural activity, creating a biofeedback mechanism. For instance, as a user focuses on a UI element, the system assesses negativity levels and responsively adjusts the stimulus to heighten the anticipatory response, possibly facilitating faster selection, similar to Crispin et al. [
13]. If the user did not intend to select the target, this adaptive stimulus shift will not impact the negativity.
In regard to
H3, we were unable to show a statistically significant difference with a HPF. This might stem from participants not being entirely accustomed to the target placements even by the fifth phase. We did not objectively measure participants’ level of familiarity, but instead inferred from their task completion times. Moreover, we did not instruct participants to familiarize themselves with the target locations. We do, however, see a strong significant difference at O2 without a HPF, strengthening the direction of our hypothesis. Given the importance of a HPF in a BCI, we believe additional research and a targeted approach is necessary to validate our third hypothesis. Interestingly, in phase 5 ERP without a HPF (right plot of Figure
9), we also observe a heightened negativity exceeding -6
μ V at 0ms. This suggests that practice and the entrenched memory of the feedback can foster stronger negativity, aligning with previous findings that practice enhances intentional regulation in slow-cortical potentials [
83].
The approach we adopted in this work emphasizes the implicit and fluent nature of these interactions and decouples BCI paradigms from the constraints of the information transfer rate [
101], by which we not only streamline the system’s functionality but also broaden its applicability. This ensures that the system is not just an assistive tool for disabled users but becomes an inclusive interface that healthy users can adopt. An added advantage is that the system could be further streamlined using only a few electrodes in the occipitoparietal ROI. For instance, a HMD might incorporate 3 to 5 electrodes on its backstrap, resulting in a more user-friendly form factor. The integration of eye-tracking with implicit neural activity thus not only emerges as a pivotal direction in fostering wider acceptance and integration of BCIs in everyday scenarios, but it also presents a more unobtrusive approach to addressing the Midas touch dilemma.
6.1 Limitations and Future Work
We have identified limitations that future research can address. Firstly, the Dwell technique can lead to user fatigue [
37,
42,
98]. While prior work have tried to identify an optimal threshold duration, a definitive one remains unestablished [
75,
98,
116]. Second, we did not develop a classification model, and hence cannot compare how well (accuracy) an intent-to-select interaction can be distinguished from an intent-to-observe one for single trial ERPs. Furthermore, as this study is an initial exploration of the SPN’s potential in tackling the Midas touch problem, it is not yet feasible to compare our approach with an established technique that resolves this issue. Yet, our robust statistical findings hint at the potential to design a hBCI leveraging the SPN, and has the capability to be a faster, effortless, and fluent interaction paradigm than SSVEPs or P300-based BCIs.
Third, further research on real-time preprocessing, ocular artifact correction, and addressing strong motion artifacts, is vital to effectively mainstream this anticipatory interaction paradigm for real-time use. Gherman and Zander [
35] found that different postures did not have an impact on a passive BCI’s classifier performance, strengthening our case for a real-time system. Fourth, it should also be noted that our high-resolution EEG data originated from a controlled lab environment using research-grade EEG equipment. Exploring the efficacy of this approach in real-world scenarios with dry electrodes, warrants further investigation. Fifth, intent-to-select and intent-to-observe interactions were restricted to their own conditions, further research is necessary in examining these neural signatures when these conditions are intermixed, like they would be in a real world setting.
Finally, our study did not include a condition decoupling feedback from intent-to-select interactions to ascertain the feedback’s role in SPN elicitation. However, prior work has demonstrated the SPN’s absence without feedback [
11] and the necessity of temporal predictability or instilled memory of the target providing feedback for SPN emergence [
8,
21,
22]. This aligns with our findings in Figure
9’s right plot and as stated earlier, phase 5 ERP shows greater negativity than phase 1 ERP, due to a more instilled memory of the feedback by phase 5. Furthermore, the SPN’s prominence in the occipital region, known for processing visual stimuli, confirms that it is linked to the visual feedback [
8]. Future research can examine the SPN’s behavior in the sudden cessation of feedback during intent-to-select interactions.
6.1.1 Future Work.
Our next efforts are to build a model to establish the approach’s accuracy in offline classification, before advancing to real-time classification. Correcting ocular artifacts for real-time online classification presents a challenge as ICA-based removal is a time-consuming process, though prior work [
44] has effectively implemented it with negligible time delays. We plan to explore additional corrective methods like SGEYESUB [
59], which remove ocular artifacts in real-time after an initial calibration phase to record specific eye artifacts for building a correction model. Moreover, examining the behavior of the SPN wave in the absence of ocular artifact corrections, will provide insights regarding the necessity of these corrections in an online framework. Other preprocessing steps, including resampling, re-referencing, line noise elimination, filtering, and baseline correction (applied post-detection of a
gaze on target event), do not pose a time constraint issue for real-time implementation.
We hypothesize that an online implementation could be improved by fine-tuning the model to a user’s specific neural signature, requiring a prior calibration step involving Dwell-based selections. It’s imperative that the model attains a high accuracy to prevent user frustration and guarantee smooth interactions; we plan to evaluate this using subjective and qualitative metrics. To counteract false negatives, an online system should incorporate a fail-safe confirmation mechanism, such as a pinch gesture or a voice interface. On the other hand, false positives can be addressed by implicitly detecting user states such as ErrPs (discussed in Section 2.3). We also plan to integrate natural gaze dynamics, such as differentiating between ambient and focal fixations [
61], which could enhance classification accuracy and optimize computational resources.
In light of evolving interaction paradigms, future work may also explore more sophisticated techniques, such as object manipulation or adjusting sliders, moving beyond mere selection tasks to enrich user experience. A promising direction would be linking the SPN to a variable dynamic value, as opposed to the binary classification employed in selection tasks. Lastly, while our method holds tremendous potential, it’s crucial to investigate privacy-preserving techniques, especially because these systems can decode user states without their explicit awareness, potentially leading to unintended retrievals of sensitive cognitive information.