1. Introduction
The landscape of Artificial Intelligence (AI) has undergone significant transformation in recent years, with its influence extending across various scientific and technological domains. A notable illustration of this impact can be observed in the autonomous vehicle sector, which heavily leverages AI and neural networks. Self-driving automobiles integrate data from many internal systems and external sensors, propelling a revolution in transportation. The concept of driverless vehicles, equipped with swift computational reflexes and optimized urban navigation capabilities, exemplifies the potential of this technology. However, despite these advancements, the widespread implementation of autonomous vehicles faces obstacles, including regulatory challenges and limitations in current data processing methodologies [
1,
2]. Furthermore, the proliferation of onboard systems, such as electronic control units (ECUs) and additional sensors, raises concerns regarding energy efficiency.
Waymo One, a pioneer in the commercial deployment of autonomous vehicles, utilizes a combination of LIDAR (Light Detection and Ranging) radars, and long-range RGB cameras [
3]. Although this self-driving vehicle operates without human intervention, its functionality is limited to controlled and predictable environments. The constraints of RGB cameras in diverse lighting and weather scenarios necessitate investigating alternative technological solutions.
Dynamic Vision Sensors are a promising solution to these challenges [
4,
5]. In contrast to conventional RGB cameras, DVS cameras demonstrate superior performance in low-light environments and offer notable advantages in contrast detection. Their distinctive asynchronous operational mode enables them to concentrate on dynamic environmental changes, resulting in reduced latency and minimized data redundancy. These characteristics enhance environmental perception capabilities and contribute to improved energy efficiency [
5].
The integration of DVS with SNNs introduces an innovative methodology for processing visual data. SNNs, which emulate the brain’s impulse-based communication system, excel at managing the data stream from DVS cameras with minimal latency and energy expenditure. This synergy represents a substantial advancement in the development of advanced vision systems for autonomous vehicles, potentially catalyzing a transformation in the domains of robotics and AI [
5].
This study aims to investigate the efficacy of SNNs in the tasks of pedestrian street-crossing detection and intention prediction. Given the scarcity of readily available data, we employ a simulation environment to generate the dataset for our subsequent experiments. Within this simulated setting, we replicate pedestrian street-crossing scenarios across various urban landscapes and weather conditions, capturing these scenes using both RGB and DVS cameras. We then assess the SNN’s performance in detecting and predicting pedestrian street-crossing behavior, utilizing input sets of sequential frames. The SNN’s performance is benchmarked against established deep learning models, evaluating their respective capabilities when processing data acquired in both favorable and adverse weather conditions.
We further conduct comparative experiments utilizing data from the JAAD [
6] dataset; we used the JAAD dataset, which is a collection of videos of people walking and crossing scenarios in all kinds of weather, from sunny days to snowy storms. We wanted to see if our model could handle different situations just as well as it did in our Carla simulations. Our findings demonstrate that SNNs can achieve a performance comparable to or surpassing that of Conventional Neural Networks (CNNs) in specific scenarios, particularly when processing DVS data under challenging weather conditions.
Lastly, we analyze the energy usage of the tested models to investigate their applicability within AV systems. We openly share the dataset and the code used to conduct the experiments.
Although some DVS datasets exist for pedestrian detection under adverse weather conditions [
7], the current study emphasizes simulated data of pedestrian street-crossing detection scenarios to thoroughly explore performance under diverse and extreme weather conditions using SNNs. The use of simulations allows for precise control over environmental variables, which is not always possible with real-world datasets.
3. Theoretical Introduction
This section provides details on the mathematical foundations of the tested SNN models, the CARLA [
42] simulation environment used for data creation, and the principles governing DVS.
3.1. Spiking Neural Networks and Neuron Models
Given the significant success of ANNs, adopting proven architectures for training SNNs seems a logical step. In fact, such approaches have demonstrated high effectiveness, enabling performance levels that are competitive with ANNs [
43]. However, directly transferring these architectures is not straightforward, as the principles underlying SNNs require certain adjustments to the models [
44]. We will first outline the general principles of spiking neurons, followed by a detailed description of the specific neuron model used in this study. Next, we will discuss the learning method and the modifications of ResNet architecture [
45] used to enable the processing of spike trains.
3.1.1. Basic Principles of Spiking Neurons
In SNNs, neurons are modeled as units that transmit information through spike trains, which are represented as vectors of binary values. Converting ANNs into SNNs, therefore, involves replacing the nonlinear activation functions with spiking neuron models. The charge of the neuron in a given timestep
t is given by
where
X is the input vector,
V is the discharge function, and
f is the neuron function, which depends on the type of chosen model and will be described later. The discharge function describes the neuron’s behavior after emitting a spike. It can incorporate hard or soft resets, in both cases resulting in an instant decrease in the membrane potential. In this paper, we use the hard reset approach, thus we derive only the following equation:
where
S is the neuronal firing function and
is the reset voltage value, to which the membrane potential comes back after emitting the spike. The equation describing the firing can be denoted as
where
is the threshold voltage value and
is a Heaviside function, denoted as
3.1.2. Neuron Model
Various biologically plausible neuron models exist [
46], differing primarily in their level of biological realism and computational complexity. In this study, we employed the Parametric Leaky Integrate and Fire (PLIF) model [
47], which extends the widely-used Leaky Integrate and Fire (LIF) model. Unlike LIF, PLIF introduces the ability to learn the
parameter, a membrane time constant that governs the rate at which the membrane potential decays over time. The PLIF neuron model is described as
where
. Here,
a is a learnable parameter shared across all neurons in a given layer. The sigmoid function is introduced to ensure that
.
3.1.3. Surrogate Gradient Training of Spiking Neural Networks
The sparsity of SNNs, while offering several advantages, also introduces challenges when training these networks. Since non-continuous functions cannot be differentiated, the widely used gradient-based optimization techniques for training ANNs cannot be directly applied to SNNs. The surrogate gradient method is one approach to approximate the discontinuous functions in SNNs with continuous ones, thereby enabling the use of backpropagation and gradient optimization [
29].
In the forward pass, the neuron’s response remains as previously described, represented by a Heaviside function. The derivative of this function corresponds to Dirac’s delta:
which makes the direct application of backpropagation impossible. To address this, during the backward pass, the Heaviside function is approximated by a selected continuous function. In this study, we employed a sigmoid function approximation, defined as
with
being the hyperparameter controlling the smoothness. Its derivative can now be expressed as
which is continuous and differentiable. Thus, the firing function during the backward pass becomes
allowing for the computation of the gradient and error backpropagation. In this work, the PLIF neuron was adopted with the following hyperparameters: initial
,
, and the smoothing factor for a surrogate function
.
3.1.4. ResNet Architecture Adaptations
In this study, we opted to adapt the ResNet architecture into its spiking variant, motivated by the model’s simplicity and its considerable success in various ANN tasks. Fang et al. [
48] noted that converting ResNet into a spiking form encounters challenges such as vanishing gradients and difficulties in maintaining proper identity mapping within the residual block for most neuron models. To address these issues, we adopted the solution proposed in the cited work by replacing the addition operation between the residual block’s output,
, and its input,
, with one of the suggested operands,
G:
As another modification, we use a Temporally Effective Batch Normalization (TEBN) layer in replacement of the standard batch normalization (BN) layer. This step was guided by the findings of [
49], where the authors prove the superiority of TEBN over other normalization techniques [
50,
51] in SNNs due to its ability to capture the richer properties of the spike trains. The normalized output
from the TEBN layer is defined as
Here, in each TEBN layer, and are time-invariant BN parameters and is a set of learnable weight parameters. The mean and variance are calculated from samples across all timesteps, and is a small constant that ensures numerical stability.
For the evaluation of these modifications’ effectiveness, we refer the reader to
Appendix C, where we perform a comparison study, as well as to the literature originally introducing them [
48,
49].
3.1.5. Network Readout
As the spiking network produces
T output spikes given
T input ones, we averaged the output spike train to produce the classification logit. Therefore, the network’s output
was equal to
where
is the output spike train value at timestep
t.
3.2. CARLA Simulator and Perception System
CARLA is an open-source simulator for autonomous driving research, developed by the Computer Vision Centre and the Embodied AI Foundation [
42]. Built on Unreal Engine 4, it offers a highly realistic urban environment with a variety of scenarios and weather conditions, enabling the safe and controlled testing of algorithms. Its extensive customization options, including adjustable environments, vehicle models, and sensor configurations, make it a powerful tool for advancing autonomous driving research. CARLA supports the integration of custom algorithms and large-scale experiments, making it widely adopted by research institutions and companies worldwide.
One of CARLA’s key features is its detailed customization of weather conditions, offering a wide range of independent parameters that allow for the creation of specific environmental scenarios. Key configurable settings include cloudiness, precipitation, wind intensity, sun position, fog density, and road wetness. These parameters can be adjusted to simulate conditions from clear skies to severe storms, strong winds, or dense fog, providing a flexible platform for testing autonomous vehicle systems in varied and challenging environments. For instance, cloudiness and precipitation can be tuned to mimic everything from a sunny day to heavy rain, while the sun’s position can be controlled through its azimuth and altitude angles. Fog settings influence its density and range, adding realism to low-visibility situations. Moreover, CARLA enables the creation of puddles and wet road surfaces to replicate post-rain conditions. We leverage these features to create diverse driving scenarios, allowing for a thorough evaluation of autonomous driving algorithms under different and challenging conditions.
ScenarioRunner is another tool offered by the CARLA simulator for defining and executing traffic scenarios. It provides the ability to create and validate complex traffic scenarios that can be used to evaluate and benchmark autonomous driving agents. ScenarioRunner allows for the selection of maps, weather, sensors, and textures and manages them in a controlled way.
3.3. Dynamic Vision Sensor (DVS)
The DVS, commonly referred to as an Event Camera in
Figure 1, functions differently from traditional cameras by capturing changes in intensity asynchronously as a continuous stream of events. Each event represents a change in brightness and encodes information about its pixel location, timestamp, and polarity. Event cameras provide several advantages over conventional cameras, such as a high dynamic range, an elimination of motion blur, and high temporal resolution in the microsecond range [
5].
To trigger an event, the change in logarithmic intensity must surpass a specified threshold, resulting in a polarity that can be either positive or negative.
CARLA allows access to the DVS camera during simulations, operating in a uniform sampling manner between two consecutive synchronous frames. This requires a high sampling frequency to replicate the high temporal resolution characteristic of a real event camera.
It is important to note that if there is no difference in pixel values between two consecutive synchronous frames, the camera will not output an image. This can happen either in the first frame or in situations where there is no movement between frames. While the DVS camera shares several features with traditional cameras, it also has unique properties that arise from the principles governing event cameras.
4. Materials and Methods
This section outlines the methodology used to generate the dataset and train the models for pedestrian street-crossing detection in autonomous vehicle (AV) systems. The workflow is depicted in
Figure 2.
4.1. Dataset Generation from CARLA Simulator
Our primary goal was to investigate the detection and prediction of pedestrian crossing behavior under challenging weather conditions using neural networks with data from various sensors. Initially, we searched for a dataset that captured pedestrians crossing streets in diverse weather scenarios, including DVS and RGB images. The dataset utilized in our experiments is available at the following link:
https://zenodo.org/records/11409259 (accessed on 15 October 2024). Although numerous road-themed datasets exist online, such as n-cars [
32], JAAD [
6], and DSEC [
52], along with various studies analyzing pedestrian crossing intentions [
10,
21], none fully met our requirements. These datasets either lack DVS data, offer limited instances of pedestrian crossings or insufficient labeling of such events, or do not provide enough video footage featuring adverse weather conditions. As a result, we opted to generate a custom dataset using a simulation environment.
The simulation was developed using CARLA software version 0.9.13 and the scenario repository from the ARCANE project, which focuses on adversarial scenarios for autonomous vehicles [
53].
In our simulation environment, the primary aim was to establish a predictable and controlled testing setup rather than to replicate a highly complex or fully randomized traffic scenario. To ensure consistency and reliable evaluation of the models, we chose a simplified configuration consisting of one vehicle and one pedestrian. The pedestrian’s crossing behavior occurs at an arbitrary moment during the simulation, introducing sufficient variability for testing while maintaining control over the conditions. This design facilitates the random occurrence of pedestrian crossings while concentrating on key challenges pertinent to our research, such as adverse weather and low-light conditions. By managing the number, speed, and location of the pedestrian in this manner, we ensure our models are tested under a range of diverse yet systematically manageable conditions, avoiding excessive randomness that could complicate the analysis.
4.1.1. Video Simulation
The simulation began with the creation of urban scenarios where pedestrians would cross streets under various conditions. The scenarios were designed to include different lighting conditions and weather effects, which are crucial for testing the robustness of the detection models.
4.1.2. RGB and DVS Data Collection
Two data types were collected during the simulation: RGB images and DVS events. RGB images provide high-resolution color information, while DVS data offers high temporal resolution, capturing changes in the scene with minimal latency. This dual-modality data is essential for training models that can operate effectively in dynamic and challenging environments.
4.1.3. Manual Data Curation
The raw data collected from the simulation underwent preprocessing to ensure quality and consistency. This step involved data cleaning, where corrupted or irrelevant data was manually removed, and manual labeling, where each frame or event was annotated to indicate the presence or absence of a pedestrian on the street.
4.2. Task Formulation
For the experiments, we formulate two training tasks: detection and prediction. Of note, understandings of these tasks differ from the ones commonly found in the domain terminology; therefore, we provide their detailed explanation.
4.2.1. Detection
At first, we explore the problem of
pedestrian street-crossing detection, where we aim to identify if the pedestrian is crossing the street in any frame of a given clip. We randomly extract clips of a given length from the dataset, assigning their labels based on frame labels included in every clip. Intuitively, we label a clip as positive if any of the frames in it has a positive label. Otherwise, the clip is considered a negative example. Formally, this approach can be defined as
where
is the
i-th frame from a extracted clip of length
N. Examples of label assignments based on clip frame labels can be seen in
Figure 3. Such clips with corresponding labels are used as inputs to the network.
Networks were tested under two clip length scenarios: one with clip lengths of 9 frames and 8 frames overlapping, and another with clip lengths of 30 frames and 29 overlapping. The overlap was introduced only in the training set.
With the following approach, we observed a class imbalance, with a predominance of negative examples. Therefore, the cross-entropy loss function was weighted for positive samples proportionally to the imbalance observed in a given setting.
4.2.2. Prediction
Building on the detection task, we also evaluated the predictive capabilities of our networks through proxy clip classification. In this case, we specify the prediction horizon of length H, which defines the number of consecutive frames that directly precede the first frame in which the pedestrian starts crossing the street in a given video. We label those frames as positive. To obtain negative ones, we extracted H consecutive frames from the videos in which the pedestrian did not cross the street, starting from a randomly chosen frame. The number of negative frames to extract was configured to match the number of positive samples, preventing a class imbalance. There was no limit to the number of negative samples that could be extracted from a non-event video. We ensured no overlap between negative sets of frames extracted from a given non-event video.
Due to the limited number of positive frames, particularly for shorter prediction horizons (
) we chose to construct only clips of length 9 with 8 frames overlapping. In this case, positive clips were created from consecutive positive frames. The same condition was applied to negative label assignments, this time including only negative frames. An example of this clip-labeling strategy can be found in
Figure 4.
The rest of the experiment organization was configured in the same way as for the previously described detection task.
4.3. Benchmarking the Solution Against the JAAD Dataset
To contextualize our findings within existing research, we adapted the JAAD dataset. Although it does not provide the DVS data, available annotations matched the ones found in the proposed simulation dataset. Namely, we were able to determine if in the given frame the pedestrian is crossing the street and categorize the videos into good and bad-weather subsets. Videos were classified as being recorded in bad weather if the conditions contained snow or rainfall. We performed experiments using only RGB data.
Given the limited number and duration of clips, especially under bad weather conditions, we were not able to create positive samples in the same way as in the prediction experiments. Due to this fact, we evaluate only the detection task.
We also observed some labeling problems in the annotations in comparison with our simulation data, which we describe in
Appendix B.
4.4. Model Training
Once the data were prepared, they were used to train two types of neural networks.
Neural Network Training
Both SNNs and CNNs were trained using DVS and RGB data. No augmentations were applied, with resizing being the only preprocessing step.
4.5. Model Evaluation
After training, the models were evaluated on a separate test set, including clips not seen during training. The performance of the SNNs and CNNs was compared using metrics such as accuracy, F1 Score, and Area Under the ROC Curve (AUROC).
4.6. Measuring Energy Usage
As energy-efficient processing is one of the prominent benefits of SNNs, we compare the energy usage of each trained network. We follow the methodology in the research by Chen et al. [
54], measuring the number of multiply-and-accumulate (MAC) and accumulation (AC) operations in a given network, and translating that to the energy usage of such operations in 45 nm technology [
54,
55].
In the standard feedforward ANNs used in this research, the number of operations is constant in every forward pass. Therefore, for this type of network, we compute the number of operations during a single forward pass with a dummy input tensor of a shape identical to one of the processed clips in experimental tasks.
For SNNs, the number of emitted spikes varies depending on the input sample. Due to this fact, for SNNs, we estimated the consumption by performing the inference on every sample of the test dataset with the corresponding trained model and averaged the number of operations.
We chose bad weather DVS data in the detection task as the evaluation for SNNs. Corresponding evaluations for ANNs were also performed using single-channel dummy data to mimic the DVS data format.
4.7. Experimental Setup and Data Preprocessing
In this section, we detail the experimental setup using the generated datasets to tackle various tasks. We evaluated four networks: ResNet18, its spiking adaptation, Spiking Sew ResNet18 with TEBN (SPS R18T), and two models designed for video classification, SlowFast R50 [
56] and MViTv2 [
57].
For the ANNs, pre-trained weights were employed: ResNet18 was trained on the ImageNet1k dataset [
58], while SlowFast R50 used the Kinetics400 dataset [
59]. Due to architectural modifications, SPS R18T and MViTv2 were trained from scratch. We also note that MViTv2 was modified to accommodate the needs of the experiments. For a full description of those modifications, we refer the reader to
Appendix D.
Due to the temporal nature of tasks, the ANN version of ResNet was trained using a “pseudotemporal” scheme, where each frame in the input clip is classified separately. The output logit for each frame contributes to a final prediction, calculated as an average of each logit, as described by Equation (
13). We refer to this model as PT ResNet18.
The videos were divided into training, validation, and testing subsets, with 15% of the total videos set aside for testing. Of the remaining videos, 15% were allocated for validation, and the rest were used for training. The frames were processed according to the specific tasks (see task descriptions below).
All networks were optimized using the AdamW [
60] optimizer to minimize weighted binary cross-entropy loss. The initial learning rate was established at
, accompanied by a weight decay factor of
. Batch sizes varied between 4 and 64, depending on hardware capabilities. The training was conducted for a maximum of 100 epochs, utilizing an early stopping protocol if the validation loss did not improve for 8 consecutive epochs. The best-performing model, as determined by validation loss, was selected for testing. Performance metrics included AUROC and F-score, which are well suited for addressing imbalanced classification challenges.
We resize the input size of images from the generated dataset to 450 × 256 to reduce the computational demands of the experiments. This operation introduced some artifacts into the images, although they were not significant enough to prevent model training. For details, we refer the reader to the
Appendix A.
For all experiments, the code was developed in Python 3.10 with PyTorch 2.2.0 [
61] and a Lightning 2.1.3 framework, along with SpikingJelly 0.0.0.15 [
62], a library implementing abstractions related to SNNs. All the parameters are given in
Table 1. The experimental code is available at
https://github.com/szmazurek/snn_dvs (accessed on 15 October 2024).
5. Results
5.1. Street-Crossing Detection
The results of the evaluation of the detection task are summarized in
Table 2. For shorter time windows of 9 frames, PT ResNet18 exhibited superior performance across most cases. Specifically, in the bad-weather subset, PT ResNet18 achieved outstanding results, with an AUROC of 0.9882 and an F-score of 93.05 for DVS data, and an AUROC of 0.9194 with an F-score of 58.02 for RGB data. The SPS R18T model showed competitive performance in the DVS modality, with an AUROC of 0.9504 and an F-score of 57.94, though it lagged significantly in RGB performance.
In normal weather conditions, all networks demonstrated robust detection capabilities, particularly in the DVS modality, where the AUROC scores exceeded 0.94 for each network. The PT ResNet18 continued to perform well with RGB data, obtaining an AUROC of 0.9516 and an F-score of 76.95. Conversely, the SPS R18T experienced a noticeable drop in performance in RGB data, registering an AUROC of 0.7995 and an F-score of 49.26.
Surprisingly, MViTv2 failed to converge in any of the setups, showing random predictions with an AUROC of 0.5. An F-score of 0 further indicates that the model fails to identify any positive samples, classifying all samples as negative. When extending the clip length to 30 frames, the observed trends were consistent with shorter clips. PT ResNet18 continued to exhibit the best performance, and was slightly surpassed by SlowFast R50 in the good-weather subset. For instance, in bad weather conditions with RGB data, SlowFast R50 markedly improved, increasing its AUROC from 0.5126 in shorter clips to 0.9546 in longer clips, and similarly for DVS data in good weather conditions, achieving an AUROC of 0.9594 and an F-score of 81.68. The SPS R18T model demonstrated comparable performance across various conditions and modalities, regardless of clip length. The performance of MViTv2 remained at the same level as previously.
5.2. Street-Crossing Prediction
The results for prediction task experiments are shown in
Table 3. For the 1 s predictive horizon, we can see the remarkable performance of the SPS R18T in the bad-weather subset for both modalities, where it outperforms both SlowFast R50 and PT ResNet18. The performance advantage is most visible with the DVS modality. Notably, for the same data modality, SlowFast R50 failed to converge, reaching an AUROC of only 0.5827 and a 0% F-score.
This changed in the good-weather subset, where classic ANNs performed better than the SNN. The best results were achieved on the DVS modality, with the top one reached by PT ResNet18 with an AUROC of 0.9455 and an F-score of 93.17%. In this scenario, MViTv2 shows a performance above randomness levels only for normal-weather RGB subsets, hinting at the high dependency of the model convergence on the specific data distributions.
For the 5 s predictive horizon, again the best class separability in the bad-weather subset was reached by the SPS R18T. Contrary to the shorter lookback observations, this time, RGB data turned out to be more informative for the SNN, where it reached an AUROC of 0.882 and an F-score of 79.39. MViTv2 once again failed to provide any meaningful predictions.
In the normal-weather subset, the networks maintain robust performance on the DVS modality, except for the SPS R18T, which notes a large drop in performance compared to the shorter predictive horizon version. This is similar to what can be seen in the previous experiments with single-frame prediction for this network. A large performance decrease can be seen for the RGB modality, where none of the networks reached results better than random guesses with AUROC scores below 0.5. For normal RGB data, MViTv2 once again shows improved performance compared with other modalities in good and bad weather, emerging as the best-performing model for that subset and modality.
5.3. Street-Crossing Detection on JAAD Dataset
The results of the street-crossing detection in the JAAD dataset can be seen in
Table 4. Overall, the performances of nearly all of the models are lower than the ones observed on the simulation dataset. This could, however, be expected, as this dataset contained fewer clips, making it harder for the models to converge.
SPS R18T was better than the PT ResNet18 and SlowFast R50 counterparts for the shorter clips of 9 frames in bad weather, with an AUROC of 0.5966 and an F-score of 74.03%. It was however surpassed by MViTv2, which shows the best performance in this subset. On longer clips in the same subset, the performance gap is reduced, but the spiking network shows the best class separability with an AUROC of 0.642. This is interesting, considering that in the simulation dataset, RGB data posed a significant challenge for SPS R18T.
These trends change in the good-weather subset. The performance of the SNN remains similar to the one on the bad-weather data, while PT Resnet and SlowFast R50 show significant performance improvements on both short and long samples. The performance of MViTv2 deteriorated to the random level once again.
5.4. Energy Usage Measurement
The results of the energy usage evaluations for the networks used in the previous experiments are shown in
Table 5. SPS R18T shows the lowest energy consumption among all networks. For input clips of 9 frames, its average energy consumption was 50.45 mJ of energy compared to nearly 10 times more used by SlowFast R50 and more than 3 times more by PT Resnet18.
Interestingly, SPS R18T has shown decreased energy usage when the number of samples in the clip increased, using only 30.37 mJ of energy. The energy usage of ANNs rose proportionally to the number of samples in the input clip.
Specifically, a more trained SNN model becomes more efficient at processing data, leading to fewer spikes being generated during inference. This reduction in spikes would directly result in lower energy consumption, as SNNs are designed to be energy-efficient by firing spikes only when necessary. Thus, as the model becomes more refined, it may rely on fewer spikes, reducing the overall energy usage, in contrast to traditional ANNs, where energy usage scales linearly with the number of samples.
Based on the energy usage measurements, a performance-energy efficiency tradeoff can be established. In
Figure 5 two Pareto plots are shown. From the plots, we can see that in the evaluated task and data subset, SNNs provided the best tradeoff between performance and energy usage. It is also tempting to view MViTv2 as the second-best model in that comparison, yet it has to be noted that its performance is at a random level, making the model unusable.
6. Discussion
Our experimental results provide nuanced insights into the performance of traditional and spiking neural networks for detecting pedestrian street-crossing behavior under various weather conditions and modalities.
In the task of crossing behavior detection, SPS R18T exhibited a strong performance when leveraging DVS data in adverse weather scenarios. It achieved an AUROC of 0.9542 and an F-score of 75.5% in the 30-frame detection task, demonstrating its effectiveness in identifying dynamic events such as pedestrian movement in challenging visual conditions, including low light and precipitation (see
Table 2).
In contrast, traditional ANNs like PT ResNet18 and SlowFast R50 performed well across all conditions, particularly excelling in normal weather environments. This underscores their effectiveness in processing high-resolution, color-rich data (see
Table 2). Such capability is crucial for the accurate detection of pedestrian street crossings, where visual details must remain clear and unobstructed by environmental factors.
The performance of MViTv2 raises interesting questions about the model’s adaptability to data distributions. In prediction tasks, it constantly failed to converge across all tested cases. For the prediction task, the situation was nearly identical, except for a surprising improvement in good RGB data. The situation was similar when testing the model in the JAAD dataset, where it reached meaningful predictive capabilities in one of the scenarios.
Possibly, such variance in the MViTv2 performance lies in its sensitivity to data distribution. It is possible that, for datasets and tasks where the differences between training samples are minute, it is difficult for the network to extract meaningful feature embeddings.
While the SPS R18T performs competitively in bad-weather DVS modalities, it exhibits a noticeable performance drop when using RGB data. For example, its performance with RGB data dropped significantly regardless of weather conditions, as noted in its lower AUROC and F-score results compared to other networks. This suggests that while SNNs are particularly adept at processing dynamic visual changes enabled by DVS technology, they may not yet fully exploit the static, detailed visual information provided by RGB imagery as effectively as traditional ANNs.
In the proposed task of behavior prediction, we observe that the patterns present in detection task results prevail. With shorter predictive horizons in bad weather, SPS R18T substantially outperforms ANNs, achieving better results even with RGB data. In normal weather, ANNs come to the forefront, especially on RGB data. With a longer predictive horizon, ANNs perform better, but SNN still prevails in bad weather. Those observations support previous claims that SNNs are valuable when dealing with DVS data, especially in bad weather conditions (see
Table 2).
The results of the detection task evaluation on the JAAD dataset align with those previously seen in simulation data, with ANNs performing robustly in good-weather scenarios. However, for the bad-weather clips, SNN was performing significantly better for shorter clips and comparably for longer ones than classic ANNs, despite working only with RGB data. This requires further exploration, as it shows that in certain problems SNNs can perform comparably to ANNs on data from traditional sensors. We leave the exploration of this problem for future work.
The observed performance patterns indicate a strategic approach to selecting network types based on sensory data and environmental conditions. SNNs utilizing DVS data prove particularly effective in adverse weather conditions due to their capability to robustly process dynamic visual changes. In contrast, ANNs excel with RGB data in favorable weather, taking advantage of the rich color and texture information that is readily accessible. These distinctions emphasize the importance of choosing the appropriate neural network architecture tailored to specific sensor data and task requirements.
Lastly, SNNs have shown great energy efficiency compared to ANNs. In certain instances, the disparities were of a magnitude several orders greater. Also, their energy usage seems not to be affected largely by the length of the analyzed sample.
The Pareto plots in
Figure 5 show that for the evaluated case, SNNs provided the best performance–energy usage ratio. However, it should be noted that in the cases where SNN performance was significantly lower than ANNs’, these relationships would change in favor of the latter. Nevertheless, the number of operations performed by ANNs when making predictions is constant; therefore, reducing their energy usage would be harder. On the other hand, SNNs show large room for performance improvement in the cases where they achieve subpar results, making future research in that direction promising.
These findings are especially interesting from the perspective of AVs, as these systems must operate in energy-constrained environments.
6.1. Challenges and Potential of SNNs with RGB Data in Pedestrian Street-Crossing Detection
In this study, we observed a performance gap between SNNs and CNNs when processing RGB data for pedestrian street-crossing detection, particularly in adverse weather conditions. This gap can be attributed to the inherent differences in how these networks handle data. CNNs are highly specialized for processing dense, continuous visual data, such as RGB images, by leveraging convolutional layers to extract spatial features across multiple resolutions, including edges, textures, and color gradients. RGB data is rich in spatial information that is essential for accurate pedestrian street-crossing detection, especially in complex scenes where color, lighting, and texture variations help distinguish objects and individuals.
SNNs, in contrast, are optimized for processing sparse, event-based data, such as that produced by DVSs, which capture temporal changes in pixel intensities asynchronously. SNNs rely on spike-based information transmission, where the representation of data is encoded in the timing of spikes, which inherently limits their ability to fully capture the detailed spatial and chromatic information present in RGB images. When applied to RGB data, SNNs may lose finer details crucial for tasks like pedestrian street-crossing detection, particularly when subtle color differences or gradients are critical for distinguishing between pedestrians and background objects.
This limitation presents several opportunities for advancing the application of SNNs in RGB-based tasks. One possible approach involves spiking convolutional networks that integrate convolutional operations within the spiking domain, allowing for more effective extraction of spatial features while maintaining the temporal efficiency of SNNs. In particular, the development of spiking convolutional layers that mimic the hierarchical feature extraction of CNNs could significantly improve SNNs’ ability to process RGB data, potentially bridging the gap in performance.
Additionally, exploring more advanced spike encoding schemes for RGB data could enhance SNNs’ capacity to process detailed visual information. Traditional rate-based encoding, which translates pixel intensities into spike rates, may not fully capture the richness of RGB images. Temporal coding strategies, such as latency coding or phase-based coding, could improve how spatial and color information is represented in spike trains, thereby enhancing SNNs’ ability to leverage RGB data.
Furthermore, hybrid models that combine CNN-SNN architectures could offer a promising solution. By utilizing CNNs to preprocess and extract spatial features from RGB images, and then passing the processed information to SNNs for efficient temporal analysis, such models could exploit the strengths of both network types. This hybrid approach could be particularly beneficial in applications where both spatial resolution and temporal precision are necessary, such as pedestrian street-crossing detection under dynamic, real-world conditions.
Finally, the fusion of multi-modal sensory data, such as integrating RGB and DVS inputs, offers another path forward. By allowing CNNs to handle the rich spatial details of RGB data and SNNs to process the high-temporal resolution of DVS data, we can develop systems that leverage complementary sensor modalities to improve robustness, particularly in challenging environments like low-light or adverse weather conditions.
Despite these challenges, SNNs’ advantages in low energy consumption and event-based processing make them a compelling choice for autonomous vehicle systems, where energy efficiency and real-time processing are critical. Future work will focus on improving the integration of SNNs with RGB data through architectural and encoding innovations, potentially enabling their wider application in complex visual processing tasks.
6.2. Limitations in Using Simulation Data
Despite the simulation data being more easily accessible and controllable than the real data, we acknowledge the need for extensive evaluations of proposed algorithms in real-world scenarios. We show initial experiment results on the JAAD dataset with RGB frames, noting that DVS recordings were unavailable for this dataset. Thus, in the future, we aim to expand our evaluation into existing datasets for both modalities, using RGB-to-event conversion methods to obtain missing DVS data. We have already developed such pipelines, the results of which can be seen in
Figure 6. While promising, it requires thorough investigation and evaluation. It is therefore beyond the scope of this article and will be addressed in our future work.
We also plan to incorporate additional datasets, such as those presented in [
7], as part of our future work. Integrating these real-world datasets will require modifications to our existing pipeline. At this stage, the use of simulation data has provided a strong foundation for subsequent experiments with real-world datasets. Moreover, we aim to evaluate the model’s performance on mixed datasets, combining both real and simulated data, to assess its robustness across varied scenarios.