1. Introduction
In today’s digital era, the amount of generated and stored data is growing at an exponential rate, with unstructured data types standing out among them. These data types, including text, images, audio, and video, offer new sources of knowledge. Applying a multimodal analysis approach, as in [
1,
2,
3], involves combining different data formats and allows leveraging the richness of information inherent in each format, providing a more comprehensive and accurate understanding of the phenomena under study. Due to the high complexity of unstructured data, exploring the application of anomaly detection techniques becomes essential. These techniques allow identifying unusual patterns or atypical behaviors within datasets that can be extremely heterogeneous and difficult to interpret using conventional methods.
Given the growing reliance on such complex data, managing crowd flow efficiently and securely has become a critical priority in settings like tourism destinations, urban areas, and large public events. Public safety concerns and the need for optimized crowd management make anomaly detection essential for timely, informed decision-making. However, traditional approaches to crowd monitoring often pose privacy challenges, particularly when using video data.
To achieve this, the application of statistical techniques for anomaly detection in time series derived from video format data is proposed, using a multimodal approach. These series represent two key metrics: the number of people detected in a given time interval and the percentage of heatmap saturation indicating image occupancy. Anomalies are identified in intervals where the detected counts or saturation percentages deviate significantly from typical patterns, providing insights that can support decision-making and improve processes in sectors such as tourism and video surveillance. By foregoing tracking algorithms and applying measures that preserve anonymity during the analysis, the confidentiality of individuals is maintained at all times.
This work presents an innovative approach focused on analyzing anonymized metrics rather than employing tracking or individual identification methods. This addresses a gap in the literature, where privacy-centered, multimodal crowd analysis methods based on time series are rare.
Section 2 provides a review of the relevant literature in this area of study.
Section 3 discusses the data and methodology employed in this research. The results obtained from applying this methodology are presented in
Section 4.
Section 5 offers an analysis of these results. Finally,
Section 6 presents the main conclusions derived from this study.
2. Related Work
This section presents a synthesis of recent research on anomaly detection, object detection in public safety, and crowd analysis. It encompasses a range of methodologies, including advanced machine learning techniques for time series analysis, video surveillance, and real-time occupancy monitoring. Prominent approaches, such as Generative Adversarial Networks (GANs), Convolutional Neural Networks (CNNs), and YOLO (You Only Look Once) models, are examined for their effectiveness in anomaly detection, object tracking, and crowd behavior analysis. The studies also address key challenges related to enhancing detection accuracy, managing spatio-temporal data, and optimizing efficiency in complex real-world scenarios. A summary of the principal contributions in these areas follows.
In [
4], a literature review is conducted, and a taxonomy for anomaly detection in time series is proposed. This taxonomy is distinguished according to the type of anomaly detected, which can be point anomalies, contextual anomalies, or collective anomalies; the characteristics of the series, i.e., whether it deals with univariate or multivariate series; the context of the data, i.e., considering the spatio-temporal characteristics of the series in the analysis; and the methodology used, either through machine learning or statistical methods. Additionally, it also provides a review of the challenges that may be encountered when using these techniques, such as defining the normality of the data, the under-representation of anomalies in datasets, or the need to differentiate anomalies from possible noise present in the series.
In [
5], a system based on the Spectral Residual (SR) algorithm and CNNs is proposed, allowing online anomaly detection in business metric time series. In [
6], a system for real-time anomaly detection is proposed, capable of detecting different types of anomalies due to the integration of various anomaly detection and prediction models. In [
7], a model for unsupervised anomaly detection based on GANs is proposed, with both the actors and critics of these networks based on LSTM recurrent networks, where series reconstruction allows anomaly detection. In [
8], a model based on Transformers for unsupervised anomaly detection is proposed. In this case, a version of Transformer with modified attention mechanism (Anomaly-Attention) is implemented, aiming to overcome the limitations of these in the task of anomaly detection.
In [
9], a methodology for real-time anomaly detection in video is introduced. This approach utilizes feature maps, extracted from the energy of each frame, enabling the detection of object movement speed. Consequently, frames where the energy undergoes abrupt changes are identified as anomalous. In [
10], a methodology for detecting suspicious behaviors in video is presented. Through object detection and tracking algorithms, the spatio-temporal characteristics of individuals in the video are extracted. Subsequently, the series obtained from these characteristics are classified based on whether they exhibit anomalous behaviors or not.
In [
11], a novel approach for abandoned object detection is introduced. This approach uses a dual background model that adapts to scene changes, an enhanced Pixel-based Finite State Machine (PFSM) with occlusion handling, and the SAO-FPN network for improved small-object feature extraction. Integrating the SODHead decoupling head and self-attention further emphasizes occluded features. Tests on datasets like ABODA and VisDrone2019 show significant accuracy improvements, with a mean detection accuracy of
, outperforming other advanced methods in recovery rate and accuracy.
In [
12], a super-resolution method for enhancing public safety through the analysis of surveillance videos is presented, integrating object detection, key frame selection, and super-resolution reconstruction. The system employs a real-time object detection algorithm, a key frame selection algorithm to identify significant scene changes, and a super-resolution algorithm to improve object resolution and visual quality. The proposed super-resolution network combines pixel and feature spaces, using an asymmetric, recursive deep back-projection approach to efficiently reconstruct high-resolution images. Experiments on videos from various scenarios show significant improvements in object detection accuracy, key frame selection, and super-resolution quality, contributing to more efficient video analysis tools for public safety.
In [
13], the use of YOLO models for real-time person detection to monitor occupancy in indoor spaces, a key measure during the COVID-19 pandemic, is explored. It proposes an algorithm that estimates the area of a region in square meters using YOLO-generated bounding boxes, assuming each person occupies 0.66 square meters. The maximum occupancy is calculated based on a density of one person per square meter. The performance of various YOLO models (v3, v4, v5s, v3-Tiny, v4-Tiny) was evaluated in terms of accuracy, FPS, and processing time. YOLO v3 showed the highest accuracy, while YOLO v5s had the highest FPS. The study highlights the algorithm’s ability to adapt to different camera resolutions, but notes potential inaccuracies in low-resolution videos. The findings offer insights into developing intelligent surveillance systems for occupancy monitoring and social distancing compliance.
3. Materials and Methods
This section outlines the approach taken in this study to collect, process, and analyze the data.
Section 3.1 begins by describing the origin and characteristics of the data used in the analysis.
Section 3.2 details the procedures employed to gather these data.
Section 3.3 explains the techniques applied to prepare the data for further analysis.
Section 3.4 describes the steps used to construct both time series, an essential part of the study. Finally,
Section 3.5 provides an in-depth explanation of the analytical approach and techniques utilized to achieve the study’s objectives.
3.1. Source of the Data
The Valencian Tourism Agency has deployed over 50 web cameras in the
Comunitat Valenciana since 2001, in a project sponsored by Tourism. These cameras are accessible on the tourist portal
Webcams de la Comunitat Valenciana (
https://www.comunitatvalenciana.com/es/webcams (accessed on 24 November 2024)) and broadcast images of various destinations live throughout the day, offering users the opportunity to view the status of these places in real time. Strategically located in collaboration with municipalities and sector entrepreneurs, they allow users to follow events, festivities, and sports competitions.
Specifically, the data used were obtained from the town of Morella (
https://en.wikipedia.org/wiki/Morella,_Spain (accessed on 24 November 2024)), Spain. The geographical location of Morella has been crucial throughout the centuries and historical events, which is why it is currently considered a population of high tourist value. The webcam located in this town is a static camera which broadcasts at a resolution of 1920 × 1080 pixels, at a rate of 30 frames per second.
Currently, the aforementioned camera has been upgraded and features different characteristics from the previous one. Because of this, the amount of data available for the study is limited (from 20 September 2023 to 15 October 2023). Therefore, as a solution, it is proposed to extend the series backwards using probability distributions, starting from the statistics of the original series. This extension allows initializing the anomaly detection models.
3.2. Data Acquisition
Regarding data acquisition, video segments equivalent to 15 min are obtained. Due to the high amount of space required for storage, the resolution is reduced to 1280 × 720 pixels and the frame rate to 1 FPS.
3.3. Data Processing
Video processing begins with the generation of a background model, based on Gaussian mixture models [
14,
15], aimed at removing static objects from the image that could generate false positives. For each frame of the video, the background model was upgraded and a pre-trained YOLOv8 model [
16] was applied for the object segmentation task. YOLO is an object detection model based on convolutional neural networks that processes entire images in a single stage to identify and locate objects. Specifically, the employed model provides us with information corresponding to both segmentation and object detection. We use the ’small’ version of this model with a confidence threshold of
. While more recent versions of YOLO are available, preliminary testing indicated that YOLOv8 yields optimal performance for this particular application. An analysis of video processing efficiency reported an average frame rate of
FPS for YOLOv8, in comparison to YOLOv11, which achieved an average of
FPS.
To extract detections of dynamic objects, we compare the areas corresponding to the bounding boxes of all detections in the frame with those areas in the background model. These cutouts are in grayscale and undergo Contrast Limited Adaptive Histogram Equalization (CLAHE) processing [
17,
18,
19]. An example can be seen in
Figure 1.
To compare the cutouts, we use the Structural Similarity Index (SSIM) [
20,
21], which can take values between −1 and 1, with 1 indicating perfect similarity between images, and −1 indicating completely dissimilar images. In our case, we consider a heuristic threshold of
, considering detections valid when their SSIM value is below this threshold.
Figure 2 illustrates an example of various comparison outcomes, highlighting the occurrence of both false negatives and false positives. False negatives are defined as objects that, while not part of the background, are misclassified as such due to prolonged inactivity. Conversely, false positives refer to background objects that are mistakenly identified as belonging to distinct object classes.
The result of this process is stored in a ‘.csv’ file for each video segment. This file will store information regarding the detection timestamp, the predicted class, prediction confidence, as well as the bounding box position and segmentation mask. No other video information is saved, avoiding any privacy concerns with the data.
3.4. Generating the Time Series
From the results of the video processing, we generate two time series corresponding to the number of detections and the percentage of heatmap saturation, which represents the image occupancy within a given interval.
3.4.1. Detection Series
To generate the series corresponding to the number of detections, for each ‘.csv’ file resulting from the video processing, detections were grouped by time interval and the maximum value was obtained within these groups. This value was then used in the time series to refer to that interval. Choosing the maximum over other statistics, such as mean or median, helps correct potential false negatives that may occur. In certain detection intervals, some people may not be detected due to either the precision of the YOLO model or the SSIM threshold; however, detections may occur throughout the interval.
We used 50% of the available data to extract statistics, aiming to extend the series. We group this partition by day of the week, hour, and minute (w, h, m), obtaining both the median (
Figure 3) and the interquartile range (IQR) (
Figure 4) for each of the groups.
For modeling the artificial series, a
distribution [
22] (Equation (
1)) was employed. This distribution allows for modeling the distribution of maximum or minimum values.
We generate the values corresponding to 8 weeks with a frequency of 15 min, where
is the median of the corresponding (w, h, m), and
is the quartile deviation
of the corresponding (w, h, m) (see
Figure 5).
To evaluate the consistency of the expanded series with actual observational data, the Kolmogorov–Smirnov test [
23] was employed. The Kolmogorov–Smirnov test is a nonparametric method designed to assess the equality of continuous, one-dimensional probability distributions, enabling verification of whether a sample derives from a specific reference distribution or if two samples share a common distribution. With a
p-value of
obtained, and at a
confidence level, the results suggest that the null hypothesis—that the data originate from the same distribution—cannot be rejected.
3.4.2. Heatmap Saturation Percentage Series
To generate the series corresponding to the heatmap, for each ‘.csv’ file resulting from the video processing, we use the segmentation masks of each detection.
To obtain the grayscale heatmap, given a ‘.csv’ file, we accumulate the segmentation masks of the different detections into a matrix initialized with zeros, of the same size as the original image. Once accumulated, we normalize the data so that, for a point in the map to saturate to white, it must have been occupied throughout the entire interval.
Next, from these heatmaps, we generate the heatmap series. Each heatmap represents a point within the series, with this value being the sum of the values in the image. The theoretical maximum value for a heatmap is equal to , i.e., when it is completely saturated.
We used 50% of the available data to extract statistics, aiming to extend the size of the series. We group this partition by day of the week, hour, and minute, obtaining both the median (
Figure 6) and the standard deviation (
Figure 7) for each.
For modeling the artificial series, we use the Laplace distribution [
24] (Equation (
2)). The choice of this distribution over others, such as the normal distribution, is due to its characteristics. Its sharper peak and faster decay allow us to control the amount of noise applied to the generated series.
Using a Laplace distribution, we generate the values corresponding to 8 weeks with a frequency of 15 min, where is the median of the corresponding (w, h, m), and is the standard deviation of the corresponding (w, h, m).
Due to the high values of the series, we normalize it using the previously mentioned theoretical maximum value, thus representing the percentage of heatmap saturation (see
Figure 8).
To evaluate the consistency of the expanded series with actual observational data, the Kolmogorov–Smirnov test is employed. With a p-value of obtained, and at a confidence level, the results suggest that the null hypothesis—that the data originate from the same distribution—cannot be rejected.
3.5. Methodology
In
Figure 9, the methodology employed throughout the project can be observed. Part of this diagram has already been discussed in
Section 3.3 and
Section 3.4, consisting of data preprocessing, while the remaining part is explained below. The central idea of this latter part is to apply anomaly detection techniques to the preprocessed data. Before applying these techniques, the series has been decomposed into its trend, seasonality, and residual components.
3.5.1. STL Decomposition
STL (Seasonal-Trend decomposition using LOESS) [
25] is a statistical method used to decompose a time series into three components: seasonal, trend, and residual. This technique is robust, capable of handling various forms of seasonality, including nonlinear patterns, and can be modified to account for outliers. The application of STL results in the decomposition of the time series into its trend, seasonality, and residual components. The algorithm allows for the adjustment of parameters such as the periodicity and seasonality of the series. For this study, a periodicity of 1 day and a seasonality of 1 week were specified.
3.5.2. Collective Anomalies—Trend Threshold
One of the types of anomalies we seek to identify are collective anomalies, i.e., those that individually do not represent an anomaly but do so when considered as a sequence.
To achieve this, a threshold is set on the trend calculated in the series decomposition. For both series, this threshold is , where is the median of the original series and is its standard deviation. We consider a value anomalous only if it exceeds the defined upper threshold.
3.5.3. Point Anomalies—SESD (Seasonal ESD)
Another type of anomaly we seek to identify are point anomalies, i.e., points that show a significant deviation from the rest of the data.
For this, we employ the Seasonal Extreme Studentized Deviate (SESD) algorithm [
26], which involves applying the ESD (Extreme Studentized Deviate) algorithm [
27] to the result obtained after performing the STL decomposition of the series, which, in our case, is based on the previously calculated decomposition. The ESD (Extreme Studentized Deviate) algorithm is a statistical test used to detect multiple outliers in a univariate dataset. Unlike other tests, such as the Grubbs’ test [
28], which can only detect a single outlier at a time, the ESD can identify multiple outliers simultaneously.
When selecting point anomalies, we discard those that have been identified as collective anomalies in the previous section. Point anomalies are chosen in descending order of residual value, with the most relevant being those with a high residual value.
5. Discussion
The results obtained reveal that the utilization of time series decomposition through STL has been effective in identifying seasonal patterns and flexible trends. Specifically, a notable shift in trend is observed around 9 October and 12 October, coinciding with regional and national festivities, respectively. This observation reinforces the usefulness of decomposition in capturing seasonal events and abrupt changes in the data. Additionally, by analyzing the residual of the decomposition in
Figure 10 and
Figure 11, points that are distant from the rest are identified, indicative of the presence of anomalous values. The detection of collective anomalies through the application of a threshold to the trend, as shown in
Figure 12 and
Figure 13, confirms the presence of anomalous periods in both series, these being the festive periods mentioned previously. On the other hand, when using the SESD algorithm to detect point anomalies (
Figure 14 and
Figure 16), the validity of the collective anomaly detection method is confirmed, as the intervals previously identified as anomalous also stand out. This analysis further suggests that other points detected as anomalies indicate significant changes across different time intervals, as shown in
Figure 15 and
Figure 17. These findings reinforce the effectiveness of the approach used for anomaly detection in the studied time series. These results enable the detection of anomalies in images, such as identifying an unexpected increase in the presence of individuals during periods that would typically be expected to remain nearly empty.