Lightweight feature skeleton node extraction is a core processing step for supporting the
L-
model, which could help to greatly reduce the model parameter numbers and computation time than full skeleton processing. In fact, as shown in the experiment
Section 7.2 (
Section 4—
model testing), only considering the optimized skeleton nodes to recognize the video behaviors, the accuracy is not ideal. In order to improve the recognition accuracy and further reduce the computation time, our proposed
L-
model has enhanced lightweight features based on a multi-stream feature cross-fusion process in order to obtain more behavior feature information.
5.1. L-MSFCF Model Abnormal Behavior Recognition Process
The L- model is different from the traditional multi-stream feature fusion action recognition method. The L- model processes the occluded skeleton nodes and also utilizes the feature cross-fusion extraction method. Firstly, lightweight the skeleton nodes. Secondly, predict the occluded skeleton node information by utilizing the skeleton node data information of past frames. Finally, obtain action features through the skeleton stream, nodes stream, and feature cross-fusion stream. The L- model strengthens the recognition ability of abnormal behaviors.
The
L-
model abnormal behavior recognition process mainly has two steps: the first step is occluded skeleton nodes prediction and lightweight processing; the second part is lightweight skeleton data feature extraction through dual-stream, and then feature fusion is performed on all the features to finally obtain the classification results.
Figure 5 shows the flowchart of the
L-
model. The following are the steps:
Step 1: Preprocessing the skeleton data. Create a skeleton joint dataset and a skeleton vector dataset. Because skeleton vectors are composed of two skeleton nodes and the whole skeleton point graph is not a ring structure, this results in the number of skeleton vectors always being less than the number of skeleton nodes in the generation process by 1. We add an empty skeleton with the value of 0 to skeleton vectors so that there are as many skeleton nodes.
Step 2: Lightweighting the skeletons. Lightweight skeleton data are based on lightweight characteristic skeleton nodes for each action. Skeleton node data and skeleton vector data are processed similarly, taking skeleton node data processing as an example. The process is as follows:
According to
Table 2, retain the corresponding characteristic skeleton node information and set other skeleton information to 0. Take fighting as an example; its original skeleton data of a certain frame is expressed as Equation (
11).
—the original skeleton dataset of a frame;
g—a skeleton node in the current frame.
The lightweight skeleton nodes for the fighting in
Table 2 are [3, 4, 6, 7, 9, 10, 12, 13], and the result of lightweight processing is shown in Equation (
12).
—skeleton dataset after lightweight processing;
g—a skeleton node in the current frame.
Step 3: Determine whether the lightweight skeleton node data are occluded or not; if the space coordinates of this skeleton node data are all 0, it is determined that this skeleton node data are occluded, then the skeleton node data are predicted.
Step 4: Process the skeleton node data and skeleton vector data separately by convolution to obtain features that can represent each action.
Step 5: Combine the skeleton node features and skeleton vector features to form the overall action features utilizing feature fusion.
5.2. Occluded Skeleton Node Prediction
Occluded skeleton nodes can cause noise to the abnormal behavior recognition, affecting the accuracy. To solve the problem, we suggest a generative network-based method for occluded skeleton node prediction, which utilizes the skeleton node data from past frames to predict the skeleton node information of the next frame.
The advantages over existing methods are: the at the lowest level can learn the motion information of the smallest unit frame without interference from higher levels, and the higher levels can capture different features of the motion of specific length frames; moreover, the latest outputs from different levels are used as inputs during the prediction period of each time step, which makes the motion information more adequate and the features of the next frame more comprehensive.
In
Figure 6, the skeleton data information of the previous, the current, and the future frame is represented by the vectors
. The expected skeleton data information at moments
t and
is
and
. The skeleton data information for every time step is used as a series of input
units at the first layer. Define
K distinct
unit sequences at the second level, each of which will only accept similar inputs from the first level’s
units that have been time-step-modeled. If
, for instance, the second layer would contain two
sequences: the first would be derived from time frame data
, whereas the second would come from time frame data
.
at the same hierarchical level share weights, improving the characteristics of the skeleton data to improve long-term dependent learning. There are a total of
sequences on the third layer because for every
K sequence on the second layer, there are
K different
sequences corresponding to it in the third layer. Each
sequence uses the same complex modulo
K inputs from it. Up to level
, where a
sequence of
will exist in level
M, the hierarchy’s process of creating new, higher-level
sequences continues. In order to produce skeleton vector predictions for the associated hidden units in all hierarchies, a two-layer connected network is finally introduced. The inputs for these projected skeleton vectors will then contribute to the skeleton vector prediction process for upcoming frames.
5.3. Lightweight Multi-Stream Feature Cross-Fusion Process
Behavior recognition method networks with multi-stream feature fusion, such as dual-stream networks,
-
[
33], typically utilize single-stream networks to extract characteristics independently before fusing them. The feature fusion method performs weight fusion at the end, and the average pooling layer will overrun the fusion step, making the network unable to fully perform each dependent feature.To solve this question, this subsection proposes a
L-
model, which performs feature cross-fusion during pooling to fully utilize each tributary feature. There are two parts to introduce the model: the network architecture and the basic convolution module.
The whole network of L- consists of three sub-stream networks: skeleton vector stream network, skeleton joint stream network, and features cross-fusion stream network. Each sub-stream network utilizes the - graph convolution network as the backbone. Either joints or skeletons can be used as input data. Formally, the skeleton sequence data are , , and C, T, S separately denote channel dimension, time dimension, and space dimension. Space characteristics may be extracted from the input data via the spatial stream network. Shallow sub-networks have a lot of inaccurate and localized data in their features. Conversely, features located in the network’s deeper levels have less false information and more global information. Many conventional networks are bottom-up and end-to-end systems that only employ a subset of top-layer characteristics. These methods lack local information that facilitates action recognition classification. For this reason, the network proposed in this paper selects features from multiple layers. The features extracted from different levels have different feelings and contain various local and global information.
The whole process of feature fusion is as follows:
Step 1: Mark the skeleton vector features collected from the skeleton vector stream network, denoted as . The skeleton joint stream network is almost identical to the skeleton stream network, where the extracted features are, respectively, denoted as . L is the maximum layer of elements. In the experiment part, we set L to 3.
Step 2: Calculate the weights of the skeleton vector stream network and the skeleton joint stream network. The skeleton vector stream network
and the skeleton joint stream network
are represented as shown in Equations (13) and (14).
—skeleton vector stream network;
—skeleton joint stream network;
—skeleton vector stream network weight;
—skeleton joint stream network weight.
Step 3: The fusion stream network inputs the features collected from the basic dual-stream network, and the weights of the fusion stream network are calculated. As an example, for the case where
L is 3, the fusion stream network is represented as shown in Equation (
15).
—fusion stream network;
—fusion stream network weight.
Step 4: Use weighted average fusion function
to compute the prediction weight of the whole network.
,,—weighted average fusion function fixed weight parameters.
Step 5: The feature data of the three tributaries are fused in the fusion layer by weighted average fusion, and finally in the fully connected layer by function. Fuse all the information to finally output a feature that can represent the whole action.
The convolution module’s goal is to extract deep features. This paper utilizes an adaptive graph convolutional network, and the advantage is that the whole process is a bottleneck structure, aiding in first reducing noise and then obtaining extremely effective information. Its specific structure is shown in
Figure 7.
The entire convolutional block can be represented as:
—input features;
—output features;
—kernel size in space dimensions;
—1 × 1 convolution operation;
—N × N adjacency matrix, its elements indicate whether a vertex is in a subset of another vertex;
—weighting parameter;
—data-driven matrix.
Throughout the computation, we set the
-space dimension’s kernel size to 3.
, and
is N × N adjacency matrix whose elements indicate whether the weak feature skeleton nodes are in the subset of lightweight feature skeleton nodes or not.
is the normalized diagonal matrix,
.
is set to 0.001 to avoid blank lines.
denotes a 1 × 1 convolution operation with weights in the shape of
.
is a data-driven matrix in the shape of N × N.
is a non-local block that goes through the computation of
Figure 8 once before participating in a second computation. The value of
directly determines the impact of
on the quadratic convolution. In the experiment, we set
= 0.3 to obtain high-level valid information, and if the parameters and elements in the matrix were not initialized, its value was set to 0.01.