1. Introduction
The brain computer interface (BCI) converts the brain signals into external device control commands, which establishes a new channel for humans to directly interact with the external environment [
1]. This technique is particularly useful for patients with motor disability and upper body paralysis [
2]. Of course, BCI can also be used for healthy people, such as games or robot control [
3]. Among various brain signals, the scalp electroencephalogram (EEG) is easy to obtain. With low cost and high time resolution, EEG is widely used in BCI [
4]. Motor imagery is a spontaneously generated EEG signal, which does not require external stimulation. It is particularly suitable for patient rehabilitation training and motor control. However, the EEG signal is very weak, with a low signal-to-noise ratio and space blurred [
5]. It is very difficult to extract stable and discriminative features. Therefore, feature extraction has always been a hotspot in the study of motor imagery based BCI. In addition, feature selection can reduce feature dimension and noise interference, the selected features are more stable and discriminative. Therefore, research on feature selection is also very important.
Commonly used feature extraction methods include the autoregressive model [
6], wavelet features [
7], band power [
8], and common spatial pattern (CSP) [
9]. CSP can effectively extract the features of event-related synchronization (ERS) and event-related desynchronization (ERD) in the motor imagery signals, so it has been widely used in BCI [
10]. However, the performance of CSP depends to a large extent on the selection of the filtering frequency band, and the optimal frequency band is typically subject-specific, which is difficult to select manually [
11]. There is a lot of research work on the frequency band selection, which is mainly divided into four categories. The first type of method, CSP combined with time-frequency analysis methods. Based on orthogonal empirical mode decomposition (OEMD), FIR filter, and the CSP algorithm, Li et al. [
12] proposed a novel feature extraction method. Lin et al. [
13] used wavelet-CSP algorithm to recognize driving action. Robinson et al. [
14] used the wavelet-CSP algorithm to classify fast and slow hand movements. Feng et al. [
15] proposed a feature extraction algorithm based on CSP and wavelet packet for motor imagery EEG signals, and Yang et al. [
16] proposed subject-based feature extraction using the fisher WPD-CSP method. The second type of method, the spatial spectrum filter is optimized simultaneously. For example, the common spatio-spectral pattern algorithm (CSSP) is proposed by Lemm et al. [
17], the common sparse spectral spatial pattern algorithm (CSSSP) is proposed by Dornhege et al. [
18], and a new discriminant filter bank common spatial patterns (DFBCSP) is proposed by Hiroshi et al. [
19]. The third type of method, the original EEG signals are filtered into multiple frequency bands, then the CSP features are extracted in each band, and finally the features of the optimal frequency band are selected for classification. There are many research works in this area, such as SBCSP [
20], FBCSP [
11], DFBCSP [
21], SWDCSP [
22], SFBCSP [
23], and SBLFB [
24]. The fourth type of method, the intelligent optimization method, is used to select the optimal frequency band. The multiple fixed frequency bands used in the third method are determined by human subjective experience, so the obtained frequency band may not be optimal, while the intelligent optimization algorithm can select a frequency band of any length. Wei et al. [
25] used binary particle swarm optimization for frequency band selection in motor imagery-based brain-computer interfaces. Kumar et al. [
26] proposed three methods to optimize the temporal filter parameters, including particle swarm optimization (PSO), genetic algorithm (GA), and artificial bee colony (ABC). Rivero et al. [
27] used genetic algorithms and k-nearest neighbor for automatic frequency band selection. The first method uses the time-frequency analysis to obtain frequency information. It needs to decompose the EEG signals of each channel, which requires a large amount of calculation and is time-consuming, especially for the wavelet packet decomposition. The second method is difficult to solve and easy to get a local solution. The EEG signals are filtered into multiple sub-bands in the third method, which is very computationally intensive. The disadvantage of the fourth method is that it requires a long time for model training. Recently, the application of deep learning in motor imaging classification has become more and more widespread [
28]. Tang et al. [
29] used conditional empirical mode decomposition (CEMD) and one-dimensional multi-scale convolutional neural network (1DMSCNN) to recognize motor imagery EEG signals. Cheng et al. [
30] classified EEG emotions by deep forest. However, features extracted by deep learning are abstract and difficult to understand [
31]. In addition, compared with traditional machine learning methods, deep learning has no obvious advantages [
32].
The existing feature selection methods are mainly divided into three categories: filter, wrapper, and embedded [
33]. The filter method uses independent evaluation criteria, and the feature selection process of which has nothing to do with subsequent classifiers. Koprinska et al. [
34] proposed five feature selection methods for the brain computer interface, including information gain ranking, correlation-based feature selection, relief, consistency-based feature selection, and 1R ranking. Experimental results show that the top three feature selectors in terms of classification accuracy were correlation-based feature selection, information gain and 1R ranking. Mutual information and its one-versus rest multi-class extension were used to select optimal spatial-temporal features in [
35]. Li et al. [
36] combined the Fisher score and classifier-dependent structure to implement the feature selection. Based on the descending sort on all Fisher score values, the wrapper model with support vector machine (SVM) and graph regularized extreme learning machine (GELM) were applied, and a 10-fold cross validation scheme was used to select the generalized features based on the training set. Mehmood et al. [
37] selected the optimal EEG features using a balanced one-way ANOVA after calculating the Hjorth parameters for different frequency ranges. Features selected by this statistical method outperformed univariate and multivariate features. The optimal features were further processed for emotion classification using SVM, k-nearest neighbor (k-NN), linear discriminant analysis (LDA), naive Bayes, random forest, deep learning, and four ensembles methods (bagging, boosting, stacking, and voting). The maximum of average distance between events and non-events was used to select optimal EEG features in [
38]. The filter method has certain advantages, such as low computational cost, but it does not consider the correlation between features and is independent of the classifier, so the classification accuracy is not high. The wrapper method uses the performance of classifier as the evaluation criterion of feature selection. An efficient feature selection method was proposed in [
39]. The least angle regression (LARS) was used for properly ranking each feature, and then an efficient leave-one-out (LOO) estimation based on the PRESS statistic was used to choose the most relevant features. In [
40], the genetic algorithm was used to select the EEG signal features. The fitness function used in the genetic algorithm was EEG signal classification error calculated using LDA classifier. Rakshit et al. [
41] employed ABC cluster algorithm to reduce the features for motor imagery EEG data. Baig et al. [
42] proposed a new hybrid method to select features. A differential evolution (DE) optimization algorithm was used to search the feature space to generate the optimal feature subset, and with performance evaluated by the SVM classifier. Liu et al. [
43] proposed a method of combining the firefly algorithm and learning automata (LA) to optimize feature selection for motor imagery EEG. The learning automata was used as a tool of parameter optimization to avoid getting the local optimum. The wrapper method needs to train and test the classifier when evaluating each candidate feature subset, which is computationally expensive and tends to overfitting. The embedded method integrates feature selection with the training process of the classifier, and simultaneously performs feature selection and classification. Therefore, the embedded feature selection method has been widely used in recent years. Miao et al. [
44] used LASSO to select the important space-frequency-time feature components of motor imagery. The minimum-redundancy and maximum-relevance (mRMR) and LASSO were used for feature selection in [
45]. In both feature selection methods, the first three features were selected. Then, the common features between mRMR and LASSO regularization are selected to train the classification model. Zhang et al. [
46] proposed a novel algorithm, namely the temporally constrained sparse group spatial pattern (TSGSP), which was modeled by combining the sparse group LASSO and fused LASSO penalties. The features with different filter bands and time window combinations were optimized and selected. Wang et al. [
47] used the sparse group LASSO to simultaneously perform feature selection and channel selection on the motor imagery signal. Jiao et al. [
48] proposed a sparse group LASSO representation model for transfer learning, the group LASSO selected subjects, and LASSO selected sample data. The above sparse optimization methods are convex optimization models. Although they have achieved good results, many applications have shown that non-convex sparse optimization methods can obtain better performance [
49]. For example, LASSO has a bias problem, which would result in significantly biased estimates, and cannot achieve reliable recovery with the least observations [
50].
Aiming to resolve the problem of large calculation and time-consumption of Wavelet-CSP [
13,
14], WPD-CSP [
15,
16], and FBCSP [
11] methods, we have proposed three new feature extraction methods, namely CSP-Wavelet, CSP-WPD, and CSP-FB method. Firstly, the original EEG signals are pre-processed, including time window selection and band-pass filtering. Then, CSP transform is performed. For CSP-Wavelet, discrete wavelet transform (DWT) is used to decompose the spatially filtered signals, and then the energy and standard deviation of the wavelet coefficients are extracted as features. For CSP-WPD, the wavelet packet decomposition (WPD) is used to decompose the spatially filtered signals. Similar to CSP-Wavelet, the energy and standard deviation of the wavelet coefficients are extracted as features. For CSP-FB, the spatially filtered signals are filtered into multiple frequency bands by a filter bank (FB), and then the logarithm of variances of each band are extracted as features. In order to solve the bias problem of LASSO, a new feature selection method is proposed. A non-convex function is used to sparsely constrain feature weights. Since the non-convex function is a log function, we call this method LOG. In addition, in order to further optimize feature selection and enhance the robustness of the classification model, an ensemble learning method is proposed for secondary feature selection and the construction of multiple classification models. Fisher linear discriminant analysis (FLDA) is used for classification. Combining feature extraction with feature selection methods, we obtained three EEG signals decoding methods, namely CSP-Wavelet+LOG, CSP-WPD+LOG and CSP-FB+LOG. Experimental results show that the classification performances of three newly proposed methods are better than CSP, Wavelet-CSP, WPD-CSP, SFBCSP and SBLFB methods. In terms of feature extraction time, the proposed methods are much less than Wavelet-CSP, WPD-CSP, SFBCSP, and SBLFB methods.
The main contributions of this paper include three aspects. Firstly, we proposed three new feature selection methods based on CSP. These three methods can effectively improve the classification performance of CSP while reducing the feature extraction time. Secondly, we propose a new feature selection method. This method is a non-convex sparse optimization method, which can effectively solve the bias problem of LASSO and select more discriminative features. Thirdly, we use ensemble learning for secondary feature selection and classification model construction, which makes the EEG decoding method more robust and stable.
The content of this paper is organized as follows.
Section 2 introduces experimental data, traditional CSP feature extraction method, three new feature extraction methods, a new feature selection method, and secondary feature selection and classification model construction using ensemble learning. The experimental results are showed in
Section 3.
Section 4 further discusses and analyzes the experimental results. The conclusion is provided in
Section 5.
4. Discussion
When CSP-Wavelet and CSP-WPD methods are used for feature extraction, the number of wavelet decomposition layers has a greater impact on classification accuracy. The selection for the number of layers considers two factors, namely the frequency resolution and the decomposition time. For a dataset with a sampling rate of 100 Hz, when the number of decomposition layers is less than or equal to 2, the frequency resolution is too low. It is impossible to correctly distinguish the frequency bands related to motor imagery, which is not conducive to extracting discriminative information. When the number of decomposition layers is greater than or equal to 5, the frequency band resolution is too high, and the extracted features are easily affected by noise. At the same time, the decomposition time also increased significantly. Therefore, for a dataset with a sampling rate of 100 Hz (dataset 1), only the case of the number of layers with 3 and 4 are considered in this paper. Similarly, for a dataset with a sampling rate of 250 Hz or 256 Hz (datasets 2–4), we only consider the case of the number of layers with 4 and 5.
Table 13 and
Table 14 show the classification results of CSP-Wavelet and CSP-WPD methods using different decomposition layers, respectively. We first discuss the CSP-Wavelet method. In
Table 13, when the sampling rate of the dataset is 100 Hz, L
1 = 3, and L
2 = 4. When the sampling rate of the dataset is 250 Hz or 256 Hz, L
1 = 4, and L
2 = 5. It can be seen from
Table 13 that the classification accuracy of the smaller number of decomposition layers is usually greater than that of the larger number of decomposition layers. Even in some cases, the larger number of decomposition layers get a slightly better classification accuracy, considering the decomposition time, we still choose a smaller number of decomposition layers. In
Table 14, for CSP-WPD method, we can obtain similar results. Therefore, for the sampling rate of the dataset is 100 Hz, the number of decomposition layer with 3 is selected in this paper. For the sampling rate of the dataset is 250 Hz or 256 Hz, the number of decomposition layer with 4 is selected.
In
Table 13 and
Table 14, we not only studied the influence of the number of decomposition layers on the classification results, but also studied the influence of sub-band selection on the classification results. As it can be seen from
Table 13 and
Table 14, in most cases, sub-band selection helps to improve the classification accuracy. Manually excluding sub-bands that are obviously unrelated to the motor imagery tasks can remove redundant information and reduce noise interference, and also reducing feature dimensions and model complexity. Therefore, sub-band selection can improve classification accuracy. It is worth pointing out that, when selecting sub-bands of CSP-Wavelet, the number of decomposition layers has no effect on the classification accuracy. The reason for this is that, when the number of decomposition layers is greater than or equal to three (100 Hz sampling rate) or greater than or equal to four (250 Hz or 256 Hz sampling rate), the selected sub-band are the same.
LASSO has been widely used in EEG feature selection. However, LASSO is a biased estimation of the
-norm, which regularize the feature weights with
-norm. The feature weights obtained by LASSO deviate from the true value and are too sparse. Non-convex regularization can alleviate the bias problem of
-norm [
50]. Therefore, the LOG method proposed in this paper can improve the classification accuracy. To illustrate the problem more intuitively, for subject A01 with motor imagery tasks of left hand vs. right hand,
Figure 8 shows the feature weights obtained by LASSO and LOG, where features are extracted by CSP-FB. The lower part of
Figure 8 is the feature weight obtained by performing the second feature selection on the feature weights obtained by LASSO and LOG. A total of six channels of signals are retained after CSP filtered. The feature index 1–10 in
Figure 8 corresponds to the features of the first channel signal after filtered by 8–12 Hz, 10–14 Hz, ..., 26–30 Hz band-pass filters. The feature index 11–20 corresponds to the features of the second channel signal after filtered by 8–12 Hz, 10–14 Hz, ..., 26–30 Hz band-pass filters. The other feature indexes can be deduced by analogy. It can be seen from
Figure 8 that the features selected by the LOG method include the features of the first and second channel signals and the fifth and sixth channel signals, while LASSO only selects the features of the second channel signal and the sixth channel signal. Therefore, the features selected by the LOG method contain more information, which is more discriminative (according to the CSP principle, the signals of the front and back m channel signals are more discriminative). In summary, the features selected by LASSO are too few, and at the same time the selected features are not discriminative enough.
In the classification results of all datasets, the SFBCSP and SBLFB methods are relatively poor. We use the ensemble learning method proposed in this paper to optimize these two methods, and the results are shown in
Figure 9. Although the classification accuracy of the SFBCSP and SBLFB methods are effectively improved, compared with the other methods, the effect is still not good. The SFBCSP and SBLFB methods do not achieve good classification results in this paper, there may be two reasons. On the one hand, the datasets used in this paper are different from the compared methods. The SFBCSP and SBLFB methods may not be applicable to new datasets. On the other hand, although we have tried to restore the SFBCSP and SBLFB methods described by the author, there may be some data processing steps and some details may not be handled properly. It is worth noting that the effect of the algorithms restored in this paper is similar to that in the literature [
26]. Specifically, the SFBCSP and SBLFB methods do not perform well on the dataset 1.
Table 15 shows the feature extraction time of training set for each method. Three subjects were used for the experiment, namely 1a, A01, and S01. These three subjects were from three different datasets, and these three datasets had different sampling rates and channel number. The feature extraction process of SFBCSP and SBLFB is the same, so the feature extraction time is the same. Comparing the three proposed methods with existing methods (CSP-Wavelet vs. Wavelet-CSP, CSP-WPD vs. WPD-CSP, and CSP-FB vs. SFBCSP), the feature extraction time is significantly reduced. Among the three newly proposed methods, CSP-FB has the least time. Although CSP-FB takes longer time than CSP, it can still be used in real-time BCI.
The three proposed methods in this paper do not consider the selection of time window during feature extraction. The correct selection of time window can effectively improve the classification accuracy, which has been verified in many existing works, such as literature [
44,
46]. Therefore, in future work, we will consider integrating the selection of time windows into the proposed methods to further improve classification performance. In addition, the feature selection method proposed in this paper uses cross-validation to obtain model parameters. The model training is cumbersome and time-consuming. On the other hand, the parameters obtained by cross-validation are not necessarily optimal, especially in the case of small samples [
85]. Implementing LASSO and LOG under the Bayesian framework [
86] to avoid tedious cross-validation will further improve the performance of the proposed methods.