Next Article in Journal
Prioritization of Ecological Conservation and Restoration Areas through Ecological Networks: A Case Study of Nanchang City, China
Previous Article in Journal
Landscape Ecological Risk and Drivers of Land-Use Transition under the Perspective of Differences in Topographic Gradient
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Feature Selection and Regression Models for Multisource Data-Based Soil Salinity Prediction: A Case Study of Minqin Oasis in Arid China

College of Geography and Environment Science, Northwest Normal University, Lanzhou 730070, China
*
Author to whom correspondence should be addressed.
Submission received: 16 April 2024 / Revised: 6 June 2024 / Accepted: 12 June 2024 / Published: 18 June 2024

Abstract

:
(1) Monitoring salinized soil in saline–alkali land is essential, requiring regional-scale soil salinity inversion. This study aims to identify sensitive variables for predicting electrical conductivity (EC) in soil, focusing on effective feature selection methods. (2) The study systematically selects a feature subset from Sentinel-1 C SAR, Sentinel-2 MSI, and SRTM DEM data. Various feature selection methods (correlation analysis, LASSO, RFE, and GRA) are employed on 79 variables. Regression models using random forest regression (RF) and partial least squares regression (PLSR) algorithms are constructed and compared. (3) The results highlight the effectiveness of the RFE algorithm in reducing model complexity. The model incorporates significant environmental factors like soil moisture, topography, and soil texture, which play an important role in modeling. Combining the method with RF improved soil salinity prediction (R2 = 0.71, RMSE = 1.47, RPD = 1.84). Overall, salinization in Minqin oasis soils was evident, especially in the unutilized land at the edge of the oasis. (4) Integrating data from different sources to construct characterization variables overcomes the limitations of a single data source. Variable selection is an effective means to address the redundancy of variable information, providing insights into feature engineering and variable selection for soil salinity estimation in arid and semi-arid regions.

1. Introduction

Widespread soil salinization, a form of desertification, poses a threat to food security and sustainable development, especially in arid and semi-arid regions [1]. Arid regions of Northwest China are severely affected by salinization due to arid climate, scarce precipitation, intense evaporation, and low soil water content [2,3], which hinders modern agricultural development and ecological security [4].
To address this global challenge, timely and cost-effective monitoring of soil salinization is crucial [5]. Traditional methods involving on-site sample collection and laboratory analysis are both time-consuming and expensive due to intensive sampling requirements [6,7,8]. Remote sensing technology, particularly with the Sentinel-2 satellite’s Multi-Spectral Instrument (MSI) launched in 2015, offers a valuable alternative [9,10,11,12,13]. Although it provides new bands (three red-edge bands and two SWIR bands), most scholars still prefer to use the traditional visible and near-infrared bands, and the sensitivity of the new red-edge bands to soil salinity information needs to be explored. However, despite its benefits, relying solely on optical remote sensing has limitations due to weather, vegetation, and soil salt crusts. Microwave remote sensing, sensitive to soil salinity through the soil’s imaginary dielectric constant, provides a solution by overcoming these constraints and enabling soil salinization inversion from radar data [14,15,16]. While there’s potential in combining optical and microwave remote sensing, such studies remain relatively scarce.
The formation and development of salinized soils are closely related to the climate, biology, topography, physical and chemical properties of soils, and other associated factors of the environment in which they are located. The introduction of environmental variables to participate in the establishment of inversion models has become a meaningful way to improve the quality of modeling [17]. Among them, the common vegetation indices, soil texture, soil moisture, slope, slope orientation, elevation, surface roughness, texture, etc., have also been critical environmental variables [1,18,19]. Since most of the environmental covariates are obtained by band math, the information has a certain degree of redundancy. Therefore, it is necessary to select the regionally optimal characterization variables through variable screening [20]. Pearson correlation analysis is the most common method for modeling the optimal screening of environmental variables [17,21]. However, the Pearson method ignores the nonlinear relationship between predictor and target variables. Wrapper methods integrate the feature selection process with the training process, and the predictive ability of the model is used as a measure of feature selection. Commonly, there are gray relational analysis (GRA) [19], Genetic Algorithms (GA) [22], Penalty Term Based Feature Selection Methods [23] (e.g., LASSO, Ridge regression, Elastic Net algorithm, etc.), random forest algorithm (RF), and Best Subset Selection (BSS) [24]. Feature engineering is an essential part of the modeling process. However, there are no established guidelines for the optimal feature set for arid zones. More attention needs to be paid to this area. Few scholars have taken visible spectral features, microwave spectral features, texture features, and terrain features into account simultaneously.
Machine learning and data mining techniques, such as random forest (RF), decision tree (Classification and Regression Trees), support vector machine (SVM), and artificial neural network (ANN), surpass traditional statistical methods in assessing soil salinity. These methods effectively address the nonlinear relationship between soil and environmental factors, offering improved precision [23,25,26,27,28].
Google Earth Engine (GEE) is a cloud-based platform that provides an integrated environment of massive data catalogs (Landsat 4–8, Sentinel 1–3, etc.) and thousands of computers for planetary-level data analysis with a friendly workbench environment that allows for interactive algorithm development via a JavaScript (or Python, R) based application programming interface (GEE API) for interactive algorithm development. A growing number of researchers are now using the platform for large-scale land cover and land use change mapping and monitoring [29,30]. Some scholars such as Ivushkin et al. [31] have also updated global soil salinity maps using soil property maps and thermal infrared images in the GEE cloud platform, thus further improving spatial accuracy. However, few studies have explored the potential of cloud computing platforms (e.g., GEE) in extracting feature variables and constructing soil salinity prediction models.
Consequently, this study employed optical remote sensing data, microwave remote sensing data, and digital terrain data in Minqin Oasis. A comprehensive extraction of 79 environmental variables, including sensitive wavebands, vegetation indices, salinity indices, terrain factors, and texture information, was conducted using the GEE platform. The model was also used to assess and map the spatial distribution of soil salinity in the lower part of the Shiyang River Basin in the arid zone of northwest China. The main objectives are (1) to explore the effectiveness of four feature selection methods to remove redundant information from variables, (2) to determine the optimal set of environmental variables applicable to the study area and explore the importance of the variables, and (3) to develop an optimal prediction model for soil salinity in the study area and map the spatial distribution. This study will provide valuable guidance for predicting soil salinity in arid zones.

2. Materials and Methods

In this study, the overall workflow includes data pre-processing, extraction and selection of feature variables, model construction, and evaluation (Figure 1). The specific research content is as follows.

2.1. Study Area and Field Sampling

The Shiyang River Basin is located east of the Hexi Corridor in Gansu Province, China. It is one of the three major inland river basins of the Hexi Corridor, a traditional agricultural area in China. Minqin Oasis is located in the lower reaches of the Shiyang River (38°03′ N to 39°28′ N, 101°49′ E to 104°12′ E), which is a typical arid inland river basin area [32]. The climate type of this region is arid desertification, with an average annual temperature of 7.8 °C, an average annual precipitation of about 116.5 mm, and an average annual evaporation of 2308 mm [33]. Minqin Oasis is bordered by the Tengger Desert in the east and the Badain Jaran Desert in the west (Figure 2), with severe soil desertification, accounting for 94.5% of the total area [34]. The Minqin oasis relies on the Hongyashan reservoir and irrigation canal system for water supply. Due to the reduction of surface water inflow into the oasis from the upper and middle reaches of the Shiyang River, groundwater has become the main source of irrigation water in order to meet water demand, leading to over-exploitation [35]. This triggered a rise in groundwater levels and an increase in dissolved mineral concentrations. The combination of saline irrigation and inadequate drainage measures led to rapid soil salinization [32]. Soil texture in the study area is mainly of five types: clay loam, loam, sandy loam, loamy sand, and sandy loam. Loamy sand is dominant, and cropland is predominantly loam (Figure 2b).
During the field survey conducted on 3–7 June and 26–30 July 2023, a total of 215 sample points were selected from different land use categories with obvious salinization in the study area; soil conductivity (uS∙cm−1), soil temperature, soil moisture, soil pH, elevation value, and land use types in the soil surface layer (0–10 cm) were collected in each sample point using a soil environment detector (FK-WSYP, Fangke, China), an RTK(Huace, China) device, and an NDVI measuring instrument, and the spatial distribution of sample points is shown in Figure 2c. The conductivity was converted from uS∙cm−1 to the standard unit dS∙m−1.

2.2. Data

2.2.1. Remote Sensing Data and Pre-Processing

In this study, we mainly used Sentinel-1 and Sentinel-2 imaging data, and SRTM DEM data provided by the GEE platform. GEE has been providing atmospherically and geometrically corrected level 2A data since 28 March 2017. In this paper, we selected data with cloudiness within 10% during the two sampling periods (3–7 June and 26–30 July). In order to avoid the influence of clouds and shadows on the model as much as possible, the median synthesis method was utilized to reconstruct the Sentinel-2 minimum cloudiness composite image after removing clouds using s2cloudless. The four bands B2, B3, B4, and B8 with a resolution of 10 m and the three red-edge bands B5, B6, and B7 with a resolution of 20 m were selected for the calculation of the relevant feature variables.
In view of the fact that Sentinel-2 data are affected by cloudy weather, GEE’s Sentinel-1A GRD level 1 data, which has VV and VH polarization modes, is selected for this study. These data were pre-processed by orbit file application, noise removal, radiometric calibration, and terrain correction. The study area image was selected in synchronization with the Sentinel-2 data, and the backscattering coefficients for the VV and VH polarization modes were derived using the median() function and the Refined Lee filter to remove coherent speckle.
Digital elevation model (DEM) is a discrete mathematical representation of surface topography that reflects topographic changes. In this paper, the SRTM DEM 30 m resolution data are used to extract the topographic factors to provide an analytical basis for the study.

2.2.2. Environmental Covariates

  • Variables Based on Microwave Data;
Although the study area is located in the arid zone, the surface of the oasis in the region was still covered by a large amount of vegetation during the sampling period, which had a noticeable effect on obtaining the backscattering coefficient of the soil and, thus, the interference of the vegetation layer on the scattering needed to be eliminated. The water cloud model (WCM) simplifies the scattering process of microwave remote sensing by assuming that the vegetation layer is a horizontally homogeneous cloud layer and ignoring the scattering effect between the vegetation layer and the soil layer below the layer [36,37]. The expression is as follows:
σ pp 0 = σ veg 0 + γ 2 σ soil 0
σ veg 0 = A × m veg × cos θ × 1 - γ 2
γ 2 = e - 2 B × m veg × sec ( θ )
where σpp0 is the total backscattering coefficient; σveg0 is the backscattering coefficient of the vegetation; σsoil0 is the backscattering coefficient of the soil surface; γ2 is the attenuation factor for the penetration into the vegetation layer; mveg is the moisture content of vegetation; θ is the angle of incidence of radar waves; and A and B are the vegetation parameters.
The results of the backscattering coefficients processed by the WCM model are shown in Figure 3, and it can be seen that the backscattering coefficients of both polarization modes are reduced. In this study, the backscattering coefficients of the two polarization modes (VV, VH) provided by Sentinel-1A were extracted by the WCM model after removing the influence of vegetation as the base data. Six polarization combinations were constructed as the eigenvariables: WCMVV, WCMVH, (WCMVV − WCMVH)/(WCMVV + WCMVH)(Com1), WCMVV + WCMVH(Com2), WCMVV − WCMVH(Com3), and WCMVV/WCMVH(Com4) [37,38].
2.
Variables based on optical remote sensing data;
The correlation between soil physicochemical properties and spectral indices is significant. Therefore, multiple indexes are constructed through band operation, and indexes better related to soil salinity are selected to evaluate the degree of soil salinization. This study set 33 spectral indices and transformed bands based on existing research results in arid regions, as shown in Table 1. In this study, principal component analysis (PCA) and tasseled cap transformation (TC) were utilized to reduce the dimensionality of the dataset and extract relevant features. PCA is a widely used statistical technique that transforms the original variables into a new set of uncorrelated variables called principal components, which capture the maximum variance in the data. This helps in simplifying the dataset while retaining essential information. Similarly, the TC transformation is specifically designed for multispectral remote sensing data and generates new indices (Brightness, Greenness, Wetness) that summarize the information content related to soil and vegetation properties. Similar to the use of salinity indices, vegetation indices, and microwave backscattering coefficients, PCA and TC transformations provide crucial insights into the soil and environmental conditions [21,24].
3.
Texture feature extraction;
The texture is the spatial variation and repetition of the grayscale of the remote sensing image. The texture feature parameter is a vital feature parameter reflecting the spatial distribution of the grayscale pattern of remote sensing images. The grayscale matrix method is the most mature texture extraction method, which can describe the spatial distribution and structural characteristics of each grayscale pixel of the image and is widely used to extract texture feature values of remote sensing images [23]. Although 18 GLCM outputs are provided in GEE, in order to avoid dimensional disaster, five indicators with high correlation with soil salinity, namely Correlation (COR), Contrast (CON), Dissimilarity (DIS), Entropy (ENT), and Angular Second Moment (ASM), were selected as texture features based on preliminary experiments and existing research references [53,54]. And the B2, B8, and gray value bands of the Sentinel-2 were used as the computational bands. Table 2 shows the formula and descriptions of the texture feature computation selected in this paper. The texture feature calculation function glcmTexture() is provided in GEE, in which the default 3 × 3 window size is used for calculating GLCM.
4.
Topographic factor extraction.
DEM-derived topographic variables directly or indirectly determine the direction of soil water movement, altering the pattern and location of soil salt accumulation [55]. Studies have confirmed the relationship between topography and soil salinity [43]. In this paper, four feature variables, elevation, slope, aspect, and surface roughness, were extracted based on DEM and resampled to the exact resolution as the Sentinel data.

2.3. Feature Selection Algorithms

Variable screening reduces prediction error by removing potentially irrelevant predictor variables. In this study, Pearson correlation analysis, Spearman correlation analysis, and Kendall correlation analysis were used for initial variable selection. In statistics, all three correlation coefficients are used to measure the correlation between variables and take values between −1 and 1. Pearson’s correlation coefficient is used for continuous variables that are typically distributed. The Spearman’s correlation coefficient is similar, except that the Spearman’s calculation requires that the two variables be transformed into ordinal numbers and are independent of outliers, which makes it suitable for nonlinear situations. Kendall’s correlation coefficient measures the strength of monotonicity between the two ordinal variables. Eliminating highly correlated and redundant variables at an early stage can prevent the subsequent feature selection methods from being biased by multicollinearity [56,57,58].
The Elastic Net algorithm, the GRA algorithm, and the RFE algorithm were used to minimize the correlation between variables further. To reduce the risk of overfitting and multicollinearity and to provide interpretable core models, shrinkage methods that regularize the estimated coefficients by reducing variance and residuals have been widely used [59]. Elastic Net combines Lasso and Ridge’s advantages and controls the penalty term’s size by two parameters, λ and ρ. It retains the nature of the feature selection of Lasso while considering the stability of Ridge regression.
The Recursive Feature Elimination (RFE) algorithm works by adding or removing specific feature variables to obtain the optimal combination of variables that maximizes model performance [60]. In this study, the RFE algorithm is implemented using the caret package of the R language(4.3.0), in which four models, namely, random forests (rfFuncs), linear regression (lmFuncs), support vector machines (SvmFuncs), and bagging decision trees (treebagFuncs), are selected for variable ranking, and the optimal method is chosen based on RSME.
Gray correlation analysis (GRA) is a multifactorial statistical analysis method. The basic idea is to determine whether the serial curves are closely linked based on the degree of similarity in their geometric shapes [61]. The closer the curves are, the greater the correlation between the corresponding sequences is, and vice versa, the smaller it is [62]. The formula is as follows:
r j = 1 n k = 1 n ξ j k = z k , j = a + ρ b x kj - y j + ρ b
where ρ is the resolution coefficient, generally taken as 0.5, and a and b are the bipolar minimum and maximum differences, respectively.

2.4. Modeling Strategy

2.4.1. Modeling Algorithm

Partial least squares method (PLSR) is a classical multivariate statistical data analysis method, which integrates the advantages of modeling methods such as MLR and PCR, and can well extract and reflect the information on data variability, and at the same time improve the explanatory power of the best composite variables by decomposing the information about the data, and it has been widely used in salinity inversion in arid zones [11,63,64].
The random forest model (RF) is an integrated learning method based on decision trees, where multiple random samples are obtained through bootstrap sampling, and corresponding decision trees are built from these samples to form a random forest [65]. RF-based regression modeling has achieved better inversion accuracy in soil salinity inversion and has been the most used algorithm in recent years [66].

2.4.2. Model Evaluation

Hyperparameters for each algorithm were determined using a grid search algorithm with 3-fold cross-validation. Model accuracy was assessed by calculating the coefficient of determination (R2), root mean square error (RMSE), and relative analysis error (RPD) [67]. Reliable models are usually characterized by high R2 and RPD values, and low RMSE values. The specific formulas are as follows:
R 2 = S S R S S T = i = 1 n ( y i ^ y ¯ ) 2 i = 1 n ( y i y ¯ ) 2
MSE = 1 m i = 1 m ( P i - M i ) 2
RPD = S y RMSE
The overall error of each algorithm was first evaluated, and then the predictions were further validated using salinity classes. Salinity classes were categorized into five classes based on Richards et al. [68]: non-saline (0–2 dS∙m−1), slightly saline (2–4 dS∙m−1), moderately saline (4–8 dS∙m−1), strongly saline (8–16 dS∙m−1), and extremely saline soils (>16 dS∙m−1).

3. Results and Discussion

3.1. Sample Descriptive Statistics

We collected 215 samples with conductivity values ranging 0–22.19 dS∙m−1, mean 2.1026 dS∙m−1, and standard deviation 3.83538 dS∙m−1 (coefficient of variation 1.824). These were divided 8:2 into training and validation sets. Figure 4a shows datasets with high coefficients of variation, indicating significant soil salinity variability. Figure 4b shows the box line diagrams of the distribution of sample points under different land use types. Overall, the sample points with EC values above 8 dS∙m−1 are mainly distributed on unutilized land, which belongs to the heavily and extremely salinized soils. The cultivated land type has the second largest number of sample points, which belongs to non-salinized or mildly salinized soil. However, there are also a certain number of sample points with high EC values, indicating the existence of secondary salinization problems. Woodland, mainly pike and poplar forests, reached 4.56 dS∙m−1, suggesting surface salt accumulation due to evapotranspiration, low precipitation, and high groundwater mineralization. Grasslands in the study area are mainly low-cover grasslands lacking moisture and with severe desertification, and the points with high EC values are near the salt ponds, aligning with findings in the Minqin oasis, where soil salinization is widespread, with saline wastelands having the highest salinity at oasis edges [32].

3.2. Feature Selection

3.2.1. Correlation Analysis

Figure 5a represents three correlation coefficients for variables with EC at p < 0.01 and p < 0.05. Figure 5b shows a heat map of Pearson’s correlation coefficients after excluding the variables with no significant correlations. Coefficients ranged from −0.39 to 0.35, indicating significant correlations (p < 0.05) between most environmental covariates and soil conductivity, except for salinity indices S1, S2, S3, S4, S5, SI, SI1, and SI3. Notably, salinity indices using red-edge bands exhibited superior correlation, likely due to enhanced spectral information and a higher signal-to-noise ratio. Among all 79 variables, pc3, the third principal component of the principal component analysis, had the highest Pearson correlation coefficient (r = 0.39), Spearman correlation coefficient of RE3SI1 (r = −0.26), and Kendall correlation coefficient of B8_asm (r = −0.18), whereas RE3SI1 and B8_asm did not have significant linear correlation, and there may be nonlinear correlation. Based on the results of three correlation analyses, 34 significant and relatively highly correlated variables were initially selected from 79 environmental covariates. They are B2, B2_contrast, B8, BI, Brightness, CRSI, Com3, Com4, DVI, EVI, GDVI, Greenness, MSAVI, NDSI, NDVI, RE1S6, RE2S1, RE2S2, RE3S3, RE3S5, RE3SI3, RVI SAVI, SI2, SI_T, WCMVH, Wetness, elevation, gray_contrast, gray_diss, pc2, pc3, RE3SI1, and RE2SI. Figure 5b revealed a high degree of correlation (r > 0.9) among the selected variables, necessitating further screening to mitigate information redundancy and multicollinearity.

3.2.2. Feature Variable Combination Selection

The 34 variables underwent GRA, Elastic Net, and RFE screening. Prior to algorithm application, all variable values underwent feature scaling for magnitude and value range standardization. For RFE, four models were tested, and Figure 6a,b display the RMSE and R2 for each model under various feature subsets. In both RMSE and R2, the RF algorithm outperformed, reaching its best at 11 features. Consequently, the RFE algorithm of the RF model was selected for feature selection. The Elastic Net algorithm employed ρ = 0.5 for L1 and L2 regularization, with grid search determining the optimal λ value of 0.07. In the GRA algorithm, rho = 0.5 regulated correlation coefficient differences. The GRA algorithm ranked variable importance; features with importance over 0.7 were selected.
Figure 6c–e shows the feature and importance ordering after feature selection using the Elastic Net, GRA, and RFE-RF algorithms, respectively. It can be seen that both Elastic Net and RFE-RF are able to distinguish the importance of variables better, and there is little difference in the importance of most of the variables under the GRA algorithm. Figure 6f shows a schematic diagram of the variables selected by the three algorithms. Notably, Com4, SI2, WCMVH, and Wetness appear in all three results, with Wetness, SI2, and Com4 exhibiting considerable importance. These three variables represent exactly three data sources: the tassel-capped transformed Wetness component, the salinity index, and the microwave backscattering coefficient polarization combination, respectively. It can be seen that, in addition to the traditional optical remote sensing variables, environmental covariates as well as microwave data also have a high sensitivity to soil salinity and are a factor to be considered when modeling soil salinity prediction.

3.3. Modeling Assessment

As the GRA algorithm lacks the capability to provide the optimal subset of feature variables with known importance ordering, this paper employs a circular, iterative approach. Initially, all variables enter PLSR or RF modeling, undergoing accuracy tests (R2, RMSE). Subsequently, the last variable in the importance ordering is eliminated, and the remaining variables re-enter the modeling operation. This process continues iteratively, with each cycle involving the removal of the last variable until only two variables remain in the operation. R2 and RMSE values are generated for each step, allowing the determination of the optimal number of variables. For optimal model results, a grid search with three-fold cross-validation is employed to select parameters. PLSR focuses on determining the principal components (n_components), while RF involves parameters such as n_estimators, max_depth, min_samples_leaf, min_samples_split, and max_features. Default values are set for other parameters. Table 3 presents the optimal parameters and the number of feature variables for each algorithm.
Model accuracy was measured by the R2, RMSE, and RPD of the training and test sets (Table 4, Figure 7). The R2 ranges between 0.16 and 0.72, the RMSE ranges between 1.47 and 2.49, and the RPD ranges between 1.09 and 1.84. Notably, the PLSR algorithm exhibits lower accuracy than RF, with all three PLSR models showing RPDs below 1.4, casting doubt on their reliability. In addition, the predicted values of all three PLSR models appeared to be less tchan 0. Lower accuracy in models using the GRA algorithm for variable selection may be linked to the GCD threshold setting. Repeated experiments may help identify an optimal GCD threshold [69]. Notably, the GRA_RF model exhibits overfitting, as training set R2 surpasses validation set R2. The REF_RF model, with the best performance (R2 = 0.71, RMSE = 1.47, RPD = 1.84), demonstrates robustness. In Figure 7, REF-RF excels in predicting non-saline soils but generally underestimates salinized soils, attributed to a limited number of high EC points. The Elastic-RF model ranks second in accuracy, excelling in predicting moderately to highly salinized soils compared to REF-RF.
Although the R2 accuracy of the RFE-RF model in the training and validation sets performed moderately well, there are still some problems with the accuracy and credibility of the model in terms of RPD and RMSE values. Some of the possible factors are the problem of data distribution in the measured dataset, cultivated land film problem, image resolution problem, etc. Compared to empirical/semi-empirical models where the collection of measured data requires consideration of the spatial distribution of data sampling points, machine learning models are more sensitive to the statistical distribution of data values and, for the few ultra-high EC value points in the original dataset, the RF model smooths them by treating them directly as outliers, which results in the construction of a model where all of the medium and high values are underestimated and the low values are overestimated. Therefore, it is necessary to perform data cleaning and normalization steps before proceeding with model construction. In this study, a significant portion of the data, situated on cropland, featured the use of plastic mulch (white or black) to curb evapotranspiration and reduce irrigations, potentially altering soil and vegetation spectral reflectance characteristics, impacting vegetation index accuracy [70,71].

3.4. Importance of Variables

To validate the REF algorithm’s efficacy in reducing information redundancy, the REF-RF model was compared to a model directly constructed using the RF algorithm (Figure 8). The REF-RF model demonstrated significantly higher accuracy than the general RF model, with a 0.18 improvement in R2, a 0.39 reduction in RMSE, and a 0.39 improvement in RPD. Features in the REF-RF model fell into five categories: salinity index (SI2, RE3S3, RE1S6), vegetation index (RVI), microwave backscattering coefficient (WCMVH, VV-VH), image spectral enhancement (Wetness, pc3), and topography factor (gray_diss, B2_contrast, elevation). This aligns with this article’s assumptions that, apart from the classical salinity index, vegetation index, spectral enhancement, microwave polarization index, and terrain factor are crucial modeling features. Terrain factors include elevation from DEM and texture features from remotely sensed imagery, reflected in all three feature selection methods.
Sentinel-1 variables rank prominently in the results (Figure 6c–e), with the REF algorithm identifying Com4 and WCMVH as pivotal (third and fifth, respectively). Excluding these features and re-modeling with RF resulted in reduced R2 but improved RMSE. An RPD < 1.4 indicated unreliable predictions (Figure 8c), highlighting the lowest accuracy and poorest fit when microwave data was omitted. Samples with soil salinity of 0–2 dS∙m−1 were mainly in cultivated land; vegetation impact was evident in June and July. Microwave data, processed with the water cloud model, effectively compensated for optical remote sensing signal saturation [44]. This underscores microwave data’s significant role in enhancing soil salinity prediction accuracy.
Satellite imagery’s spatial resolution significantly impacts soil salinity mapping [17]. In the medium-to-high salinized soil range, the vegetation type transitions from a mixture of man-made shelterbelts and natural vegetation to various salt- and drought-tolerant vegetation. At a resolution of 10 m, there may be mixed pixels, which adds some noise to models using vegetation indices. Figure 5 reveals that all variable selection algorithms excluded most vegetation indices. RFE_RF picked only one index, RVI, despite a modest 0.18 Pearson’s correlation with soil EC values. Thus, when utilizing low-resolution satellite data for arid zone salinity prediction, careful selection of vegetation indices is essential.
Soil salinity and soil surface water content interact [72] because salt is transported in the same direction as water and salt is transported with water. In addition, varying levels of soil moisture content result in reduced reflectance in the visible (VIS) and near-infrared (NIR) ranges [73], which further affects the response of various spectral indices to soil salinity [74]. The Wetness component in the RFE-RF model constructed in this article mainly reflects the humidity of soil and vegetation, and has a high importance (Figure 6e). Therefore, the water content of the surface soil is an important factor to be considered when conducting soil salinity estimation studies, and the establishment of relevant characteristic variables that respond to soil water content needs to be considered when building predictive models.
There are certain limitations in the selection of environmental covariates in this study. Salinity indices, vegetation indices, texture features, humidity features, and principal component factors are all extracted via Sentinel-2, making it challenging to avoid multicollinearity among variables. This is also a significant reason for the substantial reduction in the number of remaining variables after feature selection. According to Ivushkin et al. [31], thermal infrared imagery also performs well in distinguishing between different levels of soil salinity. Additionally, data such as meteorological data, soil texture, depth to groundwater, crop type, etc., should also be taken into account [55,57,75,76].

3.5. Characterization of the Spatial Distribution of Soil Salinity

The RFE-RF algorithm was used to map the spatial distribution of soil salinity in Minqin oasis (Figure 9a) and categorized according to different salinity classes (Figure 9c). More than 40% of the entire study area is threatened by salinization, which is similar to the findings of Ngabire et al. [77]. The soils in the study area that are seriously affected by salt are mainly located in the desert zone at the edge of the oasis, at the edge of the Badain Jaran Desert, and in the Tengger Desert. The more serious areas in the center are around Hongyashan Reservoir and Qingtuhu Lake (Figure 9d), one of which is the starting point and the other the end point of ecological water transfer from the oasis, and both of which have a high level of groundwater table [78]. Non-saline and mildly saline soils are mainly distributed in the central part of the oasis, where there are mainly residential areas and cultivated land. The degree of salinization gradually increases from southwest to northeast, which is consistent with the flow direction of the Shiyang River and reaches its maximum at Qingtu Lake, the terminal lake of the Shiyang River. Overall, salt-affected arable land reached 47.5% (Figure 9e), with salinized arable land in the central region, mainly around Qingtu Lake, and a large amount of salinized arable land occurring in the western fringe. The areas with the most severe salinization are the intermountain basins and salt ponds. The salt brought by the rising groundwater here and the accumulation of soil from the surrounding surface and runoff will form solonchak on the surface during the dry season. Coupled with the action of wind, the surface salt will accumulate [77]. Comparison of the resultant maps with the soil profiles of the study area recorded in the World Soil Database (HWSD v2.0) (Figure 9b) shows that the spatial distributions of salinized soils in the two are roughly the same. However, there are extremely high values in the HWSD data (EC = 32 dS∙m−1), and some of the high-value areas do not correspond to the actual situation, which may be related to the spatial and temporal resolution of the soil profile data in the HWSD database. For example, the high-value area located in the central Shoucheng Township (rectangular area 1) and the southern high-value area located in Nanhu Township (rectangular area 2), which are mainly cultivated land, do not actually have soils with such high conductivity values, yet the low spatial resolution of the HWSD (1 KM) and the outdated time of data collection resulted in these two locations showing extremely heavy salinization. In contrast, the results obtained by the RFE_RF model were more convincing.

4. Conclusions

The problem of soil salinization in the Minqin oasis in the arid region of northwest China is evident, with more than 40% of the area salinized. Therefore, the use of advanced remote sensing techniques to map soil salinity distribution is essential for food security and rural development in the region.
This study employs Sentinel-1, Sentinel-2, SRTM images, and field-measured data from 215 locations to create 79 salinity indicators using vegetation indices, salinity indices, terrain factors, and microwave backscattering coefficient polarization combinations. Feature selection is performed through the GRA, RFE, and Elastic Net algorithms, and salt prediction models are constructed using RF and PLSR. By comparing the six models, the best model was determined to be the RFE-RF model. Key findings include the following:
  • Variable selection reduces model complexity, preventing information redundancy and feature multicollinearity. Relying solely on correlation calculations may not adequately reduce dimensionality due to potential nonlinear correlations. The most effective method for feature selection is the RFE algorithm, which significantly streamlines model construction, reducing the feature variables from 34 to 11. These include six spectral features (SI2, RVI, RE3S3, RE3S6, pc3, and Wetness), two microwave backscattering coefficient features (WCMVH and Com4), two texture features (gray_diss and B2_contrast), and a topographic feature (elevation). In arid areas with low vegetation cover, featuring unutilized land (Gobi Desert) and irrigated cropland, model feature selection should consider not only vegetation and salinity indices but also other soil properties and environmental factors.
  • PLSR models lack accuracy for salinity predictions; RF models, adept at capturing nonlinear relationships, are preferable for such predictions. The RFE-RF and Elastic-RF inversion models outperform basic RF models. Among them, REF-RF stands out as the superior salinity inversion model. It excels in low and medium saline soils but tends to underestimate strongly saline soils and above. Model-based feature combination offers effective soil salinity monitoring in the study area’s irrigation zone, aiding decision-making in soil management. Limiting factors include sample point distribution, cropland mulching effects on reflectance, and model parameter optimization.
  • Introducing the red-edge band, the water cloud model, and the tassel-cap transformed wetness component produced satisfactory estimates. The red-edge spectral indices and the backscattering coefficients processed by the water cloud model are significantly more sensitive to soil salinity, and these effects will be more pronounced in areas with higher vegetation cover. Therefore, when choosing Sentinel data as the data source for model construction, the red-edge band of Sentinel-2 and using water cloud model pre-processing for Sentinel-1 data should be considered.
  • There is a need to study the effects of other environmental variables on soil salinity levels in the future, and the GEE platform makes the whole modeling process of this study online, which facilitates teamwork and provides a convenient method for monitoring soil salinity in a large area. In the future, we will rely on the GEE platform to estimate soil salinity in a larger area and try to improve the method’s reliability further.

Author Contributions

Conceptualization, S.Z. and J.Z.; methodology and software, S.Z. and J.Y.; investigation and data processing, J.X. and Z.S.; writing—original draft preparation, S.Z.; writing—review and editing, J.Z.; visualization, S.Z. and J.X.; supervision, J.Z.; project administration, S.Z.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

National Natural Science Foundation of China, grant number 42161072; Northwest Normal University Graduate Student Research Grant Foundation, grant number 2022KYZZ-S190.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Acknowledgments

The authors wish to thank Feipeng Hu, Rui Tuo, and Jian Liu for their help in conducting the fieldwork.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hassani, A.; Azapagic, A.; Shokri, N. Predicting long-term dynamics of soil salinity and sodicity on a global scale. Proc. Natl. Acad. Sci. USA 2020, 117, 33017–33027. [Google Scholar] [CrossRef] [PubMed]
  2. Li, J.G.; Pu, L.J.; Han, M.F.; Zhu, M.; Zhang, R.S.; Xiang, Y.Z. Soil salinization research in China: Advances and prospects. J. Geogr. Sci. 2014, 24, 943–960. [Google Scholar] [CrossRef]
  3. Stavi, I.; Thevs, N.; Priori, S. Soil Salinity and Sodicity in Drylands: A Review of Causes, Effects, Monitoring, and Restoration Measures. Front. Environ. Sci. 2021, 9, 712831. [Google Scholar] [CrossRef]
  4. Singh, A. Soil salinization management for sustainable development: A review. J. Environ. Manag. 2021, 277, 15. [Google Scholar] [CrossRef] [PubMed]
  5. Chen, S.; Arrouays, D.; Mulder, V.L.; Poggio, L.; Minasny, B.; Roudier, P.; Libohova, Z.; Lagacherie, P.; Shi, Z.; Hannam, J.; et al. Digital mapping of GlobalSoilMap soil properties at a broad scale: A review. Geoderma 2022, 409, 115567. [Google Scholar] [CrossRef]
  6. Brunner, P.; Li, H.T.; Kinzelbach, W.; Li, W.P. Generating soil electrical conductivity maps at regional level by integrating measurements on the ground and remote sensing data. Int. J. Remote Sens. 2007, 28, 3341–3361. [Google Scholar] [CrossRef]
  7. Dehaan, R.L.; Taylor, G.R. Field-derived spectra of salinized soils and vegetation as indicators of irrigation-induced soil salinization. Remote Sens. Environ. 2002, 80, 406–417. [Google Scholar] [CrossRef]
  8. Nanni, M.R.; Demattê, J.A.M. Spectral reflectance methodology in comparison to traditional soil analysis. Soil Sci. Soc. Am. J. 2006, 70, 393–407. [Google Scholar] [CrossRef]
  9. Mougenot, B.; Pouget, M.; Epema, G.F. Remote sensing of salt affected soils. Remote Sens. Rev. 1993, 7, 241–259. [Google Scholar] [CrossRef]
  10. Garajeh, M.K.; Blaschke, T.; Haghi, V.H.; Weng, Q.H.; Kamran, K.V.; Li, Z.L. A Comparison between Sentinel-2 and Landsat 8 OLI Satellite Images for Soil Salinity Distribution Mapping Using a Deep Learning Convolutional Neural Network. Can. J. Remote Sens. 2022, 48, 452–468. [Google Scholar] [CrossRef]
  11. Sahbeni, G. A PLSR model to predict soil salinity using Sentinel-2 MSI data. Open Geosci. 2021, 13, 977–987. [Google Scholar] [CrossRef]
  12. Wang, F.; Yang, S.T.; Yang, W.; Yang, X.D.; Ding, J.L. Comparison of machine learning algorithms for soil salinity predictions in three dryland oases located in Xinjiang Uyghur Autonomous Region (XJUAR) of China. Eur. J. Remote Sens. 2019, 52, 256–276. [Google Scholar] [CrossRef]
  13. Yahiaoui, I.; Bradai, A.; Douaoui, A.; Abdennour, M.A. Performance of random forest and buffer analysis of Sentinel-2 data for modelling soil salinity in the Lower-Cheliff plain (Algeria). Int. J. Remote Sens. 2021, 42, 128–151. [Google Scholar] [CrossRef]
  14. Gharechelou, S.; Tateishi, R.; Sumantyo, J.T.S. Interrelationship analysis of L-band backscattering intensity and soil dielectric constant for soil moisture retrieval using PALSAR data. Adv. Remote Sens. 2015, 4, 15. [Google Scholar] [CrossRef]
  15. Metternicht, G.I.; Zinck, J.A. Remote sensing of soil salinity: Potentials and constraints. Remote Sens. Environ. 2003, 85, 1–20. [Google Scholar] [CrossRef]
  16. Sreenivas, K.; Venkataratnam, L.; Rao, P.N. Dielectric properties of salt-affected soils. Int. J. Remote Sens. 1995, 16, 641–649. [Google Scholar] [CrossRef]
  17. Allbed, A.; Kumar, L.; Aldakheel, Y.Y. Assessing soil salinity using soil salinity and vegetation indices derived from IKONOS high-spatial resolution imageries: Applications in a date palm dominated region. Geoderma 2014, 230, 1–8. [Google Scholar] [CrossRef]
  18. Sidike, A.; Zhao, S.; Wen, Y. Estimating soil salinity in Pingluo County of China using QuickBird data and soil reflectance spectra. Int. J. Appl. Earth Obs. Geoinf. 2014, 26, 156–175. [Google Scholar] [CrossRef]
  19. Wang, J.; Ding, J.; Yu, D.; Ma, X.; Zhang, Z.; Ge, X.; Teng, D.; Li, X.; Liang, J.; Lizaga, I.; et al. Capability of Sentinel-2 MSI data for monitoring and mapping of soil salinity in dry and wet seasons in the Ebinur Lake region, Xinjiang, China. Geoderma 2019, 353, 172–187. [Google Scholar] [CrossRef]
  20. Zhou, T.; Lu, H.L.; Wang, W.W.; Yong, X. GA-SVM based feature selection and parameter optimization in hospitalization expense modeling. Appl. Soft Comput. 2019, 75, 323–332. [Google Scholar] [CrossRef]
  21. Peng, J.; Biswas, A.; Jiang, Q.S.; Zhao, R.Y.; Hu, J.; Hu, B.F.; Shi, Z. Estimating soil salinity from remote sensing and terrain data in southern Xinjiang Province, China. Geoderma 2019, 337, 1309–1319. [Google Scholar] [CrossRef]
  22. Pang, G.J.; Wang, T.; Liao, J.; Li, S. Quantitative Model Based on Field-Derived Spectral Characteristics to Estimate Soil Salinity in Minqin County, China. Soil Sci. Soc. Am. J. 2014, 78, 546–555. [Google Scholar] [CrossRef]
  23. Yang, H.; Wang, Z.H.; Cao, J.F.; Wu, Q.Y.; Zhang, B.L. Estimating soil salinity using Gaofen-2 imagery: A novel application of combined spectral and textural features. Environ. Res. 2023, 217, 114870. [Google Scholar] [CrossRef] [PubMed]
  24. Zhang, J.; Zhang, Z.; Chen, J.; Chen, H.; Jin, J.; Han, J.; Wang, X.; Song, Z.; Wei, G. Estimating soil salinity with different fractional vegetation cover using remote sensing. Land Degrad. Dev. 2021, 32, 597–612. [Google Scholar] [CrossRef]
  25. Jia, P.; Zhang, J.; He, W.; Hu, Y.; Zeng, R.; Zamanian, K.; Jia, K.; Zhao, X. Combination of Hyperspectral and Machine Learning to Invert Soil Electrical Conductivity. Remote Sens. 2022, 14, 20. [Google Scholar] [CrossRef]
  26. Zhang, Q.Q.; Li, L.; Sun, R.Z.; Zhu, D.H.; Zhang, C.; Chen, Q.Q. Retrieval of the Soil Salinity From Sentinel-1 Dual-Polarized SAR Data Based on Deep Neural Network Regression. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5. [Google Scholar] [CrossRef]
  27. Zhao, W.J.; Zhou, C.; Zhou, C.Q.; Ma, H.; Wang, Z.J. Soil Salinity Inversion Model of Oasis in Arid Area Based on UAV Multispectral Remote Sensing. Remote Sens. 2022, 14, 13. [Google Scholar] [CrossRef]
  28. Ziane, A.; Douaoui, A.; Yahiaoui, I.; Pulido, M.; Larid, M.; Gulakhmadov, A.; Chen, X. Upgrading the Salinity Index Estimation and Mapping Quality of Soil Salinity Using Artificial Neural Networks in the Lower-Cheliff Plain of Algeria in North Africa Amelioration de l’estimation de l‘indice de salinite et de la qualite de la cartographie de la salinite des sols en utilisant les reseaux de neurones artificiels dans la plaine du Bas Cheliff au Nord de l’Algerie. Can. J. Remote Sens. 2022, 48, 182–196. [Google Scholar] [CrossRef]
  29. Calderón-Loor, M.; Hadjikakou, M.; Bryan, B.A. High-resolution wall-to-wall land-cover mapping and land change assessment for Australia from 1985 to 2015. Remote Sens. Environ. 2021, 252, 112148. [Google Scholar] [CrossRef]
  30. Li, A.; Song, K.; Chen, S.; Mu, Y.; Xu, Z.; Zeng, Q. Mapping African wetlands for 2020 using multiple spectral, geo-ecological features and Google Earth Engine. ISPRS J. Photogramm. Remote Sens. 2022, 193, 252–268. [Google Scholar] [CrossRef]
  31. Ivushkin, K.; Bartholomeus, H.; Bregt, A.K.; Pulatov, A.; Kempen, B.; de Sousa, L. Global mapping of soil salinity change. Remote Sens. Environ. 2019, 231, 12. [Google Scholar] [CrossRef]
  32. Qian, T.; Tsunekawa, A.; Masunaga, T.; Wang, T. Analysis of the Spatial Variation of Soil Salinity and Its Causal Factors in China’s Minqin Oasis. Math. Probl. Eng. 2017, 2017, 9745264. [Google Scholar] [CrossRef]
  33. Wang, X.; Chen, X.; Ding, Q.; Zhao, X.; Wang, X.; Ma, Z.; Lian, J. Vegetation and soil environmental factor characteristics, and their relationship at different desertification stages: A case study in the Minqin desert-oasis ecotone. Acta Ecol. Sin. 2018, 38, 1569–1580. [Google Scholar]
  34. Zhao, J.; Yang, J.; Zhu, G. Effect of ecological water conveyance on vegetation coverage in surrounding area of the qingtu lake. Arid Zone Res. 2018, 35, 1251–1261. [Google Scholar]
  35. Bondes, M.; Buainain, A.M.; Cheng, F.-T.; Eremina, N.; Gregoryev, L.M.; Janik, L.L.; McGuire, R. Climate Change, Sustainable Development, and Human Security: A Comparative Analysis; Lexington Books: Lanham, MD, USA, 2013. [Google Scholar]
  36. Attema, E.; Ulaby, F.T. Vegetation modeled as a water cloud. Radio Sci. 1978, 13, 357–364. [Google Scholar] [CrossRef]
  37. Wei, Q.Y.; Nurmemet, I.; Gao, M.H.; Xie, B.Q. Inversion of Soil Salinity Using Multisource Remote Sensing Data and Particle Swarm Machine Learning Models in Keriya Oasis, Northwestern China. Remote Sens. 2022, 14, 21. [Google Scholar] [CrossRef]
  38. Ma, C. Quantitative retrieval of soil salt content based on Sentinel-1 dual polarization radar image. Trans. Chin. Soc. Agric. Eng. 2018, 34, 153–158. [Google Scholar]
  39. Tripathi, N.; Rai, B.K.; Dwivedi, P. Spatial modeling of soil alkalinity in GIS environment using IRS data. In Proceedings of the 18th Asian Conference in Remote Sensing, ACRS, Kuala Lumpur, Malaysia, 20–24 October 1997. [Google Scholar]
  40. Khan, N.M.; Rastoskuev, V.V.; Sato, Y.; Shiozawa, S. Assessment of hydrosaline land degradation by using a simple approach of remote sensing indicators. Agric. Water Manag. 2005, 77, 96–109. [Google Scholar] [CrossRef]
  41. Douaoui, A.E.K.; Nicolas, H.; Walter, C. Detecting salinity hazards within a semiarid context by means of combining soil and remote-sensing data. Geoderma 2006, 134, 217–230. [Google Scholar] [CrossRef]
  42. Abbas, A.; Khan, S. Using remote sensing techniques for appraisal of irrigated soil salinity. In Proceedings of the International Congress on Modelling and Simulation (MODSIM), Christchurch, New Zealand, 10–13 December 2007. [Google Scholar]
  43. Taghizadeh-Mehrjardi, R.; Minasny, B.; Sarmadian, F.; Malone, B.P. Digital mapping of soil salinity in Ardakan region, central Iran. Geoderma 2014, 213, 15–28. [Google Scholar] [CrossRef]
  44. Scudiero, E.; Skaggs, T.H.; Corwin, D.L. Regional scale soil salinity evaluation using Landsat 7, western San Joaquin Valley, California, USA. Geoderma Reg. 2014, 2–3, 82–90. [Google Scholar] [CrossRef]
  45. Liu, H.Q.; Huete, A. A feedback based modification of the NDVI to minimize canopy background and atmospheric noise. IEEE Trans. Geosci. Remote Sens. 1995, 33, 457–465. [Google Scholar] [CrossRef]
  46. Huete, A.R. A soil-adjusted vegetation index (SAVI). Remote Sens. Environ. 1988, 25, 295–309. [Google Scholar] [CrossRef]
  47. Qi, J.; Chehbouni, A.; Huete, A.R.; Kerr, Y.H.; Sorooshian, S. A modified soil adjusted vegetation index. Remote Sens. Environ. 1994, 48, 119–126. [Google Scholar] [CrossRef]
  48. Alhammadi, M.; Glenn, E. Detecting date palm trees health and vegetation greenness change on the eastern coast of the United Arab Emirates using SAVI. Int. J. Remote Sens. 2008, 29, 1745–1765. [Google Scholar] [CrossRef]
  49. Major, D.; Baret, F.; Guyot, G. A ratio vegetation index adjusted for soil brightness. Int. J. Remote Sens. 1990, 11, 727–740. [Google Scholar] [CrossRef]
  50. Wu, W.; Mhaimeed, A.S.; Al-Shafie, W.M.; Ziadat, F.; Dhehibi, B.; Nangia, V.; De Pauw, E. Mapping soil salinity changes using remote sensing in Central Iraq. Geoderma Reg. 2014, 2–3, 21–31. [Google Scholar] [CrossRef]
  51. Kwarteng, P.; Chavez, A. Extracting spectral contrast in Landsat Thematic Mapper image data using selective principal component analysis. Photogramm. Eng. Remote Sens. 1989, 55, 339–348. [Google Scholar]
  52. Crist, E.P.; Cicone, R.C. A physically-based transformation of Thematic Mapper data—The TM Tasseled Cap. IEEE Trans. Geosci. Remote Sens. 1984, GE-22, 256–263. [Google Scholar] [CrossRef]
  53. Ren, J.; Xie, R.; Zhu, H.; Zhao, Y.; Zhang, Z. Comparative study on the abilities of different crack parameters to estimate the salinity of soda saline-alkali soil in Songnen Plain, China. Catena 2022, 213, 106221. [Google Scholar] [CrossRef]
  54. Zhao, Y.; Zhang, Z.; Zhu, H.; Ren, J. Quantitative Response of Gray-Level Co-Occurrence Matrix Texture Features to the Salinity of Cracked Soda Saline–Alkali Soil. Int. J. Environ. Res. Public Health 2022, 19, 6556. [Google Scholar] [CrossRef] [PubMed]
  55. Vermeulen, D.; Van Niekerk, A. Machine learning performance for predicting soil salinity using different combinations of geomorphometric covariates. Geoderma 2017, 299, 1–12. [Google Scholar] [CrossRef]
  56. Naimi, S.; Ayoubi, S.; Zeraatpisheh, M.; Dematte, J.A.M. Ground observations and environmental covariates integration for mapping of soil salinity: A machine learning-based approach. Remote Sens. 2021, 13, 4825. [Google Scholar] [CrossRef]
  57. Wang, F.; Shi, Z.; Biswas, A.; Yang, S.T.; Ding, J.L. Multi-algorithm comparison for predicting soil salinity. Geoderma 2020, 365, 18. [Google Scholar] [CrossRef]
  58. Xu, H.; Chen, C.; Zheng, H.; Luo, G.; Yang, L.; Wang, W.; Wu, S.; Ding, J. AGA-SVR-based selection of feature subsets and optimization of parameter in regional soil salinization monitoring. Int. J. Remote Sens. 2020, 41, 4470–4495. [Google Scholar] [CrossRef]
  59. Tutmez, B. Identifying electrical conductivity in topsoil by interpretable machine learning. Model. Earth Syst. Environ. 2023, 10, 1869–1881. [Google Scholar] [CrossRef]
  60. Mohamed, S.A.; Metwaly, M.M.; Metwalli, M.R.; AbdelRahman, M.A.E.; Badreldin, N. Integrating Active and Passive Remote Sensing Data for Mapping Soil Salinity Using Machine Learning and Feature Selection Approaches in Arid Regions. Remote Sens. 2023, 15, 1751. [Google Scholar] [CrossRef]
  61. Jin, X.; Xu, X.; Song, X.; Li, Z.; Wang, J.; Guo, W. Estimation of leaf water content in winter wheat using grey relational analysis–partial least squares modeling with hyperspectral data. Agron. J. 2013, 105, 1385–1392. [Google Scholar] [CrossRef]
  62. Kuo, Y.; Yang, T.; Huang, G.-W. The use of grey relational analysis in solving multiple attribute decision-making problems. Comput. Ind. Eng. 2008, 55, 80–93. [Google Scholar] [CrossRef]
  63. Das, B.; Manohara, K.K.; Mahajan, G.R.; Sahoo, R.N. Spectroscopy based novel spectral indices, PCA- and PLSR-coupled machine learning models for salinity stress phenotyping of rice. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2020, 229, 13. [Google Scholar] [CrossRef]
  64. Yu, H.; Liu, M.Y.; Du, B.J.; Wang, Z.M.; Hu, L.J.; Zhang, B. Mapping Soil Salinity/Sodicity by using Landsat OLI Imagery and PLSR Algorithm over Semiarid West Jilin Province, China. Sensors 2018, 18, 17. [Google Scholar] [CrossRef] [PubMed]
  65. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  66. Zhu, C.M.; Ding, J.L.; Zhang, Z.P.; Wang, Z. Exploring the potential of UAV hyperspectral image for estimating soil salinity: Effects of optimal band combination algorithm and random forest. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2022, 279, 8. [Google Scholar] [CrossRef] [PubMed]
  67. Chang, C.-W.; Laird, D.A.; Mausbach, M.J.; Hurburgh, C.R. Near-infrared reflectance spectroscopy–principal components regression analyses of soil properties. Soil Sci. Soc. Am. J. 2001, 65, 480–490. [Google Scholar] [CrossRef]
  68. Richards, L.A. Diagnosis and Improvement of Saline and Alkali Soils; US Government Printing Office: Washington, DC, USA, 1954. [Google Scholar]
  69. Wei, G.; Li, Y.; Zhang, Z.; Chen, Y.; Chen, J.; Yao, Z.; Lao, C.; Chen, H. Estimation of soil salt content by combining UAV-borne multispectral sensor and machine learning algorithms. Peerj 2020, 8, 24. [Google Scholar] [CrossRef] [PubMed]
  70. Liu, Y.; Zhong, Y.; Hu, C.; Xiao, M.; Ding, F.; Yu, Y.; Yao, H.; Zhu, Z.; Chen, J.; Ge, T.; et al. Distribution of microplastics in soil aggregates after film mulching. Soil Ecol. Lett. 2023, 5, 230171. [Google Scholar] [CrossRef]
  71. Xu, Z.; Hu, C.; Wang, X.; Wang, L.; Xing, J.; He, X.; Wang, Z.; Zhao, P. Distribution characteristics of plastic film residue in long-term mulched farmland soil. Soil Ecol. Lett. 2022, 5, 220144. [Google Scholar] [CrossRef]
  72. Han, L.; Liu, D.; Cheng, G.; Zhang, G.; Wang, L. Spatial distribution and genesis of salt on the saline playa at Qehan Lake, Inner Mongolia, China. Catena 2019, 177, 22–30. [Google Scholar] [CrossRef]
  73. Lobell, D.B.; Asner, G.P. Moisture effects on soil reflectance. Soil Sci. Soc. Am. J. 2002, 66, 722–727. [Google Scholar] [CrossRef]
  74. Fang, S.; Yu, W.; Qi, Y. Spectra and vegetation index variations in moss soil crust in different seasons, and in wet and dry conditions. Int. J. Appl. Earth Obs. Geoinf. 2015, 38, 261–266. [Google Scholar] [CrossRef]
  75. Abou Samra, R.M.; Ali, R. The development of an overlay model to predict soil salinity risks by using remote sensing and GIS techniques: A case study in soils around Idku Lake, Egypt. Environ. Monit. Assess. 2018, 190, 706. [Google Scholar] [CrossRef] [PubMed]
  76. Scudiero, E.; Skaggs, T.H.; Corwin, D.L. Regional-scale soil salinity assessment using Landsat ETM+ canopy reflectance. Remote Sens. Environ. 2015, 169, 335–343. [Google Scholar] [CrossRef]
  77. Ngabire, M.; Wang, T.; Xue, X.; Liao, J.; Sahbeni, G.; Huang, C.; Duan, H.; Song, X. Soil salinization mapping across different sandy land-cover types in the Shiyang River Basin: A remote sensing and multiple linear regression approach. Remote Sens. Appl. Soc. Environ. 2022, 28, 100847. [Google Scholar] [CrossRef]
  78. Yang, J.; Zhao, J.; Zhu, G.; Wang, Y.; Ma, X.; Wang, J.; Guo, H.; Zhang, Y. Soil salinization in the oasis areas of downstream inland rivers —Case Study: Minqin oasis. Quat. Int. 2020, 537, 69–78. [Google Scholar] [CrossRef]
Figure 1. Flow chart of the research.
Figure 1. Flow chart of the research.
Land 13 00877 g001
Figure 2. Study area and spatial distribution of soil samples. (a) Location of the study area in China. (b) Distribution of soil texture types in the study area, obtained from the HWSD database. (c) Distribution of different land use types and sampling points, obtained from the CNLUCC database released by the Chinese Academy of Sciences.
Figure 2. Study area and spatial distribution of soil samples. (a) Location of the study area in China. (b) Distribution of soil texture types in the study area, obtained from the HWSD database. (c) Distribution of different land use types and sampling points, obtained from the CNLUCC database released by the Chinese Academy of Sciences.
Land 13 00877 g002
Figure 3. Comparison of the WCM model before and after processing. VH and VV are the backscattering coefficients corresponding to the different polarization modes of the original, and WCMVN and WCMVV are the backscattering coefficients of the former after the water cloud model (WCM) treatment, respectively. (a,b) are the comparison of the backscatter coefficients VH and VV with the results processed by the water cloud model.
Figure 3. Comparison of the WCM model before and after processing. VH and VV are the backscattering coefficients corresponding to the different polarization modes of the original, and WCMVN and WCMVV are the backscattering coefficients of the former after the water cloud model (WCM) treatment, respectively. (a,b) are the comparison of the backscatter coefficients VH and VV with the results processed by the water cloud model.
Land 13 00877 g003
Figure 4. Sample distribution. (a) The total distribution of samples and the distribution of calibration and validation sets, where N denotes the total number of samples, SD is the standard deviation, and CV is the coefficient of variation. (b) Distribution of all samples under different land uses, where N represents the number of sample points in each group.
Figure 4. Sample distribution. (a) The total distribution of samples and the distribution of calibration and validation sets, where N denotes the total number of samples, SD is the standard deviation, and CV is the coefficient of variation. (b) Distribution of all samples under different land uses, where N represents the number of sample points in each group.
Land 13 00877 g004
Figure 5. Variable correlation analysis. (a) Result of Kendall, Spearman, and Pearson correlation analysis of 79 variables. (b) Result of Pearson correlation analysis of 34 variables after screening.
Figure 5. Variable correlation analysis. (a) Result of Kendall, Spearman, and Pearson correlation analysis of 79 variables. (b) Result of Pearson correlation analysis of 34 variables after screening.
Land 13 00877 g005
Figure 6. RFE algorithm model selection and results of three feature selection algorithms. (a,b) are accuracy comparison charts for selecting linear models, RF models, SVM models, and Treebag models for RFE variable selection, respectively. (ce) are the characteristic variables and their importance maps obtained after variable selection using Elastic Net, GRA, and RFE algorithms, respectively. (f) is a comparison chart of the results obtained by the three variable selection algorithms.
Figure 6. RFE algorithm model selection and results of three feature selection algorithms. (a,b) are accuracy comparison charts for selecting linear models, RF models, SVM models, and Treebag models for RFE variable selection, respectively. (ce) are the characteristic variables and their importance maps obtained after variable selection using Elastic Net, GRA, and RFE algorithms, respectively. (f) is a comparison chart of the results obtained by the three variable selection algorithms.
Land 13 00877 g006
Figure 7. Independent validation results for six models. The gray dashed line represents the 1:1 line, and the purple (violet) solid line is the fitted line between predicted and observed values; the 1:1 line provides a reference for the deviation of predicted from observed values. (af) Scatter plots representing six models: RFE-PLSR, RFE-RF, GRA-PLSR, GRA-RF, Elastic-PLSR, and ElasticScatter plots representing six models: RFE-PLSR, RFE-RF, GRA-PLSR, GRA-RF, Elastic-PLSR, and Elastic-RF-RF.
Figure 7. Independent validation results for six models. The gray dashed line represents the 1:1 line, and the purple (violet) solid line is the fitted line between predicted and observed values; the 1:1 line provides a reference for the deviation of predicted from observed values. (af) Scatter plots representing six models: RFE-PLSR, RFE-RF, GRA-PLSR, GRA-RF, Elastic-PLSR, and ElasticScatter plots representing six models: RFE-PLSR, RFE-RF, GRA-PLSR, GRA-RF, Elastic-PLSR, and Elastic-RF-RF.
Land 13 00877 g007
Figure 8. Independent validation results of the RFE_RF model versus the model without variable selection and the model with polarization combination features removed. (a) Scatterplot of validation results for RFE_RF model. (b) Scatterplot of model constructed using all variables without variable selection. (c) Scatterplot of the model constructed using the removal of microwave data variables. (d) Predictions of the three models in (ac) versus actual soil conductivity values. (ac) The gray dashed line represents the 1:1 line, and the red line represents the fitting line between the predicted value and the observed value.
Figure 8. Independent validation results of the RFE_RF model versus the model without variable selection and the model with polarization combination features removed. (a) Scatterplot of validation results for RFE_RF model. (b) Scatterplot of model constructed using all variables without variable selection. (c) Scatterplot of the model constructed using the removal of microwave data variables. (d) Predictions of the three models in (ac) versus actual soil conductivity values. (ac) The gray dashed line represents the 1:1 line, and the red line represents the fitting line between the predicted value and the observed value.
Land 13 00877 g008
Figure 9. Comparison between the results of the spatial distribution of soil salinity in Minqin oasis predicted by using the RFE-RF model and those in the HWSD database. (a,b) are, respectively, the spatial distribution results of soil electrical conductivity obtained using the RFE-RF model and the spatial distribution of soil electrical conductivity in the study area provided by the HWSD database. (c) is the result after dividing soil conductivity into 4 categories. Since there is no EC value higher than 16 dS∙m−1 in the prediction results, the fifth category of extremely saline soils does not exist. (d) is the condition of cultivated land affected by salinization (EC > 2 dS∙m−1). (e) is the soil condition of Qingtu Lake and nearby.
Figure 9. Comparison between the results of the spatial distribution of soil salinity in Minqin oasis predicted by using the RFE-RF model and those in the HWSD database. (a,b) are, respectively, the spatial distribution results of soil electrical conductivity obtained using the RFE-RF model and the spatial distribution of soil electrical conductivity in the study area provided by the HWSD database. (c) is the result after dividing soil conductivity into 4 categories. Since there is no EC value higher than 16 dS∙m−1 in the prediction results, the fifth category of extremely saline soils does not exist. (d) is the condition of cultivated land affected by salinization (EC > 2 dS∙m−1). (e) is the soil condition of Qingtu Lake and nearby.
Land 13 00877 g009
Table 1. Spectral indices and transformed band lists.
Table 1. Spectral indices and transformed band lists.
IndexFormulaReferences
Salinity IndexSalinity Index (SI-T) R NIR × 100 [39]
Brightness Index (BI) R 2 + NI R 2 [39]
Salinity Index (SI) B × R [40]
Salinity Index 1 (SI1) R × G [40]
Salinity Index 2 (SI2) G 2 + R 2 + NIR 2 [41]
Salinity Index 3 (SI3) G 2 + R 2 [41]
Salinity Index (S1)B/R[41]
Salinity Index (S2)(B − R)/(B + R)[42]
Salinity Index (S3)(G   ×   R)/B[42]
Salinity Index (S5)(B × R)/G[42]
Salinity Index (S6)(R × NIR)/G[42]
Normalized Difference Salinity Index (NDSI)(R − NIR)/(R + NIR)[40]
Vegetation IndexNormalized Difference Vegetation Index (NDVI)(NIR − R)/(NIR + R)[43]
Canopy Response Salinity Index (CRSI) NIR × R - G × R NIR × R + G × R [44]
Enhanced Vegetation Index (EVI) 2.5 × NIR - R NIR + 6 R - 7.5 B + 1 [45]
Difference Vegetation Index (DVI) N I R R [43]
Soil Adjusted Vegetation Index (SAVI) ( NIR - R ) / ( NIR + R + L ) [46]
Modified Soil Adjusted Vegetation Index (MSAVI) ( 2 NIR + 1 - 2 NIR + 1 2 - 8 NIR - R   )/2[47]
Salinity Remote Sensing Index (SRSI) NDVI - 1 2 + SI 2 [48]
Ratio Vegetation Index (RVI) NIR / R [49]
Generalized Difference Vegetation Index (GDVI) NIR 2 - R 2 NIR 2 + R 2 [50]
Spectral TransformationPrincipal Component Analysis (PCA)PC1, PC2, PC3[51]
Tasseled Cap (TC)TC1, TC2, TC3[52]
Sentinel-2 MSI BandsB4, B3, B2, B8
Red-edge Spectral IndexRed-edge Salinity Index S1 (RES1)B/ RedEdge [19]
Red-edge Salinity Index S2 (RES2)(B − RedEdge )/(B + RedEdge ) [19]
Red-edge Salinity Index S3 (RES3)(G × RedEdge )/B[19]
Red-edge Salinity Index S5 (RES5)(B × RedEdge )/G[19]
Red-edge Salinity Index S6 (RES6)(RedEdge × NIR)/G[19]
Red-edge Salinity Index SI (RESI) ( B × RedEdge ) [19]
Red-edge Salinity Index S1 (RESI1) ( G × RedEdge ) [19]
Red-edge Salinity Index SI2 (RESI2) G 2 + RedEdge   2 + NIR 2 [19]
Red-edge Salinity Index SI3 (RESI3) G 2 + RedEdge   2 [19]
Table 2. Texture characteristics chosen for the article and the formula for calculating them.
Table 2. Texture characteristics chosen for the article and the formula for calculating them.
Feature NameFormulaClarification
Contrast (CON) i , j = 0 n - 1 P i - j 1 + ( i - j ) 2 Describes the clarity of the remotely sensed image and how it produces changes in a small area.
Dissimilarity (DIS) i , j = 0 n - 1 P i , j i - j Describes the localized contrast of a remotely sensed image.
Entropy (ENT) i , j = 0 n - 1 P i . j ( - ln P i , j ) Indicators of texture feature clutter in remotely sensed images.
Angular Second Moment (ASM) i , j = 0 n - 1 P 2 i , j Homogeneity and consistency of grayscale distribution of remotely sensed images.
Correlation (COR) i , j = 0 n - 1 P i , j ( i - u i ) ( j - u i ) ( σ i 2 ) ( σ j 2 ) Describes the directionality of linear targets in remotely sensed images.
Table 3. The optimal parameters for each algorithm and the number of feature variables used in the model.
Table 3. The optimal parameters for each algorithm and the number of feature variables used in the model.
AlgorithmsParametersNumber of Features
GRA-PLSRn_components = 310
Elastic-PLSRn_components = 810
RFE-PLSRn_components = 211
GRA-RFn_estimators = 27, max_depth = 4,min_samples_split = 10,10
Elastic-RFmin_samples_leaf = 1, max_features = 310
RFE-RFn_estimators = 10, max_depth = 6,min_samples_split = 3,11
Table 4. R-squared and RMSE of cross-validation for six models using the calibration set, and R-squared, RMSE, and RPD results of independent validation using the validation set.
Table 4. R-squared and RMSE of cross-validation for six models using the calibration set, and R-squared, RMSE, and RPD results of independent validation using the validation set.
ModelCalibration SetValidation Set
R2 RMSE   ( dS · m−1)R2 RMSE   ( dS · m−1) RPD
GRA-PLSR0.293.410.162.491.09
Elastic-PLSR0.33.380.312.241.21
RFE-PLSR0.283.430.292.281.19
GRA-RF0.722.120.422.061.31
Elastic-RF0.831.660.621.661.63
RFE-RF0.622.490.721.471.84
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, S.; Zhao, J.; Yang, J.; Xie, J.; Sun, Z. Feature Selection and Regression Models for Multisource Data-Based Soil Salinity Prediction: A Case Study of Minqin Oasis in Arid China. Land 2024, 13, 877. https://doi.org/10.3390/land13060877

AMA Style

Zhang S, Zhao J, Yang J, Xie J, Sun Z. Feature Selection and Regression Models for Multisource Data-Based Soil Salinity Prediction: A Case Study of Minqin Oasis in Arid China. Land. 2024; 13(6):877. https://doi.org/10.3390/land13060877

Chicago/Turabian Style

Zhang, Sheshu, Jun Zhao, Jianxia Yang, Jinfeng Xie, and Ziyun Sun. 2024. "Feature Selection and Regression Models for Multisource Data-Based Soil Salinity Prediction: A Case Study of Minqin Oasis in Arid China" Land 13, no. 6: 877. https://doi.org/10.3390/land13060877

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop