1. Introduction
Clean water is indispensable for maintaining public health, ensuring environmental sustainability, and fostering economic prosperity, underscored by its inclusion in the UN Sustainable Development Goals [
1]. In 2020, an estimated 3.6 billion people, or approximately 46% of the global population at that time, were reported to lack access to managed drinking water services [
2], a figure projected to rise to 6 billion by 2050 [
3].
The contamination of water resources is often attributed to insufficient waste management in industrial activities [
4] and the runoff from agricultural practices [
5], posing significant risks to water quality. Moreover, climate change intensifies these water-related challenges by altering precipitation patterns [
6] and triggering extreme weather events like droughts and floods [
7], compromising water resources’ availability and quality. The scarcity and degradation of clean water resources have a profound economic impact [
8] and also highlight the urgent need to develop monitoring techniques that are cost-effective, rapid, and robust to ensure sustainable water management and access.
Water quality monitoring at various points within a water body is costly. It requires mobilizing material and human resources, ensuring accessibility to each sampling point, and transporting the samples to a laboratory. Performing this task regularly is economically unfeasible for most water bodies. Additionally, scientific studies are still lacking on how machine learning (ML) algorithms can be effectively utilized and how the latest remote sensing post-processing algorithms improve existing methods for assessing water clarity in aquatic environments. A promising solution lies in leveraging satellite monitoring capabilities for remote sensing combined with machine learning (ML) techniques. These data-driven approaches offer a pathway to achieving faster, more robust, and economically viable water clarity assessments. Water clarity indicates the presence or absence of contaminants in the water, such as suspended solids, dissolved solids, algae, and other pollutants. These factors influence how light is reflected and transmitted in the water. Consequently, measuring water turbidity using the Secchi disk depth (SDD) remains a widely used parameter for assessing water quality.
Using satellites for environmental monitoring encompasses a comprehensive scope that enables the collection of global observations with detailed local specificity. As reported by the United Nations Office for Space Affairs [
9], by February 2024, the total number of satellites in orbit reached 12,568, with a significant 61.02% of these launched in the short span from 2020 to 2023. These satellites have various instruments, including optical cameras, radar systems, and LiDAR sensors, facilitating remote observations across a broad spectrum of variables [
10]. Such capabilities allow for data collection on Earth’s surface coverage and detailed ocean metrics like temperature, height, salinity, and color [
11]. Beyond terrestrial and marine environments, satellites offer insights into atmospheric conditions, including humidity and precipitation [
12]. They are instrumental in monitoring natural disasters and their impacts, such as volcanic eruptions, fires, hurricanes, earthquakes, and floods [
13]. Furthermore, they are important in tracking long-term environmental changes, including sea-level rise, ice mass fluctuations, and global temperature trends [
14]. Among the most relevant applications is the monitoring of ice sheets, glaciers, and various bodies of water, such as rivers, wetlands, lakes, and dams, which is vital for assessing changes in water clarity and overall environmental health [
15]. Some available satellites for water clarity monitoring include the Sentinel and Landsat constellations [
16] and geostationary satellites [
17].
This study combines satellite imagery with ML techniques for water clarity assessment by predicting SDD. The methodology leverages the observable changes in water’s spectral signature caused by pollutant loads, which affect the reflectance across various electromagnetic spectrum bands. Such alterations facilitate the development of predictive models linking these spectral changes to SDD, a measure significantly influenced by the concentration of suspended matter in water [
18]. The study employs multifaceted modeling to operationalize this approach, creating an array of regressors that include decision trees, kernel-based methods, and neural networks. The optimization of hyperparameters for these models ensures that the predictive accuracy is maximized. By repeatedly training these models on various data splits, the study validates their robustness and facilitates the selection of the most effective model configurations. The culmination of this process is the formation of an ensemble regressor, combining the strengths of individual models to provide superior inferences of SDD. This ensemble model is then applied to new satellite images to generate inferences indicating water clarity, offering a scalable and dynamic environmental monitoring and management tool.
This document outlines utilizing ML-based regression techniques for estimating water clarity, specifically by estimating SDD. Its contributions include the following:
A thorough evaluation of machine learning algorithms to estimate the SDD using the L1TP (Level-1 Terrain Precision) and L2SP (Level-2 Science Product) Landsat image collections, aimed at assessing the atmospheric processing results of the latter collection.
Spatiotemporal evaluation of the enriched dataset within the context of the Valle de Bravo Dam’s waterbody. This dam is an integral part of the Cutzamala system, which is crucial for supplying water to large areas of Mexico’s population.
The public release of the dataset and code encourages external validation of the findings and facilitates future advancements in water clarity research through remote sensing.
The subsequent sections of the article are dedicated to elaborating on the related literature, the dataset preparation process, the methodologies adopted, the outcomes achieved, and their broader implications for the study of water clarity using remote sensing technologies. We conclude by summarizing the findings and outlining prospective avenues for further investigation in this area of environmental research.
2. Literature Review
Secchi’s pioneering experiments aboard the
Immacolata Concezione in 1864 laid the groundwork for understanding light and color behavior in marine environments by measuring sea transparency using various disks [
19]. This historical method has evolved with technology, as exemplified by Yu et al. [
20], who mapped MODIS aqua satellite observations to SDD values in the Yellow and East China seas using a polynomial model, achieving a significant correlation. Similarly, advancements in theoretical approaches and empirical algorithms have been developed to improve SDD estimation accuracy. Lee et al. [
21] revised the Law of Contrast Reduction, introducing a new theoretical model based on the diffuse attenuation coefficient. This shift towards more accurate models is echoed in studies across various regions, including the Korean Peninsula by Kim et al. [
22], which highlighted spatial and temporal variations in MODIS-derived SDD, and the Arabian Gulf by Kaabi et al. [
23], utilizing a regionally calibrated algorithm to assess water clarity through empirical correlations.
Further research efforts have extended the utility and accuracy of SDD estimations in diverse aquatic environments. Alikas and Kratzer [
24] developed empirical and semi-analytical algorithms for lakes and coastal waters with high concentrations of organic matter. At the same time, Rodrigues et al. [
25] applied the mechanistic model by Lee et al. [
21] to Brazilian waters, leading to the QAAR17 model’s development for improved SDD mapping. Jin et al. [
26] mapped the spatial extent of surface water resources and evaluated their water quality over time using remote sensing and models of SDD, chlorophyll-a (Chl-a), and suspended solids (SS) concentration. These advancements signify a continuous effort to refine SDD estimation methods, addressing challenges such as high variability in water qualities and the need for model adaptations based on specific water types, as highlighted by recent studies like those by Qing et al. [
27], Guo et al. [
28], and Zhang et al. [
29], which strive for increased accuracy in SDD estimation through semi-analytical models and enhanced data analysis techniques.
Recent advancements in remote sensing and ML have significantly improved the estimation and monitoring of water clarity, particularly in measuring SDD across various water bodies. Studies by Alsahli and Nazeer [
30], Zhou et al. [
31], and Zhang et al. [
32] have utilized different atmospheric correction methods, classification-based approaches, and ML techniques such as generalized regression neural network (GRNN), sparse spectrum Gaussian process regression (SSGPR), extreme gradient boosting (XGBoost), and random forest (RF) to enhance SDD estimations. These methods, applied to data from satellites like Sentinel-2 and Landsat, demonstrate a promising potential for remote sensing in water clarity monitoring.
Efforts to refine the estimation of water clarity from remote observations include Golubknov and Golubkov [
33], who employ Principal Component Analysis to select features and variance analysis for clustering. Shamloo and Sima [
34] model the SDD and Landsat multispectral response with Artificial Neural Networks. Wang et al. [
35] harmonize different remote-sensing satellite platforms, such as SPP, MODIS, and MERIS, using the minimization of a physics-based cost function, which relates previous field observations with satellite images. Zheng et al. [
36] use the correlation coefficient between specific satellite wavebands and SDD field measurements to develop an inversion model that accurately estimates water transparency in Poyang Lake, achieving a high correlation for bands such as blue and red. Feng et al. [
37] emphasize the importance of satellite remote sensing, interdisciplinary research, and data-sharing frameworks to address the problem of harmful algal blooms (HABs) in inland waters. Youssef et al. [
38] employ Landsat and Grace satellite imagery and Geographic Information System (GIS) tools to study the impact of climate change and human activities, such as urbanization and agriculture, on land cover and water resources in the Eastern Nile Delta.
Furthermore, the integration of neural networks (NNs) and Mixture Density Networks (MDNs) in studies by Sun [
39], Gan et al. [
40], and Maciel et al. [
41], along with the employment of a convolutional neural network (CNN) regressor by Schatz et al. [
42], indicates a shift towards more sophisticated analytical models. These models have successfully mapped satellite data to SDD observations with high accuracy, reflected in
values up to 0.93, and addressing challenges such as sensor-specific errors and data harmonization, particularly for scenarios with open waters.
3. Data Resources
This section describes the datasets and resources used to analyze water clarity, focusing on multispectral satellite observations and derived indices. We detail the AquaSat database, including its composition, preprocessing levels, and the selected Landsat 8 records used in this study. Additionally, we present the multispectral indices employed to enrich the feature set and the geographical context of the study site within the Cutzamala System.
3.1. Landsat Image Collections
Our analysis begins with AquaSat [
43], a database for water clarity measurement. This database comprises 603,432 records, including intensity values for various multispectral bands from the Landsat 5, 7, and 8 platforms and depths associated with the SDD. The data in AquaSat were obtained from the L1TP (Level-1 Terrain Precision) collection, which corrects for radiometric and geometric issues, including sensor irregularities and distortions due to the Earth’s rotation. In 2017, the United States Geological Survey (USGS) introduced the L2SP (Level-2 Science Products) processing, which accounts for atmospheric effects, such as absorption and scattering phenomena. Consequently, we extract the L2SP collection values for the geographic locations in the AquaSat database through queries to Google Earth Engine (GEE) collections. We also deemed it interesting to explore the result of the regressors applied to the
neighborhood centered at the AquaSat sampling geolocation. Thus, we also extracted the multispectral intensity values from the GEE collections corresponding to L1TP and L2SP. Due to the presence of stripes in some Landsat 7 images, the aim to maximize the probability of obtaining high-quality pixels, and the period covered by the AquaSat water body sampling, our focus is exclusively on the Landsat 8 platform, yielding 33,261 observations. From there, we gather information on pixel quality and specific bands: coastal aerosol (0.43–0.45
m), blue (0.450–0.51
m), green (0.53–0.59
m), red (0.64–0.67
m), near-infrared (NIR) (0.85–0.88
m), short-wave infrared 1 (SWIR1) (1.65–2.07
m), and SWIR 2 (SWIR2) (2.11–2.29
m), all with a resolution of 30 m per pixel.
3.2. Multispectral Indices
The characteristics provided by the satellite multispectral bands are enriched with an additional set of descriptors derived from studies on water quality, turbidity, suspended particle studies, and SDD. Partially inspired by the work of Avdan et al. [
44], we identified the following descriptors:
Normalized Difference Water Index (NDWI) [
45]: The NDWI is designed to enhance the presence of water bodies by contrasting the reflectance in the green and near-infrared (NIR) bands. Water strongly absorbs NIR wavelengths while reflecting green light. Therefore, a high NDWI value corresponds to the presence of water. The NDWI can be obtained using the following expression:
where
is a very small number used to avoid indetermination. A higher NDWI generally indicates clearer water with fewer suspended particles, which often corresponds to greater SDD.
Normalized Difference Suspended Sediment Index (NDSSI) [
46]: NDSSI is tailored to detect suspended sediments. Blue light is reflected more by suspended particles such as sand, clay, and organic particles, while clear water absorbs it. NIR, on the other hand, is absorbed by water and reflected by sediments. NDSSI is computed using:
A high NDSSI indicates higher concentrations of suspended sediments, corresponding to a lower SDD.
Normalized Multi-band Drought Index (NMDI) [
47]: The NMDI is used to detect water content in vegetation and soil. SWIR bands capture the absorption due to water content, while NIR reflects the overall moisture presence. The NMDI can be obtained using the expression:
This index helps detect turbidity and suspended matter in water bodies since clearer water has different reflectance in SWIR bands from turbid water.
Normalized Difference Turbidity Index (NDTI) [
48]: The NDTI quantifies water turbidity by contrasting green and red reflectance. Turbid waters with high sediment levels reflect more red light and absorb green light. The expression to compute NDTI is given by:
Higher turbidity corresponds to lower SDD values. Therefore, NDTI is directly related to water clarity estimation.
These indices capture the interactions of light with water, sediment, and other particulates across different wavelengths. While NDWI and NDSSI focus on water presence and suspended solids, NMDI emphasizes moisture and particulate matter. Meanwhile, NDTI relates to optical water properties.
In this methodology, the geolocation provided by AquaSat is utilized to identify observations corresponding to the Landsat 8 satellite. The image identifier is then used to formulate a query in Google Earth Engine, retrieving the image associated with that geographic location and extracting the values of its bands. This process results in a database of 47,405 samples. From this database, we select those samples whose pixel quality indicates cloudless locations [
49]. Records that do not contain the value of the SDD, which serves as the response variable, are discarded, leaving us with a final database of 33,621 records.
3.3. Experimental Field of Interest
Given its significance as a water source affecting the livelihood of millions, we choose the Cutzamala System as the site for applying the inferences made by the ML regressors. The Cutzamala system, inaugurated in 1978, is crucial for supplying water to Mexico City (CDMX) and the State of Mexico, contributing 12
of water in 2023. This system, along with the Lerma System, which provides an additional 5.724
and groundwater extraction, totaling 58.322
, meets various needs: urban (67.746%), agricultural (4.068%), and industrial (23.804%) [
50]. The Valle de Bravo Dam, a key component within the Cutzamala System, spans 1700 ha with a maximum depth of 35 m. Initially designed for hydroelectric generation, it has evolved into a multipurpose facility, contributing 6 m
3/s to the Cutzamala System’s flow, serving as a flood regulation basin, and becoming a prime tourist destination as the country’s most significant recreational dam [
51]. Fed by seven streams, including the Amanalco and Molino Rivers, the Valle de Bravo Reservoir plays a crucial role in the region’s hydrology and environmental sustainability [
52]. Analyzing water clarity within the Valle de Bravo Dam is a starting point for identifying pollution sources, understanding pollutant behavior and distribution, and potentially optimizing the operations of potabilization plants that supply drinking water to CDMX.
Specifically, the Valle de Bravo Dam is located in the hydrological basin of the Balsas River, located at geographic coordinates
North and
West [
51] (see
Figure 1). It was completed in 1954 and has a maximum height of 35 m. Along with the El Bosque and Villa Victoria dams, it is an essential component of the Cutzamala System [
50], collectively supplying
and
of potable water to CDMX and the State of Mexico, respectively [
51].
4. Methods
We aim to infer the SDD over water bodies from Earth observations over time. To that end, the inference problem contains two stages. The first focuses on inferring the SDD on static images, which is solved by constructing an ML-based ensemble regressor. The second stage pursues the assessment of the SDD over time. Sometimes, the pixels are not usable in satellite images because clouds cover the view. Thus, we take a state estimation approach. This approach will serve us well later in cases where the water body is wholly or partially covered with clouds.
Figure 2 illustrates a schematic representation of the proposed methodology.
4.1. Setting Up the Regressors
Theoretical and empirical research shows that a good ensemble selection should include individual, accurate regressors that make mistakes in different regions of the input data distribution [
54]. Among the many regressors, we select representatives of decision trees, kernel-based methods, and graphical models. Decision trees establish their discrimination boundaries parallel to the main axis (although some variations lift this restriction [
55]), and kernel-based methods excel in establishing similarity between observations non-linearly projected into multidimensional space. In contrast, neural networks project data into non-linear space in stages, generating increasingly abstract semantic representations. By broadening the spectrum of approaches, we aim to reveal the interpretability of different regions of the dataset. We deliberately dismissed techniques requiring large spatial support, such as convolutional neural networks (CNNs) or Transformers [
56]. Given the 30 m/pixel resolution of Landsat observations, applying these techniques to such data might not yield meaningful spatial feature extraction, as even small image patches represent relatively large areas (e.g., 10 ×10 pixels cover 300×300 m). This spatial scale may be too coarse for the localized nature of SDD measurements, making them less suitable for SDD assessments.
In the present approach, a set of regressors is evaluated, including support vector regression (SVR), neural networks (NNs), and extreme gradient boosting (XGB) [
57]. Subsequently, an ensemble of regressors is constructed through an NN.
XGB [
58]. We optimized the hyperparameters related to the learning rate
, the percentage of columns sampled in each tree
, the maximum depth
, the percentage of data sampled to build a tree
, and the regularization term
. The hyperparameter search is performed randomly using a uniform distribution with 1000 samples over the specified interval. The evaluation is carried out by cross-validation. During hyperparameter selection, 50 decision trees are generated. With the chosen parameters, a model with 100 trees is trained.
SVR [
59]. The goal is to regress the objective curve using a kernel on the input data that projects them into a non-linear space. We aim to approximate them in this space by a straight line within a tolerance margin
. For SVR, we optimize the hyperparameters related to the constant
C in the range [1, 1000], allowing some flexibility in crossing the defined margin;
in the range [0.01, 1], determining the flexibility of the decision boundary; and
in the range [0.01, 0.1], defining the margin within which errors are not penalized.
NN [
60]. Using gridded hyperparameter search, the selected neural network model is chosen from architectures that include between 32 and 512 units per layer, with the number of layers ranging from one to five. A ReLU activation function and an
regularizer on the layer weights are employed, searching for the optimal regularization constant per layer. The optimizer is Adam, and the loss function is Mean Square Error (MSE). The regressor is trained during up to 500 epochs with
early stopping based on the loss value in the validation partition, having patience during ten epochs. The model that achieves the best loss value during training is retained.
Ensemble. An NN was designed as the underlying structure of the ensemble. Using gridded search to look for the best hyperparameters for the number of layers (between one and five) and the number of units per layer (between three and one hundred in steps of five). The Adam optimizer and an MSE-based loss function were used for parameter optimization. The training process was carried out for up to 500 epochs, using early stopping with the patience of ten epochs. The model that showed the best performance on the validation set was retrained. For training, the predictors are scaled according to the inference models related to NN, XGB, and SVR before being introduced into the ensemble. The predictions are then normalized using the training partition before parameter optimization.
Following Afendras and Markatou [
61], who offer theoretical foundations, the data were divided into 50% for training, 20% for validation, and 30% for testing. Using the training partition, the predictors were normalized while retaining the mean value and standard deviation across all bands. These parameters were later applied to normalize the validation and testing splits. Performance is measured using the coefficient of determination
[
62]. The procedure was repeated 20 times, conducting the learning process for each of the 20 random data partitions. This process resulted in a mean value of
and a standard deviation of the performance for each regressor.
4.2. Estimating the SDD over Time
The permanence of remote sensing platforms makes it possible to explore the temporal assessment of SDD over water bodies. For instance, the first Landsat platform was launched on 23 July 1972, making it the longest-serving satellite platform for Earth observation analysis. Although it was only on 1 October 2008 that the full images platform was publicly available, the length of its records made it possible to revisit historical observations for analysis. NASA, the USGS, and the Landsat program agencies, have tried to maintain band compatibility over time. In particular, they have added the Quality Assessment Band (BQA) to check for the presence of clouds, cloud shadows, snow, and ice, all of which prevent the correct assessment of SDD by distorting observations intended to be made of the surface reflectance.
Additionally, Landsat images undergo several transformations from when an image is captured until the data are delivered to users. These transformations include radiometric, geometric, and atmospheric corrections, notwithstanding the corresponding data compression. In order to obtain a robust estimate of the SDD over time, we have devised the use of a Kalman filter. The state estimation of the Kalman filter will marginalize the complex effects of the different data processing stages. However, it will serve as an imputation strategy for missing values in the presence of bad BQA values.
Each evaluation of the SDD in the water body estimates its depth for each image pixel. Added value can be obtained by considering the observations over time
and the corresponding measurement noise covariance
. We model the estimation of the state of the system
and the corresponding transition noise covariance
using a Kalman filter. The prediction equations for the state
and corresponding state covariance
at time
k given the measurements up to time
can be expressed by [
63]:
where
is the state transition matrix, and
is the system noise covariance matrix. On the other hand, the equations for update are given by:
where the observation model
is represented as
,
stands for the Kalman filter gain at time
k, and
is the identity matrix. In this formulation,
contains the best estimate for the SDD at time
k. In contrast,
contains the best prediction for the measurement at time
k, particularly useful when clouds occlude the surface covered by a specific pixel or the BQA value discards its employment.
5. Results
In this section, we analyze the performance of regressors in the spatiotemporal inference of the SDD. The regressors are programmed in Python. For the neural networks, we used TensorFlow 2.12.1. For XGBoost, we used XGBoost 2.0.0; for SVR, we used scikit-learn 1.5.1.
5.1. Predictors
Upon verifying the input data, we found that while the predictors can range between 0 and
, they predominantly fall within the range of 5000 to 13,000. The response variable is generally between 0 and 10 m but can reach values higher than 60 m. The normalized mutual information (NMI) between variables associated with ultra-blue and blue colors registers a value of 0.350. Similarly, the NMI between SWR1 and SWR2 is 0.484. However, the NMI concerning the response variable is generally low, oscillating between 0.008 for SWR2 and 0.046 for the red band.
Figure 3 illustrates the relationship between the multispectral bands and the SDD response variable. The linear correlations between the coastal aerosol, blue, red, green, NIR, SWIR1, and SWIR2 bands and the SDD variable are −0.06, −0.09, −0.18, −0.18, −0.11, −0.11, and −0.10, respectively. Please note that we multiplied the integer intensity values in the Landsat images by 0.0000275, as recommended by the USGS, to compute the band reflectance.
Derived from the multispectral bands, we investigated the incorporation of multispectral indices, including the NDWI, NMDI, NDTI, and NDSSI (see
Section 3.2). These indices allow for the establishment of non-linear relationships between bands to enrich the feature set. The multispectral indices were constrained to values between −1 and 1 for subsequent use.
5.2. Regressors Training
For the XGB and SVR regressors, the best hyperparameters for models with the lowest loss on the validation partition were identified through a random search in the parameter space. The XGB regressors included values for the learning rate (
), column sample by tree (
cs), maximum depth (
md), subsample ratio (
ss), regularization term (
), and number of estimators. For SVR regressors, the optimal parameters comprised the penalty term (
C), epsilon (
), and kernel coefficient (
). The best parameters for the NN and ensemble regressors were obtained through a grid search across the architectural space. For the NN regressors, the optimal architectures varied in terms of the number of layers and neurons per layer, with layer counts ranging from two to four and neurons per layer ranging from 32 to 512. The last layer in each architecture contained a single neuron, which is not shown in the table. For the ensemble regressors, the architectures consisted of two to four layers, with neuron counts tailored for each method to achieve the best performance. The specific hyperparameters for all regressors are summarized in
Table 1. The table provides details about each regressor type, including whether the L1TP (T) or L2SP (S) collection was used, whether only bands (b) or both bands and spectral indices (s) were included, and whether the central pixel (c) or the
neighborhood (n) was utilized. The results were obtained using the best regressor for each method and repeating the data partition 20 times.
The results obtained are described and illustrated in
Table 2 and
Figure 4, showing that XGB and SVR perform similarly across all options. Interestingly, the L1TP collection appears to offer marginally better results. Superior performance is achieved with the neural network, improving the determination coefficient by approximately 0.09 consistently, except for the L2SP collection when using multispectral indices. However, the ensemble appears to outperform the different options, offering
values around 0.75 for almost all cases, except when using the central pixel in the L2SP collection, where its performance drops to around 0.66–0.69.
5.3. Verification Through Fieldwork
Landsat images are downloaded through the USGS platform during operation using their Python API (note that we employed GEE-downloaded images to construct the regressors). The polygon defining the dam area is set to obtain the multispectral values that feed the ensemble regressor. Typical examples are shown in
Figure 5. Despite the observation resolution of Landsat being 30 m/pixel, the presence of clustered zones with similar depths along the dam is noticeable. The pixels in the images were filtered, considering only those without clouds or cloud shadows.
The availability of satellite data from Landsat 8 facilitates temporal analysis based on the sequence of observations. For this purpose, the mean value of the SDD predictions for the entire dam area and its standard deviation are used to monitor their evolution over time. To achieve a robust estimation, the Kalman filter is applied. Experimentally, we define a process
variance with a value of 0.2. From 5 December 2013 to 6 July 2023, satellite observations were collected, obtaining 110 images.
Figure 6 illustrates the temporal variations in the SDD, with notable peaks at the end or beginning of each year, followed by a decline a few months later. Additionally, there is a slight but perceptible reduction in the SDD over time, with an average decrease of 3.42 cm per year.
We analyzed the contribution of neighboring pixels by incorporating a 3 × 3 pixel neighborhood (90 m × 90 m) into the model. As shown in
Table 2, this approach led to improved performance. However, the area covered by 3 × 3 pixels in a Landsat image is substantial, and special care must be taken during sampling to account for the effects of shallow waters and the presence of soil. These factors can compromise remote sensing observations, particularly in inland water bodies, as highlighted by the field sampling. For our application in the Valle de Bravo Dam, we ultimately employed the 1 × 1 pixel Tbc regressor, which provided more localized and reliable results.
5.4. Field Data Validation
A field visit to the Valle de Bravo Reservoir was conducted to obtain field-measured values. Direct sampling is a fundamental tool for constructing models using satellite images, as it provides ground truth. The reference values, in turn, enable the construction and refinement of predictions made by these models.
The field visit necessitates careful planning, determination of sampling sites, and accurate collection and processing of quality parameters during sampling and laboratory work. These steps are important for obtaining and calibrating models with reliable data. Typically, selecting sites for representative reservoir sampling is based on assumptions regarding its discharges. Even in water clarity studies utilizing remote sensing, the determination of these sites can be somewhat arbitrary.
For this study, we followed a stratified sampling approach to maximize the multispectral diversity of the study area. Specifically, the intensity values were first normalized by evenly dividing the range of each band into 12 levels. This means that each group was represented as a vector of values between 0 and 11, with as many positions as spectral bands in the satellite images. Subsequently, all possible groups were identified, and the percentage of pixels belonging to each was calculated. We retained groups containing more than 5% of the pixels, while the remaining groups were assigned to the most similar one using the smallest Euclidean distance. This approach ensures that the samples are distributed across areas with the highest potential diversity of results.
On 7 October 2022, Valle de Bravo Dam measurements coincided with a Landsat 8 satellite overflight. During our fieldwork, observations were taken from forty points, with geolocation and measurement of the SDD
at each site. Subsequently, we downloaded the image set corresponding to this observation and performed inference calculations using the described procedure.
Figure 7a illustrates the portion of the dam that was not obscured by clouds or over which a cloud was not projected (
Figure 7b), highlighting the location of the points and the difference between the reference value and the prediction colored. To obtain the prediction
, an interpolation function was created to evaluate the sampling geo-coordinates.
Figure 7c displays the difference between the field measurements and the predictions. The coefficient of determination was found to be
.
The present approach consists of regressors for XGB, SVR, NN, and an ensemble, which require fine-tuning of their hyperparameters. In addition, XGB, SVR, and NN require preparing the regressors before the ensemble. In this case, we repeated the training stage 20 times with different data splits for training, validation, and testing. Note that SVR requires the most time, while XGB is the fastest. Also, the time taken to fine-tune the hyperparameters for a neighborhood does not increase linearly with the number of predictors.
6. Discussion
With the increasing availability of the data corpus [
43], research efforts to determine water clarity have shifted from semi-analytical approaches [
21] to data-driven methods. Likewise, new insights into surface reflectance scattering and absorbing effects [
16], due to temporal, spatial, and spectral variations in the presence of gases, aerosols, and water vapor, underscore the importance of evaluating available datasets. Unlike past efforts, we updated the Aquasat dataset predictors, obtained with L1TP processing, with intensity values corresponding to the newly developed L2SP image post-processing algorithms [
64]. While the L1TP processing includes geophysical corrections related to ground control point (GCP) and digital elevated map (DEM) corrections, L2SP adds additional corrections for atmospheric effects related to light absorption and scattering. Since L2SP attempts to remove atmospheric interference, the observations estimate the best surface reflectance, aiming to make the inferences more trustworthy [
65]. Note that the cloud-corrected L2C2 interpolates over clouds, whereas L2SP flags the presence of clouds. We have preferred the latter option to reduce misinterpretations of the surface reflectance values.
To assess the capacity of this new dataset to infer manually obtained SSD observations, we constructed a baseline machine learning scheme with kernel, decision trees, and neural network-based ML schemes. The resulting ML ensemble, trained with 33,261 measurements originally part of Aquasat, has been updated and tested in the field on a water body crucial for millions of people [
50]. Note that the aggregated analysis has allowed obtaining a general trend for water clarity at the Valle de Bravo Dam, as illustrated in
Figure 7, which highlights potential applications of the present approach.
The techniques employed to solve the SDD problem have evolved significantly since Secchi’s pioneering work [
19]. These methods include polynomial models [
20], empirical and semi-analytical algorithms [
24], mechanistic models [
25,
27,
28,
29], and ML techniques such as generalized regression neural networks (GRNNs), sparse spectrum Gaussian-process regression (SSGPR), XGBoost, and random forest (RF). The present approach contributes by exploring various ML techniques to benchmark an updated dataset and apply it to a novel water body.
The selected regressors represent the most commonly used non-probabilistic approaches discussed in the literature [
54], purposely chosen to offer a wide spectrum of criteria and ensembled to cover different dataset regions. Particularly for this application with inland water bodies, where the coarse resolution of satellite images and the punctual nature of SDD sampling pose challenges, the chosen approach proved appropriate. In our case, employing other techniques, such as CNNs [
42], which are more suitable for open-water scenarios, is less feasible.
Our results show that for the case of calculating SDD, the L1TP collection offers acceptable results compared to those obtained from the L2SP collection. This result is consistent with Li et al. [
66], who showed that the former provided better performance when calculating land surface temperature and emissivity from L1TP and L2SP. Other results, such as those by Sun et al. [
67], show that L2SP outperforms L1TP in reflectance consistency, particularly in applications such as NDVI calculation. However, when atmospheric conditions are minimal, such as where the effects of scattering, absorption, and refraction are reduced or negligible, L1TP is sufficiently good. Some of the reasons for these results may be related to overcorrection in the L2SP collection, particularly concerning water turbidity, interpolation, or smoothing, especially along water–land transitions. In this sense, the L1TP collection may more faithfully preserve the original radiance values, particularly useful in complex environments such as water bodies. These ideas will be of interest to future research.
The results suggest that atmospheric correction in the L2SP does not improve the results obtained with L1TP. A possible way to gain intuition about this situation is by implementing atmospheric correction mechanisms on the L1TP collection, such as ACOLITE [
68], 6S (Second Simulation of a Satellite Signal in the Solar Spectrum) [
69], and iCOR [
70], followed by a comparison with the results of the regressors built on top of the L2SP. This comparison will be the subject of future research.
This manuscript studies the construction of machine learning regression schemes from satellite images. Nonetheless, the number of available remote sensing platforms useful for satellite-based observations has steadily increased over recent years [
9], which could potentially enhance spatiotemporal analysis, particularly in dealing with seasonal or episodic events. It is important to note that the in situ observations and the satellite overflight dates must align for the construction of the regressors. In our case, AquaSat [
43] was constructed using Landsat satellite passes. Nonetheless, several strategies have been proposed to harmonize remote sensing observations across different platforms [
71,
72], including Sentinel, PlanetScope, MODIS, and others. These approaches offer an exciting opportunity to enhance the spatiotemporal analysis of water bodies. However, this requires obtaining reference values from the newer platform to assess performance properly.
7. Conclusions
As pressure on water resources increases, continuous monitoring of water bodies becomes more important. While this is especially true for water bodies intended for human consumption and agriculture, it is essential to remember that water bodies are an integral and essential part of ecosystems on which the life and well-being of plants and animals depend. This article evaluates the feasibility of using machine learning algorithms and remote sensing from Landsat satellite images to estimate water clarity by determining SDD. In particular, we compare decision tree-based schemes, kernel-based approaches, neural networks, and regressor ensembles. Additionally, we compare the effectiveness of Collection 2 at its processing levels: Level-1 Terrain Precision (L1TP) and Level-2 Surface Product (L2SP). Moreover, we examine the performance obtained when using the central pixel, which indicates the geographical position of the reference sample and the use of pixels surrounding that position. The results indicate that machine learning-based approaches can effectively estimate SDD.
To demonstrate the proposed scheme, we reviewed a key water body that supplies water resources to CDMX. We conducted a historical analysis of water clarity, highlighting its temporal trends. During field visits, we collected samples, which were subsequently verified using the developed system, yielding satisfactory results. This study provides insights into the spatiotemporal distribution of water quality in strategic reservoirs, such as the Valle de Bravo Dam, which supplies Mexico’s most populated city. The model was incorporated into a Geographical Information System used by CDMX authorities to monitor the water clarity of the Valle de Bravo Dam [
73]. Consistent monitoring will support informed decision-making regarding water management and treatment and effective communication with citizens.
This research may be enriched with data from precipitation measurements, water runoff flows, demographic density analysis, and research into water currents within the dam, which would improve the understanding of water movement and augment the current system’s efficacy. Additionally, it will be interesting to explore why the L1TP collection continues to provide competitive performance compared to the post-processed L2SP collection. Perhaps this can be supported by the implementation of alternative atmospheric correction methods. Another interesting direction for future research may involve harmonizing across multiple remote-sensing platforms to increase monitoring resilience in the face of seasonal or episodic events.