Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Long-term monitoring chlorophyll-a concentration using HJ-1 A/B imagery and machine learning algorithms in typical lakes, a cold semi-arid region

Open Access Open Access

Abstract

Chlorophyll a (Chl-a) in lakes serves as an effective marker for assessing algal biomass and the nutritional level of lakes, and its observation is feasible through remote sensing methods. HJ-1 (Huanjing-1) satellite, deployed in 2008, incorporates a CCD capable of a 30 m resolution and has a revisit interval of 2 days, rendering it a superb choice or supplemental sensor for monitoring trophic state of lakes. For effective long-term and regional-scale mapping, both the imagery and the evaluation of machine learning algorithms are essential. The several typical machine learning algorithms, i.e., Support Vector Regression (SVR), Gradient Boosting Decision Trees (GBDT), XGBoost (XGB), Random Forest (RF), K-Nearest Neighbor (KNN), Kernel Ridge Regression (KRR), and Multi-Layer Perception Network (MLP), were developed using our in-situ measured Chl-a. A cross-validation grid to identify the most effective hyperparameter combinations for each algorithm was used, as well as the selected optimal superparameter combinations. In Chl-a mapping of three typical lakes, the R2 of GBDT, XGB, RF, and KRR all reached 0.90, while XGB algorithm also exhibited stable performance with the smallest error (RMSE = 3.11 μg/L). Adjustments were made to align the Chl-a spatial-temporal patterns with past data, utilizing HJ1-A/B CCD images mapping through XGB algorithm, which demonstrates its stability. Our results highlight the considerable effectiveness and utility of HJ-1 A/B CCD imagery for evaluation and monitoring trophic state of lakes in a cold arid region, providing the application cases contribute to the ongoing efforts to monitor water qualities.

© 2024 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

While inland lake ecosystems occupy a relatively minor portion of the Earth’s surface [1], their biodiversity is substantial, and they play crucial roles in cycling materials within regional biospheres, as well as in regulating the local environment and climate [2]. As natural ecosystems deteriorate and human activities become more intense, inland lake eutrophication is becoming a more serious concern [3,4]. For instance, phytoplankton bloom intensity is increasing, and some algae also release toxins and hazardous substances that seriously threaten public health and drinking water safety [5]. Typically, phytoplankton is seen as a sensitive indicator for aquatic environments, and their growth cycles and ecological parameters act as key regulators for eutrophication [6]. Meanwhile, chlorophyll-a (Chl-a), the main photosynthetic pigment of phytoplankton, is another important parameter to evaluate aquatic environments and is directly used to assess the size of algal biomasses, as well as their nutrient statuses [7]. In light of this, to enhance our understanding of the primary productivity, biogeochemical processes, and management of inland water quality in freshwater ecosystems, it becomes particularly crucial to monitor the real-time concentration changes of Chl-a in inland waters.

Global warming has demonstrated that rising temperatures and low-pressure systems tend to enhance the intensity and duration of algal blooms in lakes [4]. In some recent studies, Each year, algal bloom events have increased globally in frequency and intensity, especially at high latitudes with cold and arid regions [8,9]. Likewise, increased terrestrial inputs from anthropogenic and intensive agricultural activities, specifically from industrial and domestic wastewater and the misuse of pesticides and fertilizers from watersheds result in serious eutrophication pollution [1013]. Although the trophic state of inland lakes can also be affected by innate lake characteristics, such as lake depth [14], this increasing algal bloom trend has accelerated due to anthropogenic processes. Ultimately, this has caused serious problems in areas also affected by drought that act as major food-production areas in Asia.

In cold and arid areas in Northeast China, the regional water shortage and eutrophication problems are becoming increasingly serious [15,16], especially in areas such as Chagan Lake and Xingkai Lake. This is mainly due to long cropland reclamation periods in the agri-food producing regions, but is also coupled to long-term fertilizer use that accumulates terrestrial inputs in water or sediment. Additionally, agricultural water accumulation causes extensive water loss in input rivers or lakes [17], while rainfall and excessive irrigation create runoff that entrains nutrients to downstream terminal lakes [18]. Furthermore, spring snowmelt also adds more terrestrial inputs that contain organic matter and nutrients that contribute to runoff and soil leaching to these lakes [19,20]. Thus, these cold-arid and agri-food producing regions must balance the development of agriculture and with aquatic health for sustainable lake ecosystems. One way to monitor aquatic health is with continuous, large-scale measurements of Chl-a. However, its concentrations need to be identified to specific trophic states for large lakes, in order to quantify quick environmental and hydrological variations. Altogether, this technology needs improvements, but will ultimately enhance our grasp on primary production, biogeochemical cycles, and the broader scope of studies related to inland water quality [21,22].

In recent decades, advancements in satellite remote sensing and artificial intelligence have revolutionized the monitoring of lake Chl-a. This progress has significantly enhanced our understanding of water quality dynamics through large-scale and continuous observations [2325]. However, available satellite sensors and their high-precision algorithms to retrieve water quality data are still being explored. In previous studies [2630], various satellite sensors used remote monitoring of Chl-a in inland water, and include the medium resolution imaging spectrometers (MERIS), the Moderate Resolution Imaging Spectroradiometer (MODIS), and other medium to high resolution land resource satellites. Besides other sensors, the United States’ Landsat Operational Land Imager (OLI) has been utilized in this context, offering images with a spatial resolution of 30 meters and revisiting interval of 16 days. In contrast, the Sentinel Multispectral Imager (MSI) from the European Space Agency provides images with resolutions between 10 and 60 meters and a revisiting intervals of 2-5 days. In comparison, the Chinese HJ1-A/B satellite has a wide coverage multispectral charge coupled device (CCD) camera [31], which provides the same spatial resolution as landsat OLI, but HJ1-A and HJ1-B are networked with only revisiting interval of only two days. Feasibly, this satellite could supplement spatio-temporal discrepancies from previous studies. Therefore, finding a suitable Chl-a algorithm and using the HJ-1A/B CCD satellite to observe inland lakes at certain time periods and regional scales is a promising application to monitor aquatic health in troubled lake regions.

Over several decades, a sea of algorithms had been developed to estimate Chl-a [26,27,3234]. Among these, the most commonly employed are empirical algorithms that establish correlations between reflectance and in situ measurements. In particular, Gurlin et al. (2011) [27] and Gitelson et al. (2011) [35] developed the empirical two-band and three-band algorithms that use red and near-infrared bands, respectively, which successfully predicted Chl-a in turbid waters. However, the empirical algorithms are usually less generalizable due to geographical area limitations that cause spatial heterogeneity among different lakes, as well as gradients between different sampling sites [36]. In comparison, semi-analytical algorithms are based on radiation transfer theory (e.g., quasi analytical algorithms (QAA) [37]) and inversion algorithms from intrinsic optical properties (IOPs) [15,38], which performed well on estimating Chl-a in regional lakes, like the Taihu Lake [39] and tropical reservoir [40]. Still, key parameters from the semi-analytic algorithms, including specific absorption coefficient for algal versus non-algal, need to be re-parameterized within different lakes, since they heavily rely on atmospheric correction algorithms and water quality background values [41]. Ultimately, these could cause more uncertainties for both for the application and feasibility of semi-analytical algorithms, especially when geographic spatial heterogeneity is considered [18,36]. It is particularly important to highlight the significance of employing universal and generalizable algorithms for large-scale observations. Such algorithms play a crucial role in reducing the adverse impacts associated with quantifying lake Chl-a concentrations based on their trophic state.

Since entering the 21st century, artificial intelligence (AI) has caused great scientific progress on remotely quantifying inland water quality [24,42,43]. Specifically, machine learning algorithms can automatically discover patterns and rules by analyzing and processing input feature variables [44], and by adjusting parameters and weights to improve performance to achieve accurate predictions. Importantly, they can also decrease atmospheric correction errors, and provide generalized, but also optimized, Chl-a models that consider complex non-linear relationships in multiple dimensions (e.g. geospatial heterogeneity) [29,32]. At present, many studies have successfully predicted Chl-a using the Landsat and Sentinel series satellite images combined with machine learning algorithms, including the Light Gradient Boosting Machine (LGBM) [45], Support Vector Machine (SVM) [29], extreme gradient Boosting trees (GBDT) [32], and mixture density networks (MDN) [43], within certain geographical regions. Nevertheless, there are very few studies combining the HJ1-A/B satellite with machine learning algorithms to derive Chl-a in lakes in Northeast China.

In this study, we address existing knowledge gaps in using HJ1-A/B imagery and machine learning algorithms for Chl-a retrieval, and aim to optimize and enhance model performance. Here, we developed machine learning-based Chl-a algorithms and assessed their effectiveness with in situ measured Chl-a data from typical lakes in Northeastern China. Our research objectives include: (1) investigate the correlation between HJ1-A/B CCD in situ measured Chl-a and reflectance; (2) calibrate and validate these algorithms for Chl-a estimation; and (3) map Chl-a concentrations in typical lakes to analyze their distribution patterns and trends. Ultimately, our study improves upon machine learning technology for remote Chl-a measurements with new, more stable algorithms that will aid in on-going efforts to monitor and mitigate lake eutrophication in China.

2. Materials and methods

2.1 Overview of the research area

Northeast China is the largest commodity grain base, where agricultural output accounts for about a quarter of the national total. This area is known for its semi-humid monsoon climate with four separate seasons. Annually, temperatures average between 2-6°C, while average yearly precipitation varies from 350-700 mm [46]. Lakes in the Northeast are crucial for water supply and agricultural irrigation, contributing significantly to ecological stability and the production of fishery products. Three typical large lakes were selected to derive Chl-a (Fig. 1). Among them is Xingkai Lake (XKH, 131°58′-133°07′E, 45°01′-45°34′N), which is situated in the eastern region of Northeastern China, at the border between China and Russia. Chagan Lake (CGH, 124°03´-124°34´E, 45°05´-45°30´N) is situated in the central region of Northeastern China in the Neng jiang River watershed. Hulun Lake (HLH, 117°00′-117°41′E, 48°30′-49°20′N) is situated in the Northern region of Northeastern China, and is situated on the Hulun-Buir Plateau and borders Mongolia to the north.

 figure: Fig. 1.

Fig. 1. Location of lakes and their collected samples in Northeast China: (a) Hulun lake (HLH), (b) Xingkai lake (XKH) and (c) Chagan lake (CGH).

Download Full Size | PDF

HLH covers 2037.3 km2 of aquatic surface and has a mean depth of 5.7 meters. It holds a storage capacity of roughly 13.85 × 108 m3 and experiences a freezing period in winter that lasts between 170-180 days. HLH’s extensive water surface contributes to the spatial variability of its water environment index [47]. Likewise, XKH’s capacity is about 20.8 × 108 m3, and has an ice period of about 150 days. It has 20 main inflow rivers and 3 outflow rivers, and a water retention time of 8.8 years [34]. In addition, CGH covers 350 km2 of aquatic surface and has a mean depth of 2.52 meters with a maximum storage capacity of 5.98 × 108 m3. CGH's shoreline extends to 128 km and undergoes freezing from October to May. The average yearly runoff is 451 mm and the average annual evaporation rate is 1206 mm [48].

2.2 Measured data and laboratory analysis

We performed 16 field campaigns in these lakes during 2012-2021, and a total of 329 samples were collected to analyze the spatial characteristics, as well as bio-optical and radiometric changes. Considering the unique climatic conditions in the Northeast, these samples were collected during the non-ice-frozen period (April to October within one year). Sample collection was performed at the surface, around 0.5 meters deep. Samples were then preserved in 1-liter amber HDPE (high-density polyethylene) bottles. Bottles were washed in acid and pre-rinsed with field samples before use. Once collected, the samples in the bottles were promptly placed in a portable 4°C refrigerator powered by the field vehicle. Samples were filtered within 1 day and sent to the laboratory within 2 days for further analysis. In the laboratory, the material underwent treatment with a solution of acetone at 90% concentration and was maintained at a temperature of 4°C for a day-long period in darkness. The measurement of Chl-a (μg/L) was carried out using a UV-2660 PC spectrophotometer (Shimadu, Kyoto, Japan) across four specific wavelengths, adhering to the SCOR-UNESCO methodology [29].

2.3 Satellite data acquisition and processing

Images from HJ1-A/BCCD were obtained via the China Land Observing Satellite Service Center [49]. HJ1-A/B is a quasi-sun-synchronous orbit satellite with a spatial resolution of 30 m for CCD images and HJ1-A and HJ1-B are networked with only revisiting interval of only 2 days (Table S1). 97 cloud-free images covering these 3 lakes were selected and matched based on field sampling dates, within a time window of ±10 days. Of these, 18 images served the purpose of calibrating and validating our model, and 79 images were used to map Chl-a. It is worth noting that each year, at least two images were selected, taking into account the actual timing of image capture and cloud conditions, particularly during ice-free periods from April to November. Atmospheric corrections are critical for estimating water quality parameters from optical satellite imagery. FLAASH atmospheric correction processor in the remote sensing information processing software ENVI enables better retrieval of reflectance in the images. For CCD imagery, we preprocessed radiometric calibration and atmospheric correction using ENVI 5.3 (Harris Geospatial Solutions, USA), and the atmospheric correction FLAASH was used [50] to ensure the uniformity of the sampling distribution. The averaged reflectance of 3 × 3 pixels were extracted [51]. In addition, lake masks were calculated by a normalized difference water index (NDWI) [52]. Figure 2 illustrates the main process of estimating Chl-a concentration using machine learning algorithms in this study.

 figure: Fig. 2.

Fig. 2. Schematic block diagram illustrates the main process of machine learning algorithms to estimate Chl-a concentration.

Download Full Size | PDF

2.4 Machine learning algorithms

This research involved the selection of 7 notable machine learning algorithms, namely XGBoost (XGB), Random Forest (RF), Support Vector Regression (SVR), Gradient Boosting Decision Trees (GBDT), Kernel Ridge Regression (KRR), K-Nearest Neighbor (KNN), and Multi-Layer Perception Network (MLP). To enhance the model's accuracy and generalizability, 329 field-measured samples were randomly split into 2 sets using Anaconda (Python 3.7) software: 1) a calibration dataset (N = 220) for training the model's parameters and weights, and 2) a validation dataset (N = 109) to assess the model's performance on new, unseen data. The calibration dataset is used throughout the model building process, while ensuring that the validation dataset is independent of each other. The purpose is not only to capture whether the model is properly trained for generalization, but also to monitor for overfitting problems. All operations, including prediction, super-parameters optimization, and performance evaluation were implemented in the Anaconda (Python 3.7).

2.4.1 Support vector regression

Support vector machine (SVM) aims to solve prediction problems by presenting datasets into a high-dimensional feature space through non-linear mapping. Support vector machine regression (SVR) is an extended form of this approach that solves problems with finite, high-dimensional, and non-linear data. It also has a strong generalization capability and can solve dimensional catastrophes [53]. Details of the SVR can be found in Supplement 1 (Eq. S1-S4).

2.4.2 Kernel ridge regression

Kernel ridge regression (KRR) enhances classical linear ridge regression (Eq. S5) by incorporating a kernel function. This function enables the development of a linear regression within a space of many dimensions, corresponding to a non-linear regression in the original input area. Because the regression function is also constructed in the feature space this leads to a large number of parameters for non-linear problems, since the computation is performed in the high-dimensional space. The introduction of the kernel function allows these problems to be solved efficiently. This kernel approach constructs a series of excellent regression models [54,55]. Predictions of KRR on the sample data are described in (Eq. S6).

2.4.3 Multi-layer perception network

Multilayer perception (MLP) is one of the most popular network architectures with applications in both classification and regression and is a typical representative of feed forward artificial neural networks [56]. The architecture of a Multilayer perceptron (MLP) network includes 3 key components: 1) an input layer, 2) one or more hidden layers, and 3) an output layer. Each neuron in the hidden and output layers utilizes non-linear activation functions that distinguishes MLP networks from linear regression models and enables their application to non-linear data.More details about the MLP can be found at Supplement 1 (Eq. S7-S8).

2.4.4 K-nearest neighbor

K-nearest neighbors (KNN) is a simple, but very effective non-parametric algorithm for classification and regression tasks. Its decision surface is non-linear with an integer parameter K, and the expected quality of the prediction automatically increases with the amount of training data [57]. In the KNN algorithm, the prediction for each test sample is derived from the weighted average of the nearest K samples’ response variable in the training set. This average then serves as the predicted outcome for the input variables. Common distance metrics employed in KNN significantly influence the algorithm's performance and efficiency, and include Euclidean, Manhattan, and Minkowski distances, along with the choice of K.

2.4.5 Tree-based regressions

Random Forest (RF) is an Ensemble Learning (EL) method based on decision trees that uses a Bagging strategy and introduces randomness in the construction of decision trees. By randomly drawing multiple equal sub-training sets from the original training set with put-back, each sub-training set is used to construct a decision tree independently. For the regression problem, each decision tree in the RF makes predictions for the sample and averages these predictions to obtain the final prediction. This averaging operation helps reduce the variance of the predictions and improves the stability and generalization of the model [58], and assumes that the RF contains K decision trees, while the mean value is taken to represent the final prediction:

$$H(x) = \frac{1}{K}\sum\nolimits_{k = 1}^K {{h_i}(x)}$$

Unlike RF, gradient boosting decision tree (GBDT) uses a (Boosting) ensemble learning method, which is an iterative Ensemble learning method that serially trains a series of weak learners (decision trees). Each of these then try to correct the errors of the previous one in order to gradually improve the overall model performance [59]. GBDT integrates gradient, augmentation algorithm, and regression trees, which mainly initializes all samples in class K of the estimates of the original dataset, which minimizes the loss function of the estimates to a constant value. After, the predicted values are closer to the true values [60]. XGBoost (XGB) can also be considered as an extension or improvement of GBDT. XGB builds on the original GBDT algorithm by adding a regularization term to reduce complexity to prevent overfitting and a second-order Taylor expansion of the objective function to make the model optimization more accurate. In addition, XGB has been implemented with a number of optimizations, including support for parallel computing and memory optimization, which improves training speed and model performance [61]. XGB also has been shown to be superior in many areas of research, both in competitions and in previous study [62]. Assuming that there are K decision trees, the final prediction for the sample can be defined as:

$${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over y} _i} = \sum\nolimits_{k = 1}^K {{f_k}({x_i})\textrm{ }{f_k} \in F}$$
where $\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over y} $ is the predicted value of the XGBoost algorithm, K is the number of decision trees, F is the set of all regression decision trees, and fk(xi) is the predicted value of the K th decision tree.

2.4.6 Ensemble learning

Ensemble learning (EL) is a machine learning strategy that employs multiple learners to a dataset, which solves the same problem by integrating multiple predictions into a composite prediction [63]. Techniques like bagging, boosting, voting, and stacking are popular methods for ensemble learning. Among them, voting offers a simple model integration, and determines the final prediction through voting or weighted voting from several base models [64]. Stacking, on the other hand, cascades predictions from different base models, and combines them via a meta-model, such as ridge regression, which we used in our study [55,65,66].

2.5 Hyper-parameters optimization

Before training a model in machine learning it is typically necessary to set an artificial parameter that the machine cannot learn directly from the training data, called hyper-parameters. Unlike the model weights (parameters), hyper-parameters are determined empirically, experimentally, and through tuning. It controls the performance of the model and learning process and can influence the complexity, capacity, and optimization of the training process. Some common hyper-parameters include learning rate, regularization factor, maximum depth of the decision tree, hidden layer size of the neural network, etc. Selecting appropriate hyper-parameters is vital for a model's performance and its ability to generalize. In experiments where the hyperparameters of the model are tuned using the calibration dataset, different combinations of hyperparameters produce different results. Various optimization methods exist for hyperparametric tuning in machine learning, such as Bayesian optimization, stochastic search, and grid search, each with its unique features [67]. In this research, we employed the grid search cross-validation technique. This method sets up a multi-dimensional hyper-parameter space grid, where each axis represents a hyper-parameter and each point a combination thereof [68]. We iteratively identified the optimal combination that was based on the cross-validation R2 score, and followed a search strategy that started broad, but then narrowed [46]. This process led us to determine the best hyper-parameter set for our machine learning model. The hyper-parameters for the MLP, RF, KNN, SVR, GBDT, and KRR algorithms were sourced from the Scikit-Learn package in Anaconda's Python 3.7, while the XGB algorithm parameters were taken from the XGBoost package.

For the SVR algorithm, the penalty term coefficient (C) controls the degree of tolerance to error, gamma controls the range of influence of the kernel function, while the two important superparameters have a great influence on the accuracy of the SVR. The KRR uses the same kernel function method as the SVR and therefore its superparameter tuning is somewhat similar to the SVR. The multilayer perceptron consists of 3 parts: 1) the input, 2) hidden, and 3) output layers with number and size. Here, the batch size determines the number of swatches used to calculate the gradient each time the weights are updated and the optimizer determines how to recalculate the weights of the model to minimize the loss function [69]. K-values, as well as distance metric weights, are also used as superparameters in KNN, and different distance metrics effectively change the model accuracy in the face of different classifications or regression problems [57]. The three integrated tree-based models RF, GBDT, and XGB have similar superparameters. For example, the number of underlying decision trees, the maximum depth have been shown to be similar and have an impact on the learning rate for both GBDT and XGB machine learning algorithms [61].

2.6 Machine learning model performance evaluation

2.6.1 Cross-validation method for the developed Chl-a algorithm

Cross-Validation is a key statistical technique in computing to assess machine learning model performance. It involves dividing the calibration dataset into several training and testing subsets and repeatedly training and testing the model on these partitions to gauge its performance across different data segments [70]. In K-fold cross-validation, the calibration dataset is split into K equal segments [71]. In this method, each fold is used in turn as the test set, with the other K-1 folds combined for training purposes. The model’s performance on the test set is evaluated using indicators like R2, MAE, RMSE (definitions provided in supporting material), etc. This process rotates through each fold as the test set to ensure each one is used for validation exactly once. The average of the results from all K folds provides the overall performance metrics of the model.

As seen above, the calibration subset trains the model, while the validation subset confirms its accuracy. This approach diminishes reliance on a single division by segmenting the dataset into multiple folds for repeated validations. The randomness and inhomogeneity of the selected subsets for cross-validation avoids the subjective overtones associated with artificially divided datasets and provides a more comprehensive assessment for the generalization ability of the models. This study uses K-fold cross-validation, K = 5. Several regression models are discussed in the following aspects.

2.6.2 Shapley additive explanations method for input variables of Chl-a algorithm

Shapley additive explanations (SHAP) analysis quantifies the impact of individual features on model predictions through the computation of their Shapley values. Derived from game theory, these values evaluate the impact of each collaborator in a cooperative setting on the game's result [72]. In this study, the bands and their band combination features can be considered as input bands, and the performances of developed models for Chl-a can be considered as outputs. The SHAP analysis was implemented in Anaconda Python 3.7 using the SHAP package.

2.6.3 Z-score

Z-score, a prevalent method for detecting outliers, assesses how many standard deviations a particular data point is from the feature's mean. This calculation helps in determining if the point significantly deviates from the average, thereby identifying outliers [73]. When the data point exceeds a certain threshold [74,75], it can be identified as outliers. This is because in a normal distribution, about 95% of the data points have a Z-score between ± 2 and about 99.7% have a Z-score between ± 3 [74]. In this study, the Z-score threshold = 2 was chosen and 98% of the predicted values of the model conformed to a normal distribution [76]. Typically, smaller standard deviation bandwidth and fewer outliers signify better robustness of the model's prediction [76].

3. Results

3.1 In situ measured lake Chl-a concentration

The Chl-a of collected lake samples varied from 0.06 to 49.66 μg/L with a mean value of 11.39 μg/L and a standard deviation of 10.20 μg/L (Table 1). Of which, the XKH (5.75 μg/L) and HLH (5.05 μg/L) presented lower averaged values than that of CGH (20.50 μg/L). For these in situ measured Chl-a measurements, we found the highest levels were present from September-October compared to June-August for all three lakes. This is in general agreement with previous studies that showed that the highest Chl-a levels were produced in autumn, particularly in the cold Northeast [20,34]. The observed phenomenon could result from the prolonged influence of nutrients originating from the watershed. Here, nutrients accumulation in agricultural fields persists, gradually mobilizing over extended periods following soil leaching and runoff contributions. For lakes, transient phytoplankton blooms and elevated Chl-a levels in the aquatic column often precede winter [77].

Tables Icon

Table 1. Descriptive statistics of chlorophyll a (Chl-a) concentration (μg/L) in the three lakes

3.2 Imagery reflectance characteristics and relationships with chlorophyll-a

The surface reflectance on the HJ1A/B CCD image was converted to remotely sense reflectance using the band math tool in ENVI. Figure 3 shows the reflectance values at the corresponding sampling points for the three lakes. Due to the high variation of water quality composition within the lakes, the Rrs(λ) shows great variations. Usually, 400-500 nm shows strong absorption, especially at high chlorophyll concentrations. Meanwhile, 550-570 nm spectral band reflectance peaks are due to weak absorption from increased pigments, such as chlorophyll and carotenoids, as well as the scattering effect of the cells [78].

 figure: Fig. 3.

Fig. 3. Spectral reflectance of lake samples collected in HJ1-A/B CCD imagery, (a) XKH, (b) CGH and (c) HLH.

Download Full Size | PDF

In turbid water, the reflection peaks migrated towards the long-wave direction, since its TSM concentration increased in the visible and near infrared range. The TSM concentrations of the three lakes are exhibited in Table S2, and the TSM concentration of XKH is higher. Additionally, the spectral shape with the reflectance peak migrates toward the vicinity of 660 nm, and is consistent with that of Xu et al. (2022) [34]. Further, we used non-parametric Spearman correlation to find the band combinations and in situ measured Chl-a (Table 2).

Tables Icon

Table 2. Spearman correlation between Chl-a and band combinations (spectral reflectance variables)a

3.3 Chl-a machine learning algorithm modeling process

3.3.1 Hyper-parameter optimization for Chl-a algorithms

Table 3 presents the results for our hyper-parameter optimization for the machine learning algorithms that were achieved through a cross-validated grid search. XGB and GBDT are also both tree-based boosting integrated learning algorithms, and XGB is considered to be a modification of GBDT to some extent [61]. This implies that the same search ranges are used for the three hyper-parameters for both models. We found that the best super-parameter combination for GBDT was 200 base decision trees (ne), 0.02 learning rate (Lr), and maximum depth (MD) = 4 (R2= 0.90, RMSE = 3.21 µg/L). Likewise, the best combination of super-parameters for XGB (R2 = 0.90, RMSE = 3.11 µg/L) was obtained using ne = 150, but Lr was raised to 0.03. For RF, we saw that the best combination of parameters (ne = 200, MD = 8, Min_samples_leaf = 1, Max_features = 6) was obtained with high accuracy (R2 = 0.90, RMSE = 3.39 µg/L).

Tables Icon

Table 3. Hyper-parametric optimization results using cross-validated grid search

In addition, for both KRR and SVR, radial basis function (rbf) was chosen as the kernel function, and the boundary range gamma of the control kernel function was 1.5 and 2. Our results indicated that the best combination of KRR super-parameter (R2 = 0.90, RMSE = 3.16 µg/L) and the canonical term coefficient (C = 100) could be constituted to create the best combination for SVR (R2= 0.88, RMSE = 3.44 µg/L). The performance of the KNN algorithm was enhanced when two collocation numbers were used for the Manhattan distance metric (R2= 0.80, RMSE = 4.64 µg/L). Of which, the MLP used a hidden layer and 1500 neurons for the adaptive moment estimation (Adam) optimizer and a batch_size = 32. The optimal combination of super-parameter (R2= 0.85, RMSE = 3.92 µg/L) was 32.

3.3.2 Chl-a algorithms calibration and validation

In order to show that the patterns and regularities of input variables were adequately learned, the five band combinations were chosen as input variables with correlation coefficients larger than 0.7 for the in situ measured Chl-a. Moreover, to further explore whether correlation is the only basis for selecting input variables for machine learning algorithms, the original four bands were also used as input feature variables. The performance evaluation for our developed Chl-a algorithms were shown in Table 4 and Fig. 4. For the calibration dataset, all Chl-a algorithms exhibited good performances with R2 values above 0.9. Of the tree models, GBDT outperformed with a higher R2, MAE, and RMSE of 0.98, 1.34 µg/L, and 1.67 µg/L, respectively. In comparison, KNN seemed relatively inferior to other Chl-a algorithms. For the validation dataset, the three tree models performed better than the other models, where the XGB model presented the best accuracy with R2, MAE, and RMSE were 0.90, 2.38 µg/L, and 3.11 µg/L, respectively. SVR (MAE = 2.85 µg/L and RMSE = 3.44 µg/L) and MLP (MAE = 2.85 µg/L and RMSE = 3.82 µg/L) also maintained R2 above 0.85. However, in contrast, KNN had the least satisfactory validation accuracy (R2= 0.80, MAE = 3.13 µg/L and RMSE = 4.64 µg/L).

 figure: Fig. 4.

Fig. 4. Relationships between estimated and measured Chl-a values using different machine learning algorithms, e.g., (a-b) GBDT, (c-d) XGB, (e-f) RF, (g-h) KRR, (i-j)SVR, (k-l) KNN and (m-n) MLP, respectively.

Download Full Size | PDF

Tables Icon

Table 4. Performance evaluation for the developed chlorophyll a algorithm

3.4 Performance comparison of Chl-a algorithms

3.4.1 Importance of input variables

We also chose to investigate SHAP to assess the importance of different features on model prediction, in order to better explain the input feature variables selected for the machine learning algorithms and understand the reasons for the model’s prediction performance [79]. For the GBDT model (Fig. 5(a)), the top three features in importance are ((B4-B2) (B1 + B1) + (B3×B3), B3, B4), with 2×B3 + B4-B2 making the least contribution. Most eigenvalue points have no impact on the model's prediction. In contrast, in the RF model (Fig. 5(c)), 2×B3 + B4-B2 is a relatively important feature. In the XGB model (Fig. 5(b)), ((B4-B2) (B1 + B1) + (B3B3), B3, B4) are the three most valuable features, but notably, almost all points of (B1×B1) (B3-B2) are unhelpful for prediction, which is a major distinction compared to the other models where this feature plays a visible role. Additionally, the KRR and SVR models (Fig. 5(d),(e)) rank feature importance nearly identically, with slight variance for 2B3 + B4-B2, and B3 is the least important. For the KNN model (Fig. 5(f)), B4 is most vital, and B3 is the least. Uniquely, in the MLP model (Fig. 5(g)), the (B3 + B3) (B2-B3) - (B4×B2) feature is most significant. However B2, which is useful in many models, has no response.

 figure: Fig. 5.

Fig. 5. SHAP results with feature importance for different machine learning models. Each row represents a feature and horizontal coordinate that denotes the Shapley values. These values are sorted from top to bottom by their average absolute magnitude. Colored points correspond to eigenvalue points of the input (reflectance to band or band combinations), where red indicates a larger value and blue a smaller one. Points to the right of the 0 scale signal a positive contribution to the model, those to the left a negative contribution, and those on the 0 scale show no effect.

Download Full Size | PDF

3.4.2 Model response to sample size

Figure 6 depicts the responses of the machine learning models to different sample sizes. Our findings revealed that the training models can achieve better R2 scores with large samples, and perform consistently well with increased sample sizes. It should also be noted that GBDT, XGB and SVR performed well with good fitting results instead of those of KNN. Further, cross-validation was employed to assess model performance for unseen datasets (Fig. 6). We found that most models (XGB, GBDT, RF, KNN, SVR, MLP) showed difficulty with achieving good validation accuracy (R2 < 0) with fewer samples (N < 75), which results in severe overfitting. The exception, however, was with the KRR model, which can control overfitting well even with limited training sample sizes. Still, when the training sample sizes increased (N > 100), the overfitting issues declined, and the training and validation curves for KRR and XGB converged swiftly. Likewise, their MLP performed weakly. Furthermore, the curves of all the models converged and stabilized with continued increased sample sizes, which signifies proper generalization.

 figure: Fig. 6.

Fig. 6. Cross-validation learning rate curves for model response to training examples. The horizontal and vertical coordinates represent the training examples and the R2 scores. Lines of the same color represent the same model, circular points in lines of the same color represent the training curve of the model, and square points represent the cross-validation curve of the model. The shaded area indicates the standard deviation interval.

Download Full Size | PDF

3.4.3 Model sensitivity to noise

Although we pretreated all the samples we collected. Usually, real-world measurement of Chl-a is complex, and lab processing can introduce random errors. In this study, we infused 5% noise into the input feature variables of machine learning and conducted cross-validation to assess model stability against noise and emulate a real environment (Table 5). Performance metrics were averaged after five cross-validations. Encouragingly, the model optimized via hyper-parameter tuning maintained a solid performance (R2 > 0.7) under 5-fold cross-validation. The variance in R2, however, highlights disparities across the five cross-validation subsets. Additionally, three tree models outperformed the others, notably the XGB model (R2= 0.81, variance = 0.0005, RMSE = 4.51 µg/L), followed by SVR (R2= 0.76, variance = 0.002, RMSE = 5.07 µg/L). In comparison, KRR, KNN, and MLP showed marginally reduced performance.

Tables Icon

Table 5. Cross-validation (CV) of performance with the original vs. model after adding five percent noise

3.4.4 Model stability: prediction performances

The ability of each model developed to predict Chl-a was also assessed using the residuals from the prediction results for the validation dataset, as presented in Fig. 7. Our analysis revealed that the three tree models (GBDT, XGB, RF) along with SVR and KRR models exhibited exceptional performance. Specifically, only one data point surpassed the Z-score threshold for these models, especially with RF and SVR not producing any outliers. Moreover, these models had relatively narrow standard deviation bandwidths. In contrast, the KNN model paired with MLP lagged in performance, and was characterized by a wider bandwidth and a higher number of outlier points.

 figure: Fig. 7.

Fig. 7. Residuals of model predictions for estimating Chl-a concentration with the standard deviation intervals shaded in yellow and the outlier points in red.

Download Full Size | PDF

Digging deeper into individual performances, KRR emerged as the standout and boasted the narrowest bandwidth (Standard range = 6.16), which indicates its superior prediction robustness for Chl-a concentrations. Conversely, the KNN model showed the least robustness and showcased the broadest standard deviation bandwidth (Standard range = 9.21) and a significant outlier. Furthermore, the MLP model, depicted in Fig. 7(g), demonstrated a slightly diminished capacity to manage outliers, and was marked by three flagged outliers and a less favorable standard deviation bandwidth (Standard range = 7.84) compared to the earlier-mentioned five models.

3.4.5 Ensemble learning

Based on these results, five crucial machine learning models (GBDT, XGB, RF, KRR, SVR) were chosen to construct the ensemble learning model using stacking and voting to explore their performance and prediction accuracy. Figure 8 illustrates the results from integrating these models through voting and stacking methods, and shows significant accuracy improvement in the validation set. The voting model outperformed (R2= 0.92, RMSE = 2.89 µg/L), while the stacking model also achieved acceptable accuracy (R2= 0.91, RMSE = 3.02 µg/L), with a slope closer to 1, which indicates predictions close to true values.

 figure: Fig. 8.

Fig. 8. Performance results for the ensemble learning model: (a) voting validation and (b) stacking validation

Download Full Size | PDF

These integrated models were further evaluated for noise handling (Table 6), and demonstrate impressive accuracy in both validation and cross-validation (R2> 0.80, RMSE < 4.5 µg/L), with substantial noise handling improvement (R2> 0.75, RMSE < 5 µg/L) and validation for their strong generalization capability. Ultimately, the integration of machine learning models proves highly effective, especially where early-stage noise management is challenging. As illustrated in Fig. 9, both voting and stacking models exhibit robustness, with standard deviation bandwidths under 6 and no outliers within Z-scores threshold = 2. Figure 10 also shows the overfitting issue in the integrated model caused by the number of samples. The voting model shows marked improvement, particularly with handling overfitting due to limited training samples, although stacking is less efficient in this regard. Overall, the capabilities of the integrated model are enhanced across all aspects compared to individual machine learning approaches.

 figure: Fig. 9.

Fig. 9. Residuals for the integrated model predictions for chl-a concentration.

Download Full Size | PDF

 figure: Fig. 10.

Fig. 10. Cross-validation learning rate curve of integrated model response to training examples.

Download Full Size | PDF

Tables Icon

Table 6. Integrated model accuracy evaluationa

3.5 HJ-A/B observation of lake Chl-a patterns

With our developed XGB algorithm, the annual average Chl-a concentrations in these three lakes were estimated from 2012-2021 (Fig. 11 and Table S3). There were significant differences in Chl-a concentrations due to watershed terrestrial inputs, water quality management, and climatic changes. We also observed that the annual average Chl-a concentrations in CGH and HLH exhibited a significant upward trend between 2012 and 2014, whereas XKH showed the opposite. In 2015, the Chl-a levels of these three lakes dropped sharply (CGH 9.24 µg/L, XKH 6.24 µg/L, HLH 5.12 µg/L), with an average change rate of -80%. Since 2016, the Chl-a concentrations in CGH, XKH, and HLH increased again, and reached 19.70 µg/L, 14.67 µg/L, and 11.71 µg/L in 2017. During 2019-2021, we found the annual averaged Chl-a concentrations in CGH and HLH also increased to 20.00 µg/L (CGH) and 14.12 µg/L (HLH) in 2021. In contrast, the annual averaged Chl-a concentration in XKH decreased since 2017, particularly in 2020 with the lowest value in ten years (3.98 µg/L). In addition, in other years, the interannual changes of the three typical lakes were relatively stable, which is consistent with some previous studies [80,81].

 figure: Fig. 11.

Fig. 11. The averaged Chl-a concentrations and their annual rates of change in the three lakes. Blue, green, and orange bars represent Chl-a levels in the corresponding years for CGH, XKH, and HLH; The blue, green, and orange lines represent interannual changes in CGH, XKH, and HLH. The red dashed lines represent the average annual rate of change of the three lakes.

Download Full Size | PDF

Figure 12 also shows the estimated spatial distribution of Chl-a concentrations using our developed XGB algorithm for the three lakes. For CGH, relatively high Chl-a are distributed in the northwest and southeast, but are lower in the central regions. In comparison, the Chl-a distribution in the central and northern parts of HLH is often higher than in other regions. This coulc be related to the three main agricultural irrigation districts (e.g., Qianguo, Da’an and Songyuan), which are located in the southeast. It is also important to notae that the northern XKH distribution that belongs to China showed higher Chl-a concentrations than those of the southern XKH in Russia.

 figure: Fig. 12.

Fig. 12. Chl-a spatial distribution for the three lakes based on HJ1-A/BCCD image data inversion from 2012-2021. Lighter colors represent lower Chl-a levels in that area, and vice versa represent higher Chl-a levels.

Download Full Size | PDF

4. Discussion

4.1 HJ-A/B application

Since the launch of the HJ1-A/B series satellites in September 2008, both CCD cameras and hyperspectral imagers have been utilized by scholars for remote sensing and monitoring of inland lake water quality [31,50,78]. However, with the rapid development of modern science and technology, remote sensing inversion algorithms for water quality parameters are constantly updated and iterated, and machine learning algorithms are continuously emerging in this field. In previous studies, satellite sensors such as Landsat and Sentinel, showed high spectral resolution and spatial resolution, which allowed for a large number of studies to create and utilize machine learning algorithms for deriving Chl-a focused image data from these satellite sensors [24,29,46,54]. With our study, it is interesting to see that smaller spectral resolution does not mean failure, and that the HJ1-A/BCCD satellite has good advantages in both spatial resolution (30 m) and HJ1-A and HJ1-B are networked with only revisiting interval of only 2 days, which is remarkably different than the Landsat OLI sensor's temporal resolution at 16 days. More importantly, our machine learning algorithms only used band 1-4 and their combinations with the HJ1-A/BCCD images were also calibrated and validated (Table 2), and performed well to derive Chl-a concentrations. Together, these results demonstrate that HJ1-A/BCCD images can provide feasible suggestions for future algorithm analysis and selection studies. Therefore, it is also expected to form a constellation with these satellite sensors to provide new data sources for synchronized observation of inland lakes over large areas in future studies.

4.2 Machine learning algorithms

Based on our results, we demonstrated that many statistical tools and machine learning can be combined with HJ1A/BCCD images to estimate water Chl-a concentrations. In some aspects, different algorithms show their different advantages, and it is necessary to explore their performance from a comparative perspective. Here, we systematically explored the comprehensive evaluation technologies for Chl-a in water based on remote sensing data. First of all it was important to understand why the model can predict the Chl-a by preparing, selecting, and processing the data. In our results, the parameters (Pearson) or non-parameters (Spearman) were not the only determinants to select for model input features. However, some bands or their combinations with low correlations were more important (Fig. 5). Likewise, we found the selection of features is not limited to bands and their combinations, and that the correlation between input features and predicted values alone may be insufficient. Indeed, recognizing inter-feature interactions and incorporating relevant environmental factors like the absorption coefficient or backscattering coefficient of Chl-a at each wavelength, and other environmental factors (e.g., water temperature and depth) could further bolster the model's generalization ability [82]. Furthermore, in terms of data selection, we found that preprocessing data to eliminate noise can significantly enhance model accuracy. For example, an effective strategy would be to set scaling features between between 0 and 1, which we delineated in section 3.4.3 [83]. Subsequently, it is particularly necessary to choose the best combination of hyper-parameters when selecting hyper-parameters for model training (as discussed in section 2.6) (Table 3). We saw when we could obtain superior hyper-parameter combinations when we employed stochastic searches alongside practical considerations and utilized optimal tuning methods, such as Bayesian optimization [68]. Moreover, model evaluation typically requires cross-validation, as the initial dataset partitioning into calibration and validation sets can be subjectively biased, which potentially influences the learning direction and limits generalizability.

In the process of machine learning, overfitting often occurs when a learning algorithm overly adapts to both the noise and training data, which hampers performance for unknown datasets and is more pronounced for smaller datasets [84]. In contrast, the KRR algorithm, an evolution of ridge regression, employs L2 regularization to constrain parameter weights, thereby minimizing overfitting risk [70]. At the same time, however, if the sample size is large enough for other more complex machine learning methods it is easier to learn and get superior results, which we saw with the tree ensemble models, such as XGB, RF, and GBDT, and the neural network models (MLP) (Fig. 6). However, at the same level of complexity, the XGB machine learning model also adds L1 and L2 regularization to the objective function to prevent overfitting [61]. This is one of the reasons why we choose the XGB model as the best learning model. In addition, we saw that noise effects permeate various domains [85]. For inland waters, there were more complex optical characteristics than those of marine waters [86]. Interestingly, our results showed that certain machine learning algorithms showed minimal sensitivity to noise, and retained consistent performance pre- and post-noise addition (Table 5). However, the three tree models (i.e XGB,GBDT,RF) and SVR showed pronounced noise sensitivity. This is consistent with another previous study [87], where these algorithms that excelled in RMSE exhibited heightened noise vulnerability. Ulimately, decision tree-based boosting models, like XGB, are preferable choices once noise effects are mitigated prior to data processing [85].

4.3 Uncertainty, challenges, and future applications

Although there are measurements being conducted in lakes worldwide and new algorithms are being developed, the creation and assessment of universally applicable Chl-a models continues to be a significant challenge. Atmospheric correction is pivotal for the estimation of water quality parameters derived from optical satellite imagery. Previously, FLAASH processor was demonstrated to perform well to retrieve reflectance from HJ1A/BCCD images. Here, we developed an XGB Chl-a algorithm using in situ measured Chl-a concentrations collected from three typical lakes, and saw that the maximal Chl-a concentration was 49.66 µg/L. However, this could limit the feasibility and application of XGB Chl-a algorithms in eutrophic or algal bloom lakes, since they have higher Chl-a concentrations (>50 µg/L). In addition, the algorithm may have limitations in inland lakes containing large amounts of suspended sediments and dissolved minerals from agricultural activities.As shown in previous studies based on radiative transfer [88] the turbid waters cannot be represented with Chla alone as the attenuation of irradiance (hence reflectance) depends on backscattering coefficients and absorption coefficients of non-algal particles (sediments) and dissolved organic matter. Another study [89] studying the effect of mineral sediments on Chla retrieval shows the possibility of false positive algal blooms reports for sediment concentration >20 mg/m3.Therefore, the use of a single indicator such as Chl-a may not be sufficient to accurately reflect the true condition of the water body. Nevertheless, our XGB model demonstrated high accuracy and effectiveness for lakes in the cold, arid regions of Northeast China, and establishes a practical and theoretical groundwork for future extensive geographic research using HJ1A/BCCD imagery.

The global surveillance of inland lakes is becoming more widespread, and the development of algorithms for the remote sensing inversion of Chl-a is an area of constant advancement and refinement. Advancements in technology have opened the door for artificial intelligence (AI) algorithms to revolutionize the integration of remote sensing satellite data, which provides deeper insights into intricate biogeochemical processes. Nonetheless, there is a pressing need to further the development of machine learning models that boast robustness, adaptability, and exceptional generalization abilities. Moreover, both Landsat and Sentinel-2, which have been widely used in past research, have their own shortcomings in spatio-temporal resolution. Ultimately, HJ1A/BCCD satellite images can make up for these deficiencies and also form a constellation to observe and track the water quality of inland lakes.

Chinese satellites seem to rarely enter the public eye. In the future, the application of HJ1-CCD imagery to larger geographical areas could demonstrate its capability in water quality monitoring. Additionally, beyond coordinating observations with Landsat satellites, which have similar spatial resolution, combining images from the newly launched HJ satellites, such as HJ2-CCD, allows for joint observation of inland water bodies dominated by different optical components. Lastly, leveraging artificial intelligence technology in the future, HJ satellites could work in conjunction with other Chinese-launched satellites, such as the Gaofen (GF) and Ziyuan (ZY) satellites, to form an online application platform for real-time data acquisition. The goal is to enable lake water quality managers to better grasp real-time water quality dynamics, facilitating the resolution of increasingly severe ecological threats caused by factors such as global warming.

5. Conclusions

Our study developed a Chl-a algorithm using machine learning algorithms and HJ1-A/BCCD images in three typical lakes in Northeast China, which is an example of a cold, arid region that is prone to changes in Chl-a that cause lake eutrophication. The findings were as follows:Hyper-parameter optimization and grid search cross-validation greatly improve the accuracy of the algorithms on calibration and validation datasets; Among machine learning-based chlorophyll-a algorithms (GBDT, XGB, RF, KRR, SVR, KNN, and MLP), the XGB algorithm outperformed the other algorithms in terms of validation accuracy (R2 = 0.9 RMSE = 3.11 µg/L); For models susceptible to noise, KRR and MLP were preferred, and tree models such as GBDT, XGB and RF have the highest accuracy with sufficient noise reduction; The developed Chl-a algorithm used in this study can be used to map the spatial and temporal variations in lakes, such as Chagan Lake (CGH), Xinkai Lake (XKH) and Hulun Lake (HLH).

Ultimately, our study offers insights into lake eutrophication and showcases the utility of HJ1-A/BCCD imagery and machine learning algorithms. Together, these machine learning algorithms can serve as a valuable tools for future research on machine learning-based water quality assessment and can potentially be extended for long-term and large-scale monitoring.

Funding

Fundamental Research Funds for the Central Universities (2022-KYYWF-0156); Municipal Academy of Science and Technology Innovation Cooperation Project of the Changchun Surface Water Quality Sky-Ground Integrated Remote Sensing Monitoring Technology Research and Development (21SH10); Jilin Province and Chinese Academy of Sciences Science and Technology Cooperation High-tech Industrialization Special Fund Project (2021SYHZ0002); Common Application Support Platform for National Civil Space Infrastructure Land Observation Satellites (2017-000052-73-01-001735); National Natural Science Foundation of China (42171374, 42201414, U2243230, U2342008).

Acknowledgments

We thank the Institute of Air and Space Information Innovation of the Chinese Academy of Sciences for providing data and image-processing resources. We also would like to thank Dr. Savannah Grace at the University of Florida for her assistance with English language and grammatical editing of the manuscript.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. C. Verpoorter, T. Kutser, D. A. Seekell, et al., “A global inventory of lakes based on high-resolution satellite imagery,” Geophys. Res. Lett. 41(18), 6396–6402 (2014). [CrossRef]  

2. J. Heino, J. Alahuhta, L. M. Bini, et al., “Lakes in the era of global change: moving beyond single-lake thinking in maintaining biodiversity and ecosystem services,” Biological Reviews 96(1), 89–106 (2021). [CrossRef]  

3. N. S. Diffenbaugh, D. L. Swain, and D. Touma, “Anthropogenic warming has increased drought risk in California,” Proc. Natl. Acad. Sci. 112(13), 3931–3936 (2015). [CrossRef]  

4. G. Free, M. Bresciani, M. Pinardi, et al., “Investigating lake chlorophyll-a responses to the 2019 European double heatwave using satellite remote sensing,” Ecol. Indic. 142, 109217 (2022). [CrossRef]  

5. J. C. Ho, A. M. Michalak, and N. Pahlevan, “Widespread global increase in intense lake phytoplankton blooms since the 1980s,” Nature 574(7780), 667–670 (2019). [CrossRef]  

6. Z. Wu, H. He, Y. Cai, et al., “Spatial distribution of chlorophyll a and its relationship with the environment during summer in Lake Poyang: a Yangtze-connected lake,” Hydrobiologia 732(1), 61–70 (2014). [CrossRef]  

7. J. N. Boyer, C. R. Kelble, P. B. Ortner, et al., “Phytoplankton bloom status: Chlorophyll a biomass as an indicator of water quality condition in the southern estuaries of Florida, USA,” Ecol. Indic. 9(6), S56–S67 (2009). [CrossRef]  

8. X. Hou, L. Feng, Y. Dai, et al., “Global mapping reveals increase in lacustrine algal blooms over the past decade,” Nat. Geosci. 15(2), 130–134 (2022). [CrossRef]  

9. H. W. Paerl, W. S. Gardner, K. E. Havens, et al., “Mitigating cyanobacterial harmful algal blooms in aquatic ecosystems impacted by climate change and anthropogenic nutrients,” Harmful Algae 54, 213–222 (2016). [CrossRef]  

10. M. E. Bechmann, D. Berge, H. O. Eggestad, et al., “Phosphorus transfer from agricultural areas and its impact on the eutrophication of lakes—two long-term integrated studies from Norway,” J. Hydrol. 304(1-4), 238–250 (2005). [CrossRef]  

11. J. Grbic, P. Helm, S. Athey, et al., “Microplastics entering northwestern Lake Ontario are diverse and linked to urban sources,” Water Res. 174, 115623 (2020). [CrossRef]  

12. P. J. Kleinman, A. N. Sharpley, P. J. Withers, et al., “Implementing agricultural phosphorus science and management to combat eutrophication,” Ambio 44(S2), 297–310 (2015). [CrossRef]  

13. J. P. Smol, “Under the radar: long-term perspectives on ecological changes in lakes,” Proc. R. Soc. B. 286, 20190834 (2019). [CrossRef]  

14. B. Qin, J. Deng, K. Shi, et al., “Extreme climate anomalies enhancing cyanobacterial blooms in eutrophic Lake Taihu, China,” Water Resour. Res. 57(7), e2020WR029371 (2021). [CrossRef]  

15. Z. Li, W. Yang, B. Matsushita, et al., “Remote estimation of phytoplankton primary production in clear to turbid waters by integrating a semi-analytical model with a machine learning algorithm,” Remote Sens. Environ. 275, 113027 (2022). [CrossRef]  

16. X. Xu, J. Kang, J. Shen, et al., “EEM–PARAFAC characterization of dissolved organic matter and its relationship with disinfection by-products formation potential in drinking water sources of northeastern China,” Sci. Total Environ. 774, 145297 (2021). [CrossRef]  

17. Z. Zhang, Y. Zheng, F. Han, et al., “Recovery of an endorheic lake after a decade of conservation efforts: Mediating the water conflict between agriculture and ecosystems,” Agricultural Water Management 256, 107107 (2021). [CrossRef]  

18. V. Sagan, K. T. Peterson, M. Maimaitijiang, et al., “Monitoring inland water quality using remote sensing: potential and limitations of spectral indices, bio-optical simulations, machine learning, and cloud computing,” Earth-Sci. Rev. 205, 103187 (2020). [CrossRef]  

19. X. Chuai, X. Chen, L. Yang, et al., “Effects of climatic changes and anthropogenic activities on lake eutrophication in different ecoregions,” Int. J. Environ. Sci. Technol. 9(3), 503–514 (2012). [CrossRef]  

20. L. Lyu, K. Song, Z. Wen, et al., “Estimation of the lake trophic state index (TSI) using hyperspectral remote sensing in Northeast China,” Opt. Express 30(7), 10329–10345 (2022). [CrossRef]  

21. R. Ladwig, A. P. Appling, A. Delany, et al., “Long-term change in metabolism phenology in north temperate lakes,” Limnol. Oceanogr. 67(7), 1502–1521 (2022). [CrossRef]  

22. M. E. Smith, L. Robertson Lain, and S. Bernard, “An optimized Chlorophyll a switching algorithm for MERIS and OLCI in phytoplankton-dominated waters,” Remote Sens. Environ. 215, 217–227 (2018). [CrossRef]  

23. M. Hu, R. Ma, J. Xiong, et al., “Eutrophication state in the Eastern China based on Landsat 35-year observations,” Remote Sens. Environ. 277, 113057 (2022). [CrossRef]  

24. Y. Li, S. Li, K. Song, et al., “Sentinel-3 OLCI observations of Chinese lake turbidity using machine learning algorithms,” J. Hydrol. 622, 129668 (2023). [CrossRef]  

25. M. Werther, D. Odermatt, S. G. H. Simis, et al., “A Bayesian approach for remote sensing of chlorophyll-a and associated retrieval uncertainty in oligotrophic and mesotrophic lakes,” Remote Sens. Environ. 283, 113295 (2022). [CrossRef]  

26. A. A. Gitelson, G. Dall’Olmo, W. Moses, et al., “A simple semi-analytical model for remote estimation of chlorophyll-a in turbid waters: Validation,” Remote Sens. Environ. 112(9), 3582–3593 (2008). [CrossRef]  

27. D. Gurlin, A. A. Gitelson, and W. J. Moses, “Remote estimation of chl-a concentration in turbid productive waters — Return to a simple two-band NIR-red model?” Remote Sens. Environ. 115(12), 3479–3490 (2011). [CrossRef]  

28. T. Hirawake, K. Shinmyo, A. Fujiwara, et al., “Satellite remote sensing of primary productivity in the Bering and Chukchi Seas using an absorption-based approach,” ICES J. Mar. Sci. 69(7), 1194–1204 (2012). [CrossRef]  

29. S. Li, K. Song, S. Wang, et al., “Quantification of chlorophyll-a in typical lakes across China using Sentinel-2 MSI imagery with machine learning algorithm,” Sci. Total Environ. 778, 146271 (2021). [CrossRef]  

30. G. Liu, L. Li, K. Song, et al., “An OLCI-based algorithm for semi-empirically partitioning absorption coefficient and estimating chlorophyll a concentration in various turbid case-2 waters,” Remote Sens. Environ. 239, 111648 (2020). [CrossRef]  

31. Q. Wang, C. Wu, Q. Li, et al., “Chinese HJ-1A/B satellites and data characteristics,” Sci. China Earth Sci. 53(S1), 51–57 (2010). [CrossRef]  

32. Z. Cao, R. Ma, H. Duan, et al., “A machine learning approach to estimate chlorophyll-a from Landsat-8 measurements in inland lakes,” Remote Sens. Environ. 248, 1 (2020). [CrossRef]  

33. K. Dörnhöfer and N. Oppelt, “Remote sensing for lake research and monitoring – Recent advances,” Ecol. Indic. 64, 105–122 (2016). [CrossRef]  

34. S. Xu, S. Li, Z. Tao, et al., “Remote Sensing of Chlorophyll-a in Xinkai Lake Using Machine Learning and GF-6 WFV Images,” Remote Sensing 14(20), 5136 (2022). [CrossRef]  

35. A. A. Gitelson, B.-C. Gao, R.-R. Li, et al., “Estimation of chlorophyll-a concentration in productive turbid waters using a Hyperspectral Imager for the Coastal Ocean—the Azov Sea case study,” Environ. Res. Lett. 6, 024023 (2011). [CrossRef]  

36. H. Yang, J. Kong, H. Hu, et al., “A review of remote sensing for water quality retrieval: progress and challenges,” Remote Sens. 14(8), 1770 (2022). [CrossRef]  

37. Z. Lee, K. L. Carder, and R. A. Arnone, “Deriving inherent optical properties from water color: a multiband quasi-analytical algorithm for optically deep waters,” Appl. Opt. 41(27), 5755–5772 (2002). [CrossRef]  

38. K. Xue, R. Ma, H. Duan, et al., “Inversion of inherent optical properties in optically complex waters using sentinel-3A/OLCI images: A case study using China’s three largest freshwater lakes,” Remote Sens. Environ. 225, 328–346 (2019). [CrossRef]  

39. C. Le, Y. Li, Y. Zha, et al., “A four-band semi-analytical model for estimating chlorophyll a in highly turbid lakes: The case of Taihu Lake, China,” Remote Sens. Environ. 113(6), 1175–1182 (2009). [CrossRef]  

40. L. Rotta, E. Alcântara, E. Park, et al., “A single semi-analytical algorithm to retrieve chlorophyll-a concentration in oligo-to-hypereutrophic waters of a tropical reservoir cascade,” Ecol. Indic. 120, 106913 (2021). [CrossRef]  

41. C. B. Mouw, S. Greb, D. Aurin, et al., “Aquatic color radiometry remote sensing of coastal and inland waters: Challenges and recommendations for future satellite missions,” Remote Sens. Environ. 160, 15–30 (2015). [CrossRef]  

42. L. Ma, Y. Liu, X. Zhang, et al., “Deep learning in remote sensing applications: A meta-analysis and review,” ISPRS Journal of Photogrammetry and Remote Sensing 152, 166–177 (2019). [CrossRef]  

43. N. Pahlevan, B. Smith, J. Schalles, et al., “Seamless retrievals of chlorophyll-a from Sentinel-2 (MSI) and Sentinel-3 (OLCI) in inland and coastal waters: A machine-learning approach,” Remote Sens. Environ. 240, 1 (2020). [CrossRef]  

44. K. P. Murphy, Machine Learning: A Probabilistic Perspective (MIT press, 2012).

45. Y. W. Kim, T. Kim, J. Shin, et al., “Validity evaluation of a machine-learning model for chlorophyll a retrieval using Sentinel-2 from inland and coastal waters,” Ecol. Indic. 137, 108737 (2022). [CrossRef]  

46. Y. Ma, K. Song, Z. Wen, et al., “Remote sensing of turbidity for lakes in Northeast China using Sentinel-2 images with machine learning algorithms,” IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing 14, 9132–9146 (2021). [CrossRef]  

47. R. Wu, S. Zhang, Y. Liu, et al., “Spatiotemporal variation in water quality and identification and quantification of areas sensitive to water quality in Hulun lake, China,” Ecol. Indic. 149, 110176 (2023). [CrossRef]  

48. K. Song, Z. Wang, J. Blackwell, et al., “Water quality monitoring using Landsat Themate Mapper data with empirical algorithms in Chagan Lake, China,” J. Appl. Remote Sens 5(1), 053506 (2011). [CrossRef]  

49. Author name, “Images from HJ1-A/BCCD,” China Land Observing Satellite Service Center (2024). https://www.cresda.com

50. H. Ma, S. Guo, X. Hong, et al., “Atmospheric correction of HJ1-A/B images and the effects on remote sensing monitoring of cyanobacteria bloom,” Proc. Int. Assoc. Hydrol. Sci. 368, 69–74 (2015). [CrossRef]  

51. K. Toming, T. Kutser, A. Laas, et al., “First experiences in mapping lake water quality parameters with Sentinel-2 MSI imagery,” Remote Sens. 8(8), 640 (2016). [CrossRef]  

52. S. K. McFeeters, “The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features,” International Journal of Remote Sensing 17(7), 1425–1432 (1996). [CrossRef]  

53. K. Yang, Z. Yu, Y. Luo, et al., “Spatial and temporal variations in the relationship between lake water surface temperatures and water quality - A case study of Dianchi Lake,” Sci. Total Environ. 624, 859–871 (2018). [CrossRef]  

54. A. B. Ruescas, M. Hieronymi, S. Koponen, et al., Retrieval of coloured dissolved organic matter with machine learning methods, (IEEE), 2187–2190 (2017).

55. C. Saunders, A. Gammerman, and V. Vovk, “Ridge regression learning algorithm in dual variables,” (1998).

56. E. A. Zanaty, “Support vector machines (SVMs) versus multilayer perception (MLP) in data classification,” Egyptian Informatics Journal 13(3), 177–183 (2012). [CrossRef]  

57. J. Goldberger, G. E. Hinton, S. Roweis, et al., “Neighbourhood components analysis,” Advances in Neural Information Processing Systems 17, 1 (2004).

58. L. Breiman, “Random forests,” Machine Learning 45(1), 5–32 (2001). [CrossRef]  

59. L. Prokhorenkova, G. Gusev, A. Vorobev, et al., “CatBoost: unbiased boosting with categorical features,” Advances in neural information processing systems 31, 6511–6520 (2018). [CrossRef]  

60. J. Huan, H. Li, M. Li, et al., “Prediction of dissolved oxygen in aquaculture based on gradient boosting decision tree and long short-term memory network: A study of Chang Zhou fishery demonstration base, China,” Computers and Electronics in Agriculture 175, 105530 (2020). [CrossRef]  

61. T. Chen and C. Guestrin, “XGBoost,” presented at Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016). [CrossRef]  

62. A. Gupta, K. Gusain, and B. Popli, Verifying the value and veracity of extreme gradient boosted decision trees on a variety of datasets, (Ieee), 457–462 (2016).

63. O. Sagi and L. Rokach, “Ensemble learning: A survey,” WIREs Data Min Knowl. 8(4), e1249 (2018). [CrossRef]  

64. T. G. Dietterich, Ensemble methods in machine learning, (Springer, 2000), pp. 1–15.

65. L. Breiman, “Bagging predictors,” Mach. Learn. 24(2), 123–140 (1996). [CrossRef]  

66. B. Pavlyshenko, Using stacking approaches for machine learning models, (IEEE), 255–258 (2018).

67. J. Bergstra, R. Bardenet, Y. Bengio, et al., “Algorithms for hyper-parameter optimization,” Advances in neural information processing systems 24, 2564 (2011). [CrossRef]  

68. J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Learning Research 13(10), 281–305 (2012).

69. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv, arXiv:1412.6980 (2014). [CrossRef]  

70. S. An, W. Liu, and S. Venkatesh, “Fast cross-validation algorithms for least squares support vector machine and kernel ridge regression,” Pattern Recognition 40(8), 2154–2162 (2007). [CrossRef]  

71. D. Anguita, L. Ghelardoni, A. Ghio, et al., “The’K’in K-fold Cross Validation,” in, 441–446 (2012).

72. S. M. Lundberg, G. G. Erion, and S.-I. Lee, “Consistent individualized feature attribution for tree ensembles,” arXiv, arXiv:1802.03888 (2018). [CrossRef]  

73. H. Abdi, “Z-scores,” Encyclopedia of Measurement and Statistics 3, 1055–1058 (2007).

74. D. G. Altman and J. M. Bland, “Standard deviations and standard errors,” Bmj 331(7521), 903 (2005). [CrossRef]  

75. C. Cheadle, M. P. Vawter, W. J. Freed, et al., “Analysis of microarray data using Z score transformation,” J. Mol. Diagn. 5(2), 73–81 (2003). [CrossRef]  

76. A. Ghasemi and S. Zahediasl, “Normality tests for statistical analysis: a guide for non-statisticians,” Int J Endocrinol Metab 10(2), 486 (2012). [CrossRef]  

77. S. M. Powers, T. W. Bruulsema, T. P. Burt, et al., “Long-term accumulation and transport of anthropogenic phosphorus in three river basins,” Nat. Geosci. 9(5), 353–356 (2016). [CrossRef]  

78. L. Zhou, D. A. Roberts, W. Ma, et al., “Estimation of higher chlorophylla concentrations using field spectral measurement and HJ-1A hyperspectral satellite data in Dianshan Lake, China,” ISPRS Journal of Photogrammetry and Remote Sensing 88, 41–47 (2014). [CrossRef]  

79. S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems 30, 1 (2017). [CrossRef]  

80. M. Shen, J. Luo, Z. Cao, et al., “Random forest: An optimal chlorophyll-a algorithm for optically complex inland water suffering atmospheric correction uncertainties,” J. Hydrol. 615, 128685 (2022). [CrossRef]  

81. Z. Cao, M. Wang, R. Ma, et al., “A decade-long chlorophyll-a data record in lakes across China from VIIRS observations,” Remote Sens. Environ. 301, 113953 (2024). [CrossRef]  

82. Y. Zhang, F. Shen, X. Sun, et al., “Marine big data-driven ensemble learning for estimating global phytoplankton group composition over two decades (1997–2020),” Remote Sens. Environ. 294, 113596 (2023). [CrossRef]  

83. Z. Cui and G. Gong, “The effect of machine learning regression algorithms and sample size on individualized behavioral prediction with functional connectivity features,” NeuroImage 178, 622–637 (2018). [CrossRef]  

84. E. M. Dos Santos, R. Sabourin, and P. Maupin, “Overfitting cautious selection of classifier ensembles with genetic algorithms,” Information Fusion 10(2), 150–162 (2009). [CrossRef]  

85. A. Atla, R. Tada, V. Sheng, et al., “Sensitivity of different machine learning algorithms to noise,” Journal of Computing Sciences in Colleges 26(5), 96–103 (2011). [CrossRef]  

86. J. J. Cole, Y. T. Prairie, N. F. Caraco, et al., “Plumbing the global carbon cycle: integrating inland waters into the terrestrial carbon Budget,” Ecosystems 10(1), 172–185 (2007). [CrossRef]  

87. E. Kalapanidas, N. Avouris, M. Craciun, et al., Machine learning algorithms: a study on noise sensitivity, (sn), 356–365 (2003).

88. K. Aryal, P.-W. Zhai, M. Gao, et al., “Instantaneous photosynthetically available radiation models for ocean waters using neural networks,” Appl. Opt. 61(33), 9985–9995 (2022). [CrossRef]  

89. C. Zeng and C. Binding, “The effect of mineral sediments on satellite chlorophyll-a retrievals from line-height algorithms using red and near-infrared bands,” Remote Sens. 11(19), 2306 (2019). [CrossRef]  

Supplementary Material (1)

NameDescription
Supplement 1       Supplemental Document

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (12)

Fig. 1.
Fig. 1. Location of lakes and their collected samples in Northeast China: (a) Hulun lake (HLH), (b) Xingkai lake (XKH) and (c) Chagan lake (CGH).
Fig. 2.
Fig. 2. Schematic block diagram illustrates the main process of machine learning algorithms to estimate Chl-a concentration.
Fig. 3.
Fig. 3. Spectral reflectance of lake samples collected in HJ1-A/B CCD imagery, (a) XKH, (b) CGH and (c) HLH.
Fig. 4.
Fig. 4. Relationships between estimated and measured Chl-a values using different machine learning algorithms, e.g., (a-b) GBDT, (c-d) XGB, (e-f) RF, (g-h) KRR, (i-j)SVR, (k-l) KNN and (m-n) MLP, respectively.
Fig. 5.
Fig. 5. SHAP results with feature importance for different machine learning models. Each row represents a feature and horizontal coordinate that denotes the Shapley values. These values are sorted from top to bottom by their average absolute magnitude. Colored points correspond to eigenvalue points of the input (reflectance to band or band combinations), where red indicates a larger value and blue a smaller one. Points to the right of the 0 scale signal a positive contribution to the model, those to the left a negative contribution, and those on the 0 scale show no effect.
Fig. 6.
Fig. 6. Cross-validation learning rate curves for model response to training examples. The horizontal and vertical coordinates represent the training examples and the R2 scores. Lines of the same color represent the same model, circular points in lines of the same color represent the training curve of the model, and square points represent the cross-validation curve of the model. The shaded area indicates the standard deviation interval.
Fig. 7.
Fig. 7. Residuals of model predictions for estimating Chl-a concentration with the standard deviation intervals shaded in yellow and the outlier points in red.
Fig. 8.
Fig. 8. Performance results for the ensemble learning model: (a) voting validation and (b) stacking validation
Fig. 9.
Fig. 9. Residuals for the integrated model predictions for chl-a concentration.
Fig. 10.
Fig. 10. Cross-validation learning rate curve of integrated model response to training examples.
Fig. 11.
Fig. 11. The averaged Chl-a concentrations and their annual rates of change in the three lakes. Blue, green, and orange bars represent Chl-a levels in the corresponding years for CGH, XKH, and HLH; The blue, green, and orange lines represent interannual changes in CGH, XKH, and HLH. The red dashed lines represent the average annual rate of change of the three lakes.
Fig. 12.
Fig. 12. Chl-a spatial distribution for the three lakes based on HJ1-A/BCCD image data inversion from 2012-2021. Lighter colors represent lower Chl-a levels in that area, and vice versa represent higher Chl-a levels.

Tables (6)

Tables Icon

Table 1. Descriptive statistics of chlorophyll a (Chl-a) concentration (μg/L) in the three lakes

Tables Icon

Table 2. Spearman correlation between Chl-a and band combinations (spectral reflectance variables)a

Tables Icon

Table 3. Hyper-parametric optimization results using cross-validated grid search

Tables Icon

Table 4. Performance evaluation for the developed chlorophyll a algorithm

Tables Icon

Table 5. Cross-validation (CV) of performance with the original vs. model after adding five percent noise

Tables Icon

Table 6. Integrated model accuracy evaluationa

Equations (2)

Equations on this page are rendered with MathJax. Learn more.

H ( x ) = 1 K k = 1 K h i ( x )
y i = k = 1 K f k ( x i )   f k F
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.