Application of KNN algorithm incorporating Gaussian functions in green and high-quality development of cities empowered by circular economy

Li, Zhezhou; Huang, Hexiang

doi:10.1186/s42162-024-00372-w

Research
Open access
Published: 05 August 2024

Application of KNN algorithm incorporating Gaussian functions in green and high-quality development of cities empowered by circular economy

Zhezhou Li¹ &
Hexiang Huang¹

Energy Informatics volume 7, Article number: 65 (2024) Cite this article

261 Accesses
Metrics details

Abstract

A growing number of industries have started to adapt to the circular economy since the concept's introduction. Therefore, in order to accurately evaluate the development level of circular economy, the circular economy prediction model based on support vector machine-Gaussian K-nearest neighbor is proposed. This model first uses the improved K-nearest neighbor algorithm based on Gaussian function to classify the index data of various levels, and then uses Support Vector Machine to make predictions based on relevant data. According to the experimental findings, the model's average prediction accuracy for each level of indicator was approximately 98.1%, 98.8%, 94.9%, and 95.9% for the levels of industrial development, resource consumption, ecological protection, and resource recycling and reuse, respectively. This prediction accuracy was higher than that of the multi-vector autoregressive model and the grey prediction model. The average prediction accuracy of the multi-vector autoregressive model, the grey prediction model, and the support vector machine-Gaussian K-nearest neighbor-based model in predicting the overall development level of the circular economy were about 94.3%, 96.2%, and 99.3%, respectively, with average recalls of about 86.6%, 87.7%, and 89.1%, and the average F1-measure was about 0.88, 0.89, and 0.92. Moreover, the average relative error based on the support vector machine-Gaussian K-nearest neighbour model was only approximately 0.6%, which was lower than the 3.7% and 2.8% for the multi-vector autoregressive model and the grey prediction model, respectively. Meanwhile, compared with the existing time series analysis techniques, the proposed SVM-Gaussian K nearest neighbor model fitted up to 0.95, which achieved good prediction performance. According to the above data, the support vector machine-Gaussian K-nearest neighbour model has the highest accuracy in predicting the amount of development of the circular economy.

Introduction

The exploitation of natural resources has increased progressively to satisfy the demands of economic development, but the resources that have been exploited have not been effectively used, and the reserves of natural resources are rapidly depleting. As a result, the current economic development is facing a series of problems, namely, the gradual depletion of natural resources. The peaceful coexistence of man and nature is seriously threatened by unsustainable industrial practice. Therefore, it is essential to build the economy on the basis of conserving the environment if a balance between short-term gains and long-term interests is to be realized over the long period (Kjaer et al. 2019; Agarwal et al. 2022). The concept of rational exploitation, efficient utilization, clean emission, and low resource destruction underlies the circular economy (CE), a different economic model from the conventional linear economy. The CE's particular economic model is reversible cycle, multi-directional transformation, multi-level utilisation, and no waste emission, with the aim of achieving zero pollutant emission (Agrawal et al. 2022). Therefore, the cycle of this development model has been recognized by people from all walks of life. With the practice of the CE concept, how to accurately evaluate the level of CE development has become a major challenge. Although polynomial regression can realize the prediction of firing level and is easy to understand, it is difficult to model data with nonlinear data with correlation and features, and it is difficult to accurately express data with high complexity. Both support vector machine and k-Nearest Neighbour can process non-linear data well, and SVM also has good processing power for high-dimensional data. Therefore, the study suggests a KN approach based on the Gaussian function (GF) and a CE evaluation model of SVM in order to close this gap and achieve accurate CE development level prediction.

This study is divided into four parts. The first part is a literature review, which briefly summarizes the related studies of KNN algorithm and SVM. The second part is the study of CE evaluation model of KNN algorithm based on GF and SVM. The third part is the experimental results, which evaluates the evaluation model of KNN algorithm based on GF and SVM and compares it with multi-vector autoregressive model (MVAR) and grey prediction model (GPM). The last section is the conclusion, which summarizes the whole research process.

Related works

In a traditional economy, the value created is directly proportional to the consumption of resources. The more resources are consumed, the more value is generated, and correspondingly, the more waste is produced. CE, on the other hand, generates as much value as possible with as little resource consumption and environmental pollution as possible, harmonizing the economic system with the material cycle of the natural system. Agarwal et al. (2022) proposed a new layered framework on how to improve the CE supply chain through Industry 4.0 technologies. The framework had 13 CE challenges and 8 Industry 4.0 technology aspects, and was evaluated through hierarchical analysis and portfolio distance assessment to identify the priority of CE challenges. The results of the experiments showed that "information disruption among supply chain members due to multiple channels" and "inability of manpower to handle toxic materials" were the key factors hindering the practice of CE in the supply chain. Eriksen et al. (2020) explored the CE challenges of PET, PE and PP in the supply chain and assessed the potential circularity of PET, PE and PP in the supply chain through dynamic logistics analysis. Experimental results showed that after 50 years from baseline, the recycling rate was only 13–20%, while the reliance on virgin plastics accounted for 85–90% of the annual plastics demand, failing to meet existing recycling targets. Scalia et al. (2021) proposed an evaluation method based on multi criteria analysis to assess the value of waste coffee grounds in mortar production, addressing the issue of their benefits in the CE. The results showed that adding waste coffee grounds could effectively improve the technical and sustainable performance of the new mortar, which can be used for different applications in the construction field. The presence of waste coffee grounds increased water absorption, improved insulation performance, and reduced environmental impact. Haleem et al. (2021) and other scholars proposed a framework for evaluating suppliers in response to the problem of how to select suppliers in CE. The framework determined the total weight of the evaluation criteria through the fuzzy CRITIC method and utilized the TOPSIS method to determine the ranking of the suppliers. According to the expert opinion and related literature, environmental criteria were the most favorable criteria with a subjective weight of 0.23. Kouhizadeh et al. (2023) conducted a study on the CE benefits of blockchain technology, collecting empirical survey data from 32 CE and blockchain experts. The results showed that the transparency and traceability, reliability and security, smart contracts, incentives, and tokenization of blockchain was proved to have different potential supports for CE performance evaluation.

KNN algorithm is widely used in various fields due to the advantages of mature theory, simplicity of idea and insensitivity to anomalies and can be used for nonlinear classification. Rehman et al. (2022) proposed a classification algorithm based on KNN and SVM for the problem of detecting and classifying microscopic retinal blood vessels. The algorithm first extracted the features of the image by filtering and comparing histograms, and then used KNN and SVM to classify the extracted features. Experimental results showed that the method had 92% normal rate, which was better than the rest of the algorithms. Abuzaraida et al. (2021) proposed a recognition method based on KNN and discrete cosine transform coefficients for the problem of Arabic handwritten word recognition. The method extracted structural features by discrete cosine transform coefficients and classified the segmented characters using KNN. The recognition method based on KNN and discrete cosine transform coefficients was tested to achieve 99.1% accuracy in recognizing handwritten Arabic words. A KNN-based classification approach was put out by Muliady et al. (2021) as a solution to the challenge of categorising the nitrogen nutritional status of rice plants. The experimental findings demonstrated that this approach had a classification accuracy of 96.4% for the nitrogen nutritional status of rice. Behera et al. (2021) proposed a recommendation model based on KNN and restricted Boltzmann machines for the problem of how to accurately recommend movies of interest to users. According to experimental findings, the KNN and limited Boltzmann machine-based recommendation model significantly increased the accuracy of movie recommendation. Hanif et al. (2021) proposed a sorting method based on KNN, hierarchical tree and multiple queries for the problem of person re-identification and sorting. The method decreased the aggregated distance in the case of correct image matching for initial sorting, and increased the aggregated distance in the case of wrong image matching, while the final sorting distance was the weighted sum of the aggregated and actual distances.

In summary, research results for CE and KNN are numerous and wide-ranging. However, there are fewer studies on how to assess the level of CE development through AI. Whereas, the disadvantage of KNN is low prediction accuracy when the sample data are unbalanced. If it is unbalanced, it is not possible to deal with high-dimensional data, so it is more difficult to deal with data related to CE. Aiming at the above problems, the study proposes a CE development level prediction model based on GF's KNN algorithm and SVM, in order to realize the accurate prediction of CE development level. The related literature are shown in Table 1.

Table 1 Summary of related literature

Full size table

KNN-based algorithm for prediction of CE development levels

The study proposes an evaluation method based on the improved KNN algorithm to accurately evaluate the development level of CE. The first section of the study will seek to enhance the KNN algorithm in order to optimize its classification efficacy. The second section will use GKNN and SVM.

Improved KNN algorithm based on GF

Information retrieval, natural language processing, etc. all make extensive use of the KNN method, a straightforward pattern recognition algorithm. It has the advantages of high accuracy, no assumptions on the data and insensitivity to anomalies, etc. When KNN algorithm categorizes a sample, it will firstly look for a number of neighbouring sample points that have the highest degree of similarity with the sample point and count the category labels of the neighbouring sample points. Finally, it will set the one with the highest probability of occurrence of the category label as the category of the target sample. Figure 1 depicts the KNN algorithm’s flow.

In Fig. 1, the KNN algorithm must first normalize the data, divide the training set and the test set, calculate the distance between the samples in the test set and the samples in the training set, and choose the $k$ sample points with the smallest distance as the prediction result from these $k$ sample points (Guo et al. 2022; Zheng et al. 2020; Hamed et al. 2020). The normalization equation is shown in Eq. (1).

$$ Z = {\raise0.7ex\hbox{${\left( { - \mu + x} \right)}$} \!\mathord{\left/ {\vphantom {{\left( { - \mu + x} \right)} \sigma }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\sigma $}} $$

(1)

In Eq. (1), $Z$ denotes the standard score. $x$ denotes the data to be normalized. $\mu$ denotes the mean of the data. $\sigma$ denotes the standard deviation of the data. The equation for calculating the distance between samples is shown in Eq. (2).

$$ D\left( {X,x_{i} } \right) = \left( {\sum\limits_{l = 1}^{t} {\left| {X^{l} - x_{i}^{l} } \right|}^{2} } \right)^{\frac{1}{2}} $$

(2)

In Eq. (2), $D\left( {X,x_{i} } \right)$ denotes the Euclidean distance between the test data and the training data. $X$ denotes the test data. $x_{i}$ denotes the training data. $x_{i}^{l}$ denotes the $l$th feature of the $i$th training data. $t$ denotes the number of features in each training data. The classification decision function is shown in Eq. (3).

$$ \begin{array}{*{20}c} {y = \arg \mathop {\max }\limits_{{c_{j} }} \sum\limits_{{x_{i} \in N_{k} \left( x \right)}} {I\left( {c_{j} = y_{i} } \right)} } & {j = 1,2, \cdots K;} & i = 1,2, \cdots ,N \end{array} $$

(3)

In Eq. (3), $y$ denotes the category. $I\left( {y_{i} = c_{j} } \right)$ denotes the indicator function, $I = 1$ when $y_{i} = c_{j}$, and $I = 0$ otherwise. $N_{k} \left( x \right)$ denotes the domain consisting of $k$ neighbouring sample points of the data $X$ to be tested. However, the traditional KNN algorithm lacks the consideration of the effect of the distance between the test point and the neighbouring points, which leads to its low prediction accuracy for unbalanced datasets. Whereas, the prediction accuracy of KNN on unbalanced datasets can be improved by weighting the distances using GF (Gonalves et al. 2019; Al-Dabagh et al. 2019). The GF is shown in Eq. (4).

$$ f\left( z \right) = a \times e^{{ - \frac{{\left( {z - b} \right)^{2} }}{{2c^{2} }}}} $$

(4)

In Eq. (4), $a$ denotes the height of the peak of the curve. $b$ denotes the coordinates of the peak. $c$ denotes the standard deviation. The parameters $a,b,c$ are all arbitrary real numbers. When $a = 1,b = 0$, the equation for calculating the distance weights between test data and training data is shown in Eq. (5).

$$ \omega_{j}^{r} = \exp \left\{ {\frac{{D^{2} \left( {X_{j} ,X_{r} } \right)}}{2}} \right\} $$

(5)

In Eq. (5), $\omega_{j}^{r}$ denotes the weight of the distance between samples. Where $x_{r} \in N_{k} \left( x \right)$. The equation for calculating the probability distribution of the number of nearest neighbors at this point is given in Eq. (6).

$$ \left\{ {\begin{array}{*{20}l} {P_{j}^{r} \left( {k_{j} = r} \right) = \frac{{\sum\nolimits_{{s = 1,Y\left( {X_{s} } \right) = Y\left( {X_{j} } \right)}} {\omega_{j}^{s} } }}{{\sum\nolimits_{s = 1}^{r} {\omega_{j}^{s} } }}} \hfill \\ {k_{j} = \arg \mathop {\max }\limits_{r} P_{j}^{r} } \hfill \\ \end{array} } \right. $$

(6)

In Eq. (6), $P_{j}^{r} \left( {k_{j} = r} \right)$ denotes the distribution probability of the number of nearest neighbors. $Y\left( X \right)$ denotes the set of category labels. $k_{j}$ denotes the optimal number of nearest neighbors. The GF-based classification decision model is shown in Eq. (7).

$$ Y = \arg \mathop {\max }\limits_{{c_{j} }} \frac{{\sum\limits_{i = 1}^{k} {\omega_{i} \left( {y_{i} = c_{j} } \right)} }}{{\sum\limits_{i = 1}^{k} {\omega_{i} } }} = \arg \mathop {\max }\limits_{{c_{j} }} \frac{{\sum\limits_{i = 1}^{k} {e^{{ - \frac{{\left( {\sum\limits_{l = 1}^{t} {\left| {X^{l} - x_{i}^{l} } \right|^{2} } } \right)}}{{2c^{2} }}}} \left( {y_{i} = c_{j} } \right)} }}{{\sum\limits_{i = 1}^{k} {e^{{ - \frac{{\left( {\sum\limits_{l = 1}^{t} {\left| {X^{l} - x_{i}^{l} } \right|^{2} } } \right)}}{{2c^{2} }}}} } }} $$

(7)

In Eq. (7), $Y$ denotes the output result. $\sum\limits_{i = 1}^{k} {\omega_{i} \left( {y_{i} = c_{j} } \right)}$ denotes the sum of the weights in a number of nearest neighbors with category $c_{j}$. The standard deviation of the GF and the number of nearby neighbors impact the performance of the algorithm. Therefore, the initial value of the parameter $k_{j}$, as incorporated into the formula (8), represents the optimal standard deviation. This process allows for the identification of the optimal $k_{j}$ value, which can then be incorporated into the formula for maximum accuracy.

$$ Accuracy = \frac{{\left( {\sum\limits_{m = 1}^{T} I \left( {\arg \mathop {\max }\limits_{{c_{j} }} \frac{{\sum\limits_{i = 1}^{k} {e^{{ - \frac{{\left( {\sum\limits_{l = 1}^{t} {\left| {X^{l} - x_{i}^{l} } \right|^{2} } } \right)^{\frac{1}{2}} }}{{2c^{2} }}}} \left( {y_{i} = c_{j} } \right)} }}{{\sum\limits_{i = 1}^{k} {e^{{ - \frac{{\left( {\sum\limits_{l = 1}^{t} {\left| {X^{l} - x_{i}^{l} } \right|^{2} } } \right)^{\frac{1}{2}} }}{{2c^{2} }}}} } }} = B_{m} = 1} \right)} \right)}}{{\sum\limits_{m = 1}^{T} I }} + \frac{{\left( {\sum\limits_{m = 1}^{T} I \left( {\arg \mathop {\max }\limits_{{c_{j} }} \frac{{\sum\limits_{i = 1}^{k} {e^{{ - \frac{{\left( {\sum\limits_{l = 1}^{t} {\left| {X^{l} - x_{i}^{l} } \right|^{2} } } \right)^{\frac{1}{2}} }}{{2c^{2} }}}} \left( {y_{i} = c_{j} } \right)} }}{{\sum\limits_{i = 1}^{k} {e^{{ - \frac{{\left( {\sum\limits_{l = 1}^{t} {\left| {X^{l} - x_{i}^{l} } \right|^{2} } } \right)^{\frac{1}{2}} }}{{2c^{2} }}}} } }} = B_{m} = 0} \right)} \right)}}{{\sum\limits_{m = 1}^{T} I }} $$

(8)

In Eq. (8), $I$ indicates the prediction category situation. $m$ represents the number of predicted points. $B_{m}$ indicates the test set category. Figure 2 depicts the algorithm flow of the G-KNN algorithm, which is the KNN algorithm following GF weight assignments.

As demonstrated in Fig. 2, the G-KNN algorithm will determine the distance weight by GF after determining the separation between the samples in the test set and the training set. Then it calculates the optimal weight parameter and the optimal number of nearest neighbors through the accuracy iteration and select the closest distance to the samples to be tested in accordance with the size of the distance and the optimal number of nearest neighbors. Then it carries out the calculation of the weights of the selected distances. Finally it calculates the classification results according to the classification decision model.

SVM-GKNN-based model for predicting the level of CE development

Although the previous section improved KNN by GF, only data imbalance is considered in the improvement, leading to suboptimal processing power for high-dimensional data. Nevertheless, SVM is capable of directly utilizing the kernel function to address high-dimensional issues in the context of linear division. Therefore, the study proposes to combine SVM with G-KNN to achieve accurate prediction of the development level of CE. The classification principle of SVM is shown in Fig. 3.

In Fig. 3, SVM separates the samples into two classes along the non-unique partition line. The farther the distance between the two nearest partition lines from the division interface, the smaller the confidence range of the promotion ability boundaries. Moreover, the optimal classification surface, i.e., the optimal hyperplane, can be obtained when this distance reaches the maximum (Dhamija and Dubey 2021; Wang and Ma 2021). The hyperplane function is shown in Eq. (9).

$$ f\left( {x_{i} } \right) = w^{T} \varphi \left( {x_{i} } \right) + b $$

(9)

In Eq. (9), $b$ denotes the displacement vector. $w$ denotes the normal vector of the hyperplane. $x_{i}$ denotes the $i$th sample. If $f\left( {x_{i} } \right) < 0$, the sample belongs to class $C_{1}$, otherwise it belongs to class $C_{2}$. The equation of the normal vector of hyperplane and displacement vector are shown in Eq. (10).

$$ \left\{ {\begin{array}{*{20}l} {w = \sum\limits_{i = 1}^{n} {\left( {\hat{\alpha }_{i} - \alpha_{i} } \right)} k\left( {x_{i} ,x_{j} } \right)} \hfill \\ {b = \sum\limits_{i = 1}^{n} {\left( {\hat{\alpha }_{i} - \alpha_{i} } \right)} k\left( {x_{i} ,x_{j} } \right)} \hfill \\ \end{array} } \right. $$

(10)

In Eq. (10), $k\left( {x_{i} ,x_{j} } \right)$ denotes the kernel function. $\alpha$ denotes the scale factor. Equation (11) calculates the distance from the sample point to the hyperplane.

$$ r = \frac{{w^{T} x + b}}{\left\| w \right\|} $$

(11)

In Eq. (11), $r$ denotes the distance from any sample to the hyperplane. To minimize the structural risk of the linear function, the conditions set forth in Eq. (12) must be satisfied.

$$ \left\{ {\begin{array}{*{20}l} {\min \left[ {\frac{1}{2}\left\| w \right\|^{2} + C\sum\limits_{i = 1}^{l} {\left( {\eta_{i}^{*} + \eta_{i} } \right)} } \right]} \hfill \\ {s.t.\left\{ {\begin{array}{*{20}l} { - \left( {y_{i} + b} \right) + w\varphi \left( {x_{i} } \right) \le \varepsilon + \eta_{i}^{*} } \hfill \\ { - \left( {b + w\varphi \left( {x_{i} } \right)} \right) + y_{i} \le \eta_{i} + \varepsilon } \hfill \\ {\eta_{i} \ge 0,\eta_{i}^{*} \ge 0} \hfill \\ \end{array} } \right.} \hfill \\ \end{array} } \right. $$

(12)

In Eq. (12), $C$ denotes the penalty function, whose value is proportional to the training accuracy. $\varepsilon$ denotes the random error, whose value is inversely proportional to the training accuracy. $\eta_{i}$ denotes the slack variable. The aforementioned issue is transformed into a pairwise optimization issue using the Lagrangian function. The equation is given in Eq. (13).

$$ \left\{ {\begin{array}{*{20}l} {\min \frac{1}{2}\sum\limits_{i,j = 1}^{n} {\left( {a_{i} - a_{i}^{*} } \right)\left( {a_{j} - a_{j}^{*} } \right)\left( {\varphi \left( {x_{i} } \right)\varphi \left( {x_{j} } \right)} \right)} + \sum\limits_{i = 1}^{n} {\varepsilon \left( {a_{i} + a_{i}^{*} } \right)} - \sum\limits_{i = 1}^{n} {y\left( {a_{i} - a_{i}^{*} } \right)} } \hfill \\ {s.t.\left\{ {\begin{array}{*{20}l} {\sum\limits_{i = 1}^{n} {\left( {a_{i} - a_{i}^{*} } \right)} = 0} \hfill \\ {a_{i} \ge 0,a_{i}^{*} \le c} \hfill \\ \end{array} } \right.} \hfill \\ \end{array} } \right. $$

(13)

In Eq. (13), $a_{i}^{ * } - a_{i} = m_{i}$, which denotes the Lagrange multiplier. The sample points corresponding to non-zero Lagrange multipliers are the support vectors of the model. The flow of SVM is shown in Fig. 4.

Figure 4 illustrates how the SVM algorithm builds the optimization function after preprocessing the data during training. After constructing the optimization function, the parameters are solved by the SMO algorithm to obtain the hyperplane parameters, and the training is finished. In the formal classification process, the SVM algorithm will calculate the classification value through the known hyperplane parameters. Then it judges the class to which the sample belongs and output the classification result. However, in actual applications, it is required to translate the data into a high-dimensional space, transform it into a high-dimensional linear problem, and then solve it since a portion of the problem is a nonlinear problem. This transformation process is depicted in Fig. 5.

In Fig. 5, when the nonlinear SVM performs classification, the original low-dimensional nonlinearity is first converted to high-dimensional linearity. Then the optimal hyperplane is obtained in the high-dimensional space. At this time, if the structural risk of the linear function is to be minimized, the conditions that need to be satisfied are shown in Eq. (14).

$$ \left\{ {\begin{array}{*{20}l} {\min \left[ {\frac{1}{2}\left\| w \right\|^{2} + C\sum\limits_{i = 1}^{l} {\left( {\eta_{i} + \eta_{i}^{*} } \right)} } \right]} \hfill \\ {s.t.\left\{ {\begin{array}{*{20}l} {y_{i} - w\varphi \left( {x_{i} } \right) - b \le \varepsilon + \eta_{i} } \hfill \\ { - y_{i} + w\varphi \left( {x_{i} } \right) - b \le \varepsilon + \eta_{i}^{*} } \hfill \\ {\eta_{i} \ge 0,\eta_{i}^{*} \ge 0} \hfill \\ \end{array} } \right.} \hfill \\ \end{array} } \right. $$

(14)

The above problem is transformed into a pairwise optimization problem by means of a Lagrangian function to obtain Eq. (15).

$$ \left\{ {\begin{array}{*{20}l} {\min \frac{1}{2}\sum\limits_{i,j = 1}^{n} {a_{i} } - \frac{1}{2}\sum\limits_{i = 1}^{n} {a_{i} a_{j} y_{i} y_{j} \left( {x_{i} x_{j} } \right)} } \hfill \\ {s.t.\left\{ {\begin{array}{*{20}l} {\sum\limits_{i = 1}^{n} {\left( {a_{i} - a_{i}^{*} } \right)} = 0} \hfill \\ {a_{i} \ge 0,a_{i}^{*} \le c} \hfill \\ \end{array} } \right.} \hfill \\ \end{array} } \right. $$

(15)

Kernel function is the key element of SVM. SVM turns the data into a high-dimensional space for nonlinear issues, classifies the high-dimensional data using a linear classifier, and then utilizes kernel functions to map the high-dimensional data. In Eq. (16), the polynomial kernel function is displayed.

$$ K\left( {x,x_{i} } \right) = \left[ {\left( {xgx_{i} } \right) + 1} \right]^{d} $$

(16)

The radial basis kernel function is shown in Eq. (17).

$$ K\left( {x,x_{i} } \right) = \exp \left\{ { - \frac{{\left| {x - x_{i} } \right|^{2} }}{{\sigma^{2} }}} \right\} $$

(17)

In Eq. (17), $\sigma$ denotes the kernel parameter, i.e., the mean square deviation of the function. Sigmoid function is shown in Eq. (18).

$$ K\left( {x,x_{i} } \right) = \tanh \left( {v\left( {x \cdot x_{i} } \right) + c} \right) $$

(18)

In Eq. (18), both $v$ and $c$ denote the kernel parameters. For simplicity, the kernel function selected for the study is the radial basis function. As for the evaluation of CE development level, the selection of evaluation indicators is very important and needs to follow the principles of comprehensiveness, scientificity, independence, feasibility and dynamics. Meanwhile, considering that the core idea of CE is to reduce pollution, reduce resource consumption and prolong product use time. Taking the coal industry as an example, its evaluation index system of CE is shown in Table 2.

Table 2 Evaluation index system of circular economy in the coal industry

Full size table

In Table 2, the study comprehensively evaluates the CE development level of the coal industry from four aspects: industrial development level, resource consumption rate, ecological and environmental protection indicators and resource recycling and reuse rate.

Results and analysis of CE development level projections

MVAR and GPM are two commonly used analytical models. Among them, MVAR lack clear dynamic relationships, exhibit weak explanatory power for instrumental variables, and are susceptible to the problem of spurious regression. To ascertain the efficacy of the SVM-GKNN model, this study subjects it to a series of tests and comparisons with MVAR and GPM. This is done to confirm whether the model can overcome the aforementioned shortcomings. The test data comes from the data related to the CE development level of coal industry in a province for a total of 8 years from 2008 to 2015. The data related to the indicators of CE development level in Shaanxi Province's coal industry from 2008 to 2012 are used as training samples, and the data from 2013 to 2015 are used as testing samples. The data selected are all from the Shaanxi Statistical Yearbook. Before training, it is necessary to normalize the relevant data to ensure that the prediction results have a small deviation. The normalization formula is shown in Eq. (19).

$$ r^{\prime}_{ij} = \frac{{r_{ij} - r_{i}^{\min } }}{{r_{i}^{\max } - r_{i}^{\min } }} $$

(19)

In Eq. (19), $r^{\prime}_{ij}$ represents the normalized data. $r_{ij}$ represents the original data. $r_{i}^{\min }$ is the minimum value in the index value. $r_{i}^{\max }$ is the maximum value in the index value. The normalized data set is shown in Table 3.

Table 3 Normalized datasets

Full size table

In Table 3, among the eight years from 2008–2015, the average value of the CE correlation index is the highest in 2015, and the values of the correlation index in the remaining years did not exceed 0.9. In Table 3, at the same time, the relevant data of resource consumption level is much less than other indicators, and the data set has a significant data imbalance problem. Since the relevant data of industrial development level, resource consumption level, ecological and environmental protection level, and resource recovery and reuse level all have a significant impact on the CE level of the coal industry, the study will not conduct data sensitivity analysis. The prediction results and absolute errors of the SVM-GKNN model, MVAR, and GPM are shown in Fig. 6.

Figure 6a shows that the MVAR estimates the level of CE development as 0.653, 0.659, 0.668, 0.674, 0.672, 0.679, 0.743, and 0.766, respectively, for the years 2008–2015. The GPM predicts the levels of CE for the years 2008–2015 as 0.667, 0.671, 0.672, 0.689, 0.687, 0.692, 0.750 and 0.789. Moreover, SVM-GKNN's predictions of CE for the 8-year period are 0.681, 0.684, 0.694, 0.702, 0.705, 0.713, 0.768, and 0.767, respectively. The maximum absolute error, minimum absolute error, and average absolute error of SVM-GKNN are 0.014, 0 and 0.005, respectively. In Fig. 6b, the MVAR and the GPM have the greatest absolute errors of 0.048 and 0.035, the lowest absolute errors of 0.001 and 0.004, and the average absolute errors of 0.026 and 0.019. It is evident that there is a greater correlation between the actual values and the predictions made by the SVM-GKNN model. This is because the research predicts the development level of CE by using the GKNN model to classify various indicators to ensure the correctness of the input data. The relative errors and accuracies of the three models are shown in Fig. 7.

Figure 7a illustrates that the greatest relative errors of the MVAR and the GPM are approximately 6.6% and 4.8%, respectively. The minimum relative errors are approximately 0.1% and 0.5%, and the average relative errors are approximately 3.7% and 2.8%, respectively. Moreover, the maximum relative errors, the minimum relative errors, and the average relative errors of the SVM-GKNN model are about 1.9%, 0%, and 0.6%, respectively. According to Fig. 7b, the MVAR, GPM, and SVM-GKNN model have maximum accuracy values of about 98.8%, 99.1%, and 99.9%, lowest accuracy values of approximately 92.1%, 94.7%, and 98.2%, and average accuracy values of approximately 94.3%, 96.2%, and 99.3%, respectively. It indicates that the SVM-GKNN model predicts the outcomes with the maximum degree of accuracy. The utilization of the GF in the GKNN algorithm for distance allocation between the test point and the assignment of varying weights to different distances in accordance with the weight ratio significantly enhances the precision of data classification. Figure 8 displays the three models' recall and F1-measure.

From Fig. 8a, the highest recall of MVAR and GPM is about 87.5% and 88.2%, the lowest recall is about 85.8% and 86.9%, and the average recall is about 86.6% and 87.7%, respectively. The highest recall, the lowest recall, and the average recall of SVM-GKNN model are about 89.8%, 88.5%, and 89.1%, respectively. From Fig. 8b, the highest F1-measure, the lowest F1-measure and the average F1-measure of MVAR, GPM and SVM-GKNN models are about 0.89, 0.91 and 0.93, the lowest F1-measure is about 0.86, 0.87 and 0.90, and the average F1-measure is about 0.88, 0.89 and 0.92, respectively. According to the aforementioned findings, the SVM-GKNN model has better prediction accuracy for CE development level. Figure 9 displays the outcomes of the three models' predictions for the levels of industrial development, resource consumption, ecological preservation, and resource recycling.

In Fig. 9a, the SVM-GKNN model predicts the maximum values of 1.001, 0.838, 0.737 and 1.002 for the industrial development level, resource consumption level, ecological environmental protection level and resource recycling level during the period of 2008–2015, and the average values are about 0.604, 0.713, 0.641 and 0.682, respectively. In Fig. 9b, the maximum prediction results of GPM for the four indicators are 0.984, 0.801, 0.725 and 0.991, with average prediction values of around 0.590, 0.694, 0.625 and 0.665, respectively. In Fig. 9c, the maximum prediction results of the MVAR for the four indicators are 0.972, 0.786, 0.703 and 0.982, respectively. The average prediction results are about 0.573, 0.679, 0.605 and 0.648, respectively. The prediction accuracy of the three models for different indicators is shown in Fig. 10.

In Fig. 10a, the MVAR, GPM and SVM-GKNN models have the highest prediction accuracy of about 96.6%, 96.7% and 98.7% for the level of industrial development, respectively, and the average accuracy are about 93.4%, 95.5% and 98.1%, respectively. In Fig. 10b, the highest accuracy of the three models in predicting the level of resource consumption is about 95.8 per cent, 96.6% and 99.4%, respectively, and the average accuracy is about 93.8%, 96.0% and 98.8%, respectively. In Fig. 10c, the highest accuracy of the three models in predicting the level of eco-environmental protection is about 92.2%, 93.6% and 95.4%, respectively, and the average accuracy is about 91.3%, 92.7% and 94.9%, respectively. From Fig. 10d, it can be seen that the highest prediction accuracy of MVAR, GPM and SVM-GKNN models for the level of resource recycling are about 94.5%, 95.2% and 96.4%, respectively, and the lowest accuracy are about 93.1%, 94.2% and 95.3%, with the average accuracies of about 93.8%, 94.8% and 95.9%, respectively. For each layer of indicators, it can be seen that the SVM-GKNN model has the highest prediction accuracy. In order to further verify the performance of the proposed model, the study compared it with a predictive model (FTS-CDTS) based on the time series of fuzzy time series analysis, and the results are shown in Fig. 11.

According to Fig. 11a, the fit degree between the prediction results and the actual results of the research SVM-GKNN model is higher than that of the FTS-CDTS model, and the fit degree is about 0.95. Figure 11b shows the maximum determination error of the SVM-GKNN model is only 0.014, and the average error is 0.006, while the FTS-CDTS model is 0.021 and the average error is about 0.011. The above results showed that the SVM-GKNN model outperformed the FTS-CDTS model. To verify the influence of the GF on the prediction performance, the study performed ablation experiments, and the results are shown in Table 4.

Table 4 Results of the ablation experiments

Full size table

In Table 4, the prediction accuracy of SVM-KNN and SVM algorithms without GF weighting values is 84.6% and 88.3%, respectively. However, the prediction accuracy of SVM-GKNN algorithm with GF weighting values is as high as 97.5%. The above results indicate that introducing GFs to allocate weights can effectively improve prediction accuracy.

Conclusion

Humans are becoming more concerned about protecting the environment as human civilization and the economy continue to develop quickly. The concept of CE has been put forth in an effort to balance the conflict between development and the environment, but it has become very difficult to gauge how far along CE is in terms of progress. The study suggests, evaluates, and compares a CE evaluation model based on SVM-GKNN with GPM and MVAR in order to overcome the aforementioned issues. The experimental results showed that the SVM-GKNN model had a maximum absolute error and a mean absolute error of 0.014 and 0.005, respectively. It achieved maximum relative error and a mean relative error of about one percent and 0.6 percent, respectively, which were lower than the MVAR and the GPM. The average prediction accuracy of MVAR, GPM and SVM-GKNN models for the overall development level of CE were about 94.3%, 96.2% and 99.3%, respectively, with average recalls of about 86.6%, 87.7% and 89.1%, and average F1-measures of about 0.88, 0.89 and 0.92, respectively. In the prediction of indicators for each level, the average prediction accuracy of the three models for the level of industrial development was about 93.4%, 95.5% and 98.1%, the average prediction accuracy for the level of eco-environmental protection was about 91.3%, 92.7% and 94.9%, the average prediction accuracy for the level of resource consumption was about 93.8%, 96.0% and 98.8%, and the average prediction accuracy for the level of recycling and reutilization was about 93.8%, 93.8% and 98.8%, respectively. The average prediction accuracy for the level of resource consumption was about 93.8 per cent, 96.0 per cent and 98.8 per cent, and for the level of resource recycling was about 93.8 per cent, 94.8 per cent and 95.9 per cent, respectively. The above results indicated that the prediction accuracy of the SVM-GKNN model was better than that of the MVAR and the GPM, both for the overall level of development of CE and for the prediction of the indicators at each level. However, due to the lack of consideration for the time complexity of the algorithm during its construction, its time complexity is relatively high. Meanwhile, the influence of data size raises questions about the model's fitting performance. In the future, efforts will be made to reduce the time complexity of algorithms, build a lightweight model for predicting the level of CE development, and expand the dataset size.

Availability of data and materials

No datasets were generated or analysed during the current study.

References

Abuzaraida MA, Elmehrek M, Elsomadi E (2021) Online handwriting Arabic recognition system using k-nearest neighbors’ classifier and DCT features. Int J Electr Comput Eng 11(4):3584–3592
Google Scholar
Agarwal S, Tyagi M, Garg RK (2022) Framework development and evaluation of Industry 4.0 technological aspects towards improving the circular economy-based supply chain. Ind Robot 49(3):555–581
Article Google Scholar
Agrawal R, Wankhede VA, Kumar A, Luthra S, Huisingh D, Welford R (2022) Progress and trends in integrating Industry 4.0 within circular economy: a comprehensive literature review and future research propositions. Bus Strategy Environ 31(1):559–579
Article Google Scholar
Al-Dabagh MZN, Alhabib M, Al-Mukhtar FH (2019) Exploiting wavelet transform, principal component analysis, support vector machine, and K-nearest neighbors for partial face recognition. Cihan Univ Erbil Sci J 3(2):80–84
Article Google Scholar
Behera DK, Das M, Swetanisha S, Sethy PK (2021) Hybrid model for movie recommendation system using content K-nearest neighbors and restricted Boltzmann machine. Indonesian J Electr Eng Comput Sci 23(1):445–452
Article Google Scholar
Dhamija A, Dubey RB (2021) Analysis of age invariant face recognition using quadratic support vector machine-principal component analysis. J Intell Fuzzy Syst Appl Eng Technol 41(1):683–697
Google Scholar
Eriksen MK, Pivnenko K, Faraca G, Boldrin A, Astrup TF (2020) Dynamic material flow analysis of PET, PE, and PP flows in Europe: evaluation of the potential for circular economy. Environ Sci Technol 54(24):16166–16175
Article Google Scholar
Gonalves M, Paiva NT, Ferra JM, Martins J, Magalhes F, Carvalho L (2019) Classification of amino resins and formaldehyde near infrared spectra using K-nearest neighbors. J NeAr Infrared Spectrosc 27(5):345–353
Article Google Scholar
Guo Y, Mustafaoglu Z, Koundal D (2022) Spam detection using bidirectional transformers and machine learning classifier algorithms. J Comput Cogn Eng 2(1):5–9
Google Scholar
Haleem A, Khan S, Luthra S, Varshney H, Khan MI (2021) Supplier evaluation in the context of circular economy: a forward step for resilient business and environment concern. Bus Strateg Environ 30(4):2119–2146
Article Google Scholar
Hamed Y, Alzahrani AI, Shafie A, Mustaffa Z, Ismail MC, Eng KK (2020) Two steps hybrid calibration algorithm of support vector regression and K-nearest neighbors - ScienceDirect. Alex Eng J 59(3):1181–1190
Article Google Scholar
Hanif M, Ling H, Tian W, Shi Y, Rauf M (2021) Re-ranking person re-identification using distance aggregation of k-nearest neighbors hierarchical tree. Multimedia Tools Appl 80(5):8015–8038
Article Google Scholar
Kjaer LL, Pigosso DCA, Niero M, Bech NM, Mcaloone TC, Lifset R (2019) Product/Serviceystems for a circular economy: the route to decoupling economic growth from resource consumption. J Ind Ecol 23(1):22–35
Article Google Scholar
Kouhizadeh M, Zhu Q, Sarkis J (2023) Circular economy performance measurements and blockchain technology: an examination of relationships. Int J Logist Manag 34(3):720–743
Article Google Scholar
Muliady M, Sze LT, Chet KV, Patra S (2021) Classification of rice plant nitrogen nutrient status using k-nearest neighbors (k-NN) with light intensity data. Indonesian J Electr Eng Comput Sci 22(1):179–186
Article Google Scholar
Rehman A, Harouni M, Karimi M, Saba T, Bahaj SA, Awan MJ (2022) Microscopic retinal blood vessels detection and segmentation using support vector machine and K-nearest neighbors. Microsc Res Tech 85(5):1899–1914
Article Google Scholar
Scalia GL, Saeli M, Miglietta PP, Micale R (2021) Coffee biowaste valorization within circular economy: an evaluation method of spent coffee grounds potentials for mortar production. Int J Life Cycle Assess 26(9):1805–1815
Article Google Scholar
Wang G, Ma L (2021) A novel image segmentation method for cardiac MRI using support vector machine algorithm based on particle Swarm optimization. J Med Imaging Health Inform 11(12):3174–3180
Article Google Scholar
Zheng L, Huang H, Zhu C, Zhang K (2020) A tensor-based K-nearest neighbors method for traffic speed prediction under data missing. Transp B Transp Dyn 8(1):182–199
Google Scholar

Download references

Funding

The research is supported by: 2023 Provincial Natural Science Foundation Project of Fujian, "Effect Analysis and Mechanism Research on the Digital Economy Empowering Urban Green High-Quality Development" (No. 2023J05219).

Author information

Authors and Affiliations

International Business School, Fuzhou University of International Studies and Trade, Fuzhou, 350202, China
Zhezhou Li & Hexiang Huang

Authors

Zhezhou Li
View author publications
You can also search for this author in PubMed Google Scholar
Hexiang Huang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Zhezhou Li: Writing—original draft; Hexiang Huang: Writing—review and editing; all authors reviewed the manuscript.

Corresponding author

Correspondence to Hexiang Huang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, Z., Huang, H. Application of KNN algorithm incorporating Gaussian functions in green and high-quality development of cities empowered by circular economy. Energy Inform 7, 65 (2024). https://doi.org/10.1186/s42162-024-00372-w

Download citation

Received: 20 June 2024
Accepted: 24 July 2024
Published: 05 August 2024
DOI: https://doi.org/10.1186/s42162-024-00372-w

Application of KNN algorithm incorporating Gaussian functions in green and high-quality development of cities empowered by circular economy

Abstract

Introduction

Related works

KNN-based algorithm for prediction of CE development levels

Improved KNN algorithm based on GF

SVM-GKNN-based model for predicting the level of CE development

Results and analysis of CE development level projections

Conclusion

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords