1. Introduction
Today, data streams are prevalent in almost every application in the real world. A data stream is defined as voluminous data coming continuously and most likely evolving over time with unknown dynamics [
1]. Some examples of applications related to streaming data are fraud detection, weather monitoring, Internet of Things, and website and network monitoring [
5]. In such complex real-world problems, uncertainty is most likely to emerge due to inadequate, incomplete, untrustworthy, vague, and inconsistent data [
6]. In general, these different kinds of information deficiencies may bring about different types of uncertainties. Handling uncertainty in an inappropriate manner will lead to wrong conclusions [
6]. Thus, data streams are determined by three key attributes: an unbounded sequence of data records, huge volumes of data, and uncertainty associated with data. Therefore, data streams create a new and challenging environment for data mining activities. Clustering is an important data mining activity. It is a process of allocating data points into clusters such that data points in a cluster are more similar to one another compared to other data points in different clusters [
7]. Clustering algorithms are grouped together into two categories: hierarchical and partitioning algorithms. For effective clustering of large datasets partitioning, algorithms are preferred over hierarchical algorithms. Among several partitioning algorithms, fuzzy-based clustering techniques are widely used. The Fuzzy C-Means (FCM) has the ability to resolve data uncertainty and can be used as a fuzzy clustering method [
9]. However, performance of FCM in handling noise and large data uncertainties is very sensitive. Therefore, a more effective method in reducing data uncertainty named as Interval Type-2 Fuzzy C-Means (IT2FCM) had been proposed [
10]. Hence, IT2FCM is a suitable clustering technique for handling data streams where uncertainty is most likely to emerge in different forms.
IT2FCM is a clustering method which is based on objective function (OF) and uses the alternating optimization (AO) method for minimizing OF. AO determines appropriate cluster centers by arbitrarily initializing a membership matrix. This random initialization often leads to IT2FCM getting stuck in local optima and thereby fails to generate suitable cluster centers [
15]. This issue has been addressed in our previous work by optimizing IT2FCM using ant colony optimization (ACO) [
15]. However, IT2FCM-ACO generates clusters over a complete dataset which is not desirable for clustering data streams. Since data streams are likely to evolve with time, the clusters thus generated also vary with respect to time. For example, a large amount of healthcare data may require the user to observe clusters generated over time. Moreover, the volume and the time span of incoming data makes it infeasible to store in memory or disk for further analysis. Therefore, a clustering algorithm is required which is capable of partitioning incoming data continuously while considering memory and time restriction issues. In such situations, an incremental clustering approach is more suitable where the algorithm learns with new incoming data without the need to start from scratch, with limited memory requirement, while achieving good efficiency and accuracy.
Lately, numerous algorithms have been proposed for clustering large and streaming datasets. However, the focus has been primarily on FCM clustering. Commonly, there are two main techniques: distributed clustering and progressive or random sampling approach [
16]. Distributed clustering is based on different incremental styles. In distributed clustering, data is divided into chunks and the methods cluster each chunk of data at a time. In the progressive and random sampling-based clustering approach, data is clustered by using a small sample size found by either the progressive or random sampling approach and gradually increases the size until the cluster quality can no longer improve. Examples of these methods include random sampling plus extension FCM (rseFCM) [
16], multi-round sampling (MRS) [
18], single-pass FCM(spFCM) [
19], bit-reduced FCM(brFCM) [
20], kernel FCM(kFCM) [
23], approximate kernel FCM (akFCM) [
24], online FCM(oFCM) [
25], okFCM, rsekFCM, spkFCM [
26]. Among these algorithms, rseFCM, rsekFCM, akFCM, and MRS are suitable for voluminous data but cannot perform clustering incrementally. On the other hand, spFCM, oFCM, brFCM, kFCM, and their variants process data chunk by chunk. Among the incremental methods, oFCM is least efficient and accurate, brFCM performance suffers for multi-dimensional data i.e., the size of the data increases in terms of the number of variables, kFCM is computationally expensive for large datasets, and spFCM performance reduces at a low sampling rate. However, compared to other algorithms, spFCM achieves similar and/or higher accuracy as FCM and scales efficiently for high-dimensional data. Although spFCM has been applied successfully for speeding up clustering, it is limited to the fuzzy clustering method.
Therefore, for scalable and incremental learning of IT2FCM-ACO, this paper proposes a modified IT2FCM-ACO algorithm. IT2FCM-ACO is improvised by implementing a single-pass approach which will generate clusters in a single pass over data with restricted memory. The proposed algorithm not only improves upon the computational efficiency of IT2FCM-ACO for large datasets but also achieves higher accuracy. Moreover, it performs clustering by processing data incrementally. Therefore, the algorithm handles the challenges of IT2FCM-ACO for uncertain, unbounded, and voluminous data streams. The paper is divided into the following sections: Section One describes related work in detail; Section Three gives an overview on background study of IT2FCM-ACO and the single-pass method; Section Four describes proposed methodology in detail, Section Five, analyses the results obtained by comparing the proposed method spIT2FCM-ACO with IT2FCM-ACO, IT2FCM-AO, and GAIT2FCM by means of certain fuzzy evaluation metrics. The computational efficiency of the algorithm is evaluated with regards to run time and speedup along with the statistical analysis of the proposed algorithm. Section Six presents the conclusion of this research study.
2. Related Work
Extensible Fast FCM (eFFCM) [
27] is a popular method for clustering large data. It generates clusters on samples obtained through a progressive sampling technique. The size of each subsample is constant. This process continues until the sample satisfies the statistical significance test. The cluster centers obtained for the sample are extended to the entire dataset. eFFCM does not generate fuzzy partitions for the entire dataset, rather it generates partitions for the sample which capture the overall nature of the data. It is, though, a simple and fast technique for clustering large datasets. The sampling technique is not efficient and in certain cases, data reduction is not enough for large data. This issue was handled by Random Sampling Plus Extension FCM (rseFCM) method [
16]. It is also a simple, non-iterative clustering technique where samples are generated randomly without replacement. However, it is not incremental, thus, not suitable for data streams, and results generated are often less consistent.
Multi-Round Sampling FCM (MRSFCM) is another random sampling iterative approach for clustering large data [
18]. In this method, samples of fixed size are generated randomly and are clustered using FCM. In each iteration, a new sample is generated which is combined with the previous sample and the obtained cluster centers are compared with the prior result. If the result is similar, the algorithm will stop, and the result is extended for the complete dataset. Both rseFCM and MRSFCM are very quick algorithms since they cluster samples of data rather than the complete dataset. However, the size of the sample affects their performance. For small sample sizes, their performance reduces as the error between the cluster locations increases. In rseFCM, no statistical method is used to estimate the optimal size of samples. To overcome this issue, a variant of rseFCM was introduced, called Minimum Sample Estimate Random FCM (MSERFCM), which uses a statistical method to determine the correct sample size [
Single Pass FCM (spFCM) [
19], Online FCM (oFCM) [
25] and Bit-Reduced FCM (brFCM) [
20] are clustering algorithms based on weighted FCM (wFCM). In FCM, each data object is given equal weighting, but in wFCM, weights are introduced to describe the relative significance of each data object in the clustering solution. A data object associated with a higher weight is more significant in determining the cluster centers. It is proven to be an effective clustering algorithm for both large and multi-dimensional datasets. In addition, compared to FCM it can reach optimal solutions in considerably a smaller number of iterations [
20]. Its drawback is that its performance reduces at small data size. Additionally, no statistical method is proposed for generating samples. The oFCM algorithm is similar to the spFCM where data samples are generated in a similar manner and are clustered using wFCM. The only difference is that the clustered weighted centers that are generated at the end of each iteration are not added as new data points in the next iteration. Though it is similar to spFCM, its performance is considerably inferior in terms of speedup and quality [
27]. brFCM was developed to handle the challenge of clustering large images. In this algorithm, the bin centers are generated for the input data and are clustered using wFCM. The number of data objects in each bin represents weights in wFCM. There are different methods in which bins can be generated. One of the easiest methods is s-bin histograms, where the weights are the number of pixels in each bin and the data are the bin centers. brFCM also does not produce partitions for the complete dataset, however, the results obtained can be easily extended to the entire dataset. For a 1D-MRI image dataset, the brFCM had proven to achieve the best performance in terms of accuracy compared to rseFCM, spFCM, and oFCM [
16]. However, if bins for the data are not generated efficiently and accurately it will affect the performance of brFCM. Moreover, generating bins for multi-dimensional datasets creates challenges for the algorithm [
Kernel FCM (kFCM) [
29] is basically a fuzzy partitioning of large datasets in the kernel induced spaces. The basic idea is to map the input data into high-dimensional space using a kernel map so that it becomes a linear clustering problem. In kFCM the input data is not transformed explicitly but it is simply a representation of dot product of kernel function. The kernel function is basically polynomial and radial basis function. Thus, for a given set of data corresponding kernel matrix is generated. Kernel matrix represents all pairwise dot products of the attributes associated with data in the transformed high-dimensional data space. In the FCM objective function is computed based on the distance between the data points and the cluster centers while in kFCM it is computed as the distance between the data points and cluster centers in the kernel space. The major challenge that exists in kFCM is the selection of appropriate kernel functions and the optimal values of associated parameters. Additionally, the kernel matrix required for storing and computing the entire dataset makes it computationally very expensive. To overcome the issue of computational complexity the kernel distance function was approximated by constraining the space in which the cluster centers exist [
21]. The difference-of-convex algorithm (DCA) [
30] is another approach that has been considered to solve this issue. It has been applied to kFCM to further improve its computational efficiency and reduce memory requirement. There are several variations to kFCM that has been found in the literature such as Random Sampling and Extension kFCM (rsekFCM), Single Pass kFCM (spkFCM), Online kFCM (okFCM) and Weighted kFCM (wkFCM) [
26] which follows the same basic step as wFCM, rseFCM, spFCM and oFCM respectively but the objective function is now replaced by weighted kFCM. However, the results obtained by these methods are less accurate than kFCM [
4. Proposed Methodology
The single-pass approach to IT2FCM-ACO (spIT2FCM-ACO) is proposed for incremental clustering of large or very large datasets as explained in Algorithm 1 on the next page. In the algorithm rather than loading the complete dataset, some percentage of data is loaded based on existing memory. In each Partial Data Access (PDA) input data are distributed into c clusters using IT2FCM-ACO approach. Similar to spFCM the OF and cluster centroids in IT2FCM-ACO are modified to incorporate the effects of weights. In IT2FCM-ACO the OF as in Equation (4) is improvised and is defined as the constrained optimization for m
1 and m
2 given by Equation (12). The IT2FCM-ACO will be referred to as weighted IT2FCM-ACO (wIT2FCM-ACO).
is a set of weights that describes the significance of each attribute.
Algorithm 1: single passIT2FCM-ACO |
Input: X, c, m1, m2, min_impro, , , , ns, Output: U, V, Jm and 1000, min_impro= 10−5 Initialize fuzzifier m1 = 1.7, m2 = 2.6, ACO parameters: 0.005, 0.01, 1.0 Initialize |
00 | Load X as ns sized samples, |
01 | Initialize |
02 | For first PDA since there are no previous c weighted points for , c=0. |
03 | for do |
04 | for z=1 to do |
05 | Repeat |
06 | for j=1 to |
07 | for i=1, …., c |
08 | set |
09 | if end if |
10 | if end if |
11 | randomly set |
12 | if end if |
13 | if end if |
14 | end for |
15 | for i=1, …., c |
16 |
17 |
18 | end for |
19 | end for |
20 | until |
21 | compute type-1 fuzzy set of cluster centroids where recomputed using procedure defined in algorithm 2. |
22 | update |
23 | obtain the crisp value of cluster centroids |
24 | compute crisp value of fuzzy partition matrix using |
25 | type reduce weights |
26 | calculate objective function using “Equation (12)” |
27 | Type reduce objective function |
28 | if then end if |
29 | for i=1, …., c |
30 | for j = 1, …, |
31 | update pheromone matrices |
32 |
33 |
34 | end for |
35 | end for |
36 | if t > 1, if min_impro break; end if, end if |
37 | end for |
38 | compute lower and upper values of weight of c condensed data points |
39 | , , |
40 | set |
41 | , , |
42 | set |
43 | End |
In each subsequent PDAs data are grouped into c clusters along with the previous condensed data points. In each PDA the new data points are clustered together with the previous c weighted data points. They are condensed again into new data points and are grouped with new incoming data in the next PDA. This is referred to as incremental clustering. In spFCM a single weight matrix defining the weight of an individual data points are used for evaluating cluster centroids. However, for proposed methodology spIT2FCM-ACO, lower weight and upper weight matrices are defined. Since weights are relatively important in defining the location of the centers, the data object may have different influence on the left center and the right center of the cluster.
Suppose X is a dataset containing n data examples . First, PDA is generated randomly without replacement by taking a certain percentage of the dataset. Let us suppose from the dataset ‘s’ PDAs are obtained, , where, is the first PDA, is second PDA and so on, each containing ns data examples. Two cases are used to explain the proposed methodology. In the first case, the first sample of data is taken and clustered using IT2FCM-ACO. In Case 2 the subsequent PDAs are generated and are clustered using IT2FCM-ACO.
Case 1:
, where s = 1 indicate that the first PDA
is loaded in the memory and there are no previous weighted data points. The lower and upper weight matrices are initialized to 1. Then, the lower and upper membership matrix is initialized with lower and upper values of the pheromone matrix
given by Equations (5) and (6) respectively. Once the membership matrices are obtained, cluster centroids are updated using (13). In this equation, the weights assigned to the data points are to be estimated for left and right cluster centroids as explained in Algorithm 2.
are left and right weights assigned to
Algorithm 2: Type-reducing weighted cluster lefts proposed algorithm |
00 | For arbitrary fuzzifier m |
01 | compute lefts using |
02 | where and |
03 | sort the n data patterns in each of d attributes (l=1, …, d) in ascending order |
04 | compute |
05 | for all data patterns |
06 | find interval index k such that |
07 | for all data patterns |
08 | if then |
09 | set primary membership and set weight else |
10 | set primary membership and set weight |
11 | end if |
12 | end for |
13 | and |
14 | end for |
15 | is calculated using the same procedure as above replacing the “if statement” as follows |
16 | for all data patterns |
17 | if then |
18 | set primary membership and set weight else |
19 | set primary membership and set weight |
20 | end if |
21 | end for |
22 | and |
23 | compute maximum value of and using equation (13) |
Once values of
are determined using Algorithm 2, they are further defuzzified to obtain their crisp values as shown in Algorithm 1, line 23–25. The obtained defuzified values are used to determine the OF. Subsequently, the lower and upper pheromone matrix is updated as shown at line 32 and 33 respectively. The equations of pheromone matrices are modified to incorporate the influence of weights as shown in (14) and (15).
represents lower and upper pheromone matrix respectively;
is type reduced weight
. If the termination condition is satisfied, the data points are condensed into
c weighted data points. The lower membership matrix with type-reduced weight is used to update the lower weight while the upper membership matrix updates the upper weight given by (16) and (17) respectively.
and the weight of the new
c data points is given
. Once the data are clustered into
c weighted points, the data are released from the memory. These clustered data points are added to the next PDA.
Case 2: s > 1: In this case, all the subsequent PDAs are loaded in the memory and are partitioned into clusters together with previous
c data points. Thus, in each PDA there are
data points stacked in the memory for generating clusters. The new incoming
data points are loaded in the memory and are assigned weight 1. The condensed
c data points are assigned lower and upper weights computed from previous clustering given by (16) and (17) respectively. The new data
are clustered using wIT2FCM-ACO and are again reduced to
c new weighted data points. The new data points are represented by the
c cluster centroids
and the weights are computed as shown in (18) and (19).
data points and
c condensed data points obtained from previous clustering. Once the data are clustered into
c weighted data points, the memory is freed from the data. These clustered data points are again added into the next PDA.
In Algorithm 2 the methodology for obtaining type-reduced cluster centers along-with the type-reduced weights are explained. For any arbitrary fuzzifier m, weighted cluster centers are computed as shown at line 01 where is evaluated as the mean of lower and upper membership matrix. Similarly, weight is determined as the mean of lower and upper weights. Then the data patterns are sorted for each attribute in an ascending order. To compute the right cluster center the values of right membership matrix and right weight are calculated as well. The index of data is determined where the cluster center obtained from line 01 lies between the two successive data. For all the data patterns whose index is less than k lower membership matrix and lower weight is considered otherwise upper membership matrix and upper weight is taken. Once all the data are exhausted, right membership matrix and right weight is determined by taking the union of all the memberships and weights as shown at line 13. Similarly, is determined by obtaining left membership matrix and left weight from line 15–22. At line 23, the left and right center is determined using Equation (13).
Figure 1 shows a single pass clustering technique to IT2FCM-ACO. As shown in figure ‘
n’ chunks of data are generated randomly from the dataset. Chunk 1 is clustered using IT2FCM-ACO and the obtained cluster centroids are added as new data points to the next data chunk. Chunk 1 is released from the memory, thus, saves time and memory. This process continues until all chunks are scanned and the final clustered data are the obtained cluster centers for the entire dataset.