Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
9.5 INCREMENTAL CLUSTERING
There are more and more applications where it is necessary to cluster a large collection of data. The definition of “large” has varied with changes in technology. In the 1960s, “large” meant several-thousand samples for clustering. Now, there are applications where millions of samples of high dimensionality have to be clustered. The algorithms discussed above work on large data sets, where it is possible to accommodate the entire data set in the main memory. However, there are applications where the entire data set cannot be stored in the main memory because of its size. There are currently three possible approaches to solve this problem:
1. The data set can be stored in a secondary memory and subsets of this data are clustered independently, followed by a merging step to yield a clustering of the entire set. We call this approach the divide-and-conquer approach.
2. An incremental-clustering algorithm can be employed. Here, data are stored in the secondary memory and data items are transferred to the main memory one at a time for clustering. Only the cluster representations are stored permanently in the main memory to alleviate space limitations.
3. A parallel implementation of a clustering algorithm may be used where the advantages of parallel computers increase the efficiency of the divide-and-conquer approach.
An incremental-clustering approach is most popular, and we will explain its basic principles. The following are the global steps of the incremental-clustering algorithm.
1. Assign the first data item to the first cluster.
2. Consider the next data item. Either assign this item to one of the existing clusters or assign it to a new cluster. This assignment is done based on some criterion, for example, the distance between the new item and the existing cluster centroids. In that case, after every addition of a new item to an existing cluster, recompute a new value for the centroid.
3. Repeat step 2 until all the data samples are clustered.
The space requirements of the incremental algorithm are very small, necessary only for the centroids of the clusters. Typically, these algorithms are non-iterative and therefore their time requirements are also small. But, even if we introduce iterations into the incremental-clustering algorithm, computational complexity and corresponding time requirements do not increase significantly. On the other hand, there is one obvious weakness of incremental algorithms that we have to be aware of. Most incremental algorithms do not satisfy one of the most important characteristics of a clustering process: order-independence. An algorithm is order-independent if it generates the same partition for any order in which the data set is presented. Incremental algorithms are very sensitive to the order of samples, and for different orders they generate totally different partitions.
Let us analyze the incremental-clustering algorithm with the sample set given in Figure 9.6. Suppose that the order of samples is x1, x2, x3, x4, x5 and the threshold level of similarity between clusters is δ = 3.
1. The first sample x1 will become the first cluster C1 = {x1}. The coordinates of x1 will be the coordinates of the centroid M1 = {0, 2}.
2. Start analysis of the other samples.
(a) Second sample x2 is compared with M1, and the distance d is determined
Therefore, x2 belongs to the cluster C1. The new centroid will be
(b) The third sample x3 is compared with the centroid M1 (still the only centroid):
(c) The fourth sample x4 is compared with the centroid M1:
Because the distance of the sample from the given centroid M1 is larger than the threshold value δ, this sample will create its own cluster C2 = {x4} with the corresponding centroid M2 = {5, 0}.
(d) The fifth sample x5 is compared with both cluster centroids:
The sample is closer to the centroid M2, and its distance is less than the threshold value δ. Therefore, sample x5 is added to the second cluster C2:
3. All samples are analyzed and a final clustering solution of two clusters is obtained:
The reader may check that the result of the incremental-clustering process will not be the same if the order of the samples is different. Usually, this algorithm is not iterative (although it could be) and the clusters generated after all the samples have been analyzed in one iteration are the final clusters. If the iterative approach is used, the centroids of the clusters computed in the previous iteration are used as a basis for the partitioning of samples in the next iteration.
For most partitional-clustering algorithms, including the iterative approach, a summarized representation of the cluster is given through its clustering feature (CF) vector. This vector of parameters is given for every cluster as a triple, consisting of the number of points (samples) of the cluster, the centroid of the cluster, and the radius of the cluster. The cluster’s radius is defined as the square-root of the average mean-squared distance from the centroid to the points in the cluster (averaged within-cluster variation). When
Comments (0)