Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
If samples are with categorical data, then we do not have a method to calculate centroids as representatives of the clusters. In that case, an additional algorithm called K-nearest neighbor may be used to estimate distances (or similarities) between samples and existing clusters. The basic steps of the algorithm are
1. to compute the distances between the new sample and all previous samples, already classified into clusters;
2. to sort the distances in increasing order and select K samples with the smallest distance values; and
3. to apply the voting principle. A new sample will be added (classified) to the largest cluster out of K selected samples.
For example, given six 6-D categorical samples
they are gathered into two clusters C1 = {X1, X2, X3} and C2 = {X4, X5, X6}. How does one classify the new sample Y = {A, C, A, B, C, A}?
To apply the K-nearest neighbor algorithm, it is necessary, as the first step, to find all distances between the new sample and the other samples already clustered. Using the SMC measure, we can find similarities instead of distances between samples.Similarities with Elements in C1Similarities with Elements in C2SMC(Y, X1) = 4/6 = 0.66SMC(Y, X4) = 4/6 = 0.66SMC(Y, X2) = 3/6 = 0.50SMC(Y, X5) = 2/6 = 0.33SMC(Y, X3) = 2/6 = 0.33SMC(Y, X6) = 2/6 = 0.33
Using the 1-nearest neighbor rule (K = 1), the new sample cannot be classified because there are two samples (X1 and X4) with the same, highest similarity (smallest distances), and one of them is in the class C1 and the other in the class C2. On the other hand, using the 3-nearest neighbor rule (K = 3) and selecting the three largest similarities in the set, we can see that two samples (X1 and X2) belong to class C1, and only one sample to class C2. Therefore, using a simple voting system we can classify the new sample Y into the C1 class.
9.6 DBSCAN ALGORITHM
Density-based approach in clustering assumes that clusters are regarded as dense regions of objects in the data space that are separated by regions of low object density (noise). These regions may have an arbitrary shape. Crucial concepts of this approach are density and connectivity both measured in terms of local distribution of nearest neighbors. The algorithm DBSCAN targeting low-dimensional data is the major representative in this category of density-based clustering algorithms. The main reason why DBSCAN recognizes the clusters is that within each cluster we have a typical density of points that is considerably higher than outside of the cluster. Furthermore, the points’ density within the areas of noise is lower than the density in any of the clusters.
DBSCAN is based on two main concepts: density reachability and density connectability. These both concepts depend on two input parameters of the DBSCAN clustering: the size of epsilon neighborhood (ε) and the minimum points in a cluster (m). The key idea of the DBSCAN algorithm is that, for each point of a cluster, the neighborhood of a given radius ε has to contain at least a minimum number of points m, that is, the density in the neighborhood has to exceed some predefined threshold. For example, in Figure 9.9 point p has only two points in the neighborhood ε, while point q has eight. Obviously, the density around q is higher then around p.
Figure 9.9. Neighborhood (ε) for points p and q.
Density reachability defines whether two close points belong to the same cluster. Point p1 is density-reachable from p2 if two conditions are satisfied: (1) the points are close enough to each other: distance (p1,p2) < ε, and (2) there are enough of points in ε neighborhood of p2: distance(r,p2) > m, where r are some database points. In the example represented in Figure 9.9, point p is reachable from point q. Density connectivity is the next building step of DBSCAN. Points p0 and pn are density connected, if there is a sequence of density-reachable points (p0, p1, p2, … ) from p0 to pn such that pi+1 is density-reachable from pi. These ideas are translated into DBSCAN cluster as a set of all density connected points.
The clustering process is based on the classification of the points in the dataset as core points, border points, and noise points (examples are given in Fig. 9.10):
A point is a core point if it has more than a specified number of points (m) within neighborhood ε. These are points that are at the interior of a cluster
A border point has fewer than m points within its neighborhood ε, but it is in the neighbor of a core point.
A noise point is any point that is not a core point or a border point.
Figure 9.10. Examples of core, border, and noise points. (a) ε and m determine the type of the point; (b) core points build dense regions.
Ideally, we would have to know the appropriate parameters ε and m of each cluster. But there is no easy way to get this information in advance for all clusters of the database. Therefore, DBSCAN uses global values for ε and m, that is, the same values for all clusters. Also, numerous experiments indicate that DBSCAN clusters for m > 4 do not significantly differ from the case m = 4, while the algorithm needs considerably more computations. Therefore, in practice we may eliminate the parameter m by setting it to four for low-dimensional databases. The main steps of DBSCAN algorithm are as follows:
Arbitrarily select a point p.
Retrieve all points density-reachable from p with respect to ε and m.
If p is a core point, a new cluster is formed or existing cluster is extended.
If p is a border point, no points
Comments (0)