Other
Read books online » Other » Data Mining Mehmed Kantardzic (good english books to read .txt) 📖

Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic



1 ... 98 99 100 101 102 103 104 105 106 ... 193
Go to page:
reading all points from disk again.

Figure 9.13. CF tree structure.

BIRCH performs faster than most of the existing algorithms on large data sets. The algorithm can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans (phases 3 and 4). Basic algorithm condenses metric data in the first pass using spherical summaries, and this part can be an incremental implementation. Additional passes cluster CFs to detect nonspherical clusters, and the algorithm approximates density function. There are several extensions of the algorithm that try to include nonmetric data, and that make applicability of the approach much wider.

9.8 CLUSTERING VALIDATION

How is the output of a clustering algorithm evaluated? What characterizes a “good” clustering result and a “poor” one? All clustering algorithms will, when presented with data, produce clusters regardless of whether the data contain clusters or not. Therefore, the first step in evaluation is actually an assessment of the data domain rather than the clustering algorithm itself. Data that we do not expect to form clusters should not be processed by any clustering algorithm. If the data do contain clusters, some clustering algorithms may obtain a “better” solution than others. Cluster validity is the second step, when we expect to have our data clusters. A clustering structure is valid if it cannot reasonably have occurred by chance or as an artifact of a clustering algorithm. Applying some of the available cluster methodologies, we assess the outputs. This analysis uses a specific criterion of optimality that usually contains knowledge about the application domain and therefore is subjective. There are three types of validation studies for clustering algorithms. An external assessment of validity compares the discovered structure with an a priori structure. An internal examination of validity tries to determine if the discovered structure is intrinsically appropriate for the data. Both assessments are subjective and domain-dependent. A relative test, as a third approach, compares the two structures obtained either from different cluster methodologies or by using the same methodology but with different clustering parameters, such as the order of input samples. This test measures their relative merit but we still need to resolve the question of selecting the structures for comparison.

Theory and practical applications both show that all approaches in the validation of clustering results have a subjective component. Hence, little in the way of “gold standards” exists in clustering evaluation. Recent studies in cluster analysis suggest that a user of a clustering algorithm should always keep the following issues in mind:

1. Every clustering algorithm will find clusters in a given data set whether they exist or not; the data should, therefore, be subjected to tests for clustering tendency before applying a clustering algorithm, followed by a validation of the clusters generated by the algorithm.

2. There is no best clustering algorithm; therefore, a user is advised to try several algorithms on a given data set.

It is important to remember that cluster analysis is an exploratory tool; the outputs of clustering algorithms only suggest or sometimes confirm hypotheses, but never prove any hypothesis about natural organization of data.

9.9 REVIEW QUESTIONS AND PROBLEMS

1. Why is the validation of a clustering process highly subjective?

2. What increases the complexity of clustering algorithms?

3.

(a) Using MND distance, distribute the input samples given as 2-D points A(2, 2), B(4, 4), and C(7, 7) into two clusters.

(b) What will be the distribution of samples in the clusters if samples D(1, 1), E(2, 0), and F(0, 0) are added?

4. Given 5-D numeric samples A = (1, 0, 2, 5, 3) and B = (2, 1, 0, 3, −1), find

(a) the Eucledian distance between points;

(b) the city-block distance;

(c) the Minkowski distance for p = 3; and

(d) the cosine-correlation distance.

5. Given 6-D categorical samples C = (A, B, A, B, A, A) and D = (B, B, A, B, B, A), find

(a) an SMC of the similarity between samples;

(b) Jaccard’s coefficient; and

(c) Rao’s coefficient

6. Given a set of 5-D categorical samples

A = (1, 0, 1, 1, 0)

B = (1, 1, 0, 1, 0)

C = (0, 0, 1, 1, 0)

D = (0, 1, 0, 1, 0)

E = (1, 0, 1, 0, 1)

F = (0, 1, 1, 0, 0)

(a) Apply agglomerative hierarchical clustering using

(i) single-link similarity measure based on Rao’s coefficient; and

(ii) complete-link similarity measure based on SMC.

(b) Plot the dendrograms for the solutions to parts (i) and (ii) of (a).

7. Given the samples X1 = {1, 0}, X2 = {0, 1}, X3 = {2, 1}, and X4 = {3, 3}, suppose that the samples are randomly clustered into two clusters C1 = {X1, X3} and C2 = {X2, X4}.

(a) Apply one iteration of the K-means partitional-clustering algorithm, and find a new distribution of samples in clusters. What are the new centroids? How can you prove that the new distribution of samples is better than the initial one?

(b) What is the change in the total square-error?

(c) Apply the second iteration of the K-means algorithm and discuss the changes in clusters.

8. For the samples in Problem 7, apply iterative clustering with the threshold value for cluster radius T = 2. What is the number of clusters and samples distribution after the first iteration?

9. Suppose that the samples in Problem 6 are distributed into two clusters:

C1 = {A, B, E} and C2 = {C, D, F}.

Using K-nearest neighbor algorithm, find the classification for the following samples:

(a) Y = {1 ,1 ,0 ,1 ,1} using K = 1

(b) Y = {1 ,1 ,0 ,1 ,1} using K = 3

(c) Z = {0, 1, 0, 0, 0} using K = 1

(d) Z = {0, 1, 0, 0, 0} using K = 5.

10. Implement the hierarchical agglomerative algorithm for samples with categorical values using the SMC measure of similarity.

11. Implement the partitional K-means clustering algorithm. Input samples are given in the form of a flat file.

12. Implement the incremental-clustering algorithm with iterations. Input samples are given in the form of a flat file.

13. Given the similarity matrix between five samples:

(a) Use the similarity matrix in the table to

1 ... 98 99 100 101 102 103 104 105 106 ... 193
Go to page:

Free ebook «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment