Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
For most clustering techniques, we say that a similarity measure is normalized:
Very often a measure of dissimilarity is used instead of a similarity measure. A dissimilarity measure is denoted by d(x, x′), ∀x, x′ ∈ X. Dissimilarity is frequently called a distance. A distance d(x, x′) is small when x and x′ are similar; if x and x′ are not similar d(x, x′) is large. We assume without loss of generality that
Distance measure is also symmetric:
and if it is accepted as a metric distance measure, then a triangular inequality is required:
The most well-known metric distance measure is the Euclidean distance in an m-dimensional feature space:
Another metric that is frequently used is called the L1 metric or city block distance:
and finally, the Minkowski metric includes the Euclidean distance and the city block distance as special cases:
It is obvious that when p = 1, then d coincides with L1 distance, and when p = 2, d is identical with the Euclidean metric. For example, for 4-D vectors x1 = {1, 0, 1, 0} and x2 = {2, 1, −3, −1}, these distance measures are d1 = 1 + 1 + 4 + 1 = 7, d2 = (1 + 1 + 16 + 1)1/2 = 4.36, and d3 = (1 + 1 + 64 + 1)1/3 = 4.06.
The Euclidian n-dimensional space model offers not only the Euclidean distance but also other measures of similarity. One of them is called the cosine-correlation:
It is easy to see that
For the previously given vectors x1 and x2, the corresponding cosine measure of similarity is scos(x1, x2) = (2 + 0 − 3 + 0)/(2½ · 15½ ) = −0.18.
Computing distances or measures of similarity between samples that have some or all features that are noncontinuous is problematic, since the different types of features are not comparable and one standard measure is not applicable. In practice, different distance measures are used for different features of heterogeneous samples. Let us explain one possible distance measure for binary data. Assume that each sample is represented by the n-dimensional vector xi, which has components with binary values (vij ∈ {0,1}). A conventional method for obtaining a distance measure between two samples xi and xj represented with binary features is to use the 2 × 2 contingency table for samples xi and xj, as shown in Table 9.2.
TABLE 9.2. The 2 × 2 Contingency Table
The meaning of the table parameters a, b, c, and d, which are given in Figure 6.2, is as follows:
1. a is the number of binary attributes of samples xi and xj such that xik = xjk = 1.
2. b is the number of binary attributes of samples xi and xj such that xik = 1 and xjk = 0.
3. c is the number of binary attributes of samples xi and xj such that xik = 0 and xjk = 1.
4. d is the number of binary attributes of samples xi and xj such that xik = xjk = 0.
For example, if xi and xj are 8-D vectors with binary feature values
then the values of the parameters introduced are
Several similarity measures for samples with binary features are proposed using the values in the 2 × 2 contingency table. Some of them are
1. simple matching coefficient (SMC)
2. Jaccard Coefficient
3. Rao’s Coefficient
For the previously given 8-D samples xi and xj these measures of similarity will be ssmc(xi, xj) = 5/8, sjc(xi, xj) = 2/5, and src(xi, xj) = 2/8.
How to measure distances between values when categorical data are not binary? The simplest way to find similarity between two categorical attributes is to assign a similarity of 1 if the values are identical and a similarity of 0 if the values are not identical. For two multivariate categorical data points, the similarity between them will be directly proportional to the number of attributes in which they match. This simple measure is also known as the overlap measure in the literature. One obvious drawback of the overlap measure is that it does not distinguish between the different values taken by an attribute. All matches, as well as mismatches, are treated as equal.
This observation has motivated researchers to come up with data-driven similarity measures for categorical attributes. Such measures take into account the frequency distribution of different attribute values in a given data set to define similarity between two categorical attribute values. Intuitively, the use of additional information would lead to a better performance. There are two main characteristics of categorical data that are included in new measures of similarity (distance):
1. number of values taken by each attribute, nk (one attribute might take several hundred possible values, while another attribute might take very few values); and
2. distribution fk(x), which refers to the distribution of frequency of values taken by an attribute in the given data set.
Almost all similarity measures assign a similarity value between two d-dimensional samples X and Y belonging to the data set D as follows:
where Sk(Xk, Yk) is the per-attribute similarity between two values for the categorical attribute Ak. The quantity wk denotes the weight assigned to the attribute Ak. To understand how different measures calculate the per-attribute similarity, Sk(Xk; Yk), consider a categorical attribute A, which takes one of the values{a, b, c, d}. The per-attribute similarity computation is equivalent to constructing the (symmetric) matrix shown in Table 9.3.
TABLE 9.3. Similarity Matrix for a Single Categorical Attribute
Essentially, in determining the similarity between two values, any categorical measure is filling the entries of this matrix. For example, the overlap measure sets the diagonal entries to 1 and the off-diagonal entries to
Comments (0)