Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
1. Discard all samples in a database with missing data.
2. Define a new algorithm or modify an existing algorithm that will work with missing data.
The first solution is simple but unacceptable when large amounts of missing values exist in a set of samples. To address the second alternative, several questions must be answered:
1. How does one compare two samples with different numbers of unknown values?
2. Training samples with unknown values cannot be associated with a particular value of the test, and so they cannot be assigned to any subsets of cases. How should these samples be treated in the partitioning?
3. In a testing phase of classification, how does one treat a missing value if the test is on the attribute with the missing value?
All these and many other questions arise with any attempt to find a solution for missing data. Several classification algorithms that work with missing data are usually based on filling in a missing value with the most probable value, or on looking at the probability distribution of all values for the given attribute. None of these approaches is uniformly superior.
In C4.5, it is an accepted principle that samples with unknown values are distributed probabilistically according to the relative frequency of known values. Let Info(T) and Infox(T) be calculated as before, except that only samples with known values of attributes are taken into account. Then the gain parameter can reasonably be corrected with a factor F, which represents the probability that a given attribute is known (F = number of samples in the database with a known value for a given attribute/total number of samples in a data set). The new gain criterion will have the form
Similarly, Split-info(x) can be altered by regarding the samples with unknown values as an additional group in splitting. If the test x has n outcomes, its Split-info(x) is computed as if the test divided the data set into n + 1 subsets. This modification has a direct influence on the final value of the modified criterion Gain-ratio(x).
Let us explain the modifications of the C4.5 decision-tree methodology applied on one example. The database is similar to the previous one (Table 6.1), only there is now one value missing for Attribute1 denoted by “?” as presented in Table 6.2.
TABLE 6.2. A Simple Flat Database of Examples with One Missing Value
The computation of the gain parameter for Attribute1 is similar to as before, only the missing value corrects some of the previous steps. Eight out of the 13 cases with values for Attribute1 belong to CLASS1 and five cases to CLASS2, so the entropy before splitting is
After using Attribute1 to divide T into three subsets (test x1 represents the selection one of three values A, B, or C), the resulting information is given by
The information gained by this test is now corrected with the factor F (F = 13/14 for our example):
The gain for this test is slightly lower than the previous value of 0.216 bits. The split information, however, is still determined from the entire training set and is larger, since there is an extra category for unknown values.
Additionally, the concept of partitioning must be generalized. With every sample a new parameter, probability, is associated. When a case with known value is assigned from T to subset Ti, the probability of it belonging to Ti is 1, and in all other subsets is 0. When a value is not known, only a weaker probabilistic statement can be made. C4.5 therefore associates with each sample (having a missing value) in each subset Ti a weight w, representing the probability that the case belongs to each subset. To make the solution more general, it is necessary to take into account that the probabilities of samples before splitting are not always equal to one (in subsequent iterations of the decision-tree construction). Therefore, new parameter wnew for missing values after splitting is equal to the old parameter wold before splitting multiplied by the probability that the sample belongs to each subset P(Ti), or more formally:
For example, the record with the missing value, given in the database in Figure 6.7, will be represented in all three subsets after the splitting set T into subsets Ti using test x1 on Attribute1. New weights wi will be equal to probabilities 5/13, 3/13, and 5/13, because the initial (old) value for w is equal to one. The new subsets are given in Figure 6.7. |Ti| can now be reinterpreted in C4.5 not as a number of elements in a set Ti, but as a sum of all weights w for the given set Ti. From Figure 6.7, the new values are computed as: |T1| = 5 + 5/13, |T2| = 3 + 3/13, and |T3| = 5 + 5/13.
Figure 6.7. Results of test x1 are subsets Ti (initial set T is with missing value).
If these subsets are partitioned further by the tests on Attribute2 and Attribute3, the final decision tree for a data set with missing values has the form shown in Figure 6.8
Figure 6.8. Decision tree for the database T with missing values.
The decision tree in Figure 6.8 has much the same structure as before (Fig. 6.6), but because of the ambiguity in final classification, every decision is attached with two parameters in a form (|Ti|/E). |Ti| is the sum of the fractional samples that reach the leaf and E is the number of samples that belong to classes other than the nominated class.
For example, (3.4/0.4) means that 3.4 (or 3 + 5/13) fractional training samples reached the leaf, of which 0.4 (or 5/13) did not belong to the class assigned to the leaf. It is
Comments (0)