Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
2. T contains no samples. The decision tree is again a leaf but the class to be associated with the leaf must be determined from information other than T, such as the overall majority class in T. The C4.5 algorithm uses as a criterion the most frequent class at the parent of the given node.
3. T contains samples that belong to a mixture of classes. In this situation, the idea is to refine T into subsets of samples that are heading toward a single-class collection of samples. Based on single attribute, an appropriate test that has one or more mutually exclusive outcomes {O1, O2, … , On} is chosen. T is partitioned into subsets T1, T2, … , Tn, where Ti contains all the samples in T that have outcome Oi of the chosen test. The decision tree for T consists of a decision node identifying the test and one branch for each possible outcome (examples of this type of nodes are nodes A, B, and C in the decision tree in Fig. 6.3a).
The same tree-building procedure is applied recursively to each subset of training samples, so that the ith branch leads to the decision tree constructed from the subset Ti of the training samples. The successive division of the set of training samples proceeds until all the subsets consist of samples belonging to a single class.
The tree-building process is not uniquely defined. For different tests, even for a different order of their application, different trees will be generated. Ideally, we would like to choose a test at each stage of sample-set splitting so that the final tree is small. Since we are looking for a compact decision tree that is consistent with the training set, why not explore all possible trees and select the simplest? Unfortunately, the problem of finding the smallest decision tree consistent with a training data set is NP-complete. Enumeration and analysis of all possible trees will cause a combinatorial explosion for any real-world problem. For example, for a small database with five attributes and only 20 training examples, the possible number of decision trees is greater than 106, depending on the number of different values for every attribute. Therefore, most decision tree-construction methods are non-backtracking, greedy algorithms. Once a test has been selected using some heuristics to maximize the measure of progress and the current set of training cases has been partitioned, the consequences of alternative choices are not explored. The measure of progress is a local measure, and the gain criterion for a test selection is based on the information available for a given step of data splitting.
Suppose we have the task of selecting a possible test with n outcomes (n values for a given feature) that partitions the set T of training samples into subsets T1, T2, … , Tn. The only information available for guidance is the distribution of classes in T and its subsets Ti. If S is any set of samples, let freq(Ci, S) stand for the number of samples in S that belong to class Ci (out of k possible classes), and let |S| denote the number of samples in the set S.
The original ID3 algorithm used a criterion called gain to select the attribute to be tested that is based on the information theory concept: entropy. The following relation gives the computation of the entropy of the set T (bits are units):
Now consider a similar measurement after T has been partitioned in accordance with n outcomes of one attribute test X. The expected information requirement can be found as the weighted sum of entropies over the subsets:
The quantity
measures the information that is gained by partitioning T in accordance with the test X. The gain criterion selects a test X to maximize Gain(X), that is, this criterion will select an attribute with the highest information gain.
Let us analyze the application of these measures and the creation of a decision tree for one simple example. Suppose that the database T is given in a flat form in which each of 14 examples (cases) is described by three input attributes and belongs to one of two given classes: CLASS1 or CLASS2. The database is given in tabular form in Table 6.1.
TABLE 6.1. A Simple Flat Database of Examples for Training
Nine samples belong to CLASS1 and five samples to CLASS2, so the entropy before splitting is
After using Attribute1 to divide the initial set of samples T into three subsets (test x1 represents the selection one of three values A, B, or C), the resulting information is given by:
The information gained by this test x1 is
If the test and splitting is based on Attribute3 (test x2 represents the selection one of two values True or False), a similar computation will give new results:
and corresponding gain is
Based on the gain criterion, the decision-tree algorithm will select test x1 as an initial test for splitting the database T because this gain is higher. To find the optimal test it will be necessary to analyze a test on Attribute2, which is a numeric feature with continuous values. In general, C4.5 contains mechanisms for proposing three types of tests:
1. The “standard” test on a discrete attribute, with one outcome and one branch for each possible value of that attribute (in our example these are both tests x1 for Attribute1 and x2 for Attribute3).
2. If attribute Y has continuous numeric values, a binary test with outcomes Y ≤ Z and Y > Z could be defined by comparing its value against a threshold value Z.
3. A more complex test is also based on a discrete attribute, in which the possible values are allocated to a variable number of groups with one outcome and branch for each group.
While we have already explained standard test for categorical attributes, additional explanations are necessary about a procedure for establishing tests
Comments (0)