Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
TABLE 3.5. A Contingency Table for 2 × 2 Categorical Data
We will analyze the ChiMerge algorithm using one relatively simple example, where the database consists of 12 two-dimensional samples with only one continuous feature (F) and an output classification feature (K). The two values, 1 and 2, for the feature K represent the two classes to which the samples belong. The initial data set, already sorted with respect to the continuous numeric feature F, is given in Table 3.6.
TABLE 3.6. Data on the Sorted Continuous Feature F with Corresponding Classes KSampleFK1112323714815916112723283719392104511146112591
We can start the algorithm of a discretization by selecting the smallest χ2 value for intervals on our sorted scale for F. We define a middle value in the given data as a splitting interval point. For our example, interval points for feature F are 0, 2, 5, 7.5, 8.5, and 10. Based on this distribution of intervals, we analyze all adjacent intervals trying to find a minimum for the χ2 test. In our example, χ2 was the minimum for intervals [7.5, 8.5] and [8.5, 10]. Both intervals contain only one sample, and they belong to class K = 1. The initial contingency table is given in Table 3.7.
TABLE 3.7. Contingency Table for Intervals [7.5, 8.5] and [8.5, 10]
Based on the values given in the table, we can calculate the expected values
and the corresponding χ2 test
For the degree of freedom d = 1, and χ2 = 0.2 < 2.706 (the threshold value from the tables for chi-squared distribution for α = 0.1), we can conclude that there are no significant differences in relative class frequencies and that the selected intervals can be merged. The merging process will be applied in one iteration only for two adjacent intervals with a minimum χ2 and, at the same time, with χ2 < threshold value. The iterative process will continue with the next two adjacent intervals that have the minimum χ2. We will just show one additional step, somewhere in the middle of the merging process, where the intervals [0, 7.5] and [7.5, 10] are analyzed. The contingency table is given in Table 3.8, and expected values are
while the χ2 test is
TABLE 3.8. Contingency Table for Intervals [0, 7.5] and [7.5, 10]
Selected intervals should be merged into one because, for the degree of freedom d = 1, χ2 = 0.834 < 2.706 (for α = 0.1). In our example, with the given threshold value for χ2, the algorithm will define a final discretization with three intervals: [0, 10], [10, 42], and [42, 60], where 60 is supposed to be the maximum value for the feature F. We can assign to these intervals coded values 1, 2, and 3 or descriptive linguistic values low, medium, and high.
Additional merging is not possible because the χ2 test will show significant differences between intervals. For example, if we attempt to merge the intervals [0, 10] and [10, 42]—contingency table is given in Table 3.9—and the test results are E11 = 2.78, E12 = 2.22, E21 = 2.22, E22 = 1.78, and χ2 = 2.72 > 2.706, the conclusion is that significant differences between two intervals exist, and merging is not recommended.
TABLE 3.9. Contingency Table for Intervals [0, 10] and [10, 42]
3.8 CASE REDUCTION
Data mining can be characterized as a secondary data analysis in the sense that data miners are not involved directly in the data-collection process. That fact may sometimes explain the poor quality of raw data. Seeking the unexpected or the unforeseen, the data-mining process is not concerned with optimal ways to collect data and to select the initial set of samples; they are already given, usually in large numbers, with a high or low quality, and with or without prior knowledge of the problem at hand.
The largest and the most critical dimension in the initial data set is the number of cases or samples or, in other words, the number of rows in the tabular representation of data. Case reduction is the most complex task in data reduction. Already, in the preprocessing phase, we have elements of case reduction through the elimination of outliers and, sometimes, samples with missing values. But the main reduction process is still ahead. If the number of samples in the prepared data set can be managed by the selected data-mining techniques, then there is no technical or theoretical reason for case reduction. In real-world data-mining applications, however, with millions of samples available, that is not the case.
Let us specify two ways in which the sampling process arises in data analysis. First, sometimes the data set itself is merely a sample from a larger, unknown population, and sampling is a part of the data-collection process. Data mining is not interested in this type of sampling. Second (another characteristic of data mining), the initial data set represents an extremely large population and the analysis of the data is based only on a subset of samples. After the subset of data is obtained, it is used to provide some information about the entire data set. It is often called estimator and its quality depends on the elements in the selected subset. A sampling process always causes a sampling error. Sampling error is inherent and unavoidable for every approach and every strategy. This error, in general, will decrease when the size of subset increases, and it will theoretically become nonexistent in the case of a complete data set. Compared with data mining of an entire data set, practical sampling possesses one or more of the following advantages: reduced cost, greater speed, greater scope, and sometimes even higher accuracy. As yet there is no known method of sampling that ensures that the estimates of the subset will be equal to
Comments (0)