Other
Read books online » Other » Data Mining Mehmed Kantardzic (good english books to read .txt) 📖

Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic



1 ... 104 105 106 107 108 109 110 111 112 ... 193
Go to page:
be divided into subsets (six for our example) without overlap: (1) frequent itemsets having item p (the end of list L); (2) the itemsets having item m but not p; (3) the frequent itemsets with b and without both m and p; … ; and (6) the large itemsets only with f. This classification is valid for our example, but the same principles can be applied to other databases and other L lists.

Based on node-link connection, we collect all the transactions that p participates in by starting from the header table of p and following p’s node links. In our example, two paths will be selected in the FP tree: {(f,4), (c,3), (a,3), (m,2), (p,2)} and {(c,1), (b,1), (p,1)}, where samples with a frequent item p are {(f,2), (c,2), (a,2), (m,2), (p,2)}, and {(c,1), (b,1), (p,1)}. The given threshold value (3) satisfies only the frequent itemsets {(c,3), (p,3)}, or the simplified {c, p}. All other itemsets with p are below the threshold value.

The next subsets of frequent itemsets are those with m and without p. The FP tree recognizes the paths {(f,4), (c,3), (a,3), (m,2)}, and {(f,4), (c,3), (a,3), (b,1), (m,1)}, or the corresponding accumulated samples {(f,2), (c,2), (a,2), (m,2)}, and {(f,1), (c,1), (a,1), (b,1), (m,1)}. Analyzing the samples we discover the frequent itemset {(f,3), (c,3), (a,3), (m,3)} or, simplified, {f, c, a, m}.

Repeating the same process for subsets (3) to (6) in our example, additional frequent itemsets could be mined. These are itemsets {f, c, a} and {f, c}, but they are already subsets of the frequent itemset {f, c, a, m}. Therefore, the final solution in the FP growth method is the set of frequent itemsets, which is, in our example, {{c, p}, {f, c, a, m}}.

Experiments have shown that the FP growth algorithm is faster than the Apriori algorithm by about one order of magnitude. Several optimization techniques are added to the FP growth algorithm, and there exists its versions for mining sequences and patterns under constraints.

10.6 ASSOCIATIVE-CLASSIFICATION METHOD

CMAR is a classification method adopted from the FP growth method for generation of frequent itemsets. The main reason we included CMAR methodology in this chapter is its FP growth roots, but there is the possibility of comparing CMAR accuracy and efficiency with the C4.5 methodology.

Suppose data samples are given with n attributes (A1, A2, … , An). Attributes can be categorical or continuous. For a continuous attribute, we assume that its values are discretized into intervals in the preprocessing phase. A training data set T is a set of samples such that for each sample there exists a class label associated with it. Let C = {c1, c2, … , cm} be a finite set of class labels.

In general, a pattern P = {a1, a2, … , ak} is a set of attribute values for different attributes (1 ≤ k ≤ n). A sample is said to match the pattern P if it has all the attribute values given in the pattern. For rule R: P → c, the number of data samples matching pattern P and having class label c is called the support of rule R, denoted sup(R). The ratio of the number of samples matching pattern P and having class label c versus the total number of samples matching pattern P is called the confidence of R, denoted as conf(R). The association-classification method (CMAR) consists of two phases:

1. rule generation or training, and

2. classification or testing.

In the first, rule generation phase, CMAR computes the complete set of rules in the form R: P → c, such that sup(R) and conf(R) pass the given thresholds. For a given support threshold and confidence threshold, the associative-classification method finds the complete set of class-association rules (CAR) passing the thresholds. In the testing phase, when a new (unclassified) sample comes, the classifier, represented by a set of association rules, selects the rule that matches the sample and has the highest confidence, and uses it to predict the classification of the new sample.

We will illustrate the basic steps of the algorithm through one simple example. Suppose that for a given training data set T, as shown in Table 10.3, the support threshold is 2 and the confidence threshold is 70%.

TABLE 10.3. Training Database T for the CMAR Algorithm

First, CMAR scans the training data set and finds the set of attribute values occurring beyond the threshold support (at least twice in our database). One simple approach is to sort each attribute and to find all frequent values. For our database T, this is a set F = {a1, b2, c1, d3}, and it is called a frequent itemset. All other attribute values fail the support threshold. Then, CMAR sorts attribute values in F, in support-descending order, that is, F-list = (a1, b2, c1, d3).

Now, CMAR scans the training data set again to construct an FP tree. The FP tree is a prefix tree with respect to the F-list. For each sample in a training data set, attributes values appearing in the F-list are extracted and sorted according to the order in the F-list. For example, for the first sample in database T, (a1, c1) are extracted and inserted in the tree as the left-most branch in the tree. The class label of the sample and the corresponding counter are attached to the last node in the path.

Samples in the training data set share prefixes. For example, the second sample carries attribute values (a1, b2, c1) in the F-list and shares a common prefix a1 with the first sample. An additional branch from the node a1 will be inserted in the tree with new nodes b2 and c1. A new class label B with the count equal to 1 is also inserted at the end of the new path. The final FP tree for the database T is given in Figure 10.7a.

Figure 10.7. FP tree for the database in Table 10.3. (a) nonmerged FT tree; (b) FT tree after merging d3 nodes.

After analyzing all the samples and

1 ... 104 105 106 107 108 109 110 111 112 ... 193
Go to page:

Free ebook «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment