Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
Because it is costly to find frequent itemsets in large databases, incremental updating techniques should be developed to maintain the discovered frequent itemsets (and corresponding association rules) so as to avoid mining the whole updated database again. Updates on the database may not only invalidate some existing frequent itemsets but also turn some new itemsets into frequent ones. Therefore, the problem of maintaining previously discovered frequent itemsets in large and dynamic databases is nontrivial. The idea is to reuse the information of the old frequent itemsets and to integrate the support information of the new frequent itemsets in order to substantially reduce the pool of candidates to be reexamined.
In many applications, interesting associations among data items often occur at a relatively high concept level. For example, one possible hierarchy of food components is presented in Figure 10.5, where M (milk) and B (bread), as concepts in the hierarchy, may have several elementary sub-concepts. The lowest level elements in the hierarchy (M1, M2, … , B1, B2, …) are types of milk and bread defined with their bar-codes in the store. The purchase patterns in a transaction database may not show any substantial regularities at the elementary data level, such as at the bar-code level (M1, M2, M3, B1, B2, … ), but may show some interesting regularities at some high concept level(s), such as milk M and bread B.
Figure 10.5. An example of concept hierarchy for mining multiple-level frequent itemsets.
Consider the class hierarchy in Figure 10.5. It could be difficult to find high support for purchase patterns at the primitive-concept level, such as chocolate milk and wheat bread. However, it would be easy to find in many databases that more than 80% of customers who purchase milk may also purchase bread. Therefore, it is important to mine frequent itemsets at a generalized abstraction level or at multiple-concept levels; these requirements are supported by the Apriori generalized-data structure.
One extension of the Apriori algorithm considers an is-a hierarchy on database items, where information about multiple abstraction levels already exists in the database organization. An is-a hierarchy defines which items are a specialization or generalization of other items. The extended problem is to compute frequent itemsets that include items from different hierarchy levels. The presence of a hierarchy modifies the notation of when an item is contained in a transaction. In addition to the items listed explicitly, the transaction contains their ancestors in the taxonomy. This allows the detection of relationships involving higher hierarchy levels, since an itemset’s support can increase if an item is replaced by one of its ancestors.
10.5 FP GROWTH METHOD
Let us define one of the most important problems with scalability of the Apriori algorithm. To generate one FP of length 100, such as {a1, a2, … , a100}, the number of candidates that has to be generated will be at least
and it will require hundreds of database scans. The complexity of the computation increases exponentially. That is only one of the many factors that influence the development of several new algorithms for association-rule mining.
FP growth method is an efficient way of mining frequent itemsets in large databases. The algorithm mines frequent itemsets without the time-consuming candidate-generation process that is essential for Apriori. When the database is large, FP growth first performs a database projection of the frequent items; it then switches to mining the main memory by constructing a compact data structure called the FP tree. For an explanation of the algorithm, we will use the transactional database in Table 10.2 and the minimum support threshold of 3.
TABLE 10.2. The Transactional Database TTIDItemset01f, a, c, d, g, i, m, p02a, b, c, f, l, m, o03b, f, h, j, o04b, c, k, s, p05a, f, c, e, l, p, m, n
First, a scan of the database T derives a list L of frequent items occurring three or more than three times in the database. These are the items (with their supports):
The items are listed in descending order of frequency. This ordering is important since each path of the FP tree will follow this order.
Second, the root of the tree, labeled ROOT, is created. The database T is scanned a second time. The scan of the first transaction leads to the construction of the first branch of the FP tree: {(f,1), (c,1), (a,1), (m,1), (p,1)}. Only those items that are in the list of frequent items L are selected. The indices for nodes in the branch (all are 1) represent the cumulative number of samples at this node in the tree, and of course, after the first sample, all are 1. The order of the nodes is not as in the sample but as in the list of frequent items L. For the second transaction, because it shares items f, c, and a it shares the prefix {f, c, a} with the previous branch and extends to the new branch {(f, 2), (c, 2), (a, 2), (m, 1), (p, 1)}, increasing the indices for the common prefix by one. The new intermediate version of the FP tree, after two samples from the database, is given in Figure 10.6a. The remaining transactions can be inserted similarly, and the final FP tree is given in Figure 10.6b.
Figure 10.6. FP tree for the database T in Table 10.2. (a) FP tree after two samples; (b) final FT tree.
To facilitate tree traversal, an item header table is built, in which each item in list L connects nodes in the FP tree with its values through node links. All f nodes are connected in one list, all c nodes in the other, and so on. For simplicity of representation only the list for b nodes is given in Figure 10.6b. Using the compact-tree structure, the FP growth algorithm mines the complete set of frequent itemsets.
According to the list L of frequent items, the complete set of frequent itemsets can
Comments (0)