Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
Sequence mining can be explained as follows: Given a collection of sequences ordered in time, where each sequence contains a set of Web pages, the goal is to discover sequences of maximal length that appear more frequently than a given percentage threshold over the whole collection. A frequent sequence is maximal if all sequences containing it have a lower frequency. This definition of the sequence-mining problem implies that the items constituting a frequent sequence need not necessarily occur adjacent to each other. They just appear in the same order. This property is desirable when we study the behavior of Web users because we want to record their intents, not their errors and disorientations.
Many of these sequences, even those with the highest frequencies, could be of a trivial nature. In general, only the designer of the site can say what is trivial and what is not. The designer has to read all patterns discovered by the mining process and discard unimportant ones. It would be much more efficient to automatically test data-mining results against the expectations of the designer. However, we can hardly expect a site designer to write down all combinations of Web pages that are considered typical; expectations are formed in the human mind in much more abstract terms. Extraction of informative and useful maximal sequences continues to be a challenge for researchers.
Although there are several techniques proposed in the literature, we will explain one of the proposed solutions for mining traversal patterns that consists of two steps:
(a) In a first step, an algorithm is developed to convert the original sequence of log data into a set of traversal subsequences. Each traversal subsequence represents a maximum forward reference from the starting point of a user access. It should be noted that this step of conversion would filter out the effect of backward references, which are mainly made for ease of traveling. The new reduced set of user-defined forward paths enables us to concentrate on mining meaningful user-access sequences.
(b) The second step consists of a separate algorithm for determining the frequent-traversal patterns, termed large reference sequences. A large reference sequence is a sequence that appears a sufficient number of times in the log database. In the final phase, the algorithm forms the maximal references obtained from large reference sequences. A maximal large sequence is a large reference sequence that is not contained in any other maximal reference sequence.
For example, suppose the traversal log of a given user contains the following path (to keep it simple, Web pages are represented by letters):
The path is transformed into the tree structure shown in Figure 11.3. The set of maximum forward references (MFR) found in step (a) after elimination of backward references is
Figure 11.3. An example of traversal patterns.
When MFR have been obtained for all users, the problem of finding frequent-traversal patterns is mapped into one of finding frequently occurring consecutive subsequences among all MFR. In our example, if the threshold value is 0.4 (or 40%), large-reference sequences (LRS) with lengths 2, 3, and 4 are
Finally, with LRS determined, maximal reference sequences (MRS) can be obtained through the process of selection. The resulting set for our example is
In general, these sequences, obtained from large log files, correspond to a frequently accessed pattern in an information-providing service.
The problem of finding LRS is very similar to that of finding frequent itemsets (occurring in a sufficient number of transactions) in association-rule mining. However, they are different from each other in that a reference sequence in the mining-traversal patterns has to be references in a given order, whereas a large itemset in mining association rules is just a combination of items in a transaction. The corresponding algorithms are different because they perform operations on different data structures: lists in the first case, sets in the second. As the popularity of Internet applications explodes, it is expected that one of the most important data-mining issues for years to come will be the problem of effectively discovering knowledge on the Web.
11.5 PAGERANK ALGORITHM
PageRank was originally published by Sergey Brin and Larry Page, the co-creators of Google. It likely contributed to the early success of Google. PageRank provides a global ranking of nodes in a graph. For search engines it provides a query-independent, authority ranking of all Web pages. PageRank has similar goals of finding authoritative Web pages to that of the HITS algorithm. The main assumption behind the PageRank algorithm is that every link from page a to page b is a vote by page a for page b. Not all votes are equal. Votes are weighted by the PageRank score of the originating node.
PageRank is based on the random surfer model. If a surfer were to randomly select a starting Web page, and at each time step the surfer were to randomly select a link on the current Web page, then PageRank could be seen as the probability that this random surfer is on any given page. Some Web pages do not contain any hyperlinks. In this model it is assumed that the random surfer selects a random Web page when exiting pages with no hyperlinks. Additionally, there is some chance that the random surfer will stop following links and restart the process.
Computationally the PageRank (Pr) of page u can be computed as follows:
Here d is a dampening factor usually set to 0.85. N refers to the total number of nodes in the graph. The function In(u) returns the set of nodes with edges pointing into node u. |Out(v)| returns the number of nodes with edges pointing from v to that node. For example, if the Web-page connections are those given in Figure 11.4, and the current node under consideration were node B, then the following values would hold through all iterations: N = 3, In(B)
Comments (0)