Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
Figure 11.9. Rank 2 approximation of the singular value decomposition.
The rank k approximation to VT gives us our dimensionally reduced data set where each document is now described with only two dimensions. VT can be thought of as a transformation of the original document matrix and can be used for any of a number of text-mining tasks, such as classification and clustering. Improved results may be obtained using the newly derived dimensions, comparing against the data-mining tasks using the original word counts.
In most text-mining cases, all training documents (in our case 4) would have been included in matrix A and transformed together. Document five (d5) was intentionally left out of these calculations to demonstrate a technique called “folding-in,” which allows for a small number of documents to be transformed into the new reduced space and compared with a training document database. Matrices from our previous SVD calculations are used for this transformation. This is done using the following modified formula: . This equation is a rearrangement of terms from the initial SVD equation replacing the original data set, matrix A, with the term counts for document five (d5) as matrix A’. The resulting multiplication can be seen in Figure 11.10. The result is the document d5 represented in the reduced 2-D space, d5 = [−0.172, 0.025].
Figure 11.10. SVD calculations to produce a 2-D approximation for d5.
To visualize the transformed data, we have plotted our example documents, d1..5, in Figure 11.11. We now perform the nearest neighbor classification algorithm. If the task is to disambiguate the term “bank,” and the possible classes are “financial institution” and “basketball shot,” then a review of the original text reveals that documents d1, d2, and d5 are of the class “financial institution” and documents d3 and d4 are of the class “basketball shot.” To classify test document d5, we compare d5 with each other document. The closest neighbor will determine the class of d5. Ideally, d5 should be nearest to d1 and d2, and farthest from d3 and d4, as pointed earlier. Table 11.4 shows these calculations using the Euclidean distance metric. The assessment for which document is closest is made based on both the original 10 dimensions and the reduced set of two dimensions. Table 11.5 shows the same comparison using cosine similarity to compare documents. Cosine similarity is often used in text-mining tasks for document comparisons.
TABLE 11.4. Use of Euclidean Distance to Find Nearest Neighbor to d5 in Both 2-d and 10-d (the Smallest Distance Ranks First)
TABLE 11.5. Use of Cosine Similarity to Find Most Similar Document to d5 in Both 2-d and 10-d (the Largest Similarity Ranks First)
Figure 11.11. 2-D plot of documents and query using LSA.
Euclidean distance as shown in Table 11.4 when using 10 dimensions ranks d3 and d4 above d1 and d2. By the formulation of Euclidean distance, when two documents both share a term or both do not share a term, the result is a distance of 0 for that dimension. After applying the LSA transformation, the Euclidean distance ranks d1 above d3 and d4. However, document d2 only is ranked above d3 and not d4.
Cosine similarity calculates the cosine of the angle between two vectors representing points in n-dimensional space. The result is a similarity score between 0 and 1, where a 1 shows highest similarity, and 0 shows no similarity. For document vector comparisons, no additional strength of similarity is added by two vectors containing a 0 for a specific term. This is a very beneficial characteristic for textual applications. When we consider Table 11.5 we see that without the transformation, in 10-D space, d2 is ranked above d3 and d4 for containing both terms in d5. However, d1 is ranked below d3 and d4 for having more terms than these other sentences. After the LSA transformation, d1 and d2 are ranked above d3 and d4, providing the expected ranking. In this simple example the best results occurred when we first transformed the data using LSA and then used cosine similarity to measure the similarity between initial training documents in a database and a new document for comparison. Using the nearest neighbor classifier, d5 would be classified correctly as a “financial institution” document using cosine similarity for both the 10-d and 2-d cases or using Euclidean distance for the 2-d case. Euclidean distance in the 10-d case would have classified d5 incorrectly. If we used k-nearest neighbor with k = 3, then Euclidean distance in the 10-d case would have also incorrectly classified d5. Clearly, the LSA transformation affects the results of document comparisons, even in this very simple example. Results are better because LSA enable better representation of a document’s semantics.
11.8 REVIEW QUESTIONS AND PROBLEMS
1. Give specific examples of where Web-content mining, Web-structure mining, and Web-usage mining would be valuable. Discuss the benefits.
2. Given a table of linked Web pages:PageLinked to pageAB, D, E, FBC, D, ECB, E, FDA, F, EEB, C, FFA, B
(a) Find authorities using two iterations of the HITS algorithm.
(b) Find hubs using two iterations of the HITS algorithm.
(c) Find the PageRank scores for each page after one iteration using 0.1 as the dampening factor.
(d) Explain the HITS and PageRank authority rankings obtained in (a) and (c).
3. For the traversal log: {X, Y, Z, W, Y, A, B, C, D, Y, C, D, E, F, D, E, X, Y, A, B, M, N},
(a) find MFR;
(b) find LRS if the threshold value is 0.3 (or 30%);
(c) Find MRS.
4. Given the following text documents and assumed decomposition:DocumentTextAWeb-content miningBWeb-structure miningCWeb-usage miningDText mining
(a)
Comments (0)