Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
(b) obtain rank 1, 2, and 3 approximations to the document representations;
(c) calculate the variability preserved by rank 1, 2, and 3 approximations;
(d) Manually cluster documents A, B, C, and D into two clusters.
5. Given a table of linked Web pages and a dampening factor of 0.15:PageLinked to pageAFBFCFDFEA, FFE
(a) find the PageRank scores for each page after one iteration;
(b) find the PageRank scores after 100 iterations, recording the absolute difference between scores per iteration (be sure to use some programming or scripting language to obtain these scores);
(c) explain the scores and rankings computed previously in parts (a) and (b). How quickly would you say that the scores converged? Explain.
6. Why is the text-refining task very important in a text-mining process? What are the results of text refining?
7. Implement the HITS algorithm and discover authorities and hubs if the input is the table of linked pages.
8. Implement the PageRank algorithm and discover central nodes in a table of linked pages.
9. Develop a software tool for discovering maximal reference sequences in a Web-log file.
10. Search the Web to find the basic characteristics of publicly available or commercial software tools for association-rule discovery. Document the results of your search.
11. Apply LSA to 20 Web pages of your choosing and compare the clusters obtained using the original term counts as attributes against the attributes derived using LSA. Comment on the successes and shortcomings of this approach.
12. What are the two main steps in mining traversal patterns using log data?
13. The XYZ Corporation maintains a set of five Web pages: {A, B, C, D, and E}. The following sessions (listed in timestamp order) have been created:
Suppose that support threshold is 30%. Find all large sequences (after building the tree).
14. Suppose a Web graph is undirected, that is, page i points to page j if and only page j points to page i. Are the following statements true or false? Justify your answers briefly.
(a) The hubbiness and authority vectors are identical, that is, for each page, its hubbiness is equal to its authority.
(b) The matrix M that we use to compute PageRank is symmetric, that is, M[i; j] = M[j; i] for all i and j.
11.9 REFERENCES FOR FURTHER STUDY
Akerkar, R., P. Lingras, Building an Intelligent Web: Theory and Practice, Jones and Bartlett Publishers, Sudbury, MA, 2008.
This provides a number of techniques used in Web mining. Code is provided along with illustrative examples showing how to perform Web-content mining, Web-structure mining and Web-usage mining.
Chang, G., M. J. Haeley, J. A. M. McHugh, J. T. L. Wang, Mining the World Wide Web: An Information Search Approach, Kluwer Academic Publishers, Boston, MA, 2001.
This book is an effort to bridge the gap between information search and data mining on the Web. The first part of the book focuses on IR on the Web. The ability to find relevant documents on the Web is essential to the process of Web mining. The cleaner the set of Web documents and data are, the better the knowledge that can be extracted from them. In the second part of the book, basic concepts and techniques on text mining, Web mining, and Web crawling are introduced. A case study, in the last part of the book, focuses on a search engine prototype called EnviroDaemon.
Garcia, E., SVD and LSI Tutorial 4: Latent Semantic Indexing (LSI) How-to Calculations, Mi Islita, 2006, http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html.
This Web tutorial provides students with a greater understanding of latent semantic indexing. It provides a detailed tutorial aimed at students. All calculations are pictured giving the student an opportunity to walk through the entire process.
Han, J., M. Kamber, Data Mining: Concepts and Techniques, 2nd edition, San Francisco, Morgan Kaufmann, 2006.
This book gives a sound understanding of data-mining principles. The primary orientation of the book is for database practitioners and professionals with emphasis on OLAP and data warehousing. In-depth analysis of association rules and clustering algorithms is the additional strength of the book. All algorithms are presented in easily understood pseudo-code, and they are suitable for use in real-world, large-scale data-mining projects, including advanced applications such as Web mining and text mining.
Mulvenna, M. D., et al., ed., Personalization on the Net Using Web Mining, CACM, Vol. 43, No. 8, 2000.
This is a collection of articles that explains state-of-the-art Web-mining techniques for developing personalization systems on the Internet. New methods are described for analyses of Web-log data in a user-centric manner, influencing Web-page content, Web-page design, and overall Web-site design.
Zhang, Q., R. S. Segall, Review of Data, Text and Web Mining Software, Kybernetes, Vol. 39, No. 4, 2010, pp. 625–655.
The paper reviews and compares selected software for data mining, text mining, and Web mining that are not available as free open-source software. The software for data mining are SAS® Enterprise Miner™, Megaputer PolyAnalyst® 5.0, NeuralWare Predict®, and BioDiscovery GeneSight®. The software for text mining are CompareSuite, SAS® Text Miner, TextAnalyst, VisualText, Megaputer PolyAnalyst® 5.0, and WordStat. The software for Web mining are Megaputer PolyAnalyst®, SPSS Clementine®, ClickTracks, and QL2. The paper discusses and compares the existing features, characteristics, and algorithms of selected software for data mining, text mining, and Web mining, respectively.
12
ADVANCES IN DATA MINING
Chapter Objectives
Analyze the characteristics of graph-mining algorithms and introduce some illustrative examples.
Identify the required changes in data-mining algorithm when temporal and spatial components are introduced.
Introduce the basic characteristics of distributed data-mining algorithms and specific modifications for distributed Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering.
Describe the differences between causality and correlation.
Introduce the basic principles in Bayesian networks modeling.
Know when and how to include privacy protection in a data-mining process.
Summarize social and legal aspects of data-mining applications.
Current technological progress permits the storage and access of large amounts of data at virtually no cost. These developments have created unprecedented opportunities for large-scale data-driven discoveries, as well as the potential for fundamental gains in scientific and business understanding. The popularity of the Internet and the Web makes it imperative that the data-mining framework is extended to include distributed, time- and space-dependent
Comments (0)