Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
Adult Data Set.
http://archive.ics.uci.edu/ml/datasets/Adult
The Adult Data Set contains 48,842 samples extracted from the U.S. Census. The task is to classify individuals as having an income that does or does not exceed $50,000/year based on factors such as age, education, race, sex, and native country.
Breast Cancer Wisconsin (Diagnostic) Data Set.
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
This data set consists of a number of measurements taken over a “digitized image of a fine needle aspirate (FNA) of a breast mass.” There are 569 samples. The task is to classify each data point as benign or malignant.
A.4.2 Clustering
Bag of Words Data Set.
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words
Word counts have been extracted from five document sources: Enron Emails, NIPS full papers, KOS blog entries, NYTimes news articles and Pubmed abstracts. The task is to cluster the documents used in this data set based on the word counts found. One may compare the output clusters with the sources from which each document came.
US Census Data (1990) Data Set.
http://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%29
This data set is a one percent sample from the 1990 Public Use Microdata Samples (PUMS). It contains 2,458,285 records and 68 attributes.
A.4.3 Regression
Auto MPG Data Set.
http://archive.ics.uci.edu/ml/datasets/Auto+MPG
This data set provides a number of attributes of cars that can be used to attempt to predict the “city-cycle fuel consumption in miles per gallon.” There are 398 data points and eight attributes.
Computer Hardware Data Set.
http://archive.ics.uci.edu/ml/datasets/Computer+Hardware
This data set provides a number of CPU attributes that can be used to predict relative CPU performance. It contains 209 data points and 10 attributes.
A.4.4 Web Mining
Anonymous Microsoft Web Data.
http://archive.ics.uci.edu/ml/datasets/Anonymous+Microsoft+Web+Data
This data set contains page visits for a number of anonymous users who visited www.microsoft.com. The task is to predict future categories of pages a user will visit based on the Web pages previously visited.
KDD Cup 2000.
http://www.sigkdd.org
This Web site contains five tasks used in a data-mining competition run yearly called KDD Cup. KDD Cup 2000 uses clickstream and purchase data obtained from Gazelle.com. Gazelle.com sold legwear and legcare products and closed their online store that same year. This Web site provides links to papers and posters of the winners of the various tasks and outlines their effective methods. Additionally, the description of the tasks provides great insight into original approaches to using data mining with clickstream data.
A.4.5 Text Mining
Reuters-21578 Text Categorization Collection.
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
This is a collection of news articles that appeared on Reuters newswire in 1987. All of the news articles have been categorized. The categorization provides opportunities to test text classification or clustering methodologies.
20 Newsgroups.
http://people.csail.mit.edu/jrennie/20Newsgroups/
The 20 Newsgroups data set contains 20,000 newsgroup documents. These documents are divided nearly evenly among 20 different newsgroups. Similar to the Reuters collection, this data set provides opportunities for text classification and clustering.
A.4.6 Time Series
Dodgers Loop Sensor Data Set.
http://archive.ics.uci.edu/ml/datasets/Dodgers+Loop+Sensor
This data set provides the number of cars counted by a sensor every 5 min over 25 weeks. The sensor was for the Glendale on ramp for the 101 North Freeway in Los Angeles. The goal of this data was to “predict the presence of a baseball game at Dodgers stadium.”
Australia Gun Deaths.
http://robjhyndman.com/TSDL/crime.html
These data give the yearly death rates in Australia for gun-related and non-gun-related homicides and suicides for the years 1915–2004.
A.4.7 Data for Association Rule Mining
BMS-POS.
http://www.sigkdd.org/kddcup
This data set gives the category for each product purchased from a large electronics retailer. It covers several years worth of point of sales data. This data set contains 515,597 transactions and 1,657 distinct items.
BMS-WebView1.
http://www.sigkdd.org/kddcup
This data set contains several months of clickstream sessions for Gazelle.com. A transaction is defined in this data set as the detail pages viewed per session. This data set contains 59,602 transactions and 497 distinct items.
A.5 COMERCIALLY AND PUBLICLY AVAILABLE TOOLS
This summary of some publicly available commercial data-mining products is being provided to help readers better understand what software tools can be found on the market and what their features are. It is not intended to endorse or critique any specific product. Potential users will need to decide for themselves the suitability of each product for their specific applications and data-mining environments. This is primarily intended as a starting point from which users can obtain more information. There is a constant stream of new products appearing in the market and hence this list is by no means comprehensive. Because these changes are very frequent, the author suggests two Web sites for information about the latest tools and their performances: http://www.kdnuggets.com and http://www.knowledgestorm.com.
A.5.1 Free Software
DataLab
Publisher: Epina Software Labs (www.lohninger.com/datalab/en_home.html)
DataLab, a complete and powerful data mining tool with a unique data exploration process, with a focus on marketing and interoperability with SAS. There is a public version for students.
DBMiner
Publisher: Simon Fraser University (http://ddm.cs.sfu.ca)
DBMiner is a publicly available tool for data mining. It is a multiple-strategy tool and it supports methodologies such as clustering, association rules, summarization, and visualization. DBMiner uses Microsoft SQL Server 7.0 Plato and runs on different Windows platforms.
GenIQ Model
Publisher: DM STAT-1 Consulting (www.geniqmodel.com)
GenIQ Model uses machine learning for regression tasks; automatically performs variable selection, and new variable construction, and then specifies the model equation to “optimize the decile table.”
NETMAP
Publisher: http://sourceforge.net/projects/netmap
NETMAP is a general-purpose, information-visualization tool. It is most effective for large, qualitative, text-based data sets. It runs on Unix workstations.
RapidMiner
Publisher: Rapid-I (http://rapid-i.com)
Rapid-I provides software, solutions, and services in the fields of predictive analytics, data mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale base, that is, for large amounts of structured data-like database systems and unstructured data-like texts. The open-source data-mining specialist Rapid-I enables other companies to use leading-edge technologies for data mining and business intelligence. The discovery and leverage of unused business intelligence from existing data enables better informed decisions and allows for process optimization.
SIPNA
Publisher: http://eric.univ-lyon2.fr/∼ricco/sipina.html
Sipina-W is publicly available software that includes different traditional data-mining techniques such as CART, Elisee, ID3, C4.5, and some new methods for generating decision trees.
SNNS
Publisher: University of Stuttart (http://www.nada.kth.se/∼orre/snns-manual/)
SNNS is a publicly available software. It is a simulation environment for research on and application
Comments (0)