Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
To obtain a graphical interpretation of the learning tasks, we start with the formalization and representation of data samples that are the “infrastructure” of the learning process. Every sample used in data mining represents one entity described with several attribute–value pairs. That is, one row in a tabular representation of a training data set, and it can be visualized as a point in an n-dimensional space, where n is the number of attributes (dimensions) for a given sample. This graphical interpretation of samples is illustrated in Figure 4.9, where a student with the name John represents a point in a 4-D space that has four additional attributes.
Figure 4.9. Data samples = points in an n-dimensional space.
When we have a basic idea of the representation of each sample, the training data set can be interpreted as a set of points in the n-dimensional space. Visualization of data and a learning process is difficult for large number of dimensions. Therefore, we will explain and illustrate the common learning tasks in a 2-D space, supposing that the basic principles are the same for a higher number of dimensions. Of course, this approach is an important simplification that we have to take care of, especially keeping in mind all the characteristics of large, multidimensional data sets, explained earlier under the topic “the curse of dimensionality.”
Let us start with the first and most common task in inductive learning: classification. This is a learning function that classifies a data item into one of several predefined classes. The initial training data set is given in Figure 4.10a. Samples belong to different classes and therefore we use different graphical symbols to visualize each class. The final result of classification in a 2-D space is the curve shown in Figure 4.10b, which best separates samples into two classes. Using this function, every new sample, even without a known output (the class to which it belongs), may be classified correctly. Similarly, when the problem is specified with more than two classes, more complex functions are a result of a classification process. For an n-dimensional space of samples the complexity of the solution increases exponentially, and the classification function is represented in the form of hypersurfaces in the given space.
Figure 4.10. Graphical interpretation of classification. (a) Training data set; (b) classification function.
The second learning task is regression .The result of the learning process in this case is a learning function, which maps a data item to a real-value prediction variable. The initial training data set is given in Figure 4.11a. The regression function in Figure 4.11b was generated based on some predefined criteria built inside a data-mining technique. Based on this function, it is possible to estimate the value of a prediction variable for each new sample. If the regression process is performed in the time domain, specific subtypes of data and inductive-learning techniques can be defined.
Figure 4.11. Graphical interpretation of regression. (a) Training data set; (b) regression function.
Clustering is the most common unsupervised learning task. It is a descriptive task in which one seeks to identify a finite set of categories or clusters to describe the data. Figure 4.12a shows the initial data, and they are grouped into clusters, as shown in Figure 4.12b, using one of the standard distance measures for samples as points in an n-dimensional space. All clusters are described with some general characteristics, and the final solutions differ for different clustering techniques. Based on results of the clustering process, each new sample may be assigned to one of the previously found clusters, using its similarity with the cluster characteristics of the sample as a criterion.
Figure 4.12. Graphical interpretation of clustering. (a) Training data set; (b) description of clusters.
Summarization is also a typical descriptive task, where the inductive-learning process is without a teacher. It involves methods for finding a compact description for a set (or subset) of data. If a description is formalized, as given in Figure 4.13b, that information may simplify and therefore improve the decision-making process in a given domain.
Figure 4.13. Graphical interpretation of summarization. (a) Training data set; (b) formalized description.
Dependency modeling is a learning task that discovers local models based on a training data set. The task consists of finding a model that describes significant dependency between features or between values in a data set covering not the entire data set, but only some specific subsets. An illustrative example is given in Figure 4.14b, where the ellipsoidal relation is found for one subset and a linear relation for the other subset of the training data. These types of modeling are especially useful in large data sets that describe very complex systems. Discovering general models based on the entire data set is, in many cases, almost impossible, because of the computational complexity of the problem at hand.
Figure 4.14. Graphical interpretation of dependency-modeling task. (a) Training data set; (b) discovered local dependencies.
Change and deviation detection is a learning task, and we have been introduced already to some of its techniques in Chapter 2. These are the algorithms that detect outliers. In general, this task focuses on discovering the most significant changes in a large data set. Graphical illustrations of the task are given in Figure 4.15. In Figure 4.15a the task is to discover outliers in a given data set with discrete values of features. The task
Comments (0)