Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
and it is a built-in function (like all other descriptive statistical measures) in most modern, statistical software tools. For each numeric feature in the n-dimensional set of samples, it is possible to calculate the mean value as a central tendency characteristic for this feature. Sometimes, each value xi in a set may be associated with a weight wi, which reflects the frequency of occurrence, significance, or importance attached to the value. In this case, the weighted arithmetic mean or the weighted average value is
Although the mean is the most useful quantity that we use to describe a set of data, it is not the only one. For skewed data sets, a better measure of the center of data is the median. It is the middle value of the ordered set of feature values if the set consists of an odd number of elements and it is the average of the middle two values if the number of elements in the set is even. If x1, x2, … , xn represents a data set of size n, arranged in increasing order of magnitude, then the median is defined by
Another measure of the central tendency of a data set is the mode. The mode for the set of data is the value that occurs most frequently in the set. While mean and median are characteristics of primarily numeric data sets, the mode may be applied also to categorical data, but it has to be interpreted carefully because the data are not ordered. It is possible for the greatest frequency to correspond to several different values in a data set. That results in more than one mode for a given data set. Therefore, we classify data sets as unimodal (with only one mode) and multimodal (with two or more modes). Multimodal data sets may be precisely defined as bimodal, trimodal, and so on. For unimodal frequency curves that are moderately asymmetrical, we have the following useful empirical relation for numeric data sets
that may be used for an analysis of data set distribution and the estimation of one central-tendency measure based on the other two.
As an example, let us analyze these three measures on the simple data set T that has the following numeric values:
After a sorting process the same data set is given as
The corresponding descriptive statistical measures for central tendency are
The degree to which numeric data tend to spread is called dispersion of the data, and the most common measures of dispersion are the standard deviation σ and the variance σ2. The variance of n numeric values x1, x2, … , xn is
The standard deviation σ is the square root of the variance σ2. The basic properties of the standard deviation σ as a measure of spread are
1. σ measures spread about the mean and should be used only when the mean is chosen as a measure of the center.
2. σ = 0 only when there is no spread in the data, that is, when all measurements have the same value. Otherwise σ > 0.
For the data set given in our example, variance σ2 and standard deviation σ are
In many statistical software tools, a popularly used visualization tool of descriptive statistical measures for central tendency and dispersion is a boxplot that is typically determined by the mean value, variance, and sometimes max and min values of the data set. In our example, the minimal and maximal values in the T set are minT = 0, and maxT = 9. Graphical representation of statistical descriptors for the data set T has a form of a boxplot, given in Figure 5.1.
Figure 5.1. A boxplot representation of the data set T based on mean value, variance, and min and max values.
Analyzing large data sets requires proper understanding of the data in advance. This would help domain experts to influence the data-mining process and to properly evaluate the results of a data-mining application. Central tendency measures for a data set are valuable only for some specific distributions of data values. Therefore, it is important to know the characteristics of a distribution for a data set we are analyzing. The distribution of values in a data set is described according to the spread of its values. Usually, this is best done using a histogram representation; an example is given in Figure 5.2. In addition to quantifying the distribution of values for each feature, it is also important to know the global character of the distributions and all specifics. Knowing that data set has a classic bell curve empowers researchers to use a broad range of traditional statistical techniques for data assessment. But in many practical cases, the distributions are skewed or multimodal, and traditional interpretations of concepts such as mean value or standard deviation do not make sense.
Figure 5.2. Displaying a single-feature distribution.
Part of the assessment process is determining relations between features in a data set. Simple visualization through the scatter plots gives initial estimation of these relations. Figure 5.3 shows part of the integrated scatter plot where each pair of features is compared. This visualization technique is available in most integrated data-mining tools. Quantification of these relations is given through the correlation factor.
Figure 5.3. Scatter plots showing the correlation between features from −1 to 1.
These visualization processes are part of the data-understanding phase, and they are essential in preparing better for data mining. This human interpretation helps to obtain a general view of the data. It is also possible to identify abnormal or interesting characteristics, such as anomalies.
5.3 BAYESIAN INFERENCE
It is not hard to imagine situations in which the data are not the only available source of information about the population or about the system to be modeled. The Bayesian method provides a principled way to incorporate this
Comments (0)