Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
Figure 2.6. Outliers for univariate data based on mean value and standard deviation.
Figure 2.7. Two-dimensional data set with one outlying sample.
Statistically based outlier detection methods can be divided between univariate methods, proposed in earlier works in this field, and multivariate methods, which usually form most of the current body of research. Statistical methods either assume a known underlying distribution of the observations or, at least, they are based on statistical estimates of unknown distribution parameters. These methods flag as outliers those observations that deviate from the model assumptions. The approach is often unsuitable for high-dimensional data sets and for arbitrary data sets without prior knowledge of the underlying data distribution.
Most of the earliest univariate methods for outlier detection rely on the assumption of an underlying known distribution of the data, which is assumed to be identically and independently distributed. Moreover, many discordance tests for detecting univariate outliers further assume that the distribution parameters and the type of expected outliers are also known. Although traditionally the normal distribution has been used as the target distribution, this definition can be easily extended to any unimodal symmetric distribution with positive density function. Traditionally, the sample mean and the sample variance give good estimation for data location and data shape if it is not contaminated by outliers. When the database is contaminated, those parameters may deviate and significantly affect the outlier-detection performance. Needless to say, in real-world data-mining applications, these assumptions are often violated.
The simplest approach to outlier detection for 1-D samples is based on traditional unimodal statistics. Assuming that the distribution of values is given, it is necessary to find basic statistical parameters such as mean value and variance. Based on these values and the expected (or predicted) number of outliers, it is possible to establish the threshold value as a function of variance. All samples out of the threshold value are candidates for outliers as presented in Figure 2.6. The main problem with this simple methodology is an a priori assumption about data distribution. In most real-world examples, the data distribution may not be known.
For example, if the given data set represents the feature age with 20 different values:
then, the corresponding statistical parameters are
If we select the threshold value for normal distribution of data as
then, all data that are out of range [−54.1, 131.2] will be potential outliers. Additional knowledge of the characteristics of the feature (age is always greater then 0) may further reduce the range to [0, 131.2]. In our example there are three values that are outliers based on the given criteria: 156, 139, and −67. With a high probability we can conclude that all three of them are typo errors (data entered with additional digits or an additional “–” sign).
An additional single-dimensional method is Grubbs’ method (Extreme Studentized Deviate), which calculates a Z value as the difference between the mean value for the attribute and the analyzed value divided by the standard deviation for the attribute. The Z value is compared with a 1% or 5% significance level showing an outlier if the Z parameter is above the threshold value.
In many cases multivariable observations cannot be detected as outliers when each variable is considered independently. Outlier detection is possible only when multivariate analysis is performed, and the interactions among different variables are compared within the class of data. An illustrative example is given in Figure 2.7 where separate analysis of each dimension will not give any outlier, but analysis of 2-D samples (x,y) gives one outlier detectable even through visual inspection.
Statistical methods for multivariate outlier detection often indicate those samples that are located relatively far from the center of the data distribution. Several distance measures can be implemented for such a task. The Mahalanobis distance measure includes the inter-attribute dependencies so the system can compare attribute combinations. It is a well-known approach that depends on estimated parameters of the multivariate distribution. Given n observations xi from a p-dimensional data set (often n p), denote the sample mean vector by n and the sample covariance matrix by Vn, where
The Mahalanobis distance for each multivariate data point i (i = 1, … , n) is denoted by Mi and given by
Accordingly, those n-dimensional samples with a large Mahalanobis distance are indicated as outliers. Many statistical methods require data-specific parameters representing a priori data knowledge. Such information is often not available or is expensive to compute. Also, most real-world data sets simply do not follow one specific distribution model.
Distance-based techniques are simple to implement and make no prior assumptions about the data distribution model. However, they suffer exponential computational growth as they are founded on the calculation of the distances between all samples. The computational complexity is dependent on both the dimensionality of the data set m and the number of samples n and usually is expressed as O(n2m). Hence, it is not an adequate approach to use with very large data sets. Moreover, this definition can lead to problems when the data set has both dense and sparse regions. For example, as the dimensionality increases, the data points are spread through a larger volume and become less dense. This makes the convex hull harder to discern and is known as the “curse of dimensionality.”
Distance-based outlier detection method, presented in this section, eliminates some of the limitations imposed by the statistical approach. The most important difference is that this method is applicable to multidimensional samples while most of statistical descriptors analyze only a single dimension, or several dimensions, but separately. The basic computational complexity of this method is the evaluation of distance measures between all samples in an n-dimensional data set. Then, a sample si in a data set S is an outlier if at least a fraction p of the samples in S lies at a distance greater than d. In other words, distance-based outliers are those samples that do not have enough neighbors, where
Comments (0)