Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
The Bayes theorem represents a theoretical background for a statistical approach to inductive-inferencing classification problems. We will explain first the basic concepts defined in the Bayes theorem, and then, use this theorem in the explanation of the Naïve Bayesian classification process, or the simple Bayesian classifier.
Let X be a data sample whose class label is unknown. Let H be some hypothesis: such that the data sample X belongs to a specific class C. We want to determine P(H/X), the probability that the hypothesis H holds given the observed data sample X. P(H/X) is the posterior probability representing our confidence in the hypothesis after X is given. In contrast, P(H) is the prior probability of H for any sample, regardless of how the data in the sample look. The posterior probability P(H/X) is based on more information then the prior probability P(H). The Bayesian theorem provides a way of calculating the posterior probability P(H/X) using probabilities P(H), P(X), and P(X/H). The basic relation is
Suppose now that there is a set of m samples S = {S1, S2, … , Sm} (the training data set) where every sample Si is represented as an n-dimensional vector {x1, x2, … , xn}. Values xi correspond to attributes A1, A2, … , An, respectively. Also, there are k classes C1, C2, … , Ck, and every sample belongs to one of these classes. Given an additional data sample X (its class is unknown), it is possible to predict the class for X using the highest conditional probability P(Ci/X), where i = 1, … , k. That is the basic idea of Naïve Bayesian classifier. These probabilities are computed using Bayes theorem:
As P(X) is constant for all classes, only the product P(X/Ci) · P(Ci) needs to be maximized. We compute the prior probabilities of the class as
where m is total number of training samples.
Because the computation of P(X/Ci) is extremely complex, especially for large data sets, the naïve assumption of conditional independence between attributes is made. Using this assumption, we can express P(X/Ci) as a product:
where xt are values for attributes in the sample X. The probabilities P(xt/Ci) can be estimated from the training data set.
A simple example will show that the Naïve Bayesian classification is a computationally simple process even for large training data sets. Given a training data set of seven four-dimensional samples (Table 5.1), it is necessary to predict classification of the new sample X = {1, 2, 2, class = ?}. For each sample, A1, A2, and A3 are input dimensions and C is the output classification.
TABLE 5.1. Training Data Set for a Classification Using Naïve Bayesian Classifier
In our example, we need to maximize the product P(X/Ci) · P(Ci) for i = 1,2 because there are only two classes. First, we compute prior probabilities P(Ci) of the class:
Second, we compute conditional probabilities P(xt/Ci) for every attribute value given in the new sample X = {1, 2, 2, C = ?}, (or more precisely, X = {A1 = 1, A2 = 2, A3 = 2, C = ?}) using training data sets:
Under the assumption of conditional independence of attributes, the conditional probabilities P(X/Ci) will be
Finally, multiplying these conditional probabilities with corresponding a priori probabilities, we can obtain values proportional (≈) to P(Ci/X) and find their maximum:
Based on the previous two values that are the final results of the Naive Bayesian classifier, we can predict that the new sample X belongs to the class C = 2. The product of probabilities for this class P(X/C = 2) · P(C = 2) is higher, and therefore P(C = 2/X) is higher because it is directly proportional to the computed probability product.
In theory, the Bayesian classifier has the minimum error rate compared with all other classifiers developed in data mining. In practice, however, this is not always the case because of inaccuracies in the assumptions of attributes and class-conditional independence.
5.4 PREDICTIVE REGRESSION
The prediction of continuous values can be modeled by a statistical technique called regression. The objective of regression analysis is to determine the best model that can relate the output variable to various input variables. More formally, regression analysis is the process of determining how a variable Y is related to one or more other variables x1, x2, … , xn. Y is usually called the response output, or dependent variable, and xi-s are inputs, regressors, explanatory variables, or independent variables. Common reasons for performing regression analysis include
1. the output is expensive to measure but the inputs are not, and so a cheap prediction of the output is sought;
2. the values of the inputs are known before the output is known, and a working prediction of the output is required;
3. controlling the input values, we can predict the behavior of corresponding outputs; and
4. there might be a causal link between some of the inputs and the output, and we want to identify the links.
Before explaining regression technique in details, let us explain the main differences between two concepts: interpolation and regression. In both cases training data set is given where xt are input features and output value rt ∈ R.
If there is no noise in the data set, the task is interpolation. We would like to find a function f(x) that passes through all these training points such that we have rt = f(xt). In polynomial interpolation, given N points, we found that we can use (N − 1) degree polynomial to predict exact output r for any input x.
In
Comments (0)