Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
Suppose that the data set has three input variables, x1, x2, and x3, and one output Y. In preparation for the use of the linear regression method, it is necessary to estimate the simplest model, in terms of the number of required inputs. Suppose that after applying the ANOVA methodology the results given in Table 5.4 are obtained.
TABLE 5.4. ANOVA for a Data Set with Three Inputs, x1, x2, and x3
The results of ANOVA show that the input attribute x3 does not have an influence on the output estimation because the F-ratio value is close to 1:
In all other cases, the subsets of inputs increase the F-ratio significantly, and therefore, there is no possibility of reducing the number of input dimensions further without influencing the quality of the model. The final linear regression model for this example will be
Multivariate ANOVA (MANOVA) is a generalization of the previously explained ANOVA, and it concerns data-analysis problems in which the output is a vector rather than a single value. One way to analyze this sort of data would be to model each element of the output separately but this ignores the possible relationship between different outputs. In other words, the analysis would be based on the assumption that outputs are not related. MANOVA is a form of analysis that does allow correlation between outputs. Given the set of input and output variables, we might be able to analyze the available data set using a multivariate linear model:
where n is the number of input dimensions, m is the number of samples, Yj is a vector with dimensions c × 1, and c is the number of outputs. This multivariate model can be fitted in exactly the same way as a linear model using least-square estimation. One way to do this fitting would be to fit a linear model to each of the c dimensions of the output, one at a time. The corresponding residuals for each dimension will be (yj − y’j) where yj is the exact value for a given dimension and y’j is the estimated value.
The analog of the residual sum of squares for the univariate linear model is the matrix of the residual sums of squares for the multivariate linear model. This matrix R is defined as
The matrix R has the residual sum of squares for each of the c dimensions stored on its leading diagonal. The off-diagonal elements are the residual sums of cross-products for pairs of dimensions. If we wish to compare two nested linear models to determine whether certain β’s are equal to 0, then we can construct an extra sum of squares matrix and apply a method similar to ANOVA—MANOVA. While we had an F-statistic in the ANOVA methodology, MANOVA is based on matrix R with four commonly used test statistics: Roy’s greatest root, the Lawley-Hotteling trace, the Pillai trace, and Wilks’ lambda. Computational details of these tests are not explained in the book, but most textbooks on statistics will explain these; also, most standard statistical packages that support MANOVA support all four statistical tests and explain which one to use under what circumstances.
Classical multivariate analysis also includes the method of principal component analysis, where the set of vector samples is transformed into a new set with a reduced number of dimensions. This method has been explained in Chapter 3 when we were talking about data reduction and data transformation as preprocessing phases for data mining.
5.6 LOGISTIC REGRESSION
Linear regression is used to model continuous-value functions. Generalized regression models represent the theoretical foundation on that the linear regression approach can be applied to model categorical response variables. A common type of a generalized linear model is logistic regression. Logistic regression models the probability of some event occurring as a linear function of a set of predictor variables.
Rather than predicting the value of the dependent variable, the logistic regression method tries to estimate the probability that the dependent variable will have a given value. For example, in place of predicting whether a customer has a good or bad credit rating, the logistic regression approach tries to estimate the probability of a good credit rating. The actual state of the dependent variable is determined by looking at the estimated probability. If the estimated probability is greater than 0.50 then the prediction is closer to YES (a good credit rating), otherwise the output is closer to NO (a bad credit rating is more probable). Therefore, in logistic regression, the probability p is called the success probability.
We use logistic regression only when the output variable of the model is defined as a categorical binary. On the other hand, there is no special reason why any of the inputs should not also be quantitative, and, therefore, logistic regression supports a more general input data set. Suppose that output Y has two possible categorical values coded as 0 and 1. Based on the available data we can compute the probabilities for both values for the given input sample: P(yj = 0) = 1 − pj and P(yj = 1) = pj. The model with which we will fit these probabilities is accommodated linear regression:
This equation is known as the linear logistic model. The function log (pj/[1−pj]) is often written as logit(p). The main reason for using the logit form of output is to prevent the predicting probabilities from becoming values out of the required range [0, 1]. Suppose that the estimated model, based on a training data set and using the linear regression procedure, is given with a linear equation
and also suppose that the new sample for classification has input values {x1, x2, x3} = {1, 0, 1}. Using the linear logistic model, it is possible to estimate the probability of the output value 1, (p[Y = 1]) for this sample. First, calculate the corresponding logit(p):
and then the probability of the output value 1 for the given inputs:
Comments (0)