Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
Based on the final value for probability p, we may conclude that output value Y = 1 is more probable than the other categorical value Y = 0. Even this simple example shows that logistic regression is a very simple yet powerful classification tool in data-mining applications. With one set of data (training set) it is possible to establish the logistic regression model and with other sets of data (testing set) we may analyze the quality of the model in predicting categorical values. The results of logistic regression may be compared with other data-mining methodologies for classification tasks such as decision rules, neural networks, and Bayesian classifier.
5.7 LOG-LINEAR MODELS
Log-linear modeling is a way of analyzing the relationship between categorical (or quantitative) variables. The log-linear model approximates discrete, multidimensional probability distributions. It is a type of a generalized linear model where the output Yi is assumed to have a Poisson distribution, with expected value μj. The natural logarithm of μj is assumed to be the linear function of inputs
Since all the variables of interest are categorical variables, we use a table to represent them, a frequency table that represents the global distribution of data. The aim in log-linear modeling is to identify associations between categorical variables. Association corresponds to the interaction terms in the model, so our problem becomes a problem of finding out which of all β’s are 0 in the model. A similar problem can be stated in ANOVA. If there is an interaction between the variables in a log-linear mode, it implies that the variables involved in the interaction are not independent but related, and the corresponding β is not equal to 0. There is no need for one of the categorical variables to be considered as an output in this analysis. If the output is specified, then instead of the log-linear models, we can use logistic regression for analysis. Therefore, we will next explain log-linear analysis when a data set is defined without output variables. All given variables are categorical, and we want to analyze the possible associations between them. That is the task for correspondence analysis.
Correspondence analysis represents the set of categorical data for analysis within incidence matrices, also called contingency tables. The result of an analysis of the contingency table answers the question: Is there a relationship between analyzed attributes or not? An example of a 2 × 2 contingency table, with cumulative totals, is shown in Table 5.5. The table is a result of a survey to examine the relative attitude of males and females about abortion. The total set of samples is 1100 and every sample consists of two categorical attributes with corresponding values. For the attribute sex, the possible values are male and female, and for attribute support the values are yes and no. Cumulative results for all the samples are represented in four elements of the contingency table.
TABLE 5.5. A 2 × 2 Contingency Table for 1100 Samples Surveying Attitudes about Abortion
Are there any differences in the extent of support for abortion between the male and the female populations? This question may be translated to: What is the level of dependency (if any) between the two given attributes: sex and support? If an association exists, then there are significant differences in opinion between the male and the female populations; otherwise both populations have a similar opinion.
Having seen that log-linear modeling is concerned with association of categorical variables, we might attempt to find some quantity (measure) based on this model using data in the contingency table. But we do not do this. Instead, we define the algorithm for feature association based on a comparison of two contingency tables:
1. The first step in the analysis is to transform a given contingency table into a similar table with expected values. These expected values are calculated under assumption that the variables are independent.
2. In the second step, we compare these two matrices using the squared distance measure and the chi-square test as criteria of association for two categorical variables.
The computational process for these two steps is very simple for a 2 × 2 contingency table. The process is also applicable for increased dimensions of a contingency table (analysis of categorical variables with more than two values, such as 3 × 4 or 6 × 9).
Let us introduce the notation. Denote the contingency table as Xm × n. The row totals for the table are
and they are valid for every row (j = 1, … , m). Similarly, we can define the column totals as
The grand total is defined as a sum of row totals:
or as a sum of column totals:
Using these totals we can calculate the contingency table of expected values under the assumption that there is no association between the row variable and the column variable. The expected values are
and they are computed for every position in the contingency table. The final result of this first step will be a totally new table that consists only of expected values, and the two tables will have the same dimensions.
For our example in Table 5.5, all sums (columns, rows, and grand total) are already represented in the contingency table. Based on these values we can construct the contingency table of expected values. The expected value on the intersection of the first row and the first column will be
Similarly, we can compute the other expected values and the final contingency table with expected values will be as given in Table 5.6.
TABLE 5.6. A 2 × 2 Contingency Table of Expected Values for the Data Given in Table 5.5
The next step in the analysis of categorical-attributes dependency is the application of the chi-squared test of association. The initial hypothesis H0 is the assumption that the two attributes are unrelated, and it is tested by Pearson’s chi-squared formula:
The greater the value of χ2, the greater the evidence against the hypothesis H0 is. For our example, comparing
Comments (0)