Other
Read books online » Other » Data Mining Mehmed Kantardzic (good english books to read .txt) 📖

Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic



1 ... 54 55 56 57 58 59 60 61 62 ... 193
Go to page:
to anything of statistical interest, whether it is a group of people, objects, or events. The number of observations in the population is defined as the size of the population. In general, populations may be finite or infinite, but some finite populations are so large that, in theory, we assume them to be infinite.

In the field of statistical inference, we are interested in arriving at conclusions concerning a population when it is impossible or impractical to observe the entire set of observations that make up the population. For example, in attempting to determine the average length of the life of a certain brand of light bulbs, it would be practically impossible to test all such bulbs. Therefore, we must depend on a subset of observations from the population for most statistical-analysis applications. In statistics, a subset of a population is called a sample and it describes a finite data set of n-dimensional vectors. Throughout this book, we will simply call this subset of population data set, to eliminate confusion between the two definitions of sample: one (explained earlier) denoting the description of a single entity in the population, and the other (given here) referring to the subset of a population. From a given data set, we build a statistical model of the population that will help us to make inferences concerning that same population. If our inferences from the data set are to be valid, we must obtain samples that are representative of the population. Very often, we are tempted to choose a data set by selecting the most convenient members of the population. But such an approach may lead to erroneous inferences concerning the population. Any sampling procedure that produces inferences that consistently overestimate or underestimate some characteristics of the population is said to be biased. To eliminate any possibility of bias in the sampling procedure, it is desirable to choose a random data set in the sense that the observations are made independently and at random. The main purpose of selecting random samples is to elicit information about unknown population parameters.

The relation between data sets and the system they describe may be used for inductive reasoning: from observed data to knowledge of a (partially) unknown system. Statistical inference is the main form of reasoning relevant to data analysis. The theory of statistical inference consists of those methods by which one makes inferences or generalizations about a population. These methods may be categorized into two major areas: estimation and tests of hypotheses.

In estimation, one wants to come up with a plausible value or a range of plausible values for the unknown parameters of the system. The goal is to gain information from a data set T in order to estimate one or more parameters w belonging to the model of the real-world system f(X, w). A data set T is described by the ordered n-tuples of values for variables: X = {X1, X2, … , Xn} (attributes of entities in population):

It can be organized in a tabular form as a set of samples with its corresponding feature values. Once the parameters of the model are estimated, we can use them to make predictions about the random variable Y from the initial set of attributes Y∈ X, based on other variables or sets of variables X* = X − Y. If Y is numeric, we speak about regression, and if it takes its values from a discrete, unordered data set, we speak about classification.

Once we have obtained estimates for the model parameters w from some data set T, we may use the resulting model (analytically given as a function f[X*, w]) to make predictions about Y when we know the corresponding value of the vector X*. The difference between the prediction f(X*, w) and the real value Y is called the prediction error. It should preferably take values close to 0. A natural quality measure of a model f(X*, w), as a predictor of Y, is the expected mean-squared error for the entire data set T:

In statistical testing, on the other hand, one has to decide whether a hypothesis concerning the value of the population characteristic should be accepted or rejected in light of an analysis of the data set. A statistical hypothesis is an assertion or conjecture concerning one or more populations. The truth or falsity of a statistical hypothesis can never be known with absolute certainty, unless we examine the entire population. This, of course, would be impractical in most situations, sometimes even impossible. Instead, we test a hypothesis on a randomly selected data set. Evidence from the data set that is inconsistent with the stated hypothesis leads to a rejection of the hypothesis, whereas evidence supporting the hypothesis leads to its acceptance, or more precisely, it implies that the data do not contain sufficient evidence to refute it. The structure of hypothesis testing is formulated with the use of the term null hypothesis. This refers to any hypothesis that we wish to test and is denoted by H0. H0 is only rejected if the given data set, on the basis of the applied statistical tests, contains strong evidence that the hypothesis is not true. The rejection of H0 leads to the acceptance of an alternative hypothesis about the population.

In this chapter, some statistical estimation and hypothesis-testing methods are described in great detail. These methods have been selected primarily based on the applicability of the technique in a data-mining process on a large data set.

5.2 ASSESSING DIFFERENCES IN DATA SETS

For many data-mining tasks, it would be useful to learn the more general characteristics about the given data set, regarding both central tendency and data dispersion. These simple parameters of data sets are obvious descriptors for assessing differences between different data sets. Typical measures of central tendency include mean, median, and mode, while measures of data dispersion include variance and standard deviation.

The most common and effective numeric measure of the center of the data set is the mean value (also called the arithmetic mean).

1 ... 54 55 56 57 58 59 60 61 62 ... 193
Go to page:

Free ebook «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment