Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
Figure 4.34. Computing points on an ROC curve. (a) Threshold = 0.5; (b) threshold = 0.8.
When we are comparing two classification algorithms we may compare the measures as accuracy or F measure, and conclude that one model is giving better results than the other. Also, we may compare lift charts, ROI charts, or ROC curves, and if one curve is above the other we may conclude that a corresponding model is more appropriate. But in both cases we may not conclude that there are significant differences between models, or more important, that one model shows better performances than the other with statistical significance. There are some simple tests that could verify these differences. The first one is McNemar’s test. After testing models of both classifiers, we are creating a specific contingency table based on classification results on testing data for both models. Components of the contingency table are explained in Table 4.5.
TABLE 4.5. Contingency Table for McNemar’s Teste00: Number of samples misclassified by both classifierse01: Number of samples misclassified by classifier 1, but not classifier 2e10: Number of samples misclassified by classifier 2, but not classifier 1e11: Number of samples correctly classified by both classifier s
After computing the components of the contingency table, we may apply the χ2 statistic with one degree of freedom for the following expression:
McNemar’s test rejects the hypothesis that the two algorithms have the same error at the significance level α, if previous value is greater than χ2 α, 1. For example, for α = 0.05, χ2 0.05, 1 = 3.84.
The other test is applied if we compare two classification models that are tested with the K-fold cross-validation process. The test starts with the results of K-fold cross-validation obtained from K training/validation set pairs. We compare the error percentages in two classification algorithms based on errors in K validation sets that are recorded for two models as: and , i = 1, … , K.
The difference in error rates on fold i is . Then, we can compute:
We have a statistic that is t distributed with K-1 degrees of freedom, and the following test:
Thus, the K-fold cross-validation paired t-test rejects the hypothesis that two algorithms have the same error rate at significance level α, if previous value is outside interval (−tα/2,K-1, tα/2,K-1). For example, the threshold values could be for α = 0.05 and K = 10 or 30: t0.025, 9 = 2.26, and t0.025, 29 = 2.05.
Over time, all systems evolve. Thus, from time to time the model will have to be retested, retrained, and possibly completely rebuilt. Charts of the residual differences between forecasted and observed values are an excellent way to monitor model results.
4.9 90% ACCURACY: NOW WHAT?
Often forgotten in texts on data mining is a discussion of the deployment process. Any data-mining student may produce a model with relatively high accuracy over some small data set using the tools available. However, an experienced data miner sees beyond the creation of a model during the planning stages. There needs to be a plan created to evaluate how useful a data-mining model is to a business, and how the model will be rolled out. In a business setting the value of a data-mining model is not simply the accuracy, but how that model can impact the bottom line of a company. For example, in fraud detection, algorithm A may achieve an accuracy of 90% while algorithm B achieves 85% on training data. However, an evaluation of the business impact of each may reveal that algorithm A would likely underperform algorithm B because of larger number of very expensive false negative cases. Additional financial evaluation may recommend algorithm B for the final deployment because with this solutions company saves more money. A careful analysis of the business impacts of data-mining decisions gives much greater insight of a data-mining model.
In this section two case studies are summarized. The first case study details the deployment of a data-mining model that improved the efficiency of employees in finding fraudulent claims at an insurance company in Chile. The second case study involves a system deployed in hospitals to aid in counting compliance with industry standards in caring for individuals with cardiovascular disease (CVD).
4.9.1 Insurance Fraud Detection
In 2005, the insurance company Banmedica S.A. of Chile received 800 digital medical claims per day. The process of identifying fraud was entirely manual. Those responsible for identifying fraud had to look one-by-one at medical claims to find fraudulent cases. Instead it was hoped that data-mining techniques would aid in a more efficient discovery of fraudulent claims.
The first step in the data-mining process required that the data-mining experts gain a better understanding of the processing of medical claims. After several meetings with medical experts, the data-mining experts were able to better understand the business process as it related to fraud detection. They were able to determine the current criteria used in manually discriminating between claims that were approved, rejected, and modified. A number of known fraud cases were discussed and the behavioral patterns that revealed these documented fraud cases.
Next, two data sets were supplied. The first data set contained 169 documented cases of fraud. Each fraudulent case took place over an extended period of time, showing that time was an important factor in these decisions as cases developed. The second data set contained 500,000 medical claims with labels supplied by the business of “approved,” “rejected,” or “reduced.”
Both data sets were analyzed in detail. The smaller data set of known fraud cases revealed that these fraudulent cases all involved a small number of medical professionals, affiliates, and employers. From the original paper, “19 employers and 6 doctors were implicated with 152 medical claims.” The labels of the larger data set were revealed to be not sufficiently accurate for data mining. Contradictory data points were found. A lack of standards in
Comments (0)