Free Essay

Data Mining Algorithms for Classification

In:

Submitted By aprajita3
Words 5455
Pages 22
Data Mining Algorithms for Classification
BSc Thesis Artificial Intelligence Author: Patrick Ozer Radboud University Nijmegen January 2008

Supervisor: Dr. I.G. Sprinkhuizen-Kuyper Radboud University Nijmegen

Abstract Data Mining is a technique used in various domains to give meaning to the available data. In classification tree modeling the data is classified to make predictions about new data. Using old data to predict new data has the danger of being too fitted on the old data. But that problem can be solved by pruning methods which degeneralizes the modeled tree. This paper describes the use of classification trees and shows two methods of pruning them. An experiment has been set up using different kinds of classification tree algorithms with different pruning methods to test the performance of the algorithms and pruning methods. This paper also analyzes data set properties to find relations between them and the classification algorithms and pruning methods.

2

1

Introduction

The last few years Data Mining has become more and more popular. Together with the information age, the digital revolution made it necessary to use some heuristics to be able to analyze the large amount of data that has become available. Data Mining has especially become popular in the fields of forensic science, fraud analysis and healthcare, for it reduces costs in time and money. One of the definitions of Data Mining is; “Data Mining is a process that consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns (or models) over the data” [4]. Another , sort of pseudo definition; “The induction of understandable models and patterns from databases” [6]. In other words, we initially have a large (possibly infinite) collection of possible models (patterns) and (finite) data. Data Mining should result in those models that describe the data best, the models that fit (part of the data). Classification trees are used for the kind of Data Mining problem which are concerned with prediction. Using examples of cases it is possible to construct a model that is able to predict the class of new examples using the attributes of those examples. An experiment has been set up to test the performance of two pruning methods which are used in classification tree modeling. The pruning methods will be applied on different types of data sets. The remainder of this thesis is as follows. Section 2 explains classification trees and different pruning methods. In Section 3 our experiments are described. The results are given in Section 4 and we end with a discussion in Section 5.

3

2

Classification Trees

In Data Mining one of the most common tasks is to build models for the prediction of the class of an object on the basis of its attributes [8]. Here the object can be seen as a customer, patient, transaction, e-mail message or even a single character. Attributes of such objects can be, for example for the patient object, hearth rate, blood pressure, weight and gender, whereas the class of the patient object would most commonly be positive/negative for a certain disease. This section considers the problem of learning a classification tree model using the data we have about such objects. In Section 2.1 a simple example will be presented. In Section 2.2 the steps of constructing such a model will be discussed in more detail.

2.1

Example

An example will be walked through to give a global impression of Classification Trees. The example concerns the classification of a credit scoring data set and is obtained from [5]. To be able to walk through a whole classification tree we only consider a part of the data. Figure 1 shows the tree that is constructed from the data in Table 1. The objects in this data set are loan applicants. The class values good and bad correspond to respectively accepted and denied for getting a loan for a loan applicant with the same attribute values as someone who already tried to get a loan in the past. The data set contains five attributes; age, ownhouse?, married?, income and gender. With this information we can make a classification model of the applicants who were accepted or denied for a loan. The model derived from that classification can then predict if a new applicant can get a loan according to his/her attributes. As can be seen in Table 1 the instances each have a value for all the attributes and the class label. When we construct the tree structure, we 4

Record age ownhouse? married? income gender 1 22 no no 28,000 male 2 46 no yes 32,000 female 3 24 yes yes 24,000 male 4 25 no no 27,000 male 5 29 yes yes 32,000 female 6 45 yes yes 30,000 female 7 63 yes yes 58,000 male 8 36 yes no 52,000 male 9 23 no yes 40,000 female 10 50 yes yes 28,000 female Table 1: Credit scoring data [5]

class bad bad bad bad bad good good good good good

will try to split the data in such manner that it will result in the best split. The splitting is done according to the attributes. The quality of a split is computed from the correctly classified number of cases in each of the resulting nodes of that split. Finding the best split continues until all nodes are leaf nodes or when no more splits are possible (a leaf node is a node which contains cases of a single class only). In computer science, tree structures can have binary or n-ary branches. This is the same for classification trees. However, most of the times a tree structure in classification trees will have binary branches, because splitting in two ways will result in a better separation of the data. When we present the impurity measures for the splitting criteria we will consider the possibility for n-ary splits. In the examples given later on, binary splits will be used because those are easier to explain. Now we will focus on classifying new applicants according to a model generated from the data in Table 1. When a new applicant arrives he/she is “dropped down” the tree until we arrive in a leaf node, where we assign the associated class to the applicant. Suppose an applicant arrives and fills in the following information on the application form:

5

Figure 1: Tree built for loan applicants [5] age: 42, married?: no, own house?: yes, income: 30,000, gender: male Then we can say, according to the model shown in Figure 1, that this applicant will be rejected for the loan: starting from the root of the tree, we first perform the associated test on income. Since the value of income for this applicant is below 36,000 we go to the left and we arrive at a new test on age. Age is above 37, thus we go to the right. Here we arrive at the final test on marital status. Since the applicant is not married he is sent to the left and we arrive at a leaf node with class label bad. We predict that the applicant will not qualify for a loan and therefore is rejected. Besides this example, there are numerous domains in which these kinds of predictions can be done. Classification trees are favoured as Data Mining tool because they are easy to interpret, they select attributes automatically and they are able to handle numeric as well as categorical values.

6

2.2

Building Classification Trees

In the previous section we saw that the construction of a classification tree starts with performing good splits on the data. In this section we define what such a good split is and how we can find such a split. Three impurity measures, resubstitution-error, gini-index and the entropy, for splitting data will be discussed in Section 2.2.1. The actual splitting and tree construction according to these splits will be explained in Sections 2.2.2 & 2.2.3. And finally in Section 2.2.4 the need for pruning will be explained accompanied by two well known pruning methods, cost-complexity pruning and reduced-error pruning. 2.2.1 Impurity Measures

As shown in the example in the previous section we should strive towards a split that will separate the data as much as possible in accordance with the class labels. So the objective is to obtain nodes that contain cases of a single class only as mentioned before. We define impurity as a function of the relative frequencies of the classes in that node: i(t) = φ(p1 , p2 , ..., pJ ) (1)

with pj (j = 1, ..., J) as the relative frequencies of the J different classes in that node [1]. To compare all the possible splits of the data you have, a quality of a split as the reduction of impurity that the split achieves must be defined. In the example later on the following equation for the impurity reduction of split s in node t will be used: ∆i(s, t) = i(t) − j π(j)i(j)

(2)

where π(j) is the proportion of cases sent to branch j by s, and i(j) is

7

the impurity of the node of branch j. Because different algorithms of tree construction use different impurity measures, we will discuss three of them and give a general example later on. Resubstitution error This is a measure for the impuroty defined by the fraction of the cases in a node that is classified incorrectly if we assign every case to the majority class in that node: i(t) = 1 − max p(j|t) j (3)

where p(j|t) is the relative frequency of class j in node t. The resubstitution error gives a score to a split according to the incorrectly classified cases in a node. It can recognize a better split if it has less errors in that node. But the resubstitution error has one major disadvantage: it does not recognize a split as a better one if one of its resulting nodes is pure. So it does not prefer Split 2 over Split 1 in Figure 2. In such a case we want a split with a pure node to be preferred. Gini index The Gini index does recognize such a split. Its impurity measure is defined as follows: i(t) = j p(j|t)(1 − p(j|t))

(4)

The Gini index prefers Split 2 over Split 1 in Figure 2, since it results in the highest impurity reduction. Entropy Finally we have the entropy measure which is used in well-known classification tree algorithms like ID3 and C4.5. The advantage of the entropy measure over the gini-index is that it will reach a minimum faster if more instances of the child nodes belong to the same class. The entropy measure is defined as:

8

Figure 2: According to resubstitution error these splits are equally good for the resubstitution error. The Gini index prefers the second split.

i(t) = − j p(j|t) log(p(j|t)

(5)

For all three measures, it holds that they reach a maximum when all classes are evenly distributed in the nodes and they will be at a minimum if all instances in the nodes belong to one class. To show the differences in how fast the measures reach their minimum, Figure 3 is given. 2.2.2 Splits to consider

We have looked at different impurity criteria for computing the quality of a split. In this section we look at which splits are considered and how we select the best split (for binairy splits only). The attributes can have numerical or categorical values. In the case of numerical values, all the values of the attributes occurring in the training set are considered. The possible splits are made between two consecutive numerical values occuring in the training set. If the attribute is categorical with N categories, then 2N −1 − 1 splits are considered. There are 2N −2 non-empty proper subsets of a set of N elements. The empty set and the complete set do not count. Furthermore a split of the N categories into S and S c , the complement of S, is the same split as the 9

Figure 3: Graphs of entropy, Gini index and resubstitution error for a twoclass problem [5] split into S c and S. Then we can count 1 (2N − 2) = 2−1 2N − 1 = 2N −1 − 1 2 different splits. Example To compute the best split on income, a numerical attribute, the impurity reduction is computed for all possible values of income. In Table 2 the data are ordered on the value of income. Using the Gini index with the split after the first value on income we get the following computation with Equation 4 filled into Equation 2: ∆i(s, t) = 1 9 4 5 1 1 − ·0− · 2 · ( )( ) = 2 10 10 9 9 18 (6)

First the impurity of the parent node is computed. Then the impurity of each branch node is multiplied with the proportion of cases send to each corresponding direction, to subtract this from the impurity of the parent node. For the rest of the computations, see Table 2.

10

income class 24,000 B 27,000 B 28,000 B,G 30,000 G 32,000 B,B 40,000 G 52,000 G 58,000 G

impurity reduction of split : income (1/2) − 1/10 · 0 − 9/10 · 2 · (4/9)(5/9) = 0.056 (1/2) − 2/10 · 0 − 8/10 · 2 · (3/8)(5/8) = 0.125 (1/2) − 4/10 · 2 · (1/4)(3/4) − 6/10 · 2 · (4/6)(2/6) = 0.083 (1/2) − 5/10 · 2 · (2/5)(3/5) − 5/10 · 2 · (3/5)(2/5) = 0.02 (1/2) − 7/10 · 2 · (2/7)(5/7) − 3/10 · 2 · 0 = 0.214 (1/2) − 8/10 · 2 · (3/8)(5/8) − 2/10 · 0 = 0.125 (1/2) − 9/10 · 2 · (4/9)(5/9) − 1/10 · 0 = 0.056

Table 2: Computation of the impurity reduction for the Gini index for the splits on income [5]

2.2.3

Tree construction

Building a classification tree starts at the top of the tree with all the data. For all the attributes the best split of the data must be computed. Then the best splits for each of the attributes are compared. The attribute with the best split wins. The split will be executed on the attribute with the best value of the best split (again we consider binary trees). The data is now separated to the corresponding branches and from here the computation on the rest of the nodes will continue in the same manner. Tree construction will finish when there is no more data to separate or no more attributes to sperate them by. 2.2.4 Overfitting and Pruning

If possible we continue splitting until all leaf nodes of the tree contain examples of a single class. But unless the problem is deterministic, this will not result in a good tree for prediction. We call this overfitting. The tree will be focused too much on the training data. To prevent overfitting we can use stopping rules; stop expanding nodes if the impurity reduction of the best split is below some threshold. A major disadvantage of stopping rules is that

11

sometimes, first a weak (not weaker) split is needed to be able to follow up with a good split. This can be seen in building a tree for the XOR problem [?]practDM. Another solution is pruning. First grow a maximum-size tree on the training sample and then prune this large tree. The objective is to select the pruned subtree that has the lowest true error rate. The problem is, how to find this pruned subtree? There are two pruning methods we will use in the tests, cost-complexity pruning [1] and [5] and reduced-error pruning [3]. In the next two paragraphs we will explain how the two pruning methods work and finish with a concrete example of the pruning process. Cost-complexity pruning The basic idea of cost-complexity pruning is not to consider all pruned subtrees, but only those that are the “best of their kind” in a sense to be defined below. Let R(T ) (T stands for the complete tree) denote the fraction of cases in the training sample that is misclassified by the tree T (R(T ) is the weighted summed error of the leafs of tree T ). Total cost Cα (T ) of tree T is defined as: ˜ Cα (T ) = R(T ) + α|T | (7)

The total cost of tree T then consists of two components: summed error of ˜ the leafs R(T ), and a penalty for the complexity of the tree α|T |. In this ˜ ˜ expression T stands for the set of leaf nodes of T , |T | the number of leaf nodes and α is the parameter that determines the complexity penalty: when the number of leaf nodes increases by one (one additional split in a binary tree), then the total cost (if R remains equal) increases with α [5]. The value of α can make a complex tree with no errors have a higher total cost than a small tree making a number of errors. For every value of α there is a smallest minimizing subtree. We state the complete tree by Tmax . For a fixed value of α there is a unique smallest minimizing subtree T (α) of Tmax that fulfills the following conditions, proven in [1]: 12

1. Cα (T (α)) = minT ⊆Tmax Cα (T ) 2. If Cα (T ) = Cα (T (α)) then T (α) ⊆ T These two conditions say the following. The first says that there is no subtree of Tmax with lower cost than T (α) at that α value. And the second says that if more than one tree achieves the same minimum, we select the smallest tree. Since Tmax is finite, there is a finite number of different subtrees T (α) of Tmax . A decreasing sequence of α values for subtrees of Tmax would then look like T1 ⊃ T2 ⊃ T3 ⊃ ... ⊃ {t1 } with t1 as the root node of T and Tn is the smallest minimizing subtree for α ∈ [αn , αn+1 ). This means that the next tree can be computed by pruning the current one, which will be shown in the example later on. Reduced-error pruning At each node in a tree it is possible to establish the number of instances that are misclassified on a training set by propagating errors upwards from leaf nodes. This can be compared to the error rate if the node was replaced by the most common class resulting from that node. If the difference is a reduction in error, then the subtree below the node can be considered for pruning. This calculation is performed for all nodes in the tree and the one which has the highest reduced-error rate is pruned. The procedure is then iterated over the freshly pruned tree until there is no possible reduction in error rate at any node. The error is computed by using a pruning set, a part of the test set. This has the disadvantage of needing larger amounts of data, but the advantage of resulting in a more accurate classification tree [3]. Example cost-complexity pruning We will give an example of how to find the pruning sequence with the tree T1 given in Figure 4. We are looking 13

Figure 4: Complete tree: T1 for the value of α where T1 − Tt becomes better than T1 . In [1] it is shown that using R(t) − R(Tt ) (8) α= ˜ (|Tt | − 1) the sequence of α values can be computed. In Equation 8 Tt stands for the branch of T with root node t. By calculating the α value for each node in tree T1 , it can be determined which node should be pruned. This process is continued until there are no more nodes to prune. As stated, we first calculate the corresponding α values for each node in the tree. For the complete tree T1 this is: α(T1 (t1 )) = −0 1 = 5−1 80
1 20 1 2

(9)

α(T1 (t2 )) =

−0 1 = 3−1 40 −0 3 = 2−1 10

(10)

α(T1 (t3 )) =

60 200

(11)

14

−0 1 = (12) 2−1 100 The node in t4 has the lowest alpha value, so we prune the tree below that node. This results in the tree T2 : T1 (t4 ), seen in Figure 5. We use T1 (ti ) for: T1 is pruned below ti . In tree T2 the node in t2 has the lowest alpha value, so α(T1 (t4 )) =

2 200

Figure 5: The tree T2 , obtained by pruning T1 below node t4 now we prune the tree in that node. This results in the tree T3 : T1 (t2 ), see Figure 6. Now the node in t1 has the lowest alpha value, so now we prune

Figure 6: The tree T3 , obtained by pruning T1 below node t2 the tree in that node. This results in the tree T4 : T1 (t1 ), see Figure 7.

15

Figure 7: The tree T4 , obtained by pruning T1 below node t1

3

Experiment

This section describes our experiments investigating how the different pruning methods and properties of the data sets influence the results. To this end we selected some characteristic data sets from the UCI Machine Learning repository [2] to test the algorithms with. Besides checking the performance on the datasets we are also interested in whether there are properties of those data sets that influence pruning method and algorithm performance performance. To run the experiments, the Data Mining tool WEKA [9] was used. In Section 3.1 a global description of WEKA is given. The data sets are discussed in Section 3.2. The section will end with the experiment description and an overview of the used classification tree algorithms.

3.1

WEKA

WEKA is a collection of machine learning algorithms for Data Mining tasks. It contains tools for data preprocessing, classification, regression, clustering, association rules, and visualization [9]. For our purpose the classification tools were used. There was no preprocessing of the data. WEKA has four different modes to work in. • Simple CLI; provides a simple command-line interface that allows direct execution of WEKA commands. • Explorer; an environment for exploring data with WEKA. • Experimenter; an environment for performing experiments and conduction of statistical tests between learning schemes. 16

• Knowledge Flow; presents a “data-flow” inspired interface to WEKA. The user can select WEKA components from a tool bar, place them on a layout canvas and connect them together in order to form a “knowledge flow” for processing and analyzing data. For most of the tests, which will be explained in more detail later, the explorer mode of WEKA is used. But because of the size of some data sets, there was not enough memory to run all the tests this way. Therefore the tests for the larger data sets were executed in the simple CLI mode to save working memory.

3.2

Data sets

The data sets used for the tests come from the UCI Machine Learning repository [2]. We are dealing with classification tasks, thus we have selected sets of which the class values are nominal [5], [9]. Selection of the sets further depended on their size, larger data sets generally means higher confidence. We choose different kinds of data sets, because we also wanted to test if the performance of an algorithm depended on the kind of set that is used. For instance, is the performance of an algorithm better on sets in different categories, like medical or recognition databases? Is performance influenced by the attribute type, numerical or nominal? Or can we find other properties of data sets that influence performance? For the WEKA tool the data sets need to be in the ARFF format. An example ARFF files for the credit scoring data is given in Figure 8. Lines beginning with a % sign are comments. Following the comments at the beginning of the file are the name of the relation (credit scoring) and a block defining the attributes (age, ownhouse?, married?, income, gender), all preceded with an “@”. Nominal attributes are followed by the set of values they can take, enclosed in curly braces. Numeric values are followed by the keyword numeric. 17

Figure 8: ARFF data file for credit scoring data set.

18

3.3

Experiment description

For the tests we selected fifteen data sets; Arrhythmia, Cylinder-band, Hypothyroid, Kr-vs-Kp, Credit-g, Letter, Mushroom, Nursery, OptiDigits, Pageblock, Segment, Sick, Spambase and Waveform5000. All of these data sets have their own properties like the domain of the data set, the kind of attributes it contains, and tree size after training. All of these properties were recorded in Table 3 and 4 . Data set Arrythmia Cylinder-band Thyroid disease Chess game German Credit Letter Mushroom Nursery Optical Digits Page Blocks Segment Sick Spam Waveform Domain Medical Medical Game Finance Recognition Nature Medical Recognition Recognition Medical Recognition Inst. Attrib. Num. Cat. Classes 452 279 206 73 16 512 40 20 20 2 3772 30 7 23 4 3196 36 0 36 2 1000 20 7 13 2 20000 16 16 0 26 8124 22 0 22 2 12960 8 0 8 5 5620 64 64 0 10 5473 10 10 0 5 2310 19 19 0 7 3772 30 7 23 2 4601 58 58 0 2 5000 40 40 0 3

Table 3: Dataset properties 1

We tested each data set with four different classification tree algorithms: J48, REPTree, RandomTree and Logistical Model Trees. For each algorithm both the test options percentage split and cross-validation were used. With percentage split, the data set is divided in a training part and a test part. For the training set 66% of the instances in the data set is used and for the test set the remaining part. Cross-validation is especially used when the amount of data is limited. Instead of reserving a part for testing, cross-validation 19

Data set

J48 REP Random LMT tree size tree size tree size tree size Arrythmia 41 29 475 1 Cylinder-band 30 430 7239 no mem Thyroid disease 21 17 517 11 Chess game 43 67 2091 15 German Credit 64 96 1279 1 Letter 1423 1257 no mem no mem Mushroom 30 38 284 1 Nursery 369 518 2401 155 Optical Digits 247 207 2809 Page Blocks 71 45 529 3 Segment 59 43 543 5 Sick 45 41 427 24 Spam 161 95 1681 39 Waveform 187 181 3005 1 Table 4: Dataset properties 2

repeats the training and testing process several times with different random samples. The standard for this is 10-fold cross-validation. The data is divided randomly into 10 parts in which the classes are represented in approximately the same proportions as in the full dataset (stratification). Each part is held out in turn and the algorithm is trained on the nine remaining parts; then its error rate is calculated on the holdout set. Finally, the 10 error estimates are averaged to yield an overall error estimate. Further reading about this topic and an explanation for the preference for 10-fold cross-validation can be found in [9]. To end this section we will discuss the four classification tree algorithms we will use in our experiment setting. The J48 algorithm is WEKA’s implementation of the C4.5 decision tree learner. The algorithm uses a greedy technique to induce decision trees for 20

classification and uses reduced-error pruning [8]. REPTree is a fast decision tree learner. Builds a decision/regression tree using entropy as impurity measure and prunes it using reduced-error pruning. Only sorts values for numeric attributes once [9]. RandomTree is an algorithm for constructing a tree that considers K random features at each node. Performs no pruning [9]. LMT is a combination of induction trees and logistic regression. A combination of learners that rely on simple regression models if only little and/or noisy data is available and add a more complex tree structure if there is enough data to warrant such a structure. LMT uses cost-complexity pruning. This algorithm is significantly slower than the other algorithms [7]. For J48, REPTree and RandomTree all the tests were run with ten different random seeds. For the LMT algorithm we restricted the experiments to five runs, because of the time it costs to run that algorithm. For the first three algorithms one run took less than a minute, but for the LMT algorithm one run could take several hours. Choosing the different random seeds is done to average out statistical variations. Then for all computed percentages their mean and variance is computed. All data gathered by these tests are available on CD. Our results are described in the next section.

21

4

Results

First we will discuss what kind of results we can expect before we run the tests. The test option 10-fold validation should result in better performance than using percentage split, especially with smaller data sets, since more data is used for training. For larger data sets this difference will decrease. Using a pruning method in a classification tree algorithm will result in higher performance than when mining the data without pruning. We could also expect a larger data set to have a higher performance on testing. But we have to keep in mind that a large data set, especially when having many attributes, will produce a large decision tree, which will result in longer computation time.

22

Arrhythmia Cylinder-band Hypothyroid Krvskp Credit-G Letter 23 Mushroom Nursery Optidigits Pageblocks Segment Sick Spambase Waveform5000

Mean Variance Mean Variance Mean Variance Mean Variance Mean Variance Mean Variance Mean Variance Mean Variance Mean Variance Mean Variance Mean Variance Mean Variance Mean Variance Mean Variance

J48 REPTree RandomTr 10 − f old 66%split 10 − f old 66%split 10 − f old 66%split 69,6 66,08 67,1 65,89 45,79 43,61 1,4 6,77 3,32 13,1 9,6 19,03 57,53 56,11 58,7 56,31 65,13* 63,46 5,34 14,61 6,46 16,5 10,47 22,93 99,46 99,22 99,51 99,26 95,48 94,99 0,01 0,05 0,04 0,05 1,38 1,42 99,08 98,78 99,16 98,66 87,38 86,19 0,01 0,26 0,12 0,13 0,79 8,27 72,35 71,45 72,2 72 65,8 64,39 1,38 2,54 0,9 3,57 11,03 4,86 84,41 82,88 84,01 82,53 0,89 0,8 0,54 0,67 100 100 99,96 99,96 99,95 99,93 0 0 0 0 0 0,01 96,13 95,31 95,89 95,07 75,54 76,32 0,12 0,15 0,04 0,3 9,35 25,49 88,48 87,32 88,8 87,32 74,05 73,15 0,05 0,54 0,55 1,27 1,96 3,19 96,83 96,72 96,79 96,67 95,76 95,19 0,02 0,21 0,02 0,05 0,14 0,29 95,42 96,19 95,31 94,28 89,54 88,75 0,24 0,53 0,16 1,9 1,54 4,71 98,52 98,31 98,62 98,49 96,08 95,54 0,03 0,17 0,04 0,2 0,59 2,29 91,9 91,67 92,28 91,73 88,77 88,4 0,06 0,99 0,07 0,79 0,53 0,91 76,09 75,94 76,46 76,15 62,35 61,32 1,11 1,17 0,34 1,05 0,58 3,88

LMT 10 − f old 66%split 75*** 70,26 6,08 11,05 99,54** 99,38 0 0,03 99,65 99,67*** 0,01 0,03 75,48*** 74,76 0,74 5,25 99,99 100 0 0 98,81** 98,23 0,07 0,15 97,18*** 96,93 0,02 0,14 96,9*** 96,81 0,01 0,03 95,96 96,34 0,21 1,42 98,92 98,96*** 0,01 0,09 93,43*** 93,11 0,05 0,23 86,68 86,89*** 0,46 0,4

Table 5: Overall Results (p

Similar Documents

Premium Essay

Data Mining In Computer Science

...CHAPTER 2 DATA MINING TECHNIQUE OVERVIEW 2.1 Introduction In the 21st century as we are moving towards more and more online system, the databases have grown into terabytes. Within this huge data, information of importance needs to be identified. Since the evolution of human life, the people discover patterns. As farmer recognizes pattern of growth in the field, bank recognizes the earning and spending pattern of a customer and politicians seeks pattern in voter opinion. This huge amount of data needs to be used either for business growth or scientific discoveries. The process of discovering the patterns and relationships in data using the analysis tools is called Data Mining. The simplest form of data mining is as follows: 1. Describing...

Words: 2594 - Pages: 11

Free Essay

Crime Investigation

...(Online): 2347 - 4718 DATA MINING TECHNIQUES TO ANALYZE CRIME DATA R. G. Uthra, M. Tech (CS) Bharathidasan University, Trichy, India. Abstract: In data mining, Crime management is an interesting application where it plays an important role in handling of crime data. Crime investigation has very significant role of police system in any country. There had been an enormous increase in the crime in recent years. With rapid popularity of the internet, crime information maintained in web is becoming increasingly rampant. In this paper the data mining techniques are used to analyze the web data. This paper presents detailed study on classification and clustering. Classification is the process of classifying the crime type Clustering is the process of combining data object into groups. The construct of scenario is to extract the attributes and relations in the web page and reconstruct the scenario for crime mining. Key words: Crime data analysis, classification, clustering. I. INTRODUCTION Crime is one of the dangerous factors for any country. Crime analysis is the activity in which analysis is done on crime activities. Today criminals have maximum use of all modern technologies and hi-tech methods in committing crimes. The law enforcers have to effectively meet out challenges of crime control and maintenance of public order. One challenge to law enforcement and intelligence agencies is the difficulty of analyzing large volumes of data involved in criminal and...

Words: 1699 - Pages: 7

Premium Essay

Data Mining

...Data Mining 6/3/12 CIS 500 Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information. This information can be used to increase revenue, cut costs or both. Data mining software is a major analytical tool used for analyzing data. It allows the user to analyze data from many different angles, categorize the data and summarizing the relationships. In a nut shell data mining is used mostly for the process of finding correlations or patterns among fields in very large databases. What ultimately can data mining do for a company? A lot. Data mining is primarily used by companies with strong customer focus in retail or financial. It allows companies to determine relationships among factors such as price, product placement, and staff skill set. There are external factors that data mining can use as well such as location, economic indicators, and competition of other companies. With the use of data mining a retailer can look at point of sale records of a customer purchases to send promotions to certain areas based on purchases made. An example of this is Blockbuster looking at movie rentals to send customers updates regarding new movies depending on their previous rent list. Another example would be American express suggesting products to card holders depending on monthly purchases histories. Data Mining consists of 5 major elements: • Extract, transform, and load transaction data onto the data...

Words: 1012 - Pages: 5

Premium Essay

Vidoe Mining

...1 Video Data Mining JungHwan Oh University of Texas at Arlington, USA JeongKyu Lee University of Texas at Arlington, USA Sae Hwang University of Texas at Arlington, USA 8 INTRODUCTION Data mining, which is defined as the process of extracting previously unknown knowledge and detecting interesting patterns from a massive set of data, has been an active research area. As a result, several commercial products and research prototypes are available nowadays. However, most of these studies have focused on corporate data — typically in an alpha-numeric database, and relatively less work has been pursued for the mining of multimedia data (Zaïane, Han, & Zhu, 2000). Digital multimedia differs from previous forms of combined media in that the bits representing texts, images, audios, and videos can be treated as data by computer programs (Simoff, Djeraba, & Zaïane, 2002). One facet of these diverse data in terms of underlying models and formats is that they are synchronized and integrated hence, can be treated as integrated data records. The collection of such integral data records constitutes a multimedia data set. The challenge of extracting meaningful patterns from such data sets has lead to research and development in the area of multimedia data mining. This is a challenging field due to the non-structured nature of multimedia data. Such ubiquitous data is required in many applications such as financial, medical, advertising and Command, Control, Communications and Intelligence...

Words: 3477 - Pages: 14

Premium Essay

Data Mining

...1. Define data mining. Why are there many different names and definitions for data mining? Data mining is the process through which previously unknown patterns in data were discovered. Another definition would be “a process that uses statistical, mathematical, artificial intelligence, and machine learning techniques to extract and identify useful information and subsequent knowledge from large databases.” This includes most types of automated data analysis. A third definition: Data mining is the process of finding mathematical patterns from (usually) large sets of data; these can be rules, affinities, correlations, trends, or prediction models. Data mining has many definitions because it’s been stretched beyond those limits by some software vendors to include most forms of data analysis in order to increase sales using the popularity of data mining. What recent factors have increased the popularity of data mining? Following are some of most pronounced reasons: * More intense competition at the global scale driven by customers’ ever-changing needs and wants in an increasingly saturated marketplace. * General recognition of the untapped value hidden in large data sources. * Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc. * Consolidation of databases and other data repositories into a single location in the form of a data warehouse. * The exponential increase...

Words: 4581 - Pages: 19

Free Essay

Review of Outlier Detection Methods

...of collected data. The presence of outliers may indicate something sinister such as unauthorised system access or fraudulent activity, or may be a new and previously unidentified occurrence. Whatever the cause of these outliers, it is important they are detected so appropriate action can be taken to minimise their harm if malignant or to exploit a newly discovered opportunity. Chandola, Banerjee and Kumar (2007) conducted a comprehensive survey of outlier detection techniques, which highlighted the importance of detection across a wide variety of domains. Their survey described the categories of outlier detection, applications of detection and detection techniques. Chandola et al. identified three main categories of outlier detection - supervised, semi-supervised and unsupervised detection. Each category utilises different detection techniques such as classification, clustering, nearest neighbour and statistical. Each category and technique has several strengths and weaknesses compared with other outlier detection methods. This review provides initial information on data labelling and classification before examining some of the existing outlier detection techniques within each of the three categories. It then looks at the use of combining detection techniques before comparing and discussing the advantages and disadvantages of each method. Finally, a new classification technique is proposed using a new outlier detection algorithm, Isolation Forest. DATA LABELLING Datasets...

Words: 2395 - Pages: 10

Premium Essay

Knowledge Discovery in Medical Databases Leveraging Data Mining

...identify and evaluate data mining algorithms which are commonly implemented in modern Medical Decision Support Systems (MDSS). They are used in various healthcare units all over the world. These institutions store large amounts of medical data. This data may contain relevant medical information hidden in various patterns buried among the records. Within the research several popular MDSS’s are analysed in order to determine the most common data mining algorithms utilized by them. Three algorithms have been identified: Naïve Bayes, Multilayer Perceptron and C4.5. Prior to the very analyses the algorithms are calibrated. Several testing configurations are tested in order to determine the best setting for the algorithms. Afterwards, an ultimate comparison of the algorithms orders them with respect to their performance. The evaluation is based on a set of performance metrics. The analyses are conducted in WEKA on five UCI medical datasets: breast cancer, hepatitis, heart disease, dermatology disease, diabetes. The analyses have shown that it is very difficult to name a single data mining algorithm to be the most suitable for the medical data. The results gained for the algorithms were very similar. However, the final evaluation of the outcomes allowed singling out the Naïve Bayes to be the best classifier for the given domain. It was followed by the Multilayer Perceptron and the C4.5. Keywords: Naïve Bayes, Multilayer Perceptron, C4.5, medical data mining, medical decision...

Words: 35271 - Pages: 142

Free Essay

Case Study

...Computer Networks Computer Graphics and Multimedia Lab Advanced Operating System Internet programming and Web Design Data Mining and Warehousing Internet programming and Web Design Lab Project Work and Viva Voce Total University Examinations Durations Max in Hrs Marks 3 100 3 100 3 100 3 100 3 100 3 3 3 3 100 100 100 100 100 1000 II For project work and viva voce (External) Breakup: Project Evaluation : 75 Viva Voce : 25 1 Anx.31 J - M Sc CS (SDE) 2007-08 with MQP Page 2 of 16 YEAR – I PAPER I: ADVANCED COMPUTER ARCHITECTURE Subject Description: This paper presents the concept of parallel processing, solving problem in parallel processing, Parallel algorithms and different types of processors. Goal: To enable the students to learn the Architecture of the Computer. Objectives: On successful completion of the course the students should have: Understand the concept of Parallel Processing. Learnt the different types of Processors. Learnt the Parallel algorithms. Content: Unit I Introduction to parallel processing – Trends towards parallel processing – parallelism in uniprocessor Systems – Parallel Computer structures – architectural classification schemes – Flynn’ Classification – Feng’s Classification – Handler’s Classification – Parallel Processing Applications. Unit II Solving problems in Parallel: Utilizing Temporal Parallelism – Utilizing Data Parallelism –...

Words: 3613 - Pages: 15

Premium Essay

Application of Bootstrap Method in Spectrometric Data Analysis

...spectrometric data analysis By XIAO Jiali, Jenny ( 0830300038) A Final Year Project thesis (STAT 4121; 3 Credits) submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Statistics at BNU-HKBU UNITED INTERNATIONAL COLLEGE December, 2011 DECLARATION I hereby declare that all the work done in this Project is of my independent effort. I also certify that I have never submitted the idea and product of this Project for academic or employment credits. XIAO Jiali, Jenny (0830300038) Date: ii Application of Bootstrap method in spectrometric data analysis XIAO Jiali, Jenny Science and Technology Division Abstract In this project the bootstrap methodology for spectrometric data is considered. The bootstrap can also compare two populations, without the normality condition and without the restriction to comparison of means. The most important new idea is that bootstrap resampling must mimic the separate samples design that produced the original data. Bootstrap in mean, bootstrap in median, and bootstrap in confidence interval are three kinds of effective way to handle mass spectrometric data. Then,we need to reduce dimension based on bootstrap method. It may allow the data to be more easily visualized. Afterwards, using results obtained by bootstrap, we use data mining method to predict a patient has ovarian cancer or not. Decision tree induction and neural network are usual way to classify it. Keywords: Bootstrap, data mining...

Words: 7049 - Pages: 29

Premium Essay

Data Mning Algos

...Top 10 data mining algorithms in plain English 1.1K Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper. Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining. What are we waiting for? Let’s get started! Contents [hide]  1. C4.5  2. k-means  3. Support vector machines  4. Apriori  5. EM  6. PageRank  7. AdaBoost  8. kNN  9. Naive Bayes  10. CART  Interesting Resources  Now it’s your turn… Update 16-May-2015: Thanks to Yuval Merhav and Oliver Keyes for their suggestions which I’ve incorporated into the post. Update 28-May-2015: Thanks to Dan Steinberg (yes, the CART expert!) for the suggested updates to the CART section which have now been added. 1. C4.5 What does it do? C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified. Wait, what’s a classifier? A classifier is a tool in data mining that takes a bunch of data representing things we want to classify and attempts to predict which class the new data belongs to. What’s an example of this? Sure, suppose a dataset contains a bunch of patients. We know various things about each patient like age, pulse...

Words: 6478 - Pages: 26

Premium Essay

Data Mining

...Data Mining Objectives: Highlight the characteristics of Data mining Operations, Techniques and Tools. A Brief Overview Online Analytical Processing (OLAP): OLAP is the dynamic synthesis, analysis, and consolidation of large volumns of multi-dimensional data. Multi-dimensional OLAP support common analyst operations, such as: ▪ Considation – aggregate of data, e.g. roll-ups from branches to regions. ▪ Drill-down – showing details, just the reverse of considation. ▪ Slicing and dicing – pivoting. Looking at the data from different viewpoints. E.g. X, Y, Z axis as salesman, Nth quarter and products, or region, Nth quarter and products. A Brief Overview Data Mining: Construct an advanced architecture for storing information in a multi-dimension data warehouse is just the first step to evolve from traditional DBMS. To realize the value of a data warehouse, it is necessary to extract the knowledge hidden within the warehouse. Unlike OLAP, which reveal patterns that are known in advance, Data Mining uses the machine learning techniques to find hidden relationships within data. So Data Mining is to ▪ Analyse data, ▪ Use software techniques ▪ Finding hidden and unexpected patterns and relationships in sets of data. Examples of Data Mining Applications: ▪ Identifying potential credit card customer groups ▪ Identifying buying patterns of customers. ▪ Predicting trends of market...

Words: 1258 - Pages: 6

Premium Essay

Data Minig

...512 Use of Data Mining in the field of Library and Information Science : An Overview Roopesh K Dwivedi Abstract Data Mining refers to the extraction or “Mining” knowledge from large amount of data or Data Warehouse. To do this extraction data mining combines artificial intelligence, statistical analysis and database management systems to attempt to pull knowledge form stored data. This paper gives an overview of this new emerging technology which provides a road map to the next generation of library. And at the end it is explored that how data mining can be effectively and efficiently used in the field of library and information science and its direct and indirect impact on library administration and services. R P Bajpai Keywords : Data Mining, Data Warehouse, OLAP, KDD, e-Library 0. Introduction An area of research that has seen a recent surge in commercial development is data mining, or knowledge discovery in databases (KDD). Knowledge discovery has been defined as “the non-trivial extraction of implicit, previously unknown, and potentially useful information from data” [1]. To do this extraction data mining combines many different technologies. In addition to artificial intelligence, statistics, and database management system, technologies include data warehousing and on-line analytical processing (OLAP), human computer interaction and data visualization; machine learning (especially inductive learning techniques), knowledge representation, pattern recognition...

Words: 3471 - Pages: 14

Premium Essay

Nt1330 Unit 1 Problem Analysis Paper

...accuracy rate of the proposed algorithm for recognizing the network traffic. A. Overview The algorithm consists two phases: the classification phase and clustering phase. In the classification phase, the artificial neural network, one of the core methods of Computational Intelligence is used to create the classifier from the known network traffic data. The neural network it identifies the input pattern and tries to output the corresponding class. It can map input patterns to their associated output patterns using their mapping capabilities. They can be trained with known examples of a problem...

Words: 1312 - Pages: 6

Free Essay

Data Mining Term Paper

...FINAL REPORT DATA MINING Reported by: Nguyen Bao An – M9839920 Date: 99/06/16 Outline In this report I present my study in the Data mining course. It includes my two proposed approaches in the field of clustering, my learn lessons in class and my comment on this class. The report’s outline is as following: Part I: Proposed approaches 1. Introduction and backgrounds 2. Related works and motivation 3. Proposed approaches 4. Evaluation method 5. Conclusion Part II: Lessons learned 1. Data preprocessing 2. Frequent pattern and association rule 3. Classification and prediction 4. Clustering Part III: My own comments on this class. I. Proposed approach • An incremental subspace-based K-means clustering method for high dimensional data • Subspace based document clustering and its application in data preprocessing in Web mining 1. Introduction and background High dimensional data clustering has many applications in real world, especially in bioinformatics. Many well-known clustering algorithms often use a whole-space distance score to measure the similarity or distance between two objects, such as Euclidean distance, Cosine function... However, in fact, when the dimensionality of space or the number of objects is large, such whole-space-based pairwise similarity scores are no longer meaningful, due to the distance of each pair of object nearly the same [5]. ...

Words: 5913 - Pages: 24

Premium Essay

An Evolution of Computer Science Research

...Abbreviated version of this report is published as "Trends in Computer Science Research" Apirak Hoonlor, Boleslaw K. Szymanski and M. Zaki, Communications of the ACM, 56(10), Oct. 2013, pp.74-83 An Evolution of Computer Science Research∗ Apirak Hoonlor, Boleslaw K. Szymanski, Mohammed J. Zaki, and James Thompson Abstract Over the past two decades, Computer Science (CS) has continued to grow as a research field. There are several studies that examine trends and emerging topics in CS research or the impact of papers on the field. In contrast, in this article, we take a closer look at the entire CS research in the past two decades by analyzing the data on publications in the ACM Digital Library and IEEE Xplore, and the grants awarded by the National Science Foundation (NSF). We identify trends, bursty topics, and interesting inter-relationships between NSF awards and CS publications, finding, for example, that if an uncommonly high frequency of a specific topic is observed in publications, the funding for this topic is usually increased. We also analyze CS researchers and communities, finding that only a small fraction of authors attribute their work to the same research area for a long period of time, reflecting for instance the emphasis on novelty (use of new keywords) and typical academic research teams (with core faculty and more rapid turnover of students and postdocs). Finally, our work highlights the dynamic research landscape in CS, with its focus constantly ...

Words: 15250 - Pages: 61