Data Mining Term Paper

FINAL REPORT

DATA MINING

Reported by: Nguyen Bao An – M9839920

Date: 99/06/16

Outline

In this report I present my study in the Data mining course. It includes my two proposed approaches in the field of clustering, my learn lessons in class and my comment on this class. The report’s outline is as following:

Part I: Proposed approaches

1. Introduction and backgrounds

2. Related works and motivation

3. Proposed approaches

4. Evaluation method

5. Conclusion

Part II: Lessons learned

1. Data preprocessing

2. Frequent pattern and association rule

3. Classification and prediction

4. Clustering

Part III: My own comments on this class.

I. Proposed approach

• An incremental subspace-based K-means clustering method for high dimensional data

• Subspace based document clustering and its application in data preprocessing in Web mining

1. Introduction and background

High dimensional data clustering has many applications in real world, especially in bioinformatics. Many well-known clustering algorithms often use a whole-space distance score to measure the similarity or distance between two objects, such as Euclidean distance, Cosine function... However, in fact, when the dimensionality of space or the number of objects is large, such whole-space-based pairwise similarity scores are no longer meaningful, due to the distance of each pair of object nearly the same [5].

Pattern-space clustering, a kind of subspace clustering, can overcome above problem by discovering such patterns that existing in subspaces of the high dimensional data. An example of subspace pattern can be seen in Figure 1: Given a dataset consists of 5 objects having 5 attribute {a, b, c, d, e} as in the table1. A subspace pattern with 4 attributes {a, b, d, e} exists in 3 object {2, 3, 5}.

[pic]

Figure 1. An example of pattern in subspace

Based on the pattern discovered from the data, we can easily assign objects that have the same subspace pattern into a cluster. Hence, the most essential task of pattern based clustering is determining the patterns hidden in the high-dimensional data. Following background is useful for this task.

As described in [2], given two objects u, v ( D and δ > 0, we say that there exists a coherent pattern between u and v in subspace S, if

[pic]

(1) and (2) can be normalized by using the attribute k, we have

[pic]

A coherent pattern can be seen in Figure 2

[pic]

Figure 2. Coherent Pattern

In [2], (3) and (4) are used for computing the coherent patterns between objects by using a centroid p and based attribute k. The coherent between an object u and the centroid p is illustrated by the appearance of attribute in the transaction database which obtained by transforming the draw original dataset using the following formula:

[pic] (5)

if [pic], the attribute i will be add to corresponding transaction.

An example of this transforming can be seen on Figure 3: Selecting R3 to be the centroid and d2 to be the based attribute, and set the threshold δ = 1. Consider the record R1:

dR11=|(4-2) – (3-3)| = 2 > δ,

dR12=|(2-2) – (3-3)| = 0 < δ, d2 is add to the corresponding transaction

dR13=|(5-2) – (4-3)| = 2 > δ,

dR14=|(1-2) – (2-3)| = 0 < δ, d4 is add to the corresponding transaction

As a result, the transaction corresponding to R1 will contain {d2, d4}.

[pic]

Figure 3. An example of transforming raw dataset in transaction database

After obtaining the transaction database, we can use any mining algorithm for getting the frequent patterns in the dataset. Now, the remaining task of clustering is assigning such objects that correspond to the same pattern into a cluster.

In this report I use above background from [2] to develop a new k-means clustering algorithms in which patterns in subspace is used to measure the distance/similarity among object. A preparation step is employed in this algorithm in order to select the appropriate centroid, due to the fact that the centroid selection strongly affects to clustering quality [1]. [1] states that such object that has more neighbors is more appropriate to be a good centroid. This preparation step employs the binning histogram analysis to estimate the number of local neighbors each object in each single dimension (attribute), an overall score will be computed to decide which object will be the centroid. By avoiding randomly centroid selection, this step overcome the drawbacks of existing methods which require to repeat the algorithms a number of times to smoothing the randomicity of clustering results.

Using the idea of subspace-based clustering, I also propose a method for document clustering based on pattern in subspace, in which the similarities between two documents are computed by the coherent between their specific terms. Based on the assumption that similar documents or such documents that belong to the same category often share some common specific terms, I suggest that we should discovery the sets of common terms and use them as patterns for clustering. This new method can overcome the drawbacks of existing methods, which use the pairwise distance of documents such as cosine function, by avoiding the significant computation cost and memory consumption in the task of constructing the similarity matrix.

This document clustering method can contribute as a preprocessing step in any scheme that needs to compute the pairwise similarity of document and construct the similarity matrix [For example, 莊見文 ‘s method presented in group meeting 99/06/09]. Sometimes, such scheme must compute similarities of every pair of documents in the data set. Its time complexity for construct similarity matrix might be O(n2). When n is very large, the running time will rapidly increasing. However, when the number of documents is large and the number of specific terms in each document is much smaller than the total number of terms in the whole set of document (such as result of web mining), maybe there are a lot of documents pairs that do not share any common terms (or just share very few). As a result, their similarity score are very low or equal to 0. Thus, we must avoid computing similarity between these pairs to save running time and memory. Divide and conquer strategy can be applied in this case, by clustering such documents that share a number of common terms into the same clusters. Now the main task is computing similarity score and constructing the similarity between documents intra clusters, not in the whole space. The more number of clusters determined, the more dangling computation cost and memory are saved.

The next sections of this part are represented as following: section 2 is related works and motivation; section 3 is the proposed approaches; expected results and evaluation are presented in section 4 and 5, respectively; the last section is the conclusion.

2. Motivation and Related works

Subspace clustering partitions objects into clusters on the subsets of attributes or dimensions. Almost subspace clustering methods are listed and evaluate briefly in [3]. Approaches in this report are proposed mostly based on [1] and [2].

[2] uses the pattern in subspace as criteria for clustering. At first, a centroid p and a based attribute k are selected randomly. Next, the raw dataset will be transformed into transaction database. Based on value of the centroid and the based attribute, coherent patterns between each record u toward the centroid p is computed by using the formula [pic], if this value is less than the threshold δ, the attribute i will be added into the corresponding transaction. [2] uses the data structure Pattern tree (P-tree) for mining the transaction database. This tree requires scanning the database only once for construction. Mining the P-tree, frequent patterns is determined. The pattern which has largest number of attributes and objects belong to will be chosen for clustering. Such objects correspond to the clustering pattern will be assigned to the cluster and removed from the original database. This process is repeated with the remaining objects until not new cluster obtained.

[2] has a light weight computation cost as well as only require scan the database once per cluster. However, as showed in [1], the results of clustering strongly depend on the selecting of the centroids. Randomly selected centroid in [2] may bring low quality clusters when the centroids just have coherent patterns toward a small number of objects.

The experimental results of [1] show that the selection of such centroids which have high number of neighbors can increase the quality of k-means algorithm on document clustering. Based on this assumption, I add a preparation step into [2], in which the centroid for each iteration is selected based on the statistic information of values of objects on each attributes. To reduce the computation cost in this step, I use binning histogram for counting the local neighbors of every object on each dimension. The total “neighbor score” of every objects is computed and compared for selecting the centroid, which is the object has largest “neighbor score” value.

Existing text clustering methods use different similarity measurement. The most commonly used is cosine function [1]. The cosine similarity between two document di and dj can be computed by

[pic]

Those such clustering method that employ the similarity scores often construct a similarity matrix in which each value sij in the matrix is the pairwise similarity score between two document di and dj, such as cos(di, dj). Sometimes, when the number of documents is too large, the construction task of similarity matrix might take a very long running time. In addition, in case of the data set is sparse; computing similarity of every pairs of documents is wasting time and not useful, which a large proportion of similarity values are nearly equal to zero. This problem can be overcome by using the benefit of pattern-based clustering, in which such document sharing common pattern on specific terms will be considered to be assigned into the same cluster.

Notations
|Notations |Meanings |
|nr |Minimum number of attribute of a subspace pattern |
|nc |Minimum number of attribute in a cluster |
|δ |Distance threshold |
|α |Number of object in a cluster |
|β |Dimesionality of attributes in subspace |
|T |The transaction database |
|Ti |The ith transaction in T |
|R |The original database |
|Ri |The ith object in R |
|FI |The set of frequent itemsets |
|CI |The set of candidate itemsets |
|S |The subspace chosen for clustering |
|ti |The ith term in text space |
|tfui |The term frequency of term i in document u |

3. Proposed approach

A. An incremental subspace-based K-means clustering method for high dimensional data

The proposed approach can be presented briefly as following: To overcome the randomicity caused by randomly centroid selected in [2], I put a preparation step to select the centroid. Based on this centroid, the raw dataset will be transformed into transaction database. Mining this transaction database, candidate frequent itemsets are obtained. Such candidate itemsets which do not satisfy the minimum support or minimum dimensionality will be pruned. The best itemset which has highest score both on support and dimensionality will be selected to be the cluster pattern. Based on this pattern the assignment task is executed in which a cluster will be obtained. The overall steps are repeated until k clusters determined.

The whole process can be divided in four following steps:

- Selecting the centroid based on “neighbor score” and a based attribute

- Transforming the raw dataset into transaction database

- Mining the transaction database to obtain the candidate itemsets, pruning none-satisfying itemsets, selecting the best itemset for clustering pattern.

- Assign objects correspond to the clustering pattern into the cluster.

▪ Step 1: Selecting the best centroid based on “neighbor score” and the based attribute

The idea of this step mainly bases on the assumption get from [1]: the centroid having more neighbors is more likely giving the better clustering quality. In fact, in such clustering algorithm which based on distance or density, the cluster is formed in the neighborhood of the centroid. Hence, the more neighbors the centroid has, the more compact cluster obtained.

However, in high dimensional data, it’s not easy to count the neighbors number of all object due to computation cost problem (O(n) = n2). Hence, I use the benefit of equal-width histogram to consider local neighborhood of each object on each individual dimension, in which the width of each bucket is equal to δ. The range of value of each dimension is divided into a number of buckets having width is δ. The numbers of objects belong to each bucket are obtained after analyzing the histogram of the dimension. A neighbor matrix M is formed, in which the value Mij of an object i at an attribute j is the number of objects of the bucket which that object belongs to.

Example:

Assume that when considering histogram of distribution of value on attribute a and b of an dataset, set the width of each bucket is 10, we have following histograms

[pic] [pic]

A part of dataset is illustrated in Table 1 and its corresponding neighbor matrix is presented in Table 2. Consider R1 ={23, 22}, based on the binning histograms, its value in neighbor matrix is (10, 20)

|Value |a |B |

▪ Selection the based attribute k:

The based attributed k is used for normalizing (1) and (2), this task we just select the attribute on which the centroid has most local neighbors. As in the example, the attribute b is selected.

▪ Step 2: Transforming the raw dataset into transaction database [2]

Base on the centroid p and the based attribute k selected in step 1, the raw dataset is transformed into transaction database using (5).

[pic]

if [pic], the attribute i will be add to corresponding transaction.

▪ Step 3: Mining the transaction database [2]

After obtaining the transaction database, FP-growth or Pattern-tree is used for determining the frequent itemset FI. This step also prune out such itemsets FIi do not meet the criteria on minimum dimensionality and minimum support (number of objects belong to). After pruning, we have the candidate itemset CI.

A measurement of clustering quality is used to select in CI the best itemset (pattern on subspace) for clustering

[pic] (7)

In which, α is the support of CIi, β is dimensionality of CIi, and k is the tradeoff parameter for user adjust the importance between the two criteria, dimensionality versus support.

▪ Step 4: Clustering assignment [2]

Based on the subspace pattern found in step 3, the assignment step is executed. In which, every object that have contain the clustering pattern is assigned into the clusters and removed from the original database.

An overall step of this approach can be illustrated as following:

Algorithm k_subspace_clustering Input: The database, number of cluster k; δ, nr, nc; Output: k clusters { Repeat until k clusters obtained { //select the best centroid Binning value ranges of all attributes into buckets whose width equal to δ; Analyzing the histogram of each attribute based on each bucket; Construct the neighbor matrix; Prune out none-appropriate objects; For each row i Compute δ, β and fi(δ, β); Select the centroid p having max fi; Select based attribute k having most local neighbors in p;

Transform raw dataset into transaction database; Construct the FP-tree;

//mining the FP-tree Mining the FP-tree ( FI; Pruning FI ( CI; Computed π(α,β) for each CIi; Select subspace S is the CIi having largest π(α,β) value;

//assign objects into cluster For each object Ti in transaction database if Ti ( S then { add Ri into cluster; remove Ri from the original database; } } }

B. An subspace based document clustering algorithm

In almost document mining system, a document is represented as a vector space model. Each document d is consider as a vector in the term-space and represented by the term frequency (TF) vector

[pic]

Where tfi is the frequency of term in the document, and D is the total number of unique terms in the text database. Assume that the preprocessing phase is good enough for remove stop words and low-discrimination-power-words; we consider that all terms in the vector model have the same weight. [The idea that “terms appearing frequently in many documents have limited discrimination-power, so they need to be deemphasized” should be consider carefully in the preprocessing phase]

Considering a document d in the text database, a term i can be considered as an attribute, and the frequency tfi of that term can be considered as the value on the corresponding attribute. In the whole view, the text database can be considered as a data matrix. |t1 |t2 |t3 |t4 |t5 |t6 |t7 |t8 |t9 | |d1 |0 |4 |1 |5 |7 |7 |3 |0 |5 | |d2 |3 |6 |4 |0 |5 |0 |0 |6 |9 | |d3 |1 |1 |9 |1 |9 |5 |1 |9 |10 | |d4 |0 |0 |7 |8 |8 |2 |10 |5 |4 | |d5 |5 |4 |4 |0 |0 |3 |0 |1 |1 | |d6 |5 |0 |1 |5 |5 |7 |10 |10 |5 | |d7 |7 |3 |1 |8 |10 |4 |1 |2 |6 | |d8 |9 |2 |3 |7 |4 |0 |6 |1 |9 | |Table 5. An example of text data matrix

However, with a large set of documents, the text database will be sparse, so a large proportion of values in the matrix have zero value. In addition, the frequency of each unique term in a document is limited, so the ranges of values of the data matrix are also small.

In the text data, there are the correlations existing among terms. It is not appropriate for using the distance measurement as first section. So following definitions is needed for consider the coherent pattern in text data.

Definition 1: Given two documents d1 and d2 represented in term frequency vectors dtf1 and dtf2 and (>0. We say that there is a coherent pattern between d1and d2 in subspace S, if:

(i,j ( S, [pic] (7)

(i,j (S, [pic] (8)

Example: Given dtf1 = [3, 7, 9], dtf2 = [6, 12, 10] and (=0.2

[pic] < (

[pic]> (

We say that there is an coherent pattern between d1 and d2 in subspace S=(t1, t2)

This definition takes into account the ratios of appearance frequency of unique terms in a document. The distance dij can be considered as the variance between two frequency ratios of two terms ti and tj. The smaller dij value, the more likely there is a coherent pattern on subspace S =(ti, tj).

(7) and (8) also can be normalized by using the attribute k (S, similarly to the first section.

(i,j ( S, [pic] (9)

(i,j ( S, [pic] (10)

Based on (9) and (10) and the knowledge in [2], the overall steps of this algorithm are similar to [2]. It consists of following step:

- Selecting the centroid based on “neighbor score” and a based attribute

- Transforming the text database into transaction database

- Mining the transaction database to obtain the candidate itemsets, pruning none-satisfying itemsets, selecting the best itemset for clustering pattern.

- Assign documents correspond to the clustering pattern into the cluster.

The steps 1, 3 and 4 are similar to corresponding steps in section A. I just represent step 2 in this section.

▪ Step 2: Transforming the text database into transaction database

Base on the centroid p and the based attribute k selected in step 1, the text database is transformed into transaction database using (11).

[pic] (11)

if [pic], the term ti will be add to corresponding transaction.

❖ Applying subspace based document clustering for data preprocessing in web mining

In web mining, the numbers of documents obtained are often very large. The need that considering the similarities among documents is practical. For example, users when reading news often do not want reading two piece of news having the same content. This task requires such mining schemes that are able to give the similarity scores among documents [as in莊見文 ‘s method]. Such schemes often consider the pairwise similarity score as the main feature, and they construct the similarity matrix which consist these scores.

However, due to large amount of documents, computing all pairwise similarity score of all document pairs in the text database seems to be a very heavy-loading work. In addition, the possibility that existing many document pairs that do not have any similarity is quite high; we should avoid doing the comparison tasks on such document pairs. This can be realized by using the processing step, in which such document do not have enough common terms is separated in different partitions. By clustering the set of documents into clusters, in which such documents sharing the same patterns on a set of common terms are assign to the same cluster. Now we just execute the comparison tasks on documents intra clusters, hence, the computation cost will be decreased significantly.

Consider the set of N documents, the time complexity of the original scheme is O(n) = N2. If this set is divided in to k clusters, in which clusters ith contains ni document. We have: [pic], the time complexity in this case is [pic]. Time complexity can be reduced is [pic].

An example of using the advance of clustering is illustrated in Figure 4 in which the set of document contains 20 documents. They are divided into three clusters containing 5, 7 and 8 documents, respectively. Total time complexity for computing the similarity scores is: 52 + 72 + 82, much less than 202 |d1 |d2 |d3 |d4 |d5 |d6 |d7 |d8 |d9 |d10 |d11 |d12 |d13 |d14 |d15 |d16 |d17 |d18 |d19 |d20 | |d1 | | | | | | | | | | | | | | | | | | | | | |d2 | | | | | | | | | | | | | | | | | | | | | |d3 | | | | | | | | | | | | | | | | | | | | | |d4 | | | | | | | | | | | | | | | | | | | | | |d5 | | | | | | | | | | | | | | | | | | | | | |d6 | | | | | | | | | | | | | | | | | | | | | |d7 | | | | | | | | | | | | | | | | | | | | | |d8 | | | | | | | | | | | | | | | | | | | | | |d9 | | | | | | | | | | | | | | | | | | | | | |d10 | | | | | | | | | | | | | | | | | | | | | |d11 | | | | | | | | | | | | | | | | | | | | | |d12 | | | | | | | | | | | | | | | | | | | | | |d13 | | | | | | | | | | | | | | | | | | | | | |d14 | | | | | | | | | | | | | | | | | | | | | |d15 | | | | | | | | | | | | | | | | | | | | | |d16 | | | | | | | | | | | | | | | | | | | | | |d17 | | | | | | | | | | | | | | | | | | | | | |d18 | | | | | | | | | | | | | | | | | | | | | |d19 | | | | | | | | | | | | | | | | | | | | | |d20 | | | | | | | | | | | | | | | | | | | | | |Figure 4. An example of time complexity in using clustering

In case of using clustering preprocessing, not only time complexity are reduced, much memory and storage space are also saved due to the reducing on dimensionality similarity matrix.

4. Evaluation method

For evaluating the performance of methods, I run the experiments for evaluate on two aspects: accuracy and running time.

With the first algorithms, I use two types of testing data sets: synthetic data and real data. The real data can be get from a bioinformatics database that contains sequences of gene strips.

With the second one, sets of data sets which mainly based on web data is used for testing. It is also used as a preprocessing step for other text mining algorithms. In that, each algorithm will be run within and without preprocessing phase using my clustering method, the contribution of proposed of my method will be seen when comparing the results of testing.

In order to compare new algorithms with existing ones, I will use the same data sets with them. For first approach, I will compare it with pCluster, SeqClus, and CPT [2]. As this approach mainly bases on [2], the accuracy should be considered carefully in order to make a conclusion that whether my heuristic on selecting the centroid is useful or not. For the second approach, it will be compare with existing cosine-function-based algorithms for text clustering, including K-means, Bi-secting K-means…

5. Conclusion

My two methods are based mainly on the process of [2]. In the first method, based on the assumption that a good centroid can improve clustering quality, and such object having many local neighbors might be a good centroid. I putted a preparation step into [2]. My expected contribution is the first method’s results can improve the accuracy and reduce the number of iterations of [2].

In the second method, I proposed a new definition of coherent patterns on text data based on the ratios of appearance frequency of terms in the documents. If this idea can be developed, I hope a more light weight clustering method for text documents will be innovated.

II. Lesson learned

1. Data preprocessing

a. Why?

- Data is not ready or incomplete for mining: missing value, error, noise, outliers…

- Data is not simply for mining: from many data sources, having many hierarchical levels of abstraction…

b. Data cleaning

- Missing value:

o Ignore the tuple

o Filling missing value by a common value.

o Filling missing value by a value computed from the existing values.

o Using mean of attribute for filling missing value

- Noisy data

o Binning: data is sorted and distributed into a number of buckets, or bin. Data will be smoothened by the neighborhood.

o Regression: Smoothing the data by regression, in which the value of an attribute can be predicted by one or some other attributes. o Clustering: Partitioning similar values into groups, as a result, the outliers can be removed.

c. Data integration and data transformation

- Data integration: combining data from many sources into a coherent data store, as in data warehousing.

- Data transformation:

o Smoothing: removing noise from the data. Such technique including regression, binning and clustering.

o Aggregation: using summary or aggregation operations on the data. For example, constructing a data cube for analysis of the data on multiple granularities.

o Generalization: replacing low-level (raw) data by higher-level one (in a hierarchical concept).

o Normalization: Data is scaled for fitting into a small specific range, for example 0.0 to 1.0.

d. Data reduction: for dealing with a huge dataset, sometimes, data reduction should be used reducing amount of tuples.

- Data cube aggregation

- Attribute subset selection: detecting and removing weakly relevant or redundant attribute.

- Dimensionality reduction

- Numerously reduction: the data is replaced or estimated by alternative, smaller representations.

- Discretization and concept hierarchy generation: raw data values are replaced by ranges of higher conceptual levels.

o Binning

o Histogram analysis: partitioning the values in an attribute into disjoin ranges called buckets.

o Entropy-based discretization (involve with classification).

o Cluster analysis:

2. Frequent pattern – Association rule

Mining frequent patterns in can be helpful in discovery association, correlations or hidden rules behind the data. Almost frequent pattern mining algorithms run on transaction database, where each transaction represents the appearances of items contained in it.

Some common algorithms and data structures for mining frequent pattern

• Apriori

Apriori is the most common algorithms. At the kth iteration, it form frequent k-itemsets based on frequent (k-1)-itemsets. It using downward closure property to prune out infrequent itemsets at each iteration: “All nonempty subset of a frequent itemset must also be frequent”

The main drawback of Apriori algorithm is multiple times of scanning database and a vast number of generated candidate itemsets.

• FP-growth – Frequent pattern tree

FP growth is an efficient algorithm for transforming the transaction database into a compressed tree structure which needs to scan database twice.

- Scan DB once to find frequent 1-itemsets

- Sort frequent item in descending order of frequencies.

- Scan DB second time for construct the FP-tree.

Mining the FP tree gives the results are the frequent itemsets. It uses a pattern fragment growth method to avoid the costly process of candidate generation and testing used by Apriori.

• FUFP tree – Fast Update FP Tree

In incremental transaction databases, there are a lot of changes in data: inserted, modified, deleted… FUFP tree is an effective structure based on the idea of FP tree which can be suitable with changes.

- What FUFP tree is: an advance algorithm lets FP tree to be updated easier and reduce the reconstruct time when new transactions are inserted.

- Differences between FP tree and FUFP tree: while FP tree’s nodes have single directional links, FUFP tree’s nodes have bi-directional links that allow more easily delete or insert nodes.

• P-tree – Pattern tree

Rather than scan database twice in FP-growth, Pattern tree just needs scan database only once for constructing, using the utility of Leaf Node Table (LNT).

• CHARM – Mining closed frequent itemsets with vertical transaction database

It enumerates closed sets using a dual itemset-tidset search tree, using an efficient hybrid search that skips many levels. A technique called diffsets is used to reduce the memory footprint of intermediate computations. Finally, it uses a fast hash-based approach to remove any “non-closed ” sets found during computation

• CLOSET+ : An efficient method for closed itemset mining.

Using divide and conquer and depth first search strategies and FP-tree as a compression structure. Local frequent itemsets are computed along with a hybrid-tree-projection, item merging, sub-itemset pruning methods and item skipping technique for further prune search space and speed up mining.

3. Classification & Prediction

• Classification

Partitioning the data into labeled classes, a model of classifier is constructed to predict categorical labels. These categories can be represented by discrete values.

• Prediction

Based on the existing data, a predictor model is constructed to predict a continuous-value-function. The predicted values can be numeric (continuous) values

• How classification and Prediction work?

Based on existing data (training data), a model is constructed, this model often contains a set of classification or prediction rule that can be learn from the training data. Running this model with the input is the data that need to be predicted or classified (testing data), we can obtain the results.

The model here can be a specific type of a machine learning methods such as Decision tree, Neural network… or a statistical analysis models such as Regression analysis, Naïve Bayesian Classification…

• Decision tree

A decision tree is a flowchart-like tree structure, where a nonleaf node denotes a test of an attribute, each branch represents an outcome of the test, and each leaf node hold a class label.

[pic]

The construction of decision tree classifiers does not require any domain knowledge or parameter setting; therefore, it is appropriate for knowledge discovery.

• Attribute selection measures

Attribute selection measures is a heuristic for selecting the spitting criterion that best for portioning the given tuples into individual class.

- Information gain: the attribute having highest gain value should be chosen for splitting.

[pic]

where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. Info(D) is also known as entropy of D

[pic]

[pic]

- Gain ratio: an extension of Information gain which attempts to overcome the bias. It applies a normalization to information gain using a “split information”

[pic]

[pic]

The attribute that has the highest GainRatio value should be selected for splitting attribute

• Naïve Bayesian Classification

Predicting a tuple X belongs to a class with the highest posterior probability, conditioned on X. That is, X belongs to class Ci if and only if

[pic]

P(Ci|X) should be maximized. The class Ci which P(Ci|X) is maximized is called the maximum posteriori hypothesis.

[pic]

• Support Vector Machine (SVM)

SVM can be used for both linear and nonlinear data. It uses a nonlinear mapping to transform the original data into a higher dimension. Within the new dimension, it searches for the linear optimal separating hyperplane. Data from two classes can always be separated by a hyperplane, in case of we have an appropriate mapping. The SVM find the hyperplane using support vectors.

• k-Nearest-Neighbors Classifiers

This method is a kind of Lazy Learner (Learning from the Neighbors), based on learning by analogy. It compare a given test tuple with training tuples that are similar to it. All the training tuples are stored in an n-dimensional pattern spaces. Given an unknown-tuple, the k-nearest-neighbors classifier searches the pattern spaces for getting the k-training tuples that are closest to the unknown tuple, called its k-nearest neighbors. Then the unknown tuple is assigned to the most common class among its k nearest neigbors.

Euclidean distance is often used as the similarity measurement between two tuples:

[pic]

• Bagging and Boosting

To improve the accuracy of prediction or classification, bagging and boosting are used as ensemble methods. They are combinations of model. Each combines a series of k learned model (M1, M2, M3…Mk), with the aim of creating an improved composite model, M*.

- Bagging:

At each ith iteration, a set of Di is sampled from the training dataset. A model Mi is learned from Di. Given a tuple X, model Mi return its class prediction, which count as one vote. The bagged classifier M* count the votes and assign the class with the most votes to X.

- Boosting

Weights are assigned into each training tuple. A series of k classfiers is iteratively learned. The weights are updated after each iteration. The final boosted classifier M* combines the votes of each individual classifier, where the weight of each classifier’s votes is a function of its accuracy.

4. Clustering

Partitioning items into unlabeled class, in which maximizing similarity intra-cluster and minimize similarity inter-cluster. Almost clustering algorithms employ two measurement functions:

- Similarity measurement between two objects: Euclidean distance, Manhattan distance, Cosine function…

- Cluster quality measurement: measure the compactness of each cluster to decide whether a cluster should be spitted (or refined) or not.

In order to compute the similarity measurement among objects, items are often represented by a multidimensional vector. Thus, a set of objects will form a data matrix M.

[pic]

Applying one of similarity function on the data matrix, we can obtain similarity/dissimilarity matrix D, where each value dij of D is a pairwise similarity/difference between to object i and j.

[pic]

• K-means algorithm: an centroid-based technique

- First, randomly select k-centroids.

- Assign each object into the cluster that is most similar, based on the distance between of that object with cluster mean.

- After assign all objects into k clusters, update the cluster means and select new k centroids.

- Repeat the assignment and refinement step until no change

• Bisecting k-means:

- Instead of selecting k centroids at the initial phase, Bisecting k-means only select two centroids, the assignment task partitions the set of objects into two clusters based on the similarity value of each object with the centroids.

- The compactness function of each cluster will be compute to determine which cluster should be split. Such cluster that has the smallest compactness measurement will be select to be split into two new clusters.

- At each iteration, the compactness measurement will be recomputed and there is a cluster split. The splitting task will be repeated until k clusters obtained.

III. My own comments on this class

I completely agree with the class schedule, class assignments and lessons of this class. As a suggestion, some “hot application fields” of data mining should be introduced in more details in class, such as web mining, multimedia mining… as well as some available tools for them. From my view point of view, the most challenge tasks in data mining is the preprocessing phase, so, if possible, a work through example of this phase should be introduced for student, in order to give more visual understanding.

. REFERENCES

[1]. Congnan Luo – Yanjun Li – Soon M. Chung, Text Document Clustering Based on Neighbors, in: Data & Knowledge Engineering – Vol. 68 – Issue 11, November 2009 – pp 1271-1288.

[2]. Jihong Guan – Yanlan Gan – Hao Wang, Discovery pattern-based subspace cluster by pattern tree, in: Knowledge Based System – Vol 22 Issuse 8 – Dec 2009

[3]. Hans Peter Kriegel – Peer Kroger – Arthur Zimek, Clustering High-Dimensional Data: A survey on Subspace clustering, Pattern-based Clustering and Correlation Clustering, in: ACM Transaction on Knowledge Discovery from Data, Vol. 3, No. 1, Article 1, March 2009

[4]. Liping Jing - Michael K. Ng – Joshua Zhexue Huang, An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data, in: IEEE Transaction on Knowledge and Data Engineering, Vol. 19, No. 8, August 2007.

[5] L.Parsons - E.Haque - H.Liu, Subspace clustering for high dimensional data: a review, SIGKDD Explorations 6 (1) (2004) pp 90–105

[pic]

Data Mining Term Paper

Similar Documents

Importance Of Data Mining In Learning

Cis 500 Complete Class Assignments and Term Paper

Social Media White Paper

Vidoe Mining

Kdd Review

Data Minig

Data Minig

An Evolution of Computer Science Research

Carbon Tax Mining

Plant Location

It and Its Scope

Text Mining Research Paper

Taking Advantage of Data Mining

Honesty and Economics

Business Intelligence

Popular Essays