Free Essay

Data Mining Term Paper


Submitted By baoanth
Words 5913
Pages 24


Reported by: Nguyen Bao An – M9839920

Date: 99/06/16


In this report I present my study in the Data mining course. It includes my two proposed approaches in the field of clustering, my learn lessons in class and my comment on this class. The report’s outline is as following:

Part I: Proposed approaches

1. Introduction and backgrounds

2. Related works and motivation

3. Proposed approaches

4. Evaluation method

5. Conclusion

Part II: Lessons learned

1. Data preprocessing

2. Frequent pattern and association rule

3. Classification and prediction

4. Clustering

Part III: My own comments on this class.

I. Proposed approach

• An incremental subspace-based K-means clustering method for high dimensional data

• Subspace based document clustering and its application in data preprocessing in Web mining

1. Introduction and background

High dimensional data clustering has many applications in real world, especially in bioinformatics. Many well-known clustering algorithms often use a whole-space distance score to measure the similarity or distance between two objects, such as Euclidean distance, Cosine function... However, in fact, when the dimensionality of space or the number of objects is large, such whole-space-based pairwise similarity scores are no longer meaningful, due to the distance of each pair of object nearly the same [5].

Pattern-space clustering, a kind of subspace clustering, can overcome above problem by discovering such patterns that existing in subspaces of the high dimensional data. An example of subspace pattern can be seen in Figure 1: Given a dataset consists of 5 objects having 5 attribute {a, b, c, d, e} as in the table1. A subspace pattern with 4 attributes {a, b, d, e} exists in 3 object {2, 3, 5}.


Figure 1. An example of pattern in subspace

Based on the pattern discovered from the data, we can easily assign objects that have the same subspace pattern into a cluster. Hence, the most essential task of pattern based clustering is determining the patterns hidden in the high-dimensional data. Following background is useful for this task.

As described in [2], given two objects u, v ( D and δ > 0, we say that there exists a coherent pattern between u and v in subspace S, if


(1) and (2) can be normalized by using the attribute k, we have


A coherent pattern can be seen in Figure 2


Figure 2. Coherent Pattern

In [2], (3) and (4) are used for computing the coherent patterns between objects by using a centroid p and based attribute k. The coherent between an object u and the centroid p is illustrated by the appearance of attribute in the transaction database which obtained by transforming the draw original dataset using the following formula:

[pic] (5)

if [pic], the attribute i will be add to corresponding transaction.

An example of this transforming can be seen on Figure 3: Selecting R3 to be the centroid and d2 to be the based attribute, and set the threshold δ = 1. Consider the record R1:

dR11=|(4-2) – (3-3)| = 2 > δ,

dR12=|(2-2) – (3-3)| = 0 < δ, d2 is add to the corresponding transaction

dR13=|(5-2) – (4-3)| = 2 > δ,

dR14=|(1-2) – (2-3)| = 0 < δ, d4 is add to the corresponding transaction

As a result, the transaction corresponding to R1 will contain {d2, d4}.


Figure 3. An example of transforming raw dataset in transaction database

After obtaining the transaction database, we can use any mining algorithm for getting the frequent patterns in the dataset. Now, the remaining task of clustering is assigning such objects that correspond to the same pattern into a cluster.

In this report I use above background from [2] to develop a new k-means clustering algorithms in which patterns in subspace is used to measure the distance/similarity among object. A preparation step is employed in this algorithm in order to select the appropriate centroid, due to the fact that the centroid selection strongly affects to clustering quality [1]. [1] states that such object that has more neighbors is more appropriate to be a good centroid. This preparation step employs the binning histogram analysis to estimate the number of local neighbors each object in each single dimension (attribute), an overall score will be computed to decide which object will be the centroid. By avoiding randomly centroid selection, this step overcome the drawbacks of existing methods which require to repeat the algorithms a number of times to smoothing the randomicity of clustering results.

Using the idea of subspace-based clustering, I also propose a method for document clustering based on pattern in subspace, in which the similarities between two documents are computed by the coherent between their specific terms. Based on the assumption that similar documents or such documents that belong to the same category often share some common specific terms, I suggest that we should discovery the sets of common terms and use them as patterns for clustering. This new method can overcome the drawbacks of existing methods, which use the pairwise distance of documents such as cosine function, by avoiding the significant computation cost and memory consumption in the task of constructing the similarity matrix.

This document clustering method can contribute as a preprocessing step in any scheme that needs to compute the pairwise similarity of document and construct the similarity matrix [For example, 莊見文 ‘s method presented in group meeting 99/06/09]. Sometimes, such scheme must compute similarities of every pair of documents in the data set. Its time complexity for construct similarity matrix might be O(n2). When n is very large, the running time will rapidly increasing. However, when the number of documents is large and the number of specific terms in each document is much smaller than the total number of terms in the whole set of document (such as result of web mining), maybe there are a lot of documents pairs that do not share any common terms (or just share very few). As a result, their similarity score are very low or equal to 0. Thus, we must avoid computing similarity between these pairs to save running time and memory. Divide and conquer strategy can be applied in this case, by clustering such documents that share a number of common terms into the same clusters. Now the main task is computing similarity score and constructing the similarity between documents intra clusters, not in the whole space. The more number of clusters determined, the more dangling computation cost and memory are saved.

The next sections of this part are represented as following: section 2 is related works and motivation; section 3 is the proposed approaches; expected results and evaluation are presented in section 4 and 5, respectively; the last section is the conclusion.

2. Motivation and Related works

Subspace clustering partitions objects into clusters on the subsets of attributes or dimensions. Almost subspace clustering methods are listed and evaluate briefly in [3]. Approaches in this report are proposed mostly based on [1] and [2].

[2] uses the pattern in subspace as criteria for clustering. At first, a centroid p and a based attribute k are selected randomly. Next, the raw dataset will be transformed into transaction database. Based on value of the centroid and the based attribute, coherent patterns between each record u toward the centroid p is computed by using the formula [pic], if this value is less than the threshold δ, the attribute i will be added into the corresponding transaction. [2] uses the data structure Pattern tree (P-tree) for mining the transaction database. This tree requires scanning the database only once for construction. Mining the P-tree, frequent patterns is determined. The pattern which has largest number of attributes and objects belong to will be chosen for clustering. Such objects correspond to the clustering pattern will be assigned to the cluster and removed from the original database. This process is repeated with the remaining objects until not new cluster obtained.

[2] has a light weight computation cost as well as only require scan the database once per cluster. However, as showed in [1], the results of clustering strongly depend on the selecting of the centroids. Randomly selected centroid in [2] may bring low quality clusters when the centroids just have coherent patterns toward a small number of objects.

The experimental results of [1] show that the selection of such centroids which have high number of neighbors can increase the quality of k-means algorithm on document clustering. Based on this assumption, I add a preparation step into [2], in which the centroid for each iteration is selected based on the statistic information of values of objects on each attributes. To reduce the computation cost in this step, I use binning histogram for counting the local neighbors of every object on each dimension. The total “neighbor score” of every objects is computed and compared for selecting the centroid, which is the object has largest “neighbor score” value.

Existing text clustering methods use different similarity measurement. The most commonly used is cosine function [1]. The cosine similarity between two document di and dj can be computed by


Those such clustering method that employ the similarity scores often construct a similarity matrix in which each value sij in the matrix is the pairwise similarity score between two document di and dj, such as cos(di, dj). Sometimes, when the number of documents is too large, the construction task of similarity matrix might take a very long running time. In addition, in case of the data set is sparse; computing similarity of every pairs of documents is wasting time and not useful, which a large proportion of similarity values are nearly equal to zero. This problem can be overcome by using the benefit of pattern-based clustering, in which such document sharing common pattern on specific terms will be considered to be assigned into the same cluster.

|Notations |Meanings |
|nr |Minimum number of attribute of a subspace pattern |
|nc |Minimum number of attribute in a cluster |
|δ |Distance threshold |
|α |Number of object in a cluster |
|β |Dimesionality of attributes in subspace |
|T |The transaction database |
|Ti |The ith transaction in T |
|R |The original database |
|Ri |The ith object in R |
|FI |The set of frequent itemsets |
|CI |The set of candidate itemsets |
|S |The subspace chosen for clustering |
|ti |The ith term in text space |
|tfui |The term frequency of term i in document u |

3. Proposed approach

A. An incremental subspace-based K-means clustering method for high dimensional data

The proposed approach can be presented briefly as following: To overcome the randomicity caused by randomly centroid selected in [2], I put a preparation step to select the centroid. Based on this centroid, the raw dataset will be transformed into transaction database. Mining this transaction database, candidate frequent itemsets are obtained. Such candidate itemsets which do not satisfy the minimum support or minimum dimensionality will be pruned. The best itemset which has highest score both on support and dimensionality will be selected to be the cluster pattern. Based on this pattern the assignment task is executed in which a cluster will be obtained. The overall steps are repeated until k clusters determined.

The whole process can be divided in four following steps:

- Selecting the centroid based on “neighbor score” and a based attribute

- Transforming the raw dataset into transaction database

- Mining the transaction database to obtain the candidate itemsets, pruning none-satisfying itemsets, selecting the best itemset for clustering pattern.

- Assign objects correspond to the clustering pattern into the cluster.

▪ Step 1: Selecting the best centroid based on “neighbor score” and the based attribute

The idea of this step mainly bases on the assumption get from [1]: the centroid having more neighbors is more likely giving the better clustering quality. In fact, in such clustering algorithm which based on distance or density, the cluster is formed in the neighborhood of the centroid. Hence, the more neighbors the centroid has, the more compact cluster obtained.

However, in high dimensional data, it’s not easy to count the neighbors number of all object due to computation cost problem (O(n) = n2). Hence, I use the benefit of equal-width histogram to consider local neighborhood of each object on each individual dimension, in which the width of each bucket is equal to δ. The range of value of each dimension is divided into a number of buckets having width is δ. The numbers of objects belong to each bucket are obtained after analyzing the histogram of the dimension. A neighbor matrix M is formed, in which the value Mij of an object i at an attribute j is the number of objects of the bucket which that object belongs to.


Assume that when considering histogram of distribution of value on attribute a and b of an dataset, set the width of each bucket is 10, we have following histograms

[pic] [pic]

A part of dataset is illustrated in Table 1 and its corresponding neighbor matrix is presented in Table 2. Consider R1 ={23, 22}, based on the binning histograms, its value in neighbor matrix is (10, 20)

|Value |a |B |

▪ Selection the based attribute k:

The based attributed k is used for normalizing (1) and (2), this task we just select the attribute on which the centroid has most local neighbors. As in the example, the attribute b is selected.

▪ Step 2: Transforming the raw dataset into transaction database [2]

Base on the centroid p and the based attribute k selected in step 1, the raw dataset is transformed into transaction database using (5).


if [pic], the attribute i will be add to corresponding transaction.

▪ Step 3: Mining the transaction database [2]

After obtaining the transaction database, FP-growth or Pattern-tree is used for determining the frequent itemset FI. This step also prune out such itemsets FIi do not meet the criteria on minimum dimensionality and minimum support (number of objects belong to). After pruning, we have the candidate itemset CI.

A measurement of clustering quality is used to select in CI the best itemset (pattern on subspace) for clustering

[pic] (7)

In which, α is the support of CIi, β is dimensionality of CIi, and k is the tradeoff parameter for user adjust the importance between the two criteria, dimensionality versus support.

▪ Step 4: Clustering assignment [2]

Based on the subspace pattern found in step 3, the assignment step is executed. In which, every object that have contain the clustering pattern is assigned into the clusters and removed from the original database.

An overall step of this approach can be illustrated as following:

Algorithm k_subspace_clustering Input: The database, number of cluster k; δ, nr, nc; Output: k clusters { Repeat until k clusters obtained { //select the best centroid Binning value ranges of all attributes into buckets whose width equal to δ; Analyzing the histogram of each attribute based on each bucket; Construct the neighbor matrix; Prune out none-appropriate objects; For each row i Compute δ, β and fi(δ, β); Select the centroid p having max fi; Select based attribute k having most local neighbors in p;

Transform raw dataset into transaction database; Construct the FP-tree;

//mining the FP-tree Mining the FP-tree ( FI; Pruning FI ( CI; Computed π(α,β) for each CIi; Select subspace S is the CIi having largest π(α,β) value;

//assign objects into cluster For each object Ti in transaction database if Ti ( S then { add Ri into cluster; remove Ri from the original database; } } }

B. An subspace based document clustering algorithm

In almost document mining system, a document is represented as a vector space model. Each document d is consider as a vector in the term-space and represented by the term frequency (TF) vector


Where tfi is the frequency of term in the document, and D is the total number of unique terms in the text database. Assume that the preprocessing phase is good enough for remove stop words and low-discrimination-power-words; we consider that all terms in the vector model have the same weight. [The idea that “terms appearing frequently in many documents have limited discrimination-power, so they need to be deemphasized” should be consider carefully in the preprocessing phase]

Considering a document d in the text database, a term i can be considered as an attribute, and the frequency tfi of that term can be considered as the value on the corresponding attribute. In the whole view, the text database can be considered as a data matrix. |t1 |t2 |t3 |t4 |t5 |t6 |t7 |t8 |t9 | |d1 |0 |4 |1 |5 |7 |7 |3 |0 |5 | |d2 |3 |6 |4 |0 |5 |0 |0 |6 |9 | |d3 |1 |1 |9 |1 |9 |5 |1 |9 |10 | |d4 |0 |0 |7 |8 |8 |2 |10 |5 |4 | |d5 |5 |4 |4 |0 |0 |3 |0 |1 |1 | |d6 |5 |0 |1 |5 |5 |7 |10 |10 |5 | |d7 |7 |3 |1 |8 |10 |4 |1 |2 |6 | |d8 |9 |2 |3 |7 |4 |0 |6 |1 |9 | |Table 5. An example of text data matrix

However, with a large set of documents, the text database will be sparse, so a large proportion of values in the matrix have zero value. In addition, the frequency of each unique term in a document is limited, so the ranges of values of the data matrix are also small.

In the text data, there are the correlations existing among terms. It is not appropriate for using the distance measurement as first section. So following definitions is needed for consider the coherent pattern in text data.

Definition 1: Given two documents d1 and d2 represented in term frequency vectors dtf1 and dtf2 and (>0. We say that there is a coherent pattern between d1and d2 in subspace S, if:

(i,j ( S, [pic] (7)

(i,j (S, [pic] (8)

Example: Given dtf1 = [3, 7, 9], dtf2 = [6, 12, 10] and (=0.2

[pic] < (

[pic]> (

We say that there is an coherent pattern between d1 and d2 in subspace S=(t1, t2)

This definition takes into account the ratios of appearance frequency of unique terms in a document. The distance dij can be considered as the variance between two frequency ratios of two terms ti and tj. The smaller dij value, the more likely there is a coherent pattern on subspace S =(ti, tj).

(7) and (8) also can be normalized by using the attribute k (S, similarly to the first section.

(i,j ( S, [pic] (9)

(i,j ( S, [pic] (10)

Based on (9) and (10) and the knowledge in [2], the overall steps of this algorithm are similar to [2]. It consists of following step:

- Selecting the centroid based on “neighbor score” and a based attribute

- Transforming the text database into transaction database

- Mining the transaction database to obtain the candidate itemsets, pruning none-satisfying itemsets, selecting the best itemset for clustering pattern.

- Assign documents correspond to the clustering pattern into the cluster.

The steps 1, 3 and 4 are similar to corresponding steps in section A. I just represent step 2 in this section.

▪ Step 2: Transforming the text database into transaction database

Base on the centroid p and the based attribute k selected in step 1, the text database is transformed into transaction database using (11).

[pic] (11)

if [pic], the term ti will be add to corresponding transaction.

❖ Applying subspace based document clustering for data preprocessing in web mining

In web mining, the numbers of documents obtained are often very large. The need that considering the similarities among documents is practical. For example, users when reading news often do not want reading two piece of news having the same content. This task requires such mining schemes that are able to give the similarity scores among documents [as in莊見文 ‘s method]. Such schemes often consider the pairwise similarity score as the main feature, and they construct the similarity matrix which consist these scores.

However, due to large amount of documents, computing all pairwise similarity score of all document pairs in the text database seems to be a very heavy-loading work. In addition, the possibility that existing many document pairs that do not have any similarity is quite high; we should avoid doing the comparison tasks on such document pairs. This can be realized by using the processing step, in which such document do not have enough common terms is separated in different partitions. By clustering the set of documents into clusters, in which such documents sharing the same patterns on a set of common terms are assign to the same cluster. Now we just execute the comparison tasks on documents intra clusters, hence, the computation cost will be decreased significantly.

Consider the set of N documents, the time complexity of the original scheme is O(n) = N2. If this set is divided in to k clusters, in which clusters ith contains ni document. We have: [pic], the time complexity in this case is [pic]. Time complexity can be reduced is [pic].

An example of using the advance of clustering is illustrated in Figure 4 in which the set of document contains 20 documents. They are divided into three clusters containing 5, 7 and 8 documents, respectively. Total time complexity for computing the similarity scores is: 52 + 72 + 82, much less than 202 |d1 |d2 |d3 |d4 |d5 |d6 |d7 |d8 |d9 |d10 |d11 |d12 |d13 |d14 |d15 |d16 |d17 |d18 |d19 |d20 | |d1 | | | | | | | | | | | | | | | | | | | | | |d2 | | | | | | | | | | | | | | | | | | | | | |d3 | | | | | | | | | | | | | | | | | | | | | |d4 | | | | | | | | | | | | | | | | | | | | | |d5 | | | | | | | | | | | | | | | | | | | | | |d6 | | | | | | | | | | | | | | | | | | | | | |d7 | | | | | | | | | | | | | | | | | | | | | |d8 | | | | | | | | | | | | | | | | | | | | | |d9 | | | | | | | | | | | | | | | | | | | | | |d10 | | | | | | | | | | | | | | | | | | | | | |d11 | | | | | | | | | | | | | | | | | | | | | |d12 | | | | | | | | | | | | | | | | | | | | | |d13 | | | | | | | | | | | | | | | | | | | | | |d14 | | | | | | | | | | | | | | | | | | | | | |d15 | | | | | | | | | | | | | | | | | | | | | |d16 | | | | | | | | | | | | | | | | | | | | | |d17 | | | | | | | | | | | | | | | | | | | | | |d18 | | | | | | | | | | | | | | | | | | | | | |d19 | | | | | | | | | | | | | | | | | | | | | |d20 | | | | | | | | | | | | | | | | | | | | | |Figure 4. An example of time complexity in using clustering

In case of using clustering preprocessing, not only time complexity are reduced, much memory and storage space are also saved due to the reducing on dimensionality similarity matrix.

4. Evaluation method

For evaluating the performance of methods, I run the experiments for evaluate on two aspects: accuracy and running time.

With the first algorithms, I use two types of testing data sets: synthetic data and real data. The real data can be get from a bioinformatics database that contains sequences of gene strips.

With the second one, sets of data sets which mainly based on web data is used for testing. It is also used as a preprocessing step for other text mining algorithms. In that, each algorithm will be run within and without preprocessing phase using my clustering method, the contribution of proposed of my method will be seen when comparing the results of testing.

In order to compare new algorithms with existing ones, I will use the same data sets with them. For first approach, I will compare it with pCluster, SeqClus, and CPT [2]. As this approach mainly bases on [2], the accuracy should be considered carefully in order to make a conclusion that whether my heuristic on selecting the centroid is useful or not. For the second approach, it will be compare with existing cosine-function-based algorithms for text clustering, including K-means, Bi-secting K-means…

5. Conclusion

My two methods are based mainly on the process of [2]. In the first method, based on the assumption that a good centroid can improve clustering quality, and such object having many local neighbors might be a good centroid. I putted a preparation step into [2]. My expected contribution is the first method’s results can improve the accuracy and reduce the number of iterations of [2].

In the second method, I proposed a new definition of coherent patterns on text data based on the ratios of appearance frequency of terms in the documents. If this idea can be developed, I hope a more light weight clustering method for text documents will be innovated.

II. Lesson learned

1. Data preprocessing

a. Why?

- Data is not ready or incomplete for mining: missing value, error, noise, outliers…

- Data is not simply for mining: from many data sources, having many hierarchical levels of abstraction…

b. Data cleaning

- Missing value:

o Ignore the tuple

o Filling missing value by a common value.

o Filling missing value by a value computed from the existing values.

o Using mean of attribute for filling missing value

- Noisy data

o Binning: data is sorted and distributed into a number of buckets, or bin. Data will be smoothened by the neighborhood.

o Regression: Smoothing the data by regression, in which the value of an attribute can be predicted by one or some other attributes. o Clustering: Partitioning similar values into groups, as a result, the outliers can be removed.

c. Data integration and data transformation

- Data integration: combining data from many sources into a coherent data store, as in data warehousing.

- Data transformation:

o Smoothing: removing noise from the data. Such technique including regression, binning and clustering.

o Aggregation: using summary or aggregation operations on the data. For example, constructing a data cube for analysis of the data on multiple granularities.

o Generalization: replacing low-level (raw) data by higher-level one (in a hierarchical concept).

o Normalization: Data is scaled for fitting into a small specific range, for example 0.0 to 1.0.

d. Data reduction: for dealing with a huge dataset, sometimes, data reduction should be used reducing amount of tuples.

- Data cube aggregation

- Attribute subset selection: detecting and removing weakly relevant or redundant attribute.

- Dimensionality reduction

- Numerously reduction: the data is replaced or estimated by alternative, smaller representations.

- Discretization and concept hierarchy generation: raw data values are replaced by ranges of higher conceptual levels.

o Binning

o Histogram analysis: partitioning the values in an attribute into disjoin ranges called buckets.

o Entropy-based discretization (involve with classification).

o Cluster analysis:

2. Frequent pattern – Association rule

Mining frequent patterns in can be helpful in discovery association, correlations or hidden rules behind the data. Almost frequent pattern mining algorithms run on transaction database, where each transaction represents the appearances of items contained in it.

Some common algorithms and data structures for mining frequent pattern

• Apriori

Apriori is the most common algorithms. At the kth iteration, it form frequent k-itemsets based on frequent (k-1)-itemsets. It using downward closure property to prune out infrequent itemsets at each iteration: “All nonempty subset of a frequent itemset must also be frequent”

The main drawback of Apriori algorithm is multiple times of scanning database and a vast number of generated candidate itemsets.

• FP-growth – Frequent pattern tree

FP growth is an efficient algorithm for transforming the transaction database into a compressed tree structure which needs to scan database twice.

- Scan DB once to find frequent 1-itemsets

- Sort frequent item in descending order of frequencies.

- Scan DB second time for construct the FP-tree.

Mining the FP tree gives the results are the frequent itemsets. It uses a pattern fragment growth method to avoid the costly process of candidate generation and testing used by Apriori.

• FUFP tree – Fast Update FP Tree

In incremental transaction databases, there are a lot of changes in data: inserted, modified, deleted… FUFP tree is an effective structure based on the idea of FP tree which can be suitable with changes.

- What FUFP tree is: an advance algorithm lets FP tree to be updated easier and reduce the reconstruct time when new transactions are inserted.

- Differences between FP tree and FUFP tree: while FP tree’s nodes have single directional links, FUFP tree’s nodes have bi-directional links that allow more easily delete or insert nodes.

• P-tree – Pattern tree

Rather than scan database twice in FP-growth, Pattern tree just needs scan database only once for constructing, using the utility of Leaf Node Table (LNT).

• CHARM – Mining closed frequent itemsets with vertical transaction database

It enumerates closed sets using a dual itemset-tidset search tree, using an efficient hybrid search that skips many levels. A technique called diffsets is used to reduce the memory footprint of intermediate computations. Finally, it uses a fast hash-based approach to remove any “non-closed ” sets found during computation

• CLOSET+ : An efficient method for closed itemset mining.

Using divide and conquer and depth first search strategies and FP-tree as a compression structure. Local frequent itemsets are computed along with a hybrid-tree-projection, item merging, sub-itemset pruning methods and item skipping technique for further prune search space and speed up mining.

3. Classification & Prediction

• Classification

Partitioning the data into labeled classes, a model of classifier is constructed to predict categorical labels. These categories can be represented by discrete values.

• Prediction

Based on the existing data, a predictor model is constructed to predict a continuous-value-function. The predicted values can be numeric (continuous) values

• How classification and Prediction work?

Based on existing data (training data), a model is constructed, this model often contains a set of classification or prediction rule that can be learn from the training data. Running this model with the input is the data that need to be predicted or classified (testing data), we can obtain the results.

The model here can be a specific type of a machine learning methods such as Decision tree, Neural network… or a statistical analysis models such as Regression analysis, Naïve Bayesian Classification…

• Decision tree

A decision tree is a flowchart-like tree structure, where a nonleaf node denotes a test of an attribute, each branch represents an outcome of the test, and each leaf node hold a class label.


The construction of decision tree classifiers does not require any domain knowledge or parameter setting; therefore, it is appropriate for knowledge discovery.

• Attribute selection measures

Attribute selection measures is a heuristic for selecting the spitting criterion that best for portioning the given tuples into individual class.

- Information gain: the attribute having highest gain value should be chosen for splitting.


where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. Info(D) is also known as entropy of D



- Gain ratio: an extension of Information gain which attempts to overcome the bias. It applies a normalization to information gain using a “split information”



The attribute that has the highest GainRatio value should be selected for splitting attribute

• Naïve Bayesian Classification

Predicting a tuple X belongs to a class with the highest posterior probability, conditioned on X. That is, X belongs to class Ci if and only if


P(Ci|X) should be maximized. The class Ci which P(Ci|X) is maximized is called the maximum posteriori hypothesis.


• Support Vector Machine (SVM)

SVM can be used for both linear and nonlinear data. It uses a nonlinear mapping to transform the original data into a higher dimension. Within the new dimension, it searches for the linear optimal separating hyperplane. Data from two classes can always be separated by a hyperplane, in case of we have an appropriate mapping. The SVM find the hyperplane using support vectors.

• k-Nearest-Neighbors Classifiers

This method is a kind of Lazy Learner (Learning from the Neighbors), based on learning by analogy. It compare a given test tuple with training tuples that are similar to it. All the training tuples are stored in an n-dimensional pattern spaces. Given an unknown-tuple, the k-nearest-neighbors classifier searches the pattern spaces for getting the k-training tuples that are closest to the unknown tuple, called its k-nearest neighbors. Then the unknown tuple is assigned to the most common class among its k nearest neigbors.

Euclidean distance is often used as the similarity measurement between two tuples:


• Bagging and Boosting

To improve the accuracy of prediction or classification, bagging and boosting are used as ensemble methods. They are combinations of model. Each combines a series of k learned model (M1, M2, M3…Mk), with the aim of creating an improved composite model, M*.

- Bagging:

At each ith iteration, a set of Di is sampled from the training dataset. A model Mi is learned from Di. Given a tuple X, model Mi return its class prediction, which count as one vote. The bagged classifier M* count the votes and assign the class with the most votes to X.

- Boosting

Weights are assigned into each training tuple. A series of k classfiers is iteratively learned. The weights are updated after each iteration. The final boosted classifier M* combines the votes of each individual classifier, where the weight of each classifier’s votes is a function of its accuracy.

4. Clustering

Partitioning items into unlabeled class, in which maximizing similarity intra-cluster and minimize similarity inter-cluster. Almost clustering algorithms employ two measurement functions:

- Similarity measurement between two objects: Euclidean distance, Manhattan distance, Cosine function…

- Cluster quality measurement: measure the compactness of each cluster to decide whether a cluster should be spitted (or refined) or not.

In order to compute the similarity measurement among objects, items are often represented by a multidimensional vector. Thus, a set of objects will form a data matrix M.


Applying one of similarity function on the data matrix, we can obtain similarity/dissimilarity matrix D, where each value dij of D is a pairwise similarity/difference between to object i and j.


• K-means algorithm: an centroid-based technique

- First, randomly select k-centroids.

- Assign each object into the cluster that is most similar, based on the distance between of that object with cluster mean.

- After assign all objects into k clusters, update the cluster means and select new k centroids.

- Repeat the assignment and refinement step until no change

• Bisecting k-means:

- Instead of selecting k centroids at the initial phase, Bisecting k-means only select two centroids, the assignment task partitions the set of objects into two clusters based on the similarity value of each object with the centroids.

- The compactness function of each cluster will be compute to determine which cluster should be split. Such cluster that has the smallest compactness measurement will be select to be split into two new clusters.

- At each iteration, the compactness measurement will be recomputed and there is a cluster split. The splitting task will be repeated until k clusters obtained.

III. My own comments on this class

I completely agree with the class schedule, class assignments and lessons of this class. As a suggestion, some “hot application fields” of data mining should be introduced in more details in class, such as web mining, multimedia mining… as well as some available tools for them. From my view point of view, the most challenge tasks in data mining is the preprocessing phase, so, if possible, a work through example of this phase should be introduced for student, in order to give more visual understanding.


[1]. Congnan Luo – Yanjun Li – Soon M. Chung, Text Document Clustering Based on Neighbors, in: Data & Knowledge Engineering – Vol. 68 – Issue 11, November 2009 – pp 1271-1288.

[2]. Jihong Guan – Yanlan Gan – Hao Wang, Discovery pattern-based subspace cluster by pattern tree, in: Knowledge Based System – Vol 22 Issuse 8 – Dec 2009

[3]. Hans Peter Kriegel – Peer Kroger – Arthur Zimek, Clustering High-Dimensional Data: A survey on Subspace clustering, Pattern-based Clustering and Correlation Clustering, in: ACM Transaction on Knowledge Discovery from Data, Vol. 3, No. 1, Article 1, March 2009

[4]. Liping Jing - Michael K. Ng – Joshua Zhexue Huang, An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data, in: IEEE Transaction on Knowledge and Data Engineering, Vol. 19, No. 8, August 2007.

[5] L.Parsons - E.Haque - H.Liu, Subspace clustering for high dimensional data: a review, SIGKDD Explorations 6 (1) (2004) pp 90–105


Similar Documents

Premium Essay

Importance Of Data Mining In Learning

...Earlier, people suffered a lot for collecting data. It was very difficult to get data, so people suggested various method, for collecting data, storing data and managing data, such as database management system (DBMS). But in today’s scenario, data is overwhelmingly enormous, DBMS is not compatible to handle such size of data. So we have situation where the data rich but information is poor. “If you don’t measure it, you can’t improve it”. This is a quote by Peter Drucker, a renowned management guru. In this new era, data collection is not as hard as earlier day. Data can be collected easily by various ways, such as, censor, bar code scanner, web browser, online survey, and many more. Business entity sitting on large amount of data wondering,...

Words: 1509 - Pages: 7

Premium Essay

Cis 500 Complete Class Assignments and Term Paper

...CIS 500 Complete ClasCIS 500 Complete Class Assignments and Term Paper Click link Below To Download Entire Class: CIS 500 Complete Class Assignments and Term Paper CIS 500 Assignment 1 Predictive Policing CIS 500 Assignment 2: 4G Wireless Networks CIS 500 Assignment 3 Mobile Computing and Social Networking CIS 500 Assignment 4 Data Mining CIS 500 Term Paper Mobile Computing and Social Networks CIS 500 Assignment 1 Predictive Policing Click link Below To Download: In 1994, the New York City Police Department adopted a law enforcement crime fighting strategy known as COMPSTAT (COMPuter STATistics). COMPSTAT uses Geographic Information Systems (GIS) to map the locations of where crimes occur, identify “hotspots”, and map problem areas. COMPSTAT has amassed a wealth of historical crime data. Mathematicians have designed and developed algorithms that run against the historical data to predict future crimes for police departments. This is known as predictive policing. Predictive policing has led to a drop in burglaries, automobile thefts, and other crimes in some cities. Write a four to five (45) page paper in which you Compare and contrast the application of information technology (IT) to optimize police departments’ performance to reduce crime versus random patrols of the streets...

Words: 2044 - Pages: 9

Free Essay

Social Media White Paper

...Social Media Data: Network Analytics meets Text Mining Killian Thiel Tobias Kötter Dr. Michael Berthold Dr. Rosaria Silipo Phil Winters Copyright © 2012 by AG all rights reserved Revision: 120403F page 1 Table of Contents Creating Usable Customer Intelligence from Social Media Data: Network Analytics meets Text Mining............................................................................................................................................ 1 Summary: “Water water everywhere and not a drop to drink” ............................................................ 3 Social Media Channel-Reporting Tools. .................................................................................................. 3 Social Media Scorecards .......................................................................................................................... 4 Predictive Analytic Techniques ............................................................................................................... 4 The Case Study: A Major European Telco. ............................................................................................. 5 Public Social Media Data: Slashdot ......................................................................................................... 6 Text Mining the Slashdot Data ..........

Words: 5930 - Pages: 24

Premium Essay

Vidoe Mining

...1 Video Data Mining JungHwan Oh University of Texas at Arlington, USA JeongKyu Lee University of Texas at Arlington, USA Sae Hwang University of Texas at Arlington, USA 8 INTRODUCTION Data mining, which is defined as the process of extracting previously unknown knowledge and detecting interesting patterns from a massive set of data, has been an active research area. As a result, several commercial products and research prototypes are available nowadays. However, most of these studies have focused on corporate data — typically in an alpha-numeric database, and relatively less work has been pursued for the mining of multimedia data (Zaïane, Han, & Zhu, 2000). Digital multimedia differs from previous forms of combined media in that the bits representing texts, images, audios, and videos can be treated as data by computer programs (Simoff, Djeraba, & Zaïane, 2002). One facet of these diverse data in terms of underlying models and formats is that they are synchronized and integrated hence, can be treated as integrated data records. The collection of such integral data records constitutes a multimedia data set. The challenge of extracting meaningful patterns from such data sets has lead to research and development in the area of multimedia data mining. This is a challenging field due to the non-structured nature of multimedia data. Such ubiquitous data is required in many applications such as financial, medical, advertising and Command, Control, Communications and Intelligence...

Words: 3477 - Pages: 14

Premium Essay

Kdd Review

...Similarity based Analysis of Networks of Ultra Low Resolution Sensors Relevance: Pervasive computing, temporal analysis to discover behaviour Method: MDS, Co-occurrence, HMMs, Agglomerative Clustering, Similarity Analysis Organization: MERL Published: July 2006, Pattern Recognition 39(10) Special Issue on Similarity Based Pattern Recognition Summary: Unsupervised discovery of structure from activations of very low resolution ambient sensors. Methods for discovering location geometry from movement patterns and behavior in an elevator scheduling scenario The context of this work is ambient sensing with a large number of simple sensors (1 bit per second giving on-off info). Two tasks are addressed. Discovering location geometry from patterns of sensor activations. And clustering activation sequences. For the former, a similarity metric is devised that measures the expected time of activation of one sensor after another has been activated, on the assumption that the two activations are resulting from movement. The time is used as a measure of distance between the sensors, and MDS is used to arrive at a geometric distribution. In the second part, the observation sequences are clustered by training HMMs for each sequence, and using agglomerative clustering. Having selected an appropriate number of clusters (chosen by the domain expert) the clusters can be used to train new HMM models. The straightforward mapping of the cluster HMMs is to a composite HMM, where each branch of...

Words: 2170 - Pages: 9

Premium Essay

Data Minig

...512 Use of Data Mining in the field of Library and Information Science : An Overview Roopesh K Dwivedi Abstract Data Mining refers to the extraction or “Mining” knowledge from large amount of data or Data Warehouse. To do this extraction data mining combines artificial intelligence, statistical analysis and database management systems to attempt to pull knowledge form stored data. This paper gives an overview of this new emerging technology which provides a road map to the next generation of library. And at the end it is explored that how data mining can be effectively and efficiently used in the field of library and information science and its direct and indirect impact on library administration and services. R P Bajpai Keywords : Data Mining, Data Warehouse, OLAP, KDD, e-Library 0. Introduction An area of research that has seen a recent surge in commercial development is data mining, or knowledge discovery in databases (KDD). Knowledge discovery has been defined as “the non-trivial extraction of implicit, previously unknown, and potentially useful information from data” [1]. To do this extraction data mining combines many different technologies. In addition to artificial intelligence, statistics, and database management system, technologies include data warehousing and on-line analytical processing (OLAP), human computer interaction and data visualization; machine learning (especially inductive learning techniques), knowledge representation, pattern recognition...

Words: 3471 - Pages: 14

Premium Essay

Data Minig Data mining techniques for customer relationship management Chris Rygielski a, Jyun-Cheng Wang b, David C. Yen a,∗ a Department of DSC & MIS, Miami University, Oxford, OH, USA b Department of Information Management, National Chung-Cheng University, Taiwan, ROC Abstract Advancements in technology have made relationship marketing a reality in recent years. Technologies such as data warehousing, data mining, and campaign management software have made customer relationship management a new area where firms can gain a competitive advantage. Particularly through data mining—the extraction of hidden predictive information from large databases—organizations can identify valuable customers, predict future behaviors, and enable firms to make proactive, knowledge-driven decisions. The automated, future-ori- ented analyses made possible by data mining move beyond the analyses of past events typically provided by history-oriented tools such as decision support systems. Data mining tools answer business questions that in the past were too time-consuming to pursue. Yet, it is the answers to these questions make customer relationship management possible. Various techniques exist among data mining software, each with their own advantages and challenges for different types of applications. A particular dichotomy exists between neural networks and chi-square automated interaction detection (CHAID). While differing approaches abound in the realm of data mining, the...

Words: 8031 - Pages: 33

Premium Essay

An Evolution of Computer Science Research

...studies that examine trends and emerging topics in CS research or the impact of papers on the field. In contrast, in this article, we take a closer look at the entire CS research in the past two decades by analyzing the data on publications in the ACM Digital Library and IEEE Xplore, and the grants awarded by the National Science Foundation (NSF). We identify trends, bursty topics, and interesting inter-relationships between NSF awards and CS publications, finding, for example, that if an uncommonly high frequency of a specific topic is observed in publications, the funding for this topic is usually increased. We also analyze CS researchers and communities, finding that only a small fraction of authors attribute their work to the same research area for a long period of time, reflecting for instance the emphasis on novelty (use of new keywords) and typical academic research teams (with core faculty and more rapid turnover of students and postdocs). Finally, our work highlights the dynamic research landscape in CS, with its focus constantly moving to new challenges arising from new technological developments. Computer science is atypical science in that its universe evolves quickly, with a speed that is unprecedented even for engineers. Naturally, researchers follow the evolution of their artifacts by adjusting their research interests. We want to capture this vibrant co-evolution in this paper. 1 Introduction...

Words: 15250 - Pages: 61

Premium Essay

Carbon Tax Mining

...Faculty of Science and Engineering Department of Mining and Engineering and Mine Surveying Western Australia School of Mines 12585 - Mine Planning 532 Research Paper 1 – Mine Planning Process and the Carbon Tax Due Date : Friday 19-8-2011 Word Count: 2470 Abstract On 15 December 2008, the Federal Government launched its 2020 target for greenhouse gas emissions and its White Paper on the Carbon Pollution Reduction Scheme (CPRS) as the start of the policy and legislation process. The mining sector in Australia has been cited as being a major contributor to greenhouse gases. The introduction of the CPRS means carbon emissions of a mining project should be considered from the initial stages of mine planning. The traditional approach to mine planning involves consideration of technical and economic data as inputs to the process. This paper considers the effect of the CPRS on various technical and economic factors related to the mine planning process. The results of this paper imply that the introduction of the CPRS makes it is imperative for mining companies to assess the impact of carbon emissions on a mining project during mine planning. Introduction Climate change has become an increasingly topical issue in recent times. Mounting scientific evidence suggests that human activities are causing a buildup of greenhouse gases and that this in turn is causing changes to the world’s climate (Gregorczu, 1999). Further complicating the issue, there are economic costs, scientific...

Words: 2496 - Pages: 10

Premium Essay

Plant Location

...Data Mining Data Mining THE BUSINESS SCHOOL, KASHMIR UNIVERSITY 5/18/2014 THE BUSINESS SCHOOL, KASHMIR UNIVERSITY 5/18/2014 Umer Rashid Roll No 55 Umer Rashid Roll No 55 Abstract: Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. CRM: In today’s competitive scenario in corporate world, “Customer Retention” strategy in Customer Relationship Management (CRM) is an increasingly pressed issue. Data mining techniques play a vital role in better CRM. This paper attempts to bring a new perspective by focusing the issue of data mining Applications, opportunities and challenges in CRM. It covers the topic such as customer retention, customer services, risk assessment, fraud detection and some of the data mining tools which are widely used in CRM. Supply Chain Management (SCM) environments are often dynamic markets providing a plethora of Information, either complete or incomplete. It is, therefore, evident that such environments demand...

Words: 2588 - Pages: 11

Premium Essay

It and Its Scope

...UNIVERSITY OF MUMBAI Bachelor of Engineering Information Technology (Third Year – Sem. V & VI) Revised course (REV- 2012) from Academic Year 2014 -15 Under FACULTY OF TECHNOLOGY (As per Semester Based Credit and Grading System) University of Mumbai, Information Technology (semester V and VI) (Rev-2012) Page 1 Preamble To meet the challenge of ensuring excellence in engineering education, the issue of quality needs to be addressed, debated and taken forward in a systematic manner. Accreditation is the principal means of quality assurance in higher education. The major emphasis of accreditation process is to measure the outcomes of the program that is being accredited. In line with this Faculty of Technology of University of Mumbai has taken a lead in incorporating philosophy of outcome based education in the process of curriculum development. Faculty of Technology, University of Mumbai, in one of its meeting unanimously resolved that, each Board of Studies shall prepare some Program Educational Objectives (PEO‟s) and give freedom to affiliated Institutes to add few (PEO‟s) and course objectives and course outcomes to be clearly defined for each course, so that all faculty members in affiliated institutes understand the depth and approach of course to be taught, which will enhance learner‟s learning process. It was also resolved that, maximum senior faculty from colleges and experts from industry to be involved while revising the curriculum. I am happy to state...

Words: 10444 - Pages: 42

Premium Essay

Text Mining Research Paper

...Text mining is the process of extracting interesting and non-trivial knowledge or information from unstructured text data. Text mining is the multidisciplinary field which draws on data mining, machine learning, information retrieval, computational linguistics and statistics. This research paper discussed about one of the text mining preprocessing techniques. The initial process of text mining systems is preprocessing steps. Pre-processing reduces the size of the input text documents significantly. It involves the actions like sentence boundary determination, natural language specific stop-word elimination, tokenization and stemming. This research paper established the comparative analysis of document tokenization tools. I. Introduction Tokenization...

Words: 1209 - Pages: 5

Premium Essay

Taking Advantage of Data Mining

...Mon 6pm Data Mining As smartphones became more advanced and second nature in our everyday lives the opportunity to be apart of this new technology began to open doors for many people such as software developers, manufacturing of parts and accessories, and jobs to market and sell smartphones. The one I considered the most to me was repairing them. If you have ever shattered the screen of your smartphone, the experience of having it repaired in a quick allotted time can be painstaking. Not so often but once in a while, repair shops may not have a particular model because either they are not aware of it being sold as much in a particular region or more often the repairer does not want to take the chance in ordering overstock. These particular circumstances led me to ask can data mining be used to collect a census of how many people request a certain model smartphone to be repaired in a particular area? The purpose of collecting data would be to determine the amount of material that should be purchased each month to produce a faster turn around time for the customer and further results to less overspending. Data mining is the analysis of  (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. In a more understandable term data mining can be used to observe the relationship between two items to see their correlation between each other. For instance data can be collected...

Words: 1068 - Pages: 5

Free Essay

Honesty and Economics

...someone else’s work, seems rare; I can recall reading about only three confirmed cases of it in the almost sixty years I have spent as an economist. Both the risk of exposure and feelings of conscience provide plausible explanations for this scarcity. Soft plagiarism in the sense of making unacknowledged THOMAS MAYER is professor emeritus of economics at University of California–Davis. A more detailed working-paper version of this article is available at Challenge, vol. 52, no. 4, July/August 2009, pp. 16–24. © 2009 M.E. Sharpe, Inc. All rights reserved. ISSN 0577–5132 / 2009 $9.50 + 0.00. DOI: 10.2753/0577–5132520402 16 Challenge/July–August 2009 Honesty and Integrity in Academic Economics use of someone else’s ideas is probably much more common, both because it is considerably less likely to be detected (and the punishment, if the act is indeed detected, is apt to be much less severe) and because it is less heinous, so one’s conscience will protest less. Another temptation is to produce a publishable paper through outright cheating by making up data out of thin air or by lying about the results of analyzing legitimate...

Words: 3023 - Pages: 13

Premium Essay

Business Intelligence

...AND ANALYTICS: FROM BIG DATA TO BIG IMPACT Hsinchun Chen Eller College of Management, University of Arizona, Tucson, AZ 85721 U.S.A. {} Roger H. L. Chiang Carl H. Lindner College of Business, University of Cincinnati, Cincinnati, OH 45221-0211 U.S.A. {} Veda C. Storey J. Mack Robinson College of Business, Georgia State University, Atlanta, GA 30302-4015 U.S.A. {} Business intelligence and analytics (BI&A) has emerged as an important area of study for both practitioners and researchers, reflecting the magnitude and impact of data-related problems to be solved in contemporary business organizations. This introduction to the MIS Quarterly Special Issue on Business Intelligence Research first provides a framework that identifies the evolution, applications, and emerging research areas of BI&A. BI&A 1.0, BI&A 2.0, and BI&A 3.0 are defined and described in terms of their key characteristics and capabilities. Current research in BI&A is analyzed and challenges and opportunities associated with BI&A research and education are identified. We also report a bibliometric study of critical BI&A publications, researchers, and research topics based on more than a decade of related academic and industry publications. Finally, the six articles that comprise this special issue are introduced and characterized in terms of the proposed BI&A research framework. Keywords: Business intelligence and analytics, big data analytics, Web 2.0 Introduction...

Words: 16335 - Pages: 66