Correlation Based Dynamic Clustering and Hash Based Retrieval for Large Datasets

Correlation Based Dynamic
Clustering and Hash Based Retrieval for Large Datasets

ABSTRACT

Automated information retrieval systems are used to reduce the overload of document retrieval. There is a need to provide an efficient method for storage and retrieval .This project proposes the use of dynamic clustering mechanism for organizing and storing the dataset according to concept based clustering. Also hashing technique will be used to retrieve the data from the dataset based on the association rules .Related documents are grouped into same cluster by k-means clustering algorithm. From each cluster important sentences are extracted by concept matching and also based on sentence feature score. Experiments are carried to analyze the performance of the proposed work with the existing techniques considering scientific articles and news tracks as data set .From the analysis it is inferred that our proposed technique gives better enhancement for the documents related to scientific terms.

Keywords

Document clustering, concept extraction, K-means algorithm, hash-based indexing, performance evaluation 1. INTRODUCTION

Now-a-days online submission of documents has increased widely, which means large amount of documents are accumulated for a particular domain dynamically. Information retrieval [1] is the process of searching information within the documents. An information retrieval process begins when a user enters a query; queries are formal statements of information needs for example search strings in web search engine. In the process of information retrieval, several objects may be retrieved for a single query with different degrees of relevancy. Hence user has to visit each and every page for the required information, which is a time consuming process. To address this issue, document clustering is introduced

Most of the existing concept based works are oriented towards synonyms and hypernyms using WordNet lexical database [3] which reduces the performance because the dataset contains only minimum number of hypernyms. Hence the proposed technique considers terms and related terms as concepts. This paper focuses on the correlation

the traditional vector space model is replaced as feature vectors which comprise of concepts. The top terms calculated based on TF-IDF [4]are added to the vector along with the related terms as work done for synonyms and hypernyms extraction. The documents are clustered based on the correlated concepts of the documents. The performance of the proposed technique is compared with the existing term based and synonyms based summarization techniques considering Precision, Recall and F-measure as metrics taking Scientific literature and Newsgroups as data set.

The following section discusses about the related work and section 3 gives a detailed description regarding the proposed work for concept extraction and summarization. Section 4 gives the experimental results. Section 5 concludes the paper and discusses about the future enhancements

2. RELATED WORK

AditiSharan, et.al [5] proposed a semantic based document clustering using Wordnet ontology. The main aim of this is to replace the words with possible concept. This technique takes the nouns from all the documents forming the master noun list. The depth of each word is calculated by weighing the words. Then all possible combination of words is created and the pairs below the threshold are deleted from the pair list. The semantic similarity measure is used to find the maximum similarity to replace the term with the concept and the documents are clustered based on extracted concepts. But the experimental result shows that it does not consider all possible conditions.

Anna Huang, et.al [6] proposed a document clustering technique based on concept extraction using semantic relations. This work computes the similarity measure between the terms instead of considering the overlap between the terms as in the previous work. This process is achieved in 3 steps: identifying candidate phrases in the document and mapping them to anchor text in Wikipedia; disambiguating anchors that relate to multiple concepts; and pruning the list of concepts to filter out those that do

QRW_UHODWH_WR_WKH_GRFXPHQW¶V_FHQWUDO_WKUHDG.

Shady Shehata, et.al [7] proposed a concept-based mining model which comprise of two steps. First, concept based term analysis that captures the sentence concept by analyzing the semantic structure of the sentence and this concept is analyzed at the sentence level and also at the document level. Second, measures the importance of the sentence by considering the semantics of the sentence and topic of the sentence. This model produces better improvement.

Hilda Hardy, et.al [9] proposed a new concept based clustering and summarization system. This system clusters the document to subdivide the documents representing relevant topics and themes and also provides the user with two types of summary: (1) with complexity of the document set in detail and (2) with fewer details and limited length. This system clusters the passages or sequence of text instead of entire documents. The similarity is based on n-gram as an alternative of just term overlap and hence increases the clustering efficiency.

3. PROPOSED SYSTEM

The proposed model is based on correlated concept oriented model, considering terms and related terms as correlated concepts for clustering would improve the efficiency.

The proposal of considering terms and related terms as concepts [11] based on semantic similarity has been carried out for extracting topic from the documents. These concepts are analyzed on the sentence and document levels and used in document clustering and topic discovery. Cosine similarity is the similarity measure used for term-based bisecting k-means algorithm and concept-based bisecting k-means algorithm. The results obtained were promising compared the term only and synonyms and hypernyms based clustering. This proposed technique takes this initiative of considering terms and related terms as concepts for clustering the documents and also by means of modified concepts based features score computation for sentence extraction similar to the existing term based features [15].

Figure 1.Proposed System Architecture for Clustering based on Correlated Concepts

The Figure 1 shows the overall architecture of the proposed clustering process. The detailed explanations about each module are described below:

3.1 Pre-Processing

During first phase, the documents from various sources are collected and stored in a database. From the database the documents are extracted and preprocessed by removing the stop words and by applying stemming algorithm. Steps involved in preprocessing are:

Sentence decomposition * Given a set of set of documents SD. * :KHUH_L _______ ___N

And N=total number of documents * Select any ith document from SD * Split the selected ith document into sentences
Remove stop words

 Get the input as the file containing the English stop words

* Match the decomposed sentences and remove the stop words

Perform stemming

 Construct the root word by using porter stemmer algorithm [18]

3.2 Concept Extraction Algorithm

As discussed, extracting synonyms or hypernyms as concepts does not give efficient results in the case scientific literature and news group dataset because of the scientific terms involved. Concept extraction is based on our previous work [11], where correlated concepts are nothing but the terms and their related terms. Considering share market as the term in the news documents, the related terms are share, shareholder, money, market. The documents containing these words are grouped together as share market which forms the cluster.

The following steps show the process of theConcept vector construction * For each document doci from SD

* Create two empty lists each for terms and related terms respectively

* Calculate the term frequency for each term in the document

* Compute conceptual term frequency for the term

* Calculate weight = term frequency+ concept frequency

* Add weight to term list

* Sort weight in descending order and add the maximum weight to terms list

* Extract the related terms for the terms in terms list based on proposed concept extraction algorithm

* Add the terms and related terms to concept list

* Calculate concept weight.

3.2 Similarity Measure

Most of the existing technique explains that semantic based similarity [10] PHDVXUH_ GRHVQ¶W_ VXLW_ IRU_ FHQWURLG_ EDVHG_ clustering algorithm. Since the proposed technique uses Bisecting K-means algorithm which is also a centroid based clustering algorithm, the similarity measure used here is cosine similarity. The formula for calculating cosine similarity [12] is given below.
Where dot (.) represents the vector dot product and |d| represents length of vector d. the centroid vector for given set s is defined as ൌ | | | (2) | | ȁ ȁ  | | |

Gives the average of all term weights in the set s. the similarity between the centroid vectors of each document as measured using cosine similarity measure

ǡ ൌ ȁ ȁȁ ȁ Ǥ (3)

3.4 Clustering Algorithm

The extracted concepts are clustered by induced Bisecting K-means algorithm [13].The steps in basic Bisecting K-means algorithm [14] starts by selecting the elements with largest distance as seed clusters and other items are assigned to the closest seed. Then the center for these two seeds are calculated by weighted sum of all items needed and this center is used to find the new seeds. This process is repeated until two seeds meet the predefined precision. If the seed size is larger than the predefined threshold then the entire process is repeated and this forms the binary tree.

Clustering process is stated as:

* Given a high dimensional concept vector

* Generate concepts for clustering (terms and related terms)

* Construct the initial clusters based on concepts (terms and related terms)

* Make the cluster disjoint in order to identify the best initial cluster and keep the document only in that cluster by calculating the goodness score

* Build the cluster

* Apply the child pruning and sibling merging algorithm to merge the similar cluster

* Apply the resolutions technique to make the summary more efficient

3.5 Retrieval by Hashing

Hash-based indexing is a promising new technology for text-based information retrieval; it provides an ef cient and reliable means to tackle different retrieval tasks. We identi ed three major classes of tasks in which hash-based indexes are applicable, that is, grouping, similarity search, and classi cation. The paper intro-duced two quite different construction principles for hash-based in-dexes, originating from fuzzy- ngerprinting and locality-sensitive hashing respectively. An analysis of both hashing approaches was conducted to demonstrate their applicability for the near-duplicate detection task and the similarity search task and to compare them in terms of precision and recall. The results of our experiments reveal that fuzzy- ngerprinting outperforms locality-sensitive hashing in the task of near-duplicate detection regarding both precision and re-call. Within the similarity search task fuzzy- ngerprinting achieves a clearly higher precision compared to locality-sensitive hashing, while only a slight advantage in terms of recall was observed.

Despite our restriction to the domain of text-based information retrieval we emphasize that the presented ideas and algorithms are applicable to retrieval problems for a variety of other domains.

Actually, locality-sensitive hashing was designed to handle vari-ous kinds of high-dimensional vector-based object representations. Also the principles of fuzzy- ngerprinting can be applied to other domains of interest—provided that the objects of this domain can be characterized with a small set of discriminative features.

Our current research aims at the theoretical analysis of similar-ity hash functions and the utilization of the gained insights within practical applications. We want to quantify the relation between the determinants of fuzzy- ngerprinting and the achieved retrieval performance in order to construct optimized hash indexes for spe-cial purpose retrieval tasks. Finally, we apply fuzzy- ngerprinting as key technology in our tools for text-based plagiarism analysis.

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5. SUMMARY
Our system takes less computational time, since it takes centriod, only for the extracted concepts and considers similarity only between concepts; whereas other clustering algorithms computes centriod for the whole data set which is time consuming. Also with respect to feature calculation, our proposed algorithm computes only for sentences with more concept words rather than computing the entire document sentences, which obviously reduces the time taken for sentence feature computation.

6. REFERENCES

[1] Guy Aston and Lou Burnard. The BNC Handbook. http://www.natcorp.ox.ac.uk, 1998. [2] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.

[3] Mayank Bawa, Tyson Condie, and Prasanna Ganesan. LSH Forest: Self-Tuning Indexes for Similarity Search. In WWW
'05: Proceedings of the 14th international conference on World Wide Web, pages 651–660, New York, NY, USA, 2005. ACM Press.

[4] Andrei Z. Broder. Identifying and ltering near-duplicate documents. In COM'00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pages 1–10, London, UK, 2000. Springer-Verlag.

1 | | | | | | | | | | | FF | | | | | | | | | | | | | | | | | 0.8 | | | | | | | | | | | LSH | | | | | | | | | | | | | | | | | 0.6 | | | | | | | | Precision | | | | | | 0.4 | Recall | | | | | | | | | | | | | 0.2 | | | | | | | | | | | | | | | | | | | FF | | | | | | | | | | | | | | | | | | | | | | | 0 | | | | | LSH | | | | | | | | | | 0 | 0.2 | 0.4 | 0.6 | 0.8 | 1 | 0 | 0.2 | 0.4 | 0.6 | 0.8 | 1 | | | | | Similarity | | | | | | Similarity | | | | | 5. CONCLUSION

The multi-document clustering and retrieval has high demand in

WRGD\¶V_ ZRUOG_ EHFDXVH_ RI_ WKH_ voluminous information. Information is available in various formats from various sources. To gather all the information in a shorter period is tiresome task and also the user wants the information to be more precise and quickly readable. We have proposed a summarization and redundancy elimination technique based on correlated concepts. Our new approach improves the quality of the summary by incorporating concept-based clustering, summarization and redundancy elimination techniques. Concepts are based on terms and related terms. The summary is created based on concept and sentence based features. The proposed technique gives quality summary since the redundancy elimination is based on correlated concepts. The system could be enhanced to create

| | Sample size | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

Correlation Based Dynamic Clustering and Hash Based Retrieval for Large Datasets

Similar Documents

Statistical Databases

Data Mining Practical Machine Learning Tools and Techniques - Weka

Dataminig

It and Its Scope

Real-Time Fraud Detection: How Stream Computing Can Help the Retail Banking Industry

Real-Time Fraud Detection

Databasse Management

Study Guide

Asignment

B2B Advantages and Disadvantages

Information Processing

Information and Survey Analysis

Database Management System

Websphere Service Registry and Repository , Used for Soa Governance on Bpm

Popular Essays