Free Essay

Correlation Based Dynamic Clustering and Hash Based Retrieval for Large Datasets

In:

Submitted By tarishi
Words 2233
Pages 9
Correlation Based Dynamic
Clustering and Hash Based Retrieval for Large Datasets

ABSTRACT

Automated information retrieval systems are used to reduce the overload of document retrieval. There is a need to provide an efficient method for storage and retrieval .This project proposes the use of dynamic clustering mechanism for organizing and storing the dataset according to concept based clustering. Also hashing technique will be used to retrieve the data from the dataset based on the association rules .Related documents are grouped into same cluster by k-means clustering algorithm. From each cluster important sentences are extracted by concept matching and also based on sentence feature score. Experiments are carried to analyze the performance of the proposed work with the existing techniques considering scientific articles and news tracks as data set .From the analysis it is inferred that our proposed technique gives better enhancement for the documents related to scientific terms.

Keywords

Document clustering, concept extraction, K-means algorithm, hash-based indexing, performance evaluation 1. INTRODUCTION

Now-a-days online submission of documents has increased widely, which means large amount of documents are accumulated for a particular domain dynamically. Information retrieval [1] is the process of searching information within the documents. An information retrieval process begins when a user enters a query; queries are formal statements of information needs for example search strings in web search engine. In the process of information retrieval, several objects may be retrieved for a single query with different degrees of relevancy. Hence user has to visit each and every page for the required information, which is a time consuming process. To address this issue, document clustering is introduced

Most of the existing concept based works are oriented towards synonyms and hypernyms using WordNet lexical database [3] which reduces the performance because the dataset contains only minimum number of hypernyms. Hence the proposed technique considers terms and related terms as concepts. This paper focuses on the correlation

the traditional vector space model is replaced as feature vectors which comprise of concepts. The top terms calculated based on TF-IDF [4]are added to the vector along with the related terms as work done for synonyms and hypernyms extraction. The documents are clustered based on the correlated concepts of the documents. The performance of the proposed technique is compared with the existing term based and synonyms based summarization techniques considering Precision, Recall and F-measure as metrics taking Scientific literature and Newsgroups as data set.

The following section discusses about the related work and section 3 gives a detailed description regarding the proposed work for concept extraction and summarization. Section 4 gives the experimental results. Section 5 concludes the paper and discusses about the future enhancements

2. RELATED WORK

AditiSharan, et.al [5] proposed a semantic based document clustering using Wordnet ontology. The main aim of this is to replace the words with possible concept. This technique takes the nouns from all the documents forming the master noun list. The depth of each word is calculated by weighing the words. Then all possible combination of words is created and the pairs below the threshold are deleted from the pair list. The semantic similarity measure is used to find the maximum similarity to replace the term with the concept and the documents are clustered based on extracted concepts. But the experimental result shows that it does not consider all possible conditions.

Anna Huang, et.al [6] proposed a document clustering technique based on concept extraction using semantic relations. This work computes the similarity measure between the terms instead of considering the overlap between the terms as in the previous work. This process is achieved in 3 steps: identifying candidate phrases in the document and mapping them to anchor text in Wikipedia; disambiguating anchors that relate to multiple concepts; and pruning the list of concepts to filter out those that do

QRW_UHODWH_WR_WKH_GRFXPHQW¶V_FHQWUDO_WKUHDG.

Shady Shehata, et.al [7] proposed a concept-based mining model which comprise of two steps. First, concept based term analysis that captures the sentence concept by analyzing the semantic structure of the sentence and this concept is analyzed at the sentence level and also at the document level. Second, measures the importance of the sentence by considering the semantics of the sentence and topic of the sentence. This model produces better improvement.

Hilda Hardy, et.al [9] proposed a new concept based clustering and summarization system. This system clusters the document to subdivide the documents representing relevant topics and themes and also provides the user with two types of summary: (1) with complexity of the document set in detail and (2) with fewer details and limited length. This system clusters the passages or sequence of text instead of entire documents. The similarity is based on n-gram as an alternative of just term overlap and hence increases the clustering efficiency.

3. PROPOSED SYSTEM

The proposed model is based on correlated concept oriented model, considering terms and related terms as correlated concepts for clustering would improve the efficiency.

The proposal of considering terms and related terms as concepts [11] based on semantic similarity has been carried out for extracting topic from the documents. These concepts are analyzed on the sentence and document levels and used in document clustering and topic discovery. Cosine similarity is the similarity measure used for term-based bisecting k-means algorithm and concept-based bisecting k-means algorithm. The results obtained were promising compared the term only and synonyms and hypernyms based clustering. This proposed technique takes this initiative of considering terms and related terms as concepts for clustering the documents and also by means of modified concepts based features score computation for sentence extraction similar to the existing term based features [15].

Figure 1.Proposed System Architecture for Clustering based on Correlated Concepts

The Figure 1 shows the overall architecture of the proposed clustering process. The detailed explanations about each module are described below:

3.1 Pre-Processing

During first phase, the documents from various sources are collected and stored in a database. From the database the documents are extracted and preprocessed by removing the stop words and by applying stemming algorithm. Steps involved in preprocessing are:

Sentence decomposition * Given a set of set of documents SD. * :KHUH_L _______ ___N

And N=total number of documents * Select any ith document from SD * Split the selected ith document into sentences
Remove stop words

 Get the input as the file containing the English stop words

* Match the decomposed sentences and remove the stop words

Perform stemming

 Construct the root word by using porter stemmer algorithm [18]

3.2 Concept Extraction Algorithm

As discussed, extracting synonyms or hypernyms as concepts does not give efficient results in the case scientific literature and news group dataset because of the scientific terms involved. Concept extraction is based on our previous work [11], where correlated concepts are nothing but the terms and their related terms. Considering share market as the term in the news documents, the related terms are share, shareholder, money, market. The documents containing these words are grouped together as share market which forms the cluster.

The following steps show the process of theConcept vector construction * For each document doci from SD

* Create two empty lists each for terms and related terms respectively

* Calculate the term frequency for each term in the document

* Compute conceptual term frequency for the term

* Calculate weight = term frequency+ concept frequency

* Add weight to term list

* Sort weight in descending order and add the maximum weight to terms list

* Extract the related terms for the terms in terms list based on proposed concept extraction algorithm

* Add the terms and related terms to concept list

* Calculate concept weight.

3.2 Similarity Measure

Most of the existing technique explains that semantic based similarity [10] PHDVXUH_ GRHVQ¶W_ VXLW_ IRU_ FHQWURLG_ EDVHG_ clustering algorithm. Since the proposed technique uses Bisecting K-means algorithm which is also a centroid based clustering algorithm, the similarity measure used here is cosine similarity. The formula for calculating cosine similarity [12] is given below.
Where dot (.) represents the vector dot product and |d| represents length of vector d. the centroid vector for given set s is defined as ൌ | | | (2) | | ȁ ȁ  | | |

Gives the average of all term weights in the set s. the similarity between the centroid vectors of each document as measured using cosine similarity measure

ǡ ൌ ȁ ȁȁ ȁ Ǥ (3)

3.4 Clustering Algorithm

The extracted concepts are clustered by induced Bisecting K-means algorithm [13].The steps in basic Bisecting K-means algorithm [14] starts by selecting the elements with largest distance as seed clusters and other items are assigned to the closest seed. Then the center for these two seeds are calculated by weighted sum of all items needed and this center is used to find the new seeds. This process is repeated until two seeds meet the predefined precision. If the seed size is larger than the predefined threshold then the entire process is repeated and this forms the binary tree.

Clustering process is stated as:

* Given a high dimensional concept vector

* Generate concepts for clustering (terms and related terms)

* Construct the initial clusters based on concepts (terms and related terms)

* Make the cluster disjoint in order to identify the best initial cluster and keep the document only in that cluster by calculating the goodness score

* Build the cluster

* Apply the child pruning and sibling merging algorithm to merge the similar cluster

* Apply the resolutions technique to make the summary more efficient

3.5 Retrieval by Hashing

Hash-based indexing is a promising new technology for text-based information retrieval; it provides an ef cient and reliable means to tackle different retrieval tasks. We identi ed three major classes of tasks in which hash-based indexes are applicable, that is, grouping, similarity search, and classi cation. The paper intro-duced two quite different construction principles for hash-based in-dexes, originating from fuzzy- ngerprinting and locality-sensitive hashing respectively. An analysis of both hashing approaches was conducted to demonstrate their applicability for the near-duplicate detection task and the similarity search task and to compare them in terms of precision and recall. The results of our experiments reveal that fuzzy- ngerprinting outperforms locality-sensitive hashing in the task of near-duplicate detection regarding both precision and re-call. Within the similarity search task fuzzy- ngerprinting achieves a clearly higher precision compared to locality-sensitive hashing, while only a slight advantage in terms of recall was observed.

Despite our restriction to the domain of text-based information retrieval we emphasize that the presented ideas and algorithms are applicable to retrieval problems for a variety of other domains.

Actually, locality-sensitive hashing was designed to handle vari-ous kinds of high-dimensional vector-based object representations. Also the principles of fuzzy- ngerprinting can be applied to other domains of interest—provided that the objects of this domain can be characterized with a small set of discriminative features.

Our current research aims at the theoretical analysis of similar-ity hash functions and the utilization of the gained insights within practical applications. We want to quantify the relation between the determinants of fuzzy- ngerprinting and the achieved retrieval performance in order to construct optimized hash indexes for spe-cial purpose retrieval tasks. Finally, we apply fuzzy- ngerprinting as key technology in our tools for text-based plagiarism analysis.

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 5. SUMMARY
Our system takes less computational time, since it takes centriod, only for the extracted concepts and considers similarity only between concepts; whereas other clustering algorithms computes centriod for the whole data set which is time consuming. Also with respect to feature calculation, our proposed algorithm computes only for sentences with more concept words rather than computing the entire document sentences, which obviously reduces the time taken for sentence feature computation.

6. REFERENCES

[1] Guy Aston and Lou Burnard. The BNC Handbook. http://www.natcorp.ox.ac.uk, 1998. [2] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.

[3] Mayank Bawa, Tyson Condie, and Prasanna Ganesan. LSH Forest: Self-Tuning Indexes for Similarity Search. In WWW
'05: Proceedings of the 14th international conference on World Wide Web, pages 651–660, New York, NY, USA, 2005. ACM Press.

[4] Andrei Z. Broder. Identifying and ltering near-duplicate documents. In COM'00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pages 1–10, London, UK, 2000. Springer-Verlag.

1 | | | | | | | | | | | FF | | | | | | | | | | | | | | | | | 0.8 | | | | | | | | | | | LSH | | | | | | | | | | | | | | | | | 0.6 | | | | | | | | Precision | | | | | | 0.4 | Recall | | | | | | | | | | | | | 0.2 | | | | | | | | | | | | | | | | | | | FF | | | | | | | | | | | | | | | | | | | | | | | 0 | | | | | LSH | | | | | | | | | | 0 | 0.2 | 0.4 | 0.6 | 0.8 | 1 | 0 | 0.2 | 0.4 | 0.6 | 0.8 | 1 | | | | | Similarity | | | | | | Similarity | | | | | 5. CONCLUSION

The multi-document clustering and retrieval has high demand in

WRGD\¶V_ ZRUOG_ EHFDXVH_ RI_ WKH_ voluminous information. Information is available in various formats from various sources. To gather all the information in a shorter period is tiresome task and also the user wants the information to be more precise and quickly readable. We have proposed a summarization and redundancy elimination technique based on correlated concepts. Our new approach improves the quality of the summary by incorporating concept-based clustering, summarization and redundancy elimination techniques. Concepts are based on terms and related terms. The summary is created based on concept and sentence based features. The proposed technique gives quality summary since the redundancy elimination is based on correlated concepts. The system could be enhanced to create

| | Sample size | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

Similar Documents

Premium Essay

Statistical Databases

...example, while basic aggregation operations like SUM and AVG are part of SQL, there is no support for other commonly used operations like variance and co-variance. Such computations, as well as more advanced ones like regression and principal component analysis, are usually performed using statistical packages and libraries, such as SAS [1] and SPSS [2]. From the end user’s perspective, whether the statistical calculations are being performed in the database or in a statistical package can be quite transparent, especially from a functionality viewpoint. However, once the datasets to be analyzed grow beyond a certain size, the statistical package approach becomes infeasible, either due to its inability to handle large volumes of data, or the unacceptable computation times which make interactive analysis impossible. With the increasing sophistication of data collection instrumentation, and the cheap availability of large volume and high speed storage devices, most applications are today collecting data at unprecedented rates. In addition, an increasing number of applications today want the ability to perform interactive and on-line analysis of this data in real time, such as “what-if” analysis in forecasting. The emergence of multiple gigabyte corporate data...

Words: 11702 - Pages: 47

Premium Essay

Data Mining Practical Machine Learning Tools and Techniques - Weka

...Data Mining Practical Machine Learning Tools and Techniques The Morgan Kaufmann Series in Data Management Systems Series Editor: Jim Gray, Microsoft Research Data Mining: Practical Machine Learning Tools and Techniques, Second Edition Ian H. Witten and Eibe Frank Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration Earl Cox Data Modeling Essentials, Third Edition Graeme C. Simsion and Graham C. Witt Location-Based Services Jochen Schiller and Agnès Voisard Database Modeling with Microsoft® Visio for Enterprise Architects Terry Halpin, Ken Evans, Patrick Hallock, and Bill Maclean Designing Data-Intensive Web Applications Stefano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambilla, Sara Comai, and Maristella Matera Mining the Web: Discovering Knowledge from Hypertext Data Soumen Chakrabarti Understanding SQL and Java Together: A Guide to SQLJ, JDBC, and Related Technologies Jim Melton and Andrew Eisenberg Database: Principles, Programming, and Performance, Second Edition Patrick O’Neil and Elizabeth O’Neil The Object Data Standard: ODMG 3.0 Edited by R. G. G. Cattell, Douglas K. Barry, Mark Berler, Jeff Eastman, David Jordan, Craig Russell, Olaf Schadow, Torsten Stanienda, and Fernando Velez Data on the Web: From Relations to Semistructured Data and XML Serge Abiteboul, Peter Buneman, and Dan Suciu Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations Ian H. Witten and Eibe Frank ...

Words: 191947 - Pages: 768

Premium Essay

Dataminig

...Mining Third Edition This page intentionally left blank Data Mining Practical Machine Learning Tools and Techniques Third Edition Ian H. Witten Eibe Frank Mark A. Hall AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann Publishers is an imprint of Elsevier Morgan Kaufmann Publishers is an imprint of Elsevier 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA This book is printed on acid-free paper. Copyright © 2011 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely...

Words: 194698 - Pages: 779

Premium Essay

It and Its Scope

...Semester Based Credit and Grading System) University of Mumbai, Information Technology (semester V and VI) (Rev-2012) Page 1 Preamble To meet the challenge of ensuring excellence in engineering education, the issue of quality needs to be addressed, debated and taken forward in a systematic manner. Accreditation is the principal means of quality assurance in higher education. The major emphasis of accreditation process is to measure the outcomes of the program that is being accredited. In line with this Faculty of Technology of University of Mumbai has taken a lead in incorporating philosophy of outcome based education in the process of curriculum development. Faculty of Technology, University of Mumbai, in one of its meeting unanimously resolved that, each Board of Studies shall prepare some Program Educational Objectives (PEO‟s) and give freedom to affiliated Institutes to add few (PEO‟s) and course objectives and course outcomes to be clearly defined for each course, so that all faculty members in affiliated institutes understand the depth and approach of course to be taught, which will enhance learner‟s learning process. It was also resolved that, maximum senior faculty from colleges and experts from industry to be involved while revising the curriculum. I am happy to state that, each Board of studies has adhered to the resolutions passed by Faculty of Technology, and developed curriculum accordingly. In addition to outcome based education, semester based credit...

Words: 10444 - Pages: 42

Premium Essay

Real-Time Fraud Detection: How Stream Computing Can Help the Retail Banking Industry

...solution. Indeed, we believe that this architecture will stimulate research, and more importantly organizations, to invest in Analytics and Statistical Fraud-Scoring to be used in conjunction with the already in-place preventive techniques. Therefore, in this research we explore different strategies to build a Streambased Fraud Detection solution, using advanced Data Mining Algorithms and Statistical Analysis, and show how they lead to increased accuracy in the detection of fraud by at least 78% in our reference dataset. We also discuss how a combination of these strategies can be embedded in a Stream-based application to detect fraud in real-time. From this perspective, our experiments lead to an average processing time of 111,702ms per transaction, while strategies to further improve the performance are discussed. Keywords: Fraud Detection, Stream Computing, Real-Time Analysis, Fraud, Data Mining, Retail Banking Industry, Data Preprocessing, Data Classification, Behavior-based Models, Supervised Analysis, Semi-supervised Analysis Sammanfattning Privatbankerna har drabbats hårt av bedrägerier de senaste åren. Bedragare har lyckats kringgå forskning och tillgängliga system och lura bankerna och deras kunder. Därför vill vi införa en ny, polyvalent...

Words: 56858 - Pages: 228

Premium Essay

Real-Time Fraud Detection

...solution. Indeed, we believe that this architecture will stimulate research, and more importantly organizations, to invest in Analytics and Statistical Fraud-Scoring to be used in conjunction with the already in-place preventive techniques. Therefore, in this research we explore different strategies to build a Streambased Fraud Detection solution, using advanced Data Mining Algorithms and Statistical Analysis, and show how they lead to increased accuracy in the detection of fraud by at least 78% in our reference dataset. We also discuss how a combination of these strategies can be embedded in a Stream-based application to detect fraud in real-time. From this perspective, our experiments lead to an average processing time of 111,702ms per transaction, while strategies to further improve the performance are discussed. Keywords: Fraud Detection, Stream Computing, Real-Time Analysis, Fraud, Data Mining, Retail Banking Industry, Data Preprocessing, Data Classification, Behavior-based Models, Supervised Analysis, Semi-supervised Analysis Sammanfattning Privatbankerna har drabbats hårt av bedrägerier de senaste åren. Bedragare har lyckats kringgå forskning och tillgängliga system och lura bankerna och deras kunder. Därför vill vi införa en ny, polyvalent...

Words: 56858 - Pages: 228

Premium Essay

Databasse Management

...Fundamentals of Database Systems Preface....................................................................................................................................................12 Contents of This Edition.....................................................................................................................13 Guidelines for Using This Book.........................................................................................................14 Acknowledgments ..............................................................................................................................15 Contents of This Edition.........................................................................................................................17 Guidelines for Using This Book.............................................................................................................19 Acknowledgments ..................................................................................................................................21 About the Authors ..................................................................................................................................22 Part 1: Basic Concepts............................................................................................................................23 Chapter 1: Databases and Database Users..........................................................................................23 ...

Words: 229471 - Pages: 918

Premium Essay

Study Guide

...® OCA Oracle Database 11g: SQL Fundamentals I Exam Guide (Exam 1Z0-051) ABOUT THE AUTHORS John Watson (Oxford, UK) works for BPLC Management Consultants, teaching and consulting throughout Europe and Africa. He was with Oracle University for several years in South Africa, and before that worked for a number of companies, government departments, and NGOs in England and Europe. He is OCP qualified in both database and Application Server administration. John is the author of several books and numerous articles on technology and has 25 years of experience in IT. Roopesh Ramklass (South Africa), OCP, is an independent Oracle specialist with over 10 years of experience in a wide variety of IT environments. These include software design and development, systems analysis, courseware development, and lecturing. He has worked for Oracle Support and taught at Oracle University in South Africa for several years. Roopesh is experienced in managing and executing IT development projects, including infrastructure systems provisioning, software development, and systems integration. About the Technical Editor Bruce Swart (South Africa) works for 2Cana Solutions and has over 14 years of experience in IT. Whilst maintaining a keen interest for teaching others, he has performed several roles including developer, analyst, team leader, administrator, project manager, consultant, and lecturer. He is OCP qualified in both database and developer roles. He has taught at Oracle University...

Words: 150089 - Pages: 601

Premium Essay

Asignment

...Oracle® Database Concepts 10g Release 2 (10.2) B14220-02 October 2005 Oracle Database Concepts, 10g Release 2 (10.2) B14220-02 Copyright © 1993, 2005, Oracle. All rights reserved. Primary Author: Michele Cyran Contributing Author: Paul Lane, JP Polk Contributor: Omar Alonso, Penny Avril, Hermann Baer, Sandeepan Banerjee, Mark Bauer, Bill Bridge, Sandra Cheevers, Carol Colrain, Vira Goorah, Mike Hartstein, John Haydu, Wei Hu, Ramkumar Krishnan, Vasudha Krishnaswamy, Bill Lee, Bryn Llewellyn, Rich Long, Diana Lorentz, Paul Manning, Valarie Moore, Mughees Minhas, Gopal Mulagund, Muthu Olagappan, Jennifer Polk, Kathy Rich, John Russell, Viv Schupmann, Bob Thome, Randy Urbano, Michael Verheij, Ron Weiss, Steve Wertheimer The Programs (which include both the software and documentation) contain proprietary information; they are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright, patent, and other intellectual and industrial property laws. Reverse engineering, disassembly, or decompilation of the Programs, except to the extent required to obtain interoperability with other independently created software or as specified by law, is prohibited. The information contained in this document is subject to change without notice. If you find any problems in the documentation, please report them to us in writing. This document is not warranted to be error-free. Except as may be expressly permitted in your license agreement...

Words: 199783 - Pages: 800

Premium Essay

B2B Advantages and Disadvantages

...sources and reproduced, with permission, in this textbook appear on appropriate page within text. Microsoft® and Windows® are registered trademarks of the Microsoft Corporation in the U.S.A. and other countries. Screen shots and icons reprinted with permission from the Microsoft Corporation. This book is not sponsored or endorsed by or affiliated with the Microsoft Corporation. Copyright © 2011, 2009, 2007, 2005, 2002 Pearson Education, Inc., publishing as Prentice Hall, One Lake Street, Upper Saddle River, New Jersey 07458. All rights reserved. Manufactured in the United States of America. This publication is protected by Copyright, and permission should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means,...

Words: 193467 - Pages: 774

Premium Essay

Information Processing

...DATABASE MODELING AND DESIGN The Morgan Kaufmann Series in Data Management Systems (Selected Titles) Joe Celko’s Data, Measurements and Standards in SQL Joe Celko Information Modeling and Relational Databases, 2nd Edition Terry Halpin, Tony Morgan Joe Celko’s Thinking in Sets Joe Celko Business Metadata Bill Inmon, Bonnie O’Neil, Lowell Fryman Unleashing Web 2.0 Gottfried Vossen, Stephan Hagemann Enterprise Knowledge Management David Loshin Business Process Change, 2nd Edition Paul Harmon IT Manager’s Handbook, 2nd Edition Bill Holtsnider & Brian Jaffe Joe Celko’s Puzzles and Answers, 2 Joe Celko nd Location-Based Services ` Jochen Schiller and Agnes Voisard Managing Time in Relational Databases: How to Design, Update and Query Temporal Data Tom Johnston and Randall Weis Database Modeling with MicrosoftW Visio for Enterprise Architects Terry Halpin, Ken Evans, Patrick Hallock, Bill Maclean Designing Data-Intensive Web Applications Stephano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambilla, Sara Comai, Maristella Matera Mining the Web: Discovering Knowledge from Hypertext Data Soumen Chakrabarti Advanced SQL: 1999—Understanding Object-Relational and Other Advanced Features Jim Melton Database Tuning: Principles, Experiments, and Troubleshooting Techniques Dennis Shasha, Philippe Bonnet SQL: 1999—Understanding Relational Language Components Jim Melton, Alan R. Simon Information Visualization in Data Mining and Knowledge Discovery Edited by Usama Fayyad, Georges G. Grinstein...

Words: 89336 - Pages: 358

Premium Essay

Information and Survey Analysis

...1. An IS auditor is reviewing access to an application to determine whether the 10 most recent "new user" forms were correctly authorized. This is an example of: A. variable sampling. B. substantive testing. C. compliance testing. D. stop-or-go sampling. The correct answer is: C. compliance testing. Explanation: Compliance testing determines whether controls are being applied in compliance with policy. This includes tests to determine whether new accounts were appropriately authorized. Variable sampling is used to estimate numerical values, such as dollar values. Substantive testing substantiates the integrity of actual processing, such as balances on financial statements. The development of substantive tests is often dependent on the outcome of compliance tests. If compliance tests indicate that there are adequate internal controls, then substantive tests can be minimized. Stop-or-go sampling allows a test to be stopped as early as possible and is not appropriate for checking whether procedures have been followed. 2. The decisions and actions of an IS auditor are MOST likely to affect which of the following risks? A. Inherent B. Detection C. Control D. Business The correct answer is: B. Detection Explanation: Detection risks are directly affected by the auditor's selection of audit procedures and techniques. Inherent risks usually are not affected by the IS auditor. Control risks are controlled by the actions of the company's management. Business...

Words: 97238 - Pages: 389

Premium Essay

Database Management System

...Shipp Marketing Coordinator: Suellen Ruttkay Content Product Manager: Matthew Hutchinson Senior Art Director: Stacy Jenkins Shirley Cover Designer: Itzhack Shelomi Cover Image: iStock Images Media Editor: Chris Valentine Manufacturing Coordinator: Julio Esperas Copyeditor: Andrea Schein Proofreader: Foxxe Editorial Indexer: Elizabeth Cunningham Composition: GEX Publishing Services © 2011 Cengage Learning ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced, transmitted, stored or used in any form or by any means graphic, electronic, or mechanical, including but not limited to photocopying, recording, scanning, digitizing, taping, Web distribution, information networks, or information storage and retrieval systems, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the publisher. For product information and technology assistance, contact us at Cengage Learning Customer & Sales Support, 1-800-354-9706 For permission to use material from this text or product, submit all requests online at www.cengage.com/permissions Further permissions...

Words: 189848 - Pages: 760

Free Essay

Websphere Service Registry and Repository , Used for Soa Governance on Bpm

...Front cover WebSphere Service Registry and Repository Handbook Best practices Sample integration scenarios SOA governance Chris Dudley Laurent Rieu Martin Smithson Tapan Verma Byron Braswell ibm.com/redbooks International Technical Support Organization WebSphere Service Registry and Repository Handbook March 2007 SG24-7386-00 Note: Before using this information and the product it supports, read the information in “Notices” on page xv. First Edition (March 2007) This edition applies to Version 6, Release 0, Modification 0.1 of IBM WebSphere Service Registry and Repository (product number 5724-N72). © Copyright International Business Machines Corporation 2007. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii The team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . ...

Words: 163740 - Pages: 655