Free Essay

Review of Outlier Detection Methods

In:

Submitted By bossface
Words 2395
Pages 10
Review of Outlier Detection Methods

INTRODUCTION
Outliers or anomalies can exist in all types of collected data. The presence of outliers may indicate something sinister such as unauthorised system access or fraudulent activity, or may be a new and previously unidentified occurrence. Whatever the cause of these outliers, it is important they are detected so appropriate action can be taken to minimise their harm if malignant or to exploit a newly discovered opportunity.
Chandola, Banerjee and Kumar (2007) conducted a comprehensive survey of outlier detection techniques, which highlighted the importance of detection across a wide variety of domains. Their survey described the categories of outlier detection, applications of detection and detection techniques.
Chandola et al. identified three main categories of outlier detection - supervised, semi-supervised and unsupervised detection. Each category utilises different detection techniques such as classification, clustering, nearest neighbour and statistical. Each category and technique has several strengths and weaknesses compared with other outlier detection methods. This review provides initial information on data labelling and classification before examining some of the existing outlier detection techniques within each of the three categories. It then looks at the use of combining detection techniques before comparing and discussing the advantages and disadvantages of each method. Finally, a new classification technique is proposed using a new outlier detection algorithm, Isolation Forest.

DATA LABELLING
Datasets normally consist of many data instances with each instance usually containing one to many attributes. Each attribute may take a predetermined value such as numerical, nominal or binary or the value may be missing. Data labelling (assigning a class to each instance in a dataset) is a time consuming, manual task, often performed by humans to ensure labelling accuracy. With large unlabelled datasets, labelling is prohibitively resource consuming and often not attempted. Research into semi-supervised data labelling has been conducted (see Simon, Kumar & Zhang, 2008) but it remains largely a human task.
Labelled data instances may or may not be required for the detection of outliers, depending on the chosen algorithm, but it is the degree of labelling which denotes the mode the detection technique operates in - supervised, semi-supervised or unsupervised.

DATA CLASSIFICATION
Data classification techniques are commonly used to detect outliers in datasets. A classifier algorithm attempts to assign observed new data to an existing labelled class (Kotsiantis, 2007). For example, in a two-class problem, the chosen classification technique identifies a majority class and a minority class with the minority class being deemed outliers. This approach to outlier detection is applicable to all three mentioned detection categories as labelling is not required but can be applied.

SUPERVISED DETECTION
In supervised detection, a fully labelled dataset (both normal and outlier occurrences labelled) is used to train the selected detection algorithm. Abe, Zardony and Langford (2006) attempted to detect outliers using classification reduction and active learning. Active learning is a form of supervised detection where only user selected labelled instances of a dataset are used. Outlier detection was reduced to a classification problem and an initial labelled dataset was injected with artificial data to simulate outliers and classified. Active learning, based on the ‘Query by Bagging’ technique, was applied to the classification problem. Accuracy was evaluated over three experiments, comparing the Abe et al. technique (ActiveOutlier) against other reduction to classification methods (Bagging and Boosting), Local Outlier Factor (LOF) and Feature Bagging and reported literature results from the network intrusion detection competition. Bagging generates different, individual models for each artificially generated data subset and the resulting decisions are combined into a single prediction. Boosting is similar to Bagging but each model influences the next created model (Witten & Frank, 2005). Results showed ActiveOutlier had the highest outlier detection accuracy in three of the five datasets used and was only 1% behind Bagging in the fourth. An area for concern from the Abe et al. report is the use of artificial outliers in the dataset. While this was a requirement for supervised learning, any machine trained on artificial outliers may fail to detect real world outliers in unseen datasets.

SEMI-SUPERVISED DETECTION
Semi-supervised detection utilizes a mix of unlabelled and labelled data for the training process. Commonly the normal class is labelled while the outliers are unlabelled. Researchers such as Gao, Cheng and Tan (2006) believe labelling some data can improve the accuracy of outlier detection.
Gao, Cheng and Tan (2006) investigated outlier detection using a semi-supervised method. Rather than using the conventional clustering measure of minimizing the sum-squared error, an iterative algorithm based on the K-means clustering method was created to optimize the objective function. This algorithm was able to incorporate the presence of outliers and labelled normal data into clustering. Gao et al. claim to have produced a computationally inexpensive and efficient algorithm that can detect outliers that other unsupervised methods cannot. However, no empirical evaluation of the algorithm was performed, nor were any comparisons with unsupervised or supervised techniques presented.
Despite being less common than semi-supervised detection using labelled normal data, there has been research into using negative selection algorithms for outlier detection. Negative selection uses datasets with outliers labelled as opposed to normal instances. Stibor, Mohr, Timmis and Eckert (2005), Ji and Dasgupta (2006) and others have conducted investigations into single-class classification using a negative selection algorithm, V-detector. Both Stibor et al. and Ji and Dasgupta compared V-detector results with a single-class Support Vector Machine (SVM).
Both papers suggest that negative selection is not practical for datasets with small self-sample sizes while also suffering from the ‘curse of dimensionality’ (extra dimensions added to a space increase volume exponentially).
The two papers have different conclusions with Ji and Dasgupta suggesting that negative selection using V-detector is a viable technique for outlier detection while Stibor et al. remained generally unconvinced of its future given that a single-class SVM performs better in high dimensional space.

UNSUPERVISED DETECTION
Unsupervised outlier detection uses unlabelled datasets for training purposes but makes the assumption that the majority of instances in the dataset consist of normal data with the minority of instances being outliers.
Angiulli, Basta and Pizzuti (2005) presented a distance-based, unsupervised outlier detection and prediction method. They hypothesised that after training, new, unseen data could be classified as an outlier or normal by using only a randomly selected subset of the dataset while executing in sub quadratic time. As their initial algorithm, SolvingSet, was not robust, a second algorithm, RobustSolvingSet, was implemented containing SolvingSet. SolvingSet and RobustSolvingSet were executed on two real world datasets (ColorHistogram – 68,040 data points and Landsat – 275,465 data points) and a synthetic dataset (Gaussian – 1,000,000 data points). Results indicated that both algorithms produced low false positive values using small sub-sample sizes of no more than 7% of the dataset, with RobustSolvingSet reporting less false positives than SolvingSet.
Although Angiulli et al. state that a comparison of algorithm execution times is not meaningful because the algorithm’s goal was to build a model, presentation of these times would allow researchers to determine how the execution time of SolvingSet ranks compared to other unsupervised techniques.
Yamanishi and Takeuchi (2001) combined supervised and unsupervised algorithms to detect outliers. Their approach involved using an unsupervised algorithm, SmartSifter, to generate outlier scores for unlabelled data using a Gaussian model for statistical representation (see Yamanishi, Takeuchi, Williams & Milne, 2000). While SmartSifter was able to detect outliers with high accuracy, it was unable to explain these outliers. A rules-based supervised classification algorithm, DL-ESC, was selected for use to explain the existence of the outliers by creating a filtering rule based on the scores that SmartSifter had previously calculated. Higher scores were classified positive (potential outlier) and low scores classified as negative. After DL-ESC classification, this filtering rule was then used in pre-processing for the next iteration of SmartSifter. The combination of SmartSifter and DL-ESC showed significant improvement in detecting outliers when the sample size was low (1 – 2%) but at 3% the difference was negligible. The results of combining supervised and unsupervised algorithms seems promising based on Yamanishi and Takeuchi’s results, however this research was limited to detecting network intrusion, which requires real time or near-real time detection. A combination of different algorithms and testing of results in multiple domains would provide a better indication of how combined supervised and unsupervised algorithms can improve outlier detection.

ENSEMBLE DETECTION
As described above, Yamanishi and Takeuchi (2001) used an ensemble consisting of supervised and unsupervised algorithm to detect outliers. Research has also been undertaken in detecting outliers by using Bagging and Feature Bagging.
By combining the outputs of several unsupervised outlier detection algorithms, Lazarevic and Kumar (2005) hypothesised increased detection of outliers in high dimensional, noisy and large datasets over detection with a single algorithm. A diverse range of outlier detection algorithms was selected; with each algorithm able to detect different outliers in the dataset and assign different scores using the density based measure LOF for detection. Two methods of algorithm combination were chosen, first was Breadth First, which ranked subsets based on the probability of being an outlier and secondly Cumulative Sum, which summed the scores produced by each of the detection algorithms.
During computation, in each iteration of t rounds, a unique and randomly selected subset of features was input to the detection algorithms, which allowed each algorithm to output a different score.
Ten detection algorithms were combined for the two synthetic datasets and 50 algorithms were combined for all but one of the 66 real life datasets. Experimental results on both types of datasets indicated that the Cumulative Sum outperformed single LOF in all datasets where noise was present and performed better than the Breadth First approach in the majority of cases. It should be noted that the ten detection algorithms used in Lazarevic and Kumar were not presented in the paper. Finally, there was no discussion as to whether the presented results would be similar in a high dimensionality dataset.

COMPARISON OF METHODS
A distinct advantage of supervised detection is high detection accuracy, provided the model has been trained with accurate labels and that the instances used for training are reflective of ‘real world’ normal and outlier behaviour. A major disadvantage of supervised detection is that, as mentioned above, manually labelling each instance is a costly exercise and quite difficult to label every possible outlier.
Using semi-supervised detection avoids some of the labelling issues associated with supervised detection. As only normal instances are labelled, this provides the benefit of not having to know how any outliers behave. However, any new, acceptable behaviour will be classified as an outlier if it is not present in the normal profile.
Unsupervised detection does not require any labelling as it assumes that a dataset will contain far more normal instances than outliers. This removes the time and cost factors associated with labelling and knowledge of all values an outlier may take on is not required. A major drawback of unsupervised detection is the reliance of a dataset being skewed heavily towards normal instances. A dataset that contains more outliers than normal values will not be classified correctly.

CONCLUSION
This paper has discussed different methods for detecting anomalies. After examining Chandola, Banerjee and Kumar’s (2007) survey on outlier detection techniques, background information on data labelling and classification was provided. Following this, the supervised, semi-supervised and unsupervised detection methods were described using various examples from differing detection categories. Next, the idea of taking an ensemble approach to detection was presented. Each method was then analysed with advantages and disadvantages explained. Finally, the implementation of a new classification technique using the detection technique iForest was proposed.

REFERENCES
Abe, N., Zadrozny, B., & Langford, J. (2006). Outlier detection by active learning. In 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA, August 20-23, 2006 (pp. 504-509). New York, USA: ACM Press.

Angiulli, F., Basta, S., & Pizzuti, C. (2005). Detection and prediction of distance-based outliers. In Proceedings of the 20th Annual ACM Symposium on Applied Computing, Santa Fe, USA, March 13-17, 2005 (pp. 531-542). New York, USA: ACM Press.

Chandola, V., Banerjee, B., & Kumar, V. (2007). Outlier detection – A survey (Report No. 07-017). Minneapolis, USA: Department of Computer Science and Engineering, University of Minnesota.

Gao, J., Cheng, H., & Tan, P. (2006), Semi-supervised outlier detection. In 21st Annual ACM Symposium on Applied Computing, Dijon, France, April 23-27, 2006 (pp. 635-636). New York, USA: ACM Press.

Ji, Z., & Dasgupta, D. (2006). Applicability issues of the real-valued negative selection algorithms. In Genetic & Evolutionary Computation Conference, Seattle, USA, July 8-12, 2006 (pp. 111-118). New York, USA: ACM Press.

Kotsiantis, S.B. (2007). Supervised machine learning: A review of classification techniques. Informatica, 31, 249-268.

Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, USA, August 21-24, 2005 (pp. 157-166). New York, USA: ACM Press.

Liu, T. F., Ting, K. M., & Zhou, Z. (2008). Isolation forest. In Eighth IEEE International Conference on Data Mining, Pisa, Italy, December 15-19, 2008 (pp. 413-422). Washington, USA: IEEE Computer Society Press.

Nakamura, T., Kamidoi, Y., Wakabayashi, S., & Yoshida, N. (2006). A decision method of attribute importance for classification by outlier detection. In Proceedings of the 22nd International Conference on Data Engineering Workshops, Georgia, USA, April 3-7, 2006 (pp. 45-50). Washington, USA: IEEE Computer Society Press.

Stibor, T., Mohr, P., & Timmis, J. (2005). Is negative selection appropriate for anomaly detection? In Genetic and Evolutionary Computation Conference, Washington, USA, June 25-29, 2005 (pp. 321-328). New York, USA: ACM Press.

Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco, USA: Morgan Kaufmann.

Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. (2000). Online unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, USA, August 20-23, 2000 (pp. 320-324). New York, USA: ACM Press.

Yamanishi, K., & Takeuchi, J. (2001). Discovering outlier filtering rules from unlabeled data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, USA, August 26-29, 2001 (pp. 389-394). New York, USA: ACM Press.

Similar Documents

Premium Essay

Real-Time Fraud Detection

...severely affected by fraud over the past few years. Indeed, despite all the research and systems available, fraudsters have been able to outsmart and deceive the banks and their customers. With this in mind, we intend to introduce a novel and multi-purpose technology known as Stream Computing, as the basis for a Fraud Detection solution. Indeed, we believe that this architecture will stimulate research, and more importantly organizations, to invest in Analytics and Statistical Fraud-Scoring to be used in conjunction with the already in-place preventive techniques. Therefore, in this research we explore different strategies to build a Streambased Fraud Detection solution, using advanced Data Mining Algorithms and Statistical Analysis, and show how they lead to increased accuracy in the detection of fraud by at least 78% in our reference dataset. We also discuss how a combination of these strategies can be embedded in a Stream-based application to detect fraud in real-time. From this perspective, our experiments lead to an average processing time of 111,702ms per transaction, while strategies to further improve the performance are discussed. Keywords: Fraud Detection, Stream Computing, Real-Time Analysis, Fraud, Data Mining, Retail Banking Industry, Data Preprocessing, Data Classification, Behavior-based Models, Supervised Analysis, Semi-supervised Analysis Sammanfattning Privatbankerna har drabbats hårt av bedrägerier de senaste åren. Bedragare har lyckats kringgå forskning och...

Words: 56858 - Pages: 228

Premium Essay

Using Km and Dm to Improve Business Performance

...MEDWAY SCHOOL OF ENGINEERING Programme: Msc. Information Technology Management for Business Course: Knowledge Management and Exploitation Course Tutor: Dr. A.A.F. Al-Shawabkeh Topic Using Data Mining and Knowledge Management to Improve Business Performance By Nurudeen Babatunde Lawal 000620744 Table of Contents Content Page No. Table of Contents 2 List of Figures 3 Abstract 4 Chapter One 5 1.1 Overview of Business 5 1.2 Nature of Business 5 1.3 Business Challenges 6 Chapter Two 2.1 Knowledge and Knowledge Management 8 2.1.1 Knowledge 8 2.1.2 Knowledge management 9 2.1.3 Knowledge Management Process 9 2.1.4 Knowledge Discovery from Database 10 2.2 Data Mining 12 2.2.1 Data Mining Tasks in Knowledge Management 12 2.2.2 Data Mining and Knowledge Management in Business 14 Chapter Three 17 3.1 Implementation Challenges of KM in Business 17 3.2 Limitations of Data Mining Applications 18 3.3 Conclusion 18 References 19 List of Figures Figure No. Description Page No. Figure 1 Forms of Knowledge Organisation 8 Figure 2 Integration of KM Technologies with KM Process Cycle 10 Figure 3 DM and KDD Process 11 Figure 4 Intersection of DM and KM 14 Abstract ...

Words: 5606 - Pages: 23

Premium Essay

Real-Time Fraud Detection: How Stream Computing Can Help the Retail Banking Industry

...severely affected by fraud over the past few years. Indeed, despite all the research and systems available, fraudsters have been able to outsmart and deceive the banks and their customers. With this in mind, we intend to introduce a novel and multi-purpose technology known as Stream Computing, as the basis for a Fraud Detection solution. Indeed, we believe that this architecture will stimulate research, and more importantly organizations, to invest in Analytics and Statistical Fraud-Scoring to be used in conjunction with the already in-place preventive techniques. Therefore, in this research we explore different strategies to build a Streambased Fraud Detection solution, using advanced Data Mining Algorithms and Statistical Analysis, and show how they lead to increased accuracy in the detection of fraud by at least 78% in our reference dataset. We also discuss how a combination of these strategies can be embedded in a Stream-based application to detect fraud in real-time. From this perspective, our experiments lead to an average processing time of 111,702ms per transaction, while strategies to further improve the performance are discussed. Keywords: Fraud Detection, Stream Computing, Real-Time Analysis, Fraud, Data Mining, Retail Banking Industry, Data Preprocessing, Data Classification, Behavior-based Models, Supervised Analysis, Semi-supervised Analysis Sammanfattning Privatbankerna har drabbats hårt av bedrägerier de senaste åren. Bedragare har lyckats kringgå forskning och...

Words: 56858 - Pages: 228

Free Essay

Wearable Obstacle Detection for the Blind That Can Detect Discrete Elevation That Uses Gizduino Microcontroller

...TITLE PAGE Wearable Obstacle Detection for the Blind that can Detect Discrete Elevation using Gizduino Microcontroller by Nicole Sam Rey P. Cuaresma Carissa D. Eustaquio Glenda T. Ofiana A Thesis Report Submitted to the School of Electrical, Electronics, and Computer Engineering in Partial Fulfilment of the Requirements for the Degree Bachelor of Science in Computer Engineering Mapúa Institute of Technology May 2015 ii iii ACKNOWLEDGEMENT The researchers would like to express their deepest gratitude to all those who gave contribution to the completion of this paper. To our adviser, Engr. Glenn V. Magwili, for his patience and efforts in stimulating suggestions and encouragement that helped the whole group in order to create a device that can help many people. To our former instructors who have taught us valuable knowledge throughout our academic progress.This research would also not be possible without the belief of the research panel members. The researchers would also like to thank the unconditional support, patience and understanding of their families throughout the process. To the researchers’ colleagues who had also given support and shared their knowledge during the course of completing the research. Lastly, to the Almighty God who had bestowed them with wisdom and strength throughout the research and their life. To the Lord who gave blessings and guidance for the success of this research. iv TABLE OF CONTENTS TITLE...

Words: 15527 - Pages: 63

Premium Essay

Fraud Terms

...• ACFE= Association of Certified Fraud Examiners; conducts comprehensive fraud studies; Report to the Nation on Occupational Fraud & Abuse • Fraud - deception that includes: a representation, about a material point, which is false, and intentionally or recklessly so, which is believed, and acted upon by the victim to the victim’s damage. Fraud is an act of dishonesty with the intention to deceive or cover the truth to gain an advantage. Most critical element: confidence. Fraud can be classified as (in terms of organization): against or on behalf of • Occupational fraud - use of one’s occupation for personnel enrichment through deliberate misuse or misapplication of the employing org’s resources or assets. Categories: Asset misappropriation (steal asset), f.s fraud (manipulate f.s), Corruption scheme (misuse connections). • Employee embezzlement-can be: direct (e.g: asset misappropriation, making dummy company and making employer pay for goods not actually delivered) (from perpetrator to employer); or indirect (corruption, taking bribes from outside) (3rd party involved) • Management fraud- aka financial statement fraud; involves top management’s deceptive manipulation of f.s.; more inclusive • Investment scam-consumer fraud: Ponzi scheme, telemarketing, identity theft, money scam, advance fee scam, letter of credit fraud, etc. • Vendor fraud-overcharge, send inferior goods, charge for goods not shipped; • Customer fraud-not pay, shoplift; • Miscellaneous¬-other ...

Words: 1542 - Pages: 7

Free Essay

Digital Marketing

...The Survey of Digital Technology and Internet Use measures the adoption and use of various digital technologies, including the Internet.  The use of the Internet and digital technology has become pervasive in today's society. Like electricity, the Internet has become so ingrained in the way that business operates, it can be considered a General Purpose Technology (GPT) because it affects the way the entire economy functions. The use of digital technologies and electronic commerce, while of great potential, also pose numerous challenges to both businesses and policy makers. In order to understand the effects and impacts of these technologies, it is essential to first understand who is using them and how.  The data from this survey are used by businesses and policy makers to monitor the uptake and adoption patterns of Information and Communication Technologies (ICTs) and assess their impact on the economy. Results are also monitored by international organizations such as the Organization for Economic Co-operation and Development (OECD) for benchmarking purposes and to study the development and the influence of the digital economy. Statistical activity Science and technology (S&T) and the information society are changing the way we live, learn and work. The concepts are closely intertwined: science generates new understanding of the way the world works, technology applies it to develop innovative products and services and the information society is one of the results...

Words: 3974 - Pages: 16

Premium Essay

Assistant Professor

...SUBJECT REVIEW Regression Methods in the Empiric Analysis of Health Care Data GRANT H. SKREPNEK, PhD ABSTRACT OBJECTIVE: The aim of this paper is to provide health care decision makers with a conceptual foundation for regression analysis by describing the principles of correlation, regression, and residual assessment. SUMMARY: Researchers are often faced with the need to describe quantitatively the relationships between outcomes andpre d i c t o r s , with the objective of ex p l a i n i n g trends, testing hypotheses, or developing models for forecasting. Regression models are able to incorporate complex mathematical functions and operands (the variables that are manipulated) to best describe the associations between sets of variables. Unlike many other statistical techniques, regression allows for the inclusion of variables that may control for confounding phenomena or risk factors. For robust analyses to be conducted, however, the assumptions of regression must be understood and researchers must be aware of diagnostic tests and the appropriate procedures that may be used to correct for violations in model assumptions. CONCLUSION: Despite the complexities and intricacies that can exist in re gre s s i o n , this statistical technique may be applied to a wide range of studies in managed care settings. Given the increased availability of data in administrative databases, the application of these procedures to pharmacoeconomics and outc o m e s assessments may result in...

Words: 9010 - Pages: 37

Free Essay

Analytical Chem

...Chemistry Modern Analytical Chemistry David Harvey DePauw University Boston Burr Ridge, IL Dubuque, IA Madison, WI New York San Francisco St. Louis Bangkok Bogotá Caracas Lisbon London Madrid Mexico City Milan New Delhi Seoul Singapore Sydney Taipei Toronto McGraw-Hill Higher Education A Division of The McGraw-Hill Companies MODERN ANALYTICAL CHEMISTRY Copyright © 2000 by The McGraw-Hill Companies, Inc. All rights reserved. Printed in the United States of America. Except as permitted under the United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a data base or retrieval system, without the prior written permission of the publisher. This book is printed on acid-free paper. 1 2 3 4 5 6 7 8 9 0 KGP/KGP 0 9 8 7 6 5 4 3 2 1 0 ISBN 0–07–237547–7 Vice president and editorial director: Kevin T. Kane Publisher: James M. Smith Sponsoring editor: Kent A. Peterson Editorial assistant: Jennifer L. Bensink Developmental editor: Shirley R. Oberbroeckling Senior marketing manager: Martin J. Lange Senior project manager: Jayne Klein Production supervisor: Laura Fuller Coordinator of freelance design: Michelle D. Whitaker Senior photo research coordinator: Lori Hancock Senior supplement coordinator: Audrey A. Reiter Compositor: Shepherd, Inc. Typeface: 10/12 Minion Printer: Quebecor Printing Book Group/Kingsport Freelance cover/interior designer: Elise Lansdon Cover image: © George Diebold/The...

Words: 88362 - Pages: 354

Premium Essay

Forensic Accounting

...j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / d s s Detection of financial statement fraud and feature selection using data mining techniques P. Ravisankar a, V. Ravi a,⁎, G. Raghava Rao a, I. Bose b a b Institute for Development and Research in Banking Technology, Castle Hills Road #1, Masab Tank, Hyderabad 500 057, AP, India School of Business, The University of Hong Kong, Pokfulam Road, Hong Kong a r t i c l e i n f o a b s t r a c t Recently, high profile cases of financial statement fraud have been dominating the news. This paper uses data mining techniques such as Multilayer Feed Forward Neural Network (MLFF), Support Vector Machines (SVM), Genetic Programming (GP), Group Method of Data Handling (GMDH), Logistic Regression (LR), and Probabilistic Neural Network (PNN) to identify companies that resort to financial statement fraud. Each of these techniques is tested on a dataset involving 202 Chinese companies and compared with and without feature selection. PNN outperformed all the techniques without feature selection, and GP and PNN outperformed others with feature selection and with marginally equal accuracies. © 2010 Elsevier B.V. All rights reserved. Article history: Received 20 November 2009 Received in revised form 14 June 2010 Accepted 3 November 2010 Available online 12 November 2010 Keywords: Data mining Financial fraud detection Feature selection t-statistic Neural networks SVM GP 1. Introduction Financial fraud...

Words: 10935 - Pages: 44

Premium Essay

Business Analytics

...Data Mining for Fraud Detection: Toward an Improvement on Internal Control Systems? Mieke Jans, Nadine Lybaert, Koen Vanhoof Abstract Fraud is a million dollar business and it’s increasing every year. The numbers are shocking, all the more because over one third of all frauds are detected by ’chance’ means. The second best detection method is internal control. As a result, it would be advisable to search for improvement of internal control systems. Taking into consideration the promising success stories of companies selling data mining software, along with the positive results of research in this area, we evaluate the use of data mining techniques for the purpose of fraud detection. Are we talking about real success stories, or salesmanship? For answering this, first a theoretical background is given about fraud, internal control, data mining and supervised versus unsupervised learning. Starting from this background, it is interesting to investigate the use of data mining techniques for detection of asset misappropriation, starting from unsupervised data. In this study, procurement fraud stands as an example of asset misappropriation. Data are provided by an international service-sector company. After mapping out the purchasing process, ’hot spots’ are identified, resulting in a series of known frauds and unknown frauds as object of the study. 1 Introduction Fraud is a million dollar business and it is increasing every year. ”45% of companies worldwide have fallen victim...

Words: 6259 - Pages: 26

Premium Essay

Use of Data Mining by Government Agencies and Practical Applications

...Project Title Use of Data mining by government agencies and practical applications (Describe the Data Mining technologies, how these are being used in government agencies. Provide practical applications and examples) Compiled By:- Sneha Gang (Student # - 84114) Karan Sawhney (Student # - 85471) Raghunath Cherancheri Balan (Student # - 86088) Sravan Yella (Student # - 87041) Mrinalini Shah (Student # - 86701) Use of Data mining by government agencies and practical applications * Abstract (Sneha Garg) With an enormous amount of data stored in databases and data warehouses, it is increasingly important to develop powerful tools for analysis of such data and mining interesting knowledge from it. Data mining is a process of inferring knowledge from such huge data. It is a modern and powerful tool, automatizing the process of discovering relationships and combinations in raw data and using the results in an automatic decision support. This project provides an overview of data mining, how government uses it quoting some practical examples. Data mining can help in extracting predictive information from large quantities of data. It uses mathematical and statistical calculations to uncover trends and correlations among the large quantities of data stored in a database. It is a blend of artificial intelligence technology, statistics, data warehousing, and machine learning. These patterns...

Words: 4505 - Pages: 19

Premium Essay

Paper

...of two-stage segmentation methods for choice-based conjoint data: a simulation study Marjolein Crabbe, Bradley Jones and Martina Vandebroek DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) KBI 1109 A Comparison of Two-Stage Segmentation Methods for Choice-Based Conjoint Data: A Simulation Study Marjolein Crabbe Bradley Jones Martina Vandebroek Abstract Due to the increasing interest in market segmentation in modern marketing research, several methods for dealing with consumer heterogeneity and for revealing market segments have been described in the literature. In this study, the authors compare eight two-stage segmentation methods that aim to uncover consumer segments by classifying subject-specific indicator values. Four different indicators are used as a segmentation basis. The forces, which are subject-aggregated gradient values of the likelihood function, and the dfbetas, an outlier detection measure, are two indicators that express a subject’s effect on the estimation of the aggregate partworths in the conditional logit model. Although the conditional logit model is generally estimated at the aggregate level, this research obtains individual-level partworth estimates for segmentation purposes. The respondents’ raw choices are the final indicator values. The authors classify the indicators by means of cluster analysis and latent class models. The goal of the study is to compare the segmentation performance of the methods with respect to their success...

Words: 12792 - Pages: 52

Free Essay

Case Study

...Anx.31 J - M Sc CS (SDE) 2007-08 with MQP Page 1 of 16 Annexure No. SCAA Dated BHARATHIAR UNIVERSITY, COIMBATORE – 641 046 M. Sc COMPUTER SCIENCE For School of Distance Education (Effective from the academic Year 2007-2008) Scheme of Examinations 31 J 29.02.2008 Year Subject and Paper I Paper I Paper II Paper III Paper IV Practical I Paper V Paper VI Paper VII Practical II Project Advanced Computer Architecture Computer Graphics & Multimedia Software Engineering Computer Networks Computer Graphics and Multimedia Lab Advanced Operating System Internet programming and Web Design Data Mining and Warehousing Internet programming and Web Design Lab Project Work and Viva Voce Total University Examinations Durations Max in Hrs Marks 3 100 3 100 3 100 3 100 3 100 3 3 3 3 100 100 100 100 100 1000 II For project work and viva voce (External) Breakup: Project Evaluation : 75 Viva Voce : 25 1 Anx.31 J - M Sc CS (SDE) 2007-08 with MQP Page 2 of 16 YEAR – I PAPER I: ADVANCED COMPUTER ARCHITECTURE Subject Description: This paper presents the concept of parallel processing, solving problem in parallel processing, Parallel algorithms and different types of processors. Goal: To enable the students to learn the Architecture of the Computer. Objectives: On successful completion of the course the students should have: Understand the concept of Parallel Processing. Learnt the different types of Processors. Learnt the Parallel algorithms. Content: Unit I...

Words: 3613 - Pages: 15

Premium Essay

Mart-1

...by cytotoxic T lymphocytes) is a common biomarker used to identify malignant melanoma (MM) in patients, it falls under the MHC (major histocompatability complex) I class. MART-1 is encoded by the MLANA gene in humans and is present normally on the surface of benign nevi (beauty marks), as well as on melanocytes, the pigment-producing cells of your skin that become abnormal and grow out of control if malignant melanoma has developed. Approximately 76,000 new cases of melanoma are diagnosed annually in the Unites States, which accounts for about 2% of the total skin cancer diagnosis in United States annually; however 75% of skin cancer related deaths are the result of MM which exhibits how devastating this form of cancer is. With early detection of melanoma being correlated highly to improved prognosis of patients, it is necessary to create a point of care device that can measure the presence of MART-1 antigen in peripheral blood as a screening technique during annual examinations and upon identification of suspicious skin lesions Protein melan-A  is a protein that in humans is encoded by the MLANA gene . A fragment of the protein, usually consisting of the nine amino acids 27 to 35, is bound byMHC class I complexes which present it to T cells of the immune system. These complexes can be found on the surface of melanoma cells. Decameric peptides (26-35) are being investigated as cancer vaccines. The names MART-1 and melan-A were coined by two groups of researchers...

Words: 4004 - Pages: 17

Premium Essay

Metasploit Vulnerability Scanner Executive Proposal

...Metasploit Vulnerability Scanner Executive Proposal Paul Dubuque Table of Contents Page 3 Executive Summary Page 5 Background Information Page 6 Recommended Product Page 7 Product Capabilities Page 10 Cost and Training Page 11 References Page 13 Product Reviews Executive Summary To: Advanced Research Corporation Mr. J. Smith, CEO; Ms. S. Long, V.P. Mr. W Donaldson, CCO; Mr. A. Gramer, CCO & Mr. B. Schuler, CFO CC. Ms. K. Young, MR. G. Holdsoth From: P. Dubuque, IT Manager Advance Research Corporation (ARC) has grown rapidly during the last five years and has been very successful in developing new and innovative devices and medicines for the health care industry. ARC has expanded to two locations, New York, NY and Reston, VA which has led to an expanded computer network in support of business communications and research. ARC has been the victim of cyber-attacks on its network and web site, as well as false alegations of unethical practices. ARC’s network is growing, with over two thousand devices currently and reaching from VA to NY. ARC needs to ensure better security of communications, intellectual property (IP) and public image, all of which affect ARC’s reputation with the public and investors. ARC has previously limited information technology (IT) expenditures to desktop computers and network infrastructure hardware such as routers, firewalls and servers. It is imperative that ARC considers information security (IS) and begins to invest in products...

Words: 2593 - Pages: 11