Review of Outlier Detection Methods

Review of Outlier Detection Methods

INTRODUCTION
Outliers or anomalies can exist in all types of collected data. The presence of outliers may indicate something sinister such as unauthorised system access or fraudulent activity, or may be a new and previously unidentified occurrence. Whatever the cause of these outliers, it is important they are detected so appropriate action can be taken to minimise their harm if malignant or to exploit a newly discovered opportunity.
Chandola, Banerjee and Kumar (2007) conducted a comprehensive survey of outlier detection techniques, which highlighted the importance of detection across a wide variety of domains. Their survey described the categories of outlier detection, applications of detection and detection techniques.
Chandola et al. identified three main categories of outlier detection - supervised, semi-supervised and unsupervised detection. Each category utilises different detection techniques such as classification, clustering, nearest neighbour and statistical. Each category and technique has several strengths and weaknesses compared with other outlier detection methods. This review provides initial information on data labelling and classification before examining some of the existing outlier detection techniques within each of the three categories. It then looks at the use of combining detection techniques before comparing and discussing the advantages and disadvantages of each method. Finally, a new classification technique is proposed using a new outlier detection algorithm, Isolation Forest.

DATA LABELLING
Datasets normally consist of many data instances with each instance usually containing one to many attributes. Each attribute may take a predetermined value such as numerical, nominal or binary or the value may be missing. Data labelling (assigning a class to each instance in a dataset) is a time consuming, manual task, often performed by humans to ensure labelling accuracy. With large unlabelled datasets, labelling is prohibitively resource consuming and often not attempted. Research into semi-supervised data labelling has been conducted (see Simon, Kumar & Zhang, 2008) but it remains largely a human task.
Labelled data instances may or may not be required for the detection of outliers, depending on the chosen algorithm, but it is the degree of labelling which denotes the mode the detection technique operates in - supervised, semi-supervised or unsupervised.

DATA CLASSIFICATION
Data classification techniques are commonly used to detect outliers in datasets. A classifier algorithm attempts to assign observed new data to an existing labelled class (Kotsiantis, 2007). For example, in a two-class problem, the chosen classification technique identifies a majority class and a minority class with the minority class being deemed outliers. This approach to outlier detection is applicable to all three mentioned detection categories as labelling is not required but can be applied.

SUPERVISED DETECTION
In supervised detection, a fully labelled dataset (both normal and outlier occurrences labelled) is used to train the selected detection algorithm. Abe, Zardony and Langford (2006) attempted to detect outliers using classification reduction and active learning. Active learning is a form of supervised detection where only user selected labelled instances of a dataset are used. Outlier detection was reduced to a classification problem and an initial labelled dataset was injected with artificial data to simulate outliers and classified. Active learning, based on the ‘Query by Bagging’ technique, was applied to the classification problem. Accuracy was evaluated over three experiments, comparing the Abe et al. technique (ActiveOutlier) against other reduction to classification methods (Bagging and Boosting), Local Outlier Factor (LOF) and Feature Bagging and reported literature results from the network intrusion detection competition. Bagging generates different, individual models for each artificially generated data subset and the resulting decisions are combined into a single prediction. Boosting is similar to Bagging but each model influences the next created model (Witten & Frank, 2005). Results showed ActiveOutlier had the highest outlier detection accuracy in three of the five datasets used and was only 1% behind Bagging in the fourth. An area for concern from the Abe et al. report is the use of artificial outliers in the dataset. While this was a requirement for supervised learning, any machine trained on artificial outliers may fail to detect real world outliers in unseen datasets.

SEMI-SUPERVISED DETECTION
Semi-supervised detection utilizes a mix of unlabelled and labelled data for the training process. Commonly the normal class is labelled while the outliers are unlabelled. Researchers such as Gao, Cheng and Tan (2006) believe labelling some data can improve the accuracy of outlier detection.
Gao, Cheng and Tan (2006) investigated outlier detection using a semi-supervised method. Rather than using the conventional clustering measure of minimizing the sum-squared error, an iterative algorithm based on the K-means clustering method was created to optimize the objective function. This algorithm was able to incorporate the presence of outliers and labelled normal data into clustering. Gao et al. claim to have produced a computationally inexpensive and efficient algorithm that can detect outliers that other unsupervised methods cannot. However, no empirical evaluation of the algorithm was performed, nor were any comparisons with unsupervised or supervised techniques presented.
Despite being less common than semi-supervised detection using labelled normal data, there has been research into using negative selection algorithms for outlier detection. Negative selection uses datasets with outliers labelled as opposed to normal instances. Stibor, Mohr, Timmis and Eckert (2005), Ji and Dasgupta (2006) and others have conducted investigations into single-class classification using a negative selection algorithm, V-detector. Both Stibor et al. and Ji and Dasgupta compared V-detector results with a single-class Support Vector Machine (SVM).
Both papers suggest that negative selection is not practical for datasets with small self-sample sizes while also suffering from the ‘curse of dimensionality’ (extra dimensions added to a space increase volume exponentially).
The two papers have different conclusions with Ji and Dasgupta suggesting that negative selection using V-detector is a viable technique for outlier detection while Stibor et al. remained generally unconvinced of its future given that a single-class SVM performs better in high dimensional space.

UNSUPERVISED DETECTION
Unsupervised outlier detection uses unlabelled datasets for training purposes but makes the assumption that the majority of instances in the dataset consist of normal data with the minority of instances being outliers.
Angiulli, Basta and Pizzuti (2005) presented a distance-based, unsupervised outlier detection and prediction method. They hypothesised that after training, new, unseen data could be classified as an outlier or normal by using only a randomly selected subset of the dataset while executing in sub quadratic time. As their initial algorithm, SolvingSet, was not robust, a second algorithm, RobustSolvingSet, was implemented containing SolvingSet. SolvingSet and RobustSolvingSet were executed on two real world datasets (ColorHistogram – 68,040 data points and Landsat – 275,465 data points) and a synthetic dataset (Gaussian – 1,000,000 data points). Results indicated that both algorithms produced low false positive values using small sub-sample sizes of no more than 7% of the dataset, with RobustSolvingSet reporting less false positives than SolvingSet.
Although Angiulli et al. state that a comparison of algorithm execution times is not meaningful because the algorithm’s goal was to build a model, presentation of these times would allow researchers to determine how the execution time of SolvingSet ranks compared to other unsupervised techniques.
Yamanishi and Takeuchi (2001) combined supervised and unsupervised algorithms to detect outliers. Their approach involved using an unsupervised algorithm, SmartSifter, to generate outlier scores for unlabelled data using a Gaussian model for statistical representation (see Yamanishi, Takeuchi, Williams & Milne, 2000). While SmartSifter was able to detect outliers with high accuracy, it was unable to explain these outliers. A rules-based supervised classification algorithm, DL-ESC, was selected for use to explain the existence of the outliers by creating a filtering rule based on the scores that SmartSifter had previously calculated. Higher scores were classified positive (potential outlier) and low scores classified as negative. After DL-ESC classification, this filtering rule was then used in pre-processing for the next iteration of SmartSifter. The combination of SmartSifter and DL-ESC showed significant improvement in detecting outliers when the sample size was low (1 – 2%) but at 3% the difference was negligible. The results of combining supervised and unsupervised algorithms seems promising based on Yamanishi and Takeuchi’s results, however this research was limited to detecting network intrusion, which requires real time or near-real time detection. A combination of different algorithms and testing of results in multiple domains would provide a better indication of how combined supervised and unsupervised algorithms can improve outlier detection.

ENSEMBLE DETECTION
As described above, Yamanishi and Takeuchi (2001) used an ensemble consisting of supervised and unsupervised algorithm to detect outliers. Research has also been undertaken in detecting outliers by using Bagging and Feature Bagging.
By combining the outputs of several unsupervised outlier detection algorithms, Lazarevic and Kumar (2005) hypothesised increased detection of outliers in high dimensional, noisy and large datasets over detection with a single algorithm. A diverse range of outlier detection algorithms was selected; with each algorithm able to detect different outliers in the dataset and assign different scores using the density based measure LOF for detection. Two methods of algorithm combination were chosen, first was Breadth First, which ranked subsets based on the probability of being an outlier and secondly Cumulative Sum, which summed the scores produced by each of the detection algorithms.
During computation, in each iteration of t rounds, a unique and randomly selected subset of features was input to the detection algorithms, which allowed each algorithm to output a different score.
Ten detection algorithms were combined for the two synthetic datasets and 50 algorithms were combined for all but one of the 66 real life datasets. Experimental results on both types of datasets indicated that the Cumulative Sum outperformed single LOF in all datasets where noise was present and performed better than the Breadth First approach in the majority of cases. It should be noted that the ten detection algorithms used in Lazarevic and Kumar were not presented in the paper. Finally, there was no discussion as to whether the presented results would be similar in a high dimensionality dataset.

COMPARISON OF METHODS
A distinct advantage of supervised detection is high detection accuracy, provided the model has been trained with accurate labels and that the instances used for training are reflective of ‘real world’ normal and outlier behaviour. A major disadvantage of supervised detection is that, as mentioned above, manually labelling each instance is a costly exercise and quite difficult to label every possible outlier.
Using semi-supervised detection avoids some of the labelling issues associated with supervised detection. As only normal instances are labelled, this provides the benefit of not having to know how any outliers behave. However, any new, acceptable behaviour will be classified as an outlier if it is not present in the normal profile.
Unsupervised detection does not require any labelling as it assumes that a dataset will contain far more normal instances than outliers. This removes the time and cost factors associated with labelling and knowledge of all values an outlier may take on is not required. A major drawback of unsupervised detection is the reliance of a dataset being skewed heavily towards normal instances. A dataset that contains more outliers than normal values will not be classified correctly.

CONCLUSION
This paper has discussed different methods for detecting anomalies. After examining Chandola, Banerjee and Kumar’s (2007) survey on outlier detection techniques, background information on data labelling and classification was provided. Following this, the supervised, semi-supervised and unsupervised detection methods were described using various examples from differing detection categories. Next, the idea of taking an ensemble approach to detection was presented. Each method was then analysed with advantages and disadvantages explained. Finally, the implementation of a new classification technique using the detection technique iForest was proposed.

REFERENCES
Abe, N., Zadrozny, B., & Langford, J. (2006). Outlier detection by active learning. In 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA, August 20-23, 2006 (pp. 504-509). New York, USA: ACM Press.

Angiulli, F., Basta, S., & Pizzuti, C. (2005). Detection and prediction of distance-based outliers. In Proceedings of the 20th Annual ACM Symposium on Applied Computing, Santa Fe, USA, March 13-17, 2005 (pp. 531-542). New York, USA: ACM Press.

Chandola, V., Banerjee, B., & Kumar, V. (2007). Outlier detection – A survey (Report No. 07-017). Minneapolis, USA: Department of Computer Science and Engineering, University of Minnesota.

Gao, J., Cheng, H., & Tan, P. (2006), Semi-supervised outlier detection. In 21st Annual ACM Symposium on Applied Computing, Dijon, France, April 23-27, 2006 (pp. 635-636). New York, USA: ACM Press.

Ji, Z., & Dasgupta, D. (2006). Applicability issues of the real-valued negative selection algorithms. In Genetic & Evolutionary Computation Conference, Seattle, USA, July 8-12, 2006 (pp. 111-118). New York, USA: ACM Press.

Kotsiantis, S.B. (2007). Supervised machine learning: A review of classification techniques. Informatica, 31, 249-268.

Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, USA, August 21-24, 2005 (pp. 157-166). New York, USA: ACM Press.

Liu, T. F., Ting, K. M., & Zhou, Z. (2008). Isolation forest. In Eighth IEEE International Conference on Data Mining, Pisa, Italy, December 15-19, 2008 (pp. 413-422). Washington, USA: IEEE Computer Society Press.

Nakamura, T., Kamidoi, Y., Wakabayashi, S., & Yoshida, N. (2006). A decision method of attribute importance for classification by outlier detection. In Proceedings of the 22nd International Conference on Data Engineering Workshops, Georgia, USA, April 3-7, 2006 (pp. 45-50). Washington, USA: IEEE Computer Society Press.

Stibor, T., Mohr, P., & Timmis, J. (2005). Is negative selection appropriate for anomaly detection? In Genetic and Evolutionary Computation Conference, Washington, USA, June 25-29, 2005 (pp. 321-328). New York, USA: ACM Press.

Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco, USA: Morgan Kaufmann.

Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. (2000). Online unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, USA, August 20-23, 2000 (pp. 320-324). New York, USA: ACM Press.

Yamanishi, K., & Takeuchi, J. (2001). Discovering outlier filtering rules from unlabeled data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, USA, August 26-29, 2001 (pp. 389-394). New York, USA: ACM Press.

Review of Outlier Detection Methods

Similar Documents

Real-Time Fraud Detection

Using Km and Dm to Improve Business Performance

Real-Time Fraud Detection: How Stream Computing Can Help the Retail Banking Industry

Wearable Obstacle Detection for the Blind That Can Detect Discrete Elevation That Uses Gizduino Microcontroller

Fraud Terms

Digital Marketing

Assistant Professor

Analytical Chem

Forensic Accounting

Business Analytics

Use of Data Mining by Government Agencies and Practical Applications

Paper

Case Study

Mart-1

Metasploit Vulnerability Scanner Executive Proposal

Popular Essays