Distance-based outlier detection

1 September 2010

journal article
research article
Published by Association for Computing Machinery (ACM) in Proceedings of the VLDB Endowment

Vol. 3 (1-2), 1469-1480
https://doi.org/10.14778/1920841.1921021

Abstract

Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches. In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms. Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.

This publication has 16 references indexed in Scilit:

Efficient Pruning Schemes for Distance-Based Outlier Detection
Lecture Notes in Computer Science, 2009
Fast mining of distance-based outliers in high-dimensional datasets
Data Mining and Knowledge Discovery, 2008
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM, 2008
Mining distance-based outliers in near linear time with randomization and a simple pruning rule
Published by Association for Computing Machinery (ACM) ,2003
Learning nonstationary models of normal network traffic for detecting novel attacks
Published by Association for Computing Machinery (ACM) ,2002
The UCI KDD archive of large data sets for data mining research and experimentation
ACM SIGKDD Explorations Newsletter, 2000
Efficient algorithms for mining outliers from large data sets
Published by Association for Computing Machinery (ACM) ,2000
LOF
Published by Association for Computing Machinery (ACM) ,2000
BIRCH
ACM SIGMOD Record, 1996
Outliers in statistical data
International Journal of Forecasting, 1996

Cited by 67 articles