Distance-based outlier detection
- 1 September 2010
- journal article
- research article
- Published by Association for Computing Machinery (ACM) in Proceedings of the VLDB Endowment
- Vol. 3 (1-2), 1469-1480
- https://doi.org/10.14778/1920841.1921021
Abstract
Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches. In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms. Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.This publication has 16 references indexed in Scilit:
- Efficient Pruning Schemes for Distance-Based Outlier DetectionLecture Notes in Computer Science, 2009
- Fast mining of distance-based outliers in high-dimensional datasetsData Mining and Knowledge Discovery, 2008
- Near-optimal hashing algorithms for approximate nearest neighbor in high dimensionsCommunications of the ACM, 2008
- Mining distance-based outliers in near linear time with randomization and a simple pruning rulePublished by Association for Computing Machinery (ACM) ,2003
- Learning nonstationary models of normal network traffic for detecting novel attacksPublished by Association for Computing Machinery (ACM) ,2002
- The UCI KDD archive of large data sets for data mining research and experimentationACM SIGKDD Explorations Newsletter, 2000
- Efficient algorithms for mining outliers from large data setsPublished by Association for Computing Machinery (ACM) ,2000
- LOFPublished by Association for Computing Machinery (ACM) ,2000
- BIRCHACM SIGMOD Record, 1996
- Outliers in statistical dataInternational Journal of Forecasting, 1996