Multiple Desirable Methods in Outlier Detection of Univariate Data With R Source Codes

Open Access

17 January 2022

journal article
editorial
Published by Frontiers Media SA in Frontiers in Psychology

Vol. 12, 819854
https://doi.org/10.3389/fpsyg.2021.819854

Abstract

The existence of outliers has been a methodological obstacle in various literature (Grubbs, 1969; Tian et al., 2018; Erdogan et al., 2019). There are many cases when we should deal with outliers of univariate data. If inappropriate methods are used, it can lead to biased and wrong conclusions (Aguinis et al., 2013; Fife, 2020). Hence, how to detect outliers is one of the hottest topics among researchers in many fields (Tian et al., 2018; Dutta and Banerjee, 2019; Saneja and Rani, 2019), including psychology (Gladwell, 2008; Blouvshtein and Cohen-Or, 2018; Leys et al., 2019). Although outlier detection methods should be considered enough in psychology, many researchers have used inappropriate methods without any theoretical basis (Simmons et al., 2011; Leys et al., 2013; Obikee and Okoli, 2021). Leys et al. (2013) investigated outlier detection methods in 127 articles published in Journal of Personality and Social Psychology (JPSP) and Psychological Science (PSS) from 2010 to 2012. As a result, 56 papers (about half of the 127 papers) used the outlier detection methods with the mean and standard deviation (Leys et al., 2013). I call the method “the conventional method” in this article. In this method, outliers are the values which do not fall within the mean ± x times standard deviation (x = 2 or 2.5 are common; Leys et al., 2013; Yang et al., 2019). Because of its simplicity, this method has been used in a great many psychological studies (Simmons et al., 2011; Leys et al., 2013). However, the conventional method has the three major theoretical problems (Chiang et al., 2003; Simmons et al., 2011). First, a normal distribution is assumed including outliers (Miller, 1991; Yang et al., 2019). Second, the mean and standard deviation are highly skewed by outliers and it leads to increasing the likelihood of Type I and Type II errors (Cousineau and Chartier, 2010; Leys et al., 2013). Third, it is difficult to detect outliers in data with a small sample size (Cousineau and Chartier, 2010). As shown above, the conventional method has several theoretical problems, but it has been used in many studies without sufficient consideration (Simmons et al., 2011; Leys et al., 2013; Obikee and Okoli, 2021). There are two possible reasons for this situation. First, there are not many known more appropriate methods other than the conventional method. Second, how to perform those desirable methods is not fully understood by researchers. Each researcher should choose the method that is appropriate for data. The purpose of this opinion paper is reviewing more desirable methods for detecting outliers of univariate data (specifically, square root transformation, median absolute deviation, Grubbs' test, and Ueda's method), and presenting source code and sample data that allow us to conduct each detection method. These detection methods have desirable advantages over the conventional method and they are relatively easy to implement. In addition, the results of applying each outlier detection method to a real data set are shown. Presented methods in this article can be conducted using R (R Core Team, 2021), a free statistical software. By summarizing various outlier detection methods and providing analysis source codes, useful knowledge in psychological research can be provided. The method of square root transformation can be used for the biased data with which normal distribution cannot be assumed, but it cannot be used for data that are too asymmetric (Cousineau and Chartier, 2010). When dealing with extreme asymmetric data, please refer to Carling (2000). First, the data x is transformed according to the following equation (1). In equation (1), x is each data, Xmin is the minimum value of the data, and Xmax is the maximum value of the data. The data y is a number falling between 0 and 1. In the square root transformation, the z-score is calculated by equation (2), for the data y. In equation (2), Ym is the mean of y and Sy is the standard deviation of y. A robust z-score transformation has higher power in detecting outliers. Then, the outlier is determined by Bonferroni correction (Armstrong, 2014). The Bonferroni correction is performed to avoid Type II errors that may occur in response to a larger standard deviation (Cousineau and Chartier, 2010). The z-values before and after Bonferroni correction for a representative sample size N were shown in the Open Science Framework repository (OSF; https://osf.io/szt5n/?view_only=5cd1c734b392442d9633d3b7414c0914). The method of using median absolute deviation (MAD) was proposed by Hampel (1974) and can be used for the biased data with which normal distribution cannot be assumed, but the method is not yet common in psychological research (Leys et al., 2013). The statistic MAD uses the median, which has a very desirable characteristic that it is stable against the influence of outliers (Leys et al., 2013; Yang et al., 2019). MAD is obtained by the following equations (3) and (4). Med(x) denotes the median value in data x. Q(0.75) refers to the 75th percentile (third quartile) of z-scores. When a normal distribution can be assumed, b = 1/Q(0.75) = 1.4826 is often used (Huber, 1981; Leys et al., 2013; Kannan et al., 2015). Then, the median ± k times of MAD is considered to be the border of outliers. For example, Miller (1991) recommends using 2, 2.5, or 3 as the value k, depending on the purpose of outlier detection, while Leys et al. (2013) recommend a criterion of 2.5 as the value k. By adjusting the coefficient b, it is possible to use this method when normal distribution is not assumed (e.g., those with high kurtosis), but robust detection cannot be achieved for extremely asymmetric data (Rousseeuw and Croux, 1993; Yang et al.,...

This publication has 35 references indexed in Scilit:

False-Positive Psychology
Psychological Science, 2011
A recursive version of Grubbs' test for detecting multiple outliers in environmental and chemical data
Clinical Biochemistry, 2010
An SPSS implementation of the nonrecursive outlier deletion procedure with shiftingz score criterion (Van Selst & Jolicoeur, 1994)
Behavior Research Methods, 2006
Exploring process data with the use of robust outlier detection algorithms
Journal of Process Control, 2003
Identification of Outliers in Multivariate Data
Journal of the American Statistical Association, 1996
Alternatives to the Median Absolute Deviation
Journal of the American Statistical Association, 1993
Short Report: Reaction Time Analysis with Outlier Exclusion: Bias Varies with Sample Size
The Quarterly Journal of Experimental Psychology Section A, 1991
The Influence Curve and its Role in Robust Estimation
Journal of the American Statistical Association, 1974
Procedures for Detecting Outlying Observations in Samples
Technometrics, 1969
Sample Criteria for Testing Outlying Observations
The Annals of Mathematical Statistics, 1950

Cited by 4 articles