Design sensitivity and statistical power in acceptability judgment experiments

Abstract
Previous investigations into the validity of acceptability judgment data have focused almost exclusively on type I errors (or false positives) because of the consequences of such errors for syntactic theories (Sprouse & Almeida 2012; Sprouse et al. 2013). The current study complements these previous studies by systematically investigating the type II error rate (false negatives), or equivalently, the statistical power, of a wide cross-section of possible acceptability judgment experiments. Though type II errors have historically been assumed to be less costly than type I errors, the dynamics of scientific publishing mean that high type II error rates (i.e., studies with low statistical power) can lead to increases in type I error rates in a given field of study. We present a set of experiments and resampling simulations to estimate statistical power for four tasks (forced-choice, Likert scale, magnitude estimation, and yes-no), 50 effect sizes instantiated by real phenomena, sample sizes from 5 to 100 participants, and two approaches to statistical analysis (null hypothesis and Bayesian). Our goals are twofold (i) to provide a fuller picture of the status of acceptability judgment data in syntax, and (ii) to provide detailed information that syntacticians can use to design and evaluate the sensitivity of acceptability judgment experiments in their own research.