Modeling the judgment of vowel quality differences

Abstract
The hypothesis of this study is that the auditory cues relevant to listeners' judgment of vowel quality are a spectral representation of loudness density versus pitch. A model is described that generates such patterns for steady-state vowels. In addition to the nonlinear transformations underlying the loudness density and pitch scales, it incorporates experimentally established characteristics associated with frequency resolution and masking, such as the critical band concept. This model is combined with a measure of auditory perceptual distance which, operating on pairs of vowels, treats each stimulus representation as a single spectral shape. In order to test the distance metric and the model, experimental data were gathered from listeners' numerical estimates of quality differences between stimulus pairs which compared four-formant and two-formant vowels. The correlation between experimental and theoretical results was 0.89. We interpret this value to indicate that the present definition of auditory cue and auditory distance can be said to account for the experimental behavior of our listeners only in a rather gross fashion. On the other hand, the theory was developed on the basis of rather conservative assumptions about the nature of auditory cues. For instance, the model ignores the possibility of temporal coding and certain nonlinear effects, and it does not pay special attention to spectral peaks. Seen in that light, the agreement between observed and predicted auditory distance is remarkably good.