Dimensionality and Generalizability of Domain-Independent Performance Assessments

Abstract
Empirical guidance for the design of comparable performance assessments is sorely lacking. A study was conducted to assess the degree to which domain specifications control topic and rater variability, focusing on task generalizability, rater reliability, and scoring rubric dimensionality. Two classes of history students were administered three on-demand, multistep performance tasks a week apart. For each topic, all students completed a Prior Knowledge Test, read primary source materials, and wrote an essay of explanation. Using a theory-based scoring rubric, four trained raters scored all essays. Inter- and intrarater reliabilities and g-study results are reported. Results show relative efficiency for the assessment approach. The dimensionality analysis supported two factors: Deep Understanding and Surface Understanding across the three topics. Prior Knowledge scores and GPA in history courses correlated with the Deep Understanding elements of the scoring rubric. Implications for design and testing purposes are discussed.