WinoGrande

Open Access

24 August 2021

journal article
research article
Published by Association for Computing Machinery (ACM) in Communications of the ACM

Vol. 64 (9), 99-106
https://doi.org/10.1145/3474381

Abstract

Commonsense reasoning remains a major challenge in AI, and yet, recent progresses on benchmarks may seem to suggest otherwise. In particular, the recent neural language models have reported above 90% accuracy on the Winograd Schema Challenge (WSC), a commonsense benchmark originally designed to be unsolvable for statistical models that rely simply on word associations. This raises an important question---whether these models have truly acquired robust commonsense capabilities or they rely on spurious biases in the dataset that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) large-scale crowdsourcing, followed by (2) systematic bias reduction using a novel AFLITE algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. Our experiments demonstrate that state-of-the-art models achieve considerably lower accuracy (59.4%-79.1%) on WINOGRANDE compared to humans (94%), confirming that the high performance on the original WSC was inflated by spurious biases in the dataset. Furthermore, we report new state-of-the-art results on five related benchmarks with emphasis on their dual implications. On the one hand, they demonstrate the effectiveness of WINOGRANDE when used as a resource for transfer learning. On the other hand, the high performance on all these benchmarks suggests the extent to which spurious biases are prevalent in all such datasets, which motivates further research on algorithmic bias reduction.

Keywords

Funding Information

DARPA, MCS program (N66001-19-2-4031)
NSF (IIS-1524371, IIS-1714566)
DARPA, the CwC program (W911NF-15-1-0543)

This publication has 4 references indexed in Scilit:

UNIFIEDQA: Crossing Format Boundaries with a Single QA System
Published by Association for Computational Linguistics (ACL) ,2020
A Generalized Knowledge Hunting Framework for the Winograd Schema Challenge
Published by Association for Computational Linguistics (ACL) ,2018
Planning, Executing, and Evaluating the Winograd Schema Challenge
AI Magazine, 2016
Commonsense reasoning and commonsense knowledge in artificial intelligence
Communications of the ACM, 2015

Cited by 27 articles