Holistic Evaluation of Language Models

25 May 2023

journal article
research article
Published by Wiley in Annals of the New York Academy of Sciences

Vol. 1525 (1), 140-146
https://doi.org/10.1111/nyas.15007

Abstract

Language models (LMs) like GPT-3, PaLM, and ChatGPT are the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of LMs. LMs can serve many purposes and their behavior should satisfy many desiderata. To navigate the vast space of potential scenarios and metrics, we taxonomize the space and select representative subsets. We evaluate models on 16 core scenarios and 7 metrics, exposing important trade-offs. We supplement our core evaluation with seven targeted evaluations to deeply analyze specific aspects (including world knowledge, reasoning, regurgitation of copyrighted content, and generation of disinformation). We benchmark 30 LMs, from OpenAI, Microsoft, Google, Meta, Cohere, AI21 Labs, and others. Prior to HELM, models were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: all 30 models are now benchmarked under the same standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly. HELM is a living benchmark for the community, continuously updated with new scenarios, metrics, and models .

Keywords

This publication has 32 references indexed in Scilit:

Glove: Global Vectors for Word Representation
Published by Association for Computational Linguistics (ACL) ,2014
Fred Jelinek
Computational Linguistics, 2010
Expectation-based syntactic comprehension
Cognition, 2008
Some Points in a Time
Computational Linguistics, 2005
A probabilistic earley parser as a psycholinguistic model
Published by Association for Computational Linguistics (ACL) ,2001
Bias in computer systems
ACM Transactions on Information Systems, 1996
Message Understanding Conference-6
Published by Association for Computational Linguistics (ACL) ,1996
The DRAGON system--An overview
IEEE Transactions on Acoustics, Speech, and Signal Processing, 1975
Speech production and the predictability of words in context
Quarterly Journal of Experimental Psychology, 1958
A Mathematical Theory of Communication
Bell System Technical Journal, 1948

Cited by 19 articles