Holistic Evaluation of Language Models
- 25 May 2023
- journal article
- research article
- Published by Wiley in Annals of the New York Academy of Sciences
- Vol. 1525 (1), 140-146
- https://doi.org/10.1111/nyas.15007
Abstract
Language models (LMs) like GPT-3, PaLM, and ChatGPT are the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of LMs. LMs can serve many purposes and their behavior should satisfy many desiderata. To navigate the vast space of potential scenarios and metrics, we taxonomize the space and select representative subsets. We evaluate models on 16 core scenarios and 7 metrics, exposing important trade-offs. We supplement our core evaluation with seven targeted evaluations to deeply analyze specific aspects (including world knowledge, reasoning, regurgitation of copyrighted content, and generation of disinformation). We benchmark 30 LMs, from OpenAI, Microsoft, Google, Meta, Cohere, AI21 Labs, and others. Prior to HELM, models were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: all 30 models are now benchmarked under the same standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly. HELM is a living benchmark for the community, continuously updated with new scenarios, metrics, and models .Keywords
This publication has 32 references indexed in Scilit:
- Glove: Global Vectors for Word RepresentationPublished by Association for Computational Linguistics (ACL) ,2014
- Fred JelinekComputational Linguistics, 2010
- Expectation-based syntactic comprehensionCognition, 2008
- Some Points in a TimeComputational Linguistics, 2005
- A probabilistic earley parser as a psycholinguistic modelPublished by Association for Computational Linguistics (ACL) ,2001
- Bias in computer systemsACM Transactions on Information Systems, 1996
- Message Understanding Conference-6Published by Association for Computational Linguistics (ACL) ,1996
- The DRAGON system--An overviewIEEE Transactions on Acoustics, Speech, and Signal Processing, 1975
- Speech production and the predictability of words in contextQuarterly Journal of Experimental Psychology, 1958
- A Mathematical Theory of CommunicationBell System Technical Journal, 1948