OnThe London–Lund Corpus 2: design, challenges and innovations

Open Access

8 September 2021

journal article
research article
Published by Cambridge University Press (CUP) in English Language and Linguistics

Vol. 25 (3), 459-483
https://doi.org/10.1017/s1360674321000186

Abstract

This article describes and critically examines the challenging task of compiling The London–Lund Corpus 2 (LLC–2) from start to end, accounting for the methodological decisions made in each stage and highlighting the innovations. LLC–2 is a half-a-million-word corpus of contemporary spoken British English with recordings from 2014 to 2019. Its size and design are the same as those of the world's first machine-readable spoken corpus, The London–Lund Corpus of Spoken English with data from the 1950s to 1980s. In this way, LLC–2 allows not only for synchronic investigations of contemporary speech but also for principled diachronic research of spoken language across time. Each stage of the compilation of LLC–2 posed its own challenges, ranging from the design of the corpus, the recruitment of the speakers, transcription, markup and annotation procedures, to the release of the corpus to the international research community. The decisions and solutions represent state-of-the-art practices of spoken corpus compilation with important innovations that enhance the value of LLC–2 for spoken corpus research, such as the availability of both the transcriptions and the corresponding time-aligned audio files in a standard compliant format.

This publication has 22 references indexed in Scilit:

Transcription design principles for spoken discourse research
Pragmatics, 2022
Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German
Corpus Studies of Language Through Time, 2016
Modest XML for Corpora: Not a standard, but a suggestion
ICAME Journal, 2014
Choices over time: methodological issues in investigating current change
Published by Cambridge University Press (CUP) ,2013
Corpuscle – a new corpus management platform for annotated corpora
Published by John Benjamins Publishing Company ,2012
Why is conversation so easy?
Trends in Cognitive Sciences, 2004
Where did we Go Wrong? A Retrospective Look at the British National Corpus
Published by Brill ,2002
Spoken Corpus Transcription
Literary and Linguistic Computing, 1994
Representativeness in Corpus Design
Literary and Linguistic Computing, 1993
Corpus Design Criteria
Literary and Linguistic Computing, 1992

Cited by 11 articles