Register variation across text lengths
- 23 August 2022
- journal article
- research article
- Published by John Benjamins Publishing Company in Corpus Studies of Language Through Time
- Vol. 28 (2), 202-231
- https://doi.org/10.1075/ijcl.20177.lii
Abstract
This paper explores variation in lexico-grammatical register features across text lengths in a large-scale sample of Reddit comments. Very short texts are known to be problematic for many statistical methods, so understanding their nature is important for the corpus-linguistic study of social media, where most contributions are short. I show that the frequencies of linguistic features change with comment length, even between longer comments, although longer texts are often considered similar in statistical terms. Moreover, I classify the variation found between short comments of different lengths into two main patterns, although other patterns can also be found, and there is variation even within these patterns. Furthermore, I interpret the observed differences in terms of register variation. For example, shorter comments appear to be more casual and less edited in terms of their feature makeup, whereas narrative and informational registers seem to favor longer comments.Keywords
This publication has 25 references indexed in Scilit:
- Using multi-dimensional analysis to explore cross-linguistic universals of register variationLanguages in Contrast, 2014
- The Stanford CoreNLP Natural Language Processing ToolkitPublished by Association for Computational Linguistics (ACL) ,2014
- Correspondence analysisPublished by John Benjamins Publishing Company ,2014
- Dimensions of web registers: an exploratory multi-dimensional comparisonCorpora, 2013
- Twenty-five years of Biber's Multi-Dimensional Analysis: introduction to the special issue and an interview with Douglas BiberCorpora, 2013
- Being Specific about Historical ChangeJournal of English Linguistics, 2013
- Effects of text length on lexical diversity measures: Using short texts with less than 200 tokensSystem, 2012
- Representativeness in Corpus DesignLiterary and Linguistic Computing, 1993
- The Reliability of Type-Token Ratios for the Oral Language of School Age ChildrenJournal of Speech, Language, and Hearing Research, 1989
- Sample Size and Type-Token Ratios for Oral Language of Preschool ChildrenJournal of Speech, Language, and Hearing Research, 1986