Word Error Analysis in Aphasia: Introducing the Greek Aphasia Error Corpus (GRAEC)

Abstract
Since the pioneering work of Paul Broca and Carl Wernicke, it has become clear that the interaction of aphasia research and theoretical linguistics can be beneficial for both disciplines: (1) in order to understand the nature of aphasia as a language disorder, it is crucial to understand the nature of language; its internal rules and principles, (2) linguistic analysis of aphasic speech can also provide some evidence on the relation between brain and language, (3) neurolinguistic data can be used to distinguish between competing linguistic theories, and (4) linguistic analysis of aphasic speech often leads to the design of linguistic-specific treatment programs for aphasia (for more details, see Avrutin, 2001). One of the most exciting recent developments in linguistics has been the widespread use of electronic corpora, both as a methodology and a theoretical viewpoint on language (see e.g., McEnery and Hardie, 2012, for an overview). In parallel, in aphasia research, large-scale data collection and group studies allow generalizations about the population from which the participants have been drawn, leading to useful findings (see Grodzinsky et al., 1999) that can complement single case studies, which allow for a detailed description of aphasic speech patterns and inferences about the language system in non-brain damaged individuals (see amongst others Badecker and Caramazza, 1985; Caramazza, 1986; Caramazza and Badecker, 1991). However, recruiting patients with aphasia on a large scale is difficult. Even when permission for collecting and using data by patients with aphasia has been obtained, considerable resources are required to move patients through the steps of consenting, screening and testing. A solution to this problem could be data sharing, as is increasingly realized in recent bibliography, which has evidenced a surge in corpora of language datasets from speakers with various disorders, including aphasia, in several languages such as Dutch (Westerhout and Monachesi, 2007), Cantonese (Kong and Law, 2019), Russian (Khudyakova et al., 2016), Croatian (Kuvač Kraljević et al., 2017), and, of course, English (Mirman et al., 2010; Williams et al., 2010; MacWhinney et al., 2011; Laures-Gore et al., 2016). Despite such attempts of developing corpora widely available to researchers, the need for additional open data banks from different languages still remains. For instance, for Greek a recent study has presented a detailed methodology for the transcription and annotation of aphasic speech samples (Varlokosta et al., 2016); although the authors describe an elaborate pipeline, no data has been available yet. Apart from the importance of data sharing discussed above, there is a methodological issue related to aphasic discourse analysis that is worth mentioning, namely, the method of eliciting a speech sample, which will be then used to evaluate a patient's linguistic competence on the basis of several indices, such as type and frequency of errors, semantic content, speech rate, mean length of utterance, etc. Given the large number of genres used in studies assessing aphasic narration ability (for an overview, see Müller et al., 2008), one must acknowledge the possible effects of the chosen elicitation task on the qualitative and quantitative characteristics of speech output (Armstrong, 2000), and, subsequently, the importance of evaluating verbal production across such genres (Armstrong et al., 2011). Moreover, there has been a well-established tradition of comparing data from speakers with aphasia with general corpus data, used as controls for a variety of purposes (e.g., Schwartz et al., 1994; Gahl, 2002; Fraser et al., 2015). As reference corpora become widely available for many languages, including Greek (Goutsos, 2010), there is an increasing need for developing resources with specialized data from speakers with disorders. To that end, we have developed the Greek Aphasia Error Corpus (GREAC), which is a large, searchable, web-based corpus of patients' performance on two different elicitation tasks, i.e., picture description and free narration, also including background language testing, and clinical/demographic information. The corpus is available at http://aphasia.phil.uoa.gr/, while a pilot sample of the data has been included in AphasiaBank (http://talkbank.org/AphasiaBank/). To our knowledge, this is the first publicly available corpus with data from Greek patients with aphasia. We present the first data from 50 right-handed monolingual Greek patients, with left stroke-induced aphasia, assessed at the Neuropsychology and Language Disorders Unit of the 1st Neurology Department of the National and Kapodistrian University of Athens, at Eginition Hospital. The participants (16 women) were 30–86 years old, with 4–20 years of formal schooling. Background language testing included the Boston Diagnostic Aphasia Examination–Short Form (BDAE-SF) adapted for Greek (Goodglass and Kaplan, 1983; Tsapkini et al., 2009), and the Boston Naming Test (Kaplan et al., 1983), standardized in Greek (Simos et al., 2011), CT and/or MRI scans were obtained for each patient, and two independent neuroradiologists identified lesion sites, which were then coded according to previously reported methodology (Kasselimis et al., 2017). These reports are part of the publicly available database. At this point, the structural MRIs of the patients are not included in GRAEC. Demographic and speech sample information are shown in Table 1. Informed consent for participation in the study and publication of the data (ensuring anonymity) was obtained from all participants according to the Ethics Committee of Eginition Hospital. No individually identifying information—apart from time post onset, brain lesion loci, tests' performance, and basic demographic information, including sex, age, and years of formal schooling- about the patients is contained in the corpus, and individual...