Debbie Orpin
Corpus Studies of Language Through Time, Volume 10, pp 37-61;

Critical Discourse Analysis (CDA) has often proved fruitful in providing insights into the relationship between language and ideology. However, CDA is not without its critics. Constructive criticism has been offered by Stubbs, who suggests bolstering CDA by using a large corpus as the basis on which to make reliable generalisations about language use. Taking up that suggestion, this paper reports on a study of a group of words semantically related to corruption. In the study, corpus methodology is used to manipulate the data: concordances and collocational tools are used to provide semantic profiles of the words and highlight connotational differences, and to identify the geographical locations that the words refer to. It is argued that words with a noticeably negative connotation tend to be used when referring to activities that take place outside of Britain, while less negative words are used when referring to similar activities in British contexts. CDA theory is drawn on to interpret the ideological significance of the findings.
Geoffrey Sampson
Corpus studies of language through time, Volume 10, pp 15-36;

In recent decades there has been a trend towards greater use of empirical data, for instance corpus data, within linguistics. I analyse a sample of linguistics articles from the past half-century in order to establish a detailed profile for this trend. Based on consistent criteria for classifying papers as evidence-based, intuition-based, or neutral, the resulting profile shows that the trend (i) is real, but (ii) is strikingly weaker in general linguistics than in the special subfield of computational linguistics, and (iii) appears to have begun to go into reverse.
Andreea S. Calude
Corpus studies of language through time, Volume 22, pp 429-455;

This paper brings together the study of sociolinguistic variation and the area of grammatical analysis by investigating demonstrative cleft constructions in spoken British English such asThat’s what I wanted to talk aboutandThis is where I saw him. Using the Spoken BNC2014S, I ask whether speaker characteristics, including gender, age, education and occupation, might be correlated with the use of demonstrative clefts and with various aspects of their structure (preference for the distal or proximal demonstrative pronoun, use of negative polarity, and use of stance adverbs). Findings suggest that in British English, demonstrative cleft use is more likely to be present in the speech of male compared to female speakers, working adults in higher-skilled occupations compared to semi-skilled adults, and in adults of middle age compared to younger adults. This work shows that even highly abstract grammatical constructions can be sensitive to speaker preferences and linguistic communicative style.
Tanja Hessner, Ira Gawlitzek
Corpus studies of language through time, Volume 22, pp 403-428;

Since the late twentieth century, the usage of intensifiers, such asabsolutely,veryorslightly, has been investigated in relation to gender. Whereas previous literature has widely agreed that women and men do indeed use intensifiers differently, there has been some disagreement as to which gender uses them more frequently. Taking up this discussion, this study explores authentic data provided in the Spoken BNC2014S by investigating 39 intensifiers. After establishing the ten most frequently used intensifiers by women and men, these are subcategorized and investigated for gender and age effects. The results generally support earlier findings, however, they also illustrate that there is much interindividual variation and variation regarding individual intensifiers and that there are fascinating interactions with other variables such as age.
Corpus Studies of Language Through Time, Volume 22, pp 319-344;

This paper introduces the Spoken British National Corpus 2014, an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK, recorded in the years 2012–2016. After showing that a survey of the recent history of corpora of spoken British English justifies the compilation of this new corpus, we describe the main stages of the Spoken BNC2014’s creation: design, data and metadata collection, transcription, XML encoding, and annotation. In doing so we aim to (i) encourage users of the corpus to approach the data with sensitivity to the many methodological issues we identified and attempted to overcome while compiling the Spoken BNC2014, and (ii) inform (future) compilers of spoken corpora of the innovations we implemented to attempt to make the construction of corpora representing spontaneous speech in informal contexts more tractable, both logistically and practically, than in the past.
Corpus studies of language through time, Volume 22, pp 345-374;

This study investigates how age, gender, social class and dialect influence how frequently speakers of British English use intensifiers (e.g. very) in private conversations and whether this has changed over the last two decades. With data drawn from over 600 speakers and 4M words included in the Spoken British National Corpus (1994 and 2014 Sample), it is the most comprehensive study of intensifier usage to date, taking into account 111 intensifier variants. Results show that, in most age groups and social classes, men use intensifiers less frequently than women, and gender differences have diminished to a very limited extent, notably for the middle class. Moreover, intensification rate has increased across the board over time. This could be due to a shift towards a stereotypically more feminine communicative style as the perception of gender roles has changed, a process by which the middle class might have been particularly affected.
Jacqueline Laws, Chris Ryder,
Corpus studies of language through time, Volume 22, pp 375-402;

The aim of this paper is to ascertain the degree to which lexical diversity, density and creativity in everyday spoken British English have changed over a 20-year period, as a function of age and gender. Usage patterns of four verb-forming suffixes, -ate, -en, -ify and -ize, were compared in contemporary speech from the Spoken British National Corpus 2014 Sample (Spoken BNC2014S) with its 20-year old counterpart, the BNC1994’s demographically-sampled component (the Spoken BNC1994DS). Frequency comparisons revealed that verb suffixation is denser in the Spoken BNC2014S than in the Spoken BNC1994DS, with the exception of the -en suffix, the use of which has decreased, particularly among female and younger speakers in general. Male speakers and speakers in the 35–59 age range showed the greatest type diversity; there is evidence that this peak is occurring earlier in the more recent corpus. Contrary to expectations, female rather than male speakers produced the largest number of neologisms and rare forms.
Mariya Koleva, Melissa Farasyn, Bart Desmet, , Véronique Hoste
Corpus studies of language through time, Volume 22, pp 107-140;

Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them.
Feng (Kevin) Jiang
Corpus studies of language through time, Volume 22, pp 85-106;

Stance and voice are two crucial elements of social interactions in academic writing. However, their conceptual constructs are elusive and their linguistic realisation is not fully explored. A relatively overlooked feature is the “noun + that” structure, where a stance head noun takes a nominal complement clause (as advantage that in Flow cytometry offers the advantage that long term is available). This construction allows a writer to express authorial stance towards complement content and attribute a voice to that stance through pre-modification. This paper examines this construction in a corpus of 60 journal articles across six disciplines extracted from the BNC corpus. Developing an expressive classification of stance nouns and the possible voice categorisation, this study shows that the structure is not only widely used to project stance and voice, but that it displays considerable variation in the way that it is used to build knowledge across different disciplines.
Tove Larsson
Corpus studies of language through time, Volume 22, pp 57-84;

The ability to successfully position oneself in relation to one’s claims through the use of stance markers is of central importance for academic writers. This study, which uses data from one expert corpus (LOCRA) and three learner corpora (ALEC, VESPA and BATMAT), investigates the use of morphologically related stance markers that occur in different syntactic constructions (such as possibly, the possibility of and it is possible that). In doing so, it examines to what extent lexis, level of expertise in academic writing and L1 transfer influence the distribution of the different realizations of stance under investigation. The results show that all three variables are important predictors. In addition, differences pertaining to information structure are found to influence the distribution of two largely synonymous constructions (disjuncts and the introductory it pattern). The findings suggest that there are principled explanations for why one construction is used instead of another functionally similar construction.
Kasper I. Kok
Corpus studies of language through time, Volume 22, pp 1-26;

Based on the Bielefeld Speech and Gesture Alignment Corpus (Lücking et al. 2013), this paper presents a systematic comparison of the linguistic characteristics of unimodal (speech only) and multimodal (gesture-accompanied) forms of language use. The results suggest that each of these two modes of expression is characterized by statistical preferences for certain types of words and grammatical categories. The words that are most frequently accompanied by a manual gesture, when controlled for their total frequency, include unspecific spatial lexemes, various deictic words, and particles that express difficulty in word retrieval or formulation. Other linguistic items, including pronouns and verbs of cognition, show a strong dispreference for being gesture-accompanied. The second part of the paper shows that gestures do not occur within a fixed time window relative to the word(s) they relate to, but the preferred temporal distance varies with the type of functional relation that exists between the verbal and gestural channel.
Rolf Kreyer
Corpus studies of language through time, Volume 8, pp 169-207;

On the basis of 698 instances of Saxon genitive and of-construction, the present paper explores the use of these modifiers from a corpus-linguistic perspective. In particular, the influence of the lexical class of the modifier, the semantic relationship expressed by the constructions, and weight and syntactic complexity is analysed. It will be argued that the variation of genitive and of-construction can be explained with regard to two major underlying factors, namely ‘processability' and ‘degree of human involvement'.
Liesbeth Mortier,
Corpus studies of language through time, Volume 14, pp 338-366;

This paper deals with the semantics of two discourse markers, viz. French en fait (“in fact”) and Dutch eigenlijk (“actually”), commonly associated with the expression of “opposition” and “reformulation”. A special focus lies on methodological issues in the description of such markers, since their non-propositional meanings seem to require what is called a ‘combined corpus approach’, including written and spoken comparable data as well as translation corpora. It is argued that eigenlijk and en fait are best described as adversatives, at the intersection of “opposition” and “reformulation” which constitute their basic meanings, and from which other meanings such as “causality”, “counterexpectation”, “enhancement” and “attenuation” can be inferred. Evidence from all sets of corpora moreover suggests that it is the semantic underspecification of en fait and (especially) eigenlijk which ultimately accounts for their high level of polysemy.
, Ylva Berglund
Corpus studies of language through time, Volume 4, pp 343-346;

Gerhard Leitner
Corpus studies of language through time, Volume 4, pp 336-341;

, Ylva Berglund, Jonathan Hope
Corpus studies of language through time, Volume 4, pp 189-196;

Michael Barlow
Corpus studies of language through time, Volume 4, pp 173-184;

The review describes the design and features associated with two Windows text analysis programs: a concordancer MonoConc and a parallel concordance program ParaConc. The general operation and potential uses of the two programs are briefly explored.
Hong Liang Qiao
Corpus studies of language through time, Volume 4, pp 113-135;

The paper discusses the design of a new computational model based on corpora—the Structural Boundary Model (SBM), particularly for the purpose of NLP. The Structural Boundary Model is constructed on the basis of parsed corpora. It consists of two main bodies, namely structural boundary data and CFG rules. The grammar supports parsing in a unique way by assigning structural boundary labels retrieved from a parsed corpus as a training corpus for the parser. Parsing experiments have demonstrated that the Structural Boundary Model is an appropriate novel computational model for parsing.
Inge De Monnink
Corpus studies of language through time, Volume 4, pp 77-111;

In this article I argue that, from a methodological point of view, descriptive studies improve considerably if they use a multi-method approach to the data, more specifically, if they use a combination of corpus data and experimental data. In the modern conception of corpus linguistics, intuitive data play an important role. The linguist formulates research hypotheses based on his or her intuitive knowledge. These hypotheses are then tested on the corpus data. I argue that a sound descriptive study should not end with simply stating the results from the corpus study. Instead, the corpus data have to be supplemented. An appropriate way to supplement corpus data is through the use of elicitation techniques. I illustrate the multi-method approach on a case study of floating postmodification in the English noun phrase.
Corpus studies of language through time, Volume 4, pp 53-75;

Corpus linguistics is being used for a wide range of research tasks. A database on English lexicology was compiled on the basis of the semantic, syntactic, and pragmatic information found in the stylistically marked lexical items of the COBUILD 1987. It offers linguistic information on 7981 units annotated as "formal" or "informal" in the COBUILD 1987. This database may be used for quantitative and qualitative analysis and for a critical evaluation of the classifications used in the COBUILD dictionary. The aim of this paper is basically descriptive: (i) some information is given on the linguistic information stored; (ii) some tentative conclusions are drawn. For example, if the COBUILD formality labels were assigned rigorously, it might be concluded that (i) "formality" is a skew system; (ii) that the "formal/informal" scale has an equi distribution; (iii) that emphasising informal adverbs are so numerous because they illustrate a tendency to exaggeration observed in informal, relaxed situations, in which interactants usually struggle for controlling the message.
Hilde Hasselgård, Jonathan Hope, Susan Pintzuk
Corpus studies of language through time, Volume 3, pp 349-352;

Juan C. Sager
Corpus studies of language through time, Volume 3, pp 335-338;

Luca Dini, Vittorio Di Tomaso
Corpus studies of language through time, Volume 3, pp 305-318;

Corpus linguistics and the development of commercial NLP applications are two tightly linked activities. It is hard to conceive fast development of high quality applications without proper tools for inspecting the corpora pertaining the application domain. At the same time, it is hard to conceive reliable corpus analysis tools that do not satisfy the standards of software engineering. In the present paper, we will prove the validity of such a concept by showing how application development at CELI benefited from corpus-oriented tools and how these corpus-oriented tools have been produced as a by-product of the technology developed for real applications.
T. Hennoste, Mare Koit, T. Roosmaa, M. Saluveer
Corpus studies of language through time, Volume 3, pp 279-304;

This paper provides an overview of the first computer corpus of the Estonian language compiled at the University of Tartu. It was based on the design principles of the LOB and Brown corpora. The main part of the corpus was assembled from 1991-1995 and contains about 1 million textual words. It was compiled by an interdepartmental computational linguistics research group of the university. This paper gives a survey of the text groups in the corpus and of the problems the compilers had to solve together with the proposed solutions and outlines the main differences from the model corpora and the underlying reasons for them. These are followed by a review of the available computer routines for processing the corpus.
Salvador Valera, Alfonso Rizo Rodriguez
Corpus studies of language through time, Volume 3, pp 251-278;

One of the various forms that the expression of attribution may take in English is through a supplementive clause, a reduced structure realized by an adjective phrase hypotactically connected with a superordinate clause. The construction under study exhibits an attributive character in that the adjective predicates about the NP subject, but also possesses an adverbial import in so far as it expresses diverse circumstances relating to the main clause.This kind of structure is, however, not entirely free of constraints; in fact, not every adjective may combine with a matrix verb, and certain semantic patterns can be observed to occur recurrently in these constructions. This paper surveys a substantial number of adjectives from the LOB corpus for the identification of the semantic profile proper to supplementive adjectives.
Corpus studies of language through time, Volume 3, pp 229-249;

This paper describes a computer program which performs a particular type of grammatical/syntactic analysis—the assigning of structural boundaries between orthographic words in written English text. The Boundary Marker has been designed, in principle, as an analyser of unrestricted text and has been developed by using, as far as possible, authentic text as data for analysis. This paper first presents a brief overview of boundary marking as a method of syntactic analysis. It then describes how the program processes text and reports on the analysis of 10 000 words of text from the media. The paper concludes with a discussion of the advantages of a tightly focused analytic tool such as the Boundary Marker.
Corpus studies of language through time, Volume 3, pp 211-228;

The paper highlights and discusses some practical issues related to the drawbacks and pitfalls of computerised texts in regard to both databases themselves and the software employed to codify and search them. In the first place, some corpora and databases are compiled in such a way as to be searched and analysed by means of tools which allow only specific kinds of search to be made. This often prevents scholars from carrying out their own free study of the data, thus hindering an effective, targeted analysis. Moreover, in some cases, the need for comprehensiveness leads to the codification and classification of subjective aspects like the text difficulty and the participants' social level This subjectivity of interpretation might mislead the researchers in a socially-orientated analysis. Finally, despite being highly sophisticated, the techniques employed for automated grammatical and part-of-speech tagging as well as for semantic and prosodic parsing appear not to be totally reliable, since mistakes in the codification of simple items are likely to occur. Each of the above thorny issues, together with some other minor matters, are testified to with instances drawn from the author's personal linguistic research on a variety of synchronic and diachronic corpora and databases.
Jan Aarts, Hans van Halteren, Nelleke Oostdijk
Corpus studies of language through time, Volume 3, pp 189-210;

The article discusses the role of linguistic annotation in corpus linguistics as opposed to annotation in natural language processing. In corpus linguistics, annotation is an integral part of the process of linguistic interpretation and description of the data. Tagging and parsing are discussed as the automatic counterparts of, respectively, the paradigmatic and the syntagmatic description of corpus data. The requirements for a corpus linguistic annotation system are considered. An account is given of the TOSCA analysis system as representative of such an annotation system. Performance results of the system are given, and an evaluation is made.
Hilde Hasselgård, Juhani Klemola, Susan Pintzuk, Jonathan Hope
Corpus studies of language through time, Volume 3, pp 181-187;

Maria Angeles Gomes Gonsalez
Corpus studies of language through time, Volume 3, pp 81-113;

This corpus-based study reformulates Halliday's (1994: 55) notion of Multiple Theme, i.e., textual and/or interpersonal items occurring before a simple Topical Theme (or clause initial transitivity/mood element) (e.g., Well, but then, Ann, surely, wouldn't the best idea be to join the group?) (cf. Berry 1982, 1995; Lautamatti 1978; Young 1980; Vasconcellos 1992). Firstly, the label Extended Multiple Theme is here proposed as a cover-term for Topical Themes co-occurring with pre-topical and/or post-topical textual and/or interpersonal elements. And secondly, Extended Multiple Themes are suggested to: (i) allow for recursiveness within the three functional slots; (ii) tend to abide by Dik's (1989: 342) Principle of Centripetal Organisation; and (iii) substantiate the layering hypothesis posited for example in Dik 's Functional Grammar or in Role and Reference Grammar (cf. Hengeveld 1989; Van Valin Jr. 1993). These claims were deduced from the application of three multivariate statistical tests, namely, the Logistic Regression Technique, the Fisher's Exact Test, and the x2 Test, to the tokens of Extended Multiple Themes found in real Present-day English texts, that is to say, in the Lancaster Spoken English Corpus.
Murat Bayraktar, Bilge Say,
Corpus studies of language through time, Volume 3, pp 33-57;

Punctuation has usually been ignored by researchers in computational linguistics over the years. Recently, it has been realized that a true understanding of written language will be impossible if punctuation marks are not taken into account. This paper contains the details of a computer-aided exercise to investigate English punctuation practice for the special case of comma (the most significant punctuation mark) in a parsed corpus. The study classifies the various "structural" uses of the comma according to the syntax-patterns in which a comma occurs. The corpus (Penn Treebank) consists of syntactically annotated sentences with no part-of-speech tag information about the individual words.
Qiang Zhou, Shiwen Yu
Corpus studies of language through time, Volume 2, pp 239-258;

In recent years, great progress has been made in Chinese corpus processing. A fifty-million-word Chinese National Corpus project has been put into effect, and many automatic corpus processing programs have also been developed. In this paper, we will briefly introduce our work on constructing a large scale annotated corpus for Chinese grammatical research and developing a Chinese Corpus Multilevel Processing system—CCMP. First, we present our annotation scheme. Second, we discuss some basic methodologies for Chinese corpus analysis and propose a man-machine mutually dependent corpus processing model. Finally, we introduce the survey of our CCMP. We hope our work will give impetus to further research in Chinese corpus linguistics.
František Čermák
Corpus studies of language through time, Volume 2, pp 181-197;

Against the background of some of the major linguistic problems which demand our attention and which should point to some badly-needed criteria, the brief history and structure of the Czech National Corpus is outlined. The points seen as open include differences between various languages in their degree of ex-plicitness, form-function relation, ellipsis, etc. It is argued that a more general and language-independent approach is necessary to handle, among other things, the multi-word units of the text; a general corpus maintenance and query system available to the increasing number of would-be users is required, too. The particular Czech solution, still being worked out and gradually implemented, is described in some detail.
Maria D. Lopez Maestre
Corpus studies of language through time, Volume 4, pp 299-330;

In this paper, we present and discuss a computer programme designed for the linguistic annotation and processing of corpora of Block Language (headlines, proverbs, graffiti, advertising headlines, cinema titles, etc.) in English. LINDA BL 1.0 (LINGUISTIC DIGITAL ASSISTANT FOR THE ANALYSIS OF BLOCK LANGUAGE version 1.0) was designed at the University of Murcia (Spain) to enable the user to study linguistic variation in the sentence structure of Block Language texts from a stylistic point of view and with reference to the social-semiotic environment of the context of situation of these varieties of language.
Ruslan Mitkov
Corpus studies of language through time, Volume 4, pp 261-280;

The paper proposes a methodology for the semi-automatic annotation of pronoun-antecedent pairs in corpora. The proposal is based on robust, knowledge-poor pronoun resolution followed by post-editing. The paper is structured as follows. The introduction comments on the fact that automatic identification of referential links in corpora has lagged behind in comparison with similar lexical, syntactical, and even semantic tasks. The second section of the paper outlines the author s robust, knowledge-based approach to pronoun resolution which will subsequently be put forward as the core of a larger architecture proposed for the automatic tagging of referential links. Section 3 briefly presents other related knowledge-poor approaches, while Section 4 discusses the limitations and advantages of the knowledge-poor approach outlined in Section 2. The main argument of the paper is to be found in Section 5, which presents the idea of developing a semi-automatic environment for annotating anaphoric links and outlines the components of such a program. Finally, the conclusion looks at the anticipated success rate of the approach.
Corpus studies of language through time, Volume 17, pp 259-286;

Recent studies have sought to understand individuals’ motivations for terrorism through terrorist material content. To date, these studies have not capitalised on automated language analysis techniques, particularly those of corpus linguistics. In this paper, we demonstrate how applying three corpus-linguistic techniques to extremist statements can provide insights into their ideology. Our data consisted of 250 statements (approximately 500,000 words) promoting terrorist violence. Using the online software tool WMatrix, we submitted these data to frequency count, key word and key concept, and concordance analyses. Results showed that authors centre their rhetoric on themes of morality, social proof, inspiration and appeals to religion, and refer to the world via contrasting concepts, suggesting a polarised way of thinking compared to a general population usage. Additionally, we show how collocation can aid the establishment of networks between people and places. We discuss how such analyses might support the formulation of evidence based counter-terrorism strategies.
