The effect of author set size and data size in authorship attribution

13 August 2010

journal article
Published by Oxford University Press (OUP) in Literary and Linguistic Computing

Vol. 26 (1), 35-55
https://doi.org/10.1093/llc/fqq013

Abstract

Applications of authorship attribution `in the wild’ [Koppel, M., Schler, J., and Argamon, S. (2010). Authorship attribution in the wild. Language Resources and Evaluation. Advanced Access published January 12, 2010:10.1007/s10579-009-9111-2], for instance in social networks, will likely involve large sets of candidate authors and only limited data per author. In this article, we present the results of a systematic study of two important parameters in supervised machine learning that significantly affect performance in computational authorship attribution: (1) the number of candidate authors (i.e. the number of classes to be learned), and (2) the amount of training data available per candidate author (i.e. the size of the training data). We also investigate the robustness of different types of lexical and linguistic features to the effects of author set size and data size. The approach we take is an operationalization of the standard text categorization model, using memory-based learning for discriminating between the candidate authors. We performed authorship attribution experiments on a set of three benchmark corpora in which the influence of topic could be controlled. The short text fragments of e-mail length present the approach with a true challenge. Results show that, as expected, authorship attribution accuracy deteriorates as the number of candidate authors increases and size of training data decreases, although the machine learning approach continues performing significantly above chance. Some feature types (most notably character n-grams) are robust to changes in author set size and data size, but no robust individual features emerge.

Keywords

This publication has 18 references indexed in Scilit:

Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts
Literary and Linguistic Computing, 2007
Quantitative Authorship Attribution: An Evaluation of Techniques
Literary and Linguistic Computing, 2007
Stylistic text classification using functional lexical features
Journal of the American Society for Information Science and Technology, 2007
All the Way Through: Testing for Authorship in Different Frequency Strata
Literary and Linguistic Computing, 2006
Ngram and Bayesian Classification of Documents for Topic and Authorship
Literary and Linguistic Computing, 2003
Authorship Attribution with Support Vector Machines
Applied Intelligence, 2003
'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship
Literary and Linguistic Computing, 2002
10.1162/153244303322753670
Applied Physics Letters, 2000
Outside the cave of shadows: using syntactic annotation to enhance authorship attribution
Literary and Linguistic Computing, 1996
Authorship attribution
Computers and the Humanities, 1994

Cited by 75 articles