AUTHORSHIP ATTRIBUTION BASED ON FEATURE SET SUBSPACING ENSEMBLES

1 October 2006

journal article
Published by World Scientific Pub Co Pte Ltd in International Journal on Artificial Intelligence Tools

Vol. 15 (5), 823-838
https://doi.org/10.1142/s0218213006002965

Abstract

Authorship attribution can assist the criminal investigation procedure as well as cybercrime analysis. This task can be viewed as a single-label multi-class text categorization problem. Given that the style of a text can be represented as mere word frequencies selected in a language-independent method, suitable machine learning techniques able to deal with high dimensional feature spaces and sparse data can be directly applied to solve this problem. This paper focuses on classifier ensembles based on feature set subspacing. It is shown that an effective ensemble can be constructed using, exhaustive disjoint subspacing, a simple method producing many poor but diverse base classifiers. The simple model can be enhanced by a variation of the technique of cross-validated committees applied to the feature set. Experiments on two benchmark text corpora demonstrate the effectiveness of the presented method improving previously reported results and compare it to support vector machines, an alternative suitable machine learning approach to authorship attribution.

Keywords

This publication has 20 references indexed in Scilit:

Applying Authorship Analysis to Extremist-Group Web Forum Messages
IEEE Intelligent Systems, 2005
Machine learning in automated text categorization
ACM Computing Surveys, 2002
Mining e-mail content for author identification forensics
ACM SIGMOD Record, 2001
Inter-Textual Distance and Authorship Attribution Corneille and Molière
Journal of Quantitative Linguistics, 2001
Automatic Text Categorization in Terms of Genre and Author
Computational Linguistics, 2000
Combining multiple classifiers by averaging or by multiplying?
Pattern Recognition, 2000
The Evolution of Stylometry in Humanities Scholarship
Literary and Linguistic Computing, 1998
Wrappers for feature subset selection
Artificial Intelligence, 1997
Outside the cave of shadows: using syntactic annotation to enhance authorship attribution
Literary and Linguistic Computing, 1996
The Authorship of Greek Prose
Journal of the Royal Statistical Society. Series A (General), 1965

Cited by 32 articles