An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

1 July 2000

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 160-167
https://doi.org/10.1145/345508.345569

Abstract

The growing problem of unsolicited bulk e-mail, also known as “spam”, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in “encrypted” form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.

Keywords

This publication has 12 references indexed in Scilit:

Learning to remove Internet advertisements
Published by Association for Computing Machinery (ACM) ,1999
Spam!
Communications of the ACM, 1998
How to avoid unwanted email
Communications of the ACM, 1998
Threading electronic mail: A preliminary study
Information Processing & Management, 1997
Interface agents that learn an investigation of learning issues in a mail agent interface
Applied Artificial Intelligence, 1997
Training algorithms for linear text classifiers
Published by Association for Computing Machinery (ACM) ,1996
GATE
Published by Association for Computational Linguistics (ACL) ,1996
ACM Transactions on Information Systems, 1994
Automated learning of decision rules for text categorization
ACM Transactions on Information Systems, 1994
Feature selection and feature extraction for text categorization
Published by Association for Computational Linguistics (ACL) ,1992

Cited by 209 articles