An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages
- 1 July 2000
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 160-167
- https://doi.org/10.1145/345508.345569
Abstract
The growing problem of unsolicited bulk e-mail, also known as “spam”, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in “encrypted” form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.Keywords
This publication has 12 references indexed in Scilit:
- Learning to remove Internet advertisementsPublished by Association for Computing Machinery (ACM) ,1999
- Spam!Communications of the ACM, 1998
- How to avoid unwanted emailCommunications of the ACM, 1998
- Threading electronic mail: A preliminary studyInformation Processing & Management, 1997
- Interface agents that learn an investigation of learning issues in a mail agent interfaceApplied Artificial Intelligence, 1997
- Training algorithms for linear text classifiersPublished by Association for Computing Machinery (ACM) ,1996
- GATEPublished by Association for Computational Linguistics (ACL) ,1996
- ACM Transactions on Information Systems, 1994
- Automated learning of decision rules for text categorizationACM Transactions on Information Systems, 1994
- Feature selection and feature extraction for text categorizationPublished by Association for Computational Linguistics (ACL) ,1992