Partitioned logistic regression for spam filtering

24 August 2008

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 97-105
https://doi.org/10.1145/1401890.1401907

Abstract

Naive Bayes and logistic regression perform well in different regimes. While the former is a very simple generative model which is efficient to train and performs well empirically in many applications,the latter is a discriminative model which often achieves better accuracy and can be shown to outperform naive Bayes asymptotically. In this paper, we propose a novel hybrid model, partitioned logistic regression, which has several advantages over both naive Bayes and logistic regression. This model separates the original feature space into several disjoint feature groups. Individual models on these groups of features are learned using logistic regression and their predictions are combined using the naive Bayes principle to produce a robust final estimation. We show that our model is better both theoretically and empirically. In addition, when applying it in a practical application, email spam filtering, it improves the normalized AUC score at 10% false-positive rate by 28.8% and 23.6% compared to naive Bayes and logistic regression, when using the exact same training examples.

Keywords

This publication has 14 references indexed in Scilit:

Raising the baseline for high-precision text classifiers
Published by Association for Computing Machinery (ACM) ,2007
Reducing weight undertraining in structured discriminative learning
Published by Association for Computational Linguistics (ACL) ,2006
Adversarial learning
Published by Association for Computing Machinery (ACM) ,2005
Logarithmic opinion pools for conditional random fields
Published by Association for Computational Linguistics (ACL) ,2005
Using asymmetric distributions to improve text classifier probability estimates
Published by Association for Computing Machinery (ACM) ,2003
Sequential conditional Generalized Iterative Scaling
Published by Association for Computational Linguistics (ACL) ,2001
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages
Published by Association for Computing Machinery (ACM) ,2000
Support vector machines for spam categorization
IEEE Transactions on Neural Networks, 1999
Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms
Neural Computation, 1998
On combining classifiers
IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998

Cited by 21 articles