Feature selection for text categorization on imbalanced data

1 June 2004

journal article
Published by Association for Computing Machinery (ACM) in ACM SIGKDD Explorations Newsletter

Vol. 6 (1), 80-89
https://doi.org/10.1145/1007730.1007741

Abstract

A number of feature selection metrics have been explored in text categorization, among which information gain (IG), chi-square (CHI), correlation coefficient (CC) and odds ratios (OR) are considered most effective. CC and OR are one-sided metrics while IG and CHI are two-sided. Feature selection using one-sided metrics selects the features most indicative of membership only, while feature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the signs of features. The former never consider the negative features, which are quite valuable, while the latter cannot ensure the optimal combination of the two kinds of features especially on imbalanced data. In this work, we investigate the usefulness of explicit control of that combination within a proposed feature selection framework. Using multinomial naïve Bayes and regularized logistic regression as classifiers, our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion according to the imbalanced data.

Keywords

This publication has 9 references indexed in Scilit:

Robustness of regularized linear classification methods in text categorization
Published by Association for Computing Machinery (ACM) ,2003
Machine learning in automated text categorization
ACM Computing Surveys, 2002
Text Categorization Based on Regularized Linear Classification Methods
Information Retrieval Journal, 2001
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval Journal, 1999
Inductive learning algorithms and representations for text categorization
Published by Association for Computing Machinery (ACM) ,1998
Wrappers for feature subset selection
Artificial Intelligence, 1997
Learning routing queries in a query zone
Published by Association for Computing Machinery (ACM) ,1997
Adaptive Fraud Detection
Data Mining and Knowledge Discovery, 1997
Feature selection, perception learning, and a usability case study for text categorization
Published by Association for Computing Machinery (ACM) ,1997

Cited by 392 articles