OCFS
- 15 August 2005
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM) in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '05
- p. 122-129
- https://doi.org/10.1145/1076034.1076058
Abstract
Text categorization is an important research area in many Information Retrieval (IR) applications. To save the storage space and computation time in text categorization, efficient and effective algorithms for reducing the data before analysis are highly desired. Traditional techniques for this purpose can generally be classified into feature extraction and feature selection. Because of efficiency, the latter is more suitable for text data such as web documents. However, many popular feature selection techniques such as Information Gain (IG) andχ2-test (CHI) are all greedy in nature and thus may not be optimal according to some criterion. Moreover, the performance of these greedy methods may be deteriorated when the reserved data dimension is extremely low. In this paper, we propose an efficient optimal feature selection algorithm by optimizing the objective function of Orthogonal Centroid (OC) subspace learning algorithm in a discrete solution space, called Orthogonal Centroid Feature Selection (OCFS). Experiments on 20 Newsgroups (20NG), Reuters Corpus Volume 1 (RCV1) and Open Directory Project (ODP) data show that OCFS is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small.Keywords
This publication has 10 references indexed in Scilit:
- Feature selection with conditional mutual information maximin in text categorizationPublished by Association for Computing Machinery (ACM) ,2004
- IMMCPublished by Association for Computing Machinery (ACM) ,2004
- Generalizing discriminant analysis using the generalized singular value decompositionIEEE Transactions on Pattern Analysis and Machine Intelligence, 2004
- A theoretical characterization of linear SVM-based feature selectionPublished by Association for Computing Machinery (ACM) ,2004
- Margin based feature selection - theory and algorithmsPublished by Association for Computing Machinery (ACM) ,2004
- Supervised term weighting for automated text categorizationPublished by Association for Computing Machinery (ACM) ,2003
- PCA versus LDAIEEE Transactions on Pattern Analysis and Machine Intelligence, 2001
- A Global Geometric Framework for Nonlinear Dimensionality ReductionScience, 2000
- Feature selection and feature extraction for text categorizationPublished by Association for Computational Linguistics (ACL) ,1992
- Principal Component AnalysisSpringer Series in Statistics, 1986