Distributional clustering of words for text classification

Abstract
This paper describes the application of Dis- tributional Clustering (20) to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionality- reduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document clas- sification accuracy. Experimental results obtained on three real-world data sets show that we can reduce the feature dimen- sional&y by three orders of magnitude and lose only 2% accuracy-significantly better than Latent Semantic In- dexing (6), class-based clustering (l), feature selection by mutual information (23), or Markov-blanket-based fea- ture selection (13). We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering.

This publication has 9 references indexed in Scilit: