Distributional clustering of words for text classification

1 August 1998

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 96-103
https://doi.org/10.1145/290941.290970

Abstract

This paper describes the application of Dis- tributional Clustering (20) to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionality- reduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document clas- sification accuracy. Experimental results obtained on three real-world data sets show that we can reduce the feature dimen- sional&y by three orders of magnitude and lose only 2% accuracy-significantly better than Latent Semantic In- dexing (6), class-based clustering (l), feature selection by mutual information (23), or Markov-blanket-based fea- ture selection (13). We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering.

Keywords

This publication has 9 references indexed in Scilit:

Elements of Information Theory
Published by Wiley ,2001
Threading electronic mail: A preliminary study
Information Processing & Management, 1997
On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality
Data Mining and Knowledge Discovery, 1997
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
Machine Learning, 1997
Noise reduction in a statistical approach to text categorization
Published by Association for Computing Machinery (ACM) ,1995
Similarity-based estimation of word cooccurrence probabilities
Published by Association for Computational Linguistics (ACL) ,1994
Distributional clustering of English words
Published by Association for Computational Linguistics (ACL) ,1993
Indexing by latent semantic analysis
Journal of the American Society for Information Science, 1990
Nearest neighbor pattern classification
IEEE Transactions on Information Theory, 1967

Cited by 310 articles