A Hierarchical Model for Clustering and Categorising Documents

14 March 2002

book chapter
conference paper
Published by Springer Science and Business Media LLC in Lecture Notes in Computer Science

p. 229-247
https://doi.org/10.1007/3-540-45886-7_16

Abstract

We propose a new hierarchical generative model for textual data, where words may be generated by topic specific distributions at any level in the hierarchy. This model is naturally well-suited to clustering documents in preset or automatically generated hierarchies, as well as categorising new documents in an existing hierarchy. Training algorithms are derived for both cases, and illustrated on real data by clustering news stories and categorising newsgroup messages. Finally, the generative model may be used to derive a Fisher kernel expressing similarity between documents.

Keywords

This publication has 13 references indexed in Scilit:

What Is Quantum Information Retrieval?
Lecture Notes in Computer Science, 2011
Text classification in a hierarchical mixture model for small training sets
Published by Association for Computing Machinery (ACM) ,2001
Probabilistic latent semantic indexing
Published by Association for Computing Machinery (ACM) ,1999
Text categorization with Support Vector Machines: Learning with many relevant features
Lecture Notes in Computer Science, 1998
Distributional clustering of English words
Published by Association for Computational Linguistics (ACL) ,1993
A classification EM algorithm for clustering and two stochastic versions
Computational Statistics & Data Analysis, 1992
Indexing by latent semantic analysis
Journal of the American Society for Information Science, 1990
Statistical mechanics and phase transitions in clustering
Physical Review Letters, 1990
Recent trends in hierarchic document clustering: A critical review
Information Processing & Management, 1988
Distributional Structure
WORD, 1954

Cited by 20 articles