A Hierarchical Model for Clustering and Categorising Documents
- 14 March 2002
- book chapter
- conference paper
- Published by Springer Science and Business Media LLC in Lecture Notes in Computer Science
- p. 229-247
- https://doi.org/10.1007/3-540-45886-7_16
Abstract
We propose a new hierarchical generative model for textual data, where words may be generated by topic specific distributions at any level in the hierarchy. This model is naturally well-suited to clustering documents in preset or automatically generated hierarchies, as well as categorising new documents in an existing hierarchy. Training algorithms are derived for both cases, and illustrated on real data by clustering news stories and categorising newsgroup messages. Finally, the generative model may be used to derive a Fisher kernel expressing similarity between documents.Keywords
This publication has 13 references indexed in Scilit:
- What Is Quantum Information Retrieval?Lecture Notes in Computer Science, 2011
- Text classification in a hierarchical mixture model for small training setsPublished by Association for Computing Machinery (ACM) ,2001
- Probabilistic latent semantic indexingPublished by Association for Computing Machinery (ACM) ,1999
- Text categorization with Support Vector Machines: Learning with many relevant featuresLecture Notes in Computer Science, 1998
- Distributional clustering of English wordsPublished by Association for Computational Linguistics (ACL) ,1993
- A classification EM algorithm for clustering and two stochastic versionsComputational Statistics & Data Analysis, 1992
- Indexing by latent semantic analysisJournal of the American Society for Information Science, 1990
- Statistical mechanics and phase transitions in clusteringPhysical Review Letters, 1990
- Recent trends in hierarchic document clustering: A critical reviewInformation Processing & Management, 1988
- Distributional StructureWORD, 1954