Enhanced hypertext categorization using hyperlinks
- 1 June 1998
- journal article
- Published by Association for Computing Machinery (ACM) in ACM SIGMOD Record
- Vol. 27 (2), 307-318
- https://doi.org/10.1145/276305.276332
Abstract
A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo! 1 and the US Patent Database 2 . In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.Keywords
This publication has 21 references indexed in Scilit:
- Visually searching the Web for contentIEEE MultiMedia, 1997
- A new probabilistic relaxation scheme and its application to edge detectionIeee Transactions On Pattern Analysis and Machine Intelligence, 1996
- The use of semantic links in hypertext information retrievalInformation Processing & Management, 1995
- Automated learning of decision rules for text categorizationACM Transactions on Information Systems, 1994
- Retrieval strategies for hypertextInformation Processing & Management, 1993
- Information retrieval from hypertext: An approach using plausible inferenceInformation Processing & Management, 1993
- A continuous relaxation labeling algorithm for Markov random fieldsIEEE Transactions on Systems, Man, and Cybernetics, 1990
- Enhancement of text representations using related document titlesInformation Processing & Management, 1986
- A probabilistic theory of indexing and similarity measure based on cited and citing documentsJournal of the American Society for Information Science, 1985
- Associative Document Retrieval Techniques Using Bibliographic InformationJournal of the ACM, 1963