Dictionary-Based Bilingual Web Page Classification

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE) in 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing

p. 1-4
https://doi.org/10.1109/wicom.2008.2684

Abstract

Web page classification poses new research challenges because of the noisy nature of the pages. For the bilingual Chinese-English web pages, it also needs to be considered that how to extract the terms of different languages exactly. A new dictionary-based multilingual text categorization approach is proposed in this paper to try to classify the Chinese-English web pages in specific domain into a hierarchical topic structure more accurately. The approach can properly recognize and integrate the web page encodings by using an automatic encoding detection and integration method. This makes the feature extraction more precise for the multilingual pages. The approach can also intensify the domain concepts in the web pages based on a domain dictionary. From the results of the experiments, it can be found that the proposed approach get the better performance than the traditional classification method when classifying the bilingual web pages.

Keywords

This publication has 6 references indexed in Scilit:

Dictionary-based techniques for cross-language information retrieval
Information Processing & Management, 2005
A Study of Approaches to Hypertext Categorization
Journal of Intelligent Information Systems, 2002
A re-examination of text categorization methods
Published by Association for Computing Machinery (ACM) ,1999
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval Journal, 1999
Inductive learning algorithms and representations for text categorization
Published by Association for Computing Machinery (ACM) ,1998
Text categorization with Support Vector Machines: Learning with many relevant features
Lecture Notes in Computer Science, 1998

Cited by 2 articles