A technical study and analysis of text classification techniques in N - Lingual documents

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE) in 2016 International Conference on Computer Communication and Informatics (ICCCI)

Abstract

In the current era, there is a high demand of accurate text identification and categorization methods in N - Lingual non-scanned and scanned machine printed documents, where N represents mono, bi, tri or multi mode. In this paper, a technical study and analysis is presented to show N-lingual document classification for normal text, printed and handwritten documents. Text classification for normal text documents is simple, whereas in scanned machine printed systems, it inherently begins with the correct recognition of text, i.e.; characters and words. The steps involved in the latter case are script identification, page layout determination, separation of text and non-text data, line segmentation, word detection and finally character recognition. After performing such processing steps, text or script is identified and separated. Three statistically analyzed charts are also shown, which are based on content type classification, language-mode pair and most-to-least preferred languages of existing algorithms.

Keywords

This publication has 20 references indexed in Scilit:

Evaluation of some English-Hindi MT systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
A Simple Study of Webpage Text Classification Algorithms for Arabic and English Languages
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
English and Chinese bilingual topic aspect classification: Exploring similarity measures, optimal LSA dimensions, and centroid correction of translated training examples
Proceedings of the American Society for Information Science and Technology, 2013
A Technical Study and Analysis on Fuzzy Similarity Based Models For Text Classification
International Journal of Data Mining & Knowledge Management Process, 2012
Bilingual topic taxonomy generation based on bilingual documents clustering
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Dictionary-Based Bilingual Web Page Classification
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
Text and Non-text Segmentation and Classification from Document Images
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Script identification in printed bilingual documents
Sādhanā, 2002
Machine-printed and hand-written text lines identification
Pattern Recognition Letters, 2001

Cited by 11 articles