A technical study and analysis of text classification techniques in N - Lingual documents
- 1 January 2016
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in 2016 International Conference on Computer Communication and Informatics (ICCCI)
Abstract
In the current era, there is a high demand of accurate text identification and categorization methods in N - Lingual non-scanned and scanned machine printed documents, where N represents mono, bi, tri or multi mode. In this paper, a technical study and analysis is presented to show N-lingual document classification for normal text, printed and handwritten documents. Text classification for normal text documents is simple, whereas in scanned machine printed systems, it inherently begins with the correct recognition of text, i.e.; characters and words. The steps involved in the latter case are script identification, page layout determination, separation of text and non-text data, line segmentation, word detection and finally character recognition. After performing such processing steps, text or script is identified and separated. Three statistically analyzed charts are also shown, which are based on content type classification, language-mode pair and most-to-least preferred languages of existing algorithms.Keywords
This publication has 20 references indexed in Scilit:
- Evaluation of some English-Hindi MT systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2014
- A Simple Study of Webpage Text Classification Algorithms for Arabic and English LanguagesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- English and Chinese bilingual topic aspect classification: Exploring similarity measures, optimal LSA dimensions, and centroid correction of translated training examplesProceedings of the American Society for Information Science and Technology, 2013
- A Technical Study and Analysis on Fuzzy Similarity Based Models For Text ClassificationInternational Journal of Data Mining & Knowledge Management Process, 2012
- Bilingual topic taxonomy generation based on bilingual documents clusteringPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Dictionary-Based Bilingual Web Page ClassificationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- Text and Non-text Segmentation and Classification from Document ImagesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Script identification in printed bilingual documentsSādhanā, 2002
- Machine-printed and hand-written text lines identificationPattern Recognition Letters, 2001