Abstract
In the current era, there is a high demand of accurate text identification and categorization methods in N - Lingual non-scanned and scanned machine printed documents, where N represents mono, bi, tri or multi mode. In this paper, a technical study and analysis is presented to show N-lingual document classification for normal text, printed and handwritten documents. Text classification for normal text documents is simple, whereas in scanned machine printed systems, it inherently begins with the correct recognition of text, i.e.; characters and words. The steps involved in the latter case are script identification, page layout determination, separation of text and non-text data, line segmentation, word detection and finally character recognition. After performing such processing steps, text or script is identified and separated. Three statistically analyzed charts are also shown, which are based on content type classification, language-mode pair and most-to-least preferred languages of existing algorithms.

This publication has 20 references indexed in Scilit: