Script based text identification
- 17 September 2011
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM) in Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data - MOCR_AND '11
Abstract
Script identification in a multi-lingual document environment has numerous applications in the field of document image analysis, such as indexing and retrieval or as an initial step towards optical character recognition. In this paper, we propose a novel hierarchical framework for script identification in bi-lingual documents. The framework presents a top-down approach by performing page, block/paragraph and word level script identification in multiple stages. We utilize texture and shape based information embedded in the documents at different levels for feature extraction. The prediction task at different levels of hierarchy is performed by Support Vector Machine (SVM) and Rejection based classifier defined using AdaBoost. Experimental evaluation of the proposed concept on document collections of Hindi/English and Bangla/English scripts have shown promising results.Keywords
This publication has 13 references indexed in Scilit:
- Word level multi-script identificationPattern Recognition Letters, 2008
- Bangla/English Script Identification Based on Analysis of Connected Component ProfilesLecture Notes in Computer Science, 2006
- Script Identification Based on Morphological Reconstruction in Document ImagesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2006
- Texture for script identificationIEEE Transactions on Pattern Analysis and Machine Intelligence, 2005
- Robust Real-Time Face DetectionInternational Journal of Computer Vision, 2004
- Script and language identification from document imagesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Neural network based system for script identification in Indian documentsSādhanā, 2002
- Adaptive, quadratic preprocessing of document images for binarizationIEEE Transactions on Image Processing, 1998
- Rotation invariant texture features and their use in automatic script identificationIEEE Transactions on Pattern Analysis and Machine Intelligence, 1998
- Determination of the script and language content of document imagesIEEE Transactions on Pattern Analysis and Machine Intelligence, 1997