End-to-End Trainable Thai OCR System Using Hidden Markov Models
- 1 September 2008
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 607-614
- https://doi.org/10.1109/das.2008.76
Abstract
In this paper we present an end-to-end trainable optical character recognition (OCR) system for recognizing machine-printed text in Thai documents. The end-to-end OCR system is based on a script-independent methodology using hidden Markov models. Our system provides an integrated workflow beginning with annotation and transcription of training images to performing OCR on new images with models trained on transcribed training images. The efficacy of our end-to-end OCR system is demonstrated by rapidly configuring our OCR engine for the Thai script. We present experimental results on Thai documents to highlight the specific challenges posed by the Thai script and analyze the recognition performance as a function of amount of training data.Keywords
This publication has 7 references indexed in Scilit:
- Robust Page Segmentation Based on Smearing and Error Correction Unifying Top-down and Bottom-up ApproachesNinth International Conference on Document Analysis and Recognition (ICDAR 2007), 2007
- Distributed computing in practice: the Condor experienceConcurrency and Computation: Practice and Experience, 2005
- The BBN Byblos Hindi OCR systemPublished by SPIE-Intl Soc Optical Eng ,2005
- Thai OCR: a neural network applicationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Script-independent, HMM-based text line finding for OCRPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- MULTILINGUAL MACHINE PRINTED OCRInternational Journal of Pattern Recognition and Artificial Intelligence, 2001
- The document spectrum for page layout analysisIeee Transactions On Pattern Analysis and Machine Intelligence, 1993