Automatic Line Segmentation and Ground-Truth Alignment of Handwritten Documents
- 1 September 2014
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 21676445,p. 667-672
- https://doi.org/10.1109/icfhr.2014.117
Abstract
In this paper, we present a method for the automatic segmentation and transcript alignment of documents, for which we only have the transcript at the document level. We consider several line segmentation hypotheses, and recognition hypotheses for each segmented line. The recognition is highly constrained with the document transcript. We formalize the problem in a weighted finite-state transducer framework. We evaluate how the constraints help achieve a reasonable result. In particular, we assess the performance of the system both in terms of segmentation quality and transcript mapping. The main contribution of this paper is that we jointly find the best segmentation and transcript mapping that allow to align the image with the whole ground-truth text. The evaluation is carried out on fully annotated public databases. Furthermore, we retrieved training material with this system for the Maurdor evaluation, where the data was only annotated at the paragraph level. With the automatically segmented and annotated lines, we record a relative improvement in Word Error Rate of 35.6%.Keywords
This publication has 15 references indexed in Scilit:
- A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling for Handwriting RecognitionLecture Notes in Computer Science, 2014
- The A2iA Multi-lingual Text Recognition System at the Second Maurdor EvaluationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2014
- On the Evaluation of Handwritten Text Line Detection AlgorithmsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- A Fast Alignment Scheme for Automatic OCR Evaluation of BooksPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Text Line Segmentation of Historical Arabic DocumentsNinth International Conference on Document Analysis and Recognition (ICDAR 2007), 2007
- A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of booksPublished by Association for Computing Machinery (ACM) ,2006
- Aligning Transcripts to Automatically Segmented Handwritten ManuscriptsLecture Notes in Computer Science, 2006
- Text alignment with handwritten documentsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2004
- Weighted Finite-State Transducer Algorithms. An OverviewPublished by Springer Science and Business Media LLC ,2004
- The IAM-database: an English sentence database for offline handwriting recognitionInternational Journal on Document Analysis and Recognition (IJDAR), 2002