Automatic Line Segmentation and Ground-Truth Alignment of Handwritten Documents

1 September 2014

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 21676445,p. 667-672
https://doi.org/10.1109/icfhr.2014.117

Abstract

In this paper, we present a method for the automatic segmentation and transcript alignment of documents, for which we only have the transcript at the document level. We consider several line segmentation hypotheses, and recognition hypotheses for each segmented line. The recognition is highly constrained with the document transcript. We formalize the problem in a weighted finite-state transducer framework. We evaluate how the constraints help achieve a reasonable result. In particular, we assess the performance of the system both in terms of segmentation quality and transcript mapping. The main contribution of this paper is that we jointly find the best segmentation and transcript mapping that allow to align the image with the whole ground-truth text. The evaluation is carried out on fully annotated public databases. Furthermore, we retrieved training material with this system for the Maurdor evaluation, where the data was only annotated at the paragraph level. With the automatically segmented and annotated lines, we record a relative improvement in Word Error Rate of 35.6%.

Keywords

This publication has 15 references indexed in Scilit:

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling for Handwriting Recognition
Lecture Notes in Computer Science, 2014
The A2iA Multi-lingual Text Recognition System at the Second Maurdor Evaluation
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
On the Evaluation of Handwritten Text Line Detection Algorithms
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
A Fast Alignment Scheme for Automatic OCR Evaluation of Books
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Text Line Segmentation of Historical Arabic Documents
Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), 2007
A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
Published by Association for Computing Machinery (ACM) ,2006
Aligning Transcripts to Automatically Segmented Handwritten Manuscripts
Lecture Notes in Computer Science, 2006
Text alignment with handwritten documents
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2004
Weighted Finite-State Transducer Algorithms. An Overview
Published by Springer Science and Business Media LLC ,2004
The IAM-database: an English sentence database for offline handwriting recognition
International Journal on Document Analysis and Recognition (IJDAR), 2002

Cited by 16 articles