An Efficient Word Segmentation Technique for Historical and Degraded Machine-Printed Documents

1 September 2007

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE) in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)

Vol. 1 (15205363), 178-182
https://doi.org/10.1109/icdar.2007.4378699

Abstract

Word segmentation is a crucial step for segmentation-free document analysis systems and is used for creating an index based on word matching. In this paper, we propose a novel methodology for word segmentation in historical and degraded machine-printed documents. The proposed technique faces problems such as having text of different size, having text and non-text areas lying very near and having non-straight and warped text lines. It is based on: (i) a dynamic run length smoothing algorithm that helps grouping together homogeneous text regions, (ii) noise and punctuation marks removal as well as on obstacle detection in order to facilitate the segmentation process and (iv) a draft text line estimation procedure that guides the final word segmentation result. After testing on numerous historical and degraded machine-printed documents, it has turned out that our methodology performs better compared to current state-of-the-art word segmentation techniques for historical and degraded machine-printed documents.

Keywords

This publication has 9 references indexed in Scilit:

A scale space approach for automatically segmenting words from historical handwritten documents
Ieee Transactions On Pattern Analysis and Machine Intelligence, 2005
ICDAR2005 page segmentation competition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
Semantics-based content extraction in typewritten historical documents
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
A segmentation-free approach for keyword search in historical typewritten documents
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
Word shape recognition for image-based document retrieval
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Two Geometric Algorithms for Layout Analysis
Lecture Notes in Computer Science, 2002
Use of adaptive segmentation in handwritten phrase recognition
Pattern Recognition, 2002
A word extraction algorithm for machine-printed documents using a 3D neighborhood graph model
International Journal on Document Analysis and Recognition (IJDAR), 2001
Block segmentation and text extraction in mixed text/image documents
Computer Graphics and Image Processing, 1982

Cited by 6 articles