An Efficient Word Segmentation Technique for Historical and Degraded Machine-Printed Documents
- 1 September 2007
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)
- Vol. 1 (15205363), 178-182
- https://doi.org/10.1109/icdar.2007.4378699
Abstract
Word segmentation is a crucial step for segmentation-free document analysis systems and is used for creating an index based on word matching. In this paper, we propose a novel methodology for word segmentation in historical and degraded machine-printed documents. The proposed technique faces problems such as having text of different size, having text and non-text areas lying very near and having non-straight and warped text lines. It is based on: (i) a dynamic run length smoothing algorithm that helps grouping together homogeneous text regions, (ii) noise and punctuation marks removal as well as on obstacle detection in order to facilitate the segmentation process and (iv) a draft text line estimation procedure that guides the final word segmentation result. After testing on numerous historical and degraded machine-printed documents, it has turned out that our methodology performs better compared to current state-of-the-art word segmentation techniques for historical and degraded machine-printed documents.Keywords
This publication has 9 references indexed in Scilit:
- A scale space approach for automatically segmenting words from historical handwritten documentsIeee Transactions On Pattern Analysis and Machine Intelligence, 2005
- ICDAR2005 page segmentation competitionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- Semantics-based content extraction in typewritten historical documentsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- A segmentation-free approach for keyword search in historical typewritten documentsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- Word shape recognition for image-based document retrievalPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Two Geometric Algorithms for Layout AnalysisLecture Notes in Computer Science, 2002
- Use of adaptive segmentation in handwritten phrase recognitionPattern Recognition, 2002
- A word extraction algorithm for machine-printed documents using a 3D neighborhood graph modelInternational Journal on Document Analysis and Recognition (IJDAR), 2001
- Block segmentation and text extraction in mixed text/image documentsComputer Graphics and Image Processing, 1982