Text segmentation of machine-printed Gurmukhi script

Abstract
This paper describes a scheme for text segmentation of machine printed Gurmukhi script documents. There has been a tremendous research in text segmentation of machine printed Roman script documents. In contrast there has been very little reported research on text segmentation of Indian language scripts in general and Gurmukhi script in particular. Research in the field of text segmentation of Gurmukhi script faces major problems mainly related to the unique characteristics of the script like connectivity of characters on the headline, two or more characters in a word having intersecting minimum bounding rectangles along horizontal direction, multi-component characters, touching characters which are present even in clean documents and horizontally overlapping text segments. In our proposed method we have used horizontal projection profile to successively divide the text area into small sub-areas or horizontal strips each of which contains (1) A set of text lines or (2) A single text line or (3) Sub-parts of text lines. Using vertical projection profile the horizontal strips are physically split into smaller units such as words, characters or sub characters depending on the type of the strip. Finally each of this unit is segmented into a set of connected components. The classifier is trained to recognize these connected components which are later merged to form character(s).