Ensemble Classification System for Scientific Chart Recognition from PDF Files
- 1 October 2012
- journal article
- Published by IGI Global in International Journal of Computer Vision and Image Processing
- Vol. 2 (4), 1-10
- https://doi.org/10.4018/ijcvip.2012100101
Abstract
Portable Document Format (PDF) is the most frequently used universal document format on the Internet and E-Publishing. Wide usage of PDF files has increased the need of conversion tools that convert PDF file content to text or HTML formats. A PDF converter can be categorized into two domains, namely, text recognition and graphics recognition. This paper focus on graphic recognition, especially chart type identification, which is concerned with developing algorithms that has the ability to determine the type of a given chart image from a PDF file. In the proposed system, initially an enhanced connected component and statistical feature based method is used to separate the chart region from other regions. The chart region is then analyzed and grouped as either 2-dimensional or 3-dimensional chart. After separating the graphic component from the text components, feature extraction is performed. The features can be grouped as object features, texture features and shape features. The combined feature vector is then classified using ensemble classification system. Experimental results show that the chart separation, feature extraction and ensemble classification models significantly improve the quality of chart identification.Keywords
This publication has 19 references indexed in Scilit:
- Segmentation of Text and Graphics from Document ImagesNinth International Conference on Document Analysis and Recognition (ICDAR 2007), 2007
- Recognition and Classification of Figures in PDF DocumentsLecture Notes in Computer Science, 2006
- Model-Based Chart Image RecognitionLecture Notes in Computer Science, 2004
- Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of contextInternational Journal on Document Analysis and Recognition (IJDAR), 2001
- Ten measures of diversity in classifier ensembles: limits for two classifiersPublished by Institution of Engineering and Technology (IET) ,2001
- Layout-based approach for extracting constructive elements of bar-chartsLecture Notes in Computer Science, 1998
- Improved directional morphological operations for separation of characters from maps/graphicsLecture Notes in Computer Science, 1998
- Vector-based segmentation of text connected to graphics in engineering drawingsLecture Notes in Computer Science, 1996
- A robust algorithm for text string separation from mixed text/graphics imagesIEEE Transactions on Pattern Analysis and Machine Intelligence, 1988
- Textural Features for Image ClassificationIEEE Transactions on Systems, Man, and Cybernetics, 1973