Ensemble Classification System for Scientific Chart Recognition from PDF Files

Abstract
Portable Document Format (PDF) is the most frequently used universal document format on the Internet and E-Publishing. Wide usage of PDF files has increased the need of conversion tools that convert PDF file content to text or HTML formats. A PDF converter can be categorized into two domains, namely, text recognition and graphics recognition. This paper focus on graphic recognition, especially chart type identification, which is concerned with developing algorithms that has the ability to determine the type of a given chart image from a PDF file. In the proposed system, initially an enhanced connected component and statistical feature based method is used to separate the chart region from other regions. The chart region is then analyzed and grouped as either 2-dimensional or 3-dimensional chart. After separating the graphic component from the text components, feature extraction is performed. The features can be grouped as object features, texture features and shape features. The combined feature vector is then classified using ensemble classification system. Experimental results show that the chart separation, feature extraction and ensemble classification models significantly improve the quality of chart identification.

This publication has 19 references indexed in Scilit: