Segmentation of mathematical equations from document images is already a major research area for improved performance of OCR systems. Though chemical equations are also sharing similar spatial properties as that of non-chemical equations (for example, mathematical equations), efforts to segment those are still to be explored. This paper presents a novel method for segmenting and identifying chemical and any other equations in heterogeneous document images that may contain graphics, tables, text and the classifying them into two categories; chemical and non-chemical equations.
This study, a first of its kind, as far our knowledge goes, not only improves the OCR performance, but also leads to creation of chemical database and formation of bond electron matrix from chemical equations or formulae. In our proposed method we extracted the equations using morphological operators and histogram analysis and the extracted equations are classified using an open source OCR engine. The effectiveness of the proposed method is demonstrated by testing it on 152 document images. Test results show an accuracy of 97.4% and 97.45% for segmentation and classification, respectively.