OCR Text Extraction

  • Alan Jiju Student, Department of IT, Fr. Conceicao Rodrigues Institute of Technology, Vashi, Navi Mumbai, INDIA
  • Shaun Tuscano Student, Department of IT, Fr. Conceicao Rodrigues Institute of Technology, Vashi, Navi Mumbai, INDIA
  • Chetana Badgujar Assistant Professor, Department of IT, Fr. Conceicao Rodrigues Institute of Technology, Vashi, Navi Mumbai, INDIA
Keywords: OpenCV, Optical Character Reader (OCR), Tesseract, Document Detection

Abstract

This research tries to find out a methodology through which any data from the daily-use printed bills and invoices can be extracted. The data from these bills or invoices can be used extensively later on – such as machine learning or statistical analysis. This research focuses on extraction of final bill-amount, itinerary, date and similar data from bills and invoices as they encapsulate an ample amount of information about the users purchases, likes or dislikes etc. Optical Character Recognition (OCR) technology is a system that provides a full alphanumeric recognition of printed or handwritten characters from images. Initially, OpenCV has been used to detect the bill or invoice from the image and filter out the unnecessary noise from the image. Then intermediate image is passed for further processing using Tesseract OCR engine, which is an optical character recognition engine. Tesseract intends to apply Text Segmentation in order to extract written text in various fonts and languages. Our methodology proves to be highly accurate while tested on a variety of input images of bills and invoices.

Downloads

Download data is not yet available.

References

https://opencv.org/.

http://code.google.com/p/tesseract-ocr.

Canny, J F. (1983). Finding edges and lines in images. MIT Technical Report AI-TR-720.

Berg, A. C., Berg, T. L., & Malik, J. (2005). Shape matching and object recognition using low distortion correspondence. In: Proc IEEE Conf on Computer Vision and Pattern Recognition, San Diego CA, pp. 20-25.

Teofilo E., de Campos, & Bodla Rakesh Babu. (2002). Optical Character Recognition (OCR) Technology. In: IIIT Hyderabad Conference.

Rodolfo P. dos Santos, Gabriela S. Clemente, Tsang Ing Ren, & George D.C. Calvalcanti. (2009). Text line segmentation based on morphology and histogram projection. In: 10th International Conference on Document Analysis and Recognition.

L. Likforman-Sulem, A. Zahour, & B. Taconet. (2007). Text line segmentation of historical documents: A survey. Available at: https://link.springer.com/article/10.1007/s10032-006-0023-z.

https://research.google.com/pubs/archive/33418.pdf.

Published
2021-04-30
How to Cite
Alan Jiju, Shaun Tuscano, & Chetana Badgujar. (2021). OCR Text Extraction. International Journal of Engineering and Management Research, 11(2), 83-86. https://doi.org/10.31033/ijemr.11.2.11