OpenCV OCR and text recognition with Tesseract

作者: zengde 分类: 技术相关 发布时间: 2019-12-30 09:52

In this tutorial, you will learn how to apply OpenCV OCR (Optical Character Recognition). We will perform both (1) text detection and (2) text recognition using OpenCV, Python, and Tesseract.

A few weeks ago I showed you how to perform text detection using OpenCV’s EAST deep learning model. Using this model we were able to detect and localize the bounding box coordinates of text contained in an image.

The next step is to take each of these areas containing text and actually recognize and OCR the text using OpenCV and Tesseract.

To learn how to build your own OpenCV OCR and text recognition system, just keep reading!

OpenCV OCR and text recognition with Tesseract

In order to perform OpenCV OCR text recognition, we’ll first need to install Tesseract v4 which includes a highly accurate deep learning-based model for text recognition.

From there, I’ll show you how to write a Python script that:

  1. Performs text detection using OpenCV’s EAST text detector, a highly accurate deep learning text detector used to detect text in natural scene images.
  2. Once we have detected the text regions with OpenCV, we’ll then extract each of the text ROIs and pass them into Tesseract, enabling us to build an entire OpenCV OCR pipeline!

Finally, I’ll wrap up today’s tutorial by showing you some sample results of applying text recognition with OpenCV, as well as discussing some of the limitations and drawbacks of the method.

Let’s go ahead and get started with OpenCV OCR!

How to install Tesseract 4

Tesseract, a highly popular OCR engine, was originally developed by Hewlett Packard in the 1980s and was then open-sourced in 2005. Google adopted the project in 2006 and has been sponsoring it ever since.

If you’ve read my previous post on Using Tesseract OCR with Python, you know that Tesseract can work very well under controlled conditions…

…but will perform quite poorly if there is a significant amount of noise or your image is not properly preprocessed and cleaned before applying Tesseract.

Just as deep learning has impacted nearly every facet of computer vision, the same is true for character recognition and handwriting recognition.

Deep learning-based models have managed to obtain unprecedented text recognition accuracy, far beyond traditional feature extraction and machine learning approaches.

It was only a matter of time until Tesseract incorporated a deep learning model to further boost OCR accuracy — and in fact, that time has come.

The latest release of Tesseract (v4) supports deep learning-based OCR that is significantly more accurate.

The underlying OCR engine itself utilizes a Long Short-Term Memory (LSTM) network, a kind of Recurrent Neural Network (RNN).

In the remainder of this section, you will learn how to install Tesseract v4 on your machine.

Later in this blog post, you’ll learn how to combine OpenCV’s EAST text detection algorithm with Tesseract v4 in a single Python script to automatically perform OpenCV OCR.

Let’s get started configuring your machine!

Install OpenCV

To run today’s script you’ll need OpenCV installed. Version 3.4.2 or better is required.

To install OpenCV on your system, just follow one of my OpenCV installation guides, ensuring that you download the correct/desired version of OpenCV and OpenCV-contrib in the process.

Install Tesseract 4 on Windows

下载tesseract安装包,并把安装目录加入path

Verify your Tesseract version

Figure 2: Screenshot of my system terminal where I have entered the tesseract -v command to query for the version. I have verified that I have Tesseract 4 installed.

Once you have Tesseract installed on your machine you should execute the following command to verify your Tesseract version: