Tesseract Character recognition without training model
Using Tesseract recognize character without training our own models
--
Background
Today, with the vast improvements in machine learning, character extraction and recognition from images is much simpler than before, thanks to well developed deep learning algorithms such as CNN, LSTM, etc. Before the advent of these sophisticated machine learning algorithms, one had to use template matching to match every character image with predefined templates. Template matching required us to have a well defined cropped character image — however, cropping the image to conformity was difficult.
Thus, finding a good algorithm for cropping characters and preprocessing images to conform to the requirements was time consuming.
Deep learning is one of the most powerful tools to perform image recognition. There are many libraries of trained models based on deep learning. For instance, Yolo is popular for object recognition. But if we want to use Yolo to create a bounding box for characters when doing character recognition, we have to create and train our own model, or additionally fine tune an existing model. In such cases, the most time-consuming parts are collecting datasets and training the model itself.
On the other hand, Google has published their own OCR (Optical Character Recognition) tool, named Tesseract
. This tool has already been trained on more than 400,000 lines of text, spanning about 4,500 fonts for Latin-characters. It also supports non-Latin characters such as Japanese, Chinese, etc. Given its advantages and robust training, it’s preferable for us to directly use Tesseract
to perform character recognition without having to train or create any new models. Below, I will discuss how one can get better results using Google’s tesseract
.
Installing Tesseract
The installation document can be found here.
For Ubuntu users, you can use the following command line code for installing it from the terminal:
sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt update
sudo apt install tesseract-ocr libtesseract-dev