image processing to improve tesseract OCR accuracy

I’ve been using tesseract to convert documents into text. The quality of the documents ranges wildly, and I’m looking for tips on what sort of image processing might improve the results. I’ve noticed that text that is highly pixellated – for example that generated by fax machines – is especially difficult for tesseract to process – presumably all those jagged edges to the characters confound the shape-recognition algorithms.

What sort of image processing techniques would improve the accuracy? I’ve been using a Gaussian blur to smooth out the pixellated images and seen some small improvement, but I’m hoping that there is a more specific technique that would yield better results. Say a filter that was tuned to black and white images, which would smooth out irregular edges, followed by a filter which would increase the contrast to make the characters more distinct.

Any general tips for someone who is a novice at image processing?

Improve Accuracy of Android TessBaseAPI (Tesseract) OCR [closed]

I am using TessBaseAPI for OCR in my project with the help of following this site. But, their result is 50% to 60% using a 5MP camera. I want to improve accuracy of my OCR. VietOCR also use tesseract

How do I improve the accuracy of the OCR text from Tesseract?

I created a basic app for recognizing text using the Tesseract API from Google and integrated it with my camera app. It works fine but the only problem is the accuracy, as sometimes the text is recogn

Pre-processing of an image with OpenCV on Android for optimizing ocr accuracy

I want to use OpenCV Library for Android in combine with tesseract ocr engine. I’m making ocr on an image with tess-two which is using leptonica for image processing. But as i’m seeing we need better

problem with this module Image::OCR::Tesseract

I have activestate perl v5.8.8 installed i install the following module Image::OCR::Tesseract with the ppm when I try to run the following code: use Image::OCR::Tesseract ‘get_ocr’; my $image = ‘my

Creating a training image for Tesseract OCR

I’m writing a generator for training images for Tesseract OCR. When generating a training image for a new font for Tesseract OCR, what are the best values for: The DPI The font size in points Should

improve tesseract OCR number recognition

I’ve been playing around with tesseract and the different psm options, im trying to work with this pattern: When I process this I get 52658, it switches up the 5’s and the 8’s and losses the decimal

iOS Tesseract OCR Image Preperation

I would like to implement an OCR application that would recognize text from Photos. I succeeded in Compiling and Integration the Tesseract Engine in iOS, I succeeded in getting reasonable detection wh

tesseract ocr camera ios

I’m using tesseract ocr 3.01 in my ios application,it shows 90% accuracy for my data when i pick image from my phone’s Library,but same image if i click from camera and use ,it is showing jumbled lett

Read correct OCR data(having less accuracy) using Tesseract

Hi I have to develope an application to read OCR data . After Googling I found that i can achieve it using Tesseract . I got Tesseract source code from https://github.com/rmtheis/tess-two I got its li

Remove background color in image processing for OCR

I am trying to remove background color so as to improve the accuracy of OCR against images. A sample would look like below: I’d keep all letters in the post-processed image while just removing the li

Answers

I think filtering to increase the contrast is called thresholding, adaptive thresholding, or binarization.

ImageMagick or Leptonica may do what you’re looking for.

  1. fix DPI (if needed) 300 DPI is minimum
  2. fix text size (e.g. 12 pt should be ok)
  3. try to fix text lines (deskew and dewarp text)
  4. try to fix illumination of image (e.g. no dark part of image
  5. binarize and de-noise image

There is no universal command line that would fit to all cases (sometimes you need to blur and sharpen image). But you can give a try to TEXTCLEANER from Fred’s ImageMagick Scripts.

If you are not fan of command line, maybe you can try to use opensource scantailor.sourceforge.net or commercial bookrestorer.

I am by no means an OCR expert. But I this week had need to convert text out of a jpg.

I started with a colorized, RGB 445×747 pixel jpg. I immediately tried tesseract on this, and the program converted almost nothing. I then went into GIMP and did the following. image>mode>grayscale image>scale image>1191×2000 pixels filters>enhance>unsharp mask with values of radius = 6.8, amount = 2.69, threshold = 0 I then saved as a new jpg at 100% quality.

Tesseract then was able to extract all the text into a .txt file

Gimp is your friend.

This is somewhat ago but it still might be useful.

My experience shows that resizing the image in-memory before passing it to tesseract sometimes helps.

Try different modes of interpolation. The post http://stackoverflow.com/a/4756906/146003 helped me a lot.

What was EXTREMLY HELPFUL to me on this way are the source codes for Capture2Text project. http://sourceforge.net/projects/capture2text/files/Capture2Text/.

BTW: Kudos to it’s author for sharing such a painstaking algorithm.

Pay special attention to the file Capture2Text/SourceCode/leptonica_util/leptonica_util.c – that’s the essence of image preprocession for this utility.

If you will run the binaries, you can check the image transformation before/after the process in Capture2Text/Output/ folder.

P.S. mentioned solution uses Tesseract for OCR and Leptonica for preprocessing.

Three points to improve the readability of the image: 1)Resize the image with variable height and width(multiply 0.5 and 1 and 2 with image height and width). 2)Convert the image to Gray scale format(Black and white). 3)Remove the noise pixels and make more clear(Filter the image).

Refer below code :

//Resize
  public Bitmap Resize(Bitmap bmp, int newWidth, int newHeight)
        {

                Bitmap temp = (Bitmap)bmp;

                Bitmap bmap = new Bitmap(newWidth, newHeight, temp.PixelFormat);

                double nWidthFactor = (double)temp.Width / (double)newWidth;
                double nHeightFactor = (double)temp.Height / (double)newHeight;

                double fx, fy, nx, ny;
                int cx, cy, fr_x, fr_y;
                Color color1 = new Color();
                Color color2 = new Color();
                Color color3 = new Color();
                Color color4 = new Color();
                byte nRed, nGreen, nBlue;

                byte bp1, bp2;

                for (int x = 0; x < bmap.Width; ++x)
                {
                    for (int y = 0; y < bmap.Height; ++y)
                    {

                        fr_x = (int)Math.Floor(x * nWidthFactor);
                        fr_y = (int)Math.Floor(y * nHeightFactor);
                        cx = fr_x + 1;
                        if (cx >= temp.Width) cx = fr_x;
                        cy = fr_y + 1;
                        if (cy >= temp.Height) cy = fr_y;
                        fx = x * nWidthFactor - fr_x;
                        fy = y * nHeightFactor - fr_y;
                        nx = 1.0 - fx;
                        ny = 1.0 - fy;

                        color1 = temp.GetPixel(fr_x, fr_y);
                        color2 = temp.GetPixel(cx, fr_y);
                        color3 = temp.GetPixel(fr_x, cy);
                        color4 = temp.GetPixel(cx, cy);

                        // Blue
                        bp1 = (byte)(nx * color1.B + fx * color2.B);

                        bp2 = (byte)(nx * color3.B + fx * color4.B);

                        nBlue = (byte)(ny * (double)(bp1) + fy * (double)(bp2));

                        // Green
                        bp1 = (byte)(nx * color1.G + fx * color2.G);

                        bp2 = (byte)(nx * color3.G + fx * color4.G);

                        nGreen = (byte)(ny * (double)(bp1) + fy * (double)(bp2));

                        // Red
                        bp1 = (byte)(nx * color1.R + fx * color2.R);

                        bp2 = (byte)(nx * color3.R + fx * color4.R);

                        nRed = (byte)(ny * (double)(bp1) + fy * (double)(bp2));

                        bmap.SetPixel(x, y, System.Drawing.Color.FromArgb
                (255, nRed, nGreen, nBlue));
                    }
                }



                bmap = SetGrayscale(bmap);
                bmap = RemoveNoise(bmap);

                return bmap;

        }


//SetGrayscale
  public Bitmap SetGrayscale(Bitmap img)
        {

            Bitmap temp = (Bitmap)img;
            Bitmap bmap = (Bitmap)temp.Clone();
            Color c;
            for (int i = 0; i < bmap.Width; i++)
            {
                for (int j = 0; j < bmap.Height; j++)
                {
                    c = bmap.GetPixel(i, j);
                    byte gray = (byte)(.299 * c.R + .587 * c.G + .114 * c.B);

                    bmap.SetPixel(i, j, Color.FromArgb(gray, gray, gray));
                }
            }
            return (Bitmap)bmap.Clone();

        }
//RemoveNoise
   public Bitmap RemoveNoise(Bitmap bmap)
        {

            for (var x = 0; x < bmap.Width; x++)
            {
                for (var y = 0; y < bmap.Height; y++)
                {
                    var pixel = bmap.GetPixel(x, y);
                    if (pixel.R < 162 && pixel.G < 162 && pixel.B < 162)
                        bmap.SetPixel(x, y, Color.Black);
                }
            }

            for (var x = 0; x < bmap.Width; x++)
            {
                for (var y = 0; y < bmap.Height; y++)
                {
                    var pixel = bmap.GetPixel(x, y);
                    if (pixel.R > 162 && pixel.G > 162 && pixel.B > 162)
                        bmap.SetPixel(x, y, Color.White);
                }
            }

            return bmap;
        }

INPUT IMAGE
image processing to improve tesseract OCR accuracy

OUTPUT IMAGE image processing to improve tesseract OCR accuracy

Adaptive thresholding is important if the lighting is uneven across the image. My preprocessing using GraphicsMagic is mentioned in this post: https://groups.google.com/forum/#!topic/tesseract-ocr/jONGSChLRv4

GraphicsMagic also has the -lat feature for Linear time Adaptive Threshold which I will try soon.

Another method of thresholding using OpenCV is described here: http://docs.opencv.org/trunk/doc/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html