Tesseract: an Open-Source Optical Character Recognition Engine
I certainly wanted to do some experiments that would give me an idea of the power of Tesseract. I also wanted to compare those results to another open-source OCR system: ocrad.
I started off by running some tests to see how well Tesseract would do. My initial test took a 200dpi screen capture of text that included bold and italic fonts. Obviously, the screen capture was completely free from any kind of noise or error introduced by a physical scanner.
Tesseract performed flawlessly, recognizing 100% of the characters. It even got the spacing right. Unfortunately, ocrad did not fare as well. It missed several spaces (causing words to join erroneously), and it missed several letters. The overall recognition rate for ocrad on a perfect input was 95%.
Next, I decided to try some torture tests to see how well Tesseract would do under more adverse conditions. I have used Adobe Acrobat to do OCR on scanned documents, and it requires 150 DPI. It manages to fix things like varying lighting (as we did in GIMP earlier) and linear distortion (for example, due to book bindings pulling the edge of the paper away from the scanner). It also handled skewed pages where the page was not aligned well on the scanner bed.
So, I found a 72dpi scanned image that contained most of these glitches. Note that 72dpi is half the resolution that Acrobat will even try. The left margin was dark gray and bled into the letters, and the left edges of the lines were bent. The original image was not skewed.
I tried the unaltered image and the results were poor. I then used GIMP thresholding to remove the lighting variance and saved it as described above. I did nothing to correct the bent lines, nor did I increase the dpi in any way.
To my surprise, Tesseract managed a 97% recognition rate! Many of the errors were mistaking e as c (which were difficult for me to distinguish in the original image), and many of the errors were around the areas where the worst linear distortion occurred.
Next, I used The GIMP to rotate the image as far as I could without clipping the text. This corresponds to someone slapping pages on a scanner with little regard for alignment. Surprisingly, Tesseract still managed a 96% recognition rate. In fact, the rotation inadvertently helped with the linear distortion, and the recognition errors were less clustered than before.
Now I was curious as to how ocrad would fare. It did not fare well. In fact, it failed miserably. ocrad did more poorly on the best quality input than Tesseract did on the worst. The results and comparison are shown in Table 1.
The tests above indicate that the recommended inputs I have seen for Acrobat are quite sane. I recommend scanning your documents at 150dpi or higher. You also might try putting your scanner in black-and-white mode; the threshold routines in your scanner actually may give better results than the manual thresholding described in this article.
Perfect alignment does not seem to affect recognition rates drastically, but distortion due to book bindings did seem to cause some minor problems. Many professional scanning companies remove the pages from the binding if possible.
The GIMP gives you very fine control over image editing, but if you have a consistent scanning environment and a lot of pages, you really will want to automate the image cleanup as much as possible.
I recommend using Netpbm for this purpose, preferably version 10.34 or later, as those versions come with a more powerful threshold filter. Unfortunately, this is not considered a super-stable version, so many systems will have an older version.
If you are using an older version, you might get acceptable results with a pipeline of commands like this:
$ tifftopnm < scanned_image.tif | \ pamditherbw -threshold -value 0.8 | \ pamtopnm | pnmtotiff > result.tif
This chain of four commands reduces the color palette to black and white and saves the result as an uncompressed TIFF image. The number passed to the -value parameter of pamditherbw defaults to 0.5, and can range from 0 to 1, and it corresponds to the slider used earlier in The GIMP. In this case, higher numbers make the image darker.
Netpbm 10.34 and higher includes a more-advanced threshold utility, pamthreshold, which can do a better job on images where the lighting varies over the page. In this case, the command chain would be:
$ tifftopnm < scanned_image.tif | \ pamthreshold -local=20x20 | \ pamtopnm | pnmtotiff > result.tif
There are several alternatives for options of pamthreshold. The -local option allows you to specify a rectangular area that is used around each pixel to determine local lighting conditions in an attempt to adapt to changing lighting conditions in the image. You also may want to try:
$ tifftopnm < scanned_image.tif | \ pamthreshold -threshold=0.8 | pamtopnm | pnmtotiff > result.tif
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- SUSE LLC's SUSE Manager
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- My +1 Sword of Productivity
- Managing Linux Using Puppet
- Non-Linux FOSS: Caffeine!
- Doing for User Space What We Did for Kernel Space
- SuperTuxKart 0.9.2 Released
- Parsing an RSS News Feed with a Bash Script
- Google's SwiftShader Released
- Rogue Wave Software's Zend Server
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide