Tesseract: an Open-Source Optical Character Recognition Engine
I certainly wanted to do some experiments that would give me an idea of the power of Tesseract. I also wanted to compare those results to another open-source OCR system: ocrad.
I started off by running some tests to see how well Tesseract would do. My initial test took a 200dpi screen capture of text that included bold and italic fonts. Obviously, the screen capture was completely free from any kind of noise or error introduced by a physical scanner.
Tesseract performed flawlessly, recognizing 100% of the characters. It even got the spacing right. Unfortunately, ocrad did not fare as well. It missed several spaces (causing words to join erroneously), and it missed several letters. The overall recognition rate for ocrad on a perfect input was 95%.
Next, I decided to try some torture tests to see how well Tesseract would do under more adverse conditions. I have used Adobe Acrobat to do OCR on scanned documents, and it requires 150 DPI. It manages to fix things like varying lighting (as we did in GIMP earlier) and linear distortion (for example, due to book bindings pulling the edge of the paper away from the scanner). It also handled skewed pages where the page was not aligned well on the scanner bed.
So, I found a 72dpi scanned image that contained most of these glitches. Note that 72dpi is half the resolution that Acrobat will even try. The left margin was dark gray and bled into the letters, and the left edges of the lines were bent. The original image was not skewed.
I tried the unaltered image and the results were poor. I then used GIMP thresholding to remove the lighting variance and saved it as described above. I did nothing to correct the bent lines, nor did I increase the dpi in any way.
To my surprise, Tesseract managed a 97% recognition rate! Many of the errors were mistaking e as c (which were difficult for me to distinguish in the original image), and many of the errors were around the areas where the worst linear distortion occurred.
Next, I used The GIMP to rotate the image as far as I could without clipping the text. This corresponds to someone slapping pages on a scanner with little regard for alignment. Surprisingly, Tesseract still managed a 96% recognition rate. In fact, the rotation inadvertently helped with the linear distortion, and the recognition errors were less clustered than before.
Now I was curious as to how ocrad would fare. It did not fare well. In fact, it failed miserably. ocrad did more poorly on the best quality input than Tesseract did on the worst. The results and comparison are shown in Table 1.
The tests above indicate that the recommended inputs I have seen for Acrobat are quite sane. I recommend scanning your documents at 150dpi or higher. You also might try putting your scanner in black-and-white mode; the threshold routines in your scanner actually may give better results than the manual thresholding described in this article.
Perfect alignment does not seem to affect recognition rates drastically, but distortion due to book bindings did seem to cause some minor problems. Many professional scanning companies remove the pages from the binding if possible.
The GIMP gives you very fine control over image editing, but if you have a consistent scanning environment and a lot of pages, you really will want to automate the image cleanup as much as possible.
I recommend using Netpbm for this purpose, preferably version 10.34 or later, as those versions come with a more powerful threshold filter. Unfortunately, this is not considered a super-stable version, so many systems will have an older version.
If you are using an older version, you might get acceptable results with a pipeline of commands like this:
$ tifftopnm < scanned_image.tif | \ pamditherbw -threshold -value 0.8 | \ pamtopnm | pnmtotiff > result.tif
This chain of four commands reduces the color palette to black and white and saves the result as an uncompressed TIFF image. The number passed to the -value parameter of pamditherbw defaults to 0.5, and can range from 0 to 1, and it corresponds to the slider used earlier in The GIMP. In this case, higher numbers make the image darker.
Netpbm 10.34 and higher includes a more-advanced threshold utility, pamthreshold, which can do a better job on images where the lighting varies over the page. In this case, the command chain would be:
$ tifftopnm < scanned_image.tif | \ pamthreshold -local=20x20 | \ pamtopnm | pnmtotiff > result.tif
There are several alternatives for options of pamthreshold. The -local option allows you to specify a rectangular area that is used around each pixel to determine local lighting conditions in an attempt to adapt to changing lighting conditions in the image. You also may want to try:
$ tifftopnm < scanned_image.tif | \ pamthreshold -threshold=0.8 | pamtopnm | pnmtotiff > result.tif
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- Designing Electronics with Linux
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Dynamic DNS—an Object Lesson in Problem Solving
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Using Salt Stack and Vagrant for Drupal Development
- Reply to comment | Linux Journal
2 hours 12 min ago - Dynamic DNS
2 hours 46 min ago - Reply to comment | Linux Journal
3 hours 45 min ago - Reply to comment | Linux Journal
4 hours 35 min ago - Not free anymore
8 hours 37 min ago - Great
12 hours 24 min ago - Reply to comment | Linux Journal
12 hours 32 min ago - Understanding the Linux Kernel
14 hours 47 min ago - General
17 hours 16 min ago - Kernel Problem
1 day 3 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
The online application Free
The online application Free OCR allows transforming the contents of an image file in a text output format. Though Microsoft Word is not supported currently.
OCR softwares are everywhere
OCR softwares are everywhere nowadays. I prefer online ones they don't need installation and most of them are free, like this one: Free OCR.
Lengthy install?
How to install? Takes about 3 seconds...
apt-get install tesseract-ocrTesseract works really well!
I've been using Tesseract
I've been using Tesseract OCR with a C# program I have made to batch OCR hundreds of documents a server we run. Considering it costs nothing, I am very impressed with the accuracy. It is far superior to GOCR which need the image to have the grey scale adjusted before anything can be done.
______________________
Submited by : Bajar Libros
Missing links to the images used in the test
I can't find the links to the images used in the test. It seems very strange to me that Ocrad was unable to recognize even a single character on some of them.
What version of Ocrad did you use? Where the characters at least 20 pixels high as requires the manual of Ocrad? If they were smaller, did you use the "--scale" option of Ocrad? Did you even RTFM?
If you want a good review of free OCR software better see this one for example.
gscan2pdf
Why not try gscan2pdf, which has support for tesseract?
http://gscan2pdf.sourceforge.net/
The OCR data is embedded into the pdf as an annotation. It can be indexed with beagle, for example, and viewed with Adobe's pdf reader. Support for annotations is coming to the free pdf readers as well, I believe.
gscan2pdf also supports unpaper, and I find it an excellent all-around tool for my scanner.
isight?
would be great if someone could code a program that could take my isight or any usb cam to directly send image to the above mentioned program and convert it into text format, so all i would have to do is hold my text book up against my webcam and get it on my computer. Cheers!
International characters?
I've using tesseract for a while and it works great, but it has a major flaw that I haven't been able to overcome, I can't make it recognize international characters (i.e. á,é,í,ó,ú,ñ,Ñ), for example, it changes ó for 6.
Is there a way to make it support other than standard ASCII characters?
:)
open source rules, seriously.
Great work, it seems like a lengthy installation however.
Lengthy Installation
You can OCR documents for free using Tesseract at A Billion Billion - Free OCR for Everyone by just uploading your TIFF files and clicking OCR. No installation this way.
Dead Link
The site at abillionbillion.com no longer exists. There is another site, free-ocr.com, that allows one to upload scanned images in a variety of graphics file formats, but it is limited to a maximum of 10 images per hour and there's an upper size limit on the images. The site is supported by ads and donations.
I'm impressed!
Wow - it did a really good job, first try out of the box. I built it without turning off TIFF support, and used tesseract-1.04b. I put it in /usr/local/bin. At first I got this:
Error: Unable to open unicharset!
This was fixed by doing this:
$ sudo ln -s /usr/local/bin/tessdata /usr/local/share/tessdata
Then it complained about not recognizing the file format (which was TIFF with no compression). I renamed the file from "text2.tiff" to "text2.tif" and then it was happy. That's just silly, if you ask me.
This is all on Ubuntu Feisty. My original scan was 300 dpi, and I ran it through the GIMP the same as in the article.
Very nice!
-Pete