Tutorial: Translating Scanned Docs
Recently, I had to access some information from a German document, but the problem was that it was only available as a poor quality scan. This is an overview of how I extracted and translated the information. The tools used were pdfimages, GIMP, gImageReader and Google Translate.
There are some OCR (optical character recognition) tools that can directly handle PDF files as an input file format. Unfortunately, in this case, the scanned pages were badly skewed and needed to by tidied up by hand before processing. It would be possible, although tedious, to screen capture each and every page, but the screen resolution and the resolution of the original scanned images wouldn't match which would result in a loss of quality.
Extract the images (pdfimages)
The solution is to extract the images from the PDF file. I used a tool called pdfimages for this.
pdfimages inputfile.pdf outputfile
will produce a series of graphic files which are numbered according to the order in which they occurred in the PDF document. By default, they are in PBM format. This is a less common format, and pdfimages can be coaxed into outputting JPEG files instead. However, I would advise against that as JPEG is a lossy format and we need to preserve as much quality as possible for documents that are going to be OCRed.
Clean up images (GIMP)
I used GIMP to clean up the images, and fortunately, it can work with PBM as an input format. The scans themselves had a number of problems. The first thing I did was to use the rotate tool (Layer>Transform>Arbitrary Rotate...) to straighten up the image. To the make this easier, I zoomed in so that I could use the top of the windows as ruler against a line of text. In this case I found that a +1.4 deg rotation made the lines straight again.

The original image that I had to work with.

The cleaned up version.
The scans were also skewed to an extent. This meant that although the lines of text were now horizontally straight, the the left margin was not vertically aligned. I used the GIMP skew tool to correct this, again working with a zoomed image.
The image was also crushed, so, I scaled it to add 50% to its height (Layer>Scale layer...). Through experimentation, I discovered that this, along with making the image mono (Image>Mode>Indexed...), greatly improved the accuracy of the OCR software.
Finally, I cropped the image.
OCR (gImageReader)
The images were now ready to put through gImageReader, a GTK front end to the Tesseract OCR tool. By default, although it had the resources to perform OCR on German documents, it didn't have the German dictionary it needed to spell check the output. I rectified this by adding the German MySpell dictionary using the package manager. By the way, gImageReader can handle PDF documents as an input format, if the page images are of a suitably good quality.

After the image has been loaded in and processed, the window is split between the input document and the output text. The output text pane has an real-time spell check and a few rudimentary text editing faculties. As you load in the pages of a multi-page document you can keep adding the output to the text pane. Obviously, as the source document was so poor to begin with, the output contained a few errors. I made some corrections by hand, such as manually removing hyphens. The real time spell checker that allows you to choose corrections with a context menu along with visual references back to the original document were helpful here.
Translate into English (Google Translate)
The final stage was to cut and paste the text into Google Translate.
The end result was good enough for me to extract the information that I needed. Here's an example of its output:
The use of free software also has a political dimension. The freedom of the software was on the 3rd UN World Summit on lnfonnationsgesellschaft (WSIS) recognized as worthy of protection. It belongs to the elementary demands of civil society with the "digital divide" is to be overcome. The application and further development of free software is free of barriers such as Soitware patents, restrictive licensing conditions and high cost. This reflects free software free decision-making powers again and wins an additional strategic importance for research, innovation and growth.
UK based freelance writer Michael Reed writes about technology, retro computing, geek culture and gender politics.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
| Introduction to MapReduce with Hadoop on Linux | Jun 05, 2013 |
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Linux Systems Administrator
- Validate an E-Mail Address with PHP, the Right Way
- Introduction to MapReduce with Hadoop on Linux
- RSS Feeds
- Weechat, Irssi's Little Brother
- New Products
- Developer Poll
- Reply to comment | Linux Journal
1 hour 39 min ago - Reply to comment | Linux Journal
2 hours 24 min ago - Didn't read
2 hours 34 min ago - Reply to comment | Linux Journal
2 hours 39 min ago - Poul-Henning Kamp: welcome to
4 hours 49 min ago - This has already been done
4 hours 50 min ago - Reply to comment | Linux Journal
5 hours 36 min ago - Welcome to 1998
6 hours 24 min ago - notifier shortcomings
6 hours 48 min ago - heroku?
8 hours 25 min ago
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?



Comments
Pretty helpful article. The
Pretty helpful article. The tool seems to be great.I have to use it the next time. I would like to recommend Staatliche Versorgung Analyse.Actually a Staatliche Versorgung Analyse considers all important facts and helps you with your decision.
I have used Google Translate
I have used Google Translate too, but the only program I have listed is GIMP. Where would the other programs come from. Übersetzung Englisch Deutsch Übersetzung Deutsch Englisch Deutsch Englisch Übersetzung
gImageReader
I have used tesseract for a long time but did not know about gImageReader. Thank you for bringing it to my attention. It certainly helps move tesseract into the modern world!
unpaper also
Nice article. Also, the advantage of PBM is you could directly go the "unpaper" route- see http://unpaper.berlios.de/unpaper.html, if you'd like. Or use imagemagick and convert/mogrify to do the dirty work, for hundreds of operations more or less the same.
Overall, really enjoyed the GIMP restore instructions - use some of them myself (the Mode-Grayscale-Mode-Indexed iteration) together with sometime, believe it or not, Motion Blur with 2-3 pixels followed by sharpening or a Colors/Curves enhancement of the extreme blacks and whites
Google Translate does not provide a real translation
With all that effort all you get is a partial translation. Google Translate(GT) makes all kinds of mistakes. It cannot be relied on for professional output for real business work. If you want professional translation you need a professional translator. If you just want the gist, then GT is okay.
Those who are translators might consider using OmegaT+ from my project (http://omegatplus.sourceforge.net). A FOSS cross-platform (Java) Computer Assisted Translation program, that actually sends text to GT and retrieves the results for further editing in order to generate professional translations.
Missing elements
First observation is that the Tesseract OCR engine is available in most Linux repos. Second is that Tesseract accepts TIFF as its input stream. The third is that GIMP 2.6 is fully capable of parsing a .PDF file and exporting to .TIFF.
Good so far. Missing piece is translation. Happily there is a python script over at Sourceforge that uses the Google Translate API -- py-translate.
The point? Your approach is sound but involves too much mouse rowing. Why not use GIMP at the front of the process? Do your clean up in GIMP then export to .TIFF directly. Place your .TIFF page output in a given directory. Then utilize a BASH script to iterate thru the files in that directory using tesseract and py-translate to do the rest?
You would be done in half the time.
Not so fast ...
scripting may be faster overall, but do not forget you may want to check the OCR accuracy, especially for not so perfect input documents. Here the GUI still has the advantage, just my 2 cents- you may correct in realtime your OCR "artistic" interpretations.
maybe...
Up to a point I might agree. But your observation falls backwards to the the situation where the problem originated at the source -- GIMP. If I wished to check the validatity of what the OCR does then a simple one page run will tell me that almost instantly. The OCR engine is very consistent, even in the mistakes it makes.
Please support enhance this
Please support enhance this post by adding citations to reliable sources Unsourced material may well be challenged
http://www.top20-songs.org/
Installing gimagereader
Execute the following commands should allow you to
install gimagereader on Ubuntu, or Debian-like Linux distributions.
wget http://sourceforge.net/projects/gimagereader/files/0.8.1/gimagereader_0....
sudo apt-get install tesseract-ocr python-gtkspell python-enchant python-poppler
dpkg -i gimagereader_0.8.1-1_all.deb
Only GIMP
I have used Google Translate, but the only program I have listed is GIMP. Where would the other programs come from.
Where from
pdfimages is contained in poppler-utils, which on a Debian-based system is installable via apt. gimagereader can be downloaded from the link provided in the article: http://sourceforge.net/projects/gimagereader/