Tutorial: Translating Scanned Docs

Recently, I had to access some information from a German document, but the problem was that it was only available as a poor quality scan. This is an overview of how I extracted and translated the information. The tools used were pdfimages, GIMP, gImageReader and Google Translate.

There are some OCR (optical character recognition) tools that can directly handle PDF files as an input file format. Unfortunately, in this case, the scanned pages were badly skewed and needed to by tidied up by hand before processing. It would be possible, although tedious, to screen capture each and every page, but the screen resolution and the resolution of the original scanned images wouldn't match which would result in a loss of quality.

Extract the images (pdfimages)


The solution is to extract the images from the PDF file. I used a tool called pdfimages for this.

pdfimages inputfile.pdf outputfile


will produce a series of graphic files which are numbered according to the order in which they occurred in the PDF document. By default, they are in PBM format. This is a less common format, and pdfimages can be coaxed into outputting JPEG files instead. However, I would advise against that as JPEG is a lossy format and we need to preserve as much quality as possible for documents that are going to be OCRed.

Clean up images (GIMP)


I used GIMP to clean up the images, and fortunately, it can work with PBM as an input format. The scans themselves had a number of problems. The first thing I did was to use the rotate tool (Layer>Transform>Arbitrary Rotate...) to straighten up the image. To the make this easier, I zoomed in so that I could use the top of the windows as ruler against a line of text. In this case I found that a +1.4 deg rotation made the lines straight again.

The original image that I had to work with.
 

The cleaned up version.

The scans were also skewed to an extent. This meant that although the lines of text were now horizontally straight, the the left margin was not vertically aligned. I used the GIMP skew tool to correct this, again working with a zoomed image.

The image was also crushed, so, I scaled it to add 50% to its height (Layer>Scale layer...). Through experimentation, I discovered that this, along with making the image mono (Image>Mode>Indexed...), greatly improved the accuracy of the OCR software.

Finally, I cropped the image.

OCR (gImageReader)

The images were now ready to put through gImageReader, a GTK front end to the Tesseract OCR tool. By default, although it had the resources to perform OCR on German documents, it didn't have the German dictionary it needed to spell check the output. I rectified this by adding the German MySpell dictionary using the package manager. By the way, gImageReader can handle PDF documents as an input format, if the page images are of a suitably good quality.



After the image has been loaded in and processed, the window is split between the input document and the output text. The output text pane has an real-time spell check and a few rudimentary text editing faculties. As you load in the pages of a multi-page document you can keep adding the output to the text pane. Obviously, as the source document was so poor to begin with, the output contained a few errors. I made some corrections by hand, such as manually removing hyphens. The real time spell checker that allows you to choose corrections with a context menu along with visual references back to the original document were helpful here.

Translate into English (Google Translate)

The final stage was to cut and paste the text into Google Translate.

The end result was good enough for me to extract the information that I needed. Here's an example of its output:

The use of free software also has a political dimension. The freedom of the software was on the 3rd UN World Summit on lnfonnationsgesellschaft (WSIS) recognized as worthy of protection. It belongs to the elementary demands of civil society with the "digital divide" is to be overcome. The application and further development of free software is free of barriers such as Soitware patents, restrictive licensing conditions and high cost. This reflects free software free decision-making powers again and wins an additional strategic importance for research, innovation and growth.

______________________

UK based freelance writer Michael Reed writes about technology, retro computing, geek culture and gender politics.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Pretty helpful article. The

Anonymous's picture

Pretty helpful article. The tool seems to be great.I have to use it the next time. I would like to recommend Staatliche Versorgung Analyse.Actually a Staatliche Versorgung Analyse considers all important facts and helps you with your decision.

I have used Google Translate

french20's picture

I have used Google Translate too, but the only program I have listed is GIMP. Where would the other programs come from. Übersetzung Englisch Deutsch Übersetzung Deutsch Englisch Deutsch Englisch Übersetzung

gImageReader

celem's picture

I have used tesseract for a long time but did not know about gImageReader. Thank you for bringing it to my attention. It certainly helps move tesseract into the modern world!

unpaper also

cowardly Larry's picture

Nice article. Also, the advantage of PBM is you could directly go the "unpaper" route- see http://unpaper.berlios.de/unpaper.html, if you'd like. Or use imagemagick and convert/mogrify to do the dirty work, for hundreds of operations more or less the same.

Overall, really enjoyed the GIMP restore instructions - use some of them myself (the Mode-Grayscale-Mode-Indexed iteration) together with sometime, believe it or not, Motion Blur with 2-3 pixels followed by sharpening or a Colors/Curves enhancement of the extreme blacks and whites

Google Translate does not provide a real translation

Raymond Martin's picture

With all that effort all you get is a partial translation. Google Translate(GT) makes all kinds of mistakes. It cannot be relied on for professional output for real business work. If you want professional translation you need a professional translator. If you just want the gist, then GT is okay.

Those who are translators might consider using OmegaT+ from my project (http://omegatplus.sourceforge.net). A FOSS cross-platform (Java) Computer Assisted Translation program, that actually sends text to GT and retrieves the results for further editing in order to generate professional translations.

Missing elements

JohnMc's picture

First observation is that the Tesseract OCR engine is available in most Linux repos. Second is that Tesseract accepts TIFF as its input stream. The third is that GIMP 2.6 is fully capable of parsing a .PDF file and exporting to .TIFF.

Good so far. Missing piece is translation. Happily there is a python script over at Sourceforge that uses the Google Translate API -- py-translate.

The point? Your approach is sound but involves too much mouse rowing. Why not use GIMP at the front of the process? Do your clean up in GIMP then export to .TIFF directly. Place your .TIFF page output in a given directory. Then utilize a BASH script to iterate thru the files in that directory using tesseract and py-translate to do the rest?

You would be done in half the time.

Not so fast ...

cowardly Larry's picture

scripting may be faster overall, but do not forget you may want to check the OCR accuracy, especially for not so perfect input documents. Here the GUI still has the advantage, just my 2 cents- you may correct in realtime your OCR "artistic" interpretations.

maybe...

JohnMc's picture

Up to a point I might agree. But your observation falls backwards to the the situation where the problem originated at the source -- GIMP. If I wished to check the validatity of what the OCR does then a simple one page run will tell me that almost instantly. The OCR engine is very consistent, even in the mistakes it makes.

Please support enhance this

Bruno49's picture

Please support enhance this post by adding citations to reliable sources Unsourced material may well be challenged
http://www.top20-songs.org/

Installing gimagereader

Daniel2's picture

Execute the following commands should allow you to
install gimagereader on Ubuntu, or Debian-like Linux distributions.

wget http://sourceforge.net/projects/gimagereader/files/0.8.1/gimagereader_0....

sudo apt-get install tesseract-ocr python-gtkspell python-enchant python-poppler

dpkg -i gimagereader_0.8.1-1_all.deb

Only GIMP

obx_ruckle's picture

I have used Google Translate, but the only program I have listed is GIMP. Where would the other programs come from.

Where from

jargon's picture

pdfimages is contained in poppler-utils, which on a Debian-based system is installable via apt. gimagereader can be downloaded from the link provided in the article: http://sourceforge.net/projects/gimagereader/

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix