Tutorial: Translating Scanned Docs
Recently, I had to access some information from a German document, but the problem was that it was only available as a poor quality scan. This is an overview of how I extracted and translated the information. The tools used were pdfimages, GIMP, gImageReader and Google Translate.
There are some OCR (optical character recognition) tools that can directly handle PDF files as an input file format. Unfortunately, in this case, the scanned pages were badly skewed and needed to by tidied up by hand before processing. It would be possible, although tedious, to screen capture each and every page, but the screen resolution and the resolution of the original scanned images wouldn't match which would result in a loss of quality.
Extract the images (pdfimages)
The solution is to extract the images from the PDF file. I used a tool called pdfimages for this.
pdfimages inputfile.pdf outputfile
will produce a series of graphic files which are numbered according to the order in which they occurred in the PDF document. By default, they are in PBM format. This is a less common format, and pdfimages can be coaxed into outputting JPEG files instead. However, I would advise against that as JPEG is a lossy format and we need to preserve as much quality as possible for documents that are going to be OCRed.
Clean up images (GIMP)
I used GIMP to clean up the images, and fortunately, it can work with PBM as an input format. The scans themselves had a number of problems. The first thing I did was to use the rotate tool (Layer>Transform>Arbitrary Rotate...) to straighten up the image. To the make this easier, I zoomed in so that I could use the top of the windows as ruler against a line of text. In this case I found that a +1.4 deg rotation made the lines straight again.
The original image that I had to work with.
The cleaned up version.
The scans were also skewed to an extent. This meant that although the lines of text were now horizontally straight, the the left margin was not vertically aligned. I used the GIMP skew tool to correct this, again working with a zoomed image.
The image was also crushed, so, I scaled it to add 50% to its height (Layer>Scale layer...). Through experimentation, I discovered that this, along with making the image mono (Image>Mode>Indexed...), greatly improved the accuracy of the OCR software.
Finally, I cropped the image.
The images were now ready to put through gImageReader, a GTK front end to the Tesseract OCR tool. By default, although it had the resources to perform OCR on German documents, it didn't have the German dictionary it needed to spell check the output. I rectified this by adding the German MySpell dictionary using the package manager. By the way, gImageReader can handle PDF documents as an input format, if the page images are of a suitably good quality.
After the image has been loaded in and processed, the window is split between the input document and the output text. The output text pane has an real-time spell check and a few rudimentary text editing faculties. As you load in the pages of a multi-page document you can keep adding the output to the text pane. Obviously, as the source document was so poor to begin with, the output contained a few errors. I made some corrections by hand, such as manually removing hyphens. The real time spell checker that allows you to choose corrections with a context menu along with visual references back to the original document were helpful here.
Translate into English (Google Translate)
The final stage was to cut and paste the text into Google Translate.
The end result was good enough for me to extract the information that I needed. Here's an example of its output:
The use of free software also has a political dimension. The freedom of the software was on the 3rd UN World Summit on lnfonnationsgesellschaft (WSIS) recognized as worthy of protection. It belongs to the elementary demands of civil society with the "digital divide" is to be overcome. The application and further development of free software is free of barriers such as Soitware patents, restrictive licensing conditions and high cost. This reflects free software free decision-making powers again and wins an additional strategic importance for research, innovation and growth.
UK based freelance writer Michael Reed writes about technology, retro computing, geek culture and gender politics.
|Designing Electronics with Linux||May 22, 2013|
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
- RSS Feeds
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Designing Electronics with Linux
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- What's the tweeting protocol?
- Kernel Problem
7 hours 26 min ago
- BASH script to log IPs on public web server
11 hours 53 min ago
15 hours 29 min ago
- Reply to comment | Linux Journal
16 hours 1 min ago
- All the articles you talked
18 hours 25 min ago
- All the articles you talked
18 hours 28 min ago
- All the articles you talked
18 hours 29 min ago
22 hours 54 min ago
- Keeping track of IP address
1 day 45 min ago
- Roll your own dynamic dns
1 day 5 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi
It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?