Perennially Sane

I'm in your PDF, OCRin' your texts

No comments

Through the development of a nice shellscript, I am now able to use ImageMagick and Tesseract OCR to extract plain text from a PDF file right here on this web-server.

The sample PDF file I used is a supplementary manual (or "cookbook") for the Blades of Avernum scenario designer, which can be reached on Erik Westra's page.

The text file is attached, as is the shellscript (be warned that I'm a bash nooCool.

The script took 442 seconds to run through the 88 pages, which works out to a fairly acceptable 5 seconds per page. A full 15 seconds were taken to (inefficiently) find the number of pages.

Note that Tesseract OCR has no bells and whistles. It recognizes only single-column text in 2-3 different typefaces. It does not work well with color, and it has a good chance of throwing up ascii soup when it gets to illustrations. But the process certainly works.

I'm still thinking of the potential uses for this. Automatic digitizing of uploaded content is one of them - as is indexing for search.

Who knows, perhaps Google will eventually supplement its rather simple PDF-text-extraction feature with OCR?

Post new comment

The content of this field is kept private and will not be shown publicly.
  • You may use these tags: [abbr], [acronym], [b], [center], [code], [color], [cpp], [css], [define], [diff], [dtd], [font], [h3], [h4], [h5], [h6], [hr], [html], [i], [img], [java], [javascript], [justify], [left], [list], [mysql], [node], [ol], [perl], [php], [python], [quote], [right], [s], [sh], [size], [sql], [sub], [sup], [table], [u], [ul], [url], [wikipedia], [xml]
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Easily link to terms in various wikis. For help, see <a href="/interwiki/3">interwiki</a>.
  • Textual smileys will be replaced with graphical ones.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Syndicate content
Powered by Drupal, an open source content management system

eXTReMe Tracker