I'm in your PDF, OCRin' your texts
Through the development of a nice shellscript, I am now able to use ImageMagick and Tesseract OCR to extract plain text from a PDF file right here on this web-server.
The sample PDF file I used is a supplementary manual (or "cookbook") for the Blades of Avernum scenario designer, which can be reached on Erik Westra's page.
The text file is attached, as is the shellscript (be warned that I'm a bash noo
.
The script took 442 seconds to run through the 88 pages, which works out to a fairly acceptable 5 seconds per page. A full 15 seconds were taken to (inefficiently) find the number of pages.
Note that Tesseract OCR has no bells and whistles. It recognizes only single-column text in 2-3 different typefaces. It does not work well with color, and it has a good chance of throwing up ascii soup when it gets to illustrations. But the process certainly works.
I'm still thinking of the potential uses for this. Automatic digitizing of uploaded content is one of them - as is indexing for search.
Who knows, perhaps Google will eventually supplement its rather simple PDF-text-extraction feature with OCR?
- Arancaytar's blog
- 641 reads




Post new comment