Hi All,
I just wanted to let y'all know that tesseract ocr has
significantly improved and is much easier to use that it used to
be. I have been using it with my workflow for a bit and it's crazy
how much better it is than it was back when I tried it last
(admittedly 5-6 years ago). For those of you doing your own scans,
or those of you finding sad little pdfs without ocr, the process
is fairly simple.
Let's say you find "The Master Manual of Fortran.pdf" out there in
the wild (or scan it). Here's how to turn it into a glorious ocr'd
version:
Export your pdf as a multi-image tiff - it'll be ginormous, but
you can delete it later (on Mac, this is just export from preview
and select tiff, but gs will do it to, if I remember correctly)
and then:
tesseract The\ Master\ Manual\ of\ Fortran.tiff out -l eng PDF
et voila, I nice, if large pdf, called out.pdf or somesuch will
appear with ocr text that actually matches your scan (it seems to
have caught up to adobe's ocr, or is quite close in my view,
ymmv).
I speak English, so I installed tesseract and tesseract-eng, but
it supports a bunch of other languages if you need them.
Apparently google's been supporting and developing it for while
now and if my results are any indicator, it's paying off (boy do I
remember all the gobbledegook it used to produce).
tesseract will import from different image types, multiple images,
etc. I just like the simplicity of tiff->pdf.
Anyhow, thought y'all might like to know as many of you live off
the scans :).
Will