From: Dennis Boone <drb(a)msu.edu>
* Don't use JPEG 2000 and similar compression algorithms that try to
re-use blocks of pixels from elsewhere in the document -- too many
errors, and they're errors of the sort that can be critical. Even if
the replacements use the correct code point, they're distracting as
hell in a different font, size, etc.
I wondered about why certain images were
the way they were, this
probably explains a lot.
* OCR-under is good. I use `ocrmypdf`, which uses the
Tesseract engine.
Thanks for the tips.
* Bookmarks for pages / table of contents entries /
etc are mandatory.
Very few things make a scanned-doc PDF less useful than not being able
to skip directly to a document indicated page.
I wish. This is a tough one. I
generally sacrifice ditching the
bookmarks to make a better pdf. I need to look into extracting bookmarks
and if they can be re-added without getting all wonky.
* I like to see at least 300 dpi.
Yes, me too,
but I've found that this often results in too big (when
fixing existing), if I'm creating, they're fine.
* Don't scan in color mode if the source material
isn't color. Grey
scale or even "line art" works fine in most cases. Using one pixel
means you can use G4 compression for colorless pages.
Amen :).
* Do reduce the color depth of pages that do contain color if you can.
The resulting PDF can contain a mix of image types. I've worked with
documents that did use color where four or eight colors were enough,
and the whole document could be mapped to them. With care, you _can_
force the scans down to two or three bits per pixel.
* Do insert sensible metadata.
* Do try to square up the inevitably crooked scans, clean up major
floobydust and whatever crud around the edges isn't part of the paper,
etc. Besides making the result more readable, it'll help the OCR. I
never have any luck with automated page orientation tooling for some
reason, so end up just doing this with Gimp.
Great points. Thanks.
-will