I read a tremendous number of documents from the web,
or at least
read parts of them - to the tune of maybe 50 or so a week. It is
appalling to me in this era that we can't get better at scanning. Be
that as it may, the needle doesn't seem to have moved appreciably in
the last decade or so and it's a little sad. Sure, if folks print to
pdf, it's great. But, if they scan a doc, not so great, even today.
I see a fair number of frustrating scanned-doc PDFs too. My thoughts on
what constitutes a decent scan:
* Assume people will print at least a few pages occasionally. It's
often easier to print that one table or diagram and take it to the
bench than to try to use a tablet or run back and forth to a PC. That
affects how you think about creating the PDF.
* Don't use JPEG 2000 and similar compression algorithms that try to
re-use blocks of pixels from elsewhere in the document -- too many
errors, and they're errors of the sort that can be critical. Even if
the replacements use the correct code point, they're distracting as
hell in a different font, size, etc.
* OCR-under is good. I use `ocrmypdf`, which uses the Tesseract engine.
* I do get angry when I see people trying to reconstruct the document
via OCR and omitting the actual scan -- too many errors.
* Bookmarks for pages / table of contents entries / etc are mandatory.
Very few things make a scanned-doc PDF less useful than not being able
to skip directly to a document indicated page.
* I like to see at least 300 dpi.
* Don't scan in color mode if the source material isn't color. Grey
scale or even "line art" works fine in most cases. Using one pixel
means you can use G4 compression for colorless pages.
* Do reduce the color depth of pages that do contain color if you can.
The resulting PDF can contain a mix of image types. I've worked with
documents that did use color where four or eight colors were enough,
and the whole document could be mapped to them. With care, you _can_
force the scans down to two or three bits per pixel.
* Do insert sensible metadata.
* Do try to square up the inevitably crooked scans, clean up major
floobydust and whatever crud around the edges isn't part of the paper,
etc. Besides making the result more readable, it'll help the OCR. I
never have any luck with automated page orientation tooling for some
reason, so end up just doing this with Gimp.
Tuppence.
De