[COFF] Re: converting lousy scans of pdfs into something more useable

3 Feb 2023

...
  I read a tremendous number of documents from the web,
or at least
 read parts of them - to the tune of maybe 50 or so a week. It is
 appalling to me in this era that we can't get better at scanning. Be
 that as it may, the needle doesn't seem to have moved appreciably in
 the last decade or so and it's a little sad. Sure, if folks print to
 pdf, it's great. But, if they scan a doc, not so great, even today. 
I see a fair number of frustrating scanned-doc PDFs too.  My thoughts on
what constitutes a decent scan:
* Assume people will print at least a few pages occasionally.  It's
  often easier to print that one table or diagram and take it to the
  bench than to try to use a tablet or run back and forth to a PC.  That
  affects how you think about creating the PDF.
* Don't use JPEG 2000 and similar compression algorithms that try to
  re-use blocks of pixels from elsewhere in the document -- too many
  errors, and they're errors of the sort that can be critical.  Even if
  the replacements use the correct code point, they're distracting as
  hell in a different font, size, etc.
* OCR-under is good.  I use `ocrmypdf`, which uses the Tesseract engine.
* I do get angry when I see people trying to reconstruct the document
  via OCR and omitting the actual scan -- too many errors.
* Bookmarks for pages / table of contents entries / etc are mandatory.
  Very few things make a scanned-doc PDF less useful than not being able
  to skip directly to a document indicated page.
* I like to see at least 300 dpi.
* Don't scan in color mode if the source material isn't color.  Grey
  scale or even "line art" works fine in most cases.  Using one pixel
  means you can use G4 compression for colorless pages.
* Do reduce the color depth of pages that do contain color if you can.
  The resulting PDF can contain a mix of image types.  I've worked with
  documents that did use color where four or eight colors were enough,
  and the whole document could be mapped to them.  With care, you _can_
  force the scans down to two or three bits per pixel.
* Do insert sensible metadata.
* Do try to square up the inevitably crooked scans, clean up major
  floobydust and whatever crud around the edges isn't part of the paper,
  etc.  Besides making the result more readable, it'll help the OCR.  I
  never have any luck with automated page orientation tooling for some
  reason, so end up just doing this with Gimp.
Tuppence.
De

2025

2024

2023

2022

2021

2020

2019

2018

[COFF] Re: converting lousy scans of pdfs into something more useable