All,
I thought I would post something here that wasn't DOA over on tuhs
and see if it would fly here instead. I have been treating coff as
the destination for the place where off-topic tuhs posts go to
die, but after the latest thread bemoaning a place to go for
topics tangential to unix, I thought I'd actually start a coff
thread! Here goes...
I read a tremendous number of documents from the web, or at least
read parts of them - to the tune of maybe 50 or so a week. It is
appalling to me in this era that we can't get better at scanning.
Be that as it may, the needle doesn't seem to have moved
appreciably in the last decade or so and it's a little sad. Sure,
if folks print to pdf, it's great. But, if they scan a doc, not so
great, even today.
Rather than worry about the scanning aspects, I am more interested
in what to do with those scans. Can they be handled in such a way
as to give them new life? Unlike the scanning side of things, I
have found quite a bit of movement in the area of being able to
work with the pdfs and I'd really like to get way better at it. If
I get a bad scanned pdf, if I can make it legible on screen,
legible on print, and searchable, I'm golden. Sadly, that's way
harder than it sounds, or, in my opinion, than it should be.
I recently put together a workflow that is tenable, if time
consuming. If your interested in the details, I've shared them:
https://decuser.github.io/pdfs/2023/02/01/pdf-cleanup-workflow.html
In the note, I leverage a lot of great tools that have
significantly improved over the years to the point where they do a
great job at what they do. But, there's lots of room for
improvement. Particularly in the area of image tweaking around
color and highlights and such.
The note is mac-centric in that I use a mac, otherwise, all of the
tools work on modern *nix and with a little abstract thought,
windows too.
In my world, here's what happens:
* find a really interesting topic and along the way, collect pdfs
to read
* open the pdf and find it salient, but not so readable, with sad
printability, and no or broken OCR
* I begin the process of making the pdf better with the
aforementioned goals aforethought
The process in a nutshell:
1. Extract the images to individual tiffs (so many tools can't
work with multi-image tiffs)
* pdfimages from poppler works great for this
2. Adjust the color (it seems impossible to do this without a
batch capable gui app)
* I use Photoscape X for this - just click batch and make
adjustments to all of the images using the same settings
3. Resize the images - most pdfs have super wonky sizes
* I use convert from imagemagick for this and I compress the
tiffs while I'm converting them
4. Recombine the images into a multi-tiff image
* I use tiffcp from libtiff for this
5. OCR the reworked image set
* I use tesseract for this - It's gotten so much better it's
ridiculous
This process results in a pdf that meets the objectives.
It's not horribly difficult to do and it's not horribly time
consuming. It represents many, many attempts to figure out this
thorny problem.
I'd really like to get away from needing Photoscape X, though.
Then I could entirely automate the workflow in bash...
The problem is that the image adjustments are the most critical -
image extraction, resize, compression, recombining images, ocr (I
still can't believe it), and outputting a pdf are now taken care
of by command line tools that work well.
I wouldn't mind using a gui to figure out some color setting
(Grayscale, Black and White, or Color) and increase/decrease
values for shadows and highlights if those could then be mapped to
command line arguments of a tool that could apply them, though.
Cuz, then the workflow could be, extract a good representative
page as image, open it, figure out the color settings, and then
use those settings with toolY as part of the scripted workflow.
Here are the objectives for easy reference:
1. The PDF needs to be readable on a decent monitor (zooming in
doesn't distort the readability, pixelation that is systematic is
ok, but not preferred). Yes, I know it's got a degree of
subjectivity, but blobby, bleeding text is out of scope!
2. The PDF needs to print with a minimum of artifact (weird
shadows, bleeding and blob are out). It needs to be easy to read.
3. The PDF needs to be searchable with good accuracy (generally,
bad scans have no ocr, or ocr that doesn't work).
Size is a consideration, but depends greatly on the value of the
work. My own calculus goes like this - if it's modern work, it
should be way under 30mbs. If it's print to pdf, it should be way
under 10mb (remember when you thought you'd never use 10mb of
space... for all of your files and the os). If it is significant
and rare, less than 150mbs can work. Obviously, this is totally
subjective, your calculus is probably quite different.
The reason this isn't posted over in pdf.scans.discussion is that
even if there were such a place, it'd be filled with super
technical gibberish about color depth and the perils of gamma
radiation or somesuch. We, as folks interested in preserving the
past have a more pragmatic need for a workable solution that is
attainable to mortals.
So, with that as a bit of
background, let me ask what I asked previously in a different
wayon tuhs, here in coff - what's your experience with using sad
pdfs? Do you just live with them as they are, or do you try to
fix them and how, or do you use a workflow and get good results?
Later,
Will