[TUHS] tesseract has gotten so much better

28 Jan 2023

Hi All,
I just wanted to let y'all know that tesseract ocr has significantly
improved and is much easier to use that it used to be. I have been using
it with my workflow for a bit and it's crazy how much better it is than
it was back when I  tried it last (admittedly 5-6 years ago). For those
of you doing your own scans, or those of you finding sad little pdfs
without ocr, the process is fairly simple.
Let's say you find "The Master Manual of Fortran.pdf" out there in the
wild (or scan it). Here's how to turn it into a glorious ocr'd version:
Export your pdf as a multi-image tiff - it'll be ginormous, but you can
delete it later (on Mac, this is just export from preview and select
tiff, but gs will do it to, if I remember correctly) and then:
tesseract The\ Master\ Manual\ of\ Fortran.tiff out -l eng PDF
et voila, I nice, if large pdf, called out.pdf or somesuch will appear
with ocr text that actually matches your scan (it seems to have caught
up to adobe's ocr, or is quite close in my view, ymmv).
I speak English, so I installed tesseract and tesseract-eng, but it
supports a bunch of other languages if you need them. Apparently
google's been supporting and developing it for while now and if my
results are any indicator, it's paying off (boy do I remember all the
gobbledegook it used to produce).
tesseract will import from different image types, multiple images, etc.
I just like the simplicity of tiff->pdf.
Anyhow, thought y'all might like to know as many of you live off the
scans :).
Will

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

[TUHS] tesseract has gotten so much better