Fabio's notes: Scan and OCR

Wednesday, September 14, 2022

To generate images (for scantailor) from a pdf

mutool extract scan.pdf

run scantailor

https://github.com/4lex4/scantailor-advanced

to OCR use tesseract

sudo add-apt-repository ppa:alex-p/tesseract-ocr5

sudo apt-get update

To run tesseract on multiple images (the output of scantailor)

ls -1 *.tif | sort > filelist

tesseract filelist outfilename pdf

to parse PDF

ocrmypdf https://pypi.org/project/ocrmypdf/