Wednesday, September 14, 2022

Scan and OCR

To generate images (for scantailor) from a pdf

 mutool extract scan.pdf

run scantailor

https://github.com/4lex4/scantailor-advanced

to OCR use tesseract

sudo add-apt-repository ppa:alex-p/tesseract-ocr5

sudo apt-get update

To run tesseract on multiple images (the output of scantailor)

ls -1 *.tif | sort  > filelist

tesseract  filelist outfilename pdf

to parse PDF

ocrmypdf https://pypi.org/project/ocrmypdf/

No comments:

Post a Comment