To generate images (for scantailor) from a pdf
mutool extract scan.pdf
run scantailor
https://github.com/4lex4/scantailor-advanced
to OCR use tesseract
sudo add-apt-repository ppa:alex-p/tesseract-ocr5
sudo apt-get update
To run tesseract on multiple images (the output of scantailor)
ls -1 *.tif | sort > filelist
tesseract filelist outfilename pdf
to parse PDF
ocrmypdf https://pypi.org/project/ocrmypdf/
No comments:
Post a Comment