Turn scanned PDFs into searchable PDFs. Works with any PDF that contains images or scanned pages.
- Python 3.6+
- Tesseract OCR:
- Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr
pip install -r requirements.txtBasic:
python ocr_pdf.py your_file.pdfCreates your_file_searchable.pdf
With GUI:
python ocr_gui.pyMultiple files:
python ocr_pdf.py file1.pdf file2.pdf file3.pdfDifferent language (German example):
python ocr_pdf.py document.pdf -l deuCustom output location:
python ocr_pdf.py scan.pdf -o /path/to/output.pdfCommon language codes:
eng- Englishdeu- Germanfra- Frenchspa- Spanishita- Italianpor- Portuguesechi_sim- Chinesejpn- Japanesekor- Korean
For multiple languages use + like: eng+deu+fra
The tool automatically optimizes output size.
To disable optimization: python ocr_pdf.py file.pdf --no-optimize