This Python script automates the extraction of text from images using Tesseract OCR. It processes all images in the test_images/ folder and saves the extracted text as .txt files in the extracted_texts/ directory, maintaining the original image filenames.
OCR-Text-Extractor/
├── OCR.py
├── test_images/
│ └── image1.jpg
│ └── image2.png
├── extracted_texts/
│ └── image1.txt
│ └── image2.txt
└── README.md
- Batch processes
.jpg,.jpeg, and.pngimages. - Supports multiple languages (default: English and Hindi).
- Automatically creates the
extracted_texts/folder if it doesn't exist. - Provides informative logging for each processed file.([GitHub][2])
git clone https://github.com/Mrigank005/OCR
cd OCREnsure you have Python 3 installed. Then, install the required Python libraries:
pip install pillow pytesseract-
Windows: Download and install from Tesseract OCR Windows Installer.
-
macOS: Use Homebrew:([GitHub][1])
brew install tesseract
-
Linux (Debian/Ubuntu):
sudo apt-get install tesseract-ocr
Ensure Tesseract is added to your system's PATH.
Place the images you want to process into the test_images/ directory.
python OCR.pyThe extracted text files will be saved in the extracted_texts/ directory.
-
Language Support: The script defaults to English and Hindi. To modify the languages, edit the
langsparameter in theextract_text_and_savefunction withinOCR.py:def extract_text_and_save(image_path, langs=["eng", "hin"]):
Refer to Tesseract OCR Language Data for available language codes.([GitHub][1])
-
Tesseract Path: If Tesseract isn't in your system's PATH, specify its location in
OCR.py:import pytesseract pytesseract.pytesseract.tesseract_cmd = r'/path/to/tesseract'
For an image named page1.jpg in test_images/, the script will generate page1.txt in extracted_texts/ containing the recognized text.