Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix: add 'el' and 'gr' as Greek language code aliases for Tesseract OCR#4270

Open
s0wa48 wants to merge 1 commit intoUnstructured-IO:mainfrom
s0wa48:fix/issue-2939-text-extraction-issue-greek-l
Open

fix: add 'el' and 'gr' as Greek language code aliases for Tesseract OCR#4270
s0wa48 wants to merge 1 commit intoUnstructured-IO:mainfrom
s0wa48:fix/issue-2939-text-extraction-issue-greek-l

Conversation

@s0wa48
Copy link

@s0wa48 s0wa48 commented Feb 27, 2026

Summary

  • The issue was that users specifying languages=["gr"] for Greek language PDFs were getting incorrect OCR output because "gr" (the ISO 3166-1 alpha-2 country code for Greece) was not mapped to the Tesseract language code "ell".
  • Similarly, "el" (the ISO 639-1 language code for Modern Greek) was also not mapped.
  • This fix adds both "gr" and "el" as aliases in TESSERACT_LANGUAGES_AND_CODES that map to "ell", which is the correct Tesseract code for Modern Greek.
  • This ensures that users who pass either languages=["gr"] or languages=["el"] will get proper Greek OCR processing.

Fixes #2939


This PR was auto-generated by Gittensor bot using Claude AI to fix a reported issue.

@s0wa48 s0wa48 force-pushed the fix/issue-2939-text-extraction-issue-greek-l branch from d24b157 to 348e8f4 Compare February 27, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet

1 participant