pyhocr

pyhocr is a Python package to help you parse and navigate hocr documents.

Installation

To install the module, run:

pip install pyhocr

Usage

pyhocr parses the following elements from hocr:

ocr pages: represented by <ocr_page>,
ocr content areas: represented by <ocr_carea>
ocr paragraphs: represented by <ocr_par>
ocr lines: represented by <ocr_lines>
ocr words: represented by <ocr?_words

and returns them as Page, Blocks, Paragraphs, Lines, and Words objects respectively.

You can navigate through the hocr by asking for any children elements or any parent element. You can navigate down the structure like:

import pyhocr

with open('example.hocr') as f:
    hocr_string = f.read()

hocr_document = pyhocr.parse(hocr_string)

# get the first page
page = hocr_document.pages[0]

# pulling all lines out:
lines = page.lines

# getting text of last line
last_line_text = lines[-1].text

# getting all words of page
words = page.words

Or navigate up the data structure by:

# get parent page
page = word.page

# get parent line
line = word.line

# get parent block
line = word.block

# get parent page of the block
page = block.page

Contributions

Please feel free to post pull requests or report issues.

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
build/lib/hocr		build/lib/hocr
dist		dist
pyhocr		pyhocr
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyhocr

Installation

Usage

Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

algorythmik/python-hocr

Folders and files

Latest commit

History

Repository files navigation

pyhocr

Installation

Usage

Contributions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages