pyonb

Warning

This repo is under construction.

pyonb is a Python library and suite of APIs that wrap open-source Optical Character Recognition (OCR) tools. It it designed for local deployment and can convert PDFs to structured text using the several OCR tools:

Getting Started

Prerequisites

pyonb requires Docker and Docker Compose.

Installation & Usage

Clone pyonb

git clone [email protected]:SAFEHR-data/pyonb.git
cd pyonb

Rename .env.sample to .env.

mv .env.sample .env

Edit .env with the correct DATA_FOLDER location, e.g.:

DATA_FOLDER="path/to/documents/folder"

where the path is relative to the docker-compose.yml file in the top-level pyonb directory.

Set OCR service ports, e.g.:

OCR_FORWARDING_API_PORT=8110
MARKER_API_PORT=8112
PADDLEOCR_API_PORT=8114
DOCLING_API_PORT=8115
KREUZBERG_API_PORT=8116

Important

For GAE usage, set OCR service ports and UCLH proxy details:

http_proxy=
https_proxy=
HTTPS_PROXY=
HTTP_PROXY=

Start the OCR API Server (e.g. using marker and docling):

docker compose --profile marker --profile docling up -d

Open FastAPI Swagger at http://127.0.0.1:8110/docs to view and execute endpoints.

Use the following POST endpoints to execute the chosen OCR tool on a PDFs:

marker - POST /marker/inference_single
docling - POST /docling/inference_single

View the JSON response:

Developer Tips

Alternatively to Swagger, use Postman to construct, save and make your API requests.

Tests

Clone the repo:

git clone https://github.com/SAFEHR-data/pyonb.git

Create a virtual environment (we suggest using uv) and install dependencies:

uv venv --python3.12
source .venv/bin/activate
uv sync

Copy the tests/ .env file to root directory to use with tox:

cp /tests/.env.tests .env

Start the Docker services:

docker compose --profile marker --profile docling up -d

Run tests using tox:

tox -e py312

NB: this may take a few minutes to perform the inference tests. Some may fail depending on which OCR tools you choose to raise. For example, with --profile marker --profile docling the Paddle and Kreuzberg APIs will not be raised, so the associated tests will fail.

To run unit tests individually, adapt the following:

tox -e py312 -- tests/api/test_routers.py::test_inference_single_file_upload_marker

About

Project Team

Arman Eshaghi
Tom Roberts ([email protected])
Kawsar Noor
Lawrence Lai
Stefan Piatek
Richard Dobson
Steve Harris
Sarah Keating

Acknowledgements

This work was funded by the National Institute for Health and Care Research (NIHR, award code NIHR302495).

This project is developed in collaboration with the Centre for Advanced Research Computing, University College London.

Name		Name	Last commit message	Last commit date
Latest commit History 293 Commits
.github		.github
.vscode		.vscode
docs		docs
packages		packages
schemas		schemas
src/pyonb		src/pyonb
tests		tests
.env.sample		.env.sample
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE.md		LICENSE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pyonb

Getting Started

Prerequisites

Installation & Usage

Developer Tips

Tests

About

Project Team

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

SAFEHR-data/pyonb

Folders and files

Latest commit

History

Repository files navigation

pyonb

Getting Started

Prerequisites

Installation & Usage

Developer Tips

Tests

About

Project Team

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages