Warning
This repo is under construction.
pyonb is a Python library and suite of APIs that wrap open-source Optical Character Recognition (OCR) tools. It it designed for local deployment and can convert PDFs to structured text using the several
OCR tools:
pyonb requires Docker and Docker Compose.
- Clone
pyonb
git clone [email protected]:SAFEHR-data/pyonb.git
cd pyonb- Rename
.env.sampleto.env.
mv .env.sample .env- Edit
.envwith the correctDATA_FOLDERlocation, e.g.:
DATA_FOLDER="path/to/documents/folder"where the path is relative to the docker-compose.yml file in the top-level pyonb directory.
- Set OCR service ports, e.g.:
OCR_FORWARDING_API_PORT=8110
MARKER_API_PORT=8112
PADDLEOCR_API_PORT=8114
DOCLING_API_PORT=8115
KREUZBERG_API_PORT=8116Important
For GAE usage, set OCR service ports and UCLH proxy details:
http_proxy=
https_proxy=
HTTPS_PROXY=
HTTP_PROXY=- Start the OCR API Server (e.g. using
markeranddocling):
docker compose --profile marker --profile docling up -d- Open FastAPI Swagger at http://127.0.0.1:8110/docs to view and execute endpoints.
Use the following POST endpoints to execute the chosen OCR tool on a PDFs:
- marker - POST
/marker/inference_single - docling - POST
/docling/inference_single
- View the JSON response:
- Alternatively to Swagger, use Postman to construct, save and make your API requests.
- Clone the repo:
git clone https://github.com/SAFEHR-data/pyonb.git- Create a virtual environment (we suggest using uv) and install dependencies:
uv venv --python3.12
source .venv/bin/activate
uv sync- Copy the
tests/.env file to root directory to use with tox:
cp /tests/.env.tests .env- Start the Docker services:
docker compose --profile marker --profile docling up -d- Run tests using tox:
tox -e py312NB: this may take a few minutes to perform the inference tests. Some may fail depending on which OCR tools you choose to raise.
For example, with --profile marker --profile docling the Paddle and Kreuzberg APIs will not be raised,
so the associated tests will fail.
To run unit tests individually, adapt the following:
tox -e py312 -- tests/api/test_routers.py::test_inference_single_file_upload_marker- Arman Eshaghi
- Tom Roberts ([email protected])
- Kawsar Noor
- Lawrence Lai
- Stefan Piatek
- Richard Dobson
- Steve Harris
- Sarah Keating
This work was funded by the National Institute for Health and Care Research (NIHR, award code NIHR302495).
This project is developed in collaboration with the Centre for Advanced Research Computing, University College London.