Python modules implementing OCR-D specs and related tools
This repository contains the python packages that form the base for tools within the OCR-D ecosphere.
All packages are also published to PyPI.
NOTE Unless you want to contribute to OCR-D/core, we recommend installation as part of ocrd_all which installs a complete stack of OCR-D-related software.
The easiest way to install is via pip:
pip install ocrd
# or just the functionality you need, e.g.
pip install ocrd_modelfactoryAll python software released by OCR-D requires Python 3.7 or higher.
NOTE Some OCR-D-Tools (or even test cases) might reveal an unintended behavior if you have specific environment modifications, like:
- using a custom build of ImageMagick, whose format delegates are different from what OCR-D supposes
- custom Python logging configurations in your personal account
NOTE: All OCR-D CLI tools support a --help flag which shows usage and
supported flags, options and arguments.
A minimal OCR-D processor that copies from -I/-input-file-grp to -O/-output-file-grp
Almost all behaviour of the OCR-D/core software is configured via CLI options and flags, which can be listed with the --help flag that all CLI support.
Some parts of the software are configured via environement variables:
OCRD_METS_CACHING: If set totrue, access to the METS file is cached, speeding in-memory search and modification.OCRD_PROFILE: This variable configures the built-in CPU and memory profiling. If empty, no profiling is done. Otherwise expected to contain any of the following tokens:CPU: Enable CPU profiling of processor runsRSS: Enable RSS memory profilingPSS: Enable proportionate memory profiling
OCRD_PROFILE_FILE: If set, then the CPU profile is written to this file for later peruse with a analysis tools like snakeviz
Contains utilities and constants, e.g. for logging, path normalization, coordinate calculation etc.
See README for ocrd_utils for further information.
Contains file format wrappers for PAGE-XML, METS, EXIF metadata etc.
See README for ocrd_models for further information.
Code to instantiate models from existing data.
See README for ocrd_modelfactory for further information.
Schemas and routines for validating BagIt, ocrd-tool.json, workspaces, METS, page, CLI parameters etc.
See README for ocrd_validators for further information.
Components related to OCR-D Web API
See README for ocrd_network for further information.
Depends on all of the above, also contains decorators and classes for creating OCR-D processors and CLIs.
Also contains the command line tool ocrd.
See README for ocrd for further information.
Builds a bash script that can be sourced by other bash scripts to create OCRD-compliant CLI.
For example:
source `ocrd bashlib filename`
declare -A NAMESPACES MIMETYPES
eval NAMESPACES=( `ocrd bashlib constants NAMESPACES` )
echo ${NAMESPACES[page]}
eval MIMETYPE_PAGE=( `ocrd bashlib constants MIMETYPE_PAGE` )
echo $MIMETYPE_PAGE
eval MIMETYPES=( `ocrd bashlib constants EXT_TO_MIME` )
echo ${MIMETYPES[.jpg]}
See CLI usage
Raise an error and exit.
Delegate logging to ocrd log
Ensure minimum version
Output ocrd-tool.json content verbatim.
Requires $OCRD_TOOL_JSON and $OCRD_TOOL_NAME to be set:
export OCRD_TOOL_JSON=/path/to/ocrd-tool.json
export OCRD_TOOL_NAME=ocrd-foo-bar(Which you automatically get from ocrd__wrap.)
Output given resource file's content.
Output all resource files' names.
Print help on CLI usage.
Parses arguments according to OCR-D CLI. In doing so, depending on the values passed to it, may delegate to …
ocrd__raiseand exit (if something went wrong)ocrd__usageand exitocrd__dumpjsonand exitocrd__show_resourceand exitocrd__list_resourcesand exitocrd validate tasksand return
Expects an associative array ("hash"/"dict") ocrd__argv to be predefined:
declare -A ocrd__argv=()
This will be filled by the parser along the following keys:
overwrite: whether--overwriteis enabledprofile: whether--profileis enabledprofile_file: the argument of--profile-filelog_level: the argument of--log-levelmets_file: absolute path of the--metsargumentworking_dir: absolute path of the--working-dirargument or the parent ofmets_filepage_id: the argument of--page-idinput_file_grp: the argument of--input-file-grpoutput_file_grp: the argument of--output-file-grp
Moreover, there will be an associative array params
with the fully expanded runtime values of the ocrd-tool.json parameters.
Parses an ocrd-tool.json for a specific tool (i.e. processor executable).
Delegates to …
ocrd__parse_argv, creating theocrd__argvassociative arrayocrd bashlib input-files, creating the data structures used byocrd__input_file
Usage: ocrd__wrap PATH/TO/OCRD-TOOL.JSON EXECUTABLE ARGS
For example:
ocrd__wrap $SHAREDIR/ocrd-tool.json ocrd-olena-binarize "$@"
...
(Requires ocrd__wrap to have been run first.)
Access information on the input files according to the parsed CLI arguments:
- their file
url(or local file path) - their file
ID - their
mimetype - their
pageId - their proposed corresponding
outputFileId(generated from${ocrd__argv[output__file_grp]}and input fileID)
Usage: ocrd__input_file NR KEY
For example:
pageId=`ocrd__input_file 3 pageId`
To be used in a loop over all selected pages:
for ((n=0; n<${#ocrd__files[*]}; n++)); do
local in_fpath=($(ocrd__input_file $n url))
local in_id=($(ocrd__input_file $n ID))
local in_mimetype=($(ocrd__input_file $n mimetype))
local in_pageId=($(ocrd__input_file $n pageId))
local out_id=$(ocrd__input_file $n outputFileId)
local out_fpath="${ocrd__argv[output_file_grp]}/${out_id}.xml
# process $in_fpath to $out_fpath ...
declare -a options
if [ -n "$in_pageId" ]; then
options=( -g $in_pageId )
else
options=()
fi
if [[ "${ocrd__argv[overwrite]}" == true ]];then
options+=( --force )
fi
options+=( -G ${ocrd__argv[output_file_grp]}
-m $MIMETYPE_PAGE -i "$out_id"
"$out_fpath" )
ocrd -l ${ocrd__argv[log_level]} workspace -d ${ocrd__argv[working_dir]} add "${options[@]}"
Note: If the
--input-file-grpis multi-valued (N fileGrps separated by commas), then usage is similar:
- The function
ocrd__input_filecan be used, but its results will be lists (delimited by whitespace and surrounded by single quotes), e.g.[url]='file1.xml file2.xml' [ID]='id_file1 id_file2' [mimetype]='application/vnd.prima.page+xml image/tiff' ....- Therefore its results should be encapsulated in a (non-associative) array variable and without extra quotes, e.g.
in_file=($(ocrd__input_file 3 url)), or as shown above.- This will yield the first fileGrp's results on index 0, which in bash will always be the same as if you referenced the array without index (so code does not need to be changed much), e.g.
test -f $in_filewhich equalstest -f ${in_file[0]}.- Additional fileGrps will have to be fetched from higher indexes, e.g.
test -f ${in_file[1]}.
Download assets (make assets)
Test with local files: make test
- Test with remote assets:
make test OCRD_BASEURL='https://github.com/OCR-D/assets/raw/master/data/'
- OCR-D Specifications (Repo)
- OCR-D core API documentation (built here via
make docs) - OCR-D Website (Repo)