OCR-D framework
Project description
OCR-D/core
Python modules implementing OCR-D specs and related tools
Introduction
This repository contains the python packages that form the base for tools within the OCR-D ecosphere.
All packages are also published to PyPI.
Installation
NOTE Unless you want to contribute to OCR-D/core, we recommend installation as part of ocrd_all which installs a complete stack of OCR-D-related software.
The easiest way to install is via pip:
pip install ocrd
All Python software released by OCR-D requires Python 3.8 or higher.
NOTE Some OCR-D tools (or even test cases) might reveal an unintended behavior if you have specific environment modifications, like:
- using a custom build of ImageMagick, whose format delegates are different from what OCR-D supposes
- custom Python logging configurations in your personal account
Command line tools
NOTE: All OCR-D CLI tools support a --help flag which shows usage and
supported flags, options and arguments.
ocrd CLI
ocrd-dummy CLI
A minimal OCR-D processor that copies from -I/-input-file-grp to -O/-output-file-grp
ocrd-filter CLI
A simple OCR-D processor that removes segments in PAGE-XML files from -I/-input-file-grp to -O/-output-file-grp with arbitrary selection based on powerful XPath 2.0 expressions.
ocrd-command CLI
A simple OCR-D processor that runs arbitrary shell commands to transform PAGE-XML files from -I/-input-file-grp to -O/-output-file-grp (in effect "wrapping" them for OCR-D).
ocrd-merge CLI
A simple OCR-D processor that (for every page) joins PAGE-XML files from multiple -I/-input-file-grp into a single -O/-output-file-grp, ensuring that
Borderpolygons are joined- all regions are concatenated, while
- ensuring segment identifiers do not clash,
- and the reading order simply gets concatenated.
Configuration
Almost all behaviour of the OCR-D/core software is configured via CLI options and flags, which can be listed with the --help flag that all CLI support.
Some parts of the software are configured via environment variables:
-
OCRD_PROFILE: This variable configures the built-in CPU and memory profiling. If empty, no profiling is done. Otherwise expected to contain any of the following tokens:CPU: Enable CPU profiling of processor runsRSS: Enable RSS memory profilingPSS: Enable proportionate memory profiling
-
OCRD_PROFILE_FILE: If set, then the CPU profile is written to this file for later peruse with a analysis tools like snakeviz -
PATH: Search path for processor executables (affectsocrd processandocrd resmgr). -
HOME: Directory to look forocrd_logging.conf, fallback for unset XDG variables (see below). -
XDG_CONFIG_HOME: Directory to look for./ocrd/resources.yml(i.e.ocrd resmgruser database) – defaults to$HOME/.config. -
XDG_DATA_HOME: Directory to look for./ocrd-resources/*(i.e.ocrd resmgrdata location) – defaults to$HOME/.local/share. -
OCRD_DOWNLOAD_RETRIES: Number of times to retry failed attempts for downloads of resources or workspace files. -
OCRD_DOWNLOAD_TIMEOUT: Timeout in seconds for connecting or reading (comma-separated) when downloading. -
OCRD_MISSING_INPUT: How to deal with missing input files (for some fileGrp/pageId) during processing:SKIP: ignore and proceed with next page's inputABORT: throwMissingInputFileexception
-
OCRD_MISSING_OUTPUT: How to deal with missing output files (for some fileGrp/pageId) during processing:SKIP: ignore and proceed processing next pageCOPY: fall back to copying input PAGE to output fileGrp for pageABORT: re-throw whatever caused processing to fail
-
OCRD_MAX_MISSING_OUTPUTS: Maximal rate of skipped/fallback pages among all processed pages before aborting (decimal fraction, ignored if negative). -
OCRD_EXISTING_OUTPUT: How to deal with already existing output files (for some fileGrp/pageId) during processing:SKIP: ignore and proceed processing next pageOVERWRITE: force writing result to output fileGrp for pageABORT: re-throwFileExistsErrorexception
-
OCRD_METS_CACHING: Whether to enable in-memory storage of OcrdMets data structures for speedup during processing or workspace operations. -
OCRD_MAX_PROCESSOR_CACHE: Maximum number of processor instances (for each set of parameters) to be kept in memory (including loaded models) for processing workers or processor servers. -
OCRD_MAX_PARALLEL_PAGES: Maximum number of processor threads for page-parallel processing (within each Processor's selected page range, independent of the number of Processing Workers or Processor Servers). If set>1, then a METS Server must be used for METS synchronisation. -
OCRD_PROCESSING_PAGE_TIMEOUT: Timeout in seconds for processing a single page. If set >0, when exceeded, the same as OCRD_MISSING_OUTPUT applies. -
OCRD_NETWORK_SERVER_ADDR_PROCESSING: Default address of Processing Server to connect to (forocrd network client processing). -
OCRD_NETWORK_SERVER_ADDR_WORKFLOW: Default address of Workflow Server to connect to (forocrd network client workflow). -
OCRD_NETWORK_SERVER_ADDR_WORKSPACE: Default address of Workspace Server to connect to (forocrd network client workspace). -
OCRD_NETWORK_RABBITMQ_CLIENT_CONNECT_ATTEMPTS: Number of attempts for a worker to create its queue. Helpful if the rabbitmq-server needs time to be fully started. -
OCRD_NETWORK_CLIENT_POLLING_SLEEP: How many seconds to sleep before tryingocrd network clientagain. -
OCRD_NETWORK_CLIENT_POLLING_TIMEOUT: Timeout for a blockingocrd network client(in seconds). -
OCRD_NETWORK_SOCKETS_ROOT_DIR: The root directory where all mets server related socket files are created. -
OCRD_NETWORK_LOGS_ROOT_DIR: The root directory where all ocrd_network related file logs are stored.
Packages
ocrd_utils
Contains utilities and constants, e.g. for logging, path normalization, coordinate calculation etc.
See README for ocrd_utils for further information.
ocrd_models
Contains file format wrappers for PAGE-XML, METS, EXIF metadata etc.
See README for ocrd_models for further information.
ocrd_modelfactory
Code to instantiate models from existing data.
See README for ocrd_modelfactory for further information.
ocrd_validators
Schemas and routines for validating BagIt, ocrd-tool.json, workspaces, METS, page, CLI parameters etc.
See README for ocrd_validators for further information.
ocrd_network
Components related to OCR-D Web API
See README for ocrd_network for further information.
ocrd
Depends on all of the above, also contains decorators and classes for creating OCR-D processors and CLIs.
Also contains the command line tool ocrd.
See README for ocrd for further information.
Testing
Download assets (make assets)
Test with local files: make test
- Test with remote assets:
make test OCRD_BASEURL='https://github.com/OCR-D/assets/raw/master/data/'
See Also
- OCR-D Specifications (Repo)
- OCR-D core API documentation (built here via
make docs) - OCR-D Website (Repo)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ocrd-3.12.1.tar.gz.
File metadata
- Download URL: ocrd-3.12.1.tar.gz
- Upload date:
- Size: 370.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b92a22435fcb38a0719669117106821a94c1fafdcd69c3299fb30fdc1c8fb8ff
|
|
| MD5 |
aea17420cf765c70db0364bab6d6d036
|
|
| BLAKE2b-256 |
1efd2d1f5273b7afabeafa1fbc042d57f443e36d5ee8f4de732dca4f66c9a3e2
|
File details
Details for the file ocrd-3.12.1-py3-none-any.whl.
File metadata
- Download URL: ocrd-3.12.1-py3-none-any.whl
- Upload date:
- Size: 385.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4da5383810b235d7eabc1da94c4d759b6ba8c9ca264aac219ef4de0405b78436
|
|
| MD5 |
7e27fc6f7c7318aba6f34073a2f0306c
|
|
| BLAKE2b-256 |
01aace0e7b4c420410846258190d67e48a03ab87545157d33b22d92dc805b3bd
|