No project description provided
Project description
ocrd-page-to-alto
Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Introduction
This software converts PAGE XML files to the ALTO XML OCR result format. It enables using PAGE XML generating software in a context where ALTO is needed to display the results, i.e. in libraries.
Installation
In a Python virtualenv:
make install # or pip install .
# or to install from PyPI
pip install ocrd_page_to_alto
Usage
To convert the PAGE XML document example.xml
to ALTO:
page-to-alto example.xml > example.alto.xml
You can get an exhaustive list of page-to-alto's many options with --help
:
CLI
Usage: page-to-alto [OPTIONS] FILENAME Convert PAGE to ALTO Options: -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE] Log level --alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0] Choose version of ALTO-XML schema to produce (older versions may not preserve all features) --check-words / --no-check-words Check whether PAGE-XML contains any Words and fail if not --check-border / --no-check-border Check whether PAGE-XML contains Border or PrintSpace --skip-empty-lines / --no-skip-empty-lines Whether to omit or keep empty lines in PAGE- XML --trailing-dash-to-hyp / --no-trailing-dash-to-hyp Whether to add a element if the last word in a line ends in "-" --dummy-textline / --no-dummy-textline Whether to create a TextLine for regions that have TextEquiv/Unicode but no TextLine --dummy-word / --no-dummy-word Whether to create a Word for TextLine that have TextEquiv/Unicode but no Word --textequiv-index INTEGER If multiple textequiv, use the n-th TextEquiv by @index --textequiv-fallback-strategy [raise|first|last] What to do if nth textequiv isn't available. 'raise' will lead to a runtime error, 'first' will use the first TextEquiv, 'last' will use the last TextEquiv on the element -O, --output-file FILE Output filename (or "-" for standard output, the default) -h, --help Show this message and exit.
To process an OCR-D workspace, use ocrd_fileformat, which uses page-to-alto by default:
ocrd-fileformat-transform -I OCRD-OCR-OUTPUT-PAGE -O OCRD-OCR-OUTPUT-ALTO \
-P script-args "--dummy-word --no-check-words --no-check-border"
TODO
- AlternativeImage
- unmappable regions
- handle Border
- TextStyle
- ParagraphStyle
- table regions
- recursive regions for *Region
- Set
PAGECLASS
frompc:Page/@type
#4 - Layers / z-level via
StructureTag
? #4 -
<SP/>
-
<HYP/>
- rotation
- reading order
- input PAGE-XML not having words #5
- multiple pc:TextEquivs
- language
-
scriptno equivalent in ALTO :( -
kerningno equivalent in ALTO :( -
underlineStyleno equivalent in ALTO :( -
bgColourno equivalent in ALTO :( -
bgColourRgbno equivalent in ALTO :( -
reverseVideono equivalent in ALTO :( -
xHeightno equivalent in ALTO :( -
letterSpacedno equivalent in ALTO :( - ProcessingStep
- differentiate/select ALTO versions
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ocrd-page-to-alto-1.1.0.tar.gz
(12.0 kB
view details)
Built Distribution
File details
Details for the file ocrd-page-to-alto-1.1.0.tar.gz
.
File metadata
- Download URL: ocrd-page-to-alto-1.1.0.tar.gz
- Upload date:
- Size: 12.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9997cc05e46eee85be5d1672d804eac9a94a9ceed91b80798d8b20736a49175 |
|
MD5 | 4e5f3451bd839f4dcccca91d02fffff0 |
|
BLAKE2b-256 | f10cf4d13e12f2336bf66f5b11cdfa93e54e1133b0163b02847076650adffbca |
File details
Details for the file ocrd_page_to_alto-1.1.0-py3-none-any.whl
.
File metadata
- Download URL: ocrd_page_to_alto-1.1.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7427454bb53c5d2ae995820a40091c866f39ec8c774de96935afede71483221a |
|
MD5 | 205918cd64cb6d2b810b09cccb4d5d0b |
|
BLAKE2b-256 | 787dde386556d8e4b4a8491af62065e3003a8267d9736c86c1613e22687eabc6 |