Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Project description
ocrd-page-to-alto
Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Introduction
This software converts PAGE XML files to the ALTO XML OCR result format. It enables using PAGE XML generating software in a context where ALTO is needed to display the results, i.e. in libraries.
Installation
In a Python virtualenv:
make install # or pip install .
# or to install from PyPI
pip install ocrd_page_to_alto
Usage
To convert the PAGE XML document example.xml
to ALTO:
page-to-alto example.xml > example.alto.xml
You can get an exhaustive list of page-to-alto's many options with --help
:
CLI
Usage: page-to-alto [OPTIONS] FILENAME Convert PAGE to ALTO Options: -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE] Log level --alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0] Choose version of ALTO-XML schema to produce (older versions may not preserve all features) --check-words / --no-check-words Check whether PAGE-XML contains any Words and fail if not --check-border / --no-check-border Check whether PAGE-XML contains Border or PrintSpace --skip-empty-lines / --no-skip-empty-lines Whether to omit or keep empty lines in PAGE- XML --trailing-dash-to-hyp / --no-trailing-dash-to-hyp Whether to add a element if the last word in a line ends in "-" --dummy-textline / --no-dummy-textline Whether to create a TextLine for regions that have TextEquiv/Unicode but no TextLine --dummy-word / --no-dummy-word Whether to create a Word for TextLine that have TextEquiv/Unicode but no Word --textequiv-index INTEGER If multiple textequiv, use the n-th TextEquiv by @index --textequiv-fallback-strategy [raise|first|last] What to do if nth textequiv isn't available. 'raise' will lead to a runtime error, 'first' will use the first TextEquiv, 'last' will use the last TextEquiv on the element -O, --output-file FILE Output filename (or "-" for standard output, the default) -h, --help Show this message and exit.
To process an OCR-D workspace, use ocrd_fileformat, which uses page-to-alto by default:
ocrd-fileformat-transform -I OCRD-OCR-OUTPUT-PAGE -O OCRD-OCR-OUTPUT-ALTO \
-P script-args "--dummy-word --no-check-words --no-check-border"
TODO
- AlternativeImage
- unmappable regions
- handle Border
- TextStyle
- ParagraphStyle
- table regions
- recursive regions for *Region
- Set
PAGECLASS
frompc:Page/@type
#4 - Layers / z-level via
StructureTag
? #4 -
<SP/>
-
<HYP/>
- rotation
- reading order
- input PAGE-XML not having words #5
- multiple pc:TextEquivs
- language
-
scriptno equivalent in ALTO :( -
kerningno equivalent in ALTO :( -
underlineStyleno equivalent in ALTO :( -
bgColourno equivalent in ALTO :( -
bgColourRgbno equivalent in ALTO :( -
reverseVideono equivalent in ALTO :( -
xHeightno equivalent in ALTO :( -
letterSpacedno equivalent in ALTO :( - ProcessingStep
- differentiate/select ALTO versions
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ocrd_page_to_alto-1.4.0.tar.gz
.
File metadata
- Download URL: ocrd_page_to_alto-1.4.0.tar.gz
- Upload date:
- Size: 20.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19+
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11f7b60e889112543be105d04082dd656a39e712e422d5ddc3865aae51e2ed52 |
|
MD5 | 252fdf598759b464152d6ab8a6d516b9 |
|
BLAKE2b-256 | 4c2781b43f972cd071cf26f3029923c5b2d807c91dabb5095c0196d8c8ca7688 |
File details
Details for the file ocrd_page_to_alto-1.4.0-py3-none-any.whl
.
File metadata
- Download URL: ocrd_page_to_alto-1.4.0-py3-none-any.whl
- Upload date:
- Size: 19.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19+
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f60563b269c6464d7cd9f0103cf61e791c55e718be9596794b05ee688a70285f |
|
MD5 | a3fa1ce0bdf0f33da25352a7511ec9db |
|
BLAKE2b-256 | ce20f6d71b965c829c00517651119d2cfc3c9e9bcaa6f98a6ab1e6009ebba916 |