Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Project description
ocrd-page-to-alto
Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Introduction
This software converts PAGE XML files to the ALTO XML OCR result format. It enables using PAGE XML generating software in a context where ALTO is needed to display the results, i.e. in libraries.
Installation
In a Python virtualenv:
make install # or pip install .
# or to install from PyPI
pip install ocrd_page_to_alto
Usage
To convert the PAGE XML document example.xml to ALTO:
page-to-alto example.xml > example.alto.xml
You can get an exhaustive list of page-to-alto's many options with --help:
CLI
Usage: page-to-alto [OPTIONS] FILENAME
Convert PAGE to ALTO
Options:
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Log level
--alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0]
Choose version of ALTO-XML schema to produce
(older versions may not preserve all
features)
--check-words / --no-check-words
Check whether PAGE-XML contains any Words
and fail if not
--check-border / --no-check-border
Check whether PAGE-XML contains Border or
PrintSpace
--skip-empty-lines / --no-skip-empty-lines
Whether to omit or keep empty lines in PAGE-
XML
--trailing-dash-to-hyp / --no-trailing-dash-to-hyp
Whether to add a element if the last
word in a line ends in "-"
--dummy-textline / --no-dummy-textline
Whether to create a TextLine for regions
that have TextEquiv/Unicode but no TextLine
--dummy-word / --no-dummy-word Whether to create a Word for TextLine that
have TextEquiv/Unicode but no Word
--textequiv-index INTEGER If multiple textequiv, use the n-th
TextEquiv by @index
--textequiv-fallback-strategy [raise|first|last]
What to do if nth textequiv isn't available.
'raise' will lead to a runtime error,
'first' will use the first TextEquiv, 'last'
will use the last TextEquiv on the element
-O, --output-file FILE Output filename (or "-" for standard output,
the default)
-h, --help Show this message and exit.
To process an OCR-D workspace, use ocrd_fileformat, which uses page-to-alto by default:
ocrd-fileformat-transform -I OCRD-OCR-OUTPUT-PAGE -O OCRD-OCR-OUTPUT-ALTO \
-P script-args "--dummy-word --no-check-words --no-check-border"
TODO
- AlternativeImage
- unmappable regions
- handle Border
- TextStyle
- ParagraphStyle
- table regions
- recursive regions for *Region
- Set
PAGECLASSfrompc:Page/@type#4 - Layers / z-level via
StructureTag? #4 -
<SP/> -
<HYP/> - rotation
- reading order
- input PAGE-XML not having words #5
- multiple pc:TextEquivs
- language
-
scriptno equivalent in ALTO :( -
kerningno equivalent in ALTO :( -
underlineStyleno equivalent in ALTO :( -
bgColourno equivalent in ALTO :( -
bgColourRgbno equivalent in ALTO :( -
reverseVideono equivalent in ALTO :( -
xHeightno equivalent in ALTO :( -
letterSpacedno equivalent in ALTO :( - ProcessingStep
- differentiate/select ALTO versions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ocrd_page_to_alto-2.1.0.tar.gz.
File metadata
- Download URL: ocrd_page_to_alto-2.1.0.tar.gz
- Upload date:
- Size: 20.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7b3e82dc2e42a187224d67269ec78db42c606252fb1cc822f58f3d7f63caeae
|
|
| MD5 |
cbfd3452f7f34ff71751aef7e9346a78
|
|
| BLAKE2b-256 |
ececc4a6668cc47770e63e0d1ae2c8dd1ab1c3de8ff02480cfca7b1dcd91f2b2
|
File details
Details for the file ocrd_page_to_alto-2.1.0-py3-none-any.whl.
File metadata
- Download URL: ocrd_page_to_alto-2.1.0-py3-none-any.whl
- Upload date:
- Size: 19.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91b5c007dee1548228da442b54baf58730540efe24928d356731f5b832c495b7
|
|
| MD5 |
dd1a14b9d030235a9fcd0c8e2ccf9444
|
|
| BLAKE2b-256 |
3cbc8048512f096fb0fe3f2c543c97939d0368e77d3591c0485f60eb7f674f8b
|