Skip to main content

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)

Project description

ocrd-page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)

CircleCI

Introduction

This software converts PAGE XML files to the ALTO XML OCR result format. It enables using PAGE XML generating software in a context where ALTO is needed to display the results, i.e. in libraries.

Installation

In a Python virtualenv:

make install     # or pip install .
# or to install from PyPI
pip install ocrd_page_to_alto

Usage

To convert the PAGE XML document example.xml to ALTO:

page-to-alto example.xml > example.alto.xml

You can get an exhaustive list of page-to-alto's many options with --help:

CLI

Usage: page-to-alto [OPTIONS] FILENAME
  Convert PAGE to ALTO
Options:
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  --alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0]
                                  Choose version of ALTO-XML schema to produce
                                  (older versions may not preserve all
                                  features)
  --check-words / --no-check-words
                                  Check whether PAGE-XML contains any Words
                                  and fail if not
  --check-border / --no-check-border
                                  Check whether PAGE-XML contains Border or
                                  PrintSpace
  --skip-empty-lines / --no-skip-empty-lines
                                  Whether to omit or keep empty lines in PAGE-
                                  XML
  --trailing-dash-to-hyp / --no-trailing-dash-to-hyp
                                  Whether to add a  element if the last
                                  word in a line ends in "-"
  --dummy-textline / --no-dummy-textline
                                  Whether to create a TextLine for regions
                                  that have TextEquiv/Unicode but no TextLine
  --dummy-word / --no-dummy-word  Whether to create a Word for TextLine that
                                  have TextEquiv/Unicode but no Word
  --textequiv-index INTEGER       If multiple textequiv, use the n-th
                                  TextEquiv by @index
  --textequiv-fallback-strategy [raise|first|last]
                                  What to do if nth textequiv isn't available.
                                  'raise' will lead to a runtime error,
                                  'first' will use the first TextEquiv, 'last'
                                  will use the last TextEquiv on the element
  -O, --output-file FILE          Output filename (or "-" for standard output,
                                  the default)
  -h, --help                      Show this message and exit.

To process an OCR-D workspace, use ocrd_fileformat, which uses page-to-alto by default:

ocrd-fileformat-transform -I OCRD-OCR-OUTPUT-PAGE -O OCRD-OCR-OUTPUT-ALTO \
  -P script-args "--dummy-word --no-check-words --no-check-border"

TODO

  • AlternativeImage
  • unmappable regions
  • handle Border
  • TextStyle
  • ParagraphStyle
  • table regions
  • recursive regions for *Region
  • Set PAGECLASS from pc:Page/@type #4
  • Layers / z-level via StructureTag? #4
  • <SP/>
  • <HYP/>
  • rotation
  • reading order
  • input PAGE-XML not having words #5
  • multiple pc:TextEquivs
  • language
  • script no equivalent in ALTO :(
  • kerning no equivalent in ALTO :(
  • underlineStyle no equivalent in ALTO :(
  • bgColour no equivalent in ALTO :(
  • bgColourRgb no equivalent in ALTO :(
  • reverseVideo no equivalent in ALTO :(
  • xHeight no equivalent in ALTO :(
  • letterSpaced no equivalent in ALTO :(
  • ProcessingStep
  • differentiate/select ALTO versions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrd_page_to_alto-2.2.8.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocrd_page_to_alto-2.2.8-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file ocrd_page_to_alto-2.2.8.tar.gz.

File metadata

  • Download URL: ocrd_page_to_alto-2.2.8.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ocrd_page_to_alto-2.2.8.tar.gz
Algorithm Hash digest
SHA256 9557de1cdfce1147d8c7ec9cef0e23643ff72a13e08265dc7d71b5becdd46fd3
MD5 caeba6c70fd35aaee0b6dad1aa6b1188
BLAKE2b-256 375a72cb83e19a111dff1e9aa6fc5c44a10c996d68000e0163739a4c81d6bd82

See more details on using hashes here.

Provenance

The following attestation bundles were made for ocrd_page_to_alto-2.2.8.tar.gz:

Publisher: cd-pypi.yml on OCR-D/page-to-alto

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ocrd_page_to_alto-2.2.8-py3-none-any.whl.

File metadata

File hashes

Hashes for ocrd_page_to_alto-2.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 0cd7a2ade468c75ee744ed7894daf9a8f1a0518962b35510dbab57b5de03d6de
MD5 2c39c5a8e16c2d4433ea53160b1aa2b5
BLAKE2b-256 e0a8c8945826d35b206595abb9f725c8888ee68bfe86d1a0bad656a6242785d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for ocrd_page_to_alto-2.2.8-py3-none-any.whl:

Publisher: cd-pypi.yml on OCR-D/page-to-alto

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page