Skip to main content

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)

Project description

ocrd-page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)

CircleCI

Introduction

This software converts PAGE XML files to the ALTO XML OCR result format. It enables using PAGE XML generating software in a context where ALTO is needed to display the results, i.e. in libraries.

Installation

In a Python virtualenv:

make install     # or pip install .
# or to install from PyPI
pip install ocrd_page_to_alto

Usage

To convert the PAGE XML document example.xml to ALTO:

page-to-alto example.xml > example.alto.xml

You can get an exhaustive list of page-to-alto's many options with --help:

CLI

Usage: page-to-alto [OPTIONS] FILENAME
  Convert PAGE to ALTO
Options:
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  --alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0]
                                  Choose version of ALTO-XML schema to produce
                                  (older versions may not preserve all
                                  features)
  --check-words / --no-check-words
                                  Check whether PAGE-XML contains any Words
                                  and fail if not
  --check-border / --no-check-border
                                  Check whether PAGE-XML contains Border or
                                  PrintSpace
  --skip-empty-lines / --no-skip-empty-lines
                                  Whether to omit or keep empty lines in PAGE-
                                  XML
  --trailing-dash-to-hyp / --no-trailing-dash-to-hyp
                                  Whether to add a  element if the last
                                  word in a line ends in "-"
  --dummy-textline / --no-dummy-textline
                                  Whether to create a TextLine for regions
                                  that have TextEquiv/Unicode but no TextLine
  --dummy-word / --no-dummy-word  Whether to create a Word for TextLine that
                                  have TextEquiv/Unicode but no Word
  --textequiv-index INTEGER       If multiple textequiv, use the n-th
                                  TextEquiv by @index
  --textequiv-fallback-strategy [raise|first|last]
                                  What to do if nth textequiv isn't available.
                                  'raise' will lead to a runtime error,
                                  'first' will use the first TextEquiv, 'last'
                                  will use the last TextEquiv on the element
  -O, --output-file FILE          Output filename (or "-" for standard output,
                                  the default)
  -h, --help                      Show this message and exit.

To process an OCR-D workspace, use ocrd_fileformat, which uses page-to-alto by default:

ocrd-fileformat-transform -I OCRD-OCR-OUTPUT-PAGE -O OCRD-OCR-OUTPUT-ALTO \
  -P script-args "--dummy-word --no-check-words --no-check-border"

TODO

  • AlternativeImage
  • unmappable regions
  • handle Border
  • TextStyle
  • ParagraphStyle
  • table regions
  • recursive regions for *Region
  • Set PAGECLASS from pc:Page/@type #4
  • Layers / z-level via StructureTag? #4
  • <SP/>
  • <HYP/>
  • rotation
  • reading order
  • input PAGE-XML not having words #5
  • multiple pc:TextEquivs
  • language
  • script no equivalent in ALTO :(
  • kerning no equivalent in ALTO :(
  • underlineStyle no equivalent in ALTO :(
  • bgColour no equivalent in ALTO :(
  • bgColourRgb no equivalent in ALTO :(
  • reverseVideo no equivalent in ALTO :(
  • xHeight no equivalent in ALTO :(
  • letterSpaced no equivalent in ALTO :(
  • ProcessingStep
  • differentiate/select ALTO versions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrd_page_to_alto-2.2.11.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocrd_page_to_alto-2.2.11-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file ocrd_page_to_alto-2.2.11.tar.gz.

File metadata

  • Download URL: ocrd_page_to_alto-2.2.11.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ocrd_page_to_alto-2.2.11.tar.gz
Algorithm Hash digest
SHA256 586c46834b1cbd97c2f04ef433d1d65a675afa3d4c86c758afd8004789a2c36c
MD5 1b8c17b86ef9796d3fe08a42efd60e91
BLAKE2b-256 8b3c86e0b3e28fc0687d03723b9a444ea9c09c131d4da38f521175f321305081

See more details on using hashes here.

Provenance

The following attestation bundles were made for ocrd_page_to_alto-2.2.11.tar.gz:

Publisher: cd-pypi.yml on OCR-D/page-to-alto

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ocrd_page_to_alto-2.2.11-py3-none-any.whl.

File metadata

File hashes

Hashes for ocrd_page_to_alto-2.2.11-py3-none-any.whl
Algorithm Hash digest
SHA256 e54786763d3cf6e4209affff935fde0a702d824a376874be0b957fa727bbe3c5
MD5 ae7c9bb01837675688d052fb34da15af
BLAKE2b-256 f5ea8798f74d18a7265d1c4494ac6b15231858093ba79f0d5f94af5b907fe0e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for ocrd_page_to_alto-2.2.11-py3-none-any.whl:

Publisher: cd-pypi.yml on OCR-D/page-to-alto

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page