Skip to main content

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)

Project description

ocrd-page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)

CircleCI

Introduction

This software converts PAGE XML files to the ALTO XML OCR result format. It enables using PAGE XML generating software in a context where ALTO is needed to display the results, i.e. in libraries.

Installation

In a Python virtualenv:

make install     # or pip install .
# or to install from PyPI
pip install ocrd_page_to_alto

Usage

To convert the PAGE XML document example.xml to ALTO:

page-to-alto example.xml > example.alto.xml

You can get an exhaustive list of page-to-alto's many options with --help:

CLI

Usage: page-to-alto [OPTIONS] FILENAME
  Convert PAGE to ALTO
Options:
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  --alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0]
                                  Choose version of ALTO-XML schema to produce
                                  (older versions may not preserve all
                                  features)
  --check-words / --no-check-words
                                  Check whether PAGE-XML contains any Words
                                  and fail if not
  --check-border / --no-check-border
                                  Check whether PAGE-XML contains Border or
                                  PrintSpace
  --skip-empty-lines / --no-skip-empty-lines
                                  Whether to omit or keep empty lines in PAGE-
                                  XML
  --trailing-dash-to-hyp / --no-trailing-dash-to-hyp
                                  Whether to add a  element if the last
                                  word in a line ends in "-"
  --dummy-textline / --no-dummy-textline
                                  Whether to create a TextLine for regions
                                  that have TextEquiv/Unicode but no TextLine
  --dummy-word / --no-dummy-word  Whether to create a Word for TextLine that
                                  have TextEquiv/Unicode but no Word
  --textequiv-index INTEGER       If multiple textequiv, use the n-th
                                  TextEquiv by @index
  --textequiv-fallback-strategy [raise|first|last]
                                  What to do if nth textequiv isn't available.
                                  'raise' will lead to a runtime error,
                                  'first' will use the first TextEquiv, 'last'
                                  will use the last TextEquiv on the element
  -O, --output-file FILE          Output filename (or "-" for standard output,
                                  the default)
  -h, --help                      Show this message and exit.

To process an OCR-D workspace, use ocrd_fileformat, which uses page-to-alto by default:

ocrd-fileformat-transform -I OCRD-OCR-OUTPUT-PAGE -O OCRD-OCR-OUTPUT-ALTO \
  -P script-args "--dummy-word --no-check-words --no-check-border"

TODO

  • AlternativeImage
  • unmappable regions
  • handle Border
  • TextStyle
  • ParagraphStyle
  • table regions
  • recursive regions for *Region
  • Set PAGECLASS from pc:Page/@type #4
  • Layers / z-level via StructureTag? #4
  • <SP/>
  • <HYP/>
  • rotation
  • reading order
  • input PAGE-XML not having words #5
  • multiple pc:TextEquivs
  • language
  • script no equivalent in ALTO :(
  • kerning no equivalent in ALTO :(
  • underlineStyle no equivalent in ALTO :(
  • bgColour no equivalent in ALTO :(
  • bgColourRgb no equivalent in ALTO :(
  • reverseVideo no equivalent in ALTO :(
  • xHeight no equivalent in ALTO :(
  • letterSpaced no equivalent in ALTO :(
  • ProcessingStep
  • differentiate/select ALTO versions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrd_page_to_alto-2.1.0.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocrd_page_to_alto-2.1.0-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file ocrd_page_to_alto-2.1.0.tar.gz.

File metadata

  • Download URL: ocrd_page_to_alto-2.1.0.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for ocrd_page_to_alto-2.1.0.tar.gz
Algorithm Hash digest
SHA256 c7b3e82dc2e42a187224d67269ec78db42c606252fb1cc822f58f3d7f63caeae
MD5 cbfd3452f7f34ff71751aef7e9346a78
BLAKE2b-256 ececc4a6668cc47770e63e0d1ae2c8dd1ab1c3de8ff02480cfca7b1dcd91f2b2

See more details on using hashes here.

File details

Details for the file ocrd_page_to_alto-2.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ocrd_page_to_alto-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 91b5c007dee1548228da442b54baf58730540efe24928d356731f5b832c495b7
MD5 dd1a14b9d030235a9fcd0c8e2ccf9444
BLAKE2b-256 3cbc8048512f096fb0fe3f2c543c97939d0368e77d3591c0485f60eb7f674f8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page