Skip to main content

No project description provided

Project description

ocrd-page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)

image

Introduction

This software converts PAGE XML files to the ALTO XML OCR result format. It enables using PAGE XML generating software in a context where ALTO is needed to display the results, i.e. in libraries.

Installation

In a Python virtualenv:

make install     # or pip install .
# or to install from PyPI
pip install ocrd_page_to_alto

Usage

To convert the PAGE XML document example.xml to ALTO:

page-to-alto example.xml > example.alto.xml

You can get an exhaustive list of page-to-alto's many options with --help:

CLI

Usage: page-to-alto [OPTIONS] FILENAME
  Convert PAGE to ALTO
Options:
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  --alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0]
                                  Choose version of ALTO-XML schema to produce
                                  (older versions may not preserve all
                                  features)
  --check-words / --no-check-words
                                  Check whether PAGE-XML contains any Words
                                  and fail if not
  --check-border / --no-check-border
                                  Check whether PAGE-XML contains Border or
                                  PrintSpace
  --skip-empty-lines / --no-skip-empty-lines
                                  Whether to omit or keep empty lines in PAGE-
                                  XML
  --trailing-dash-to-hyp / --no-trailing-dash-to-hyp
                                  Whether to add a  element if the last
                                  word in a line ends in "-"
  --dummy-textline / --no-dummy-textline
                                  Whether to create a TextLine for regions
                                  that have TextEquiv/Unicode but no TextLine
  --dummy-word / --no-dummy-word  Whether to create a Word for TextLine that
                                  have TextEquiv/Unicode but no Word
  --textequiv-index INTEGER       If multiple textequiv, use the n-th
                                  TextEquiv by @index
  --textequiv-fallback-strategy [raise|first|last]
                                  What to do if nth textequiv isn't available.
                                  'raise' will lead to a runtime error,
                                  'first' will use the first TextEquiv, 'last'
                                  will use the last TextEquiv on the element
  -O, --output-file FILE          Output filename (or "-" for standard output,
                                  the default)
  -h, --help                      Show this message and exit.

To process an OCR-D workspace, use ocrd_fileformat, which uses page-to-alto by default:

ocrd-fileformat-transform -I OCRD-OCR-OUTPUT-PAGE -O OCRD-OCR-OUTPUT-ALTO \
  -P script-args "--dummy-word --no-check-words --no-check-border"

TODO

  • AlternativeImage
  • unmappable regions
  • handle Border
  • TextStyle
  • ParagraphStyle
  • table regions
  • recursive regions for *Region
  • Set PAGECLASS from pc:Page/@type #4
  • Layers / z-level via StructureTag? #4
  • <SP/>
  • <HYP/>
  • rotation
  • reading order
  • input PAGE-XML not having words #5
  • multiple pc:TextEquivs
  • language
  • script no equivalent in ALTO :(
  • kerning no equivalent in ALTO :(
  • underlineStyle no equivalent in ALTO :(
  • bgColour no equivalent in ALTO :(
  • bgColourRgb no equivalent in ALTO :(
  • reverseVideo no equivalent in ALTO :(
  • xHeight no equivalent in ALTO :(
  • letterSpaced no equivalent in ALTO :(
  • ProcessingStep
  • differentiate/select ALTO versions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrd-page-to-alto-1.3.0.tar.gz (382.5 kB view details)

Uploaded Source

Built Distribution

ocrd_page_to_alto-1.3.0-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file ocrd-page-to-alto-1.3.0.tar.gz.

File metadata

  • Download URL: ocrd-page-to-alto-1.3.0.tar.gz
  • Upload date:
  • Size: 382.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for ocrd-page-to-alto-1.3.0.tar.gz
Algorithm Hash digest
SHA256 7fffd6e916698ca03ba95990bf9fad96a530d58498dd16cadef187f3ef2a4127
MD5 6a23349fb20ccb03e61bb4d64c0c7d47
BLAKE2b-256 ef26a90b166c78daa1d137ba389289bcb4ec7730c376d872c7c53ef2e150190d

See more details on using hashes here.

File details

Details for the file ocrd_page_to_alto-1.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ocrd_page_to_alto-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1d87d51421cc30b0ad9ccbcf01686209e447bd36c98de6b19b8e090c45408759
MD5 e6d90f2962930f5cd985787563f4b4b4
BLAKE2b-256 6948434387331247d846a89791a02a1f7f7c11f1d6e779f92041bb365bede55c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page