Skip to main content

Convert AWS Textract JSON to PRImA PAGE XML

Project description

textract2page

Convert AWS Textract JSON to PRImA PAGE XML

Introduction

This software converts OCR results from Amazon AWS Textract Response files to PAGE XML files.

Installation

In a Python virtualenv:

pip install textract2page

Usage

The package contains a file-based conversion function provided as CLI and Python API. The function takes the Textract JSON file and the original image file which was used as input for the OCR. (That is necessary because Textract stores coordinates in float ratios, whereas PAGE uses int in pixel indices.)

Python API

To convert a Textract file example.json for an image file example.jpg to a PAGE example.xml:

from textract2page import convert_file

convert_file("example.json", "example.jpg", "example.xml")

CLI

Analogously, on the command line interface:

textract2page example.json example.jpg > example.xml
textract2page -O example.xml example.json example.jpg

You can get a list of options with --help or -h

Testing

Requires installation and a local copy of the repository.

To run regression tests with pytest, do

make deps-test
make test-api

To run regression test via command line, do

# optionally:
sudo apt-get install xmlstarlet
make test-cli

(If xmlstarlet is available, then the CLI test will also validate the result tree. Otherwise, this just checks the command completes without error.)

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textract2page-0.1.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

textract2page-0.1-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file textract2page-0.1.tar.gz.

File metadata

  • Download URL: textract2page-0.1.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/6.1.0 pkginfo/1.8.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.7.3

File hashes

Hashes for textract2page-0.1.tar.gz
Algorithm Hash digest
SHA256 60c3ddcd304e55d50d2e899d7c853c21461789e3a42bbce6b1af2c76b8ddc7ca
MD5 68dd6c5aed7498efb07f17a874463e20
BLAKE2b-256 b21ec362e9c0718de5266cd26196f97980056b2cea3877db065ae476c39152c4

See more details on using hashes here.

File details

Details for the file textract2page-0.1-py3-none-any.whl.

File metadata

  • Download URL: textract2page-0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/6.1.0 pkginfo/1.8.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.7.3

File hashes

Hashes for textract2page-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ea7f3e191cb564a9ffa75f995486f03bb8f8d5540b0d22068df3a5910b088b89
MD5 2e3fc3c2151681095e7058e682076a7b
BLAKE2b-256 60a457cde2464769467e3902504e0775c76af35daf0d6e98faf416684523cd4f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page