Skip to main content

Convert AWS Textract JSON to PRImA PAGE XML

Project description

textract2page

Convert AWS Textract JSON to PRImA PAGE XML

PyPI Release CI Tests

Introduction

This software converts OCR results from Amazon AWS Textract Response files to PRImA PAGE XML files.

Installation

In a Python virtualenv:

pip install textract2page

Usage

The package contains a file-based conversion function provided as CLI and Python API. The function takes the Textract JSON file and the original image file which was used as input for the OCR. (That is necessary because Textract stores coordinates in float ratios, whereas PAGE uses int in pixel indices.)

Python API

To convert a Textract file example.json for an image file example.jpg to a PAGE example.xml:

from textract2page import convert_file

convert_file("example.json", "example.jpg", "example.xml")

Alternatively, if you do not have access to the image file, but do know its pixel resolution, use:

from textract2page import convert_file_without_image

convert_file_without_image("example.json",
    # just give it a name (will not be read):
    "example.jpg",
    # set image width so PAGE coordinates will be correct:
    2135,
    # set image width so PAGE coordinates will be correct:
    3240,
    "example.xml")

CLI

Analogously, on the command line interface:

# with image file
textract2page example.json example.jpg > example.xml
textract2page -O example.xml example.json example.jpg
# without image file (just its path name)
textract2page --image-width 2135 --image-height 3240 example.json example.jpg > example.xml
textract2page --image-width 2135 --image-height 3240 -O example.xml example.json example.jpg

You can get a list of options with --help or -h

Testing

Requires installation and a local copy of the repository.

To run regression tests with pytest, do

make deps-test
make test-api

To run regression test via command line, do

# optionally:
sudo apt-get install xmlstarlet
make test-cli

(If xmlstarlet is available, then the CLI test will also validate the result tree. Otherwise, this just checks the command completes without error.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textract2page-0.3.1.tar.gz (36.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textract2page-0.3.1-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file textract2page-0.3.1.tar.gz.

File metadata

  • Download URL: textract2page-0.3.1.tar.gz
  • Upload date:
  • Size: 36.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for textract2page-0.3.1.tar.gz
Algorithm Hash digest
SHA256 7fe7108ca9d20d1e664c45b5d46b1a5f725f4bb4a3b48ca4e254b9e238343e04
MD5 17756504ff5cf28ea29f8604fded5aa0
BLAKE2b-256 f15934ea84ef19dd18fc57b93650a3ab816fb2143e7f4255832fdc98a40a1d67

See more details on using hashes here.

File details

Details for the file textract2page-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: textract2page-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for textract2page-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 97050284610aa85dfb8b1f117924521f1c140dc630452a51a3be8499ef9fd515
MD5 c4db024998cd9dab10322075748a16d2
BLAKE2b-256 98c5091d49942fdb33cf31a2215e005cea00785a9a70e222a21d3825b3fd1058

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page