Skip to main content

Creates ready-to-use Label Studio pre-populated JSON files from popular OCR formats.

Project description

ls-converter

LabelStudioConverter (or ls_converter for short) is a simple library to convert OCR outputs into pre-annotated data for import into LabelStudio.

Currently, we can convert directly from PyTesseract, ABBYY FineReader, and Transkribus. All that is needed is an image (which can be a path, a public URL, or an Image object) and some input_data (which can be a path to a JSON).

It even comes with a quick utility tool to provide the PyTesseract data if you don't have it available:

from ls_converter import LabelStudioConverter, Input
from ls_converter.utils import url_to_tesseract_data

URL = "http://<URL-TO-PUBLICLY-AVAILABLE-IMAGE>"

converter = LabelStudioConverter(input_format=Input.TESSERACT)

converted_data = converter.convert(
    image=URL,
    input_data=url_to_tesseract_data(URL),
)

Installing

Installation of LabelStudioConverter is easily done using PIP:

$ pip install ls_converter
...

OCR a public image URL with PyTesseract into Label Studio

In this example, we have a publicly available historical newspaper directory from the University of Leicester‘s Special Collections. We take the direct URL to the image and using the built-in url_to_tesseract_data function, can pass the image and the data straight into the .convert method. The next step is to save the resulting dictionary as a JSON file. There is a helpful save_json function built into the package as well. In effectively three lines of code, we have OCR parsed and created a file that we can input into Label Studio.

from ls_converter import LabelStudioConverter, Input
from ls_converter.utils import url_to_tesseract_data, save_json

URL = "http://specialcollections.le.ac.uk/iiif/2/p16445coll4:8897/full/730,/0/default.jpg?page=27"

converter = LabelStudioConverter(input_format=Input.TESSERACT)
converted_data = converter.convert(image=URL, input_data=url_to_tesseract_data(URL))

save_json(converted_data, "import-me-into-label-studio.json")

If you select “Optical Character Recognition” in the Labeling Setup of the project where you import the resulting JSON file, you should end up with something like this:

Label Studio interface after importing PyTesseract’s resulting JSON

Using a local image and ABBYY FineReader

If you instead have an image and the resulting JSON file from running it through ABBYY FineReader, you only have to adjust the import of the data thus:

from ls_converter import LabelStudioConverter, Input
from ls_converter.utils import load_json, save_json

LOCAL_IMAGE = "abbyy-output/0212_BCL8001.jpg"
LOCAL_JSON = "abbyy-output/0212_BCL8001.json"
REMOTE_IMAGE = "https://lwmincomingtradedirs.blob.core.windows.net/jpg/0212_BCL8001.jpg"

converter = LabelStudioConverter(input_format=Input.ABBYY)
converted_data = converter.convert(
    image=LOCAL_IMAGE,
    input_data=load_json(LOCAL_JSON),
    url=REMOTE_IMAGE,
)

save_json(converted_data, "import-me-into-label-studio.json")

Note that, in this example, we use the convenient utility function load_json to load the JSON file with the ABBYY results. In this example (and any example where you have a locally stored version and a remote version), if you don’t care too much about speed, you can just pass the URL to the image parameter. If you are processing 100+ images, the script will run much faster if you have locally stored images.

The result should from the above example, when imported into Label Studio looks differently (since ABBYY’s result will differ from Tesseract’s) but otherwise, the result should be the same look:

Label Studio interface after importing ABBYY FineReader’s resulting JSON

Risk for error

Because you run the data conversion on a local file, you must specify a url as a parameter in the .convert method (as we did above). Alternatively, after you export the data, you can adjust the "ocr" value in the resulting JSON file before importing it into Label Studio. Otherwise you will see the following error message:

Label Studio error message after importing faulty JSON

Using Transkribus results instead

Out of the box, LabelStudioConverter comes with support for the XML files created by Transkribus, as well. In this example, similar to the ABBYY example above, we provide a local image and a remote version of the same image. and use the load_xml_as_json utility function to read in the Transkribus XML as JSON data.

from ls_converter import LabelStudioConverter, Input
from ls_converter.utils import load_xml_as_json, save_json

LOCAL_IMAGE = "transkribus-output/0219_BCL8001.jpg"
LOCAL_XML = "transkribus-output/0219_BCL8001.xml"
REMOTE_IMAGE = "https://lwmincomingtradedirs.blob.core.windows.net/jpg/0219_BCL8001.jpg"

converter = LabelStudioConverter(input_format=Input.TRANSKRIBUS)
converted_data = converter.convert(
    image=LOCAL_IMAGE,
    input_data=load_xml_as_json(LOCAL_XML),
    url=REMOTE_IMAGE,
)

save_json(converted_data, "import-me-into-label-studio.json")

The result, since we provide the url to the remote image in the example above, is similar (albeit different, due to Transkribus’s OCR and layout parsing algorithm) when viewed in Label Studio:

Label Studio interface after importing Transkribus’s resulting JSON

Change Log

0.0.1 (Dec 14, 2022)

  • First alpha version

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ls-converter-0.0.1.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

ls_converter-0.0.1-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file ls-converter-0.0.1.tar.gz.

File metadata

  • Download URL: ls-converter-0.0.1.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.4 Darwin/21.6.0

File hashes

Hashes for ls-converter-0.0.1.tar.gz
Algorithm Hash digest
SHA256 85a155fd867704166116821dd8bb975953e63e967c4690527a5a18e9a8a1e4be
MD5 a561067d8802a7574ef60c55016090d8
BLAKE2b-256 7f9e8029dae1994b11354a99a7723c15a8c622a9b9f9d61cfae34be1d9b73ee2

See more details on using hashes here.

File details

Details for the file ls_converter-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: ls_converter-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.4 Darwin/21.6.0

File hashes

Hashes for ls_converter-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 26f3cae1f2c0bfd51f367bfb8623653c576d0dac832a383d6753b4cf8124be29
MD5 f966ce1cdc9af2aa68c82a5f4a5ae712
BLAKE2b-256 1da0cb6270faa7046357741173e558b8552eebd6bb40143c9226dcbff3787679

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page