Skip to main content

A tool for extracting, labeling and linking entities in document images for Information Extraction tasks.

Project description


python3 -m build twine upload dist/* --repository toolri --config-file .pypirc

ToolRI

ToolRI was created to simplify and standardize the creation of samples for the task of Information Extraction in document images. The tool allows text extraction by OCR, the creation of document entities and their labeling and linking. The project was created purely with Python and can be run on any desktop platform. The graphical user interface is implemented thanks to the amazing CustomTkinter library.

Instalation

PyPi

Install the ToolRI package with pip:

pip install toolri

Source

Clone the ToolRI repository with:

git clone https://github.com/Victorgonl/ToolRI

And install using pip:

pip install ./ToolRI/

Standalone

Download

You can download a portable binary of the tool to start using right away. Download and run a version on the releases page of the ToolRI repository.

Build

To build the standalone version of ToolRI into a portable binary, clone the repository:

git clone https://github.com/Victorgonl/ToolRI

Change current directory to ./ToolRI:

cd ./ToolRI

Install all the dependencies found on requirements.txt:

pip install -r requirements.txt

And run the script toolri_build.py:

python3 toolri_build.py

The binary will be available on dist folder.

Documentation

Under construction. :construction:

Tesseract OCR

To be able to use the OCR function in ToolRI, Tesseract OCR must be installed separately.

For now, OCR is configured for English and Portuguese languages only, but it will be updated soon for all languages available. :construction:

Debian based

Use the command:

sudo apt-get install tesseract-ocr tesseract-ocr-eng tesseract-ocr-por

Windows

  • Download and run the installer available at https://github.com/UB-Mannheim/tesseract/wiki.

  • Make sure to install Tesseract on C:\Program Files\Tesseract-OCR\ (the default directory) due to a predefined configuration in current ToolRI version.

Usage

ToolRI was developed and used to create the UFLA-FORMS dataset. Download the dataset to try the tool on the available samples or create a new metadata for any document image available.

Example

import toolri

image = toolri.load_image("document_image.jpg")

labels = [
    toolri.ToolRILabel(name="QUESTION", color="#004B80", links=["ANSWER"], is_visible=True),
    toolri.ToolRILabel(name="ANSWER", color="#00943E", links=[], is_visible=True)
]

data = toolri.toolri(image=image, data=data, labels=labels)

toolri.draw_data_on_image(image=image, data=data, labels=labels).show()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toolri-1.0.2.tar.gz (39.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

toolri-1.0.2-py3-none-any.whl (56.9 kB view details)

Uploaded Python 3

File details

Details for the file toolri-1.0.2.tar.gz.

File metadata

  • Download URL: toolri-1.0.2.tar.gz
  • Upload date:
  • Size: 39.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.16

File hashes

Hashes for toolri-1.0.2.tar.gz
Algorithm Hash digest
SHA256 e47bb1af341be1f0cc18cc9ed0c1e7ba733b34e2beba52b84a581e3a91418939
MD5 3be63c46cae12a0a8d455a8c1905e443
BLAKE2b-256 5cd705cec5670fb22a9533dd5268818585f9f61533e254f0ad6ac3f6d2298688

See more details on using hashes here.

File details

Details for the file toolri-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: toolri-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 56.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.16

File hashes

Hashes for toolri-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c793793c9c1965324702fd88bdedf7e2b809a8e1ee5a5b3cc9c2636f8e644008
MD5 765b810ba4e7245832c81d4b4a41ecee
BLAKE2b-256 f1aa28f482b4c4bb062e66cd0ff11d2b1c9eb9611614fb7d1ee08df220630b5b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page