A tool for extracting, labeling and linking entities in document images for Information Extraction tasks.
Project description
ToolRI
ToolRI was created to simplify and standardize the creation of samples for the task of Information Extraction in document images. The tool allows text extraction by OCR, the creation of document entities and their labeling and linking. The project was created purely with Python and can be run on any desktop platform. The graphical user interface is implemented thanks to the amazing CustomTkinter library.
Instalation
PyPi
Install the ToolRI package with pip
:
pip install toolri
Source
Clone the ToolRI repository with:
git clone https://github.com/Victorgonl/ToolRI
And install using pip
:
pip install ./ToolRI/
Standalone
Download
You can download a portable binary of the tool to start using right away. Download and run a version on the releases page of the ToolRI repository.
Build
To build the standalone version of ToolRI into a portable binary, clone the repository:
git clone https://github.com/Victorgonl/ToolRI
Change current directory to ./ToolRI
:
cd ./ToolRI
Install all the dependencies found on requirements.txt
:
pip install -r requirements.txt
And run the script toolri_build.py
:
python3 toolri_build.py
The binary will be available on dist
folder.
Documentation
Under construction. :construction:
Tesseract OCR
To be able to use the OCR function in ToolRI, Tesseract OCR must be installed separately.
For now, OCR is configured for English and Portuguese languages only, but it will be updated soon for all languages available. :construction:
Debian based
Use the command:
sudo apt-get install tesseract-ocr tesseract-ocr-eng tesseract-ocr-por
Windows
-
Download and run the installer available at https://github.com/UB-Mannheim/tesseract/wiki.
-
Make sure to install Tesseract on
C:\Program Files\Tesseract-OCR\
(the default directory) due to a predefined configuration in current ToolRI version.
Usage
ToolRI was developed and used to create the UFLA-FORMS dataset. Download the dataset to try the tool on the available samples or create a new metadata for any document image available.
Example
import toolri
image = toolri.load_image("document_image.jpg")
labels = [
toolri.ToolRILabel(name="QUESTION", color="#004B80", links=["ANSWER"], is_visible=True),
toolri.ToolRILabel(name="ANSWER", color="#00943E", links=[], is_visible=True)
]
data = toolri.toolri(image=image, data=data, labels=labels)
toolri.draw_data_on_image(image=image, data=data, labels=labels).show()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file toolri-1.0.1.tar.gz
.
File metadata
- Download URL: toolri-1.0.1.tar.gz
- Upload date:
- Size: 38.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 07438a37a75d8aab9bbe4957315b34b475745f2f1f3cc2ec9f397c115ca36c3d |
|
MD5 | 640b68c5eca4b7e3b0d4541fec06056e |
|
BLAKE2b-256 | 3c046eb6d4cae313d2dc1d63f97230b7296aca26cde433fcd2fc5a8dfef060df |
File details
Details for the file toolri-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: toolri-1.0.1-py3-none-any.whl
- Upload date:
- Size: 55.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e23afe819e4ad38ebfeacb836e32934363a0f9bb00b5309a588c85955d4dd4a |
|
MD5 | 23847b010f48594ff6e74cfe2a69ff9d |
|
BLAKE2b-256 | d188d6fd08b882718d58c696f556a0eb9a852dd41de612629f1e87b1c583c750 |