Tools to work with the DocILE dataset and benchmark
Project description
DocILE: Document Information Localization and Extraction Benchmark
Repository to work with the DocILE dataset and benchmark, used in the DocILE'23 CLEF Lab and ICDAR Competition. The competition deadline is on May 10, 2023 and comes with a $9000 prize pool.
The repository consists of:
- A python library,
docile
, making it easy to load the dataset, work with its annotations, pdfs and pre-computed OCR and run the evaluation. - An interactive dataset browser notebook to visualize the document annotations, predictions and evaluation results.
- Baseline methods (will appear soon).
Table of Contents:
- Download the dataset
- Installation
- Predictions format and running evaluation
- Pre-computed OCR
- Development instructions
- Dataset and benchmark paper
Also check Tutorials to get quickly started with the repo.
Download the dataset
First you need to obtain a secret token by following the instructions at https://docile.rossum.ai/. Then download and unzip the dataset by running:
./download_dataset.sh TOKEN annotated-trainval data/docile --unzip
./download_dataset.sh TOKEN synthetic data/docile --unzip
./download_dataset.sh TOKEN unlabeled data/docile --unzip
Run ./download_dataset.sh --help
for more options, including how to only show urls (to download
with a different tool than curl), how to download smaller unlabeled/synthetic chunks or unlabeled
dataset without pdfs (with pre-computed OCR only).
You can also work with the zipped datasets when you turn off image caching (check Load and sample dataset tutorial for details).
Installation
Option 1: Install as a library
Install the library with:
pip install docile-benchmark
To convert pdfs into images, the library uses https://github.com/Belval/pdf2image. On linux you might need to install:
apt install poppler-utils
And on macOS:
brew install poppler
Now you have all dependencies to work with the dataset annotations, pdfs, pre-comptued OCR and to run the evaluation. You can install extra dependencies by running the following (although using one of the provided dockerfiles, as explained below, might be easier in this case):
pip install "docile-benchmark[interactive]"
pip install "docile-benchmark[ocr]"
The first line installs additional dependencies allowing you to use the interactive dataset browser in docile/tools/dataset_browser.py and the tutorials. The second line let's you rerun the OCR predictions from scratch (e.g., if you'd like to run it with different parameters) but to make it work, you might need additional dependencies on your system. Check https://github.com/mindee/doctr for the installation instructions (for pytorch).
Option 2: Use docker
There are two Dockerfiles available with the dependencies preinstalled:
Dockerfile
is a lighter, CPU-only, version with all necessary dependencies to use the dataset with the pre-computed OCR and interactive browser.Dockerfile.gpu
has CUDA GPU support and contains additional dependencies needed to recompute OCR predictions from scratch (not needed for standard usage).
You can use docker compose to manage the docker images. First update the settings in docker-compose.yml
and the port for jupyter lab in .env
. Then build the image with:
docker compose build jupyter[-gpu]
where jupyter
uses Dockerfile
and jupyter-gpu
uses Dockerfile.gpu
. You can then start the jupyter server:
docker compose up -d jupyter[-gpu]
Jupyter lab can be then accessed at https://127.0.0.1:${JUPYTER_PORT}
(retrieve the token from logs with docker compose logs jupyter[-gpu]
). You can also login to the container with:
docker compose exec jupyter bash
After that run poetry shell
to activate the virtual environment with the docile
library and its dependencies installed.
Predictions format and running evaluation
To evaluate predictions for tasks KILE or LIR, use the following command:
docile_evaluate \
--task LIR \
--dataset-path path/to/dataset[.zip] \
--split val \
--predictions path/to/predictions.json \
--evaluate-x-shot-subsets "0,1-3,4+" \ # default, show evaluation for 0-shot, few-shot and many-shot layout clusters
--evaluate-synthetic-subsets \ # optional, show evaluation on layout clusters with available synthetic data
--evaluate-fieldtypes \ # optional, show breakdown per fieldtype
--evaluate-also-text \ # optional
--store-evaluation-result LIR_val_eval.json # optional, it can be loaded in the dataset browser
Run docile_evaluate --help
for more information on the options. You can also run docile_print_evaluation_report --evaluation-result-path LIR_val_eval.json
to print the results of a previously computed evaluation.
Predictions need to be stored in a single json file (for each task separately) containing a mapping from docid
to the predictions for that document, i.e.:
{
"docid1": [
{
"page": 0,
"bbox": [0.2, 0.1, 0.4, 0.5],
"fieldtype": "line_item_order_id",
"line_item_id": 3,
"score": 0.8,
"text": "Order 38",
"use_only_for_ap": true
},
"..."
],
"docid2": [{"...": "..."}, "..."],
"..."
}
Explanation of the individual fields of the predictions:
page
: page index (from zero) the prediction belongs tobbox
: relative coordinates (from 0 to 1) representing theleft
,top
,right
,bottom
sides of the bbox, respectivelyfieldtype
: the fieldtype (sometimes called category or key) of the predictionline_item_id
: ID of the line item. This should be a different number for each line item, the order does not matter. Omit for KILE predictions.score
[optional]: the confidence for this prediction, can be omitted (in that case predictions are taken in the order in which they are stored in the list)text
[optional]: text of the prediction, evaluated in a secondary metric only (when--evaluate-also-text
is used)use_only_for_ap
[optional, default is False]: only use the prediction for AP metric computation, not for f1, precision and recall (useful for less confident predictions).
You can use docile.dataset.store_predictions
to store predictions represented with the docile.dataset.Field
class to a json file with the required format.
Pre-computed OCR
Pre-computed OCR is provided with the dataset. The prediction was done using the DocTR library. On top of that, word boxes were snapped to text (check the code in docile/dataset/document_ocr.py). These snapped word boxes are used in evaluation (description of the evaluation is coming soon).
While this should not be needed, it is possible to (re)generate OCR from scratch (including the snapping) with the provided Dockerfile.gpu
. Just delete DATASET_PATH/ocr
directory and then access the ocr for each document and page with doc.ocr.get_all_words(page, snapped=True)
.
Development instructions
For development, install poetry and run poetry install
. Start a shell with the virtual environment activated with poetry shell
. No other dependencies are needed to run pre-commit and the tests. It's recommended to use docker (as explained above) if you need the extra (interactive or ocr) dependencies.
Install pre-commit with pre-commit install
(don't forget you need to prepend all commands with poetry run ...
if you did not run poetry shell
first).
Run tests by calling pytest tests
.
Dataset and benchmark paper
The dataset, the benchmark tasks and the evaluation criteria are described in detail in the dataset paper. To cite the dataset, please use the following BibTeX entry:
@misc{simsa2023docile,
title={{DocILE} Benchmark for Document Information Localization and Extraction},
author={{\v{S}}imsa, {\v{S}}t{\v{e}}p{\'a}n and {\v{S}}ulc, Milan and U{\v{r}}i{\v{c}}{\'a}{\v{r}}, Michal and Patel, Yash and Hamdi, Ahmed and Koci{\'a}n, Mat{\v{e}}j and Skalick{\`y}, Maty{\'a}{\v{s}} and Matas, Ji{\v{r}}{\'\i} and Doucet, Antoine and Coustaty, Micka{\"e}l and Karatzas, Dimosthenis},
url = {https://arxiv.org/abs/2302.05658},
journal={arXiv preprint arXiv:2302.05658},
year={2023}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for docile_benchmark-0.2.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7eeecd4848c4ffc07ae3b8dd84ab2ede7e8f029ee4f8a6042f50700dd32a5cc7 |
|
MD5 | 9e0eb532caad3da1f25dac9406975390 |
|
BLAKE2b-256 | 9908562ab183393eb0788e4e633a205f0050f72c017d69c81829c6a925a0fea8 |