Skip to main content

Python API for multilingual legal document classification with EuroVoc descriptors using BERT models.

Project description

IMPORTANT NOTE!

This repository is not compatible yet with the latest version of the transformers library. You need to install a version < 4.11.0. We recommend and tested with transformers==4.10.3.

We are currently working to migrate the models to the latest version of transformers.

EuroVoc-BERT

PyEuroVoc is a tool for legal document classification with EuroVoc descriptors. It supports 22 languages: Bulgarian (bg), Czech (cs), Danish (da), German (de), Greek (el), English (en), Spanish (es), Estonian (et), Finnish (fi), French (fr), Hungarian (hu), Italian (it), Lithuanian (lt), Latvian (lv), Maltese (mt), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Sweedish (sv).

The tool uses BERT at its core. The list of BERT variant for each language can be found here. The performance of each model is outlined in our paper.

Installation

Make sure you have Python3 installed and a suitable package for PyTorch. Then, install pyeurovoc with pip:

pip install pyeurovoc

Usage

Import the EuroVocBERT class from pyeurovoc. Instantiate the class with the desired langauge (default is "en") and then simply pass a document text to the model.

from pyeurovoc import EuroVocBERT

model = EuroVocBERT(lang="en")
prediction = model("Commission Decision on a modification of the system of aid applied in Italy in respect of shipbuilding")

The prediction of the model is a dictionary that contains the predicted ID descriptors as keys together with their confidence score as values.

{'1519': 0.9990228414535522, '889': 0.9199628829956055, '155': 0.8993383646011353, '5541': 0.6949614882469177, '365': 0.03358528017997742, '431': 0.03317515179514885}

The number of most probable labels returned by the model is controlled by the num_labels parameter (default is 6).

prediction = model("Commission Decision on a modification of the system of aid applied in Italy in respect of shipbuilding", num_labels=4)

Which outputs:

{'1519': 0.9990228414535522, '889': 0.9199628829956055, '155': 0.8993383646011353, '5541': 0.6949614882469177}

Training your own models

Download Dataset

Firstly, you need to download the datasets. Use the download_datasets.sh script in data to do that.

./download_datasets.sh

Preprocess

Once the datasets has finished downloading, you need to preprocess them using the preprocess.py script. It takes as input the model per language configuration file and the path to the dataset.

python preprocess.py --config [model_config] --data_path [dataset_path]

Train

Training is done using the train.py script. It will automatically load the preprocessed files created by the previous step, and will save the best model for each split at the path given by the -save_path argument. To view the full list of available arguments, run python train.py --help.

python train.py --config [model_config] --data_path [dataset_path] 
                --epochs [n_epochs] --batch_size [batch_size] 
                --max_grad_norm [max_grad_norm]
                --device [device]
                --save_path [model_save_path]
                --logging_step [logging_step]
                --verbose [verbose]

Evaluate

To evaluate the performance of each model on a split, run the evaluate.py script. As in the case of training, it provides several arguments that can be visualized with python evaluate.py --help.

python evaluate.py --config [model_config] --mt_labels [mt_labels_path] --data_path [dataset_path]
                   --models_path [models_ckpt_path] 
                   --batch_size [batch_size]
                   --device [device]
                   --output_path [results_output_path]
                   --loggin_step [logging_step]
                   --verbose [verbose]

Acknowledgments

This research was supported by the EC grant no. INEA/CEF/ICT/A2017/1565710 for the Action no. 2017-EU-IA-0136 entitled “Multilingual Resources for CEF.AT in the legal domain” (MARCELL).

Credits

Please consider citing the following paper as a thank you to the authors of the PyEuroVoc:

Avram, Andrei-Marius, Vasile Pais, and Dan Tufis. "PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors." arXiv preprint arXiv:2108.01139 (2021).

or in .bibtex format:

@article{avram2021pyeurovoc,
  title={PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors},
  author={Avram, Andrei-Marius and Pais, Vasile and Tufis, Dan},
  journal={arXiv preprint arXiv:2108.01139},
  year={2021}
}

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyeurovoc-1.3.0.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

pyeurovoc-1.3.0-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file pyeurovoc-1.3.0.tar.gz.

File metadata

  • Download URL: pyeurovoc-1.3.0.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.5

File hashes

Hashes for pyeurovoc-1.3.0.tar.gz
Algorithm Hash digest
SHA256 13327b06f7405397c54d51eaa5552251440ee17b42aa73262093b546170b2f20
MD5 40efba5dedd7bbd4eb376f637aa08e8b
BLAKE2b-256 9ec907e9c4c71f874ce5176ebf8f4fb61b59c8d9227b15ed40131f4ae8771ecb

See more details on using hashes here.

File details

Details for the file pyeurovoc-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: pyeurovoc-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.5

File hashes

Hashes for pyeurovoc-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e148aa99f6a262c2fa190ea4fddf4e39746b279bcc3066a5483cc9d843633e9c
MD5 04b987f90fe2508a27c3aa58f4087a60
BLAKE2b-256 2b4084f4cce6fb893cf15a8ecf5c971eeb8718ca70b9bd2d71d0abef7177704c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page