Python API for multilingual legal document classification with EuroVoc descriptors using BERT models.
Project description
IMPORTANT NOTE!
This repository is not compatible yet with the latest version of the transformers
library. You need to install a version < 4.11.0.
We recommend and tested with transformers==4.10.3
.
We are currently working to migrate the models to the latest version of transformers
.
EuroVoc-BERT
PyEuroVoc is a tool for legal document classification with EuroVoc descriptors. It supports 22 languages: Bulgarian (bg), Czech (cs), Danish (da), German (de), Greek (el), English (en), Spanish (es), Estonian (et), Finnish (fi), French (fr), Hungarian (hu), Italian (it), Lithuanian (lt), Latvian (lv), Maltese (mt), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Sweedish (sv).
The tool uses BERT at its core. The list of BERT variant for each language can be found here. The performance of each model is outlined in our paper.
Installation
Make sure you have Python3 installed and a suitable package for PyTorch. Then, install pyeurovoc
with pip:
pip install pyeurovoc
Usage
Import the EuroVocBERT
class from pyeurovoc
. Instantiate the class with the desired langauge (default is "en") and then simply pass a document text to the model.
from pyeurovoc import EuroVocBERT
model = EuroVocBERT(lang="en")
prediction = model("Commission Decision on a modification of the system of aid applied in Italy in respect of shipbuilding")
The prediction of the model is a dictionary that contains the predicted ID descriptors as keys together with their confidence score as values.
{'1519': 0.9990228414535522, '889': 0.9199628829956055, '155': 0.8993383646011353, '5541': 0.6949614882469177, '365': 0.03358528017997742, '431': 0.03317515179514885}
The number of most probable labels returned by the model is controlled by the num_labels
parameter (default is 6).
prediction = model("Commission Decision on a modification of the system of aid applied in Italy in respect of shipbuilding", num_labels=4)
Which outputs:
{'1519': 0.9990228414535522, '889': 0.9199628829956055, '155': 0.8993383646011353, '5541': 0.6949614882469177}
Training your own models
Download Dataset
Firstly, you need to download the datasets. Use the download_datasets.sh
script in data to do that.
./download_datasets.sh
Preprocess
Once the datasets has finished downloading, you need to preprocess them using the preprocess.py
script. It takes as input the model per language configuration file and the path to the dataset.
python preprocess.py --config [model_config] --data_path [dataset_path]
Train
Training is done using the train.py
script. It will automatically load the preprocessed files created by the previous step, and will save the best model for each split at the path given by the -save_path
argument. To view the full list of available arguments, run python train.py --help
.
python train.py --config [model_config] --data_path [dataset_path]
--epochs [n_epochs] --batch_size [batch_size]
--max_grad_norm [max_grad_norm]
--device [device]
--save_path [model_save_path]
--logging_step [logging_step]
--verbose [verbose]
Evaluate
To evaluate the performance of each model on a split, run the evaluate.py
script. As in the case of training, it provides several arguments that can be visualized with python evaluate.py --help
.
python evaluate.py --config [model_config] --mt_labels [mt_labels_path] --data_path [dataset_path]
--models_path [models_ckpt_path]
--batch_size [batch_size]
--device [device]
--output_path [results_output_path]
--loggin_step [logging_step]
--verbose [verbose]
Acknowledgments
This research was supported by the EC grant no. INEA/CEF/ICT/A2017/1565710 for the Action no. 2017-EU-IA-0136 entitled “Multilingual Resources for CEF.AT in the legal domain” (MARCELL).
Credits
Please consider citing the following paper as a thank you to the authors of the PyEuroVoc:
Avram, Andrei-Marius, Vasile Pais, and Dan Tufis. "PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors." arXiv preprint arXiv:2108.01139 (2021).
or in .bibtex format:
@article{avram2021pyeurovoc,
title={PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors},
author={Avram, Andrei-Marius and Pais, Vasile and Tufis, Dan},
journal={arXiv preprint arXiv:2108.01139},
year={2021}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyeurovoc-1.3.0.tar.gz
.
File metadata
- Download URL: pyeurovoc-1.3.0.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13327b06f7405397c54d51eaa5552251440ee17b42aa73262093b546170b2f20 |
|
MD5 | 40efba5dedd7bbd4eb376f637aa08e8b |
|
BLAKE2b-256 | 9ec907e9c4c71f874ce5176ebf8f4fb61b59c8d9227b15ed40131f4ae8771ecb |
File details
Details for the file pyeurovoc-1.3.0-py3-none-any.whl
.
File metadata
- Download URL: pyeurovoc-1.3.0-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e148aa99f6a262c2fa190ea4fddf4e39746b279bcc3066a5483cc9d843633e9c |
|
MD5 | 04b987f90fe2508a27c3aa58f4087a60 |
|
BLAKE2b-256 | 2b4084f4cce6fb893cf15a8ecf5c971eeb8718ca70b9bd2d71d0abef7177704c |