Skip to main content

Python API for multilingual legal document classification with EuroVoc descriptors using BERT models.

Project description

EuroVoc-BERT

PyEuroVoc is a tool for legal document classification with EuroVoc descriptors. It supports 22 languages: Bulgarian (bg), Czech (cs), Danish (da), German (de), Greek (el), English (en), Spanish (es), Estonian (et), Finnish (fi), French (fr), Hungarian (hu), Italian (it), Lithuanian (lt), Latvian (lv), Maltese (mt), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Sweedish (sv).

The tool uses BERT at its core. The list of BERT variant for each language can be found here. The performance of each model is outlined in our paper.

Installation

Make sure you have Python3 installed and a suitable package for PyTorch. Then, install pyeurovoc with pip:

pip install pyeurovoc

Usage

Import the EuroVocBERT class from pyeurovoc. Instantiate the class with the desired langauge (default is "en") and then simply pass a document text to the model.

from pyeurovoc import EuroVocBERT

model = EuroVocBERT(lang="en")
prediction = model("Commission Decision on a modification of the system of aid applied in Italy in respect of shipbuilding")

The prediction of the model is a dictionary that contains the predicted ID descriptors as keys together with their confidence score as values.

{'155': 0.9990228414535522, '365': 0.9199643731117249, '431': 0.8993396759033203, '889': 0.6949650645256042, '1519': 0.03358537331223488, '5541': 0.03317505866289139}

The number of most probable labels returned by the model is controlled by the num_labels parameter (default is 6).

prediction = model("Commission Decision on a modification of the system of aid applied in Italy in respect of shipbuilding", num_labels=4)

Which outputs:

{'155': 0.9990228414535522, '365': 0.9199643731117249, '431': 0.8993396759033203, '889': 0.6949650645256042}

Training your own models

Download Dataset

Firstly, you need to download the datasets. Use the download_datasets.sh script in data to do that.

./download_datasets.sh

Preprocess

Once the datasets has finished downloading, you need to preprocess them using the preprocess.py script. It takes as input the model per language configuration file and the path to the dataset.

python preprocess.py --config [model_config] --data_path [dataset_path]

Train

Training is done using the train.py script. It will automatically load the preprocessed files created by the previous step, and will save the best model for each split at the path given by the -save_path argument. To view the full list of available arguments, run python train.py --help.

python train.py --config [model_config] --data_path [dataset_path] 
                --epochs [n_epochs] --batch_size [batch_size] 
                --max_grad_norm [max_grad_norm]
                --device [device]
                --save_path [model_save_path]
                --logging_step [logging_step]
                --verbose [verbose]

Evaluate

To evaluate the performance of each model on a split, run the evaluate.py script. As in the case of training, it provides several arguments that can be visualized with python evaluate.py --help.

python evaluate.py --config [model_config] --mt_labels [mt_labels_path] --data_path [dataset_path]
                   --models_path [models_ckpt_path] 
                   --batch_size [batch_size]
                   --device [device]
                   --output_path [results_output_path]
                   --loggin_step [logging_step]
                   --verbose [verbose]

Acknowledgments

This research was supported by the EC grant no. INEA/CEF/ICT/A2017/1565710 for the Action no. 2017-EU-IA-0136 entitled “Multilingual Resources for CEF.AT in the legal domain” (MARCELL).

Credits

Please consider citing the following paper as a thank you to the authors of the PyEuroVoc:

Avram, Andrei-Marius, Vasile Pais, and Dan Tufis. "PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors." arXiv preprint arXiv:2108.01139 (2021).

or in .bibtex format:

@article{avram2021pyeurovoc,
  title={PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors},
  author={Avram, Andrei-Marius and Pais, Vasile and Tufis, Dan},
  journal={arXiv preprint arXiv:2108.01139},
  year={2021}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyeurovoc-1.1.0.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyeurovoc-1.1.0-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file pyeurovoc-1.1.0.tar.gz.

File metadata

  • Download URL: pyeurovoc-1.1.0.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.5

File hashes

Hashes for pyeurovoc-1.1.0.tar.gz
Algorithm Hash digest
SHA256 81250e31d1fc4c2cc6babead64c73852110f4fe3390c957fa07cc8720a752b7a
MD5 38558d936e300373313b99b095129de2
BLAKE2b-256 f010a4b0961ac9032abebdbc808769e24ab1b5bfc2a2d3055dae8ccb01e981ba

See more details on using hashes here.

File details

Details for the file pyeurovoc-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyeurovoc-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.5

File hashes

Hashes for pyeurovoc-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1fb229d50ea7d6b9e023d2c28a5488b3d4eb1087c3794a9f8595c69fc7e6df58
MD5 78eb5624a99650d5d5bbcfde8cd05b97
BLAKE2b-256 c4faff75c98e732782359328db7e6e802ef3cf1077ae2598b31949e16687073b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page