Python API for multilingual legal document classification with EuroVoc descriptors using BERT models.

Project description

EuroVoc-BERT

PyEuroVoc is a tool for legal document classification with EuroVoc descriptors. It supports 22 languages: Bulgarian (bg), Czech (cs), Danish (da), German (de), Greek (el), English (en), Spanish (es), Estonian (et), Finnish (fi), French (fr), Hungarian (hu), Italian (it), Lithuanian (lt), Latvian (lv), Maltese (mt), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Sweedish (sv).

The tool uses BERT at its core. The list of BERT variant for each language can be found here. The performance of each model is outlined in our paper.

Installation

Make sure you have Python3 installed and a suitable package for PyTorch. Then, install pyeurovoc with pip:

pip install pyeurovoc

Usage

Import the EuroVocBERT class from pyeurovoc. Instantiate the class with the desired langauge (default is "en") and then simply pass a document text to the model.

from pyeurovoc import EuroVocBERT

model = EuroVocBERT(lang="en")
prediction = model("Commission Decision on a modification of the system of aid applied in Italy in respect of shipbuilding")

The prediction of the model is a dictionary that contains the predicted ID descriptors as keys together with their confidence score as values.

{'155': 0.9990228414535522, '365': 0.9199643731117249, '431': 0.8993396759033203, '889': 0.6949650645256042, '1519': 0.03358537331223488, '5541': 0.03317505866289139}

The number of most probable labels returned by the model is controlled by the num_labels parameter (default is 6).

prediction = model("Commission Decision on a modification of the system of aid applied in Italy in respect of shipbuilding", num_labels=4)

Which outputs:

{'155': 0.9990228414535522, '365': 0.9199643731117249, '431': 0.8993396759033203, '889': 0.6949650645256042}

Training your own models

Download Dataset

Firstly, you need to download the datasets. Use the download_datasets.sh script in data to do that.

./download_datasets.sh

Preprocess

Once the datasets has finished downloading, you need to preprocess them using the preprocess.py script. It takes as input the model per language configuration file and the path to the dataset.

python preprocess.py --config [model_config] --data_path [dataset_path]

Train

Training is done using the train.py script. It will automatically load the preprocessed files created by the previous step, and will save the best model for each split at the path given by the -save_path argument. To view the full list of available arguments, run python train.py --help.

python train.py --config [model_config] --data_path [dataset_path] 
                --epochs [n_epochs] --batch_size [batch_size] 
                --max_grad_norm [max_grad_norm]
                --device [device]
                --save_path [model_save_path]
                --logging_step [logging_step]
                --verbose [verbose]

Evaluate

To evaluate the performance of each model on a split, run the evaluate.py script. As in the case of training, it provides several arguments that can be visualized with python evaluate.py --help.

python evaluate.py --config [model_config] --mt_labels [mt_labels_path] --data_path [dataset_path]
                   --models_path [models_ckpt_path] 
                   --batch_size [batch_size]
                   --device [device]
                   --output_path [results_output_path]
                   --loggin_step [logging_step]
                   --verbose [verbose]

Acknowledgments

This research was supported by the EC grant no. INEA/CEF/ICT/A2017/1565710 for the Action no. 2017-EU-IA-0136 entitled “Multilingual Resources for CEF.AT in the legal domain” (MARCELL).

Credits

Please consider citing the following paper as a thank you to the authors of the PyEuroVoc:

Avram, Andrei-Marius, Vasile Pais, and Dan Tufis. "PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors." arXiv preprint arXiv:2108.01139 (2021).

or in .bibtex format:

@article{avram2021pyeurovoc,
  title={PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors},
  author={Avram, Andrei-Marius and Pais, Vasile and Tufis, Dan},
  journal={arXiv preprint arXiv:2108.01139},
  year={2021}
}

Project details

Release history Release notifications | RSS feed

1.3.0

Sep 16, 2022

1.2.1

Jan 13, 2022

1.2.0

Jan 13, 2022

1.1.1

Aug 16, 2021

This version

1.1.0

Aug 16, 2021

1.0.5

Aug 16, 2021

1.0.4

Aug 15, 2021

1.0.3

Aug 9, 2021

1.0.2

Aug 9, 2021

1.0.1

Aug 3, 2021

1.0.0

Aug 3, 2021

0.2.0

Aug 1, 2021

0.1.0

Jul 31, 2021

0.0.4

Jul 30, 2021

0.0.3

Jul 30, 2021

0.0.2

Jul 30, 2021

0.0.1

Jul 30, 2021

0.0.0

Jul 30, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyeurovoc-1.1.0.tar.gz (7.2 kB view details)

Uploaded Aug 16, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyeurovoc-1.1.0-py3-none-any.whl (7.5 kB view details)

Uploaded Aug 16, 2021 Python 3

File details

Details for the file pyeurovoc-1.1.0.tar.gz.

File metadata

Download URL: pyeurovoc-1.1.0.tar.gz
Upload date: Aug 16, 2021
Size: 7.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.5

File hashes

Hashes for pyeurovoc-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`81250e31d1fc4c2cc6babead64c73852110f4fe3390c957fa07cc8720a752b7a`
MD5	`38558d936e300373313b99b095129de2`
BLAKE2b-256	`f010a4b0961ac9032abebdbc808769e24ab1b5bfc2a2d3055dae8ccb01e981ba`

See more details on using hashes here.

File details

Details for the file pyeurovoc-1.1.0-py3-none-any.whl.

File metadata

Download URL: pyeurovoc-1.1.0-py3-none-any.whl
Upload date: Aug 16, 2021
Size: 7.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.5

File hashes

Hashes for pyeurovoc-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1fb229d50ea7d6b9e023d2c28a5488b3d4eb1087c3794a9f8595c69fc7e6df58`
MD5	`78eb5624a99650d5d5bbcfde8cd05b97`
BLAKE2b-256	`c4faff75c98e732782359328db7e6e802ef3cf1077ae2598b31949e16687073b`

See more details on using hashes here.

pyeurovoc 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

EuroVoc-BERT

Installation

Usage

Training your own models

Download Dataset

Preprocess

Train

Evaluate

Acknowledgments

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes