Skip to main content

Python API for multilingual legal document classification with EuroVoc descriptors using BERT models.

Project description

EuroVoc-BERT

PyEuroVoc is a tool for legal document classification with EuroVoc descriptors. It supports 22 languages: Bulgarian (bg), Czech (cs), Danish (da), German (de), Greek (el), English (en), Spanish (es), Estonian (et), Finnish (fi), French (fr), Hungarian (hu), Italian (it), Lithuanian (lt), Latvian (lv), Maltese (mt), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Sweedish (sv).

The tool uses BERT at its core. The list of BERT variant for each language can be found here. The performance of each model is outlined in our paper.

Installation

Make sure you have Python3 installed and a suitable package for PyTorch. Then, install pyeurovoc with pip:

pip install pyeurovoc

Usage

Import the EuroVocBERT class from pyeurovoc. Instantiate the class with the desired langauge (default is "en") and then simply pass a document text to the model.

from pyeurovoc import EuroVocBERT

model = EuroVocBERT(lang="en")
prediction = model("Commission Decision on a modification of the system of aid applied in Italy in respect of shipbuilding")

The prediction of the model is a dictionary that contains the predicted ID descriptors as keys together with their confidence score as values.

{'155': 0.9995473027229309, '230': 0.9377984404563904, '889': 0.9193254113197327, '1519': 0.714003324508667, '5020': 0.5, '5541': 0.5}

The number of most probable labels returned by the model is controlled by the num_labels parameter (default is 6).

prediction = model("Commission Decision on a modification of the system of aid applied in Italy in respect of shipbuilding", num_labels=4)

Which outputs:

{'155': 0.9995473027229309, '230': 0.9377984404563904, '889': 0.9193254113197327, '1519': 0.714003324508667}

Training your own models

Download Dataset

Firstly, you need to download the datasets. Use the download_datasets.sh script in data to do that.

./download_datasets.sh

Preprocess

Once the datasets has finished downloading, you need to preprocess them using the preprocess.py script. It takes as input the model per language configuration file and the path to the dataset.

python preprocess.py --config [model_config] --data_path [dataset_path]

Train

Training is done using the train.py script. It will automatically load the preprocessed files created by the previous step, and will save the best model for each split at the path given by the -save_path argument. To view the full list of available arguments, run python train.py --help.

python train.py --config [model_config] --data_path [dataset_path] 
                --epochs [n_epochs] --batch_size [batch_size] 
                --max_grad_norm [max_grad_norm]
                --device [device]
                --save_path [model_save_path]
                --logging_step [logging_step]
                --verbose [verbose]

Evaluate

To evaluate the performance of each model on a split, run the evaluate.py script. As in the case of training, it provides several arguments that can be visualized with python evaluate.py --help.

python evaluate.py --config [model_config] --mt_labels [mt_labels_path] --data_path [dataset_path]
                   --models_path [models_ckpt_path] 
                   --batch_size [batch_size]
                   --device [device]
                   --output_path [results_output_path]
                   --loggin_step [logging_step]
                   --verbose [verbose]

Credits

Coming soon...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyeurovoc-1.0.1.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyeurovoc-1.0.1-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file pyeurovoc-1.0.1.tar.gz.

File metadata

  • Download URL: pyeurovoc-1.0.1.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.5

File hashes

Hashes for pyeurovoc-1.0.1.tar.gz
Algorithm Hash digest
SHA256 39d252963a6476b424e415864204acaabb4da264a179ae58e504fcd5412d428d
MD5 8d4c37b4b77b71a7f379706acbf561c6
BLAKE2b-256 c89e72dd68a7d6c0473ddb4483d7786dc9bc2a2f529b698c0ff26f4bc3509516

See more details on using hashes here.

File details

Details for the file pyeurovoc-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pyeurovoc-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.5

File hashes

Hashes for pyeurovoc-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c573f96fa7c61dfd116a7a259035ac2dff1c27c56474ba05440063495a89c20d
MD5 8e3df59e94a80450d69082de7c66f81d
BLAKE2b-256 f9970b37f46433b2fc217f75486398aae10ca5cfcd5c22711148eb15a4b50fdc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page