A Danish-speaking language model with entity-aware self-attention
Project description
DaLUKE: The Entity-aware, Danish Language Model
Implementation of the knowledge-enhanced transformer LUKE pretrained on the Danish Wikipedia and evaluated on named entity recognition (NER).
Installation
pip install daluke
For including optional requirements that are necessary for training and general analysis:
pip install daluke[full]
Python 3.8 or newer is required.
Explanation
For an explanation of the model, see our bachelor's thesis or the original LUKE paper.
Usage
Inference on simple NER or masked language modeling (MLM) examples
Python
For performing NER predictions
from daluke import AutoNERDaLUKE, predict_ner
daluke = AutoNERDaLUKE()
document = "Det Kgl. Bibliotek forvalter Danmarks største tekstsamling, der strækker sig fra middelalderen til det nyeste litteratur."
iob_list = predict_ner(document, daluke)
For testing MLM predictions
from daluke import AutoMLMDaLUKE, predict_mlm
daluke = AutoMLMDaLUKE()
# Empty list => No entity annotations in the string
document = "Professor i astrofysik, [MASK] [MASK], udtaler til avisen, at den nye måling sandsynligvis ikke er en fejl."
best_prediction, table = predict_mlm(document, list(), daluke)
CLI
daluke ner --text "Thomas Delaney fører Danmark til sejr ved EM i fodbold."
daluke masked --text "Slutresultatet af kampen mellem Danmark og Rusland bliver [MASK]-[MASK]."
For Windows, or systems where #!/usr/bin/env python3
is not linked to the correct Python interpreter, the command python -m daluke.api.cli
can be used instead of daluke
.
Training DaLUKE yourself
This part shows how to recreate the entire DaLUKE training pipeline from dataset preparation to fine-tuning. This guide is designed to be run in a bash shell. If you use Windows, you will probably have to make some modifications to the shell scripts used.
# Download forked luke submodule
git submodule update --init --recursive
# Install requirements
pip install -r requirements.txt
pip install -r optional-requirements.txt
pip install -r luke/requirements.txt
# Build dataset
# The script performs all the steps of building the dataset, including downloading the Danish Wikipedia
# You only need to modify DATA_PATH to where you want the data to be saved
# Be aware that this takes several hours
dev/build_data.sh
# Start pretraining using default hyperparameters
python daluke/pretrain/run.py <INSERT DATA_PATH HERE> -c configs/pretrain-main.ini --name $NAME --save-every 5 --epochs 150 --name daluke --fp16
# Optional: Make plots of pretraining
python daluke/plot/plot_pretraining.py <DATA_PATH>/daluke
# Fine-tune on DaNE
python daluke/collect_modelfile.py <DATA_PATH>/daluke <DATA_PATH>/ner/daluke.tar.gz
python daluke/ner/run.py <DATA_PATH>/ner/daluke -c configs/main-finetune.ini --model <DATA_PATH>/ner/daluke.tar.gz --name finetune --eval
# Evaluate on DaNE test set
python daluke/ner/run_eval.py <DATA_PATH>/ner/daluke/finetune --model <DATA_PATH>/ner/daluke/finetune/daluke_ner_best.tar.gz
# Optional: Fine-tuning plots
python daluke/plot/plot_finetune_ner.py <DATA_PATH>/ner/daluke/finetune/train-results
History
0.0.5
- Added batching in Python API NER forward passing
0.0.4
- Added a Python API for maintaining a stateful model and performing CWR, MLM and NER predictions
0.0.3: Finalization of Bachelor's Project
- Allowed specifying entity spans in masked word prediction CLI
0.0.2
- CLI made working on Windows
0.0.1
- Simple single-example CLI released
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.