Skip to main content

A Danish-speaking language model with entity-aware self-attention

Project description

DaLUKE: The Entity-aware, Danish Language Model

pytest

Implementation of the knowledge-enhanced transformer LUKE pretrained on the Danish Wikipedia and evaluated on named entity recognition (NER).

Installation

pip install daluke

For including optional requirements that are necessary for training and general analysis:

pip install daluke[full]

Python 3.8 or newer is required.

Explanation

For an explanation of the model, see our bachelor's thesis or the original LUKE paper.

Usage

Inference on simple NER or masked language modeling (MLM) examples

Python

For performing NER predictions

from daluke import AutoNERDaLUKE, predict_ner

daluke = AutoNERDaLUKE()

document = "Det Kgl. Bibliotek forvalter Danmarks største tekstsamling, der strækker sig fra middelalderen til det nyeste litteratur."
iob_list = predict_ner(document, daluke)

For testing MLM predictions

from daluke import AutoMLMDaLUKE, predict_mlm

daluke = AutoMLMDaLUKE()
# Empty list => No entity annotations in the string
document = "Professor i astrofysik, [MASK] [MASK], udtaler til avisen, at den nye måling sandsynligvis ikke er en fejl."
best_prediction, table = predict_mlm(document, list(), daluke)

CLI

daluke ner --text "Thomas Delaney fører Danmark til sejr ved EM i fodbold."
daluke masked --text "Slutresultatet af kampen mellem Danmark og Rusland bliver [MASK]-[MASK]."

For Windows, or systems where #!/usr/bin/env python3 is not linked to the correct Python interpreter, the command python -m daluke.api.cli can be used instead of daluke.

Training DaLUKE yourself

This part shows how to recreate the entire DaLUKE training pipeline from dataset preparation to fine-tuning. This guide is designed to be run in a bash shell. If you use Windows, you will probably have to make some modifications to the shell scripts used.

# Download forked luke submodule
git submodule update --init --recursive
# Install requirements
pip install -r requirements.txt
pip install -r optional-requirements.txt
pip install -r luke/requirements.txt

# Build dataset
# The script performs all the steps of building the dataset, including downloading the Danish Wikipedia
# You only need to modify DATA_PATH to where you want the data to be saved
# Be aware that this takes several hours
dev/build_data.sh

# Start pretraining using default hyperparameters
python daluke/pretrain/run.py <INSERT DATA_PATH HERE> -c configs/pretrain-main.ini --name $NAME --save-every 5 --epochs 150 --name daluke --fp16
# Optional: Make plots of pretraining
python daluke/plot/plot_pretraining.py <DATA_PATH>/daluke

# Fine-tune on DaNE
python daluke/collect_modelfile.py <DATA_PATH>/daluke <DATA_PATH>/ner/daluke.tar.gz
python daluke/ner/run.py <DATA_PATH>/ner/daluke -c configs/main-finetune.ini --model <DATA_PATH>/ner/daluke.tar.gz --name finetune --eval
# Evaluate on DaNE test set
python daluke/ner/run_eval.py <DATA_PATH>/ner/daluke/finetune --model <DATA_PATH>/ner/daluke/finetune/daluke_ner_best.tar.gz
# Optional: Fine-tuning plots
python daluke/plot/plot_finetune_ner.py <DATA_PATH>/ner/daluke/finetune/train-results

History

0.0.5

- Added batching in Python API NER forward passing

0.0.4

- Added a Python API for maintaining a stateful model and performing CWR, MLM and NER predictions

0.0.3: Finalization of Bachelor's Project

- Allowed specifying entity spans in masked word prediction CLI

0.0.2

- CLI made working on Windows

0.0.1

- Simple single-example CLI released

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

daluke-0.0.5.tar.gz (67.2 kB view details)

Uploaded Source

Built Distribution

daluke-0.0.5-py3-none-any.whl (86.0 kB view details)

Uploaded Python 3

File details

Details for the file daluke-0.0.5.tar.gz.

File metadata

  • Download URL: daluke-0.0.5.tar.gz
  • Upload date:
  • Size: 67.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7

File hashes

Hashes for daluke-0.0.5.tar.gz
Algorithm Hash digest
SHA256 7cd6965b2502842b4555800a8c212fe4792cb46a5e3849444c064ea675795504
MD5 1f5e2e4788ddb8d6cb70c7fb7c3a5a9d
BLAKE2b-256 07eefe689b33ccd43ae2fa2e6661cacf05d28841792f434787c4b49f1becedcc

See more details on using hashes here.

File details

Details for the file daluke-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: daluke-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 86.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7

File hashes

Hashes for daluke-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a9d4ed394fcd055404d103ceda04c2c17b86fa3f4ece5472ba5d87c2d5953615
MD5 2f308925f7f5d6829f19a37e6b6ba6ae
BLAKE2b-256 0b9dd895a615f6aa75f56da9db1844cbaedad0e60be04d0ae3ba91d6d0493068

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page