Skip to main content

Module for finetuning dfm base-models to sentence transformers

Project description

dfm-sentence-transformers

Code for curating data and training sentence transformers for the Danish Foundation Models project.

Training

Install the CLI:

pip install dfm_sentence_trf

WARNING: The package is not on PyPI yet, so this won't actually work as of yet.

Config system (TODO)

You have to specify basic model and training parameters, as well as all the tasks/datasets the model should be trained on.

[model]
name="dfm-sentence-encoder-small-v1"
base_model="chcaa/dfm-encoder-small-v1"
device="cpu"

[training]
epochs=5
warmup_steps=100
batch_size=120

[tasks]

[tasks.bornholmsk]
@tasks="multiple_negatives_ranking"
sentence1="da_bornholm"
sentence2="da"

[tasks.bornholmsk.dataset]
@loaders="load_dataset"
path="strombergnlp/bornholmsk_parallel"

[tasks.hestenet]
@tasks="multiple_negatives_ranking"
sentence1="question"
sentence2="answer"

[tasks.hestenet.dataset]
@loaders="load_dataset"
path="some/local/path"

Then you can train a sentence transformer by using the finetune command.

python3 -m dfm_sentence_trf training.cfg --output_folder "output/"

You can push the finetuned model to HuggingFace Hub:

python3 -m dfm_sentence_trf training.cfg --model_path "output/" --organization "chcaa"

Tasks (TODO)

ContrastiveParallel (TODO)

The task uses a contrastive loss on a parallel corpus, where negative examples (aka. non-matching sentence pairs labelled with 0) are randomly sampled. You can specify the dataset, and the number of negative samples for each positive sample. As well as basic training parameters.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfm_sentence_transformers-0.1.0.tar.gz (7.6 kB view hashes)

Uploaded Source

Built Distribution

dfm_sentence_transformers-0.1.0-py3-none-any.whl (9.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page