Skip to main content

Module for finetuning dfm base-models to sentence transformers

Project description

dfm-sentence-transformers


Sentence transformers for the Danish Foundation Models Project.

Training

Install the package from PyPI:

pip install dfm-sentence-transformers

You have to specify basic model and training parameters, as well as all the tasks/datasets the model should be trained on.

Here is an example of a config:

[model]
name="dfm-sentence-encoder-small-v1"
base_model="chcaa/dfm-encoder-small-v1"
device="cpu"

[training]
epochs=50
steps_per_epoch=500
warmup_steps=100
batch_size=64
wandb_project="dfm-sentence-transformers"
checkpoint_repo="checkpoints-dfm-sentence-encoder-small-v1"

[tasks]

[tasks.bornholmsk]
@tasks="multiple_negatives_ranking"
sentence1="da_bornholm"
sentence2="da"

[tasks.bornholmsk.dataset]
@loaders="load_dataset"
path="strombergnlp/bornholmsk_parallel"

Then you can train a sentence transformer by using the finetune command.

python3 -m dfm_sentence_trf finetune training.cfg -o "model/"

You can push the finetuned model to HuggingFace Hub:

python3 -m dfm_sentence_trf push_to_hub training.cfg --model_path "model/"

Evaluation

You can evaluate trained models with the Scandinavian Embedding Benchmark.

pip install seb
python3 -m seb "model/" "da"

Tasks

You can add an arbitrary number of tasks to the model's config. All tasks must have a unique name but their name is ignored in the actual training procedure. Datasets of tasks with the same loss function are mixed together so that the model can learn them simultaneously in mixed batches. The package comes with three default tasks you can use for different objectives:

1. Multiple Negatives Ranking

If you have a parallel corpus of sentences (paraphrase, translation, etc.) use this task. Batches consist of positive sentence pairs, and negative samples are constructed by taking all non-matching pairs in a batch.

Parameters:

Param Type Description Default
sentence1 str Name of the first sentence column in the dataset. -
sentence2 str Name of the second sentence column in the dataset. -
scale float Output of similarity function is multiplied by scale value. 20.0
[tasks.faroese]
@tasks="multiple_negatives_ranking"
sentence1="fo"
sentence2="da"

[tasks.faroese.dataset]
@loaders="load_dataset"
path="strombergnlp/itu_faroese_danish"

2. Cosine Similarity

Good for STS datasets. Minimizes mean squared error of estimated and true sentence cosine similairites.

Parameters:

Param Type Description Default
sentence1 str Name of the first sentence column in the dataset. -
sentence2 str Name of the second sentence column in the dataset. -
similarity str Name of the gold standard similarity column. -
[tasks.sts]
@tasks="cosine_similarity"
sentence1="sent1"
sentence2="sent1"
similarity="label"

[tasks.sts.dataset]
...

3. Softmax

Good for NLI datasets. Uses softmax classification loss based on concatenated embeddings and their difference. Beware that these tasks are never joined due to potentially different labeling schemes.

Parameters:

Param Type Description Default
sentence1 str Name of the first sentence column in the dataset. -
sentence2 str Name of the second sentence column in the dataset. -
label str Name of the label column in the dataset. -
[tasks.nli]
@tasks="softmax"
sentence1="premise"
sentence2="hypothesis"
label="label"

[tasks.nli.dataset]
...

Datasets

Datasets for each task are loaded with :hugs: load_dataset() function, but only the first argument, and a name are accepted. You can use local or remote datasets, and they can be of any of the canonical file formats (JSON, JSONL, CSV, Parquet...).

...

[tasks.local.dataset]
@loaders="load_dataset"
path="local/dataset/file.jsonl"

...

[tasks.huggingface_hub.dataset]
@loaders="load_dataset"
path="username/dataset"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfm_sentence_transformers-0.3.3.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dfm_sentence_transformers-0.3.3-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file dfm_sentence_transformers-0.3.3.tar.gz.

File metadata

  • Download URL: dfm_sentence_transformers-0.3.3.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.9.13 Linux/5.15.0-91-generic

File hashes

Hashes for dfm_sentence_transformers-0.3.3.tar.gz
Algorithm Hash digest
SHA256 bb0eb14c8e50c67a265a23e96bfa19cb9f5a7bf065f0b32aec0be48f0026928e
MD5 619bd1ab611381451a55c4cf41ad1a28
BLAKE2b-256 0dd25a3aa48d6da3a19f25cd6ad0fc6b243290e03e6f46b045f2699cd17916b3

See more details on using hashes here.

File details

Details for the file dfm_sentence_transformers-0.3.3-py3-none-any.whl.

File metadata

File hashes

Hashes for dfm_sentence_transformers-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 886e046947881e9b81cbbfe89716b9199a9848ea0ab3c8be1f2ce2d67896a5e0
MD5 3fd474867e243021f6bdf674d32ab6b4
BLAKE2b-256 10f0e01b176205e59df57a68d0c856fb37c00a3a002f2f02a95939024f875a4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page