Module for finetuning dfm base-models to sentence transformers

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

dfm-sentence-transformers

Sentence transformers for the Danish Foundation Models Project.

Training

Install the package from PyPI:

pip install dfm-sentence-transformers

You have to specify basic model and training parameters, as well as all the tasks/datasets the model should be trained on.

Here is an example of a config:

[model]
name="dfm-sentence-encoder-small-v1"
base_model="chcaa/dfm-encoder-small-v1"
device="cpu"

[training]
epochs=50
steps_per_epoch=500
warmup_steps=100
batch_size=64
wandb_project="dfm-sentence-transformers"
checkpoint_repo="checkpoints-dfm-sentence-encoder-small-v1"

[tasks]

[tasks.bornholmsk]
@tasks="multiple_negatives_ranking"
sentence1="da_bornholm"
sentence2="da"

[tasks.bornholmsk.dataset]
@loaders="load_dataset"
path="strombergnlp/bornholmsk_parallel"

Then you can train a sentence transformer by using the finetune command.

python3 -m dfm_sentence_trf finetune training.cfg -o "model/"

You can push the finetuned model to HuggingFace Hub:

python3 -m dfm_sentence_trf push_to_hub training.cfg --model_path "model/"

(NEW) Curating datasets with models you've pretrained

Similarly to Microsoft's E5 we intend to train models on data that has been curated by models we've trained on huristic-based sentence pairs. We provide a CLI for filtering the dataset based on consistency.

This is based on a batch-based strategy in which we take batches of sentence pairs, create a similarity matrix between left-side and right-side sentences. If a pair's similarity is over the 1-(1/(N*specificity))'th quantile of all similarities in the matrix, and is originally annotated as a pair by the heuristics, we accept it as a positive pair. We also assign hard negatives. These are pairs that have similarity in the lower quantile and are originally not annotated as a pair.

The hard positive, hard negative scheme is employed so that we can use AnglE for finetuning the models on this curated data.

The config scheme for data cleaning is the following:

[cleaning]
batch_size=1000
specificity=1.2
name="kardosdrur/folketing-wiki-clean"

[cleaning.model]
...(same as everywhere else)

[data]

[data.folketinget]
sentence1="comment"
sentence2="response"

[data.folketinget.dataset]
@loaders="load_dataset"
path="kardosdrur/folketinget-discussions"

Then you can clean the dataset:

python3 -m dfm_sentence_trf clean_dataset "config.cfg"

This will produce the JSONL file <dataset_name>.jsonl containing all examples.

Datasets can then be shuffled, split and pushed to the hub with the push_dataset command.

python3 -m dfm_sentence_trf push_dataset "config.cfg"

(NEW) Finetuning with AnglE

You can finetune a model with AnglE on supervised tasks. AnglE models have a different config format, namely:

[model]
...

[training]
epochs=5
batch_size=32
warmup_steps=100

[angle]
sentence1="premise"
sentence2="hypothesis"
label="label"

[angle.dataset]
@loaders="load_dataset"
path="kardosdrur/nb-nli"

AnglE models can only be trained on one supervised task, where the label is correlated with semantic similarity.

Note that you have to manually install AnglE.

pip install angle_emb

Then you can finetune:

python3 -m dfm_sentence_trf angle_finetune "config.cfg" -o "model/"

Models can be pushed to the hub the same way as everything else. We recommend that you pretrain on sentence pair datasets and then finetune with angle on NLI or STS tasks.

Evaluation

You can evaluate trained models with the Scandinavian Embedding Benchmark.

pip install seb
python3 -m seb "model/" "da"

Tasks

You can add an arbitrary number of tasks to the model's config. All tasks must have a unique name but their name is ignored in the actual training procedure. Datasets of tasks with the same loss function are mixed together so that the model can learn them simultaneously in mixed batches. The package comes with three default tasks you can use for different objectives:

1. Multiple Negatives Ranking

If you have a parallel corpus of sentences (paraphrase, translation, etc.) use this task. Batches consist of positive sentence pairs, and negative samples are constructed by taking all non-matching pairs in a batch.

Parameters:

Param	Type	Description	Default
sentence1	str	Name of the first sentence column in the dataset.	-
sentence2	str	Name of the second sentence column in the dataset.	-
scale	float	Output of similarity function is multiplied by scale value.	20.0

[tasks.faroese]
@tasks="multiple_negatives_ranking"
sentence1="fo"
sentence2="da"

[tasks.faroese.dataset]
@loaders="load_dataset"
path="strombergnlp/itu_faroese_danish"

2. Cosine Similarity

Good for STS datasets. Minimizes mean squared error of estimated and true sentence cosine similairites.

Parameters:

Param	Type	Description	Default
sentence1	str	Name of the first sentence column in the dataset.	-
sentence2	str	Name of the second sentence column in the dataset.	-
similarity	str	Name of the gold standard similarity column.	-

[tasks.sts]
@tasks="cosine_similarity"
sentence1="sent1"
sentence2="sent1"
similarity="label"

[tasks.sts.dataset]
...

3. Softmax

Good for NLI datasets. Uses softmax classification loss based on concatenated embeddings and their difference. Beware that these tasks are never joined due to potentially different labeling schemes.

Parameters:

Param	Type	Description	Default
sentence1	str	Name of the first sentence column in the dataset.	-
sentence2	str	Name of the second sentence column in the dataset.	-
label	str	Name of the label column in the dataset.	-

[tasks.nli]
@tasks="softmax"
sentence1="premise"
sentence2="hypothesis"
label="label"

[tasks.nli.dataset]
...

4. Contrastive (new in 0.3.6)

Contrastive loss for hard negative and hard positive pairs.

Parameters:

Param	Type	Description	Default
sentence1	str	Name of the first sentence column in the dataset.	-
sentence2	str	Name of the second sentence column in the dataset.	-
label	str	Name of the label column in the dataset.	-

[tasks.contrastive]
@tasks="contrastive"
sentence1="text1"
sentence2="text2"
label="label"

[tasks.contrastive.dataset]
...

Datasets

Datasets for each task are loaded with :hugs: load_dataset() function, but only the first argument, and a name are accepted. You can use local or remote datasets, and they can be of any of the canonical file formats (JSON, JSONL, CSV, Parquet...).

...

[tasks.local.dataset]
@loaders="load_dataset"
path="local/dataset/file.jsonl"

...

[tasks.huggingface_hub.dataset]
@loaders="load_dataset"
path="username/dataset"

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.3.9

Jan 8, 2024

0.3.8

Jan 8, 2024

0.3.7

Jan 8, 2024

0.3.6

Jan 8, 2024

0.3.5

Dec 27, 2023

0.3.4

Dec 23, 2023

0.3.3

Dec 15, 2023

0.3.2

Nov 22, 2023

0.3.1

Nov 22, 2023

0.3.0

Nov 22, 2023

0.2.0

Nov 16, 2023

0.1.0

Oct 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfm_sentence_transformers-0.3.9.tar.gz (15.8 kB view hashes)

Uploaded Jan 8, 2024 Source

Built Distribution

dfm_sentence_transformers-0.3.9-py3-none-any.whl (19.5 kB view hashes)

Uploaded Jan 8, 2024 Python 3

Hashes for dfm_sentence_transformers-0.3.9.tar.gz

Hashes for dfm_sentence_transformers-0.3.9.tar.gz
Algorithm	Hash digest
SHA256	`7e8ec734e0f24d64d80cd21adecf412588fb6e3789c9b0cf99a427de11c38710`
MD5	`982dcb274269c5cbe6cee56bea941359`
BLAKE2b-256	`2c2bdf397b0b985064e893d6b046e6dac1d7f4bcb04bce18bbf6139005042752`

Hashes for dfm_sentence_transformers-0.3.9-py3-none-any.whl

Hashes for dfm_sentence_transformers-0.3.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`49643db91a22988bd65f136294a722938deaf61e5602dba25f7fdfcf5909941e`
MD5	`8833e1c2f41cdcce2134760eb6aaa850`
BLAKE2b-256	`1a66b644dec260d11861cdcb50589e7da7f1a43cd81e36b577201c9513ca45e5`