A semantic search embedding model fine-tuning tool

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Environment
- GPU
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

Nixietune: a fine-tuner for semantic search models

Last commit Last release

Nixietune is a GPU fine-tuning harness for semantic search models. Built for the Nixiesearch search engine:

a set of state-of-the-art recipes to fine-tune existing generic semantic search models like E5/BGE/MiniLM on your data
based on battle-tested sentence-transformers library, but uses modern Huggingface ecosystem for training: multi-GPU and distributed training, FP16/BF16 mixed-precision, gradient checkpointing/accumulation and dataset caching.
Can be used with and without hard negatives, supports InfoNCE/Cosine/Contrastive/Triples losses.

Features

What Nixietune can do for you:

Fine-tune an existing embedding model on your labeled data.
Generate synthetic queries and labels
Train a cross-encoder reranker model.

Usage

Fine-tuning an embedding model

To fine-tune a semantic search embedding model on your data:

Install nixietune: you need a GPU for that!
Format your data in a nixietune format: a JSON file format with a specific schema.
Run the training: for base/small models it takes less than an hour on a single desktop GPU.
Tinker with params: choose the best loss and make your model training faster.

Installation

Nixietune is published to PyPi:

# setup the environment
python -m venv .venv && source .venv/bin/activate
# install dependencies
pip install nixietune

Nixietune is tested with Python 3.10 and 3.11.
3.12 is not yet supported by PyTorch

Data format

Nixietune expects a specific JSONL input format for your documents:

{
    "query": "pizza",
    "doc": "Standard Serious Pizza",
    "neg": [
        "Burgermeister",
        "Risa Chicken",
    ]
}

The document schema can be described as:

query: string. An anchor search query for the whole group of documents.
doc: string. A one or more positive documents for the query above.
neg: list[string]. A zero or more negative documents for the query.
negscore: list[float]. A zero or more scores for negatives.

All fields are formally optional and different modules require different fields, but for a traditional embedding fine-tuning we need query, doc and optionally neg fields to be present.

Some losses like InfoNCE can be trained without negatives (so you need only query and doc fields in the training data), but usually you can get much better results with explicit negatives.

Run the training

Let's fine-tune a sentence-transformers/all-MiniLM-L6-v2 embedding model on a nixiesearch/amazon-esci dataset, using the InfoNCE loss.

python -m nixietune.biencoder examples/esci.json

The esci.json configuration file is based on a HuggingFace Transformer TrainingArguments with some extra settings:

{
    "seq_len": 128,
    "target": "infonce",
    "num_negatives": 8,
    "train_dataset": "nixiesearch/amazon-esci",
    "eval_dataset": "nixiesearch/amazon-esci",
    "train_split": "train[:10%]",
    "eval_split": "test_1k",
    "model_name_or_path": "sentence-transformers/all-MiniLM-L6-v2",
    "output_dir": "out",
    "num_train_epochs": 1,
    "seed": 33,
    "per_device_train_batch_size": 512,
    "per_device_eval_batch_size": 512,
    "fp16": true,
    "logging_dir": "logs",
    "gradient_checkpointing": true,
    "gradient_accumulation_steps": 1,
    "dataloader_num_workers": 14,
    "eval_steps": 0.1,
    "logging_steps": 0.1,
    "evaluation_strategy": "steps",
    "torch_compile": true,
    "report_to": [],
    "save_strategy": "epoch",
    "lr_scheduler_type": "cosine",
    "warmup_ratio": 0.05,
    "learning_rate": 5e-5
}

It takes around 60 minutes to fine-tune an all-MiniLM-L6-v2 on an Amazon ESCI dataset on a single RTX4090 GPU.

Choosing the best parameters

The following training parameters are worth tuning:

target: the training recipe. Currently supported targets are infonce/cosine_similarity/contrastive/triplet. If not sure, start with infonce.
model_name_or_path: which model to fine-tune. Any SBERT-supported model should work.
per_device_train_batch_size: batch size. Too small values lead to sub-par quality and slow training. Too large need a lot of VRAM. Start with 128 and go up.
seq_len: context length of the model. Usually it's around 128-160 for most models in MTEB leaderboard.
gradient_checkpointing: reduces VRAM usage sugnificantly (up to 70%) with a small 10% performance penalty, as we recompute gradients instead of storing them. If unsure, choose true
num_negatives: for infonce/triplet targets, how many negatives from the dataset to select.
query_prefix and document_prefix: prompt labels for asymmetric models like E5 - when the model can distinguish between query and document passages.

Training a cross-encoder

Cross-encoders are not limited by the restrictions of cosine space, and usually provide much more precise result - for the extra cost of much resource-hungry inference.

Training a cross-encoder with nixietune requires negatives to be present in your data (so query, doc and neg fields) and is possible with the following config file:

{
    "seq_len": 128,
    "train_dataset": "nixiesearch/amazon-esci",
    "eval_dataset": "nixiesearch/amazon-esci",
    "train_split": "train",
    "eval_split": "test_1k",
    "model_name_or_path": "cross-encoder/ms-marco-MiniLM-L-6-v2",
    "output_dir": "out",
    "num_train_epochs": 1,
    "seed": 33,
    "per_device_train_batch_size": 1024,
    "per_device_eval_batch_size": 1024,
    "fp16": true,
    "logging_dir": "logs",
    "gradient_checkpointing": true,
    "gradient_accumulation_steps": 1,
    "dataloader_num_workers": 14,
    "eval_steps": 0.1,
    "logging_steps": 0.1,
    "evaluation_strategy": "steps",
    "torch_compile": false,
    "report_to": [],
    "save_strategy": "epoch",
    "lr_scheduler_type": "cosine",
    "warmup_ratio": 0.05,
    "learning_rate": 5e-5
}

It can be launched with the following command:

python -m nixietune.crossencoder examples/esci_ce.json

Generating synthetic queries

Nixietune has a module for an LLM-based synthetic query generation:

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Environment
- GPU
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

0.0.7

Apr 22, 2024

0.0.6

Jan 8, 2024

0.0.5

Jan 5, 2024

0.0.4

Jan 3, 2024

0.0.3

Jan 3, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nixietune-0.0.7.tar.gz (28.3 kB view details)

Uploaded Apr 22, 2024 Source

Built Distribution

nixietune-0.0.7-py3-none-any.whl (37.5 kB view details)

Uploaded Apr 22, 2024 Python 3

File details

Details for the file nixietune-0.0.7.tar.gz.

File metadata

Download URL: nixietune-0.0.7.tar.gz
Upload date: Apr 22, 2024
Size: 28.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.9

File hashes

Hashes for nixietune-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`1c74928ca3a798884fce5d481a1d132c44e8e24cfdb43c7e9619a5e3a263dba7`
MD5	`3084b818dc55c3356bef4906bf6bfdc4`
BLAKE2b-256	`4d566fd3adfbd5f1f3b8cb1b24a07ffd7b803e1f99a2329d162e8318f54fc848`

See more details on using hashes here.

File details

Details for the file nixietune-0.0.7-py3-none-any.whl.

File metadata

Download URL: nixietune-0.0.7-py3-none-any.whl
Upload date: Apr 22, 2024
Size: 37.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.9

File hashes

Hashes for nixietune-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`863f02d94793a3051e35c5c7eb0d64609c385612e6e94b282ced4f78e5a0bb30`
MD5	`6765987609e6babd969ba6103d14b8f1`
BLAKE2b-256	`7d6884c394f059f972be1da6030930f3c778b07879c819032c773f82c6beaebe`