Skip to main content

A semantic search embedding model fine-tuning tool

Project description

Nixietune: a fine-tuner for semantic search models

License: Apache 2 Last commit Last release Join our slack

Nixietune is a GPU fine-tuning harness for semantic search models. Built for the Nixiesearch search engine:

  • a set of state-of-the-art recipes to fine-tune existing generic semantic search models like E5/BGE/MiniLM on your data
  • based on battle-tested sentence-transformers library, but uses modern Huggingface ecosystem for training: multi-GPU and distributed training, FP16/BF16 mixed-precision, gradient checkpointing/accumulation and dataset caching.
  • Can be used with and without hard negatives, supports InfoNCE/Cosine/Contrastive/Triples losses.

Usage

To fine-tune a semantic search embedding model on your data:

  • Install nixietune: you need a GPU for that!
  • Format your data in a nixietune format: a JSONL file with a specific schema.
  • Run the training: for base/small models it takes less than an hour on a single GPU.
  • Tinker with params: choose the best loss and make your model training faster.

Installation

Nixietune is published to PyPi:

# setup the environment
python -m venv .venv && source .venv/bin/activate
# install dependencies
pip install nixietune
  • Nixietune is tested with Python 3.10 and 3.11.
  • 3.12 is not yet supported by PyTorch

Data format

Nixietune expects a specific JSONL input format for your documents:

{
    "query": "pizza",
    "positive": [
        "Standard Serious Pizza",
        "60 Seconds to Napoli",
    ],
    "negative": [
        "Burgermeister",
        "Risa Chicken",
    ]
}

The document schema can be described as:

  • query: required, string. An anchor search query for the whole group of documents.
  • positive: required, list[string]. A one or more positive documents for the query above.
  • negative: optional, list[string]. A zero or more negative documents for the query.

The InfoNCE loss supports negative-less training - when all the other in-batch positives are treated as negatives.

Run the training

Let's fine-tune a sentence-transformers/all-MiniLM-L6-v2 embedding model on a nixiesearch/ms-marco-hard-negatives dataset, using the InfoNCE loss.

python -m nixietune examples/msmarco.json

The msmarco.json configuration file is based on a HuggingFace Transformer TrainingArguments with some extra settings:

{
    "train_dataset": "nixiesearch/ms-marco-hard-negatives",
    "eval_dataset": "nixiesearch/ms_marco",
    "seq_len": 128,
    "target": "infonce",
    "model_name_or_path": "sentence-transformers/all-MiniLM-L6-v2",
    "output_dir": "minilm-msmarco-infonce8",
    "num_train_epochs": 1,
    "seed": 33,
    "per_device_train_batch_size": 256,
    "per_device_eval_batch_size": 256,
    "fp16": true,
    "logging_dir": "logs",
    "gradient_checkpointing": true,
    "gradient_accumulation_steps": 1,
    "dataloader_num_workers": 14,
    "eval_steps": 0.05,
    "logging_steps": 0.05,
    "evaluation_strategy": "steps",
    "torch_compile": true,
    "report_to": [],
    "save_strategy": "epoch",
    "num_negatives": 8,
    "query_prefix": "query: ",
    "document_prefix": "passage: "
}

It takes around 60 minutes to fine-tune an all-MiniLM-L6-v2 on a MS MARCO hard negatives on a single RTX4090 GPU.

Choosing the best parameters

The following training parameters are worth tuning:

  • target: the training recipe. Currently supported targets are infonce/cosine_similarity/contrastive/triplet. If not sure, start with infonce.
  • model_name_or_path: which model to fine-tune. Any SBERT-supported model should work.
  • per_device_train_batch_size: batch size. Too small values lead to sub-par quality and slow training. Too large need a lot of VRAM. Start with 128 and go up.
  • seq_len: context length of the model. Usually it's around 128-160 for most models in MTEB leaderboard.
  • gradient_checkpointing: reduces VRAM usage sugnificantly (up to 70%) with a small 10% performance penalty, as we recompute gradients instead of storing them. If unsure, choose true
  • num_negatives: for infonce/triplet targets, how many negatives from the dataset to select.
  • query_prefix and document_prefix: prompt labels for asymmetric models - when the model can distinguish between query and document passages.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nixietune-0.0.4.tar.gz (56.5 kB view details)

Uploaded Source

Built Distribution

nixietune-0.0.4-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file nixietune-0.0.4.tar.gz.

File metadata

  • Download URL: nixietune-0.0.4.tar.gz
  • Upload date:
  • Size: 56.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for nixietune-0.0.4.tar.gz
Algorithm Hash digest
SHA256 d6859cfd4864b608ae890f94935b42c76b38b5e07ad8af725d8f0636b2237eaf
MD5 d930cbe682f3d5d75142bd3a23cb5500
BLAKE2b-256 5542ffcad41ee8414c66c8496c7792f8b11eed2a67f2f3947abaff1d99fe7e6c

See more details on using hashes here.

File details

Details for the file nixietune-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: nixietune-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for nixietune-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 a51bee0a7a82d6ffeb70751d4b98558ca6db7b63bd574e2617479d84ad4665a5
MD5 13463c6f1e1b0d24e3e3bac252ddb663
BLAKE2b-256 b440c6ba5c0727b5149a981fa5733c418ad727d432cefc173b8fbba32902a318

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page