Nearly Inference Free embedding models in python

Project description

pyNIFE

NIFE compresses large embedding models into static, drop-in replacements with up to 200x faster query embedding see benchmarks.

Features

200x faster CPU query embedding
Fully aligned with their teacher models
Re-use your existing vector index

Introduction

Nearly Inference Free Embedding (NIFE) models are static embedding models that are fully aligned with a much larger model. Because static models are so small and fast, NIFE allows you to:

Speed up query time immensely: 200x embed time speed-up on CPU.
Get away with using a much smaller memory/compute footprint. Create embeddings in your DB service.
Reuse your big model index: Switch dynamically between your big model and the NIFE model.

Some possible use-cases for NIFE include search engines with slow and fast paths, RAGs in agent loops, and on-the-fly document comparisons.

Quickstart

This snippet loads stephantulkens/NIFE-mxbai-embed-large-v1, which is aligned with mixedbread-ai/mxbai-embed-large-v1. Use it in any spot where you use mixedbread-ai/mxbai-embed-large-v1.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("stephantulkens/NIFE-mxbai-embed-large-v1", device="cpu")
# Loads in 41ms.
query_vec = model.encode(["What is the capital of France?"])
# Embedding a query takes 90.4 microseconds.

big_model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", device="cpu")
# Four cities near France
index_doc = big_model.encode(["Paris is the largest city in France", "Lyon is pretty big", "Antwerp is really great, and in Belgium", "Berlin is pretty gloomy in winter", "France is a country in Europe"])

similarity = model.similarity(query_vec, index_doc)
print(similarity)
# It correctly retrieved the document containing the statement about paris.
# tensor([[0.7065, 0.5012, 0.3596, 0.2765, 0.6648]])

big_model_query_vec = big_model.encode(["What is the capital of France?"])
# Embedding a query takes 68.1 ms (~750 times slower)
similarity = model.similarity(big_model_query_vec, index_doc)
# Compare to the above. Very similar.
# tensor([[0.7460, 0.5301, 0.3816, 0.3423, 0.6692]])

similarity_queries = model.similarity(big_model_query_vec, query_vec)
# The two vectors are very similar.
# tensor([[0.9377]])

This snippet is an example of how you could use it. But in reality you should just use it wherever you encode a query using your teacher model. There's no need to keep the teacher in memory. This makes NIFE extremely flexible, because you can decouple the inference model from the indexing model. Because the models load extremely quickly, they can be used in edge environments and one-off things like lambda functions.

Installation

On PyPi:

pip install pynife

Usage

A NIFE model is just a sentence transformer router model, so you don't need to install pynife to use NIFE models. Nevertheless, NIFE contains some helper functions for loading a model trained with NIFE.

Note that with all NIFE models the teacher model is unchanged; so if you have a large set of documents indexed with the teacher model, you can use the NIFE model as a drop-in replacement.

Standalone

Use just like any other sentence transformer:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("stephantulkens/NIFE-mxbai-embed-large-v1", device="cpu")
X = model.encode(["What is the capital of France?"])

As a router

You can also use the small model and big model together as a single router using a helper function from pynife. This is useful for benchmarking; in production you should probably use the query model by itself.

from pynife import load_as_router

model = load_as_router("stephantulkens/NIFE-mxbai-embed-large-v1")
# Use the fast model
query = model.encode_query("What is the capital of France?")
# Use the slow model
docs = model.encode_document("What is the capital of France?")

print(model.similarity(query, docs))
# Same result as above in the quickstart.
# tensor([[0.9377]])

Rationale

For retrieval using dense models, the normal mode of operation is to embed your documents, and put them in some index. Then, using that same model, also embed your queries. In general, larger embedding models are better than smaller models, so you're often better off by making your embedder as large as possible. This however, makes inference more difficult; you need to host a larger model, and embedding queries might take longer.

For sparse models, like SPLADE, there is an interesting alternative, which they call doc-SPLADE, and which sentence transformers calls inference free. In doc-SPLADE, you only embed using the full model for documents in your index. When querying, you just index the sparse index using the tokenizer.

NIFE is the answer to the question: what would inference free dense retrieval be? It is called Nearly Inference Free, because you still need to have some mapping from tokens to embeddings.

See this table:

	Sparse	Dense
Full	SPLADE	Sentence transformer
Inference free	doc-SPLADE	NIFE

As in doc-SPLADE, you lose performance. No real way about it, but as with other fast models, the gap is smaller than you might think.

How does it work?

We use knowledge distillation from an initialized static model to the teacher we want to emulate. Some special things:

The static model is initialized directly from the teacher by inferring all tokens in the tokenizer through the whole model. This is similar to how this was done in model2vec, except we skip the PCA and weighting steps.
The knowledge distillation is done in cosine space. We don't guarantee any alignment in euclidean space. Using, e.g., MSE or KLDiv between the student and teacher did not work as well.
We train a custom tokenizer on our pre-training corpus, which is MsMARCO. This custom tokenizer is based on bert-base-uncased, but with a lot of added vocabulary. The models used in NIFE all have around 100k vocabulary size.
We perform two stages of training; following LEAF, we also train on queries. This raises performance considerably, but training on interleaved queries and documents does not work very well. So we first train on a corpus of documents (MsMarco), and then finetune using a lower learning rate on a large selection of queries from a variety of sources.
Unlike LEAF, we leave out all instructions from the knowledge distillation process. Static models can't deal with instructions, because there is no interaction between the instruction and other tokens in the document. Instructions can therefore at best be a constant offset of your embedding space. This can be really useful, but not for this specific task.

Caveats/weaknesses

NIFE can't do the following things:

Ignore words based on context: the query "What is the capital of France?" the word "France" will cause documents containing the term "France" to be retrieved. There is no way for the model to attenuate this vector and morph it into the answer ("Paris").
Deal with negation: for the same reason as above; there is no interaction between tokens, so the similarity between "Cars that aren't red" and "Cars that are red" will be really high.

Creating a NIFE model

To create a NIFE model, you can run the scripts in scripts, or directly use the code from the repository. First, you should create a corpus of embeddings for your embedder. You can also use pre-computed collections of embeddings I created:

Broadly construed, training a NIFE model has 5 separate steps.

1. Create a set of embeddings using the teacher

Let's assume we want to create embeddings on trivia QA, using mxbai-embed-large-v1 as a teacher.

from datasets import load_dataset
from pynife.distillation.infer import generate_and_save_embeddings
from sentence_transformers import SentenceTransformer

model_name = "mixedbread-ai/mxbai-embed-large-v1"
model = SentenceTransformer(model_name)

dataset_name = "mandarjoshi/trivia_qa"
dataset = load_dataset(dataset_name, "rc", split="train")
dataset_iterator = (x['question'] for x in dataset)

output_directory = "my-trivia-qa"

generate_and_save_embeddings(
    model=model,
    records=dataset_iterator,
    output_folder=output_directory,
    limit_batches=None,
    batch_size=8,
    save_every=512,
    max_length=512,
    model_name=model_name,
    dataset_name=dataset_name,
    lowercase=False,
    make_greedy=False,
    )

This piece of code loads the model, the dataset and then starts inference. Inference takes a while, and will stream snippets to disk as .txt files and torch tensor files. After the whole dataset has been inferenced, the .txt and tensor files are converted into parquet files, and the .txt and torch tensor files are deleted.

Your dataset will be ready and saved as parquet files in output_directory. If you want to upload these, please use the HfAPI, not dataset.push_to_hub, because we rely on some metadata embedded in the README to infer the base model later on. Note that the dataset iterator can be anything, and does not need to be a Hugging Face dataset. For example, it could also work with a stream from your database.

For a simple inference script with a lot of pre-made datasets, see the infer_datasets script.

2. (optional) Expanding a tokenizer

NIFE models work really well if you create a custom tokenizer for your domain. Empirically, it also works really well if you just expand the tokenizer of your teacher model with additional words. We call this tokenizer expansion. We have a pre-defined corpus to work on:

from transformers import AutoTokenizer

from datasets import load_dataset
from pynife.tokenizer.expand_tokenizer import expand_tokenizer


dataset = load_dataset("stephantulkens/msmarco-vocab", split="train")
print(dataset.tolist()[:5])
# [{'token': '.', 'frequency': 36174594, 'document_frequency': 8701009},
# {'token': 'the', 'frequency': 28806701, 'document_frequency': 7712172},
# {'token': ',', 'frequency': 25825435, 'document_frequency': 7411743},
# {'token': 'of', 'frequency': 15196930, 'document_frequency': 6562023},
# {'token': 'a', 'frequency': 13702107, 'document_frequency': 6064770},

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Function expects an iterator over dictionaries with "token" and "frequency" as keys.
new_tokenizer = expand_tokenizer(tokenizer, data, new_vocabulary_size=30000)
new_tokenizer.save_pretrained("my_tokenizer")

This will do a couple of things:

It will remove all tokens from the original tokenizer that aren't present in your data.
It will then add the most frequent tokens until the size of the tokenizer == new_vocabulary_size.

This works a lot better than training a tokenizer from scratch on equivalent data. For a runnable version, see the expand_tokenizer script.

To get frequency counts, you can use count_tokens_in_dataset, as follows:

from datasets import load_dataset, Dataset

from pynife.tokenizer.count_vocabulary import count_tokens_in_dataset

dataset = load_dataset("sentence-transformers/msmarco", "corpus", split="train", streaming=True)
dataset_iterator = (item["passage"] for item in dataset)
counts = count_tokens_in_dataset(dataset_iterator)

# Save the counts as a dataset if you want.
dataset = Dataset.from_list(counts, split="train")
dataset.push_to_hub("my_hub")

This dataset can be used directly to expand your tokenizer, above. For a runnable version, see the create_vocabulary script

3. Train

Given a dataset and optionally a tokenizer, there's two steps to complete for a successful training.

3a Initialize a static model using your teacher

Using your teacher model, initialize a static model. For example, when using mixedbread-ai/mxbai-embed-large-v1:

from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer

from pynife.initialization import initialize_from_model

teacher = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
# The tokenizer you trained in step 2. or an off-the-shelf tokenizer.
tokenizer = AutoTokenizer.from_pretrained("my_tokenizer")
model = initialize_from_model(teacher, tokenizer)

3b Actually train

Now you can train, just like a regular sentence transformer. In my experiments, I found that using the cosine distance as a loss function was superior to using MSE, so I recommend using that, find it in pynife.losses. In addition, I also recommend using Matryoshka Representation Learning. There's a bunch of helper functions in pynife to make training easier. In general, I recommend using hyperparameters like the following:

batch_size: 128
learning rate: 0.01
scheduler: "cosine_warmup_with_min_lr"
warmup_ratio: 0.1
weight_decay: 0.01
epochs: 5

It can be tempting to move to very high batch sizes, but this has a very large detrimental effect on performance, even with higher learning rates. As a consequence, GPU usage during training is actually pretty low, because there's very little actual computation happening. For a complete runnable training loop, including model initialization, see the training script.

from pynife.losses import CosineLoss
from pynife.data import get_datasets

# Fill with datasets you trained yourself.
datasets_you_made = [""]
train_dataset = get_datasets(datasets_you_made)

# Model is initialized in step 3a.
loss = CosineLoss(model=model)

# Train as usual.

This will train a model and report the result to wandb. The experiment_distillation script is otherwise completely the same as a regular sentence transformers training loop, so there's very little actual code involved.

License

MIT

Author

Stéphan Tulkens

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Nov 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pynife-0.1.0.tar.gz (971.2 kB view details)

Uploaded Nov 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pynife-0.1.0-py3-none-any.whl (28.9 kB view details)

Uploaded Nov 3, 2025 Python 3

File details

Details for the file pynife-0.1.0.tar.gz.

File metadata

Download URL: pynife-0.1.0.tar.gz
Upload date: Nov 3, 2025
Size: 971.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for pynife-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c64a7cc9832e56706a82b0812b12c17ca087def6493c7765984b78f43d51d14e`
MD5	`81811719f03d030d530ba0d5da147d13`
BLAKE2b-256	`7acc824285f1597b1cd80d43e8776b261ce96317eb315c703889f687635a6d84`

See more details on using hashes here.

File details

Details for the file pynife-0.1.0-py3-none-any.whl.

File metadata

Download URL: pynife-0.1.0-py3-none-any.whl
Upload date: Nov 3, 2025
Size: 28.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for pynife-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd3dd42c7c16ed19472cee800ab0373a6b03e2ef4d227274bded3de14ede870a`
MD5	`e1c143478c3b4d86812cbe591b7839aa`
BLAKE2b-256	`9981aba494a609e90f9fb156ae772acebe4ef6ca8edd455312a5ec27b9ed38bb`

See more details on using hashes here.

pynife 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pyNIFE

Features

Introduction

Quickstart

Installation

Usage

Standalone

As a router

Rationale

How does it work?

Caveats/weaknesses

Creating a NIFE model

1. Create a set of embeddings using the teacher

2. (optional) Expanding a tokenizer

3. Train

3a Initialize a static model using your teacher

3b Actually train

License

Author

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes