Pre-train Static Embedders

These details have not been verified by PyPI

Project links

Project description

Tokenlearn

Tokenlearn is a method to pre-train Model2Vec.

The method is described in detail in our Tokenlearn blogpost.

Quickstart

Install the package with:

pip install tokenlearn

The basic usage of Tokenlearn consists of two CLI scripts: featurize and train.

Tokenlearn is trained using means from a sentence transformer. To create means, the tokenlearn-featurize CLI can be used:

python3 -m tokenlearn.featurize --model-name "baai/bge-base-en-v1.5" --output-dir "data/c4_features"

NOTE: the default model is trained on the C4 dataset. If you want to use a different dataset, the following code can be used:

python3 -m tokenlearn.featurize \
    --model-name "baai/bge-base-en-v1.5" \
    --output-dir "data/c4_features" \
    --dataset-path "allenai/c4" \
    --dataset-name "en" \
    --dataset-split "train"

To train a model on the featurized data, the tokenlearn-train CLI can be used:

python3 -m tokenlearn.train --model-name "baai/bge-base-en-v1.5" --data-path "data/c4_features" --save-path "<path-to-save-model>"

Training will create two models:

The base trained model.
The base model with weighting applied. This is the model that should be used for downstream tasks.

NOTE: the code assumes that the padding token ID in your tokenizer is 0. If this is not the case, you will need to modify the code.

Evaluation

To evaluate a model, you can use the following command after installing the optional evaluation dependencies:

pip install evaluation@git+https://github.com/MinishLab/evaluation@main

from model2vec import StaticModel

from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results
from mteb import ModelMeta

# Get all available tasks
tasks = get_tasks()
# Define the CustomMTEB object with the specified tasks
evaluation = CustomMTEB(tasks=tasks)

# Load a trained model
model_name = "tokenlearn_model"
model = StaticModel.from_pretrained(model_name)

# Optionally, add model metadata in MTEB format
model.mteb_model_meta = ModelMeta(
            name=model_name, revision="no_revision_available", release_date=None, languages=None
        )

# Run the evaluation
results = evaluation.run(model, eval_splits=["test"], output_folder=f"results")

# Parse the results and summarize them
parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name)
task_scores = summarize_results(parsed_results)

# Print the results in a leaderboard format
print(make_leaderboard(task_scores))

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Jun 2, 2025

0.2.0

May 30, 2025

0.1.2

Mar 7, 2025

0.1.1

Dec 14, 2024

0.1.0

Nov 10, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenlearn-0.2.1.tar.gz (149.0 kB view details)

Uploaded Jun 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokenlearn-0.2.1-py3-none-any.whl (11.9 kB view details)

Uploaded Jun 2, 2025 Python 3

File details

Details for the file tokenlearn-0.2.1.tar.gz.

File metadata

Download URL: tokenlearn-0.2.1.tar.gz
Upload date: Jun 2, 2025
Size: 149.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for tokenlearn-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`b1cdb5cb1bb9f60d132143cf78d8b38db69a85f044157788c40bd48315ea82be`
MD5	`c7adf3286d5c12d6c68afd42f72e4e6b`
BLAKE2b-256	`124c9dd1c2383c517442f666ea2419d02bf5aa522fba84e5da8b09294e0d1399`

See more details on using hashes here.

File details

Details for the file tokenlearn-0.2.1-py3-none-any.whl.

File metadata

Download URL: tokenlearn-0.2.1-py3-none-any.whl
Upload date: Jun 2, 2025
Size: 11.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for tokenlearn-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6961ef09418701cdb4badf83139d7aadf08c7bbfb5577e0276f1cb8e9b087792`
MD5	`031f0aadeb940c4373038376bcb527a9`
BLAKE2b-256	`8cdaddc617c9f6026c5ae59a0aae6dbaebf6dfc2c81b9c8cfe4ff75e5518eca7`

See more details on using hashes here.

tokenlearn 0.2.1

Navigation

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Project description

Tokenlearn

Quickstart

Evaluation

License

Project details

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes