Skip to main content

Pre-train Static Embedders

Project description

Tokenlearn

Tokenlearn is a method to pre-train Model2Vec.

The method is described in detail in our Tokenlearn blogpost.

Quickstart

Install the package with:

pip install tokenlearn

The basic usage of Tokenlearn consists of two CLI scripts: featurize and train.

Tokenlearn is trained using means from a sentence transformer. To create means, the tokenlearn-featurize CLI can be used:

python3 -m tokenlearn.featurize --model-name "baai/bge-base-en-v1.5" --output-dir "data/c4_features"

NOTE: the default model is trained on the C4 dataset. If you want to use a different dataset, the following code can be used:

python3 -m tokenlearn.featurize \
    --model-name "baai/bge-base-en-v1.5" \
    --output-dir "data/c4_features" \
    --dataset-path "allenai/c4" \
    --dataset-name "en" \
    --dataset-split "train"

To train a model on the featurized data, the tokenlearn-train CLI can be used:

python3 -m tokenlearn.train --model-name "baai/bge-base-en-v1.5" --data-path "data/c4_features" --save-path "<path-to-save-model>"

Training will create two models:

  • The base trained model.
  • The base model with weighting applied. This is the model that should be used for downstream tasks.

NOTE: the code assumes that the padding token ID in your tokenizer is 0. If this is not the case, you will need to modify the code.

Evaluation

To evaluate a model, you can use the following command after installing the optional evaluation dependencies:

pip install evaluation@git+https://github.com/MinishLab/evaluation@main
from model2vec import StaticModel

from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results
from mteb import ModelMeta

# Get all available tasks
tasks = get_tasks()
# Define the CustomMTEB object with the specified tasks
evaluation = CustomMTEB(tasks=tasks)

# Load a trained model
model_name = "tokenlearn_model"
model = StaticModel.from_pretrained(model_name)

# Optionally, add model metadata in MTEB format
model.mteb_model_meta = ModelMeta(
            name=model_name, revision="no_revision_available", release_date=None, languages=None
        )

# Run the evaluation
results = evaluation.run(model, eval_splits=["test"], output_folder=f"results")

# Parse the results and summarize them
parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name)
task_scores = summarize_results(parsed_results)

# Print the results in a leaderboard format
print(make_leaderboard(task_scores))

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenlearn-0.1.2.tar.gz (149.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenlearn-0.1.2-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file tokenlearn-0.1.2.tar.gz.

File metadata

  • Download URL: tokenlearn-0.1.2.tar.gz
  • Upload date:
  • Size: 149.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for tokenlearn-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b81ebfc272630b293064ae7fcdfd0cfa4b3f0b9366b822d08b8cf25e931d49c0
MD5 e2687d0dfc390f19b7b63cfe93c6437a
BLAKE2b-256 550eeac0f3d40ecd8c26338b854b5be8ebd5a3c7cc05c7bc18d20a66ab8046df

See more details on using hashes here.

File details

Details for the file tokenlearn-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: tokenlearn-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 12.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for tokenlearn-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 533f79dcbf063ef71e4b4b74b7d0b4fa75440d74a9a7ad4b4cc4c1a9fb60b952
MD5 6b6e11a811974d6292b3f1c2e6bc14ba
BLAKE2b-256 6e4663035e024a33684a6538e37057b9c9dd28f1b2ce0b1f8f26d8e83e1d01de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page