Skip to main content

Pre-train Static Embedders

Project description

Tokenlearn

Tokenlearn is a method to pre-train Model2Vec.

The method is described in detail in our Tokenlearn blogpost.

Quickstart

Install the package with:

pip install tokenlearn

The basic usage of Tokenlearn consists of two CLI scripts: featurize and train.

Tokenlearn is trained using means from a sentence transformer. To create means, the tokenlearn-featurize CLI can be used:

python3 -m tokenlearn.featurize --model-name "baai/bge-base-en-v1.5" --output-dir "data/c4_features"

To train a model on the featurized data, the tokenlearn-train CLI can be used:

python3 -m tokenlearn.train --model-name "baai/bge-base-en-v1.5" --data-path "data/c4_features" --save-path "<path-to-save-model>"

Training will create two models:

  • The base trained model.
  • The base model with weighting applied. This is the model that should be used for downstream tasks.

NOTE: the code assumes that the padding token ID in your tokenizer is 0. If this is not the case, you will need to modify the code.

Evaluation

To evaluate a model, you can use the following command after installing the optional evaluation dependencies:

pip install evaluation@git+https://github.com/MinishLab/evaluation@main
from model2vec import StaticModel

from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results
from mteb import ModelMeta

# Get all available tasks
tasks = get_tasks()
# Define the CustomMTEB object with the specified tasks
evaluation = CustomMTEB(tasks=tasks)

# Load a trained model
model_name = "tokenlearn_model"
model = StaticModel.from_pretrained(model_name)

# Optionally, add model metadata in MTEB format
model.mteb_model_meta = ModelMeta(
            name=model_name, revision="no_revision_available", release_date=None, languages=None
        )

# Run the evaluation
results = evaluation.run(model, eval_splits=["test"], output_folder=f"results")

# Parse the results and summarize them
parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name)
task_scores = summarize_results(parsed_results)

# Print the results in a leaderboard format
print(make_leaderboard(task_scores))

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenlearn-0.1.0.tar.gz (148.3 kB view details)

Uploaded Source

Built Distribution

tokenlearn-0.1.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file tokenlearn-0.1.0.tar.gz.

File metadata

  • Download URL: tokenlearn-0.1.0.tar.gz
  • Upload date:
  • Size: 148.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for tokenlearn-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e62198154d35c2605bf1138859afa46fc6e1858f42651ca84b3c601443915724
MD5 f44f7808fa42331c3ef923b9fdcceb48
BLAKE2b-256 ddb814d82ea8baea9cea25539db6c7433d6e64eea2613b91629a923569cfe872

See more details on using hashes here.

File details

Details for the file tokenlearn-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tokenlearn-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for tokenlearn-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8c276aa73c628a8f2a1847959724d2efcbde6b754bf4fca80f6fa9b48b969e81
MD5 a44f893cd59d1cf42e2a47cadb437067
BLAKE2b-256 50f269f70b553054163d264d6e6dffae2f0df57d605723cc0b4b3a4d020bf914

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page