Pre-train Static Embedders
Project description
Tokenlearn
Tokenlearn is a method to pre-train Model2Vec.
The method is described in detail in our Tokenlearn blogpost.
Quickstart
Install the package with:
pip install tokenlearn
The basic usage of Tokenlearn consists of two CLI scripts: featurize
and train
.
Tokenlearn is trained using means from a sentence transformer. To create means, the tokenlearn-featurize
CLI can be used:
python3 -m tokenlearn.featurize --model-name "baai/bge-base-en-v1.5" --output-dir "data/c4_features"
To train a model on the featurized data, the tokenlearn-train
CLI can be used:
python3 -m tokenlearn.train --model-name "baai/bge-base-en-v1.5" --data-path "data/c4_features" --save-path "<path-to-save-model>"
Training will create two models:
- The base trained model.
- The base model with weighting applied. This is the model that should be used for downstream tasks.
NOTE: the code assumes that the padding token ID in your tokenizer is 0. If this is not the case, you will need to modify the code.
Evaluation
To evaluate a model, you can use the following command after installing the optional evaluation dependencies:
pip install evaluation@git+https://github.com/MinishLab/evaluation@main
from model2vec import StaticModel
from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results
from mteb import ModelMeta
# Get all available tasks
tasks = get_tasks()
# Define the CustomMTEB object with the specified tasks
evaluation = CustomMTEB(tasks=tasks)
# Load a trained model
model_name = "tokenlearn_model"
model = StaticModel.from_pretrained(model_name)
# Optionally, add model metadata in MTEB format
model.mteb_model_meta = ModelMeta(
name=model_name, revision="no_revision_available", release_date=None, languages=None
)
# Run the evaluation
results = evaluation.run(model, eval_splits=["test"], output_folder=f"results")
# Parse the results and summarize them
parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name)
task_scores = summarize_results(parsed_results)
# Print the results in a leaderboard format
print(make_leaderboard(task_scores))
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tokenlearn-0.1.0.tar.gz
.
File metadata
- Download URL: tokenlearn-0.1.0.tar.gz
- Upload date:
- Size: 148.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e62198154d35c2605bf1138859afa46fc6e1858f42651ca84b3c601443915724 |
|
MD5 | f44f7808fa42331c3ef923b9fdcceb48 |
|
BLAKE2b-256 | ddb814d82ea8baea9cea25539db6c7433d6e64eea2613b91629a923569cfe872 |
File details
Details for the file tokenlearn-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: tokenlearn-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c276aa73c628a8f2a1847959724d2efcbde6b754bf4fca80f6fa9b48b969e81 |
|
MD5 | a44f893cd59d1cf42e2a47cadb437067 |
|
BLAKE2b-256 | 50f269f70b553054163d264d6e6dffae2f0df57d605723cc0b4b3a4d020bf914 |