Skip to main content

WordLlama Embedding Utility

Project description

WordLlama

The power of 15 trillion tokens of training, extracted, flogged and minimized into a cute little package for word embedding.

Word Llama

Table of Contents

Quick Start

Install:

pip install wordllama

Load the 256-dim model.

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.06641249096796882

# Rank documents based on their similarity to a query
query = "i went to the car"
candidates = ["i went to the park", "i went to the shop", "i went to the truck", "i went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('i went to the vehicle', 0.7441646856486314),
#   ('i went to the truck', 0.2832691551894259),
#   ('i went to the shop', 0.19732814982305436),
#   ('i went to the park', 0.15101404519322253)
# ]

# additional inference methods
wl.deduplicate(candidates, threshold=0.8) # fuzzy deduplication
wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init
wl.filter(query, candidates, threshold=0.3) # filter candidates based on query
wl.topk(query, candidates, k=3) # return topk strings based on query

What is it?

WordLlama is a utility for NLP and word embedding model that recycles components from large language models (LLMs) to create efficient and compact word representations (such as GloVe, Word2Vec or FastText). WordLlama begins by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLama3 70B), and training a small context-less model in a general purpose embedding framework.

WordLlama improves on all MTEB benchmarks above word models like GloVe 300d, while being substantially smaller in size (16MB default model @ 256-dim vs >2GB).

Features of WordLlama include:

  1. Matryoshka Representations: Truncate embedding dimension as needed.
  2. Low Resource Requirements: A simple token lookup with average pooling, enables this to operate fast on CPU.
  3. Binarization: Models trained using the straight through estimator can be packed to small integer arrays for even faster hamming distance calculations. (coming soon)
  4. Numpy-only inference: Lightweight and simple.

For flexibility, WordLlama employs the Matryoshka representation learning training technique. The largest model (1024-dim) can be truncated to 64, 128, 256 or 512. For binary embedding models, we implement straight-through estimators during training. For dense embeddings, 256 dimensions sufficiently captures most of the performance, while for binary embeddings validation accuracy is close to saturation at 512-dimensions (64 bytes packed).

The final weights are saved after weighting, projection and truncation of the entire tokenizer vocabulary. Thus, WordLlama becomes a single embedding matrix (nn.Embedding) that is considerably smaller than the gigabyte-sized llm codebooks we start with. The original tokenizer is still used to preprocess the text into tokens, and the reduced size token embeddings are average pooled. There is very little computation required, and the resulting model sizes range from 16mb to 250mb for the 128k llama3 vocabulary.

It's good option for some nlp-lite tasks. You can train sklearn classifiers on it, perform basic semantic matching, fuzzy deduplication, ranking and clustering. I think it should work well for creating LLM output evaluators, or other preparatory tasks involved in multi-hop or agentic workflows. You can perform your own llm surgery and train your own model on consumer GPUs in a few hours. Because of its fast and portable size, it makes a good "Swiss-Army Knife" utility for exploratory analysis and utility applications.

MTEB Results (l2_supercat)

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

The l2_supercat is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens). Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).

Embed Text

Here’s how you can load pre-trained embeddings and use them to embed text:

from wordllama import WordLlama

# Load pre-trained embeddings
# truncate dimension to 64
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["the quick brown fox jumps over the lazy dog", "and all that jazz"])
print(embeddings.shape)  # (2, 64)

Binary embedding models can be used like this:

# Binary embeddings are packed into uint32
# 64-dims => array of 2x uint32 
wl = WordLlama.load(trunc_dim=64, binary=True)  # this will download the binary model from huggingface
wl.embed("I went to the car") # Output: array([[3029168104, 2427562626]], dtype=uint32)

# load binary trained model trained with straight through estimator
wl = WordLlama.load(dim=1024, binary=True)

# Uses the hamming similarity to binarize
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.57421875

ranked_docs = wl.rank("i went to the car", ["van", "truck"])

wl.binary = False # turn off hamming and use cosine

# load a different model class
wl = WordLlama.load(config="l3_supercat", dim=1024) # downloads model from HF

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 is recommended for binary embedding.

L2 Supercat was trained using a batch size of 512 on a single A100 for 12 hours.

Roadmap

  • Working on adding inference features:
    • Semantic text splitting
  • Add example notebooks
    • DSPy evaluators
    • RAG pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for llama3 models). You can then use the following snippet:

from wordllama.extract import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

HINT: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest, sometimes you have to snoop around and figure it out.

For training, use the scripts in the github repo. You have to add a configuration file (copy/modify an existing one into the folder).

$ pip install wordllama[train]
$ python train.py train --config your_new_config
(training stuff happens)
$ python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
(saves 1 model per matryoshka dim)

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.2.3}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordllama-0.2.3.post35.tar.gz (16.5 MB view details)

Uploaded Source

Built Distributions

wordllama-0.2.3.post35-cp312-cp312-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.2.3.post35-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.2.3.post35-cp312-cp312-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.2.3.post35-cp312-cp312-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

wordllama-0.2.3.post35-cp311-cp311-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.2.3.post35-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.2.3.post35-cp311-cp311-macosx_12_0_arm64.whl (16.6 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.2.3.post35-cp311-cp311-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.2.3.post35-cp310-cp310-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.2.3.post35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.5 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.2.3.post35-cp310-cp310-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.2.3.post35-cp310-cp310-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.2.3.post35-cp39-cp39-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.2.3.post35-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.5 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.2.3.post35-cp39-cp39-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.2.3.post35-cp39-cp39-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.2.3.post35.tar.gz.

File metadata

  • Download URL: wordllama-0.2.3.post35.tar.gz
  • Upload date:
  • Size: 16.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for wordllama-0.2.3.post35.tar.gz
Algorithm Hash digest
SHA256 5eb1dcbef1efa41b759417adef3cf65dccbc3e7d2d064c14ea76568d82e81930
MD5 4ed3c4861a9d56bfbe7ff17cf74ab45d
BLAKE2b-256 9ef3b7e93d91075857534f1a6140420cc278975da2776af18152a61c3e756300

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 2891e571238d5c397c0a7ca93d7e23728f39a8559ddc6c2f1824c18b935c0644
MD5 722e220d3e64752bd485d4a3c360c588
BLAKE2b-256 18f6ec382c2269d8da85eeb3c75a2c77b9712fea16e8da644d34f0ea8100af57

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 74dc49bfd584baff039fb3ebe286b31663c3a169986b4ca61a362df5379fb41e
MD5 6377112a05b5fee821beb97d5857b7a6
BLAKE2b-256 8a9e685b3f63871539ab7492db33ae986364cc8a0aa6729211495ec4c0e7a592

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 4b6ecf06b0a10856b91d299f5ff3775b4cbf342bbdfb0905dfa652f29729b39f
MD5 5ac653c49dc575610bd42cd23c84df74
BLAKE2b-256 993517120acd456c80e9c5bd28e50f597b5517469ff99ec647c1962de65ca75b

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8579c0bb975608c6fdccd72cc09a9fc9a15d37d82c0918fb3c92cef5f205de11
MD5 898ebbf70fdb567d1c81d40a21f80129
BLAKE2b-256 f1d35a4fdf39405d0b119e7a322672bf2fdfd826f96f528ce8b888176a3b1728

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 0a5bcb68246c6d46e3ee1e4027cdae1e33a9c5e407f302eb72bf9347c2f2ad35
MD5 b8a1486240ea74bdfa9cf7eb5acd35ab
BLAKE2b-256 7053cbab0e0bbd6a4eb6c43bd60e632445cd0937556038e6426aadde590bb15e

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 225b18cbb451342d056fad8f605ddeb91c5cdbc72818cf41ea6206ceb6d2ad07
MD5 1b21e2dec2e764a11e68ed80697e7024
BLAKE2b-256 1baee077479cf9fd526abaf7461c237aefd0fd9e11542d09bab52c5b5aa4568c

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 a5db3601777e1ea006278741832502982ecbd68d0b2447627f5e5d5aac19f879
MD5 6744538859dc20c03178b65f20956ce1
BLAKE2b-256 ed684a34c18133eefaf407be3ca2647f0b4493a5eedf4b4fa6d7a206ff56e056

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2059492aa4992cf8fb6390bb275ece0e575593756a575994436eed106c1155f5
MD5 49d2fa6b11cc873292c683bd52ec3962
BLAKE2b-256 dc9c08ae15206c556bf38227a4f27fb3859286c659c55f0df572330c9f39834d

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 2192cc7c732be085593f0ce7d4e7a316dbcc289104ce056191c02346105745b4
MD5 96558e2c5fab298b87d8027c19fd7a53
BLAKE2b-256 ef2eb19446eae30dca0b020571f9eda40a7c7e05292dca11553c1f3f9e8779b6

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5fbe383306b22afb745c919260fa41c8b849b306b2b702a86d7058065f5c3c48
MD5 3fb82f4586c60713e2079f992eab8446
BLAKE2b-256 9371d27d19360d025b359dc6cd91a17db334f6c126cb73269714bfe7e8b9f907

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 1bdc5a12eecb197918d27fa0939633bbbd8536168fff8e1e70cf9cb7f9c9119e
MD5 c23c76385f8eb40367805f58b07b4020
BLAKE2b-256 c8022d076142c9bf6d17e53398d4cfdb614eba12a410624741e661899c407044

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1ad9836e341e13b01be78c3ebeb07b2486be5cb143b56a3d65befe5611aa787d
MD5 04087494427ec5886d0dbfa0143ed8ee
BLAKE2b-256 f9d3ebeba5ab62ad59af6d0c3a46c8e2ca8f30e096a2c142547af62a738a8f4b

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 3ebae4c2bcb29c980ddda5b1914fae49e6ba9193579a8830f6766d7c98bdb97f
MD5 d7bcb80220e911b16debba2fc6909c37
BLAKE2b-256 ce0f6a31966ed1a94cfe5b3d71c8c2e514ee80164dd7f87283991eef06f8cbbb

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3c755de57c7ce567e8eebc4c3802a7f7c10751fb3ef4805d3a453f03d473e72d
MD5 4976b54e6961d36147478af82b950f1a
BLAKE2b-256 f3b05c27ac046ad934fb57e1e8429e33722c0410d7b6358e30f709ec70f15b67

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 7865a65946e758e3b691f30696e3fd30f0617759d94bd05d9f764d308ceb39d2
MD5 13b9bbc9248fea6bf4d7ea504df1682f
BLAKE2b-256 90d809420f3bf12156cca79afaa052633f44a4ba812ba3bf44bee672dd6d6d7f

See more details on using hashes here.

File details

Details for the file wordllama-0.2.3.post35-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.3.post35-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e492f836a46581198279d754e616d55c2058841144528d6dceb14d16c4d0d00e
MD5 629e26bc462d2c1593e88d1473133afa
BLAKE2b-256 7e73fc4ffa56ee027891f21c998b8164697201b2d9de83325f91c4a2d6daf877

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page