Skip to main content

WordLlama Embedding Utility

Project description

WordLlama

WordLlama is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity and ranking with minimal inference-time dependencies and optimized for CPU hardware.

Word Llama

Table of Contents

Quick Start

Install:

pip install wordllama

Load the 256-dim model.

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.06641249096796882

# Rank documents based on their similarity to a query
query = "i went to the car"
candidates = ["i went to the park", "i went to the shop", "i went to the truck", "i went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('i went to the vehicle', 0.7441646856486314),
#   ('i went to the truck', 0.2832691551894259),
#   ('i went to the shop', 0.19732814982305436),
#   ('i went to the park', 0.15101404519322253)
# ]

# additional inference methods
wl.deduplicate(candidates, threshold=0.8) # fuzzy deduplication
wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init
wl.filter(query, candidates, threshold=0.3) # filter candidates based on query
wl.topk(query, candidates, k=3) # return topk strings based on query

What is it?

WordLlama is a utility for NLP and word embedding model that recycles components from large language models (LLMs) to create efficient and compact word representations (such as GloVe, Word2Vec or FastText). WordLlama begins by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLama3 70B), and training a small context-less model in a general purpose embedding framework.

WordLlama improves on all MTEB benchmarks above word models like GloVe 300d, while being substantially smaller in size (16MB default model @ 256-dim vs >2GB).

Features of WordLlama include:

  1. Matryoshka Representations: Truncate embedding dimension as needed.
  2. Low Resource Requirements: A simple token lookup with average pooling, enables this to operate fast on CPU.
  3. Binarization: Models trained using the straight through estimator can be packed to small integer arrays for even faster hamming distance calculations.
  4. Numpy-only inference: Lightweight and simple.

For flexibility, WordLlama employs the Matryoshka representation learning training technique. The largest model (1024-dim) can be truncated to 64, 128, 256 or 512. For binary embedding models, we implement straight-through estimators during training. For dense embeddings, 256 dimensions sufficiently captures most of the performance, while for binary embeddings validation accuracy is close to saturation at 512-dimensions (64 bytes packed).

The final weights are saved after weighting, projection and truncation of the entire tokenizer vocabulary. Thus, WordLlama becomes a single embedding matrix (nn.Embedding) that is considerably smaller than the gigabyte-sized llm codebooks we start with. The original tokenizer is still used to preprocess the text into tokens, and the reduced size token embeddings are average pooled. There is very little computation required, and the resulting model sizes range from 16mb to 250mb for the 128k llama3 vocabulary.

It's good option for some nlp-lite tasks. You can train sklearn classifiers on it, perform basic semantic matching, fuzzy deduplication, ranking and clustering. I think it should work well for creating LLM output evaluators, or other preparatory tasks involved in multi-hop or agentic workflows. You can perform your own llm surgery and train your own model on consumer GPUs in a few hours. Because of its fast and portable size, it makes a good "Swiss-Army Knife" utility for exploratory analysis and utility applications.

MTEB Results (l2_supercat)

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

The l2_supercat is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens). Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).

Other Models

Results

Llama3-based: l3_supercat

Embed Text

Here’s how you can load pre-trained embeddings and use them to embed text:

from wordllama import WordLlama

# Load pre-trained embeddings
# truncate dimension to 64
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["the quick brown fox jumps over the lazy dog", "and all that jazz"])
print(embeddings.shape)  # (2, 64)

Binary embedding models can be used like this:

# Binary embeddings are packed into uint64
# 64-dims => array of 1x uint64 
wl = WordLlama.load(trunc_dim=64, binary=True)  # this will download the binary model from huggingface
wl.embed("I went to the car") # Output: array([[3029168427562626]], dtype=uint64)

# load binary trained model trained with straight through estimator
wl = WordLlama.load(dim=1024, binary=True)

# Uses the hamming similarity to binarize
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.57421875

ranked_docs = wl.rank("i went to the car", ["van", "truck"])

wl.binary = False # turn off hamming and use cosine

# load a different model class
wl = WordLlama.load(config="l3_supercat", dim=1024) # downloads model from HF

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 is recommended for binary embedding.

L2 Supercat was trained using a batch size of 512 on a single A100 for 12 hours.

Roadmap

  • Working on adding inference features:
    • Semantic text splitting
  • Add example notebooks
    • DSPy evaluators
    • RAG pipelines

Extracting Token Embeddings

Extract Token Embeddings Tutorial

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for llama3 models). You can then use the following snippet:

from wordllama.extract.extract_safetensors import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

HINT: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest, sometimes you have to snoop around and figure it out.

For training, use the scripts in the github repo. You have to add a configuration file (copy/modify an existing one into the folder).

$ pip install wordllama[train]
$ python train.py train --config your_new_config
(training stuff happens)
$ python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
(saves 1 model per matryoshka dim)

Community Projects

Gradio Demo HF Space

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.2.6}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

wordllama-0.2.8.post0-cp312-cp312-win_amd64.whl (16.5 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.2.8.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.8 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.2.8.post0-cp312-cp312-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.2.8.post0-cp312-cp312-macosx_10_13_x86_64.whl (16.8 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.2.8.post0-cp311-cp311-win_amd64.whl (16.5 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.2.8.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.2.8.post0-cp311-cp311-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.2.8.post0-cp311-cp311-macosx_10_9_x86_64.whl (16.8 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.2.8.post0-cp310-cp310-win_amd64.whl (16.5 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.2.8.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.2.8.post0-cp310-cp310-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.2.8.post0-cp310-cp310-macosx_10_9_x86_64.whl (16.8 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.2.8.post0-cp39-cp39-win_amd64.whl (16.5 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.2.8.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.2.8.post0-cp39-cp39-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.2.8.post0-cp39-cp39-macosx_10_9_x86_64.whl (16.8 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

wordllama-0.2.8.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.8 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

wordllama-0.2.8.post0-cp38-cp38-macosx_10_9_x86_64.whl (16.8 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.2.8.post0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ef4b3051ea694ed7cd8ee77e6305f94062997fd46f96014c2b026c3df6f7fcdd
MD5 5f09163803c713d82ff6902d25f486d9
BLAKE2b-256 c40866835c83956c33567a9b82b54c949cc1a5aa0604d4c0e0f5aabcdccf7449

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f2c6922ee5929a09eaef391fe7e09d30c5c6b92c209d41777037ed62e2db5d40
MD5 93683fd9862c1cd58273eda725819100
BLAKE2b-256 12674b9fc00f348f934e9d4d195690f738747dff8004c7e404e83264837ef4ad

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 3d0722c9691155aa6f8b2c8e3a6d820f9848dcfe5cd9ab11651b9c1dd10e8f84
MD5 3e52e5101aa8ed67188c46b059045dc3
BLAKE2b-256 611f747cbf8bc3f1e333c88d6b71acec72e133e72f4d95f7cd8794953785fd38

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 a89593489903d5430437572a0654522bd6c5a2bed7810bda4d80fe10a08db9d9
MD5 3227e79c226ea85738b61988ae63fc71
BLAKE2b-256 27334a74d805d1e98a37ae3b572111739c313846a6302571ba77a4a3330c9979

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e4b8665bc5eaf11e93b9c2a07165853e00e8bdff97b15fb11e44f2117f3d6055
MD5 ee90ed5a73b21386034423429f6a80ad
BLAKE2b-256 333b8d6b8c99238685e34f83f1d9d9548f639827cdc0485d13fb92ec219ba983

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d76ef7141fbb7b0096018b9c035e9a805d03310dd56af8923eb8ac615317efb1
MD5 2cdca699a00bcd2db5727adb453e4882
BLAKE2b-256 d11fa36a78337874e71f35d40e0b756626fa581e21da82822ff22913b2a8baae

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 01f2084fcc54ce542d660ca6c350f4a4945e875bcc767c0f855866f4d525ebbb
MD5 f7d51b69d93741a10cf53711666138f5
BLAKE2b-256 e38ca19d0b0ba7721b102fb0c3d9d321f8c938c32a032471ea897355217f075b

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2d7a206cc0310c5c0f8e68310b48200f25af7e58d53dabfbb8b32afdbee4bf34
MD5 b7d4262d9539bd6e701f1a505471915c
BLAKE2b-256 bb2cab3dfdc413fe70a68c8eed078844b40e2944477baa8988f6aade36abf0fa

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 2ff807d3274bcb10241e0231144ed3a68c43d250aac5cf5924138cf805a07c09
MD5 798b857d14a4a50674d96ddd8b5f695a
BLAKE2b-256 2f927e9f46ae3f5fb8853d637e0cca5a666225d9e6441819766216806ae9ea15

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 137dc5a3e1640a11e660161efba417cbe0df2a9052d75277ac9ca342590cc771
MD5 f145b692fd5cc19dd6b2b084184dddc9
BLAKE2b-256 894fa608faabec723b05d0b568cd0b6327d4094c2c1be4c22778f98b090a76a6

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 31a37da1ebc79b36f59e9fc80195ba3a87f1808e464e18ac84c33db08fc1a55f
MD5 e299c2f52fe5fe826bbfd6667f149e98
BLAKE2b-256 52e46632bfda464d2ab2970c3453b80144d88f275bc9db7f0b2af273ccb442b0

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 865dee83ef745fe3faf0a0bd9abb7d2de08ede1059a835109da8e189a3b25817
MD5 4a26dd31dcb62e6fa16ddab4c1130dba
BLAKE2b-256 bc2333d79013f36026960e99c236e1041a2cf661bcf076c0e3f17a7da0045544

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 772c5be66ed224fa71cf6ce9bd2390453f606e609d15b91748cae2b04de49451
MD5 b83a2310261616088b044c8725946549
BLAKE2b-256 85d1377f994e44d60b6861029a4d91af94c734cbfe4d6505add7acce43dc85cb

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ec7d43d92f05fc5c58068430321ada40bbb819474f8cec41882cad2df960e75f
MD5 b2eb0e6736378220904f9448b2bfe2ec
BLAKE2b-256 978021b646b3992a63f714f4680cb8c4806ac9f6be409c5e5b425df1628fbbd5

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 85df6638da1a59836c44b13625d45f9d6fb6b46de45ac3de2febe271b5dc584c
MD5 f139b12ec232ef1a15b429f1a73b2ef3
BLAKE2b-256 c4054e6c2387981095755b89f639a63615d3ae4d594ea612f3528730d6f96a5a

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b1822bdca9e4e796be1292c2d5bbfe1221041efadea2eddeaafcf50920f5b000
MD5 6a2475c97ac0f035ca3b86d8a0f142a5
BLAKE2b-256 7e062805ca9103a2c64cec95714babccace1497eac794fa90eb7104253ed44b9

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 aae244103bc7590fdb95b91f938a796365eb47d00a8d33846d69250cb58c501b
MD5 83db5ba53fc3116ed6ff1916edf8cb87
BLAKE2b-256 c8282bd796b8ad9d43a067b14926bec44a9f23cc4d9b9d3e28dda437aeab2b45

See more details on using hashes here.

File details

Details for the file wordllama-0.2.8.post0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.8.post0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 998302862a18594d80712b92d23df2f401c013716c3dfb6204ca8ecac162b04f
MD5 019b62776a3cf54992724a1b87d79958
BLAKE2b-256 bc67e2bb71560fbe21198e7549439166d96c4343e76ea1a6a56467bc0f1a5088

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page