Skip to main content

WordLlama Embedding Utility

Project description

WordLlama

WordLlama is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity and ranking with minimal inference-time dependencies and optimized for CPU hardware.

Word Llama

Table of Contents

Quick Start

Install:

pip install wordllama

Load the 256-dim model.

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.06641249096796882

# Rank documents based on their similarity to a query
query = "i went to the car"
candidates = ["i went to the park", "i went to the shop", "i went to the truck", "i went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('i went to the vehicle', 0.7441646856486314),
#   ('i went to the truck', 0.2832691551894259),
#   ('i went to the shop', 0.19732814982305436),
#   ('i went to the park', 0.15101404519322253)
# ]

# additional inference methods
wl.deduplicate(candidates, threshold=0.8) # fuzzy deduplication
wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init
wl.filter(query, candidates, threshold=0.3) # filter candidates based on query
wl.topk(query, candidates, k=3) # return topk strings based on query

What is it?

WordLlama is a utility for NLP and word embedding model that recycles components from large language models (LLMs) to create efficient and compact word representations (such as GloVe, Word2Vec or FastText). WordLlama begins by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLama3 70B), and training a small context-less model in a general purpose embedding framework.

WordLlama improves on all MTEB benchmarks above word models like GloVe 300d, while being substantially smaller in size (16MB default model @ 256-dim vs >2GB).

Features of WordLlama include:

  1. Matryoshka Representations: Truncate embedding dimension as needed.
  2. Low Resource Requirements: A simple token lookup with average pooling, enables this to operate fast on CPU.
  3. Binarization: Models trained using the straight through estimator can be packed to small integer arrays for even faster hamming distance calculations. (coming soon)
  4. Numpy-only inference: Lightweight and simple.

For flexibility, WordLlama employs the Matryoshka representation learning training technique. The largest model (1024-dim) can be truncated to 64, 128, 256 or 512. For binary embedding models, we implement straight-through estimators during training. For dense embeddings, 256 dimensions sufficiently captures most of the performance, while for binary embeddings validation accuracy is close to saturation at 512-dimensions (64 bytes packed).

The final weights are saved after weighting, projection and truncation of the entire tokenizer vocabulary. Thus, WordLlama becomes a single embedding matrix (nn.Embedding) that is considerably smaller than the gigabyte-sized llm codebooks we start with. The original tokenizer is still used to preprocess the text into tokens, and the reduced size token embeddings are average pooled. There is very little computation required, and the resulting model sizes range from 16mb to 250mb for the 128k llama3 vocabulary.

It's good option for some nlp-lite tasks. You can train sklearn classifiers on it, perform basic semantic matching, fuzzy deduplication, ranking and clustering. I think it should work well for creating LLM output evaluators, or other preparatory tasks involved in multi-hop or agentic workflows. You can perform your own llm surgery and train your own model on consumer GPUs in a few hours. Because of its fast and portable size, it makes a good "Swiss-Army Knife" utility for exploratory analysis and utility applications.

MTEB Results (l2_supercat)

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

The l2_supercat is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens). Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).

Other Models

Results

Llama3-based: l3_supercat

Embed Text

Here’s how you can load pre-trained embeddings and use them to embed text:

from wordllama import WordLlama

# Load pre-trained embeddings
# truncate dimension to 64
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["the quick brown fox jumps over the lazy dog", "and all that jazz"])
print(embeddings.shape)  # (2, 64)

Binary embedding models can be used like this:

# Binary embeddings are packed into uint64
# 64-dims => array of 1x uint64 
wl = WordLlama.load(trunc_dim=64, binary=True)  # this will download the binary model from huggingface
wl.embed("I went to the car") # Output: array([[3029168427562626]], dtype=uint64)

# load binary trained model trained with straight through estimator
wl = WordLlama.load(dim=1024, binary=True)

# Uses the hamming similarity to binarize
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.57421875

ranked_docs = wl.rank("i went to the car", ["van", "truck"])

wl.binary = False # turn off hamming and use cosine

# load a different model class
wl = WordLlama.load(config="l3_supercat", dim=1024) # downloads model from HF

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 is recommended for binary embedding.

L2 Supercat was trained using a batch size of 512 on a single A100 for 12 hours.

Roadmap

  • Working on adding inference features:
    • Semantic text splitting
  • Add example notebooks
    • DSPy evaluators
    • RAG pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for llama3 models). You can then use the following snippet:

from wordllama.extract import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

HINT: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest, sometimes you have to snoop around and figure it out.

For training, use the scripts in the github repo. You have to add a configuration file (copy/modify an existing one into the folder).

$ pip install wordllama[train]
$ python train.py train --config your_new_config
(training stuff happens)
$ python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
(saves 1 model per matryoshka dim)

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.2.6}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordllama-0.2.6.post2.tar.gz (16.5 MB view details)

Uploaded Source

Built Distributions

wordllama-0.2.6.post2-cp312-cp312-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.2.6.post2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.2.6.post2-cp312-cp312-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.2.6.post2-cp312-cp312-macosx_10_13_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.2.6.post2-cp311-cp311-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.2.6.post2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.2.6.post2-cp311-cp311-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.2.6.post2-cp311-cp311-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.2.6.post2-cp310-cp310-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.2.6.post2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.2.6.post2-cp310-cp310-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.2.6.post2-cp310-cp310-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.2.6.post2-cp39-cp39-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.2.6.post2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.2.6.post2-cp39-cp39-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.2.6.post2-cp39-cp39-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

wordllama-0.2.6.post2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

wordllama-0.2.6.post2-cp38-cp38-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.2.6.post2.tar.gz.

File metadata

  • Download URL: wordllama-0.2.6.post2.tar.gz
  • Upload date:
  • Size: 16.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for wordllama-0.2.6.post2.tar.gz
Algorithm Hash digest
SHA256 8d1779da2299527b5446489637ec2d1625bf7b7314c29faa8f2c86d46a53d3fd
MD5 d1cf8906d663d4bd36a41218c7bd5432
BLAKE2b-256 0aa1c2cefaf499b5f9decfb877227843b2c3aa05d9afc59cb9325febefc0f64d

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 962452248164defe0183e2932ccccdb4c7658fc3fb60fa2191d8c161127404c7
MD5 fbc212de259d0a5dcb07eb5a21843e67
BLAKE2b-256 994f858f8f380937f5458b9c9238a5abb68016a796c0cd2c80394d2429388006

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 030d4f1ced49f6782fd3a0af6e3d06a3dea3b79ca3f6a57de5b48d3053935189
MD5 980a21edd97150372ca6a36464f510b3
BLAKE2b-256 a1ce836895ffad97ee44e4a8f6627dd5419b457d81d8374669b249092c2db054

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 e282b892eb65ab43e071c7efd0a3ae800d65da9d4371afdf4a0b2e9a3ed08896
MD5 ad321f880bf93c1a8ebbc7bbb3f4acc4
BLAKE2b-256 59156cca8da5281afa77f7f5ecf8bd3cd0cf02a6f4c63f24dd51ec5e2af3ea34

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 49f3e269527b315f7768d3f42eba04af229e000b5d6709ab25f9f6aa0ca0c561
MD5 eedf2d67d3c165ce5adddb8e1b14e213
BLAKE2b-256 c77f586b0725c8d275f8290eed901b7ea78d825aeb0dc1310d9db4f3f241a9e6

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 0918be0afcbf31ef542098984afadb416f524447ac48070538a215b53783f6e9
MD5 d0c45acdddcaaa7dcd213a56ba457e94
BLAKE2b-256 a370da87405ff5dbfe5e1c063d46e0920f04308a0547d90a2a4cbfbabb362b2d

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 019d80896446476948cfc2f53ebe32ca73167ceb332d09cf6be2ab33858aa943
MD5 0d5518b21ca02cd1d5d2eb31d3fd9b51
BLAKE2b-256 1534e6c0f8a1c3b834d8777d4623fc9226f9813fc7c56a6659284213974e4913

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 72b5b426e4e3ecd94f506198dd5325b3110c2e3dbc36a4df95d4a68e44f31854
MD5 549af635898890c704cec167df5340ec
BLAKE2b-256 b6bf117874f2413b3d1a0cbd51421c10fabdde40df0834c2afef3774d62b8b29

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a6fac58f7ac75723dd510c09bc3901653991f48e3a2e2a11539c230a62b919fc
MD5 686b2e84f25baaa80875c7e79018f525
BLAKE2b-256 86a42c98e6f6197e628e9ea6d4d8f361244d90893a5e854bb28633da98853602

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 d5a96ece2c0a60e337d3f88d8e41255330391a412899d7b02322c103aaf56a3c
MD5 1559e67369d6b0680526756442201343
BLAKE2b-256 eeac76bb9c54db00b45f9edd271b11eef62a6801194b25be648ebcdbe1d0888e

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ae3e4cc1fe201165958ad0eb39b83bf9661b3f0e1a042828fb66cb629f29d4f2
MD5 b81224f7d7570aad09ee2a9acc05c424
BLAKE2b-256 17ba8d99465b8ea92c3eb9377cf710a59260dd3a7ff1526bdb2e04a762eb81bd

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 9464f4d2e5f4956c5ddccb814ba0e78c4677edef3457d7869331953d51823060
MD5 56d95648668d3ee1f510642399b2c264
BLAKE2b-256 ecf0f3458308992359d96a7da27cb5a4d4a182baa3adbf54a3c58576abb1c587

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ad1943bbfe3df928f230d28124925f8f99fff3f1d0c3c748954380f39bd82bf2
MD5 a6dd0fea792edf025adc473a9b1a1674
BLAKE2b-256 7aef354f98ac6adeb5e58a9b5a5431ac59bd41a4c14aab19442ec0eaa52d3b0a

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 4694b7387c9ec85b8eadf8f19269180877aeb23c53f2e89be7df93425c926601
MD5 fe437712ec444bdf5089d2c9dfbf6f94
BLAKE2b-256 629d152e795faa70f76d64ccd4ba76b945d4d59cf290a69dd2155f864c8a2f09

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f0ba7a663b1105c153b0fffae3f6127cea4aa2afdaf9cde3266c724077746847
MD5 33a19dd398fa8a62414f6883accdde62
BLAKE2b-256 680ec74011af62a40c9a598f85830f2fd3e61b0ad1ec40593afc14d54b2769be

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 6d57111745a19026cc7cf234d138a538506e61f6234589f20986fec394dac9de
MD5 8c3a10f50055c8a2dcc966fec80bf404
BLAKE2b-256 fe36d5c8ce8f16c5f3062dab0b0ff80402e2cb2feda669653f1137ceebbe3cc3

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d84aa210ca6c18a0dc83bcdb57d3010f0b058e3b7b223476be1f27dd5b953e6c
MD5 59de28962b75b9e9e97eef404d7c5dd6
BLAKE2b-256 2c2948eebe9b716ec73315704ff8a7d7668e266cb702a3b0341025cb5799d7a4

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c53738438fa57e62c36395f17c0d3ad2e8d62771573a3807bff4ac1afdfb72b3
MD5 c684fe79a756a1e7715c4cd88504bdea
BLAKE2b-256 960ae2af36f072832bd327fffa5658243885c62ca700018e679401f5c6d87f0a

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post2-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f297927c2b89384dd0f6b768d7ae706803685957ad5afcccdbb7ceb798814f29
MD5 0433290bd7c65f0c9606501d7dc7ff75
BLAKE2b-256 7e260a4b7f6fe18d9c191f8309d93fa24350c5b526eae22b5ddd580e5aa507dc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page