Skip to main content

WordLlama Embedding Utility

Project description

WordLlama

WordLlama is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity and ranking with minimal inference-time dependencies and optimized for CPU hardware.

Word Llama

Table of Contents

Quick Start

Install:

pip install wordllama

Load the 256-dim model.

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.06641249096796882

# Rank documents based on their similarity to a query
query = "i went to the car"
candidates = ["i went to the park", "i went to the shop", "i went to the truck", "i went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('i went to the vehicle', 0.7441646856486314),
#   ('i went to the truck', 0.2832691551894259),
#   ('i went to the shop', 0.19732814982305436),
#   ('i went to the park', 0.15101404519322253)
# ]

# additional inference methods
wl.deduplicate(candidates, threshold=0.8) # fuzzy deduplication
wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init
wl.filter(query, candidates, threshold=0.3) # filter candidates based on query
wl.topk(query, candidates, k=3) # return topk strings based on query

What is it?

WordLlama is a utility for NLP and word embedding model that recycles components from large language models (LLMs) to create efficient and compact word representations (such as GloVe, Word2Vec or FastText). WordLlama begins by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLama3 70B), and training a small context-less model in a general purpose embedding framework.

WordLlama improves on all MTEB benchmarks above word models like GloVe 300d, while being substantially smaller in size (16MB default model @ 256-dim vs >2GB).

Features of WordLlama include:

  1. Matryoshka Representations: Truncate embedding dimension as needed.
  2. Low Resource Requirements: A simple token lookup with average pooling, enables this to operate fast on CPU.
  3. Binarization: Models trained using the straight through estimator can be packed to small integer arrays for even faster hamming distance calculations. (coming soon)
  4. Numpy-only inference: Lightweight and simple.

For flexibility, WordLlama employs the Matryoshka representation learning training technique. The largest model (1024-dim) can be truncated to 64, 128, 256 or 512. For binary embedding models, we implement straight-through estimators during training. For dense embeddings, 256 dimensions sufficiently captures most of the performance, while for binary embeddings validation accuracy is close to saturation at 512-dimensions (64 bytes packed).

The final weights are saved after weighting, projection and truncation of the entire tokenizer vocabulary. Thus, WordLlama becomes a single embedding matrix (nn.Embedding) that is considerably smaller than the gigabyte-sized llm codebooks we start with. The original tokenizer is still used to preprocess the text into tokens, and the reduced size token embeddings are average pooled. There is very little computation required, and the resulting model sizes range from 16mb to 250mb for the 128k llama3 vocabulary.

It's good option for some nlp-lite tasks. You can train sklearn classifiers on it, perform basic semantic matching, fuzzy deduplication, ranking and clustering. I think it should work well for creating LLM output evaluators, or other preparatory tasks involved in multi-hop or agentic workflows. You can perform your own llm surgery and train your own model on consumer GPUs in a few hours. Because of its fast and portable size, it makes a good "Swiss-Army Knife" utility for exploratory analysis and utility applications.

MTEB Results (l2_supercat)

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

The l2_supercat is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens). Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).

Other Models

Results

Llama3-based: l3_supercat

Embed Text

Here’s how you can load pre-trained embeddings and use them to embed text:

from wordllama import WordLlama

# Load pre-trained embeddings
# truncate dimension to 64
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["the quick brown fox jumps over the lazy dog", "and all that jazz"])
print(embeddings.shape)  # (2, 64)

Binary embedding models can be used like this:

# Binary embeddings are packed into uint64
# 64-dims => array of 1x uint64 
wl = WordLlama.load(trunc_dim=64, binary=True)  # this will download the binary model from huggingface
wl.embed("I went to the car") # Output: array([[3029168427562626]], dtype=uint64)

# load binary trained model trained with straight through estimator
wl = WordLlama.load(dim=1024, binary=True)

# Uses the hamming similarity to binarize
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.57421875

ranked_docs = wl.rank("i went to the car", ["van", "truck"])

wl.binary = False # turn off hamming and use cosine

# load a different model class
wl = WordLlama.load(config="l3_supercat", dim=1024) # downloads model from HF

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 is recommended for binary embedding.

L2 Supercat was trained using a batch size of 512 on a single A100 for 12 hours.

Roadmap

  • Working on adding inference features:
    • Semantic text splitting
  • Add example notebooks
    • DSPy evaluators
    • RAG pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for llama3 models). You can then use the following snippet:

from wordllama.extract import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

HINT: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest, sometimes you have to snoop around and figure it out.

For training, use the scripts in the github repo. You have to add a configuration file (copy/modify an existing one into the folder).

$ pip install wordllama[train]
$ python train.py train --config your_new_config
(training stuff happens)
$ python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
(saves 1 model per matryoshka dim)

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.2.5}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

wordllama-0.2.6.post0-cp312-cp312-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.2.6.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.2.6.post0-cp312-cp312-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.2.6.post0-cp312-cp312-macosx_10_13_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.2.6.post0-cp311-cp311-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.2.6.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.2.6.post0-cp311-cp311-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.2.6.post0-cp311-cp311-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.2.6.post0-cp310-cp310-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.2.6.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.2.6.post0-cp310-cp310-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.2.6.post0-cp310-cp310-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.2.6.post0-cp39-cp39-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.2.6.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.2.6.post0-cp39-cp39-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.2.6.post0-cp39-cp39-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.2.6.post0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 133e1cfa3f218d0cab3a0635dbac0257fa5c2f7bf7a33238629e3ebdb5e32124
MD5 9b35def0a1662ab04c45be675af22d17
BLAKE2b-256 3ad2436b0996a8348e86d030c80d69d7ef89e8be9d8df7b1cfd4df99b8b2abc4

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 413dc68b73dd391b6aaae22aa3be2ce9255e5f1f071d62c2f8ea567530458d17
MD5 8996bff8018113d86dfa836ce2ac502c
BLAKE2b-256 1db3146f0315b64e2c68a196f9015db1abbf988c4f3ba6ea9ecd5b1f32dfb0b5

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 2dd02c8a3d2354cdb554cbc2e9ead4e9637ed2402bd09d6b9979942f463f08c1
MD5 f320efb32d5effbcf9af6deb867c97b3
BLAKE2b-256 8a3c7b5be2d01473c177a157e57147bb5fbd3175ea234582282d7e3bcaa7b8ae

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 842ae99f397822413940d293cdf9324c15cc8cac44032832eea7e0dcea7eabca
MD5 fac5877a7cbad01886f5bb6f1730b35f
BLAKE2b-256 3fc73eb01a00ee4417c4bcec49e57d5ccfaf0e04a12343d9773ae6ebb6d254fe

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7d76d4c87d85c98ca024abc709badc1ecd89c0d61bc81d644a0678463e114d7b
MD5 94b30d69a5469ffa3ec722bab215302a
BLAKE2b-256 8aad2e91e6376ebf37cc532713a51114a23bdc0a4dcaa70137b653af4d8153bb

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4b8e85ebf63d1357353b879ff9c17e41bdfdb4fb5a5b4b81533a11f19d1a5ca7
MD5 cc6ddbdf11b1f021b1bdb9f3b3fd0e84
BLAKE2b-256 95d0890ce8186277052818b4bda9a6d13feff8ae942b9c4a80b56194eb639223

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 367e97a984bcb896e0a3c755bfaab11fa5eee25b5b70c0d70b897f616fb0b8c8
MD5 5d271c3eeee5b1712e4ed1ea2606b85e
BLAKE2b-256 7c6247a1eb6ed48c57fc353f6e499cda1242a9f80b815af90f15fb9ec6818566

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1fde26a032b39e8148b37eca487ffdca2618ad7464e0765a03d0c61767d9ddfa
MD5 d4bcf7e144747ad41b45ef14390401b5
BLAKE2b-256 96b559fee18d9d0ea7c6c5fda3e7c21ef24ae5f20fbcdd422051f0a549c4bf45

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c8bcc0aef1fb4dfc32b5c1d8ff9586e1e5d5bf218260d4ada0b6b4b7506f3158
MD5 38739fd26b9fe4d1a21cdc6489c0718a
BLAKE2b-256 6f5757672594ab8d5ef13d82f17eac02ebb953f21cfa07a5d46dd59d99559893

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8e7130bb47b454e1f24166c01c99052262b5203a3e677d6d1d00e8ed41e274fe
MD5 25a2a68f4f50f178769599ba506de38e
BLAKE2b-256 2faa9120027c6ec83aa85d4662d816a1d4d966010f5f0892786cc0a00aa19431

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 f93659a6c485f8124dd6625ad22730e59630f64faa5abd05de3de7299feb47a4
MD5 ab4adcfc140cdfa40c752793ffcf5d59
BLAKE2b-256 7a81ba4f1e92af2ca46a6854ab2039ff2aa744c7aafd451baea8e6e18fbf4619

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5ec611ec14b9d2bec97a54ccc393206c4578f0c6c16157bfeb7e739b3a9b77bb
MD5 0e204e36a1f92e77f4258a2095da9584
BLAKE2b-256 39928e12f65347f3a01370138ae6c25ebcb730187209db2e8dae2813fd20be14

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 6e17fb0a09e5230d62bdb69c2c06947c86fa9f824a79d8ad72824d6a78518479
MD5 711d59d409b0b3dca5b43a6f5dfd50d9
BLAKE2b-256 914916388416c0ca27ddcece0a13f1f32ca21c8c22998f2ff7fd26a5ebf9c6ac

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f6d632798c547a77be9d2bba79bb772356c4d515ec9e9373b5b90ea7e29bf8b0
MD5 e566e188f84dacea0f5d668eb098bbbd
BLAKE2b-256 1c20581e0f9b53170692b817d3931a8a332b970183c9592bcc2159401f1ae2a2

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 263ec4877923eedc30be7c59a463ff768ee6c25ad5d925bd8cc4827053721ea5
MD5 2fb6067d1c37b6bcb5179098fc51f89e
BLAKE2b-256 c0265d0225e2f99b01594be003492f509af9999032e9373fc43551d12891da1b

See more details on using hashes here.

File details

Details for the file wordllama-0.2.6.post0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.6.post0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0b02f9ba68bbdd8d5d0d89948e93a2c2299f314c1c8092d4bb914a14cd6ef364
MD5 c41e5e1c836edcd198358e3ef7377d1c
BLAKE2b-256 e18e5d9a0080dca2644f4217ffcc25b965d14037d2cae9750617d868c005c873

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page