Skip to main content

WordLlama Embedding Utility

Project description

WordLlama

WordLlama is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity and ranking with minimal inference-time dependencies and optimized for CPU hardware.

Word Llama

Table of Contents

Quick Start

Install:

pip install wordllama

Load the 256-dim model.

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.06641249096796882

# Rank documents based on their similarity to a query
query = "i went to the car"
candidates = ["i went to the park", "i went to the shop", "i went to the truck", "i went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('i went to the vehicle', 0.7441646856486314),
#   ('i went to the truck', 0.2832691551894259),
#   ('i went to the shop', 0.19732814982305436),
#   ('i went to the park', 0.15101404519322253)
# ]

# additional inference methods
wl.deduplicate(candidates, threshold=0.8) # fuzzy deduplication
wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init
wl.filter(query, candidates, threshold=0.3) # filter candidates based on query
wl.topk(query, candidates, k=3) # return topk strings based on query

What is it?

WordLlama is a utility for NLP and word embedding model that recycles components from large language models (LLMs) to create efficient and compact word representations (such as GloVe, Word2Vec or FastText). WordLlama begins by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLama3 70B), and training a small context-less model in a general purpose embedding framework.

WordLlama improves on all MTEB benchmarks above word models like GloVe 300d, while being substantially smaller in size (16MB default model @ 256-dim vs >2GB).

Features of WordLlama include:

  1. Matryoshka Representations: Truncate embedding dimension as needed.
  2. Low Resource Requirements: A simple token lookup with average pooling, enables this to operate fast on CPU.
  3. Binarization: Models trained using the straight through estimator can be packed to small integer arrays for even faster hamming distance calculations.
  4. Numpy-only inference: Lightweight and simple.

For flexibility, WordLlama employs the Matryoshka representation learning training technique. The largest model (1024-dim) can be truncated to 64, 128, 256 or 512. For binary embedding models, we implement straight-through estimators during training. For dense embeddings, 256 dimensions sufficiently captures most of the performance, while for binary embeddings validation accuracy is close to saturation at 512-dimensions (64 bytes packed).

The final weights are saved after weighting, projection and truncation of the entire tokenizer vocabulary. Thus, WordLlama becomes a single embedding matrix (nn.Embedding) that is considerably smaller than the gigabyte-sized llm codebooks we start with. The original tokenizer is still used to preprocess the text into tokens, and the reduced size token embeddings are average pooled. There is very little computation required, and the resulting model sizes range from 16mb to 250mb for the 128k llama3 vocabulary.

It's good option for some nlp-lite tasks. You can train sklearn classifiers on it, perform basic semantic matching, fuzzy deduplication, ranking and clustering. I think it should work well for creating LLM output evaluators, or other preparatory tasks involved in multi-hop or agentic workflows. You can perform your own llm surgery and train your own model on consumer GPUs in a few hours. Because of its fast and portable size, it makes a good "Swiss-Army Knife" utility for exploratory analysis and utility applications.

MTEB Results (l2_supercat)

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

The l2_supercat is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens). Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).

Other Models

Results

Llama3-based: l3_supercat

Embed Text

Here’s how you can load pre-trained embeddings and use them to embed text:

from wordllama import WordLlama

# Load pre-trained embeddings
# truncate dimension to 64
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["the quick brown fox jumps over the lazy dog", "and all that jazz"])
print(embeddings.shape)  # (2, 64)

Binary embedding models can be used like this:

# Binary embeddings are packed into uint64
# 64-dims => array of 1x uint64 
wl = WordLlama.load(trunc_dim=64, binary=True)  # this will download the binary model from huggingface
wl.embed("I went to the car") # Output: array([[3029168427562626]], dtype=uint64)

# load binary trained model trained with straight through estimator
wl = WordLlama.load(dim=1024, binary=True)

# Uses the hamming similarity to binarize
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.57421875

ranked_docs = wl.rank("i went to the car", ["van", "truck"])

wl.binary = False # turn off hamming and use cosine

# load a different model class
wl = WordLlama.load(config="l3_supercat", dim=1024) # downloads model from HF

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 is recommended for binary embedding.

L2 Supercat was trained using a batch size of 512 on a single A100 for 12 hours.

Roadmap

  • Working on adding inference features:
    • Semantic text splitting
  • Add example notebooks
    • DSPy evaluators
    • RAG pipelines

Extracting Token Embeddings

Extract Token Embeddings Tutorial

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for llama3 models). You can then use the following snippet:

from wordllama.extract.extract_safetensors import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

HINT: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest, sometimes you have to snoop around and figure it out.

For training, use the scripts in the github repo. You have to add a configuration file (copy/modify an existing one into the folder).

$ pip install wordllama[train]
$ python train.py train --config your_new_config
(training stuff happens)
$ python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
(saves 1 model per matryoshka dim)

Community Projects

Gradio Demo HF Space

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.2.6}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

wordllama-0.2.7.post0-cp312-cp312-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.2.7.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.2.7.post0-cp312-cp312-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.2.7.post0-cp312-cp312-macosx_10_13_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.2.7.post0-cp311-cp311-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.2.7.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.2.7.post0-cp311-cp311-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.2.7.post0-cp311-cp311-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.2.7.post0-cp310-cp310-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.2.7.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.2.7.post0-cp310-cp310-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.2.7.post0-cp310-cp310-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.2.7.post0-cp39-cp39-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.2.7.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.2.7.post0-cp39-cp39-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.2.7.post0-cp39-cp39-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

wordllama-0.2.7.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

wordllama-0.2.7.post0-cp38-cp38-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.2.7.post0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 0d619634aad2dba6e0ba33b31bcb820639c371d7a1579de0ba3fa1da56920923
MD5 c9cd4970f95af41cb7bd2a1f850d5fe1
BLAKE2b-256 593be62beb775190988197aaf6076ba96c10e90aaa093352fddf24b84ad8f3f3

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 73507d3b19e144f1f24fe336e224b6b2ba556096c449bb0216b8909fc1289d12
MD5 3187be44ef4e766f40885bbc447b7f7f
BLAKE2b-256 71794e5ee075f38c8ad7302f1f81b2a64e245b657ef709d08ecf66d3c744e3a6

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 3508394a9ec1a2d85704f739b1b65bbe74e52cc0327cf5a777c39ab4a50be50f
MD5 3fce15ba7dde44510ad9c05fb1b65fc3
BLAKE2b-256 e5dded1fb8e9c9d684f75e0ac66833528c48148b7f3bac7e16d67d58ce1ce698

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 94cc7b3203baca1e17c03b36cf48ec44aad73b452d33e71c7d57bd683c3a4e7e
MD5 51a05760d1744d599ec30ab4ded29b9f
BLAKE2b-256 fa63188b8870d264d30109a07103b26f1e4e1f483840c6bb13078dc96a6f44d4

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 048254bc6abf3754f1b26d17ba655bcbe3f6072b6e1d00573a55539f721c214d
MD5 2246ca79ac71b86b6e9c7e763d4aa4f9
BLAKE2b-256 93f2da60721833c05f9bd0339beb3fe983cf473bf1488b24234869bb6f5771ef

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b232901949d7bd86ea2fc8b99041bf95165c3247c4a0916bc9bdcb53abd4d91b
MD5 88e7183f4ccfcf36ac9a55cfd247c756
BLAKE2b-256 555f54169551d8a5648b8cfc2354fae87bc9178fc02b5272bb36278b8627852d

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 43f66dacbbb11b74f98dc6def8099a048c8863d2e0337bdcb8f1e0b267d52a46
MD5 17936c4ede2006dd62c3a0ae62f5992a
BLAKE2b-256 3cad83c04e92f4e7b7765da42bf0b8f5247e03e4a644a1e5a301d6752c43567b

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a9d19d76fa0d584ed7b7fdfd322890f14e2dd36ec48fbc10b76013e1429d01f0
MD5 be34876a97c82dd233fadbbf530ae628
BLAKE2b-256 f859b3c99c3115e70804d1a33cccba1c7bbd39ed72853fd88dd9154943e0e5ef

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 fcebbec8b5e4e436c6c7586bc54966c7d4d86402254d2535a56d066201dbc84a
MD5 f658211f5f7571e38382d49d60e3ddd0
BLAKE2b-256 31845f19125174a305dc0d459101765e26cea18ece1ff2ca8ecd582c8a3d4468

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eb8abf5ef6b75c7ce6e1753ccc4437cfdcd1f096917cd99181731d956e8a8839
MD5 619550824678cea83854b154929dbc42
BLAKE2b-256 ad4c969e621a82a1aaeb43207a084220fe0fbeb57255c93a0f347c1ce4cb37b9

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 7f59dfda780f9aa17b48a6642974a05cb7e99e6813539d71a15842609e6e3018
MD5 18d9470362f15aaf4eb21dc3c6fe3c77
BLAKE2b-256 643bdedf7f8f0753e661b439456f0aec514807d4abbe2dd0d4ae86d2c5fe4106

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 cec6b9b74484d58c8a75cb0bc97194a540f8922cd649a3ef2011b1a37f653ae1
MD5 2976b33812da9623d1d3d53dc109ee5c
BLAKE2b-256 a89861e72d9b03ad2da5b518301a659cd92dc980c69656bd100fedd6d9ee58bf

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 9204a3bf0602758c2d08e2553445f314f0ac3bc792dd6b631fb3fdc55514d7dd
MD5 9ca3f7272d631afa568c7a2af9eadf9d
BLAKE2b-256 d6fabc77e057dc9b9cbd1c096d1da0d55b14ea90d454901a1e746c5a4654eaa8

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7213f2c96604750ac29a0b21fdff9905dc77d78eddd40c01fafca576ad32f518
MD5 c3f3edf628fbfb4fb40acaed9036800d
BLAKE2b-256 733c88a27c7adeea03b3ff19ef6093d10e683de5a05ad5154665eaaef1614830

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 b20e4880ab2e652200d3402ebd4639cd11e700a8ea6f911bb2a43af0d106328a
MD5 61c2109ea4908ee1df183443294b0aa3
BLAKE2b-256 9dfd2ede1bc4e8804f7acef28b730f1a3e22e680abf6f947244eb5a8e1b1d520

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b9853c1def54afae795397e1346cc491c79a1fe4c8860d29b9eb6803da749b02
MD5 164e118265a112620bd672a7cdc29a1d
BLAKE2b-256 cccb088b66856a69222b82b862d1a32e82f5c2bf133d238599acedcf040b0791

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cf6b0d93f21a32a7d77d3c4654e6a44a23621cf6a1e374b9a83a7d65cc9c283a
MD5 48426c97be11bc724b9ad859653515e9
BLAKE2b-256 a0940bf9ec950cc673cb7c85a4bf2992df94d62d18bafbfd18af563ac6fa4e04

See more details on using hashes here.

File details

Details for the file wordllama-0.2.7.post0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.7.post0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 eb5014eff280bdf07b9fe71db23dc7c2c42adf3e54b048f32ce47f9a9fda7a28
MD5 b35d20c2c00e55fbec219ae40aacf21b
BLAKE2b-256 783221087cf99b31cca49dc70d618cbfcc2aa6c8e601c535fc947ce0498822cb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page