Skip to main content

WordLlama Embedding Utility

Project description

WordLlama

WordLlama is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity and ranking with minimal inference-time dependencies and optimized for CPU hardware.

Word Llama

Table of Contents

Quick Start

Install:

pip install wordllama

Load the 256-dim model.

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.06641249096796882

# Rank documents based on their similarity to a query
query = "i went to the car"
candidates = ["i went to the park", "i went to the shop", "i went to the truck", "i went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('i went to the vehicle', 0.7441646856486314),
#   ('i went to the truck', 0.2832691551894259),
#   ('i went to the shop', 0.19732814982305436),
#   ('i went to the park', 0.15101404519322253)
# ]

# additional inference methods
wl.deduplicate(candidates, threshold=0.8) # fuzzy deduplication
wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init
wl.filter(query, candidates, threshold=0.3) # filter candidates based on query
wl.topk(query, candidates, k=3) # return topk strings based on query

What is it?

WordLlama is a utility for NLP and word embedding model that recycles components from large language models (LLMs) to create efficient and compact word representations (such as GloVe, Word2Vec or FastText). WordLlama begins by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLama3 70B), and training a small context-less model in a general purpose embedding framework.

WordLlama improves on all MTEB benchmarks above word models like GloVe 300d, while being substantially smaller in size (16MB default model @ 256-dim vs >2GB).

Features of WordLlama include:

  1. Matryoshka Representations: Truncate embedding dimension as needed.
  2. Low Resource Requirements: A simple token lookup with average pooling, enables this to operate fast on CPU.
  3. Binarization: Models trained using the straight through estimator can be packed to small integer arrays for even faster hamming distance calculations. (coming soon)
  4. Numpy-only inference: Lightweight and simple.

For flexibility, WordLlama employs the Matryoshka representation learning training technique. The largest model (1024-dim) can be truncated to 64, 128, 256 or 512. For binary embedding models, we implement straight-through estimators during training. For dense embeddings, 256 dimensions sufficiently captures most of the performance, while for binary embeddings validation accuracy is close to saturation at 512-dimensions (64 bytes packed).

The final weights are saved after weighting, projection and truncation of the entire tokenizer vocabulary. Thus, WordLlama becomes a single embedding matrix (nn.Embedding) that is considerably smaller than the gigabyte-sized llm codebooks we start with. The original tokenizer is still used to preprocess the text into tokens, and the reduced size token embeddings are average pooled. There is very little computation required, and the resulting model sizes range from 16mb to 250mb for the 128k llama3 vocabulary.

It's good option for some nlp-lite tasks. You can train sklearn classifiers on it, perform basic semantic matching, fuzzy deduplication, ranking and clustering. I think it should work well for creating LLM output evaluators, or other preparatory tasks involved in multi-hop or agentic workflows. You can perform your own llm surgery and train your own model on consumer GPUs in a few hours. Because of its fast and portable size, it makes a good "Swiss-Army Knife" utility for exploratory analysis and utility applications.

MTEB Results (l2_supercat)

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

The l2_supercat is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens). Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).

Other Models

Results

Llama3-based: l3_supercat

Embed Text

Here’s how you can load pre-trained embeddings and use them to embed text:

from wordllama import WordLlama

# Load pre-trained embeddings
# truncate dimension to 64
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["the quick brown fox jumps over the lazy dog", "and all that jazz"])
print(embeddings.shape)  # (2, 64)

Binary embedding models can be used like this:

# Binary embeddings are packed into uint64
# 64-dims => array of 1x uint64 
wl = WordLlama.load(trunc_dim=64, binary=True)  # this will download the binary model from huggingface
wl.embed("I went to the car") # Output: array([[3029168427562626]], dtype=uint64)

# load binary trained model trained with straight through estimator
wl = WordLlama.load(dim=1024, binary=True)

# Uses the hamming similarity to binarize
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.57421875

ranked_docs = wl.rank("i went to the car", ["van", "truck"])

wl.binary = False # turn off hamming and use cosine

# load a different model class
wl = WordLlama.load(config="l3_supercat", dim=1024) # downloads model from HF

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 is recommended for binary embedding.

L2 Supercat was trained using a batch size of 512 on a single A100 for 12 hours.

Roadmap

  • Working on adding inference features:
    • Semantic text splitting
  • Add example notebooks
    • DSPy evaluators
    • RAG pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for llama3 models). You can then use the following snippet:

from wordllama.extract import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

HINT: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest, sometimes you have to snoop around and figure it out.

For training, use the scripts in the github repo. You have to add a configuration file (copy/modify an existing one into the folder).

$ pip install wordllama[train]
$ python train.py train --config your_new_config
(training stuff happens)
$ python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
(saves 1 model per matryoshka dim)

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.2.5}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

wordllama-0.2.5.post0-cp312-cp312-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.2.5.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.2.5.post0-cp312-cp312-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.2.5.post0-cp312-cp312-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

wordllama-0.2.5.post0-cp311-cp311-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.2.5.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.2.5.post0-cp311-cp311-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.2.5.post0-cp311-cp311-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.2.5.post0-cp310-cp310-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.2.5.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.2.5.post0-cp310-cp310-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.2.5.post0-cp310-cp310-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.2.5.post0-cp39-cp39-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.2.5.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.2.5.post0-cp39-cp39-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.2.5.post0-cp39-cp39-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.2.5.post0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a8740fdb6fc502132db2b5a9637090ef35ad5e238b97c45bbc540f878f4b9bdc
MD5 c87e7841f5da32e5597d42a139c4312e
BLAKE2b-256 d95a57b650a661fef9ef81ba82b515a560faf6ac4af41c8d4a169ab817c1e3c5

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d98133dbdca9b2c7804da1d8652d1cdd067b1e7ab7ff53ec7f41713ce624ee06
MD5 859d7662e283192bf0bb007ad1860bbd
BLAKE2b-256 7c2cc59973f64936367b3b5adb033048d58e3de26840af02405ef95c7f6d5ffe

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 ce4b94f0bf5b3e54b3c61b97491386badd340ea13f37ab06a796073617d9f889
MD5 c5faba6da357f0ef63e685819a9917be
BLAKE2b-256 b4af8664dc9f4defacf8ae37cde872d92e1f745c00e93593d9c9c2956dbad7dd

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 00b0fd401bc7d6450385ea6a6fc9d70926192e62b923bb2a1f2ac883cfb87794
MD5 e8c7f12156a45f120875441d95818cda
BLAKE2b-256 4f978a8da8185a2f78178f4687eb7ece1a2a880b112c1f98cfeb02451769165a

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b7c265771c760766eb886c299c53367d51b3f66c3e2790f7a464a871dd4fad05
MD5 1168e9cddbbb1564ceeae4c97786ef3c
BLAKE2b-256 f9f5b2739d60016b4df245b1f4aed204b82e3c02481d2e644e4cb4cf9b3a0dca

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d153b142980a9f22f194ba06d6edc6f42a4ed4b8d49d7c4f19e76b14de7057a1
MD5 b1ccef623fbfb82a35169c8db400d90b
BLAKE2b-256 e46937be9d930a27f9f00e795fd08452c3a4e38e54011185e99bdee712088704

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 4c83afced22bae4e5767d90cc0021eb26376d6cfa2e03922264e79f15dca42c4
MD5 fb5abacc0d06c25a9680335809001e59
BLAKE2b-256 7326b556e0369527a0e8f33645a0c11ffc890eadd9c28d069614b0ffeb3c3fc0

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 676be38289b76801866a887ea843e909df640c0e147993adffc9372320cf0c58
MD5 8ad32c25ce53cc480e484fcdc91f5350
BLAKE2b-256 89144a0b5e286562b8ee71ea67150bec6d83ed7f4b4a5380b9b67305cedb4760

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c036bc003c5e2b8c6d14bdbecdaa69bb7473235b172201105dd3dfe962917817
MD5 689ecaac4f0cee223e6726d1c7d543be
BLAKE2b-256 585b2b0e4f95f8fc007146951155dde0e5c5b055d67ae84bb2885513ded6c06f

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2c6474d76179c38c50381bb160dc073e4ac3843f2372cc7c92acfaa6e277be02
MD5 7f2b3e3d4edcc440635d720e0b5b8286
BLAKE2b-256 7c6495200cad771a6cd56cd03d08e8153bca5842c9dd4dd4ce00dfbca12cfebe

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 e31a0286919d9134b06d1d634f8ad806b0faa456275ca03726117c874f1c27f5
MD5 42fd8398042b5d2c53da422da17fc78e
BLAKE2b-256 7bc54305383895653d9390ea2f813074e0b05b79369f012540eacfe6d66c0cc9

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 47eb14faee277455e079beac9557b7de3119dc68bb6a55da40288c933ec78963
MD5 6a9c22d1d1b109e7866939f2ed7fe652
BLAKE2b-256 95780625b05e14701be7371a113acbee0f04434074c7d354cd28b0e172049d7e

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 b02334819a1b9b738bcc88724f1e7ffd4a6c166458b9b6e9f9b411250fa8aad7
MD5 233088ba9fab297ee5b9880b567c4840
BLAKE2b-256 771423d62dec807af86cee0f29b2b538fc4b51b97aa24419cde79ca7a410f792

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a6e8c134e1a5114b1f7697cadaa0d2524b7496eb1192338e5686e0d377fa7b5d
MD5 e10db30bc76046e9b6a6da912b3c87ca
BLAKE2b-256 7fa3e26dc75dffcf598af3a969b2431028f522be350ae8947edfda3f0f91050e

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 830ad16e2ab5712ce41f7d8f6655a43935a75bdee96cb41a58682ed12815b4ec
MD5 a40d13642b8320ebe07e4280fd932c2b
BLAKE2b-256 b8f561fea9b1ed3ab8cf3e359151605f781eef4bd42022f7a96a739900afddc7

See more details on using hashes here.

File details

Details for the file wordllama-0.2.5.post0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.5.post0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 179c44182504c75ddc6b259104bd1cc35d8fde98c8bfb613897bbcd400d21489
MD5 397a52e575abc85c570ce56d59f132ca
BLAKE2b-256 2042c1ec6a9a6bb6204a32f9275b76e812891620231d91202507943b6169f448

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page