Skip to main content

WordLlama Embedding Utility

Project description

WordLlama

The power of 15 trillion tokens of training, extracted, flogged and minimized into a cute little package for word embedding.

Word Llama

Table of Contents

Quick Start

Install:

pip install wordllama

Load the 256-dim model.

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.06641249096796882

# Rank documents based on their similarity to a query
query = "i went to the car"
candidates = ["i went to the park", "i went to the shop", "i went to the truck", "i went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('i went to the vehicle', 0.7441646856486314),
#   ('i went to the truck', 0.2832691551894259),
#   ('i went to the shop', 0.19732814982305436),
#   ('i went to the park', 0.15101404519322253)
# ]

# additional inference methods
wl.deduplicate(candidates, threshold=0.8) # fuzzy deduplication
wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init
wl.filter(query, candidates, threshold=0.3) # filter candidates based on query
wl.topk(query, candidates, k=3) # return topk strings based on query

What is it?

WordLlama is a utility for NLP and word embedding model that recycles components from large language models (LLMs) to create efficient and compact word representations (such as GloVe, Word2Vec or FastText). WordLlama begins by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLama3 70B), and training a small context-less model in a general purpose embedding framework.

WordLlama improves on all MTEB benchmarks above word models like GloVe 300d, while being substantially smaller in size (16MB default model @ 256-dim vs >2GB).

Features of WordLlama include:

  1. Matryoshka Representations: Truncate embedding dimension as needed.
  2. Low Resource Requirements: A simple token lookup with average pooling, enables this to operate fast on CPU.
  3. Binarization: Models trained using the straight through estimator can be packed to small integer arrays for even faster hamming distance calculations. (coming soon)
  4. Numpy-only inference: Lightweight and simple.

For flexibility, WordLlama employs the Matryoshka representation learning training technique. The largest model (1024-dim) can be truncated to 64, 128, 256 or 512. For binary embedding models, we implement straight-through estimators during training. For dense embeddings, 256 dimensions sufficiently captures most of the performance, while for binary embeddings validation accuracy is close to saturation at 512-dimensions (64 bytes packed).

The final weights are saved after weighting, projection and truncation of the entire tokenizer vocabulary. Thus, WordLlama becomes a single embedding matrix (nn.Embedding) that is considerably smaller than the gigabyte-sized llm codebooks we start with. The original tokenizer is still used to preprocess the text into tokens, and the reduced size token embeddings are average pooled. There is very little computation required, and the resulting model sizes range from 16mb to 250mb for the 128k llama3 vocabulary.

It's good option for some nlp-lite tasks. You can train sklearn classifiers on it, perform basic semantic matching, fuzzy deduplication, ranking and clustering. I think it should work well for creating LLM output evaluators, or other preparatory tasks involved in multi-hop or agentic workflows. You can perform your own llm surgery and train your own model on consumer GPUs in a few hours. Because of its fast and portable size, it makes a good "Swiss-Army Knife" utility for exploratory analysis and utility applications.

MTEB Results (l2_supercat)

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

The l2_supercat is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens). Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).

Embed Text

Here’s how you can load pre-trained embeddings and use them to embed text:

from wordllama import WordLlama

# Load pre-trained embeddings
# truncate dimension to 64
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["the quick brown fox jumps over the lazy dog", "and all that jazz"])
print(embeddings.shape)  # (2, 64)

Binary embedding models can be used like this:

# Binary embeddings are packed into uint32
# 64-dims => array of 2x uint32 
wl = WordLlama.load(trunc_dim=64, binary=True)  # this will download the binary model from huggingface
wl.embed("I went to the car") # Output: array([[3029168104, 2427562626]], dtype=uint32)

# load binary trained model trained with straight through estimator
wl = WordLlama.load(dim=1024, binary=True)

# Uses the hamming similarity to binarize
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.57421875

ranked_docs = wl.rank("i went to the car", ["van", "truck"])

wl.binary = False # turn off hamming and use cosine

# load a different model class
wl = WordLlama.load(config="l3_supercat", dim=1024) # downloads model from HF

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 is recommended for binary embedding.

L2 Supercat was trained using a batch size of 512 on a single A100 for 12 hours.

Roadmap

  • Working on adding inference features:
    • Semantic text splitting
  • Add example notebooks
    • DSPy evaluators
    • RAG pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for llama3 models). You can then use the following snippet:

from wordllama.extract import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

HINT: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest, sometimes you have to snoop around and figure it out.

For training, use the scripts in the github repo. You have to add a configuration file (copy/modify an existing one into the folder).

$ pip install wordllama[train]
$ python train.py train --config your_new_config
(training stuff happens)
$ python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
(saves 1 model per matryoshka dim)

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.2.3}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

wordllama-0.2.4.post0-cp312-cp312-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.2.4.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.2.4.post0-cp312-cp312-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.2.4.post0-cp312-cp312-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

wordllama-0.2.4.post0-cp311-cp311-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.2.4.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.2.4.post0-cp311-cp311-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.2.4.post0-cp311-cp311-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.2.4.post0-cp310-cp310-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.2.4.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.2.4.post0-cp310-cp310-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.2.4.post0-cp310-cp310-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.2.4.post0-cp39-cp39-win_amd64.whl (16.4 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.2.4.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.2.4.post0-cp39-cp39-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.2.4.post0-cp39-cp39-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.2.4.post0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 1785de2d3bf0dac3671efbefa98286bf067b5c291f678d97f7c7f07f9d7bfc6a
MD5 e0b84383d5042d37d44b568e36bf546f
BLAKE2b-256 cdafb39d45a36d4a4960698a05ad06fe1cc69f0166e0b54b6e93903d830055f7

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 32c328a4aac31810097f1435e341142c0a08397cfc8b40bc754835189b637311
MD5 301e5ba744d6d490e580e9827d46efb8
BLAKE2b-256 4b097a50b118dcba0929ed312d380a1a27d7811e86dabd9689ac0eeaef41b67b

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 dbdee6d312c13c32f228dff13514bd80444e991ed775430486d4a382c19913eb
MD5 8b7b1c25ae60e43a3a915761828639c5
BLAKE2b-256 dd8e0f426f670739e1b846b81ee46f4c60852d864851be1a4068409653f1a7a6

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 229512e69d69380510659eeae69aadbc273aa048636b841a8a40db1cde6b57f5
MD5 db9ca83a607dc3361bc36a8f12e6a718
BLAKE2b-256 7b0803c4a294557b91b924c22a22b2a5370cd68e072e5623b3d903b42684e800

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1c00e5df71415cdb1fbc4e7c92b9bb6d58fba0946c8247533d97a4f4a6b3c193
MD5 86d661b0e6a4bcba15e5a31713bc138e
BLAKE2b-256 f9fd2430982172af8a2205b8f6432100c899a5772006714015dbe2e286b45c13

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3fd4439b0d3162b88a0be4bd6315c5916375861788882642de2fb5b26bd20cf0
MD5 a43ebac898a22a48724ba08179748034
BLAKE2b-256 ecb374743dbc7afb62c92a3a660e9c536e8ed89f2ecc2c83cf9a0766cdcb520f

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 0a46c691b7f25354d6cc10435e4bb784a187631014a3bbb6d965edc7e05c2987
MD5 eed085505c1f6d8386612c5fe8b2caef
BLAKE2b-256 f07b80b978bdc1387573decf8950066044fc2ca11e1c499d3be49c4f43df64cc

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 fceb9e20ab4f5fe4c656bff38eeb98c70cbfa63e946e2ad67460de3f20709bc1
MD5 61f00fc1fbb512ab3aae7999c4ee544b
BLAKE2b-256 d7eb3779af9368c2fb80ac70232d33b4bb769d8c2a0d9c1e0b24701e772bd63f

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 128c8f2cb956bd5dbd38cbadd5202f7ea11e05413f766d20dbc66b761340681d
MD5 1b876c3b8fde2f5dfae9f49ef57f8a1f
BLAKE2b-256 16fb03715e6c999fd3712d8654ff6798387e0184bbcd804dbb5ea0241f78b554

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fc76a21b4a158bb3ce2c4c1e63af53e53d08be398adaf3efadd38c17c3c38b99
MD5 b4e0e2d640522f7ac11b25fb82e98e4e
BLAKE2b-256 0f0592daabc7add1c2b5f673d152d89a5a06e712f247a39295a95be07731db16

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 2a1e37df693f7f502bfd7a42f31827b80926274fdf6a2617a50ea0287b692691
MD5 2f62c4aa2c494926fc433c37e2e5d11f
BLAKE2b-256 a41987d06510e9bc8648c10770c48257f8eca518992d0e5947fa4251e0badec5

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 7fc273deeaa7ea108ae0c1d18231268bac3651e128ffe6fa411f94916ad12486
MD5 ac251391eb6a2d13227c6a76ab05ebe0
BLAKE2b-256 4c56228b118b4e13fc01d2a3878cfc9e40c7f201438a5793a4a2d4b48d9013ae

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 f9e996f26edf609ed64ff501d0101347284d519aa7b22084c37c90a63da301e3
MD5 b200549e77cc2538ef266b15c9b76339
BLAKE2b-256 5f82be0f470d706c8232e46fd59f1f4c009c98ee933911b2e81118774f9ac21f

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 db11a5befd617b68f7f028c488325f75694cac1bbe92d4044636679bbc917e15
MD5 e8fbd726e39c2db7ce3f5fd083ceb146
BLAKE2b-256 bc180adc185b04fd8757dacc1127427ef0f1156a503197d4b76dacc82c8247f0

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 742a06db596da773459b0f00d820e535c629e03c4be462e9007dd49083eb96bf
MD5 dc98232fe7e9a0c838e6b0e8b264a869
BLAKE2b-256 5a3985bc2bbd7cba351c90093b4753cbce30bec2ae37099d6f21c3250474f38e

See more details on using hashes here.

File details

Details for the file wordllama-0.2.4.post0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.4.post0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4965e3d78f34114dcc6cf1f01332d9a7de5253a503741582334a0e3ff8322b78
MD5 3a640171d0ba77da7be141a3bc410619
BLAKE2b-256 c23f66c7b5dba50740dd4f9403b0f40cd83606de1088fe1bd2ccc8ba121fd482

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page