Skip to main content

WordLlama Embedding Utility

Project description

WordLlama 📝🦙

WordLlama is a fast, lightweight NLP toolkit designed for tasks like fuzzy deduplication, similarity computation, ranking, clustering, and semantic text splitting. It operates with minimal inference-time dependencies and is optimized for CPU hardware, making it suitable for deployment in resource-constrained environments.

Word Llama

News and Updates 🔥

Table of Contents

Quick Start

Install WordLlama via pip:

pip install wordllama

Load the default 256-dimensional model:

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Features

  • Fast Embeddings: Efficiently generate text embeddings using a simple token lookup with average pooling.
  • Similarity Computation: Calculate cosine similarity between texts.
  • Ranking: Rank documents based on their similarity to a query.
  • Fuzzy Deduplication: Remove duplicate texts based on a similarity threshold.
  • Clustering: Cluster documents into groups using KMeans clustering.
  • Filtering: Filter documents based on their similarity to a query.
  • Top-K Retrieval: Retrieve the top-K most similar documents to a query.
  • Semantic Text Splitting: Split text into semantically coherent chunks.
  • Binary Embeddings: Support for binary embeddings with Hamming similarity for even faster computations.
  • Matryoshka Representations: Truncate embedding dimensions as needed for flexibility.
  • Low Resource Requirements: Optimized for CPU inference with minimal dependencies.

What is WordLlama?

WordLlama is a utility for natural language processing (NLP) that recycles components from large language models (LLMs) to create efficient and compact word representations, similar to GloVe, Word2Vec, or FastText.

Starting by extracting the token embedding codebook from state-of-the-art LLMs (e.g., LLaMA 2, LLaMA 3 70B), WordLlama trains a small context-less model within a general-purpose embedding framework. This approach results in a lightweight model that improves on all MTEB benchmarks over traditional word models like GloVe 300d, while being substantially smaller in size (e.g., 16MB default model at 256 dimensions).

WordLlama's key features include:

  1. Matryoshka Representations: Allows for truncation of the embedding dimension as needed, providing flexibility in model size and performance.
  2. Low Resource Requirements: Utilizes a simple token lookup with average pooling, enabling fast operation on CPUs without the need for GPUs.
  3. Binary Embeddings: Models trained using the straight-through estimator can be packed into small integer arrays for even faster Hamming distance calculations.
  4. Numpy-only Inference: Lightweight inference pipeline relying solely on NumPy, facilitating easy deployment and integration.

Because of its fast and portable size, WordLlama serves as a versatile tool for exploratory analysis and utility applications, such as LLM output evaluators or preparatory tasks in multi-hop or agentic workflows.

MTEB Results

The following table presents the performance of WordLlama models compared to other similar models.

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

WL64 to WL1024: WordLlama models with embedding dimensions ranging from 64 to 1024.

Note: The l2_supercat is a LLaMA 2 vocabulary model. To train this model, we concatenated codebooks from several models, including LLaMA 2 70B and phi 3 medium, after removing additional special tokens. Because several models have used the LLaMA 2 tokenizer, their codebooks can be concatenated and trained together. The performance of the resulting model is comparable to training the LLaMA 3 70B codebook, while being 4x smaller (32k vs. 128k vocabulary).

Other Models

How Fast? :zap:

8k documents from the ag_news dataset

  • Single core performance (CPU), i9 12th gen, DDR4 3200
  • NVIDIA A4500 (GPU)

Word Llama

Usage

Embedding Text

Load pre-trained embeddings and embed text:

from wordllama import WordLlama

# Load pre-trained embeddings (truncate dimension to 64)
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["The quick brown fox jumps over the lazy dog", "And all that jazz"])
print(embeddings.shape)  # Output: (2, 64)

Calculating Similarity

Compute the similarity between two texts:

similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

Ranking Documents

Rank documents based on their similarity to a query:

query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Fuzzy Deduplication

Remove duplicate texts based on a similarity threshold:

deduplicated_docs = wl.deduplicate(candidates, threshold=0.5)
print(deduplicated_docs)
# Output:
# ['I went to the park',
#  'I went to the shop',
#  'I went to the truck']

Clustering

Cluster documents into groups using KMeans clustering:

labels, inertia = wl.cluster(candidates, k=3, max_iterations=100, tolerance=1e-4, n_init=3)
print(labels, inertia)
# Output:
# [2, 0, 1, 1], 0.4150

Filtering

Filter documents based on their similarity to a query:

filtered_docs = wl.filter(query, candidates, threshold=0.3)
print(filtered_docs)
# Output:
# ['I went to the vehicle']

Top-K Retrieval

Retrieve the top-K most similar documents to a query:

top_docs = wl.topk(query, candidates, k=2)
print(top_docs)
# Output:
# ['I went to the vehicle', 'I went to the truck']

Semantic Text Splitting

Split text into semantic chunks:

long_text = "Your very long text goes here... " * 100
chunks = wl.split(long_text, target_size=1536)

print(list(map(len, chunks)))
# Output: [1055, 1055, 1187]

Note that the target size is also the maximum size. The .split() feature attempts to aggregate sections up to the target_size, but will retain the order of the text as well as sentence and, as much as possible, paragraph structure. It uses wordllama embeddings to locate more natural indexes to split on. As a result, there will be a range of chunk sizes in the output up to the target size.

The recommended target size is from 512 to 2048 characters, with the default size at 1536. Chunks that need to be much larger should probably be batched after splitting, and will often be aggregated from multiple semantic chunks already.

For more information see: technical overview

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 dimensions are recommended for binary embeddings.

The L2 Supercat model was trained using a batch size of 512 on a single A100 GPU for 12 hours.

Roadmap

  • Adding Inference Features:
    • Semantic text splitting (completed)
  • Additional Example Notebooks:
    • DSPy evaluators
    • Retrieval-Augmented Generation (RAG) pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for LLaMA models). You can then use the following snippet:

from wordllama.extract.extract_safetensors import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

Hint: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest; sometimes you have to inspect and figure it out.

For training, use the scripts in the GitHub repository. You have to add a configuration file (copy/modify an existing one into the folder).

pip install wordllama[train]
python train.py train --config your_new_config
# (Training process begins)
python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
# (Saves one model per Matryoshka dimension)

Community Projects

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.2.6}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

wordllama-0.2.10.post0-cp312-cp312-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.2.10.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.7 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.2.10.post0-cp312-cp312-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.2.10.post0-cp312-cp312-macosx_10_13_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.2.10.post0-cp311-cp311-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.2.10.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.2.10.post0-cp311-cp311-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.2.10.post0-cp311-cp311-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.2.10.post0-cp310-cp310-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.2.10.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.2.10.post0-cp310-cp310-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.2.10.post0-cp310-cp310-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.2.10.post0-cp39-cp39-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.2.10.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.2.10.post0-cp39-cp39-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.2.10.post0-cp39-cp39-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

wordllama-0.2.10.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

wordllama-0.2.10.post0-cp38-cp38-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.2.10.post0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 68f13a72595536d62e3d04d6c20eb6c4fb8ab53f99e062ab12f7c725c9cd1b82
MD5 29178f678a192b422c6b7fc4086d6e35
BLAKE2b-256 4978df38c449400e39e2c112183397524e482756291cdf57bd0654cc649eb8e3

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 848e54947726156bada66a963ed48afd587b0b539469e91a455cf5437f19f61d
MD5 a22c6f37845e51e6c8e14d6cf3f8adea
BLAKE2b-256 e8707554067b0d8714ed2cfa8e2e4e5428f235fd6c5d023ff2a9c171dada5c3e

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 c05554225a6192d7a4ca0880ab03f8f4106689562f7ebae35a1ffc6b79800b1e
MD5 f1e6baba66835ebba488542eca451174
BLAKE2b-256 d14a705d4c404a4aa29a707f814463a66d9c3dee4f56bcb68d70891ef70e736b

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 8ebe783227f5aad10fdca8ab02284f86abfaa691e02d94a5e371ca29f1963273
MD5 3e970ce11cdda8c7033731660f5a443c
BLAKE2b-256 25b92280a9dd36be2e7feb154e28aa352ad4bc04fa549323f0559d105b8a2b78

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 18efe62d85c7b138d0375a5ac9b70a5b1fe0f8763584c0ea01ceae3d226f3f2f
MD5 9b1059b1de5797e0c412384702e6cf74
BLAKE2b-256 8683bc9b9d4941fc000bf6286bf4ab85db46edc5c13126d6a9cb5ec0c3441246

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 538e764f525a95b5723d002b4c67db31afca9b64278714d8349b26a8955f7b05
MD5 f4a202c0e3702203d8b0099bb1be60ae
BLAKE2b-256 2dbed04516b62af9fa376b256b6cfc081dc9b8e5bd840aa25e5cb56e59ff12c1

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 07bc892d2fa0c8632ea8ddc73fe9a0578980ed5202cefc1f2878de98a70d0178
MD5 b5693888dac4d6c50e13a721085406e2
BLAKE2b-256 0e014ebe46661c22fb6078bf4fb7e8127d0266ab4d94194035b65b7d24dfd9ab

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3ea391016cf8f715e823b2034c54f02ca63327da0d3b671ed0fb0014dab903b7
MD5 01112804a295dd755a79f9220c452983
BLAKE2b-256 0c89eb93794e8f40a3680a245c81c459ab7881408c3d406113c15d0ce91289bb

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8809ce174b3855f489f988d2b899c2b88f8d79fdc8dc6e220779f607338b3a91
MD5 7152cfc77a67b697fb151893912ae438
BLAKE2b-256 d0527d5856426c599785827f1e4a5c43e67b83028f84380a4dd4bc7e2a44909d

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 08355856c2f3c8d062bc3e3861ce7de8be7c461985c679957184f171f1af747c
MD5 d93c7feaf56e2c7c616bc87827ae9a7a
BLAKE2b-256 a144ad6c9a19365a63da2d50f3cf89cf3e1b68c8753ceaa5c777d6d02b3779fd

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 b56dd5ab39962adfe41660399e507f9ef56ff455cf3fdf42d023f14dc5518f29
MD5 ddd36b3da396ce0fd8b11b019a91237f
BLAKE2b-256 959a951ede3b321a34eafad274914475130153ae9e765923f10bc430f53ce886

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 396a0e4c59f1acc45264cae6a9af4c61b75abc92d23e616c36ad7b0fb386f308
MD5 443b43af3fc725dec5384b8f88ce04b8
BLAKE2b-256 96339fa1fb22fcf3c0805884dd26b9c14b505464477984831f1e68eca12996ab

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 de111a873402554b76c176268e0ff123c60360a037b806579d14a1d40da5364a
MD5 23c8da0cf4fa7201293a0aa077854eb1
BLAKE2b-256 0b6d98c07887ca72d75b7cc39a245d537d5caaa367f523a72b4dc9b3e57b90d5

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9178e7e98082e1d9576f72462abccdb1933b348b3df62a636e050f402659acb5
MD5 222bb59437cdfef1b1bc3db508644d03
BLAKE2b-256 84010caa18a0f6c0e4020d391a7ee8bdf0987e3684e8f38d3a50c6140ac438d0

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 0fae8bf62cbf6c5ff8a9edb94bd25e7b9d43c766d6a6df35e2a3d903c4280826
MD5 55d7e93c760e7118b76debf38d8e79ce
BLAKE2b-256 8ddd5762dea46872615f46a4866dedcbfa6181cdbe2d412abd1ce9d772c370db

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a12d1eef7694ab59789151e6c257c56022b4ea87e8c377fe9431ee1cf2762cec
MD5 a06de1021d3754d6f272f14dda6bc791
BLAKE2b-256 99109339cabf08f22c828055d30201a0fd5c816ed8796aab54a6ae16cdf30f9e

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 afbc50f232dc9eef20c7b521a56e12c89c3c095793aa9c81c1ab567d84e69b25
MD5 9d17d3e9b37e891fab6f92d9d2c48238
BLAKE2b-256 0efd563f92bc74328225f9e91688379c573b6e93340446f48dfb3df3f9ccb771

See more details on using hashes here.

File details

Details for the file wordllama-0.2.10.post0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.10.post0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 046ed2c6a8595456153c35b730b2b5c4a8f32fe7e8340fbd23d29aafdb720766
MD5 25596a7f82f79c9252c83bc1e23f85ef
BLAKE2b-256 8f0871479f9bcccac9c266e622cd14056bb0582de659ec3217e598a135fdc11b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page