Skip to main content

WordLlama Embedding Utility

Project description

WordLlama 📝🦙

WordLlama is a fast, lightweight NLP toolkit designed for tasks like fuzzy deduplication, similarity computation, ranking, clustering, and semantic text splitting. It operates with minimal inference-time dependencies and is optimized for CPU hardware, making it suitable for deployment in resource-constrained environments.

Word Llama

News and Updates 🔥

Table of Contents

Quick Start

Install WordLlama via pip:

pip install wordllama

Load the default 256-dimensional model:

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Features

  • Fast Embeddings: Efficiently generate text embeddings using a simple token lookup with average pooling.
  • Similarity Computation: Calculate cosine similarity between texts.
  • Ranking: Rank documents based on their similarity to a query.
  • Fuzzy Deduplication: Remove duplicate texts based on a similarity threshold.
  • Clustering: Cluster documents into groups using KMeans clustering.
  • Filtering: Filter documents based on their similarity to a query.
  • Top-K Retrieval: Retrieve the top-K most similar documents to a query.
  • Semantic Text Splitting: Split text into semantically coherent chunks.
  • Binary Embeddings: Support for binary embeddings with Hamming similarity for even faster computations.
  • Matryoshka Representations: Truncate embedding dimensions as needed for flexibility.
  • Low Resource Requirements: Optimized for CPU inference with minimal dependencies.

What is WordLlama?

WordLlama is a utility for natural language processing (NLP) that recycles components from large language models (LLMs) to create efficient and compact word representations, similar to GloVe, Word2Vec, or FastText.

Starting by extracting the token embedding codebook from state-of-the-art LLMs (e.g., LLaMA 2, LLaMA 3 70B), WordLlama trains a small context-less model within a general-purpose embedding framework. This approach results in a lightweight model that improves on all MTEB benchmarks over traditional word models like GloVe 300d, while being substantially smaller in size (e.g., 16MB default model at 256 dimensions).

WordLlama's key features include:

  1. Matryoshka Representations: Allows for truncation of the embedding dimension as needed, providing flexibility in model size and performance.
  2. Low Resource Requirements: Utilizes a simple token lookup with average pooling, enabling fast operation on CPUs without the need for GPUs.
  3. Binary Embeddings: Models trained using the straight-through estimator can be packed into small integer arrays for even faster Hamming distance calculations.
  4. Numpy-only Inference: Lightweight inference pipeline relying solely on NumPy, facilitating easy deployment and integration.

Because of its fast and portable size, WordLlama serves as a versatile tool for exploratory analysis and utility applications, such as LLM output evaluators or preparatory tasks in multi-hop or agentic workflows.

MTEB Results

The following table presents the performance of WordLlama models compared to other similar models.

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

WL64 to WL1024: WordLlama models with embedding dimensions ranging from 64 to 1024.

Note: The l2_supercat is a LLaMA 2 vocabulary model. To train this model, we concatenated codebooks from several models, including LLaMA 2 70B and phi 3 medium, after removing additional special tokens. Because several models have used the LLaMA 2 tokenizer, their codebooks can be concatenated and trained together. The performance of the resulting model is comparable to training the LLaMA 3 70B codebook, while being 4x smaller (32k vs. 128k vocabulary).

Other Models

How Fast? :zap:

8k documents from the ag_news dataset

  • Single core performance (CPU), i9 12th gen, DDR4 3200
  • NVIDIA A4500 (GPU)

Word Llama

Usage

Embedding Text

Load pre-trained embeddings and embed text:

from wordllama import WordLlama

# Load pre-trained embeddings (truncate dimension to 64)
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["The quick brown fox jumps over the lazy dog", "And all that jazz"])
print(embeddings.shape)  # Output: (2, 64)

Calculating Similarity

Compute the similarity between two texts:

similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

Ranking Documents

Rank documents based on their similarity to a query:

query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Fuzzy Deduplication

Remove duplicate texts based on a similarity threshold:

deduplicated_docs = wl.deduplicate(candidates, threshold=0.5)
print(deduplicated_docs)
# Output:
# ['I went to the park',
#  'I went to the shop',
#  'I went to the truck']

Clustering

Cluster documents into groups using KMeans clustering:

labels, inertia = wl.cluster(candidates, k=3, max_iterations=100, tolerance=1e-4, n_init=3)
print(labels, inertia)
# Output:
# [2, 0, 1, 1], 0.4150

Filtering

Filter documents based on their similarity to a query:

filtered_docs = wl.filter(query, candidates, threshold=0.3)
print(filtered_docs)
# Output:
# ['I went to the vehicle']

Top-K Retrieval

Retrieve the top-K most similar documents to a query:

top_docs = wl.topk(query, candidates, k=2)
print(top_docs)
# Output:
# ['I went to the vehicle', 'I went to the truck']

Semantic Text Splitting

Split text into semantic chunks:

long_text = "Your very long text goes here... " * 100
chunks = wl.split(long_text, target_size=1536)

print(list(map(len, chunks)))
# Output: [1055, 1055, 1187]

Note that the target size is also the maximum size. The .split() feature attempts to aggregate sections up to the target_size, but will retain the order of the text as well as sentence and, as much as possible, paragraph structure. It uses wordllama embeddings to locate more natural indexes to split on. As a result, there will be a range of chunk sizes in the output up to the target size.

The recommended target size is from 512 to 2048 characters, with the default size at 1536. Chunks that need to be much larger should probably be batched after splitting, and will often be aggregated from multiple semantic chunks already.

For more information see: technical overview

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 dimensions are recommended for binary embeddings.

The L2 Supercat model was trained using a batch size of 512 on a single A100 GPU for 12 hours.

Roadmap

  • Adding Inference Features:
    • Semantic text splitting (completed)
  • Additional Example Notebooks:
    • DSPy evaluators
    • Retrieval-Augmented Generation (RAG) pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for LLaMA models). You can then use the following snippet:

from wordllama.extract.extract_safetensors import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

Hint: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest; sometimes you have to inspect and figure it out.

For training, use the scripts in the GitHub repository. You have to add a configuration file (copy/modify an existing one into the folder).

pip install wordllama[train]
python train.py train --config your_new_config
# (Training process begins)
python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
# (Saves one model per Matryoshka dimension)

Community Projects

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.2.6}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

wordllama-0.3.0.post0-cp312-cp312-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.3.0.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.7 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.3.0.post0-cp312-cp312-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.3.0.post0-cp312-cp312-macosx_10_13_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.3.0.post0-cp311-cp311-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.3.0.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.3.0.post0-cp311-cp311-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.3.0.post0-cp311-cp311-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.3.0.post0-cp310-cp310-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.3.0.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.3.0.post0-cp310-cp310-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.3.0.post0-cp310-cp310-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.3.0.post0-cp39-cp39-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.3.0.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.3.0.post0-cp39-cp39-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.3.0.post0-cp39-cp39-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

wordllama-0.3.0.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

wordllama-0.3.0.post0-cp38-cp38-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.3.0.post0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 857bfabf89f0d1480bfbc37287ec4f25e30f6c64d37a29ba17867d0a32f7700b
MD5 1f9e6c64e797aabb0df4e9e39210367d
BLAKE2b-256 f8c7bb89667f458619c50c78f6db9c29ccc805f58160f6b0854689f8442a2201

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 82438b21d8990b46fd58064198ca07f359e9be6a47374f6e35016ac7b738278a
MD5 3bbb140033b84aed4059a0bb93e294d0
BLAKE2b-256 e7c465aeba832f0258ac217f2bfc27fb031d555497f52de36939ec167479c6d0

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 dc096fdfef477fe5e0360f5eebad0652b596608766e8208bcfd6e996d2feae43
MD5 4a5d1861ffe89c8aa3d98354ff78302f
BLAKE2b-256 dcbf5615a53abafeeb152bd73c47a79b4dd1d922e6724df6b771e3d586954d96

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 e7d3158ec40a8c2e68ae6e86dedeb0273fca90dc7c461acdea4ac0353aa1b742
MD5 7e6448c6fa66b6287254385f00a9512a
BLAKE2b-256 87b1078951b2809ffc953a3418b68733764347e3d80338177fc77fa6b9718282

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 00a2753b491c6707b0786486b197e747f344608bc20a0a7ac960798e7836ca4b
MD5 f434d49a5aa285ad0048646a015f0e4d
BLAKE2b-256 22b0a8aa4f39e4e6547feac0aa7e45a2e2c804b13ef909b0363ffff3daabbc58

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3c0f4c66d64d56c78218d9b83f5aa77f960322d5541dc73c6fd0632ea257753e
MD5 2f16bbeb22507cc941a9110ac92850db
BLAKE2b-256 7d587b64f8a01cf9f8c10b18e5f80a7a8a11c37ed74c85e43e2d82f167365d27

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 d8cd444a1295f2a9d23ec9d0d9c71862e4a2d68b72f8bc2e5f14e6766f98025e
MD5 02268c2ddec3e460198bf9e2193619f3
BLAKE2b-256 0e1b743d4cc02285d7cec88e8d691b8066426ac70740ecfa10f6c801e9ac28fd

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 165e7d6a57ac53d44966c204c6aa294e6428196c469be4b9b92ef4b6e7392764
MD5 0bf4432854dc5c98b7cc528191ce0e55
BLAKE2b-256 22c6a8f92777d6f7b63a9dbd326db56006384d1f683913eee3ce2a5fb1b40233

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 887c9abe36efdb684527e53cbaa3a1d33c1153fa9ca99186f5252115555dfb59
MD5 a3b9b8ffa809403664360bd444c12ebc
BLAKE2b-256 bf6d830ab9bffa7c945df5f05440cb3c4e548b7524d37eeb57034cfb16bc1803

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 18fee465d0a01e687e80c03e4f3d863d79efbfce242e60a0cbe4cf083dc3586c
MD5 14034cc928e0f15c42398514ba172e12
BLAKE2b-256 439df09a84935a12e33cff3dd2b4157e3e9ec3e91fbc1da4ebb7def85f28df0d

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 516b04e0ca2ccb17b763ff1197432a6a652fa271c4cdbab08e7a618b5e6375b1
MD5 259763c754c161ad392ae251d3cad43a
BLAKE2b-256 1f955fd34f157bf50b3097ddd9a2c6e08924e9387182845004833c2c0dffcbd6

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b13f159e78a3c3f539403a2789e48a983390e3219114bdb8c476fbeee56ab1cd
MD5 1d395eb486d37445bace4e99c88d9ad9
BLAKE2b-256 eecf1a96586e5db84019483549856e2a804f9e8b4e8b663064559ddfa341ab04

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 65b10ebd12b3cab1bd89a3454dae046c29d7ce7dc37072423323eb2c09d8e501
MD5 9d5c0781faca4d68360756a824ec769a
BLAKE2b-256 9c970eef36d5bbb1c44a5f9c6353dd12c62df8313892116565f9f7d126311d43

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e8901463ed0caff3914c722cee44c068d601236126882e6492832ae7a0cd2628
MD5 d61d8174f9a4b65023e37899b9e5eefb
BLAKE2b-256 66c3376bfe301158a6cc27a1df6732efcf1b1684c37b304ab112c8d4d4c86e7c

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 70642e01ab66647669d153df977cefda56aa0b2308c057718e7e4192b1716442
MD5 05d8d7010bc2b0676aaf970c274576ef
BLAKE2b-256 7b8b6e7761c267d5f39375c2bd2112977541d2697b5d833d3088ed1e79ee6b76

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b663ded58b98dd4e280f3f882a5532d19f6e854b6d7e452137412a22896e26df
MD5 26ffe5df5ecaf1d729c9da0ed6f4fe49
BLAKE2b-256 534529ca0b1bdf1e06e21a123ba85780901bc1c7567257789e0ab3ae81d64953

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d0f70529c7645245f52509c7c9ad68d61f818ac439b5af172a3a1d97d508be0d
MD5 0b743d40cbe49d7106fb2abebbf25c4e
BLAKE2b-256 fee0a46a0566b4c4e1c2afe3ab528f6ce681178d5d1c3e762cd94cd292ec3e57

See more details on using hashes here.

File details

Details for the file wordllama-0.3.0.post0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.0.post0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 59d42b9cc622f28e5d8662513320250228fb8c7407c623aecd0feaaaab86a692
MD5 d923ccacb4ee87ee961a0cfed397222c
BLAKE2b-256 bd32da3bac66da2b210badbab4dc02b667fe9e8422b54c1bccffcdad5f3e9ecb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page