Skip to main content

WordLlama Embedding Utility

Project description

WordLlama 📝🦙

WordLlama is a fast, lightweight NLP toolkit designed for tasks like fuzzy deduplication, similarity computation, ranking, clustering, and semantic text splitting. It operates with minimal inference-time dependencies and is optimized for CPU hardware, making it suitable for deployment in resource-constrained environments.

Word Llama

News and Updates 🔥

Table of Contents

Quick Start

Install WordLlama via pip:

pip install wordllama

Load the default 256-dimensional model:

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Features

  • Fast Embeddings: Efficiently generate text embeddings using a simple token lookup with average pooling.
  • Similarity Computation: Calculate cosine similarity between texts.
  • Ranking: Rank documents based on their similarity to a query.
  • Fuzzy Deduplication: Remove duplicate texts based on a similarity threshold.
  • Clustering: Cluster documents into groups using KMeans clustering.
  • Filtering: Filter documents based on their similarity to a query.
  • Top-K Retrieval: Retrieve the top-K most similar documents to a query.
  • Semantic Text Splitting: Split text into semantically coherent chunks.
  • Binary Embeddings: Support for binary embeddings with Hamming similarity for even faster computations.
  • Matryoshka Representations: Truncate embedding dimensions as needed for flexibility.
  • Low Resource Requirements: Optimized for CPU inference with minimal dependencies.

What is WordLlama?

WordLlama is a utility for natural language processing (NLP) that recycles components from large language models (LLMs) to create efficient and compact word representations, similar to GloVe, Word2Vec, or FastText.

Starting by extracting the token embedding codebook from state-of-the-art LLMs (e.g., LLaMA 2, LLaMA 3 70B), WordLlama trains a small context-less model within a general-purpose embedding framework. This approach results in a lightweight model that improves on all MTEB benchmarks over traditional word models like GloVe 300d, while being substantially smaller in size (e.g., 16MB default model at 256 dimensions).

WordLlama's key features include:

  1. Matryoshka Representations: Allows for truncation of the embedding dimension as needed, providing flexibility in model size and performance.
  2. Low Resource Requirements: Utilizes a simple token lookup with average pooling, enabling fast operation on CPUs without the need for GPUs.
  3. Binary Embeddings: Models trained using the straight-through estimator can be packed into small integer arrays for even faster Hamming distance calculations.
  4. Numpy-only Inference: Lightweight inference pipeline relying solely on NumPy, facilitating easy deployment and integration.

Because of its fast and portable size, WordLlama serves as a versatile tool for exploratory analysis and utility applications, such as LLM output evaluators or preparatory tasks in multi-hop or agentic workflows.

MTEB Results

The following table presents the performance of WordLlama models compared to other similar models.

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

WL64 to WL1024: WordLlama models with embedding dimensions ranging from 64 to 1024.

Note: The l2_supercat is a LLaMA 2 vocabulary model. To train this model, we concatenated codebooks from several models, including LLaMA 2 70B and phi 3 medium, after removing additional special tokens. Because several models have used the LLaMA 2 tokenizer, their codebooks can be concatenated and trained together. The performance of the resulting model is comparable to training the LLaMA 3 70B codebook, while being 4x smaller (32k vs. 128k vocabulary).

Other Models

How Fast? :zap:

8k documents from the ag_news dataset

  • Single core performance (CPU), i9 12th gen, DDR4 3200
  • NVIDIA A4500 (GPU)

Word Llama

Usage

Embedding Text

Load pre-trained embeddings and embed text:

from wordllama import WordLlama

# Load pre-trained embeddings (truncate dimension to 64)
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["The quick brown fox jumps over the lazy dog", "And all that jazz"])
print(embeddings.shape)  # Output: (2, 64)

Calculating Similarity

Compute the similarity between two texts:

similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

Ranking Documents

Rank documents based on their similarity to a query:

query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates, sort=True)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Fuzzy Deduplication

Remove duplicate texts based on a similarity threshold:

deduplicated_docs = wl.deduplicate(candidates, return_indices=False, threshold=0.5)
print(deduplicated_docs)
# Output:
# ['I went to the park',
#  'I went to the shop',
#  'I went to the truck']

Clustering

Cluster documents into groups using KMeans clustering:

labels, inertia = wl.cluster(candidates, k=3, max_iterations=100, tolerance=1e-4, n_init=3)
print(labels, inertia)
# Output:
# [2, 0, 1, 1], 0.4150

Filtering

Filter documents based on their similarity to a query:

filtered_docs = wl.filter(query, candidates, threshold=0.3)
print(filtered_docs)
# Output:
# ['I went to the vehicle']

Top-K Retrieval

Retrieve the top-K most similar documents to a query:

top_docs = wl.topk(query, candidates, k=2)
print(top_docs)
# Output:
# ['I went to the vehicle', 'I went to the truck']

Semantic Text Splitting

Split text into semantic chunks:

long_text = "Your very long text goes here... " * 100
chunks = wl.split(long_text, target_size=1536)

print(list(map(len, chunks)))
# Output: [1055, 1055, 1187]

Note that the target size is also the maximum size. The .split() feature attempts to aggregate sections up to the target_size, but will retain the order of the text as well as sentence and, as much as possible, paragraph structure. It uses wordllama embeddings to locate more natural indexes to split on. As a result, there will be a range of chunk sizes in the output up to the target size.

The recommended target size is from 512 to 2048 characters, with the default size at 1536. Chunks that need to be much larger should probably be batched after splitting, and will often be aggregated from multiple semantic chunks already.

For more information see: technical overview

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 dimensions are recommended for binary embeddings.

The L2 Supercat model was trained using a batch size of 512 on a single A100 GPU for 12 hours.

Roadmap

  • Adding Inference Features:
    • Semantic text splitting (completed)
  • Additional Example Notebooks:
    • DSPy evaluators
    • Retrieval-Augmented Generation (RAG) pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for LLaMA models). You can then use the following snippet:

from wordllama.extract.extract_safetensors import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

Hint: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest; sometimes you have to inspect and figure it out.

For training, use the scripts in the GitHub repository. You have to add a configuration file (copy/modify an existing one into the folder).

pip install wordllama[train]
python train.py train --config your_new_config
# (Training process begins)
python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
# (Saves one model per Matryoshka dimension)

Community Projects

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.3.2}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

wordllama-0.3.3.post0-cp312-cp312-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.3.3.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.3.3.post0-cp312-cp312-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.3.3.post0-cp312-cp312-macosx_10_13_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.3.3.post0-cp311-cp311-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.3.3.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.3.3.post0-cp311-cp311-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.3.3.post0-cp311-cp311-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.3.3.post0-cp310-cp310-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.3.3.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.3.3.post0-cp310-cp310-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.3.3.post0-cp310-cp310-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.3.3.post0-cp39-cp39-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.3.3.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.3.3.post0-cp39-cp39-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.3.3.post0-cp39-cp39-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

wordllama-0.3.3.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

wordllama-0.3.3.post0-cp38-cp38-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.3.3.post0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 886c0d70957ccc0953b7acd397dd93f9418b0ef43dbc4262ac1289270b00bd73
MD5 6afe3f76d6598d4f0900966202e2bf4f
BLAKE2b-256 ea09f64325ac5118ba86bdf80e4e38ce04d242acea85de4b4b2324512d7ea7d3

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dd10c2fa059a71e349e4f2c1c5ebaf75a2489d3c179626c0f3b0b4756bc1eb66
MD5 432a8e78786b20f9dbf43907d3471832
BLAKE2b-256 26c87e59283002ddd2d67d256a096e5dab18e51c08a8e9c41c523d5eb291a3cd

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 42618e6527de7cba5fd499ab8ae3236fe6043076cc4f6b7aedf5f33685f671e7
MD5 9838c998209918401e74d767e41875a2
BLAKE2b-256 e768dd4b8b6fb5ac1cd793880499c2c69e73b79f5eef98b3f438a84f9d76602f

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 65ef130c00631f071b8923369ea295fe45481a39e1680069a2ec54d63f9e9cba
MD5 a8e482fbecc30eef137a0bb64ad69845
BLAKE2b-256 58198aaa3dd3c703c26f83ea4556701c91df0e777d8161220ba39a67bf7c615d

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 2e27747f77d6ca4021823e9fa223c6b1e38f74c8af7075fed9df8f75ae7f318c
MD5 205d17b4f4ee5c3998492a76b50a1efa
BLAKE2b-256 b4729a22e85d2458047d1b345fa7a9475e407c3469b065773b3dae6bca45c1f4

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ad90dad4f53eff7fcfca589e84d916db3e06c8ec82a275a7b4edf8ce67ff3813
MD5 00ff12880aa3a5298a127cbc73712bfe
BLAKE2b-256 81feab5b2cf66e5d0abc8207e7f897519b95896d9fc1d36a9383a0bb0e4558e5

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 f13d41dd32f2d0bfcdc2b9bfe47fcd01be5f52f514f2e522402ae05186446f88
MD5 18c35b9be93ae16166b1783347eb6693
BLAKE2b-256 56ed029cc051969f7b7b7e765fbaf6a8ce7e9c1e871d5b5bcdce182db8c523cf

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f6b9f9454c96f9f5eb7ea8264be767cb72eebdddbc5afa0d602a6b65ff299900
MD5 3223c022d5c14f5ca1d59a759f64546e
BLAKE2b-256 e694ce9da7fd67282b3fe16717130d100bf786e4a7fec2796d9f97aaee85250e

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 be12712ec116296314813cf7dcfd3e1dfb6c0ba93c90c887b243b9f37437d8cb
MD5 04967e903ca43fad2c6cc8c830909b30
BLAKE2b-256 27fc28cd96e6453025cbb6089ec0a4daa3cafdd55c0ef566b56afb84c750a7e1

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 75482d695d714dbf110c65b8c3d32162daff72da61629289490aa94145c5c4fe
MD5 06ea63edfaed444e9105d87fa86ca7e5
BLAKE2b-256 56f92e5a60022d21b93ecc9593bc752258fea15a7f4f01a47c00688cd1283812

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 65931d07b915804886296f855231cebcd606bc457b4ec267d420c208b63e6559
MD5 99a8f1eaca8de0b06d576a46a4a2aaa2
BLAKE2b-256 5a0ffc2bbf5a16745444a3165ea80f60df4fdcea7a838d70d87964540b6d568c

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a225ced61bf2812172d9f4865d2cb733b2a90475b197a7d4c668863cc12fdfaa
MD5 d2eb28f1547964649ad6b01a1d111f56
BLAKE2b-256 dc90d58b7667388f4563e09ffd9fefc791caa537aeedd1e3ca2f52ccda3ae439

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 6ba10567cc4345b37ca88fd42bfefd9683f786c24b62804e9710ba0c19a4ce06
MD5 e64f593b6596f00b046fc846c487aefa
BLAKE2b-256 d0c60cd68a79038f380f5acb2e6a002a5c06c9a9d0f99ee542a4f3edf18b6257

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9c2b705dae4b29ae7263cfc85496b520fac0427b0744d0d8d22894e77ab8d646
MD5 997d117e7c674b7414fbaa0dc6919141
BLAKE2b-256 a3d0202ceed9586131d936955f928feb9fc332012ebe7687c8ce01cc71636704

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 448a1e6308d59c449a7d043eb4c10dd979cae6b4e9d7da41792462b560910551
MD5 d8385ce884b54ead301d24944b84bb72
BLAKE2b-256 7f4154466b45378aa34483181a9039d9cb2fd8d0a3274d6ded4b5c4da2d6bd1c

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 aab0b39a8519a188a7898b8636e71d9c0ff1bdb0fcc8687cea7e66ad48e69c35
MD5 d3ec75138402cc5860208646c0eb2c42
BLAKE2b-256 42ba06b74443fd297024550364f65208ff85a25f995466932eff51d9cc3ff997

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a3a2af0f2cedd91bac82c8ef630a40123bed453a3eba1249d1b5f29df039b940
MD5 05785b64c3cd5eae51e33f7f5cf9b34a
BLAKE2b-256 5d65a1ba594fa47191781b6cd94bd63798e53422e974ae9d276d209c0903da33

See more details on using hashes here.

File details

Details for the file wordllama-0.3.3.post0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.3.post0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 50e27fe3208e7a35802aec5e19af8fd3ab8f7f56cfe1d136c5fda7369882c025
MD5 993309ce8b68615b2ab59475605c722e
BLAKE2b-256 73fe1cd36a0d35dddd382e42407c39389766b9a47ffcffdc9b3a256ac36e7b7c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page