Skip to main content

WordLlama Embedding Utility

Project description

WordLlama 📝🦙

WordLlama is a fast, lightweight NLP toolkit designed for tasks like fuzzy deduplication, similarity computation, ranking, clustering, and semantic text splitting. It operates with minimal inference-time dependencies and is optimized for CPU hardware, making it suitable for deployment in resource-constrained environments.

Word Llama

News and Updates 🔥

Table of Contents

Quick Start

Install WordLlama via pip:

pip install wordllama

Load the default 256-dimensional model:

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Features

  • Fast Embeddings: Efficiently generate text embeddings using a simple token lookup with average pooling.
  • Similarity Computation: Calculate cosine similarity between texts.
  • Ranking: Rank documents based on their similarity to a query.
  • Fuzzy Deduplication: Remove duplicate texts based on a similarity threshold.
  • Clustering: Cluster documents into groups using KMeans clustering.
  • Filtering: Filter documents based on their similarity to a query.
  • Top-K Retrieval: Retrieve the top-K most similar documents to a query.
  • Semantic Text Splitting: Split text into semantically coherent chunks.
  • Binary Embeddings: Support for binary embeddings with Hamming similarity for even faster computations.
  • Matryoshka Representations: Truncate embedding dimensions as needed for flexibility.
  • Low Resource Requirements: Optimized for CPU inference with minimal dependencies.

What is WordLlama?

WordLlama is a utility for natural language processing (NLP) that recycles components from large language models (LLMs) to create efficient and compact word representations, similar to GloVe, Word2Vec, or FastText.

Starting by extracting the token embedding codebook from state-of-the-art LLMs (e.g., LLaMA 2, LLaMA 3 70B), WordLlama trains a small context-less model within a general-purpose embedding framework. This approach results in a lightweight model that improves on all MTEB benchmarks over traditional word models like GloVe 300d, while being substantially smaller in size (e.g., 16MB default model at 256 dimensions).

WordLlama's key features include:

  1. Matryoshka Representations: Allows for truncation of the embedding dimension as needed, providing flexibility in model size and performance.
  2. Low Resource Requirements: Utilizes a simple token lookup with average pooling, enabling fast operation on CPUs without the need for GPUs.
  3. Binary Embeddings: Models trained using the straight-through estimator can be packed into small integer arrays for even faster Hamming distance calculations.
  4. Numpy-only Inference: Lightweight inference pipeline relying solely on NumPy, facilitating easy deployment and integration.

Because of its fast and portable size, WordLlama serves as a versatile tool for exploratory analysis and utility applications, such as LLM output evaluators or preparatory tasks in multi-hop or agentic workflows.

MTEB Results

The following table presents the performance of WordLlama models compared to other similar models.

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

WL64 to WL1024: WordLlama models with embedding dimensions ranging from 64 to 1024.

Note: The l2_supercat is a LLaMA 2 vocabulary model. To train this model, we concatenated codebooks from several models, including LLaMA 2 70B and phi 3 medium, after removing additional special tokens. Because several models have used the LLaMA 2 tokenizer, their codebooks can be concatenated and trained together. The performance of the resulting model is comparable to training the LLaMA 3 70B codebook, while being 4x smaller (32k vs. 128k vocabulary).

Other Models

How Fast? :zap:

8k documents from the ag_news dataset

  • Single core performance (CPU), i9 12th gen, DDR4 3200
  • NVIDIA A4500 (GPU)

Word Llama

Usage

Embedding Text

Load pre-trained embeddings and embed text:

from wordllama import WordLlama

# Load pre-trained embeddings (truncate dimension to 64)
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["The quick brown fox jumps over the lazy dog", "And all that jazz"])
print(embeddings.shape)  # Output: (2, 64)

Calculating Similarity

Compute the similarity between two texts:

similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

Ranking Documents

Rank documents based on their similarity to a query:

query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates, sort=True)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Fuzzy Deduplication

Remove duplicate texts based on a similarity threshold:

deduplicated_docs = wl.deduplicate(candidates, threshold=0.5)
print(deduplicated_docs)
# Output:
# ['I went to the park',
#  'I went to the shop',
#  'I went to the truck']

Clustering

Cluster documents into groups using KMeans clustering:

labels, inertia = wl.cluster(candidates, k=3, max_iterations=100, tolerance=1e-4, n_init=3)
print(labels, inertia)
# Output:
# [2, 0, 1, 1], 0.4150

Filtering

Filter documents based on their similarity to a query:

filtered_docs = wl.filter(query, candidates, threshold=0.3)
print(filtered_docs)
# Output:
# ['I went to the vehicle']

Top-K Retrieval

Retrieve the top-K most similar documents to a query:

top_docs = wl.topk(query, candidates, k=2)
print(top_docs)
# Output:
# ['I went to the vehicle', 'I went to the truck']

Semantic Text Splitting

Split text into semantic chunks:

long_text = "Your very long text goes here... " * 100
chunks = wl.split(long_text, target_size=1536)

print(list(map(len, chunks)))
# Output: [1055, 1055, 1187]

Note that the target size is also the maximum size. The .split() feature attempts to aggregate sections up to the target_size, but will retain the order of the text as well as sentence and, as much as possible, paragraph structure. It uses wordllama embeddings to locate more natural indexes to split on. As a result, there will be a range of chunk sizes in the output up to the target size.

The recommended target size is from 512 to 2048 characters, with the default size at 1536. Chunks that need to be much larger should probably be batched after splitting, and will often be aggregated from multiple semantic chunks already.

For more information see: technical overview

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 dimensions are recommended for binary embeddings.

The L2 Supercat model was trained using a batch size of 512 on a single A100 GPU for 12 hours.

Roadmap

  • Adding Inference Features:
    • Semantic text splitting (completed)
  • Additional Example Notebooks:
    • DSPy evaluators
    • Retrieval-Augmented Generation (RAG) pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for LLaMA models). You can then use the following snippet:

from wordllama.extract.extract_safetensors import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

Hint: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest; sometimes you have to inspect and figure it out.

For training, use the scripts in the GitHub repository. You have to add a configuration file (copy/modify an existing one into the folder).

pip install wordllama[train]
python train.py train --config your_new_config
# (Training process begins)
python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
# (Saves one model per Matryoshka dimension)

Community Projects

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.3.1}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

wordllama-0.3.1.post0-cp312-cp312-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.3.1.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.3.1.post0-cp312-cp312-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.3.1.post0-cp312-cp312-macosx_10_13_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.3.1.post0-cp311-cp311-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.3.1.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.3.1.post0-cp311-cp311-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.3.1.post0-cp311-cp311-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.3.1.post0-cp310-cp310-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.3.1.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.3.1.post0-cp310-cp310-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.3.1.post0-cp310-cp310-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.3.1.post0-cp39-cp39-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.3.1.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.3.1.post0-cp39-cp39-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.3.1.post0-cp39-cp39-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

wordllama-0.3.1.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

wordllama-0.3.1.post0-cp38-cp38-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.3.1.post0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 138ee38149dc2394ec9db38a70f7e7d60d2cd47beb54880dd0eb3c08080219ca
MD5 20002cdcb4d8825d3d394f0177fa0274
BLAKE2b-256 d2fcd74233626cfd827eb61135d81e8d6a575238d5bd502ce3863bb8a9b325c6

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f52e8d3b0a1332db86002b4190953c54a385981dc999c161f5553d113278344c
MD5 9ddf2e9659da1ddf4b28a334c5289147
BLAKE2b-256 8f742e33a275dc0949a35c80b868047e211c66a28c3f452ca137181dc354d1b3

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 986973aeccc94591c338fb75c9fdb6d0a5b0c78142e3f283d96273c2e981327a
MD5 9bf17bf5a0d234c25e8cdedf7940cd4b
BLAKE2b-256 96ee875660a3a3e240f459472081f13ca979696b465859e2f1b9b4e4d502d4f9

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 a669f17474f014fe843a5e8923524082d55e392ae933a3211a29254acdab84dd
MD5 350ad5f5913fb79b764317931c31dde9
BLAKE2b-256 d45e699011e1687c89d09d5780fae8e4ee647383468a963254c3b9b6891dc0bc

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8f9138b111befd4dd2e53e1045b2e1a12d8e817e3305905536c973517e36ad44
MD5 1d5b0bb4a5c07cd51939909f58cd1800
BLAKE2b-256 f471238a106c3603f7fed30c5389e7a491fb7194d8e363e98302495009c2da3e

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 def7528d42969c9b06bafae86c25d3bea5c4cec1aeef34277fb1350e2ed8fbe2
MD5 e4460fbcb50b213817d9673266a57fbf
BLAKE2b-256 9ccfb62b15b220b3be8b5c68fb6c89ed9360fdf61522929b2a9196f16587c199

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 d9595e314580a778c240c8ba2011a388335cab6753bf1e9830ee693c40bab90c
MD5 5df3601311c373e2e57eeb0d6fed2282
BLAKE2b-256 663fb9da1383a49e8228eda5b7c4ec845afac26d7a85bc616c76b5bbd262362b

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 300a8400b521984382ace122a464a07cb0406a4029c0f86f60d3167e29ac8623
MD5 003acd55a92c4f42a3f9a9d88ba09932
BLAKE2b-256 a76bd978eb8b33dc6bbd5039d73515ba9aca8fb5aebd0fcd5cf44b1fc0244dee

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 60f8e60504c85ae06fc4c6fa4c36b5dc523f5b7971eb674aaceac7d487cbdf09
MD5 7d22a1383491a72fe2e39a5a89bdcefb
BLAKE2b-256 d9bfa1627c1fa8548359533a48cb86dbd56e07eb5e343131b81ea47a74e653a4

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2c934adb795d982c315189a7b172b99168c39de7bb193e950dcedb707b54f806
MD5 3d3da7c695bd8cb5bb1a31e3be646431
BLAKE2b-256 a553fa6b2c208933a7e5f0e43ad093a6f1eb4009a7f306bd88f0b2ab0768219c

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 4123b4eafea29dd7c0564b242606b421078dbd45de0a68861ba30d735a105950
MD5 1f3d2ebc8e189feaddbfca766fa74e01
BLAKE2b-256 b5f2270f21f8efa242c6e661e64834542cb00b891be87d90988a2e74350b27cd

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3acaadf8fec37ca923e7d9dc1d670cb2d593dfa76338c8f000e9ef77e37c56f4
MD5 2d63568935d5ba100287428169b13b7a
BLAKE2b-256 611f0296964939491782abe06acb6f083decdc9e99138e65eef1641592627779

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 5948411250f4c4920706853133d9362a590a3709678cb6f06d48644a8497fbc4
MD5 f2958ac6ab3d37e61f5864f68ce07ace
BLAKE2b-256 08c764e491ba70600d6d7905e276f0eafd22e2e7a9d964a1a6a23ee8909ad7aa

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2b0821599ecb8f558189302d9e052d98ac97d49caad9d27a9eb79dab07b50311
MD5 009f95aca8df4de6b5f0bfabdf7de800
BLAKE2b-256 34bf4af7d06d3a4b4ffa3e1c18fed9d55f1344209ae7c5fb5a48913c97eeb1a1

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 9e771c796156cdb1f9e657293c4afe348692fa2ea6f3b5d7d651d57d19d442c8
MD5 bf042ff0b1f3b2032ce3e74dca9560da
BLAKE2b-256 1fef68be4ccd55252d744efc9d9ca1e35a56e98efa373a821f4376f70ad45cd1

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 eefdf2b35fde2c2b3f2d03159df9c93766859fa1b5635c0d708db3de2c6ab1c6
MD5 e3d75627d5e1fa7177aa2e2c117f6102
BLAKE2b-256 03a45676615bae678261297de581480b3204a92e4fb6df84e4644e83441a16dd

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d5cf50677552d66d2ea2b7e5f6177feb41f968d3d614d51dc711f5c9739935ab
MD5 f5c166b6be2c7d11a6a2471979ad2f37
BLAKE2b-256 3daa7bafb59461f4b2d7b2c2fc2b1d7bb45a0be46b102b5b87525fb655d0594f

See more details on using hashes here.

File details

Details for the file wordllama-0.3.1.post0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.1.post0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3d75e55d6ab5d5f87714a2175306b28f14681b5df3526f91359d82054cf719e6
MD5 7acfb3d9ed7a640ec4a6ec29d37623f3
BLAKE2b-256 8d7d7e612ac5390e03ac365188f7a57072ca7a577d15e1b89af53de6b90d34c4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page