Skip to main content

WordLlama Embedding Utility

Project description

WordLlama 📝🦙

WordLlama is a fast, lightweight NLP toolkit designed for tasks like fuzzy deduplication, similarity computation, ranking, clustering, and semantic text splitting. It operates with minimal inference-time dependencies and is optimized for CPU hardware, making it suitable for deployment in resource-constrained environments.

Word Llama

News and Updates 🔥

Table of Contents

Quick Start

Install WordLlama via pip:

pip install wordllama

Load the default 256-dimensional model:

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Features

  • Fast Embeddings: Efficiently generate text embeddings using a simple token lookup with average pooling.
  • Similarity Computation: Calculate cosine similarity between texts.
  • Ranking: Rank documents based on their similarity to a query.
  • Fuzzy Deduplication: Remove duplicate texts based on a similarity threshold.
  • Clustering: Cluster documents into groups using KMeans clustering.
  • Filtering: Filter documents based on their similarity to a query.
  • Top-K Retrieval: Retrieve the top-K most similar documents to a query.
  • Semantic Text Splitting: Split text into semantically coherent chunks.
  • Binary Embeddings: Support for binary embeddings with Hamming similarity for even faster computations.
  • Matryoshka Representations: Truncate embedding dimensions as needed for flexibility.
  • Low Resource Requirements: Optimized for CPU inference with minimal dependencies.

What is WordLlama?

WordLlama is a utility for natural language processing (NLP) that recycles components from large language models (LLMs) to create efficient and compact word representations, similar to GloVe, Word2Vec, or FastText.

Starting by extracting the token embedding codebook from state-of-the-art LLMs (e.g., LLaMA 2, LLaMA 3 70B), WordLlama trains a small context-less model within a general-purpose embedding framework. This approach results in a lightweight model that improves on all MTEB benchmarks over traditional word models like GloVe 300d, while being substantially smaller in size (e.g., 16MB default model at 256 dimensions).

WordLlama's key features include:

  1. Matryoshka Representations: Allows for truncation of the embedding dimension as needed, providing flexibility in model size and performance.
  2. Low Resource Requirements: Utilizes a simple token lookup with average pooling, enabling fast operation on CPUs without the need for GPUs.
  3. Binary Embeddings: Models trained using the straight-through estimator can be packed into small integer arrays for even faster Hamming distance calculations.
  4. Numpy-only Inference: Lightweight inference pipeline relying solely on NumPy, facilitating easy deployment and integration.

Because of its fast and portable size, WordLlama serves as a versatile tool for exploratory analysis and utility applications, such as LLM output evaluators or preparatory tasks in multi-hop or agentic workflows.

MTEB Results

The following table presents the performance of WordLlama models compared to other similar models.

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

WL64 to WL1024: WordLlama models with embedding dimensions ranging from 64 to 1024.

Note: The l2_supercat is a LLaMA 2 vocabulary model. To train this model, we concatenated codebooks from several models, including LLaMA 2 70B and phi 3 medium, after removing additional special tokens. Because several models have used the LLaMA 2 tokenizer, their codebooks can be concatenated and trained together. The performance of the resulting model is comparable to training the LLaMA 3 70B codebook, while being 4x smaller (32k vs. 128k vocabulary).

Other Models

How Fast? :zap:

8k documents from the ag_news dataset

  • Single core performance (CPU), i9 12th gen, DDR4 3200
  • NVIDIA A4500 (GPU)

Word Llama

Usage

Embedding Text

Load pre-trained embeddings and embed text:

from wordllama import WordLlama

# Load pre-trained embeddings (truncate dimension to 64)
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["The quick brown fox jumps over the lazy dog", "And all that jazz"])
print(embeddings.shape)  # Output: (2, 64)

Calculating Similarity

Compute the similarity between two texts:

similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

Ranking Documents

Rank documents based on their similarity to a query:

query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates, sort=True)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Fuzzy Deduplication

Remove duplicate texts based on a similarity threshold:

deduplicated_docs = wl.deduplicate(candidates, return_indices=False, threshold=0.5)
print(deduplicated_docs)
# Output:
# ['I went to the park',
#  'I went to the shop',
#  'I went to the truck']

Clustering

Cluster documents into groups using KMeans clustering:

labels, inertia = wl.cluster(candidates, k=3, max_iterations=100, tolerance=1e-4, n_init=3)
print(labels, inertia)
# Output:
# [2, 0, 1, 1], 0.4150

Filtering

Filter documents based on their similarity to a query:

filtered_docs = wl.filter(query, candidates, threshold=0.3)
print(filtered_docs)
# Output:
# ['I went to the vehicle']

Top-K Retrieval

Retrieve the top-K most similar documents to a query:

top_docs = wl.topk(query, candidates, k=2)
print(top_docs)
# Output:
# ['I went to the vehicle', 'I went to the truck']

Semantic Text Splitting

Split text into semantic chunks:

long_text = "Your very long text goes here... " * 100
chunks = wl.split(long_text, target_size=1536)

print(list(map(len, chunks)))
# Output: [1055, 1055, 1187]

Note that the target size is also the maximum size. The .split() feature attempts to aggregate sections up to the target_size, but will retain the order of the text as well as sentence and, as much as possible, paragraph structure. It uses wordllama embeddings to locate more natural indexes to split on. As a result, there will be a range of chunk sizes in the output up to the target size.

The recommended target size is from 512 to 2048 characters, with the default size at 1536. Chunks that need to be much larger should probably be batched after splitting, and will often be aggregated from multiple semantic chunks already.

For more information see: technical overview

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 dimensions are recommended for binary embeddings.

The L2 Supercat model was trained using a batch size of 512 on a single A100 GPU for 12 hours.

Roadmap

  • Adding Inference Features:
    • Semantic text splitting (completed)
  • Additional Example Notebooks:
    • DSPy evaluators
    • Retrieval-Augmented Generation (RAG) pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for LLaMA models). You can then use the following snippet:

from wordllama.extract.extract_safetensors import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

Hint: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest; sometimes you have to inspect and figure it out.

For training, use the scripts in the GitHub repository. You have to add a configuration file (copy/modify an existing one into the folder).

pip install wordllama[train]
python train.py train --config your_new_config
# (Training process begins)
python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
# (Saves one model per Matryoshka dimension)

Community Projects

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.3.2}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

wordllama-0.3.2.post0-cp312-cp312-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.3.2.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.3.2.post0-cp312-cp312-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.3.2.post0-cp312-cp312-macosx_10_13_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.3.2.post0-cp311-cp311-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.3.2.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.3.2.post0-cp311-cp311-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.3.2.post0-cp311-cp311-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.3.2.post0-cp310-cp310-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.3.2.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.3.2.post0-cp310-cp310-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.3.2.post0-cp310-cp310-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.3.2.post0-cp39-cp39-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.3.2.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.3.2.post0-cp39-cp39-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.3.2.post0-cp39-cp39-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

wordllama-0.3.2.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

wordllama-0.3.2.post0-cp38-cp38-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.3.2.post0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 29550c79502a34c06438739fca1daeb51c5c114dfafeef0758eff2dadd1cc47f
MD5 1eaa62007e74e295926a659f67bfa673
BLAKE2b-256 d620905081f1456411b7296415de3a79d47250590ad7587c18d923df88dc0ee3

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bfbfb1498d9f1c9aaef29c3a5c50724e49069a1f67e73043cf5f48174306b681
MD5 9ab5a3328f68e8370c6d901d349bd8ba
BLAKE2b-256 a0a5566871ba91deea7a10f395fcede3b8bd69c99b9e4d7f1bc0a801cc4b7311

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 431877079150efb76a9aea713382d309763a529911698a54c4920c572c885bc7
MD5 b1ddfef21a2ed4a49ab70fee0e29deb4
BLAKE2b-256 8c66ddaf44430611a392422a645eb165431e81aa95bc457959447e5a81716adf

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 1f8c17e618279509eef6be3d4a6033752bb90e750d61bed2cdd979db96d67fc8
MD5 8b319fb7d09305174988bc928649c49e
BLAKE2b-256 5568763df82fe132d9a4af0f49eca32d3fb71573d16edabc48d3e439603dc331

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 75a208a6f4f203ad241a046a13ed319f5c0991354ed6c17e45c0a3db7f0b767e
MD5 0d7dbc9f299970a2ac563dd1231192ca
BLAKE2b-256 099e2fa4158632de37f0d5541d1a870cedccadd9eeec35f3611bb23e4e7c7322

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a0ee4293aac0f73c609ab122d8176b587f1801e1b67043838b176f7b9e77955b
MD5 8d7fbddb5f17df8925bf0686663668aa
BLAKE2b-256 4e60cbd921b7205926a1d627bf635d65add7979542ae6591caf8a537ed99ce42

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 1a768bebdcab6ebaeed95482e3ad8261b598aa5a526868fc0b5ba55017aa8f03
MD5 4e1fd0a3ee7e6ab52cb8d486ab3ae0e3
BLAKE2b-256 d59a4f561d44248f66d9b72046205b5230ee04b56b961a3384844d05890cae87

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d22056a07743f3c54010e1ae3b2d783e007b0e3714296be788fdaa93ed530251
MD5 a9f27e7bc2a4799dc44e4f9bcd35e4ab
BLAKE2b-256 9d973b0b37f3b1ac61ccd6346410df9db51168401f6c2e0fa199d69d61154c15

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 df4881a97637304a6af9e88f8edafb8d1645976a1aff4f26cbe0d2e904aa8751
MD5 17aa26eed24f0e4b7d676ef73536d4c8
BLAKE2b-256 bfbc93156d9c912188852ebde5e0051b19ff1fcf50adab2883765df0c308191c

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 77cd9ada9baad8dd90ff4d1da68e45eb2da0f9c6a211c5bc0cc2ab1a63514b17
MD5 d71039fb2595d87a3bbb43396a20ab49
BLAKE2b-256 79895ca8b49ed5f0c3c0b1446439720286b0895bc2dff72ab40e40b46e980a56

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 d7b28ef9e19c8f8b8bf643526c66cc48d52cc3ee966ed218265cb0efa7ea69ae
MD5 dd2f7f1b6c0e4b08a1adb16273940050
BLAKE2b-256 1996320dfec78534a15ce0f13c3a5934aa448259904f36288fdfed5eceddd438

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 98df9b7f5b16d888e9cab5fca18f1eee257b9b1dfa109235acc20f620dea0209
MD5 1762cc29a5b57fc80421a3c3b6fb160f
BLAKE2b-256 f65afe8823afeafc99da8637173785463293d7b448ca24aad06e1eee15d8d23d

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 72b5f9602815f5b7946f26ae2ecb3322a46766a4fa1fcd49f5704bdef6769509
MD5 66c7cc20403882a58a3ac1ac74c4edce
BLAKE2b-256 eadae1e4c7551299ff9a4056af556549801f1a10f30d66c3b01fcaf65ab3a703

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3e43aa0935ae061147f8758e245f02e267a29275cec31dbb679f12005b53757e
MD5 18116b49f683bfbfb39d7d6de7cbee6e
BLAKE2b-256 81a5d60d0f1076ee95a8456b728dbe3d9358be658f4f9fe905a8fdfaea8a67f1

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 972c3a0de192ca3ff742465a5a92cc69bfdba58bbd1fd06d408136335a15ecec
MD5 85a26637458aca8c34c948006473b343
BLAKE2b-256 0cb10869ca989aeb04fdfd518dec2c780cfc25c10a63b60409bfa2f4aef26db1

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f092595ffe76e41af783b23b3fc9cc70a3c44a039a426ac421e55323d8953302
MD5 175b7b7002520c866feeacb93bf22f15
BLAKE2b-256 925bb2e42293144c38fee96ad9b717a47d05abfa5b5e39390bed6460c332fded

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7b599bd72888c1353823e809f4569c59dff4fb9db75318b57412fbec7a932d1a
MD5 b4b5d6432af712006cc42c0855e3a282
BLAKE2b-256 497b67d2080a00a1247116bd507669ff5053e8313964d7c0adba3ceabcab2c1c

See more details on using hashes here.

File details

Details for the file wordllama-0.3.2.post0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.2.post0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a3bd4d85ac75368126886e1eec4e93b72d11728cbd83cf9843c8428202c4225d
MD5 6ddf14c8320da3402356e82553b183d3
BLAKE2b-256 2e5f6309e5783cc51aa1bcec6cb0c210ee31cd6d6261b5d14656959a6d3476a3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page