Skip to main content

WordLlama Embedding Utility

Project description

WordLlama 📝🦙

WordLlama is a fast, lightweight NLP toolkit designed for tasks like fuzzy deduplication, similarity computation, ranking, clustering, and semantic text splitting. It operates with minimal inference-time dependencies and is optimized for CPU hardware, making it suitable for deployment in resource-constrained environments.

Word Llama

News and Updates 🔥

Table of Contents

Quick Start

Install WordLlama via pip:

pip install wordllama

Load the default 256-dimensional model:

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Features

  • Fast Embeddings: Efficiently generate text embeddings using a simple token lookup with average pooling.
  • Similarity Computation: Calculate cosine similarity between texts.
  • Ranking: Rank documents based on their similarity to a query.
  • Fuzzy Deduplication: Remove duplicate texts based on a similarity threshold.
  • Clustering: Cluster documents into groups using KMeans clustering.
  • Filtering: Filter documents based on their similarity to a query.
  • Top-K Retrieval: Retrieve the top-K most similar documents to a query.
  • Semantic Text Splitting: Split text into semantically coherent chunks.
  • Binary Embeddings: Support for binary embeddings with Hamming similarity for even faster computations.
  • Matryoshka Representations: Truncate embedding dimensions as needed for flexibility.
  • Low Resource Requirements: Optimized for CPU inference with minimal dependencies.

What is WordLlama?

WordLlama is a utility for natural language processing (NLP) that recycles components from large language models (LLMs) to create efficient and compact word representations, similar to GloVe, Word2Vec, or FastText.

Starting by extracting the token embedding codebook from state-of-the-art LLMs (e.g., LLaMA 2, LLaMA 3 70B), WordLlama trains a small context-less model within a general-purpose embedding framework. This approach results in a lightweight model that improves on all MTEB benchmarks over traditional word models like GloVe 300d, while being substantially smaller in size (e.g., 16MB default model at 256 dimensions).

WordLlama's key features include:

  1. Matryoshka Representations: Allows for truncation of the embedding dimension as needed, providing flexibility in model size and performance.
  2. Low Resource Requirements: Utilizes a simple token lookup with average pooling, enabling fast operation on CPUs without the need for GPUs.
  3. Binary Embeddings: Models trained using the straight-through estimator can be packed into small integer arrays for even faster Hamming distance calculations.
  4. Numpy-only Inference: Lightweight inference pipeline relying solely on NumPy, facilitating easy deployment and integration.

Because of its fast and portable size, WordLlama serves as a versatile tool for exploratory analysis and utility applications, such as LLM output evaluators or preparatory tasks in multi-hop or agentic workflows.

MTEB Results

The following table presents the performance of WordLlama models compared to other similar models.

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

WL64 to WL1024: WordLlama models with embedding dimensions ranging from 64 to 1024.

Note: The l2_supercat is a LLaMA 2 vocabulary model. To train this model, we concatenated codebooks from several models, including LLaMA 2 70B and phi 3 medium, after removing additional special tokens. Because several models have used the LLaMA 2 tokenizer, their codebooks can be concatenated and trained together. The performance of the resulting model is comparable to training the LLaMA 3 70B codebook, while being 4x smaller (32k vs. 128k vocabulary).

Other Models

How Fast? :zap:

8k documents from the ag_news dataset

  • Single core performance (CPU), i9 12th gen, DDR4 3200
  • NVIDIA A4500 (GPU)

Word Llama

Usage Examples

Embedding Text

Load pre-trained embeddings and embed text:

from wordllama import WordLlama

# Load pre-trained embeddings (truncate dimension to 64)
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["The quick brown fox jumps over the lazy dog", "And all that jazz"])
print(embeddings.shape)  # Output: (2, 64)

Calculating Similarity

Compute the similarity between two texts:

similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

Ranking Documents

Rank documents based on their similarity to a query:

query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Fuzzy Deduplication

Remove duplicate texts based on a similarity threshold:

deduplicated_docs = wl.deduplicate(candidates, threshold=0.5)
print(deduplicated_docs)
# Output:
# ['I went to the park',
#  'I went to the shop',
#  'I went to the truck']

Clustering

Cluster documents into groups using KMeans clustering:

labels, inertia = wl.cluster(candidates, k=3, max_iterations=100, tolerance=1e-4, n_init=3)
print(labels, inertia)
# Output:
# [2, 0, 1, 1], 0.4150

Filtering

Filter documents based on their similarity to a query:

filtered_docs = wl.filter(query, candidates, threshold=0.3)
print(filtered_docs)
# Output:
# ['I went to the vehicle']

Top-K Retrieval

Retrieve the top-K most similar documents to a query:

top_docs = wl.topk(query, candidates, k=2)
print(top_docs)
# Output:
# ['I went to the vehicle', 'I went to the truck']

Semantic Text Splitting

Split text into semantic chunks:

long_text = "Your very long text goes here... " * 100
chunks = wl.split(long_text, target_size=1536)

print(list(map(len, chunks)))
# Output: [1055, 1055, 1187]

Note that the target size is also the maximum size. The .split() feature attempts to aggregate sections up to the target_size, but will retain the order of the text as well as sentence and, as much as possible, paragraph structure. It uses wordllama embeddings to locate more natural indexes to split on. As a result, there will be a range of chunk sizes in the output up to the target size.

The recommended target size is from 512 to 2048 characters, with the default size at 1536. Chunks that need to be much larger should probably be batched after splitting, and will often be aggregated from multiple semantic chunks already.

For more information see: technical overview

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 dimensions are recommended for binary embeddings.

The L2 Supercat model was trained using a batch size of 512 on a single A100 GPU for 12 hours.

Roadmap

  • Adding Inference Features:
    • Semantic text splitting (completed)
  • Additional Example Notebooks:
    • DSPy evaluators
    • Retrieval-Augmented Generation (RAG) pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for LLaMA models). You can then use the following snippet:

from wordllama.extract.extract_safetensors import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

Hint: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest; sometimes you have to inspect and figure it out.

For training, use the scripts in the GitHub repository. You have to add a configuration file (copy/modify an existing one into the folder).

pip install wordllama[train]
python train.py train --config your_new_config
# (Training process begins)
python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
# (Saves one model per Matryoshka dimension)

Community Projects

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.2.6}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

wordllama-0.2.9.post0-cp312-cp312-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.2.9.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.7 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.2.9.post0-cp312-cp312-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.2.9.post0-cp312-cp312-macosx_10_13_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.2.9.post0-cp311-cp311-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.2.9.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.2.9.post0-cp311-cp311-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.2.9.post0-cp311-cp311-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.2.9.post0-cp310-cp310-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.2.9.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.2.9.post0-cp310-cp310-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.2.9.post0-cp310-cp310-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.2.9.post0-cp39-cp39-win_amd64.whl (16.8 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.2.9.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.2.9.post0-cp39-cp39-macosx_12_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.2.9.post0-cp39-cp39-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

wordllama-0.2.9.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

wordllama-0.2.9.post0-cp38-cp38-macosx_10_9_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.2.9.post0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 f77409c4e4915e94fc86a4cde8c0454a0f7884aec215b6bfb70b4aa63f7545b2
MD5 00bea527427f0503df2aa307b119a410
BLAKE2b-256 689506c895b0b664217dc5bb194b9c99098f8d7dc6463bb2846e557f6a31f8f5

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2a1e078ebec038cee6d279b696521544f9ed60503a5ed2da856ac008b4b8fb6b
MD5 ecc858c4ac9cc8d8c3eeaf5779b2461b
BLAKE2b-256 98c87b34c364d706ad355e15a461eeeb9644f57c6bc111adb0afc780bb31cf84

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 0f3a8210f59b380c306adab14163b0df157e7a122d4c7a12fb4a2cd81defb73a
MD5 45013fe8a08a56a42d3feb9040027b97
BLAKE2b-256 dde10a799b7a4b48fcd32d62d33c109e56a3768f74c3f30b0b6a4e1de04fa126

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 6a413de751fb390dcaa23cae82016e10313c8eaacfba98f2009ceada507ec438
MD5 b387193d1e03e35dfd4630a0630479f8
BLAKE2b-256 b4167633862a804690a07b958c73212b8bdd2bd033c0008cadd9533b028e9e95

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 db831ea08ea77da7b7fb8caaa7d076331a507e1913ce6a8f0453c0d1339f2240
MD5 81b48f53663b901bd398aad634b12851
BLAKE2b-256 192c914772aaac0e454b29d5fcb90d1b546d77044343b83f1941000fee9e6472

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 96e9bab81bf7a246d477782f9fa9c703b5034cdb447da87f00f7a3d787b363e9
MD5 ecec51b6bb760be0442c7964a5555e15
BLAKE2b-256 ea48fedf523962996e11340d37a4ec23ac4a14df2a85553c31a1d3064bce071b

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 aab8983fbdda4666c7917a1b99e56ae2ab33081345612f9c4f674c2dd99a8794
MD5 e6c4a278542d5864b6fadf22512d36bd
BLAKE2b-256 7797612c90275c8dd12b39fac1a26adc879568913c630dd8dd860d29adbc6378

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2636c259c395ef67d1fa178eda9669d8e8fd6fecb3c993e399780e6bf3f5e69b
MD5 fc2fa46108e258ab3641bc758cd4b19d
BLAKE2b-256 0b85958fe4be47c03425591b72d2bcb3e9f92db2f6bb8210bc71468bb114c123

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 fd77e29c19cd67865053215fe097cec52ef033d2d541d85f7d1a0ed8b3dd196c
MD5 b846dcaf97a633e3dbec66f358a04298
BLAKE2b-256 c698ead5f79d74cc2867dab795081e8348d7bcbb490c5fc518a4613e53402d52

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 809ed8603a2069667e9da4d0ff3fb927b5a5b17365d0fb9f99187e0ff069e120
MD5 608979de4157304b8696a5b39b164864
BLAKE2b-256 363bb6df24faa14d18debbe6a4ea6707b55a62b7a8150db9170b6146b6360f76

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 8f6aa1b10ac205daa69926ab2f04f4a08feb49bab6d540c0033e75fc61dc592d
MD5 e280a33bc3f6d6c28ba135dababcbf15
BLAKE2b-256 f8573781e1e0613f3b78da88cf2c1fc80bd5862623fa2d2c628278a6a279a9a9

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 fd8605f1fe41fba3892d2872115dad8c8f48b4bac70f1a13d94e7a9a785143bd
MD5 29e9d60458a3e7e3fe9cf93339b2d80a
BLAKE2b-256 883df4866bbeb49a48d72f1b3609f25c0f8b2d0e5ea1adad90cef4d34adc869a

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 2bcebb6f73a77ab560432e514832094e301a5422a9b070aaf31d8986f2e4141b
MD5 0fe81ed9e6461f44edd3bd3b5c7069f3
BLAKE2b-256 f1c30450973a959f90a1d703ccb47f8b7f5bbe22fc42d185cdc9f9feb9e7d06f

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e2809d9bf73c1471471904c79da62e6b23334ce3f5e5976901f953d0fb51dca6
MD5 e9283c538953674e1692ab750658959a
BLAKE2b-256 3eb94efeabb4e35e0e49fa77618eecc2b41a040d7624801c025eeabe833403f1

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 6df28941d7984500fc99b9ded32d87c157ffeda2549384da4a4b7adf8b996d6c
MD5 4c134cb085d5633e002d1043ecf683a5
BLAKE2b-256 171fe169e1e89f5e3633c303d1e6216beed309cc42cd23581ec977881cf68078

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a3a1e260d380d1218abbf9fd82947ab9655908bb5f02af3c7b28b8b8792d86dc
MD5 3115979fbe74086645dd2e2f0f93141d
BLAKE2b-256 b72314905483c97537247dbe80e1d5116b51c6b91b8a318a5cc0eb7c9e874b5d

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c20e1f52024c9bd968d44fdba60ad067c32c31cf1c35cf7f1acb85eff26e6d7a
MD5 3aca47c6cda61474ae090c2b0de8daa4
BLAKE2b-256 fd3401a19a57b0f1580aefe6fe927c4375377d4e78c4d4fdd5e8cf8856632f91

See more details on using hashes here.

File details

Details for the file wordllama-0.2.9.post0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.2.9.post0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e31df70ca05550855a1adaeecc0a6cb14497249825cd78c830ea339ee921fcbf
MD5 91a697f356acfcfd6787d97de3184629
BLAKE2b-256 f655dbeaf4dbba6c287b23618c167f41cbbb9e7c7bdfdecb8e32ea01cc2a6819

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page