Skip to main content

WordLlama NLP Utility

Project description

WordLlama 📝🦙

WordLlama is a fast, lightweight NLP toolkit designed for tasks like fuzzy deduplication, similarity computation, ranking, clustering, and semantic text splitting. It operates with minimal inference-time dependencies and is optimized for CPU hardware, making it suitable for deployment in resource-constrained environments.

Word Llama

News and Updates 🔥

Table of Contents

Quick Start

Install WordLlama via pip:

pip install wordllama

Load the default 256-dimensional model:

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Features

  • Fast Embeddings: Efficiently generate text embeddings using a simple token lookup with average pooling.
  • Similarity Computation: Calculate cosine similarity between texts.
  • Ranking: Rank documents based on their similarity to a query.
  • Fuzzy Deduplication: Remove duplicate texts based on a similarity threshold.
  • Clustering: Cluster documents into groups using KMeans clustering.
  • Filtering: Filter documents based on their similarity to a query.
  • Top-K Retrieval: Retrieve the top-K most similar documents to a query.
  • Semantic Text Splitting: Split text into semantically coherent chunks.
  • Binary Embeddings: Support for binary embeddings with Hamming similarity for even faster computations.
  • Matryoshka Representations: Truncate embedding dimensions as needed for flexibility.
  • Low Resource Requirements: Optimized for CPU inference with minimal dependencies.

What is WordLlama?

WordLlama is a utility for natural language processing (NLP) that recycles components from large language models (LLMs) to create efficient and compact word representations, similar to GloVe, Word2Vec, or FastText.

Starting by extracting the token embedding codebook from state-of-the-art LLMs (e.g., LLaMA 2, LLaMA 3 70B), WordLlama trains a small context-less model within a general-purpose embedding framework. This approach results in a lightweight model that improves on all MTEB benchmarks over traditional word models like GloVe 300d, while being substantially smaller in size (e.g., 16MB default model at 256 dimensions).

WordLlama's key features include:

  1. Matryoshka Representations: Allows for truncation of the embedding dimension as needed, providing flexibility in model size and performance.
  2. Low Resource Requirements: Utilizes a simple token lookup with average pooling, enabling fast operation on CPUs without the need for GPUs.
  3. Binary Embeddings: Models trained using the straight-through estimator can be packed into small integer arrays for even faster Hamming distance calculations.
  4. Numpy-only Inference: Lightweight inference pipeline relying solely on NumPy, facilitating easy deployment and integration.

Because of its fast and portable size, WordLlama serves as a versatile tool for exploratory analysis and utility applications, such as LLM output evaluators or preparatory tasks in multi-hop or agentic workflows.

MTEB Results

The following table presents the performance of WordLlama models compared to other similar models.

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

WL64 to WL1024: WordLlama models with embedding dimensions ranging from 64 to 1024.

Note: The l2_supercat is a LLaMA 2 vocabulary model. To train this model, we concatenated codebooks from several models, including LLaMA 2 70B and phi 3 medium, after removing additional special tokens. Because several models have used the LLaMA 2 tokenizer, their codebooks can be concatenated and trained together. The performance of the resulting model is comparable to training the LLaMA 3 70B codebook, while being 4x smaller (32k vs. 128k vocabulary).

Other Models

How Fast? :zap:

8k documents from the ag_news dataset

  • Single core performance (CPU), i9 12th gen, DDR4 3200
  • NVIDIA A4500 (GPU)

Word Llama

Usage

Embedding Text

Load pre-trained embeddings and embed text:

from wordllama import WordLlama

# Load pre-trained embeddings (truncate dimension to 64)
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["The quick brown fox jumps over the lazy dog", "And all that jazz"])
print(embeddings.shape)  # Output: (2, 64)

Calculating Similarity

Compute the similarity between two texts:

similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

Ranking Documents

Rank documents based on their similarity to a query:

query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates, sort=True, batch_size=64)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Fuzzy Deduplication

Remove duplicate texts based on a similarity threshold:

deduplicated_docs = wl.deduplicate(candidates, return_indices=False, threshold=0.5)
print(deduplicated_docs)
# Output:
# ['I went to the park',
#  'I went to the shop',
#  'I went to the truck']

Clustering

Cluster documents into groups using KMeans clustering:

labels, inertia = wl.cluster(candidates, k=3, max_iterations=100, tolerance=1e-4, n_init=3)
print(labels, inertia)
# Output:
# [2, 0, 1, 1], 0.4150

Filtering

Filter documents based on their similarity to a query:

filtered_docs = wl.filter(query, candidates, threshold=0.3)
print(filtered_docs)
# Output:
# ['I went to the vehicle']

Top-K Retrieval

Retrieve the top-K most similar documents to a query:

top_docs = wl.topk(query, candidates, k=2)
print(top_docs)
# Output:
# ['I went to the vehicle', 'I went to the truck']

Semantic Text Splitting

Split text into semantic chunks:

long_text = "Your very long text goes here... " * 100
chunks = wl.split(long_text, target_size=1536)

print(list(map(len, chunks)))
# Output: [1055, 1055, 1187]

Note that the target size is also the maximum size. The .split() feature attempts to aggregate sections up to the target_size, but will retain the order of the text as well as sentence and, as much as possible, paragraph structure. It uses wordllama embeddings to locate more natural indexes to split on. As a result, there will be a range of chunk sizes in the output up to the target size.

The recommended target size is from 512 to 2048 characters, with the default size at 1536. Chunks that need to be much larger should probably be batched after splitting, and will often be aggregated from multiple semantic chunks already.

For more information see: technical overview

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 dimensions are recommended for binary embeddings.

The L2 Supercat model was trained using a batch size of 512 on a single A100 GPU for 12 hours.

Roadmap

  • Adding Inference Features:
    • Semantic text splitting (completed)
  • Additional Example Notebooks:
    • DSPy evaluators
    • Retrieval-Augmented Generation (RAG) pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for LLaMA models). You can then use the following snippet:

from wordllama.extract.extract_safetensors import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

Hint: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest; sometimes you have to inspect and figure it out.

For training, use the scripts in the GitHub repository. You have to add a configuration file (copy/modify an existing one into the folder).

pip install wordllama[train]
python train.py train --config your_new_config
# (Training process begins)
python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
# (Saves one model per Matryoshka dimension)

Community Projects

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.3.3}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordllama-0.3.7.tar.gz (17.5 MB view details)

Uploaded Source

Built Distributions

wordllama-0.3.7-cp312-cp312-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.3.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.3.7-cp312-cp312-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.3.7-cp312-cp312-macosx_10_13_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.3.7-cp311-cp311-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.3.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.3.7-cp311-cp311-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.3.7-cp311-cp311-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.3.7-cp310-cp310-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.3.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.3.7-cp310-cp310-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.3.7-cp310-cp310-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.3.7-cp39-cp39-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.3.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.3.7-cp39-cp39-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.3.7-cp39-cp39-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.3.7.tar.gz.

File metadata

  • Download URL: wordllama-0.3.7.tar.gz
  • Upload date:
  • Size: 17.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for wordllama-0.3.7.tar.gz
Algorithm Hash digest
SHA256 95534e39ff24119100865ce9046ed4835f247d4b5a75c4f5a85cb150881c684d
MD5 959277810a7479751a96e8ca5e70e18e
BLAKE2b-256 e2b5ffd59d8b7ced690bb91bc60216a60c15d5413a0f37cf19def8bc87fcf743

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a5297413711267c09a448f06a3574f6bde9b91074f793eb729418176ca1e2406
MD5 7ae2e29c40efa0bc228e35adfe256690
BLAKE2b-256 cf55dee139128b83b975b593690a260b1f68ae5159567407ed230deeabb1c397

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1adc345f46657e049cecb619607a232cae639ba6f39081b9a86f6160b6dd447b
MD5 83446647fd602c7c5520ff9c50ad7e64
BLAKE2b-256 131032d696d0a6370d8d6e1c4e8739f548799ffcba6c956bd04a4d88055ad1e3

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 30c851e7288b07345d82af3593c1cdeba207114512320ec1b2fb1afa83fa2474
MD5 114761a5eee58ef44df6cb2771190643
BLAKE2b-256 db155f8604276bab980f537ff5228bb3bf51f3db198152efadbf83bd6008d0f7

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 e28a565696b47099a4be4cde83dbbe1677ad962df30687baa12b1b715c78ee2a
MD5 7aef108d26fd4755fbcd02eaee5f8c0d
BLAKE2b-256 ba1a4084c02c412a1b94871e9ecdd698ddd582665014dd3de6166219bccf2d39

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 90688c868b73f6536f5b079ad3012e468514d907fee837bdb2bd9f79f1609130
MD5 f036f82fd6e0fc2f55eb22dbc78c8e71
BLAKE2b-256 19701475261f881aa95166ccdb9808f3913bb16a9ee308d8a4c7cb085a3c9bc4

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 43ee6672d850a3780f6d81729a7ee96326cf46549e83ae8b7c04abb8ad2693a1
MD5 1c7790305b1c3a2bca95d6b6442c6e59
BLAKE2b-256 eb51e2982200597f9257e7b59e65d876d315f41aa7e3ce75cf1cf2d23d9baa16

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 ebda720342db06860692823bb2aa3b5d75d7bb3864aa3db649878dcf9299c1d8
MD5 65b6c2bc8d67ea127216b8e8b8d24ba9
BLAKE2b-256 0c960d5e71d3fdc9174290da355ba372b8dccef742d2e01c8d1d23a7ea04dd48

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9b3f5c42d2e180597345b1ce89328986db75919709a86be86a681b4b7432c8c1
MD5 48df88b9336e775a5cd4103fbf41e977
BLAKE2b-256 c65dcf7e09763c137912e1cc77e24f72d8fa6a2b9dfb979bf909a7049bb6ef0a

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 bc35d962481fb9c84aa4b712606db95b5e3edfaa5dd80081d30806ef6e69276e
MD5 a31d2fbbea7a40974aa19734c0188f5b
BLAKE2b-256 e2b7f1a26f883db0cc289aa540a2a00293fb3cb4f51776a3c0916bf624d59717

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 28975f706a2d5702278a5c6b32f852d68bccd9de3ebed6ad29235372db039c06
MD5 c8debb65cdaf738403fe2020a8aa67fa
BLAKE2b-256 877724eb4634888700ddbf154f977127bb58c0abe1186ad16574e030845131d7

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 9f1047d69164af5e43bb553b7f95a96a3fd64878ee5413f43122efa68aee589e
MD5 72a1b8fdf84afb5226ae33d165c91c94
BLAKE2b-256 c41e51d1b25a1ac42ad4be2572cc1e3e4fa32e889ba7b497400f538c9ed6dadc

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5a0330577d10e046442514bca4d544b5ba582c95bac13d11ba19bf10b5777a60
MD5 c8e39481d4a4648195052807b2a8ead3
BLAKE2b-256 02970c6c04750510d44e02ba2254e3a21e50ab477cf650a882dc0f6441bfcd08

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: wordllama-0.3.7-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 16.9 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for wordllama-0.3.7-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 241273e5c3c216e56e1928d753cd53977648b59035bc3cb4d92f51af9a1bb002
MD5 7b30165a544dd8fc7af49c9c971a2fbe
BLAKE2b-256 65b5739ea24180760f0a9778f753f71b799d0ee22ff037e7fa499b7dce384ca9

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 482bc71ce830a00656ea0f7d1652c2b5c1c8c4b2a25054c372d343e589f9a1b7
MD5 bfa46cdf1b0ee2dd2f10123f31019d02
BLAKE2b-256 38d0c8fedad1fe02b71de4fe55bffbb531b1e65a9dca50700fe30e60f5645a76

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 db0206938c28a815f297c5005cd051ffc2d95a1f4649b3c2d6eaf3f2579e846d
MD5 cd51badc88e40332dd439755646fcf77
BLAKE2b-256 6c24929caead7166484d604339dc9ca275e40d14c4bcc7b93a4849f79774f0a1

See more details on using hashes here.

File details

Details for the file wordllama-0.3.7-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.7-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 99aa29161a106e4089694d8924bfb996486dd99496da25c4c75e83003cabd7b7
MD5 3fa0c7f2abc3857f243fea760e2dd887
BLAKE2b-256 2de95a21efb3b32d16f60b341d019d93ccca597798886f12a1445fd201e66a02

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page