Skip to main content

WordLlama NLP Utility

Project description

WordLlama 📝🦙

WordLlama is a fast, lightweight NLP toolkit designed for tasks like fuzzy deduplication, similarity computation, ranking, clustering, and semantic text splitting. It operates with minimal inference-time dependencies and is optimized for CPU hardware, making it suitable for deployment in resource-constrained environments.

Word Llama

News and Updates 🔥

Table of Contents

Quick Start

Install WordLlama via pip:

pip install wordllama

Load the default 256-dimensional model:

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Features

  • Fast Embeddings: Efficiently generate text embeddings using a simple token lookup with average pooling.
  • Similarity Computation: Calculate cosine similarity between texts.
  • Ranking: Rank documents based on their similarity to a query.
  • Fuzzy Deduplication: Remove duplicate texts based on a similarity threshold.
  • Clustering: Cluster documents into groups using KMeans clustering.
  • Filtering: Filter documents based on their similarity to a query.
  • Top-K Retrieval: Retrieve the top-K most similar documents to a query.
  • Semantic Text Splitting: Split text into semantically coherent chunks.
  • Binary Embeddings: Support for binary embeddings with Hamming similarity for even faster computations.
  • Matryoshka Representations: Truncate embedding dimensions as needed for flexibility.
  • Low Resource Requirements: Optimized for CPU inference with minimal dependencies.

What is WordLlama?

WordLlama is a utility for natural language processing (NLP) that recycles components from large language models (LLMs) to create efficient and compact word representations, similar to GloVe, Word2Vec, or FastText.

Starting by extracting the token embedding codebook from state-of-the-art LLMs (e.g., LLaMA 2, LLaMA 3 70B), WordLlama trains a small context-less model within a general-purpose embedding framework. This approach results in a lightweight model that improves on all MTEB benchmarks over traditional word models like GloVe 300d, while being substantially smaller in size (e.g., 16MB default model at 256 dimensions).

WordLlama's key features include:

  1. Matryoshka Representations: Allows for truncation of the embedding dimension as needed, providing flexibility in model size and performance.
  2. Low Resource Requirements: Utilizes a simple token lookup with average pooling, enabling fast operation on CPUs without the need for GPUs.
  3. Binary Embeddings: Models trained using the straight-through estimator can be packed into small integer arrays for even faster Hamming distance calculations.
  4. Numpy-only Inference: Lightweight inference pipeline relying solely on NumPy, facilitating easy deployment and integration.

Because of its fast and portable size, WordLlama serves as a versatile tool for exploratory analysis and utility applications, such as LLM output evaluators or preparatory tasks in multi-hop or agentic workflows.

MTEB Results

The following table presents the performance of WordLlama models compared to other similar models.

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

WL64 to WL1024: WordLlama models with embedding dimensions ranging from 64 to 1024.

Note: The l2_supercat is a LLaMA 2 vocabulary model. To train this model, we concatenated codebooks from several models, including LLaMA 2 70B and phi 3 medium, after removing additional special tokens. Because several models have used the LLaMA 2 tokenizer, their codebooks can be concatenated and trained together. The performance of the resulting model is comparable to training the LLaMA 3 70B codebook, while being 4x smaller (32k vs. 128k vocabulary).

Other Models

How Fast? :zap:

8k documents from the ag_news dataset

  • Single core performance (CPU), i9 12th gen, DDR4 3200
  • NVIDIA A4500 (GPU)

Word Llama

Usage

Embedding Text

Load pre-trained embeddings and embed text:

from wordllama import WordLlama

# Load pre-trained embeddings (truncate dimension to 64)
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["The quick brown fox jumps over the lazy dog", "And all that jazz"])
print(embeddings.shape)  # Output: (2, 64)

Calculating Similarity

Compute the similarity between two texts:

similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

Ranking Documents

Rank documents based on their similarity to a query:

query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates, sort=True, batch_size=64)
print(ranked_docs)
# Output:
# [
#   ('I went to the vehicle', 0.7441),
#   ('I went to the truck', 0.2832),
#   ('I went to the shop', 0.1973),
#   ('I went to the park', 0.1510)
# ]

Fuzzy Deduplication

Remove duplicate texts based on a similarity threshold:

deduplicated_docs = wl.deduplicate(candidates, return_indices=False, threshold=0.5)
print(deduplicated_docs)
# Output:
# ['I went to the park',
#  'I went to the shop',
#  'I went to the truck']

Clustering

Cluster documents into groups using KMeans clustering:

labels, inertia = wl.cluster(candidates, k=3, max_iterations=100, tolerance=1e-4, n_init=3)
print(labels, inertia)
# Output:
# [2, 0, 1, 1], 0.4150

Filtering

Filter documents based on their similarity to a query:

filtered_docs = wl.filter(query, candidates, threshold=0.3)
print(filtered_docs)
# Output:
# ['I went to the vehicle']

Top-K Retrieval

Retrieve the top-K most similar documents to a query:

top_docs = wl.topk(query, candidates, k=2)
print(top_docs)
# Output:
# ['I went to the vehicle', 'I went to the truck']

Semantic Text Splitting

Split text into semantic chunks:

long_text = "Your very long text goes here... " * 100
chunks = wl.split(long_text, target_size=1536)

print(list(map(len, chunks)))
# Output: [1055, 1055, 1187]

Note that the target size is also the maximum size. The .split() feature attempts to aggregate sections up to the target_size, but will retain the order of the text as well as sentence and, as much as possible, paragraph structure. It uses wordllama embeddings to locate more natural indexes to split on. As a result, there will be a range of chunk sizes in the output up to the target size.

The recommended target size is from 512 to 2048 characters, with the default size at 1536. Chunks that need to be much larger should probably be batched after splitting, and will often be aggregated from multiple semantic chunks already.

For more information see: technical overview

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 dimensions are recommended for binary embeddings.

The L2 Supercat model was trained using a batch size of 512 on a single A100 GPU for 12 hours.

Roadmap

  • Adding Inference Features:
    • Semantic text splitting (completed)
  • Additional Example Notebooks:
    • DSPy evaluators
    • Retrieval-Augmented Generation (RAG) pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for LLaMA models). You can then use the following snippet:

from wordllama.extract.extract_safetensors import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

Hint: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest; sometimes you have to inspect and figure it out.

For training, use the scripts in the GitHub repository. You have to add a configuration file (copy/modify an existing one into the folder).

pip install wordllama[train]
python train.py train --config your_new_config
# (Training process begins)
python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
# (Saves one model per Matryoshka dimension)

Community Projects

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.3.3}
}

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordllama-0.3.6.post1.tar.gz (17.5 MB view details)

Uploaded Source

Built Distributions

wordllama-0.3.6.post1-cp312-cp312-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.12 Windows x86-64

wordllama-0.3.6.post1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.3.6.post1-cp312-cp312-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.12 macOS 12.0+ ARM64

wordllama-0.3.6.post1-cp312-cp312-macosx_10_13_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

wordllama-0.3.6.post1-cp311-cp311-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.11 Windows x86-64

wordllama-0.3.6.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.3.6.post1-cp311-cp311-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

wordllama-0.3.6.post1-cp311-cp311-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wordllama-0.3.6.post1-cp310-cp310-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordllama-0.3.6.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.3.6.post1-cp310-cp310-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

wordllama-0.3.6.post1-cp310-cp310-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wordllama-0.3.6.post1-cp39-cp39-win_amd64.whl (16.9 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordllama-0.3.6.post1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.3.6.post1-cp39-cp39-macosx_12_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

wordllama-0.3.6.post1-cp39-cp39-macosx_10_9_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.3.6.post1.tar.gz.

File metadata

  • Download URL: wordllama-0.3.6.post1.tar.gz
  • Upload date:
  • Size: 17.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for wordllama-0.3.6.post1.tar.gz
Algorithm Hash digest
SHA256 bcd96e4b6dcd5421aab35e634f406864bfd7e2fdbff99ffc3f6e264574c1e0c5
MD5 5dd1302cdc5cb505c7ebacbf1d1a4eca
BLAKE2b-256 032c953e8add3985c63b60bd5cd2ad086c3c1de8bbde4e3e2e99188ee0398b3e

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 df2055f2b8971cd25914f437cecbddf267bb51c95314557058c50b624c43db2a
MD5 21a2fc36f8effdeccfdd7e3fbb5d3c83
BLAKE2b-256 089caf4e20a1bf0ac7f0da7aac51c4e73182a2803c13857d1553c3a4486ed6d9

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d583b15abee50ae7bc00e091d411ae1fd1aa2e4ed750f0f88c0aed8c7694c89b
MD5 ac09fe924b80fa44f08b0b417e8c51c2
BLAKE2b-256 7d1fa9d3151e64416e816ea1b6d94d0b34d5f60be4ac8acfe7f4c3413c9e2b26

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp312-cp312-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp312-cp312-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 a669b2d0ee5c981781f090e517b71f33a58922ab012f74aabef66edf39833563
MD5 962cb910772699cf71038478eb23393e
BLAKE2b-256 01d4d6bbe609880d5e47962e7f983411eb64752e89f562f13f961232231e6374

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 06e3acdd11efc612e68a89938ceacea9c78845450c953182e315414d98628192
MD5 f7befeb213af6b3469df54f621d1d34a
BLAKE2b-256 92bd1905e0ded902b36744e88863114e8934f9298d46c4feb8e69bc0eb468d55

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f1ac2539a4c173ee5f58cadf06e8684aabe3c802a1b59af3f40ebdf03a6c694a
MD5 3d818b3a3078719a53667e4b7b65ede8
BLAKE2b-256 bdb89b1e22bfce468cec281fbeb15d1e7c4e676c7b6a8d47e8b28e4f126352c8

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0cec8cce81b1f584526742d2d41790cda6f588328264be2da619b0cd470cf54b
MD5 694545ec4ac53f1bda96b4aeeb588b27
BLAKE2b-256 d89303bc3442d592a3de6293079129ad800e2dd3ad88590923925f0e6f0e7457

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 ec1f976cc3702bb21e048852015c75ba7390d3efaf1e7b11d3b6e0f36f6b2862
MD5 64a1a706cfbbcffef7b1dc1eedfa2f16
BLAKE2b-256 a505c0d62ddab253d7274702e181c72b2fab76f086e06beaa0dd632556099262

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 acf6f46ba7b6e2a5223f3037d449f827b15b226f0483b02a19f496f07f28a846
MD5 53637e66630284f06699441ded93e709
BLAKE2b-256 b29bfbbc5e75b2e6527ae0eb8682e0731969f95d46138de6529075a48f0c4f63

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 31ed3b4c019fe6410ee190c03716050e6157f269754127323d9a840c2fbc5d02
MD5 593d9f8a8da35494032795c8dc674bf4
BLAKE2b-256 911900518071cd5800236ce455d70f0cc7db7e50df5e7e6b8496146cc4e30e04

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c2e325a8e70002879d65060d398e92213cd9978ea8ea1add3a1f270d456eb814
MD5 c7fa09495b246e236f264f80a97e0a80
BLAKE2b-256 15fa6b6c58e0a2809283f6d8db443b37befe90a4e0b30c779bd87f4d245bf54a

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 9d9d1a2ae279d9ab65ff99e98fb73cf151aab65309c5f3459387306494c00f32
MD5 60b5f15469aafcf3a548cff5595b3e12
BLAKE2b-256 1dbcc0a77230c4a5b8c42f0551a51f19d30f14d32b0e1e59f8ba5f4e1b6609eb

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 55b6a855199b408691ff4234ecdb77b61be597d2d91d6753387df856819a71e4
MD5 dfbe4665a3376f582062e11192e049fe
BLAKE2b-256 04a0284c7603cb8c6c8b47120f0c2604e091a967d769fe3b0d2f4318527591c8

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 0c6b928abef2e91d0b3ee10d11e650e934216f6afbd6df266de616b220ce5540
MD5 26b66f99a5667357559926d4a245313f
BLAKE2b-256 de187c521208c4ecfb2f9f9fc22be31b118754da37a06a9dc4a6bd565fa5b2f6

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 26f7523a72301690286f55b2d9c4d3fdddeb03f26fb6a3b579c3c07ffcb41cfc
MD5 13c88ac512594948c1cfb38597fd5405
BLAKE2b-256 d311419d7208e258dece5d876f8e9e23596be7aeae02d1388f22eeedd0019901

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 776489201be3df428e71edac9978cae6d38aaf9823af3c03dbf89c2fccd26dce
MD5 f51646a138f9d3ef7733ce0b203ef8ec
BLAKE2b-256 45b4d340e429ea702aa66dad3bfc8755c51f9025243acd8cfbc5d8a1503263f5

See more details on using hashes here.

File details

Details for the file wordllama-0.3.6.post1-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wordllama-0.3.6.post1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 dc07ae5504791b7e54f8c4a83661720a8ac00fdf68b31bf630a7f522f2dc25e3
MD5 7d271e4f821cddc476c522a80e10fbd2
BLAKE2b-256 8cea9075798578b4dc868195964376e487398a2654bda4d64073d003eb40e871

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page