WordLlama Embedding Utility

Project description

WordLlama

The power of 15 trillion tokens of training, extracted, flogged and minimized into a cute little package for word embedding.

Word Llama

Quick Start
What is it?
MTEB Results
Embed Text
Training Notes
Roadmap
Extracting Token Embeddings
Citations
License

Quick Start

Install:

pip install wordllama

Load the 256-dim model.

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.06641249096796882

# Rank documents based on their similarity to a query
query = "i went to the car"
candidates = ["i went to the park", "i went to the shop", "i went to the truck", "i went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('i went to the vehicle', 0.7441646856486314),
#   ('i went to the truck', 0.2832691551894259),
#   ('i went to the shop', 0.19732814982305436),
#   ('i went to the park', 0.15101404519322253)
# ]

# additional inference methods
wl.deduplicate(candidates, threshold=0.8) # fuzzy deduplication
wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init
wl.filter(query, candidates, threshold=0.3) # filter candidates based on query
wl.topk(query, candidates, k=3) # return topk strings based on query

What is it?

WordLlama is a utility for NLP and word embedding model that recycles components from large language models (LLMs) to create efficient and compact word representations (such as GloVe, Word2Vec or FastText). WordLlama begins by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLama3 70B), and training a small context-less model in a general purpose embedding framework.

WordLlama improves on all MTEB benchmarks above word models like GloVe 300d, while being substantially smaller in size (16MB default model @ 256-dim vs >2GB).

Features of WordLlama include:

Matryoshka Representations: Truncate embedding dimension as needed.
Low Resource Requirements: A simple token lookup with average pooling, enables this to operate fast on CPU.
Binarization: Models trained using the straight through estimator can be packed to small integer arrays for even faster hamming distance calculations. (coming soon)
Numpy-only inference: Lightweight and simple.

For flexibility, WordLlama employs the Matryoshka representation learning training technique. The largest model (1024-dim) can be truncated to 64, 128, 256 or 512. For binary embedding models, we implement straight-through estimators during training. For dense embeddings, 256 dimensions sufficiently captures most of the performance, while for binary embeddings validation accuracy is close to saturation at 512-dimensions (64 bytes packed).

The final weights are saved after weighting, projection and truncation of the entire tokenizer vocabulary. Thus, WordLlama becomes a single embedding matrix (nn.Embedding) that is considerably smaller than the gigabyte-sized llm codebooks we start with. The original tokenizer is still used to preprocess the text into tokens, and the reduced size token embeddings are average pooled. There is very little computation required, and the resulting model sizes range from 16mb to 250mb for the 128k llama3 vocabulary.

It's good option for some nlp-lite tasks. You can train sklearn classifiers on it, perform basic semantic matching, fuzzy deduplication, ranking and clustering. I think it should work well for creating LLM output evaluators, or other preparatory tasks involved in multi-hop or agentic workflows. You can perform your own llm surgery and train your own model on consumer GPUs in a few hours. Because of its fast and portable size, it makes a good "Swiss-Army Knife" utility for exploratory analysis and utility applications.

MTEB Results (l2_supercat)

Metric	WL64	WL128	WL256 (X)	WL512	WL1024	GloVe 300d	Komninos	all-MiniLM-L6-v2
Clustering	30.27	32.20	33.25	33.40	33.62	27.73	26.57	42.35
Reranking	50.38	51.52	52.03	52.32	52.39	43.29	44.75	58.04
Classification	53.14	56.25	58.21	59.13	59.50	57.29	57.65	63.05
Pair Classification	75.80	77.59	78.22	78.50	78.60	70.92	72.94	82.37
STS	66.24	67.53	67.91	68.22	68.27	61.85	62.46	78.90
CQA DupStack	18.76	22.54	24.12	24.59	24.83	15.47	16.79	41.32
SummEval	30.79	29.99	30.99	29.56	29.39	28.87	30.49	30.81

The l2_supercat is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens). Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).

Embed Text

Here’s how you can load pre-trained embeddings and use them to embed text:

from wordllama import WordLlama

# Load pre-trained embeddings
# truncate dimension to 64
wl = WordLlama.load(trunc_dim=64)

# Embed text
embeddings = wl.embed(["the quick brown fox jumps over the lazy dog", "and all that jazz"])
print(embeddings.shape)  # (2, 64)

Binary embedding models can be used like this:

# Binary embeddings are packed into uint32
# 64-dims => array of 2x uint32 
wl = WordLlama.load(trunc_dim=64, binary=True)  # this will download the binary model from huggingface
wl.embed("I went to the car") # Output: array([[3029168104, 2427562626]], dtype=uint32)

# load binary trained model trained with straight through estimator
wl = WordLlama.load(dim=1024, binary=True)

# Uses the hamming similarity to binarize
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.57421875

ranked_docs = wl.rank("i went to the car", ["van", "truck"])

wl.binary = False # turn off hamming and use cosine

# load a different model class
wl = WordLlama.load(config="l3_supercat", dim=1024) # downloads model from HF

Training Notes

Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 is recommended for binary embedding.

L2 Supercat was trained using a batch size of 512 on a single A100 for 12 hours.

Roadmap

Working on adding inference features:
- Semantic text splitting
Add example notebooks
- DSPy evaluators
- RAG pipelines

Extracting Token Embeddings

To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for llama3 models). You can then use the following snippet:

from wordllama.extract import extract_safetensors

# Extract embeddings for the specified configuration
extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

HINT: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest, sometimes you have to snoop around and figure it out.

For training, use the scripts in the github repo. You have to add a configuration file (copy/modify an existing one into the folder).

$ pip install wordllama[train]
$ python train.py train --config your_new_config
(training stuff happens)
$ python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/
(saves 1 model per matryoshka dim)

Citations

If you use WordLlama in your research or project, please consider citing it as follows:

@software{miller2024wordllama,
  author = {Miller, D. Lee},
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
  year = {2024},
  url = {https://github.com/dleemiller/wordllama},
  version = {0.2.3}
}

License

This project is licensed under the MIT License.

Project details

Release history Release notifications | RSS feed

0.3.7

Nov 10, 2024

0.3.6.post1

Oct 28, 2024

0.3.3.post0

Oct 14, 2024

0.3.3

Oct 14, 2024

0.3.2.post0

Oct 14, 2024

0.3.2

Oct 14, 2024

0.3.1.post0

Oct 14, 2024

0.3.1

Oct 14, 2024

0.3.0.post0

Oct 14, 2024

0.3.0

Oct 14, 2024

0.2.10.post0

Oct 9, 2024

0.2.10

Oct 9, 2024

0.2.9.post0

Oct 5, 2024

0.2.9

Oct 5, 2024

0.2.8.post0

Sep 22, 2024

0.2.8

Sep 22, 2024

0.2.7.post0

Sep 21, 2024

0.2.7

Sep 21, 2024

0.2.6.post2

Sep 16, 2024

0.2.6.post0

Sep 15, 2024

0.2.6

Sep 15, 2024

0.2.5.post0

Aug 11, 2024

0.2.5

Aug 11, 2024

0.2.4.post0

Aug 11, 2024

0.2.4

Aug 11, 2024

This version

0.2.3.post35

Aug 11, 2024

0.2.2.post0

Jul 18, 2024

0.2.1.post0

Jul 17, 2024

0.2.0

Jul 15, 2024

0.1.0

Jul 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordllama-0.2.3.post35.tar.gz (16.5 MB view details)

Uploaded Aug 11, 2024 Source

Built Distributions

wordllama-0.2.3.post35-cp312-cp312-win_amd64.whl (16.4 MB view details)

Uploaded Aug 11, 2024 CPython 3.12 Windows x86-64

wordllama-0.2.3.post35-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded Aug 11, 2024 CPython 3.12 manylinux: glibc 2.17+ x86-64

wordllama-0.2.3.post35-cp312-cp312-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded Aug 11, 2024 CPython 3.12 macOS 12.0+ ARM64

wordllama-0.2.3.post35-cp312-cp312-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded Aug 11, 2024 CPython 3.12 macOS 10.9+ x86-64

wordllama-0.2.3.post35-cp311-cp311-win_amd64.whl (16.4 MB view details)

Uploaded Aug 11, 2024 CPython 3.11 Windows x86-64

wordllama-0.2.3.post35-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded Aug 11, 2024 CPython 3.11 manylinux: glibc 2.17+ x86-64

wordllama-0.2.3.post35-cp311-cp311-macosx_12_0_arm64.whl (16.6 MB view details)

Uploaded Aug 11, 2024 CPython 3.11 macOS 12.0+ ARM64

wordllama-0.2.3.post35-cp311-cp311-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded Aug 11, 2024 CPython 3.11 macOS 10.9+ x86-64

wordllama-0.2.3.post35-cp310-cp310-win_amd64.whl (16.4 MB view details)

Uploaded Aug 11, 2024 CPython 3.10 Windows x86-64

wordllama-0.2.3.post35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.5 MB view details)

Uploaded Aug 11, 2024 CPython 3.10 manylinux: glibc 2.17+ x86-64

wordllama-0.2.3.post35-cp310-cp310-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded Aug 11, 2024 CPython 3.10 macOS 12.0+ ARM64

wordllama-0.2.3.post35-cp310-cp310-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded Aug 11, 2024 CPython 3.10 macOS 10.9+ x86-64

wordllama-0.2.3.post35-cp39-cp39-win_amd64.whl (16.4 MB view details)

Uploaded Aug 11, 2024 CPython 3.9 Windows x86-64

wordllama-0.2.3.post35-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.5 MB view details)

Uploaded Aug 11, 2024 CPython 3.9 manylinux: glibc 2.17+ x86-64

wordllama-0.2.3.post35-cp39-cp39-macosx_12_0_arm64.whl (16.7 MB view details)

Uploaded Aug 11, 2024 CPython 3.9 macOS 12.0+ ARM64

wordllama-0.2.3.post35-cp39-cp39-macosx_10_9_x86_64.whl (16.7 MB view details)

Uploaded Aug 11, 2024 CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file wordllama-0.2.3.post35.tar.gz.

File metadata

Download URL: wordllama-0.2.3.post35.tar.gz
Upload date: Aug 11, 2024
Size: 16.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for wordllama-0.2.3.post35.tar.gz
Algorithm	Hash digest
SHA256	`5eb1dcbef1efa41b759417adef3cf65dccbc3e7d2d064c14ea76568d82e81930`
MD5	`4ed3c4861a9d56bfbe7ff17cf74ab45d`
BLAKE2b-256	`9ef3b7e93d91075857534f1a6140420cc278975da2776af18152a61c3e756300`

Algorithm	Hash digest
SHA256	`2891e571238d5c397c0a7ca93d7e23728f39a8559ddc6c2f1824c18b935c0644`
MD5	`722e220d3e64752bd485d4a3c360c588`
BLAKE2b-256	`18f6ec382c2269d8da85eeb3c75a2c77b9712fea16e8da644d34f0ea8100af57`

Algorithm	Hash digest
SHA256	`74dc49bfd584baff039fb3ebe286b31663c3a169986b4ca61a362df5379fb41e`
MD5	`6377112a05b5fee821beb97d5857b7a6`
BLAKE2b-256	`8a9e685b3f63871539ab7492db33ae986364cc8a0aa6729211495ec4c0e7a592`

Algorithm	Hash digest
SHA256	`4b6ecf06b0a10856b91d299f5ff3775b4cbf342bbdfb0905dfa652f29729b39f`
MD5	`5ac653c49dc575610bd42cd23c84df74`
BLAKE2b-256	`993517120acd456c80e9c5bd28e50f597b5517469ff99ec647c1962de65ca75b`

Algorithm	Hash digest
SHA256	`8579c0bb975608c6fdccd72cc09a9fc9a15d37d82c0918fb3c92cef5f205de11`
MD5	`898ebbf70fdb567d1c81d40a21f80129`
BLAKE2b-256	`f1d35a4fdf39405d0b119e7a322672bf2fdfd826f96f528ce8b888176a3b1728`

Algorithm	Hash digest
SHA256	`0a5bcb68246c6d46e3ee1e4027cdae1e33a9c5e407f302eb72bf9347c2f2ad35`
MD5	`b8a1486240ea74bdfa9cf7eb5acd35ab`
BLAKE2b-256	`7053cbab0e0bbd6a4eb6c43bd60e632445cd0937556038e6426aadde590bb15e`

Algorithm	Hash digest
SHA256	`225b18cbb451342d056fad8f605ddeb91c5cdbc72818cf41ea6206ceb6d2ad07`
MD5	`1b21e2dec2e764a11e68ed80697e7024`
BLAKE2b-256	`1baee077479cf9fd526abaf7461c237aefd0fd9e11542d09bab52c5b5aa4568c`

Algorithm	Hash digest
SHA256	`a5db3601777e1ea006278741832502982ecbd68d0b2447627f5e5d5aac19f879`
MD5	`6744538859dc20c03178b65f20956ce1`
BLAKE2b-256	`ed684a34c18133eefaf407be3ca2647f0b4493a5eedf4b4fa6d7a206ff56e056`

Algorithm	Hash digest
SHA256	`2059492aa4992cf8fb6390bb275ece0e575593756a575994436eed106c1155f5`
MD5	`49d2fa6b11cc873292c683bd52ec3962`
BLAKE2b-256	`dc9c08ae15206c556bf38227a4f27fb3859286c659c55f0df572330c9f39834d`

Algorithm	Hash digest
SHA256	`2192cc7c732be085593f0ce7d4e7a316dbcc289104ce056191c02346105745b4`
MD5	`96558e2c5fab298b87d8027c19fd7a53`
BLAKE2b-256	`ef2eb19446eae30dca0b020571f9eda40a7c7e05292dca11553c1f3f9e8779b6`

Algorithm	Hash digest
SHA256	`5fbe383306b22afb745c919260fa41c8b849b306b2b702a86d7058065f5c3c48`
MD5	`3fb82f4586c60713e2079f992eab8446`
BLAKE2b-256	`9371d27d19360d025b359dc6cd91a17db334f6c126cb73269714bfe7e8b9f907`

Algorithm	Hash digest
SHA256	`1bdc5a12eecb197918d27fa0939633bbbd8536168fff8e1e70cf9cb7f9c9119e`
MD5	`c23c76385f8eb40367805f58b07b4020`
BLAKE2b-256	`c8022d076142c9bf6d17e53398d4cfdb614eba12a410624741e661899c407044`

Algorithm	Hash digest
SHA256	`1ad9836e341e13b01be78c3ebeb07b2486be5cb143b56a3d65befe5611aa787d`
MD5	`04087494427ec5886d0dbfa0143ed8ee`
BLAKE2b-256	`f9d3ebeba5ab62ad59af6d0c3a46c8e2ca8f30e096a2c142547af62a738a8f4b`

Algorithm	Hash digest
SHA256	`3ebae4c2bcb29c980ddda5b1914fae49e6ba9193579a8830f6766d7c98bdb97f`
MD5	`d7bcb80220e911b16debba2fc6909c37`
BLAKE2b-256	`ce0f6a31966ed1a94cfe5b3d71c8c2e514ee80164dd7f87283991eef06f8cbbb`

Algorithm	Hash digest
SHA256	`3c755de57c7ce567e8eebc4c3802a7f7c10751fb3ef4805d3a453f03d473e72d`
MD5	`4976b54e6961d36147478af82b950f1a`
BLAKE2b-256	`f3b05c27ac046ad934fb57e1e8429e33722c0410d7b6358e30f709ec70f15b67`

Algorithm	Hash digest
SHA256	`7865a65946e758e3b691f30696e3fd30f0617759d94bd05d9f764d308ceb39d2`
MD5	`13b9bbc9248fea6bf4d7ea504df1682f`
BLAKE2b-256	`90d809420f3bf12156cca79afaa052633f44a4ba812ba3bf44bee672dd6d6d7f`

Algorithm	Hash digest
SHA256	`e492f836a46581198279d754e616d55c2058841144528d6dceb14d16c4d0d00e`
MD5	`629e26bc462d2c1593e88d1473133afa`
BLAKE2b-256	`7e73fc4ffa56ee027891f21c998b8164697201b2d9de83325f91c4a2d6daf877`

wordllama 0.2.3.post35

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

WordLlama

Table of Contents

Quick Start

What is it?

MTEB Results (l2_supercat)

Embed Text

Training Notes

Roadmap

Extracting Token Embeddings

Citations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes