Lightweight runner for static-embedding / bag-of-embeddings sentence models (numpy only, CPU, Windows-friendly)
Project description
static-embed-runner
English | 日本語
A lightweight runner for static-embedding / bag-of-embeddings sentence models, powered by numpy only.
It runs the full pipeline — tokenize → mean pooling → (optional head) → L2 normalization —
in pure Python with a total dependency footprint of ~50 MB, without torch / transformers /
sentence-transformers. No native build step is required, so it runs anywhere
(including Windows, where compiler toolchains and runtime DLLs are a common source of pain —
here pip install is all you need).
Verified models
Loads static-embedding models in tokenizer.json + safetensors form, from a local
directory or a Hugging Face Hub repo id. Verified end-to-end on a 224-sentence ja/en
corpus plus edge cases, against each model's reference implementation:
The same verification and benchmark flow has been run successfully on both Windows 11 and WSL Ubuntu 24.04.3 LTS.
| Model | Format | Verification |
|---|---|---|
| sentence-transformers/static-retrieval-mrl-en-v1 (English retrieval, official) | StaticEmbedding | token ids 224/224; max emb diff 5.2e-8 vs sentence-transformers |
| sentence-transformers/static-similarity-mrl-multilingual-v1 (multilingual, official) | StaticEmbedding | token ids 224/224; max emb diff 4.5e-8 (benchmark target below) |
| minishlab/potion-base-8M and other Model2Vec models | Model2Vec | token ids 224/224; max emb diff 1.2e-7 vs model2vec reference |
| RikkaBotan/quantized-stable-static-embedding-fast-retrieval-mrl-bilingual-ja-en (ja/en bilingual, 4-bit quantized) | SSE q4 | token ids 224/224; max emb diff 6e-8 vs sentence-transformers |
For Model2Vec models the runner mirrors the reference implementation's semantics of dropping the unknown token before pooling.
The tokenizer is a pure-Python implementation driven by tokenizer.json, covering
Unigram (Viterbi, byte_fallback) and WordPiece (including BertNormalizer /
BertPreTokenizer). Configurations outside this subset automatically fall back to the
tokenizers package when installed.
Installation
pip install static-embed-runner # numpy is the only dependency
pip install static-embed-runner[rust] # optional: Rust tokenizer backend
Usage
from static_embed_runner import StaticEmbedRunner, similarity
# Accepts a HF Hub repo id or a local directory. For Hub ids, only the files
# the runner needs are downloaded (via urllib) and cached under
# ~/.cache/static-embed-runner.
runner = StaticEmbedRunner.load("minishlab/potion-base-8M")
emb = runner.encode(["こんにちは", "Hello"]) # (2, dim) float32, L2-normalized
emb64 = runner.encode("Hello", truncate_dim=64) # Matryoshka truncation (MRL models)
sim = similarity(emb, emb) # bundled cosine-similarity helper (2, 2)
CLI:
static-embed-runner minishlab/potion-base-8M "Hello world" --bench --out emb.npy
API scope
What this library produces is raw, L2-normalized embedding vectors (numpy arrays).
Similarity computation and search are fundamentally the caller's responsibility, but since
dot product = cosine for normalized vectors, a thin similarity(a, b) helper is bundled
(essentially a @ b.T). ANN indexes, storage, and reranking are out of scope — the numpy
arrays plug directly into faiss / hnswlib / sqlite-vec and friends.
Options
table="q4"(default): for 4-bit quantized models, keeps the table packed in memory (~26 MB RAM) and dequantizes only the rows actually looked up.table="f32": dequantizes the whole table at load time (~200 MB RAM). Fastest lookups.tokenizer_backend="lite"(default: auto): pure-Python tokenizer."rust"uses thetokenizerspackage.encode(..., normalize=False): when you need pre-normalization vectors.
Benchmarks
Target model: sentence-transformers/static-similarity-mrl-multilingual-v1
(official multilingual static embedding model, vocab 105,879 × 1024 dims).
Baseline: the same model running on sentence-transformers + torch (CPU).
Environment: Windows 11 / i7-14700 (20C/28T) / Python 3.13.
Corpus: 224 mixed ja/en sentences (bench/texts.py). The runner's output matches the
baseline with identical token ids (224/224) and a max embedding diff of 4.5e-8
(float32 rounding only).
| Configuration | Deps size | Single p50 | Batch ms/text | Throughput |
|---|---|---|---|---|
| Baseline: sentence-transformers + torch (CPU) | 993.6 MB | 0.419 ms | 0.0246 | 40,587 txt/s |
| runner (numpy only) | 52.5 MB | 0.020 ms | 0.0159 | 62,848 txt/s |
runner + tokenizers (optional) |
96.4 MB | 0.047 ms | 0.0216 | 46,301 txt/s |
- Dependency size: ~1/19 (52.5 MB vs 993.6 MB)
- Single-text latency: ~21× faster
- Batch throughput: 1.5×
- Load time: 0.28 s vs 7.0 s import+load for the baseline (with warm HF cache)
Cold numbers (fresh process, empty caches): single-text p50 is 0.049 ms, and the first batch call pays a one-time ~0.1 s warm-up (first touch of the 433 MB table + BLAS thread spin-up) before settling at the steady-state numbers above.
4-bit quantized models (SSE format) show the same trend, with the table held packed at
~26 MB RAM in table="q4" mode.
With the word-level cache, the pure-Python tokenizer wins batch throughput on typical
corpora too; the [rust] backend (+44 MB) mainly pays off for bulk indexing of
low-redundancy text and for tokenizer.json configs outside the built-in subset.
All speedups come from algorithms and BLAS; no OS-specific optimizations are used.
Running the benchmark locally
Model weights are not included in this repository. The real-model smoke tests look
for ./model by default and are skipped when it is missing. To run them without
manually preparing ./model, pass a Hugging Face repo id or a local model directory
with RUNNER_MODEL:
python -m venv .venv
.venv/bin/python -m pip install -e . pytest
RUNNER_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
.venv/bin/python -m pytest -q
RUNNER_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
.venv/bin/python bench/bench_runner.py
bench_runner.py downloads only the files this runner needs and writes reproducible
artifacts under results/. Its result name includes the detected model format, actual
table storage, and tokenizer backend, for example
runner_format=static-embedding_table=f32_tok=lite or
runner_format=sse-q4_table=q4_tok=lite.
The sentence-transformers baseline uses a separate environment so its large dependency tree does not affect the runner dependency-size measurement:
python -m venv .venv-baseline
.venv-baseline/bin/python -m pip install sentence-transformers
BASELINE_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
.venv-baseline/bin/python bench/bench_baseline.py
Implementation notes
The pipeline is tokenize → mean pooling → head (if any) → MRL truncation → L2 normalize.
Everything runs on CPU; no GPU is used (static embeddings are just table lookups plus a
mean, so transfer overhead would dwarf the compute on a GPU; the baseline was also run
with device="cpu" for fairness).
1. numpy as the only dependency (993.6 MB → 52.5 MB)
A static embedding model is really just "an embedding table plus three lines of math", yet the baseline drags in torch (518 MB), scipy, transformers, and more. So everything around the table is reimplemented from scratch:
- safetensors reader (
safetensors_lite.py, ~40 lines): the format is just "8-byte header length + JSON metadata + raw buffer", readable withstruct+json+ numpy. - Tokenizer (
hf_tokenizer_lite.py): pure-Python, driven bytokenizer.json. Supports Unigram (Viterbi, byte_fallback, unk penalty) and WordPiece plus the major normalizers / pre-tokenizers. Anything outside this subset falls back to thetokenizerspackage. - EmbeddingBag / head / normalization: a few lines of numpy each
(e.g.
beta * tanh(alpha * x + bias)).
2. Making a pure-Python tokenizer compete with parallel Rust
Initially the batch path lost 6× to the Rust tokenizer (rayon across 20 cores). What closed the gap:
- First-char → candidate-piece-length table: the Viterbi inner loop only tries piece lengths that can actually start at the current position, drastically cutting failed dict probes.
- Chunking + memoization (Unigram): the vocab is inspected, and if the only pieces
containing a non-leading
▁are pure▁runs, the optimal segmentation provably never crosses a word-end →▁boundary. The input is then split into▁+wordchunks and Viterbi results are cached per chunk (an exact divide-and-conquer, not a greedy approximation). On English or repetitive corpora most chunks become cache hits. - Per-char memoized normalization + raw-word cache (WordPiece): BertNormalizer's
transforms (clean / CJK padding / strip accents / lowercase) all act on one character
at a time, so the pipeline collapses into a lazily built char → replacement table —
this is what makes CJK text fast, where every character otherwise goes through
unicodedata. Because the normalizer is per-char, it commutes with whitespace splitting, so whole raw words can additionally be memoized straight from the input. - Smaller wins: a single piece →
(id, score)dict halves lookups; tuple allocations eliminated from the backtracking arrays.
3. Batch pooling as a BLAS matmul (~3× over reduceat)
The naive version (gather all token embeddings, then np.add.reduceat segment sums) was
the biggest bottleneck in profiles. Since bag-of-embeddings discards order, the math
rearranges into a single sgemm:
count matrix over the batch's unique tokens
C (B×U)@ unique embeddingsE (U×dim)
This wins twice: (a) BLAS uses all cores, and (b) gathering and 4-bit decoding only touch the U unique tokens instead of every token occurrence. An implicit float64 promotion via integer division was also eliminated (it had been doubling the cost of tanh and the matmul).
4. 4-bit table: one-shot LUT decode + two storage modes
- A precomputed 256×2 LUT maps each packed byte (0–255) to its dequantized
(hi, lo)float32 pair, so unpacking is a single fancy-index — no per-row bit twiddling. table="q4"(default): keep the table packed (~26 MB RAM), decode only referenced rows.table="f32": dequantize everything at load (~200 MB RAM), making lookups a pure gather. Fastest.
5. Correctness pinned against the baseline
After every optimization, two checks run against the reference implementation: exact
token-id equality (including edge cases: full-width characters, ZWJ emoji, control
characters, soft hyphens) and max absolute embedding error (≤6e-8 = float32 rounding
only). Edge cases like empty strings (tokenizers returns an empty id list) were
caught and matched this way.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file static_embed_runner-0.1.0.tar.gz.
File metadata
- Download URL: static_embed_runner-0.1.0.tar.gz
- Upload date:
- Size: 33.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ffcfc3c203c3b491711247e2f52b550f9dbe34246c5c4dad335e9bda07e7e9a
|
|
| MD5 |
24fd6a57224c25c39d395be01c6cd43c
|
|
| BLAKE2b-256 |
1790c13b779e06a724afc0b52de11a58e78f53b75e43136325fec8b4e55a6c2d
|
File details
Details for the file static_embed_runner-0.1.0-py3-none-any.whl.
File metadata
- Download URL: static_embed_runner-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab116a1931b2e7c28745086557a6e67910419ab1befd93a2bd7c3ae70f8ce90e
|
|
| MD5 |
9f908970b86f45026ddb45b807d9d0e4
|
|
| BLAKE2b-256 |
7321c40f11f8a999fffa5ce64c79cbd06fcaf111d51c051db152cb8fdae27ba8
|