Lightweight runner for static-embedding / bag-of-embeddings sentence models (numpy only, CPU, Windows-friendly)

These details have not been verified by PyPI

Project description

static-embed-runner

English | 日本語

A lightweight runner for static-embedding / bag-of-embeddings sentence models, powered by numpy only.

It runs the full pipeline — tokenize → mean pooling → (optional head) → L2 normalization — in pure Python with a total dependency footprint of ~50 MB, without torch / transformers / sentence-transformers. No native build step is required, so it runs anywhere (including Windows, where compiler toolchains and runtime DLLs are a common source of pain — here pip install is all you need).

Verified models

Loads static-embedding models in tokenizer.json + safetensors form, from a local directory or a Hugging Face Hub repo id. Verified end-to-end on a 224-sentence ja/en corpus plus edge cases, against each model's reference implementation:

The same verification and benchmark flow has been run successfully on both Windows 11 and WSL Ubuntu 24.04.3 LTS.

Model	Format	Verification
sentence-transformers/static-retrieval-mrl-en-v1 (English retrieval, official)	StaticEmbedding	token ids 224/224; max emb diff 5.2e-8 vs sentence-transformers
sentence-transformers/static-similarity-mrl-multilingual-v1 (multilingual, official)	StaticEmbedding	token ids 224/224; max emb diff 4.5e-8 (benchmark target below)
minishlab/potion-base-8M and other Model2Vec models	Model2Vec	token ids 224/224; max emb diff 1.2e-7 vs `model2vec` reference
RikkaBotan/quantized-stable-static-embedding-fast-retrieval-mrl-bilingual-ja-en (ja/en bilingual, 4-bit quantized)	SSE q4	token ids 224/224; max emb diff 6e-8 vs sentence-transformers

For Model2Vec models the runner mirrors the reference implementation's semantics of dropping the unknown token before pooling.

The tokenizer is a pure-Python implementation driven by tokenizer.json, covering Unigram (Viterbi, byte_fallback) and WordPiece (including BertNormalizer / BertPreTokenizer). Configurations outside this subset automatically fall back to the tokenizers package when installed.

Installation

pip install static-embed-runner         # numpy is the only dependency
pip install static-embed-runner[rust]   # optional: Rust tokenizer backend

Usage

from static_embed_runner import StaticEmbedRunner, similarity

# Accepts a HF Hub repo id or a local directory. For Hub ids, only the files
# the runner needs are downloaded (via urllib) and cached under
# ~/.cache/static-embed-runner.
runner = StaticEmbedRunner.load("minishlab/potion-base-8M")

emb = runner.encode(["こんにちは", "Hello"])         # (2, dim) float32, L2-normalized
emb64 = runner.encode("Hello", truncate_dim=64)      # Matryoshka truncation (MRL models)

sim = similarity(emb, emb)                           # bundled cosine-similarity helper (2, 2)

CLI:

static-embed-runner minishlab/potion-base-8M "Hello world" --bench --out emb.npy

API scope

What this library produces is raw, L2-normalized embedding vectors (numpy arrays). Similarity computation and search are fundamentally the caller's responsibility, but since dot product = cosine for normalized vectors, a thin similarity(a, b) helper is bundled (essentially a @ b.T). ANN indexes, storage, and reranking are out of scope — the numpy arrays plug directly into faiss / hnswlib / sqlite-vec and friends.

Options

table="q4" (default): for 4-bit quantized models, keeps the table packed in memory (~26 MB RAM) and dequantizes only the rows actually looked up.
table="f32": dequantizes the whole table at load time (~200 MB RAM). Fastest lookups.
tokenizer_backend="lite" (default: auto): pure-Python tokenizer. "rust" uses the tokenizers package.
encode(..., normalize=False): when you need pre-normalization vectors.

Benchmarks

Target model: sentence-transformers/static-similarity-mrl-multilingual-v1 (official multilingual static embedding model, vocab 105,879 × 1024 dims). Baseline: the same model running on sentence-transformers + torch (CPU). Environment: Windows 11 / i7-14700 (20C/28T) / Python 3.13. Corpus: 224 mixed ja/en sentences (bench/texts.py). The runner's output matches the baseline with identical token ids (224/224) and a max embedding diff of 4.5e-8 (float32 rounding only).

Configuration	Deps size	Single p50	Batch ms/text	Throughput
Baseline: sentence-transformers + torch (CPU)	993.6 MB	0.419 ms	0.0246	40,587 txt/s
runner (numpy only)	52.5 MB	0.020 ms	0.0159	62,848 txt/s
runner + `tokenizers` (optional)	96.4 MB	0.047 ms	0.0216	46,301 txt/s

Dependency size: ~1/19 (52.5 MB vs 993.6 MB)
Single-text latency: ~21× faster
Batch throughput: 1.5×
Load time: 0.28 s vs 7.0 s import+load for the baseline (with warm HF cache)

Cold numbers (fresh process, empty caches): single-text p50 is 0.049 ms, and the first batch call pays a one-time ~0.1 s warm-up (first touch of the 433 MB table + BLAS thread spin-up) before settling at the steady-state numbers above.

4-bit quantized models (SSE format) show the same trend, with the table held packed at ~26 MB RAM in table="q4" mode.

With the word-level cache, the pure-Python tokenizer wins batch throughput on typical corpora too; the [rust] backend (+44 MB) mainly pays off for bulk indexing of low-redundancy text and for tokenizer.json configs outside the built-in subset.

All speedups come from algorithms and BLAS; no OS-specific optimizations are used.

Running the benchmark locally

Model weights are not included in this repository. The real-model smoke tests look for ./model by default and are skipped when it is missing. To run them without manually preparing ./model, pass a Hugging Face repo id or a local model directory with RUNNER_MODEL:

python -m venv .venv
.venv/bin/python -m pip install -e . pytest
RUNNER_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
  .venv/bin/python -m pytest -q
RUNNER_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
  .venv/bin/python bench/bench_runner.py

bench_runner.py downloads only the files this runner needs and writes reproducible artifacts under results/. Its result name includes the detected model format, actual table storage, and tokenizer backend, for example runner_format=static-embedding_table=f32_tok=lite or runner_format=sse-q4_table=q4_tok=lite.

The sentence-transformers baseline uses a separate environment so its large dependency tree does not affect the runner dependency-size measurement:

python -m venv .venv-baseline
.venv-baseline/bin/python -m pip install sentence-transformers
BASELINE_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
  .venv-baseline/bin/python bench/bench_baseline.py

Implementation notes

The pipeline is tokenize → mean pooling → head (if any) → MRL truncation → L2 normalize. Everything runs on CPU; no GPU is used (static embeddings are just table lookups plus a mean, so transfer overhead would dwarf the compute on a GPU; the baseline was also run with device="cpu" for fairness).

1. numpy as the only dependency (993.6 MB → 52.5 MB)

A static embedding model is really just "an embedding table plus three lines of math", yet the baseline drags in torch (518 MB), scipy, transformers, and more. So everything around the table is reimplemented from scratch:

safetensors reader (safetensors_lite.py, ~40 lines): the format is just "8-byte header length + JSON metadata + raw buffer", readable with struct + json + numpy.
Tokenizer (hf_tokenizer_lite.py): pure-Python, driven by tokenizer.json. Supports Unigram (Viterbi, byte_fallback, unk penalty) and WordPiece plus the major normalizers / pre-tokenizers. Anything outside this subset falls back to the tokenizers package.
EmbeddingBag / head / normalization: a few lines of numpy each (e.g. beta * tanh(alpha * x + bias)).

2. Making a pure-Python tokenizer compete with parallel Rust

Initially the batch path lost 6× to the Rust tokenizer (rayon across 20 cores). What closed the gap:

First-char → candidate-piece-length table: the Viterbi inner loop only tries piece lengths that can actually start at the current position, drastically cutting failed dict probes.
Chunking + memoization (Unigram): the vocab is inspected, and if the only pieces containing a non-leading ▁ are pure ▁ runs, the optimal segmentation provably never crosses a word-end → ▁ boundary. The input is then split into ▁+word chunks and Viterbi results are cached per chunk (an exact divide-and-conquer, not a greedy approximation). On English or repetitive corpora most chunks become cache hits.
Per-char memoized normalization + raw-word cache (WordPiece): BertNormalizer's transforms (clean / CJK padding / strip accents / lowercase) all act on one character at a time, so the pipeline collapses into a lazily built char → replacement table — this is what makes CJK text fast, where every character otherwise goes through unicodedata. Because the normalizer is per-char, it commutes with whitespace splitting, so whole raw words can additionally be memoized straight from the input.
Smaller wins: a single piece → (id, score) dict halves lookups; tuple allocations eliminated from the backtracking arrays.

3. Batch pooling as a BLAS matmul (~3× over reduceat)

The naive version (gather all token embeddings, then np.add.reduceat segment sums) was the biggest bottleneck in profiles. Since bag-of-embeddings discards order, the math rearranges into a single sgemm:

count matrix over the batch's unique tokens C (B×U) @ unique embeddings E (U×dim)

This wins twice: (a) BLAS uses all cores, and (b) gathering and 4-bit decoding only touch the U unique tokens instead of every token occurrence. An implicit float64 promotion via integer division was also eliminated (it had been doubling the cost of tanh and the matmul).

4. 4-bit table: one-shot LUT decode + two storage modes

A precomputed 256×2 LUT maps each packed byte (0–255) to its dequantized (hi, lo) float32 pair, so unpacking is a single fancy-index — no per-row bit twiddling.
table="q4" (default): keep the table packed (~26 MB RAM), decode only referenced rows.
table="f32": dequantize everything at load (~200 MB RAM), making lookups a pure gather. Fastest.

5. Correctness pinned against the baseline

After every optimization, two checks run against the reference implementation: exact token-id equality (including edge cases: full-width characters, ZWJ emoji, control characters, soft hyphens) and max absolute embedding error (≤6e-8 = float32 rounding only). Edge cases like empty strings (tokenizers returns an empty id list) were caught and matched this way.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

static_embed_runner-0.1.0.tar.gz (33.9 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

static_embed_runner-0.1.0-py3-none-any.whl (19.9 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file static_embed_runner-0.1.0.tar.gz.

File metadata

Download URL: static_embed_runner-0.1.0.tar.gz
Upload date: Jun 12, 2026
Size: 33.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for static_embed_runner-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6ffcfc3c203c3b491711247e2f52b550f9dbe34246c5c4dad335e9bda07e7e9a`
MD5	`24fd6a57224c25c39d395be01c6cd43c`
BLAKE2b-256	`1790c13b779e06a724afc0b52de11a58e78f53b75e43136325fec8b4e55a6c2d`

See more details on using hashes here.

File details

Details for the file static_embed_runner-0.1.0-py3-none-any.whl.

File metadata

Download URL: static_embed_runner-0.1.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 19.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for static_embed_runner-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ab116a1931b2e7c28745086557a6e67910419ab1befd93a2bd7c3ae70f8ce90e`
MD5	`9f908970b86f45026ddb45b807d9d0e4`
BLAKE2b-256	`7321c40f11f8a999fffa5ce64c79cbd06fcaf111d51c051db152cb8fdae27ba8`

See more details on using hashes here.

static-embed-runner 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

static-embed-runner

Verified models

Installation

Usage

API scope

Options

Benchmarks

Running the benchmark locally

Implementation notes

1. numpy as the only dependency (993.6 MB → 52.5 MB)

2. Making a pure-Python tokenizer compete with parallel Rust

3. Batch pooling as a BLAS matmul (~3× over reduceat)

4. 4-bit table: one-shot LUT decode + two storage modes

5. Correctness pinned against the baseline

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes