Skip to main content

Lightweight runner for static-embedding / bag-of-embeddings sentence models (numpy only, CPU, Windows-friendly)

Project description

static-embed-runner

English | 日本語

A lightweight runner for static-embedding / bag-of-embeddings sentence models, powered by numpy only.

It runs the full pipeline — tokenize → mean pooling → (optional head) → L2 normalization — in pure Python with a total dependency footprint of ~50 MB, without torch / transformers / sentence-transformers. No native build step is required, so it runs anywhere (including Windows, where compiler toolchains and runtime DLLs are a common source of pain — here pip install is all you need).

Verified models

Loads static-embedding models in tokenizer.json + safetensors form, from a local directory or a Hugging Face Hub repo id. Verified end-to-end on a 224-sentence ja/en corpus plus edge cases, against each model's reference implementation:

The same verification and benchmark flow has been run successfully on both Windows 11 and WSL Ubuntu 24.04.3 LTS.

Model Format Verification
sentence-transformers/static-retrieval-mrl-en-v1 (English retrieval, official) StaticEmbedding token ids 224/224; max emb diff 5.2e-8 vs sentence-transformers
sentence-transformers/static-similarity-mrl-multilingual-v1 (multilingual, official) StaticEmbedding token ids 224/224; max emb diff 4.5e-8 (benchmark target below)
minishlab/potion-base-8M and other Model2Vec models Model2Vec token ids 224/224; max emb diff 1.2e-7 vs model2vec reference
RikkaBotan/quantized-stable-static-embedding-fast-retrieval-mrl-bilingual-ja-en (ja/en bilingual, 4-bit quantized) SSE q4 token ids 224/224; max emb diff 6e-8 vs sentence-transformers

For Model2Vec models the runner mirrors the reference implementation's semantics of dropping the unknown token before pooling.

The tokenizer is a pure-Python implementation driven by tokenizer.json, covering Unigram (Viterbi, byte_fallback) and WordPiece (including BertNormalizer / BertPreTokenizer). Configurations outside this subset automatically fall back to the tokenizers package when installed.

Installation

pip install static-embed-runner         # numpy is the only dependency
pip install static-embed-runner[rust]   # optional: Rust tokenizer backend

Usage

from static_embed_runner import StaticEmbedRunner, similarity

# Accepts a HF Hub repo id or a local directory. For Hub ids, only the files
# the runner needs are downloaded (via urllib) and cached under
# ~/.cache/static-embed-runner.
runner = StaticEmbedRunner.load("minishlab/potion-base-8M")

emb = runner.encode(["こんにちは", "Hello"])         # (2, dim) float32, L2-normalized
emb64 = runner.encode("Hello", truncate_dim=64)      # Matryoshka truncation (MRL models)

sim = similarity(emb, emb)                           # bundled cosine-similarity helper (2, 2)

CLI:

static-embed-runner minishlab/potion-base-8M "Hello world" --bench --out emb.npy

API scope

What this library produces is raw, L2-normalized embedding vectors (numpy arrays). Similarity computation and search are fundamentally the caller's responsibility, but since dot product = cosine for normalized vectors, a thin similarity(a, b) helper is bundled (essentially a @ b.T). ANN indexes, storage, and reranking are out of scope — the numpy arrays plug directly into faiss / hnswlib / sqlite-vec and friends.

Options

  • table="q4" (default): for 4-bit quantized models, keeps the table packed in memory (~26 MB RAM) and dequantizes only the rows actually looked up.
  • table="f32": dequantizes the whole table at load time (~200 MB RAM). Fastest lookups.
  • tokenizer_backend="lite" (default: auto): pure-Python tokenizer. "rust" uses the tokenizers package.
  • encode(..., normalize=False): when you need pre-normalization vectors.

Benchmarks

Target model: sentence-transformers/static-similarity-mrl-multilingual-v1 (official multilingual static embedding model, vocab 105,879 × 1024 dims). Baseline: the same model running on sentence-transformers + torch (CPU). Environment: Windows 11 / i7-14700 (20C/28T) / Python 3.13. Corpus: 224 mixed ja/en sentences (bench/texts.py). The runner's output matches the baseline with identical token ids (224/224) and a max embedding diff of 4.5e-8 (float32 rounding only).

Configuration Deps size Single p50 Batch ms/text Throughput
Baseline: sentence-transformers + torch (CPU) 993.6 MB 0.419 ms 0.0246 40,587 txt/s
runner (numpy only) 52.5 MB 0.020 ms 0.0159 62,848 txt/s
runner + tokenizers (optional) 96.4 MB 0.047 ms 0.0216 46,301 txt/s
  • Dependency size: ~1/19 (52.5 MB vs 993.6 MB)
  • Single-text latency: ~21× faster
  • Batch throughput: 1.5×
  • Load time: 0.28 s vs 7.0 s import+load for the baseline (with warm HF cache)

Cold numbers (fresh process, empty caches): single-text p50 is 0.049 ms, and the first batch call pays a one-time ~0.1 s warm-up (first touch of the 433 MB table + BLAS thread spin-up) before settling at the steady-state numbers above.

4-bit quantized models (SSE format) show the same trend, with the table held packed at ~26 MB RAM in table="q4" mode.

With the word-level cache, the pure-Python tokenizer wins batch throughput on typical corpora too; the [rust] backend (+44 MB) mainly pays off for bulk indexing of low-redundancy text and for tokenizer.json configs outside the built-in subset.

All speedups come from algorithms and BLAS; no OS-specific optimizations are used.

Running the benchmark locally

Model weights are not included in this repository. The real-model smoke tests look for ./model by default and are skipped when it is missing. To run them without manually preparing ./model, pass a Hugging Face repo id or a local model directory with RUNNER_MODEL:

python -m venv .venv
.venv/bin/python -m pip install -e . pytest
RUNNER_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
  .venv/bin/python -m pytest -q
RUNNER_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
  .venv/bin/python bench/bench_runner.py

bench_runner.py downloads only the files this runner needs and writes reproducible artifacts under results/. Its result name includes the detected model format, actual table storage, and tokenizer backend, for example runner_format=static-embedding_table=f32_tok=lite or runner_format=sse-q4_table=q4_tok=lite.

The sentence-transformers baseline uses a separate environment so its large dependency tree does not affect the runner dependency-size measurement:

python -m venv .venv-baseline
.venv-baseline/bin/python -m pip install sentence-transformers
BASELINE_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
  .venv-baseline/bin/python bench/bench_baseline.py

Implementation notes

The pipeline is tokenize → mean pooling → head (if any) → MRL truncation → L2 normalize. Everything runs on CPU; no GPU is used (static embeddings are just table lookups plus a mean, so transfer overhead would dwarf the compute on a GPU; the baseline was also run with device="cpu" for fairness).

1. numpy as the only dependency (993.6 MB → 52.5 MB)

A static embedding model is really just "an embedding table plus three lines of math", yet the baseline drags in torch (518 MB), scipy, transformers, and more. So everything around the table is reimplemented from scratch:

  • safetensors reader (safetensors_lite.py, ~40 lines): the format is just "8-byte header length + JSON metadata + raw buffer", readable with struct + json + numpy.
  • Tokenizer (hf_tokenizer_lite.py): pure-Python, driven by tokenizer.json. Supports Unigram (Viterbi, byte_fallback, unk penalty) and WordPiece plus the major normalizers / pre-tokenizers. Anything outside this subset falls back to the tokenizers package.
  • EmbeddingBag / head / normalization: a few lines of numpy each (e.g. beta * tanh(alpha * x + bias)).

2. Making a pure-Python tokenizer compete with parallel Rust

Initially the batch path lost 6× to the Rust tokenizer (rayon across 20 cores). What closed the gap:

  • First-char → candidate-piece-length table: the Viterbi inner loop only tries piece lengths that can actually start at the current position, drastically cutting failed dict probes.
  • Chunking + memoization (Unigram): the vocab is inspected, and if the only pieces containing a non-leading are pure runs, the optimal segmentation provably never crosses a word-end → boundary. The input is then split into ▁+word chunks and Viterbi results are cached per chunk (an exact divide-and-conquer, not a greedy approximation). On English or repetitive corpora most chunks become cache hits.
  • Per-char memoized normalization + raw-word cache (WordPiece): BertNormalizer's transforms (clean / CJK padding / strip accents / lowercase) all act on one character at a time, so the pipeline collapses into a lazily built char → replacement table — this is what makes CJK text fast, where every character otherwise goes through unicodedata. Because the normalizer is per-char, it commutes with whitespace splitting, so whole raw words can additionally be memoized straight from the input.
  • Smaller wins: a single piece → (id, score) dict halves lookups; tuple allocations eliminated from the backtracking arrays.

3. Batch pooling as a BLAS matmul (~3× over reduceat)

The naive version (gather all token embeddings, then np.add.reduceat segment sums) was the biggest bottleneck in profiles. Since bag-of-embeddings discards order, the math rearranges into a single sgemm:

count matrix over the batch's unique tokens C (B×U) @ unique embeddings E (U×dim)

This wins twice: (a) BLAS uses all cores, and (b) gathering and 4-bit decoding only touch the U unique tokens instead of every token occurrence. An implicit float64 promotion via integer division was also eliminated (it had been doubling the cost of tanh and the matmul).

4. 4-bit table: one-shot LUT decode + two storage modes

  • A precomputed 256×2 LUT maps each packed byte (0–255) to its dequantized (hi, lo) float32 pair, so unpacking is a single fancy-index — no per-row bit twiddling.
  • table="q4" (default): keep the table packed (~26 MB RAM), decode only referenced rows.
  • table="f32": dequantize everything at load (~200 MB RAM), making lookups a pure gather. Fastest.

5. Correctness pinned against the baseline

After every optimization, two checks run against the reference implementation: exact token-id equality (including edge cases: full-width characters, ZWJ emoji, control characters, soft hyphens) and max absolute embedding error (≤6e-8 = float32 rounding only). Edge cases like empty strings (tokenizers returns an empty id list) were caught and matched this way.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

static_embed_runner-0.1.0.tar.gz (33.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

static_embed_runner-0.1.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file static_embed_runner-0.1.0.tar.gz.

File metadata

  • Download URL: static_embed_runner-0.1.0.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for static_embed_runner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6ffcfc3c203c3b491711247e2f52b550f9dbe34246c5c4dad335e9bda07e7e9a
MD5 24fd6a57224c25c39d395be01c6cd43c
BLAKE2b-256 1790c13b779e06a724afc0b52de11a58e78f53b75e43136325fec8b4e55a6c2d

See more details on using hashes here.

File details

Details for the file static_embed_runner-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for static_embed_runner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab116a1931b2e7c28745086557a6e67910419ab1befd93a2bd7c3ae70f8ce90e
MD5 9f908970b86f45026ddb45b807d9d0e4
BLAKE2b-256 7321c40f11f8a999fffa5ce64c79cbd06fcaf111d51c051db152cb8fdae27ba8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page