Skip to main content

nanoE5.c - blazing-fast 4-bit CPU text embeddings (multilingual-e5-small), model bundled, OpenAI-compatible server, zero ML dependencies

Project description

nanoE5.c

A blazing-fast, dependency-free CPU engine for multilingual-e5-small text embeddings.

A tiny C core (the .c is the whole point) packaged for one-command use: pip install nanoe5 from Python, or a single self-contained server binary.

The 4-bit model is bundled — there is nothing to download or configure. Use it from Python in two lines, or run an OpenAI-compatible server from a single self-contained binary.

pip install nanoe5
import nanoe5
q = nanoe5.query("how much protein per day")     # 384-dim, L2-normalized
P = nanoe5.passage(["doc a", "doc b"])           # (2, 384)
scores = P @ q                                   # cosine similarity

…or run an OpenAI-compatible server (works with the official openai client):

nanoe5-serve --port 8000           # OpenAI-compatible embeddings API

No PyTorch. No transformers. No ONNX. No BLAS. Just C, libm, and OpenMP.


Why

  • One file to deploy. The 4-bit model is linked inside the ./e5 binary (~69 MB). Copy it to a server and run — nothing to download, install, or mount.
  • Fast where it counts. ~2 ms to embed a single query on a desktop CPU — about 7× faster than sentence-transformers for one-at-a-time serving.
  • Tiny. 72 MB 4-bit model vs 471 MB fp32. Instant startup (mmap).
  • Faithful. Real XLM-RoBERTa SentencePiece tokenizer + exact BERT forward pass; cosine 0.98–0.99 vs the fp32 reference, retrieval rankings preserved.
  • Handles long text. Inputs over 512 tokens are windowed automatically and transparently, in bounded memory.

Install

From PyPI (Python)

pip install nanoe5

That's it — the 4-bit model is inside the package. The tiny C engine compiles on install (needs a C compiler with OpenMP, e.g. gcc), then everything runs with no ML dependencies (just NumPy). Requires an x86-64 CPU with AVX2 for the fast path; other CPUs fall back to a portable scalar build automatically.

From source (server binary + CLI)

# 1. download + quantize the model -> e5-small-q4.bin  (one-time, ~72 MB)
make convert        # pip install torch transformers safetensors tokenizers numpy

# 2a. build the self-contained server/CLI binary  ->  ./e5
make server

# 2b. (optional) build the Python shared library   ->  libe5.so
make lib

make convert is the only step that touches the Python ML stack. After it, the binary runs with no ML dependencies at all.


Use it: the OpenAI-compatible server

Start a server with one command — works with the official openai Python client out of the box (verified against openai>=1.0):

pip install nanoe5
nanoe5-serve --port 8000          # OpenAI-compatible embeddings server
from openai import OpenAI                       # the official OpenAI client

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

resp = client.embeddings.create(
    model="e5-query",                           # see "Query vs passage" below
    input=["how much protein per day", "best protein sources"],
)
embeddings = [d.embedding for d in resp.data]   # two 384-dim vectors

Both encoding_format="float" and the client's default "base64" path are supported, so nothing in your existing OpenAI code needs to change — just point base_url at the server.

Prefer a single dependency-free binary? make server builds ./e5, which embeds the model and serves the same API with zero Python: ./e5 --server --port 8000.

…or hit it with plain curl:

curl http://localhost:8000/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"input": ["doc one", "doc two"], "input_type": "passage"}'
{
  "object": "list",
  "data": [
    {"object": "embedding", "index": 0, "embedding": [0.031, -0.044, ...]},
    {"object": "embedding", "index": 1, "embedding": [0.018,  0.007, ...]}
  ],
  "model": "multilingual-e5-small-q4",
  "usage": {"prompt_tokens": 8, "total_tokens": 8}
}

Endpoints

Method & path Purpose
POST /v1/embeddings Create embeddings (string or array of strings).
GET /v1/models List the served model.
GET /health Liveness check → {"status":"ok"}.

Request fields

Field Values Default
input a string or an array of strings required
encoding_format "float" or "base64" "float"
input_type "query", "passage" (alias "document") server default
model any string; if it contains query/passage/doc it sets the modality

encoding_format: "base64" returns each embedding as base64-encoded little-endian float32 — this is what the official OpenAI Python client requests by default, and it's fully supported.

Server flags

Both server forms take the same flags:

nanoe5-serve  [--host H] [--port P] [--threads N] [--default-type query|passage] [--model FILE]
./e5 --server [--host H] [--port P] [--threads N] [--default-type query|passage] [--model FILE]
  • --threads N caps OpenMP threads (default: all cores).
  • --default-type sets the modality when a request doesn't specify one (default query).
  • --model FILE loads an external e5-small-q4.bin (the binary otherwise uses its embedded copy; the pip server uses the bundled one).

Use it: from Python

The simplest form uses module-level helpers backed by a shared, hot model (loaded once, reused for every call):

import nanoe5

q = nanoe5.query("how much protein per day")     # (384,)
docs = nanoe5.passage([                            # (N, 384)
    "The recommended protein intake for adult women is about 46 g/day.",
    "Mount Everest is the highest mountain above sea level.",
])

scores = docs @ q          # already L2-normalized -> dot product = cosine
print(scores.argmax())     # -> 0

Or hold an explicit handle (e.g. to cap threads):

from nanoe5 import E5
model = E5(num_threads=8)
model.query("...");  model.passage(["...", "..."])

That's the whole API:

Call Prefix added Returns
nanoe5.query(text | list) / model.query(...) query: (384,) or (N, 384) float32
nanoe5.passage(text | list) / model.passage(...) passage: (384,) or (N, 384) float32
nanoe5.encode(x, is_query=False) / model.encode(...) either generic form

A single text is parallelized across all CPU cores (low latency); a list is parallelized across texts (high throughput).


Query vs passage

multilingual-e5-small is trained with two prefixes, and you should use the right one:

  • query: — short search queries / questions.
  • passage: — documents you want to retrieve.

Embed your documents with passage, your search queries with query, then rank documents by cosine similarity (a plain dot product, since outputs are normalized).

  • Python: model.query(...) vs model.passage(...).
  • Server: set "input_type": "query" or "passage" per request (or name the model e5-query / e5-passage), otherwise the server's --default-type is used.

Long inputs (automatic)

The base model maxes out at 512 tokens. Instead of truncating, nanoE5.c slides a window over longer text: it splits into ≤510-token windows, embeds each, and returns the token-count-weighted average (then re-normalizes). This is mathematically equivalent to mean-pooling over the whole document and needs no API change — just pass a long string. Memory stays bounded (~350 MB) even for million-token inputs.


CLI

The same binary is also a quick CLI:

./e5 query   "how much protein should a female eat"
./e5 passage "a document to index"
./e5 --model e5-small-q4.bin query "use an external model file"

How it works (short version)

  • 4-bit weights (Q4_0). Every large matrix is stored in 32-weight blocks with an fp16 scale (~4.5 bits/weight) — ~10× less memory traffic than fp32.
  • int8 × int4 matmul. Activations are quantized to int8 and multiplied against the 4-bit weights with AVX2 integer MACs — no fp32 dequant in the hot loop. Scalar fallback included for non-AVX CPUs.
  • One pass per batch. All tokens of a batch share a single matmul per layer, so weights stream once; attention runs per text.
  • OpenMP across matrix rows / texts; deterministic regardless of thread count.
  • Faithful tokenizer. XLM-RoBERTa SentencePiece-unigram (Viterbi) with the real Precompiled normalizer baked in as a per-codepoint table.

The model is packed into one binary blob by convert.py; e5.c is the entire engine (loader, tokenizer, BERT, quantized matmul); server.c adds the HTTP server and CLI; e5.py is the ctypes wrapper.


Performance

On a Ryzen 7 5800X3D (8 cores / 16 threads, AVX2):

nanoE5.c (4-bit) sentence-transformers (fp32)
single-query latency (hot) ~2 ms ~13 ms
batch throughput ~190–340 texts/s ~280 texts/s
model size 72 MB 471 MB
dependencies libc, libm, OpenMP torch + transformers
cold start instant (mmap) seconds

For online serving (one query at a time, model hot) nanoE5.c is ~7× faster per call. For huge offline batch jobs, PyTorch's oneDNN GEMM edges ahead on raw throughput — but at 1/6th the footprint and zero dependencies.


Validate & stress

make test     # cosine parity vs the fp32 HF reference + speed
make stress   # hard edge-case / concurrency / server suite

make stress throws adversarial inputs at every layer and asserts: no crashes, no hangs, finite & unit-norm outputs, determinism, batch == single (exact), server == binding parity, base64 == float parity, real OpenAI-client compatibility, correct 4xx handling for malformed requests, survival of a raw garbage barrage, and 400 concurrent requests with zero errors or races.


Files

e5.c / e5.h      the entire inference engine
server.c         OpenAI-compatible HTTP server + CLI
convert.py       build e5-small-q4.bin from the HF checkpoint (one-time)
nanoe5/          the pip package (engine + 4-bit model bundled)
pyproject.toml   / setup.py   packaging (compiles the engine, bundles the model)
e5.py            standalone ctypes wrapper (repo-local use)
test_parity.py   parity vs HF reference + benchmark
stress_test.py   hard stress / edge-case suite
Makefile

License

The code here is yours to use. The model weights are intfloat/multilingual-e5-small (MIT) — see the model card for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanoe5-0.1.1.tar.gz (65.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nanoe5-0.1.1-py3-none-musllinux_1_2_x86_64.whl (65.9 MB view details)

Uploaded Python 3musllinux: musl 1.2+ x86-64

nanoe5-0.1.1-py3-none-musllinux_1_2_aarch64.whl (65.8 MB view details)

Uploaded Python 3musllinux: musl 1.2+ ARM64

nanoe5-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (65.8 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

nanoe5-0.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (65.8 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

nanoe5-0.1.1-py3-none-macosx_11_0_arm64.whl (65.6 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file nanoe5-0.1.1.tar.gz.

File metadata

  • Download URL: nanoe5-0.1.1.tar.gz
  • Upload date:
  • Size: 65.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.1.tar.gz
Algorithm Hash digest
SHA256 34c0fb50cf84dd43951faeeb0922ab7e5d464887d8793b459a357da92f513a2b
MD5 904c3d28dbdd2adf307320070f9d24e9
BLAKE2b-256 83d0a108a717ed4d74c91f6dcafe069cb42eef5b9e11551930d53d7893051287

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.1-py3-none-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for nanoe5-0.1.1-py3-none-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d1d519b6acd9a3f57ea4a04d1abf5b54d3ff8de0bf5db00bad30eec826be6014
MD5 3c3c878f2bd8d033e70e056f9875ebcf
BLAKE2b-256 fe57169e889fc859dc0219083e0cafc65107bad1fd4b209bb5cddb801b31e8fc

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.1-py3-none-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for nanoe5-0.1.1-py3-none-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 15d3409860d2adfcd9353ab6f213986055e10e23873033c5e19c21dc7999b21a
MD5 0e2ca265dab7bdaede25fb5361a2fe94
BLAKE2b-256 82e2560ffa85a113cdb62b0eed4f3c7215a5c4ed008138295bf35f6ff843134f

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nanoe5-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1de6f7beff191b3dcffc761e68295a20c49af308ec3d6b978cf4783025b2098c
MD5 d5a54fc7520552160ae34b7fe645c7e5
BLAKE2b-256 c8a6cc76e4c8d1c1104e3917565a509ccbac1ee0a0c31fde85000da3c3e9621f

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for nanoe5-0.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 40c359ffb1bac0a2713a0cbc79799775368519add88a6aeb982cc5a463d4ef21
MD5 f0fd9de9032c41634463d67528abe06a
BLAKE2b-256 152a104e4fedfd1eaa3391a5daaeca5be4f9d078edb93d9f92de2e8932a7a930

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.1-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nanoe5-0.1.1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2a7455ba42db15905a37bf29577d8fa27cf6cc9104603777d0e2bd433ca66023
MD5 875389ed186c9a00ddeae259c2dd90bc
BLAKE2b-256 77e662e1adc2acb7514d5a449918da53134f7933682caf89b15c474f13935f87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page