nanoE5.c - blazing-fast 4-bit CPU text embeddings (multilingual-e5-small), model bundled, zero ML dependencies
Project description
nanoE5.c
A blazing-fast, dependency-free CPU engine for multilingual-e5-small text embeddings.
A tiny C core (the
.cis the whole point) packaged for one-command use:pip install nanoe5from Python, or a single self-contained server binary.
Ships as a single self-contained binary with the 4-bit model baked inside — run an OpenAI-compatible embeddings server with one file and zero dependencies. Or call it from Python and keep the model hot in RAM.
./e5 --server --port 8000 # OpenAI-compatible server, one file, no deps
from e5 import E5
model = E5()
vec = model.query("how much protein per day") # 384-dim, L2-normalized
No PyTorch. No transformers. No ONNX. No BLAS. Just C, libm, and OpenMP.
Why
- One file to deploy. The 4-bit model is linked inside the
./e5binary (~69 MB). Copy it to a server and run — nothing to download, install, or mount. - Fast where it counts. ~2 ms to embed a single query on a desktop CPU —
about 7× faster than
sentence-transformersfor one-at-a-time serving. - Tiny. 72 MB 4-bit model vs 471 MB fp32. Instant startup (mmap).
- Faithful. Real XLM-RoBERTa SentencePiece tokenizer + exact BERT forward pass; cosine 0.98–0.99 vs the fp32 reference, retrieval rankings preserved.
- Handles long text. Inputs over 512 tokens are windowed automatically and transparently, in bounded memory.
Install
You need a C compiler with OpenMP (gcc/clang) and, once, Python to build the model file.
# 1. download + quantize the model -> e5-small-q4.bin (one-time, ~72 MB)
make convert # pip install torch transformers safetensors tokenizers numpy
# 2a. build the self-contained server/CLI binary -> ./e5
make server
# 2b. (optional) build the Python shared library -> libe5.so
make lib
make convert is the only step that touches the Python ML stack. After it, the
binary and the Python library run with no ML dependencies at all.
Use it: the server
Start it (the model is already inside the binary):
./e5 --server --host 0.0.0.0 --port 8000
It speaks the OpenAI embeddings API, so any OpenAI client works unchanged:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.embeddings.create(
model="e5-query", # see "Query vs passage" below
input=["how much protein per day", "best protein sources"],
)
embeddings = [d.embedding for d in resp.data] # two 384-dim vectors
…or just curl:
curl http://localhost:8000/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"input": ["doc one", "doc two"], "input_type": "passage"}'
{
"object": "list",
"data": [
{"object": "embedding", "index": 0, "embedding": [0.031, -0.044, ...]},
{"object": "embedding", "index": 1, "embedding": [0.018, 0.007, ...]}
],
"model": "multilingual-e5-small-q4",
"usage": {"prompt_tokens": 8, "total_tokens": 8}
}
Endpoints
| Method & path | Purpose |
|---|---|
POST /v1/embeddings |
Create embeddings (string or array of strings). |
GET /v1/models |
List the served model. |
GET /health |
Liveness check → {"status":"ok"}. |
Request fields
| Field | Values | Default |
|---|---|---|
input |
a string or an array of strings | required |
encoding_format |
"float" or "base64" |
"float" |
input_type |
"query", "passage" (alias "document") |
server default |
model |
any string; if it contains query/passage/doc it sets the modality |
— |
encoding_format: "base64" returns each embedding as base64-encoded
little-endian float32 — this is what the official OpenAI Python client requests
by default, and it's fully supported.
Server flags
./e5 --server [--host H] [--port P] [--threads N]
[--default-type query|passage] [--model FILE]
--threads Ncaps OpenMP threads (default: all cores).--default-typesets the modality when a request doesn't specify one (defaultquery).--model FILEloads an externale5-small-q4.bininstead of the embedded one.
Use it: from Python
The Python wrapper loads the model once and keeps it hot in RAM — every call reuses it with zero reload cost.
from e5 import E5
model = E5() # loads e5-small-q4.bin, stays hot
# single text -> shape (384,)
q = model.query("how much protein per day")
# a list -> shape (N, 384)
docs = model.passage([
"The recommended protein intake for adult women is about 46 g/day.",
"Mount Everest is the highest mountain above sea level.",
])
# cosine similarity (vectors are already L2-normalized, so just a dot product)
scores = docs @ q
print(scores.argmax()) # -> 0
That's the whole API:
| Method | Prefix added | Returns |
|---|---|---|
model.query(text | list) |
query: |
(384,) or (N, 384) float32 |
model.passage(text | list) |
passage: |
(384,) or (N, 384) float32 |
model.encode(text | list, is_query=False) |
either | generic form |
E5(model_path=..., lib_path=..., num_threads=...) lets you point at a specific
model/library or cap threads.
A single text is parallelized across all CPU cores (low latency); a list is parallelized across texts (high throughput).
Query vs passage
multilingual-e5-small is trained with two prefixes, and you should use the
right one:
query:— short search queries / questions.passage:— documents you want to retrieve.
Embed your documents with passage, your search queries with query, then rank
documents by cosine similarity (a plain dot product, since outputs are
normalized).
- Python:
model.query(...)vsmodel.passage(...). - Server: set
"input_type": "query"or"passage"per request (or name the modele5-query/e5-passage), otherwise the server's--default-typeis used.
Long inputs (automatic)
The base model maxes out at 512 tokens. Instead of truncating, nanoE5.c slides a window over longer text: it splits into ≤510-token windows, embeds each, and returns the token-count-weighted average (then re-normalizes). This is mathematically equivalent to mean-pooling over the whole document and needs no API change — just pass a long string. Memory stays bounded (~350 MB) even for million-token inputs.
CLI
The same binary is also a quick CLI:
./e5 query "how much protein should a female eat"
./e5 passage "a document to index"
./e5 --model e5-small-q4.bin query "use an external model file"
How it works (short version)
- 4-bit weights (Q4_0). Every large matrix is stored in 32-weight blocks with an fp16 scale (~4.5 bits/weight) — ~10× less memory traffic than fp32.
- int8 × int4 matmul. Activations are quantized to int8 and multiplied against the 4-bit weights with AVX2 integer MACs — no fp32 dequant in the hot loop. Scalar fallback included for non-AVX CPUs.
- One pass per batch. All tokens of a batch share a single matmul per layer, so weights stream once; attention runs per text.
- OpenMP across matrix rows / texts; deterministic regardless of thread count.
- Faithful tokenizer. XLM-RoBERTa SentencePiece-unigram (Viterbi) with the real Precompiled normalizer baked in as a per-codepoint table.
The model is packed into one binary blob by convert.py; e5.c is the entire
engine (loader, tokenizer, BERT, quantized matmul); server.c adds the HTTP
server and CLI; e5.py is the ctypes wrapper.
Performance
On a Ryzen 7 5800X3D (8 cores / 16 threads, AVX2):
| nanoE5.c (4-bit) | sentence-transformers (fp32) | |
|---|---|---|
| single-query latency (hot) | ~2 ms | ~13 ms |
| batch throughput | ~190–340 texts/s | ~280 texts/s |
| model size | 72 MB | 471 MB |
| dependencies | libc, libm, OpenMP | torch + transformers |
| cold start | instant (mmap) | seconds |
For online serving (one query at a time, model hot) nanoE5.c is ~7× faster per call. For huge offline batch jobs, PyTorch's oneDNN GEMM edges ahead on raw throughput — but at 1/6th the footprint and zero dependencies.
Validate & stress
make test # cosine parity vs the fp32 HF reference + speed
make stress # hard edge-case / concurrency / server suite
make stress throws adversarial inputs at every layer and asserts: no crashes,
no hangs, finite & unit-norm outputs, determinism, batch == single (exact),
server == binding parity, base64 == float parity, real OpenAI-client
compatibility, correct 4xx handling for malformed requests, survival of a raw
garbage barrage, and 400 concurrent requests with zero errors or races.
Files
convert.py build e5-small-q4.bin from the HF checkpoint (one-time)
e5.c / e5.h the entire inference engine
server.c OpenAI-compatible HTTP server + CLI
e5.py Python wrapper (load once, keep hot)
test_parity.py parity vs HF reference + benchmark
stress_test.py hard stress / edge-case suite
Makefile
License
The code here is yours to use. The model weights are
intfloat/multilingual-e5-small (MIT) — see the model card for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nanoe5-0.1.0.tar.gz.
File metadata
- Download URL: nanoe5-0.1.0.tar.gz
- Upload date:
- Size: 65.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d03826461683131e7b16d52c961e2149c9a20265fd5a67689c2fae6144751977
|
|
| MD5 |
509bc888b0ea535b233b2577c56f8119
|
|
| BLAKE2b-256 |
045b57f55fccb17fc9cc46fad2e9977782db847f5c9b8b3234389d745ca3dce1
|
File details
Details for the file nanoe5-0.1.0-py3-none-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: nanoe5-0.1.0-py3-none-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 65.9 MB
- Tags: Python 3, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ca4479c6df86ff8dd93f025575b659db5058e43ee64d7f01ac118a99fb6e198
|
|
| MD5 |
3e12a203284c622d53061cb998c0c1d0
|
|
| BLAKE2b-256 |
5f721b0b417d15f49174560ed40b16ee3bebc5631446f9b5947cadb70ad60a67
|
File details
Details for the file nanoe5-0.1.0-py3-none-musllinux_1_2_aarch64.whl.
File metadata
- Download URL: nanoe5-0.1.0-py3-none-musllinux_1_2_aarch64.whl
- Upload date:
- Size: 65.8 MB
- Tags: Python 3, musllinux: musl 1.2+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3400b8e9eb168339a8c4a392ac0ba81c76c44cd9dffd89ef1cc32602cc6b53d
|
|
| MD5 |
552e3637de6f38bbf479d33a2c48cefc
|
|
| BLAKE2b-256 |
287bd423af1c538efdf0c029b6d5e39833a3f27abe07a22b7a0f96ad51fa39cc
|
File details
Details for the file nanoe5-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: nanoe5-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 65.8 MB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ae59acb0828f09ffbaa7468db0040c8762f17096fb2fd3fb3dee8785662f052
|
|
| MD5 |
1632a1a85ec065d4973e3deac7421377
|
|
| BLAKE2b-256 |
b7d7223487d369129163737dce0584ce5e11af42ebb500c49c519e84e7124679
|
File details
Details for the file nanoe5-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: nanoe5-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 65.8 MB
- Tags: Python 3, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
861737f804ae9e04ce8ddb716f0ee6363c798e102fb6ff3c386bf50ef49e0ac2
|
|
| MD5 |
b4a9de099db29275f286c37e2c3c88e7
|
|
| BLAKE2b-256 |
5d098a2f3a465c61836d46b75331a209db495154512ba7115da5de42410d4d62
|
File details
Details for the file nanoe5-0.1.0-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: nanoe5-0.1.0-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 65.6 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86ffd7e7a37dd0ae627151fa08691cb401b37928d690064155420400d9936ae8
|
|
| MD5 |
dd3108bc783d85daccd36c24a61b50b3
|
|
| BLAKE2b-256 |
7dd540371b3b36c7c362ab1b4d5254b6f3041e1fa571f938152e750e7055e624
|