nanoE5.c - blazing-fast 4-bit CPU text embeddings (multilingual-e5-small), model bundled, OpenAI-compatible server, zero ML dependencies
Project description
nanoE5.c
A blazing-fast, dependency-free CPU engine for multilingual-e5-small text embeddings.
A tiny C core (the
.cis the whole point) packaged for one-command use:pip install nanoe5from Python, or a single self-contained server binary.
The 4-bit model is bundled — there is nothing to download or configure. Use it from Python in two lines, or run an OpenAI-compatible server from a single self-contained binary.
pip install nanoe5
import nanoe5
q = nanoe5.query("how much protein per day") # 384-dim, L2-normalized
P = nanoe5.passage(["doc a", "doc b"]) # (2, 384)
scores = P @ q # cosine similarity
…or run an OpenAI-compatible server (works with the official openai client):
nanoe5-serve --port 8000 # OpenAI-compatible embeddings API
No PyTorch. No transformers. No ONNX. No BLAS. Just C, libm, and OpenMP.
Why
- One file to deploy. The 4-bit model is linked inside the
./e5binary (~69 MB). Copy it to a server and run — nothing to download, install, or mount. - Fast where it counts. ~2 ms to embed a single query on a desktop CPU —
about 7× faster than
sentence-transformersfor one-at-a-time serving. - Tiny. 72 MB 4-bit model vs 471 MB fp32. Instant startup (mmap).
- Faithful. Real XLM-RoBERTa SentencePiece tokenizer + exact BERT forward pass; cosine 0.98–0.99 vs the fp32 reference, retrieval rankings preserved.
- Handles long text. Inputs over 512 tokens are windowed automatically and transparently, in bounded memory.
Install
From PyPI (Python)
pip install nanoe5
That's it — the 4-bit model is inside the package. The tiny C engine compiles on
install (needs a C compiler with OpenMP, e.g. gcc), then everything runs with
no ML dependencies (just NumPy). Requires an x86-64 CPU with AVX2 for the
fast path; other CPUs fall back to a portable scalar build automatically.
From source (server binary + CLI)
# 1. download + quantize the model -> e5-small-q4.bin (one-time, ~72 MB)
make convert # pip install torch transformers safetensors tokenizers numpy
# 2a. build the self-contained server/CLI binary -> ./e5
make server
# 2b. (optional) build the Python shared library -> libe5.so
make lib
make convert is the only step that touches the Python ML stack. After it, the
binary runs with no ML dependencies at all.
Use it: the OpenAI-compatible server
Start a server with one command — works with the official openai Python
client out of the box (verified against openai>=1.0):
pip install nanoe5
nanoe5-serve --port 8000 # OpenAI-compatible embeddings server
from openai import OpenAI # the official OpenAI client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.embeddings.create(
model="e5-query", # see "Query vs passage" below
input=["how much protein per day", "best protein sources"],
)
embeddings = [d.embedding for d in resp.data] # two 384-dim vectors
Both encoding_format="float" and the client's default "base64" path are
supported, so nothing in your existing OpenAI code needs to change — just point
base_url at the server.
Prefer a single dependency-free binary?
make serverbuilds./e5, which embeds the model and serves the same API with zero Python:./e5 --server --port 8000.
…or hit it with plain curl:
curl http://localhost:8000/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"input": ["doc one", "doc two"], "input_type": "passage"}'
{
"object": "list",
"data": [
{"object": "embedding", "index": 0, "embedding": [0.031, -0.044, ...]},
{"object": "embedding", "index": 1, "embedding": [0.018, 0.007, ...]}
],
"model": "multilingual-e5-small-q4",
"usage": {"prompt_tokens": 8, "total_tokens": 8}
}
Endpoints
| Method & path | Purpose |
|---|---|
POST /v1/embeddings |
Create embeddings (string or array of strings). |
GET /v1/models |
List the served model. |
GET /health |
Liveness check → {"status":"ok"}. |
Request fields
| Field | Values | Default |
|---|---|---|
input |
a string or an array of strings | required |
encoding_format |
"float" or "base64" |
"float" |
input_type |
"query", "passage" (alias "document") |
server default |
model |
any string; if it contains query/passage/doc it sets the modality |
— |
encoding_format: "base64" returns each embedding as base64-encoded
little-endian float32 — this is what the official OpenAI Python client requests
by default, and it's fully supported.
Server flags
Both server forms take the same flags:
nanoe5-serve [--host H] [--port P] [--threads N] [--default-type query|passage] [--model FILE]
./e5 --server [--host H] [--port P] [--threads N] [--default-type query|passage] [--model FILE]
--threads Ncaps OpenMP threads (default: all cores).--default-typesets the modality when a request doesn't specify one (defaultquery).--model FILEloads an externale5-small-q4.bin(the binary otherwise uses its embedded copy; the pip server uses the bundled one).
Use it: from Python
The simplest form uses module-level helpers backed by a shared, hot model (loaded once, reused for every call):
import nanoe5
q = nanoe5.query("how much protein per day") # (384,)
docs = nanoe5.passage([ # (N, 384)
"The recommended protein intake for adult women is about 46 g/day.",
"Mount Everest is the highest mountain above sea level.",
])
scores = docs @ q # already L2-normalized -> dot product = cosine
print(scores.argmax()) # -> 0
Or hold an explicit handle (e.g. to cap threads):
from nanoe5 import E5
model = E5(num_threads=8)
model.query("..."); model.passage(["...", "..."])
That's the whole API:
| Call | Prefix added | Returns |
|---|---|---|
nanoe5.query(text | list) / model.query(...) |
query: |
(384,) or (N, 384) float32 |
nanoe5.passage(text | list) / model.passage(...) |
passage: |
(384,) or (N, 384) float32 |
nanoe5.encode(x, is_query=False) / model.encode(...) |
either | generic form |
A single text is parallelized across all CPU cores (low latency); a list is parallelized across texts (high throughput).
Query vs passage
multilingual-e5-small is trained with two prefixes, and you should use the
right one:
query:— short search queries / questions.passage:— documents you want to retrieve.
Embed your documents with passage, your search queries with query, then rank
documents by cosine similarity (a plain dot product, since outputs are
normalized).
- Python:
model.query(...)vsmodel.passage(...). - Server: set
"input_type": "query"or"passage"per request (or name the modele5-query/e5-passage), otherwise the server's--default-typeis used.
Long inputs (automatic)
The base model maxes out at 512 tokens. Instead of truncating, nanoE5.c slides a window over longer text: it splits into ≤510-token windows, embeds each, and returns the token-count-weighted average (then re-normalizes). This is mathematically equivalent to mean-pooling over the whole document and needs no API change — just pass a long string. Memory stays bounded (~350 MB) even for million-token inputs.
CLI
The same binary is also a quick CLI:
./e5 query "how much protein should a female eat"
./e5 passage "a document to index"
./e5 --model e5-small-q4.bin query "use an external model file"
How it works (short version)
- 4-bit weights (Q4_0). Every large matrix is stored in 32-weight blocks with an fp16 scale (~4.5 bits/weight) — ~10× less memory traffic than fp32.
- int8 × int4 matmul. Activations are quantized to int8 and multiplied against the 4-bit weights with AVX2 integer MACs — no fp32 dequant in the hot loop. Scalar fallback included for non-AVX CPUs.
- One pass per batch. All tokens of a batch share a single matmul per layer, so weights stream once; attention runs per text.
- OpenMP across matrix rows / texts; deterministic regardless of thread count.
- Faithful tokenizer. XLM-RoBERTa SentencePiece-unigram (Viterbi) with the real Precompiled normalizer baked in as a per-codepoint table.
The model is packed into one binary blob by convert.py; e5.c is the entire
engine (loader, tokenizer, BERT, quantized matmul); server.c adds the HTTP
server and CLI; e5.py is the ctypes wrapper.
Performance
On a Ryzen 7 5800X3D (8 cores / 16 threads, AVX2):
| nanoE5.c (4-bit) | sentence-transformers (fp32) | |
|---|---|---|
| single-query latency (hot) | ~2 ms | ~13 ms |
| batch throughput | ~190–340 texts/s | ~280 texts/s |
| model size | 72 MB | 471 MB |
| dependencies | libc, libm, OpenMP | torch + transformers |
| cold start | instant (mmap) | seconds |
For online serving (one query at a time, model hot) nanoE5.c is ~7× faster per call. For huge offline batch jobs, PyTorch's oneDNN GEMM edges ahead on raw throughput — but at 1/6th the footprint and zero dependencies.
Validate & stress
make test # cosine parity vs the fp32 HF reference + speed
make stress # hard edge-case / concurrency / server suite
make stress throws adversarial inputs at every layer and asserts: no crashes,
no hangs, finite & unit-norm outputs, determinism, batch == single (exact),
server == binding parity, base64 == float parity, real OpenAI-client
compatibility, correct 4xx handling for malformed requests, survival of a raw
garbage barrage, and 400 concurrent requests with zero errors or races.
Files
e5.c / e5.h the entire inference engine
server.c OpenAI-compatible HTTP server + CLI
convert.py build e5-small-q4.bin from the HF checkpoint (one-time)
nanoe5/ the pip package (engine + 4-bit model bundled)
pyproject.toml / setup.py packaging (compiles the engine, bundles the model)
e5.py standalone ctypes wrapper (repo-local use)
test_parity.py parity vs HF reference + benchmark
stress_test.py hard stress / edge-case suite
Makefile
License
The code here is yours to use. The model weights are
intfloat/multilingual-e5-small (MIT) — see the model card for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nanoe5-0.1.1.tar.gz.
File metadata
- Download URL: nanoe5-0.1.1.tar.gz
- Upload date:
- Size: 65.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34c0fb50cf84dd43951faeeb0922ab7e5d464887d8793b459a357da92f513a2b
|
|
| MD5 |
904c3d28dbdd2adf307320070f9d24e9
|
|
| BLAKE2b-256 |
83d0a108a717ed4d74c91f6dcafe069cb42eef5b9e11551930d53d7893051287
|
File details
Details for the file nanoe5-0.1.1-py3-none-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: nanoe5-0.1.1-py3-none-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 65.9 MB
- Tags: Python 3, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1d519b6acd9a3f57ea4a04d1abf5b54d3ff8de0bf5db00bad30eec826be6014
|
|
| MD5 |
3c3c878f2bd8d033e70e056f9875ebcf
|
|
| BLAKE2b-256 |
fe57169e889fc859dc0219083e0cafc65107bad1fd4b209bb5cddb801b31e8fc
|
File details
Details for the file nanoe5-0.1.1-py3-none-musllinux_1_2_aarch64.whl.
File metadata
- Download URL: nanoe5-0.1.1-py3-none-musllinux_1_2_aarch64.whl
- Upload date:
- Size: 65.8 MB
- Tags: Python 3, musllinux: musl 1.2+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15d3409860d2adfcd9353ab6f213986055e10e23873033c5e19c21dc7999b21a
|
|
| MD5 |
0e2ca265dab7bdaede25fb5361a2fe94
|
|
| BLAKE2b-256 |
82e2560ffa85a113cdb62b0eed4f3c7215a5c4ed008138295bf35f6ff843134f
|
File details
Details for the file nanoe5-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: nanoe5-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 65.8 MB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1de6f7beff191b3dcffc761e68295a20c49af308ec3d6b978cf4783025b2098c
|
|
| MD5 |
d5a54fc7520552160ae34b7fe645c7e5
|
|
| BLAKE2b-256 |
c8a6cc76e4c8d1c1104e3917565a509ccbac1ee0a0c31fde85000da3c3e9621f
|
File details
Details for the file nanoe5-0.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: nanoe5-0.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 65.8 MB
- Tags: Python 3, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40c359ffb1bac0a2713a0cbc79799775368519add88a6aeb982cc5a463d4ef21
|
|
| MD5 |
f0fd9de9032c41634463d67528abe06a
|
|
| BLAKE2b-256 |
152a104e4fedfd1eaa3391a5daaeca5be4f9d078edb93d9f92de2e8932a7a930
|
File details
Details for the file nanoe5-0.1.1-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: nanoe5-0.1.1-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 65.6 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a7455ba42db15905a37bf29577d8fa27cf6cc9104603777d0e2bd433ca66023
|
|
| MD5 |
875389ed186c9a00ddeae259c2dd90bc
|
|
| BLAKE2b-256 |
77e662e1adc2acb7514d5a449918da53134f7933682caf89b15c474f13935f87
|