nanoE5.c - blazing-fast 4-bit CPU text embeddings (multilingual-e5-small), model bundled, zero ML dependencies

These details have not been verified by PyPI

Project links

Homepage

Project description

nanoE5.c

A blazing-fast, dependency-free CPU engine for multilingual-e5-small text embeddings.

A tiny C core (the .c is the whole point) packaged for one-command use: pip install nanoe5 from Python, or a single self-contained server binary.

Ships as a single self-contained binary with the 4-bit model baked inside — run an OpenAI-compatible embeddings server with one file and zero dependencies. Or call it from Python and keep the model hot in RAM.

./e5 --server --port 8000          # OpenAI-compatible server, one file, no deps

from e5 import E5
model = E5()
vec = model.query("how much protein per day")   # 384-dim, L2-normalized

No PyTorch. No transformers. No ONNX. No BLAS. Just C, libm, and OpenMP.

Why

One file to deploy. The 4-bit model is linked inside the ./e5 binary (~69 MB). Copy it to a server and run — nothing to download, install, or mount.
Fast where it counts. ~2 ms to embed a single query on a desktop CPU — about 7× faster than sentence-transformers for one-at-a-time serving.
Tiny. 72 MB 4-bit model vs 471 MB fp32. Instant startup (mmap).
Faithful. Real XLM-RoBERTa SentencePiece tokenizer + exact BERT forward pass; cosine 0.98–0.99 vs the fp32 reference, retrieval rankings preserved.
Handles long text. Inputs over 512 tokens are windowed automatically and transparently, in bounded memory.

Install

You need a C compiler with OpenMP (gcc/clang) and, once, Python to build the model file.

# 1. download + quantize the model -> e5-small-q4.bin  (one-time, ~72 MB)
make convert        # pip install torch transformers safetensors tokenizers numpy

# 2a. build the self-contained server/CLI binary  ->  ./e5
make server

# 2b. (optional) build the Python shared library   ->  libe5.so
make lib

make convert is the only step that touches the Python ML stack. After it, the binary and the Python library run with no ML dependencies at all.

Use it: the server

Start it (the model is already inside the binary):

./e5 --server --host 0.0.0.0 --port 8000

It speaks the OpenAI embeddings API, so any OpenAI client works unchanged:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

resp = client.embeddings.create(
    model="e5-query",                       # see "Query vs passage" below
    input=["how much protein per day", "best protein sources"],
)
embeddings = [d.embedding for d in resp.data]   # two 384-dim vectors

…or just curl:

curl http://localhost:8000/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"input": ["doc one", "doc two"], "input_type": "passage"}'

{
  "object": "list",
  "data": [
    {"object": "embedding", "index": 0, "embedding": [0.031, -0.044, ...]},
    {"object": "embedding", "index": 1, "embedding": [0.018,  0.007, ...]}
  ],
  "model": "multilingual-e5-small-q4",
  "usage": {"prompt_tokens": 8, "total_tokens": 8}
}

Endpoints

Method & path	Purpose
`POST /v1/embeddings`	Create embeddings (string or array of strings).
`GET /v1/models`	List the served model.
`GET /health`	Liveness check → `{"status":"ok"}`.

Request fields

Field	Values	Default
`input`	a string or an array of strings	required
`encoding_format`	`"float"` or `"base64"`	`"float"`
`input_type`	`"query"`, `"passage"` (alias `"document"`)	server default
`model`	any string; if it contains `query`/`passage`/`doc` it sets the modality	—

encoding_format: "base64" returns each embedding as base64-encoded little-endian float32 — this is what the official OpenAI Python client requests by default, and it's fully supported.

Server flags

./e5 --server [--host H] [--port P] [--threads N]
              [--default-type query|passage] [--model FILE]

--threads N caps OpenMP threads (default: all cores).
--default-type sets the modality when a request doesn't specify one (default query).
--model FILE loads an external e5-small-q4.bin instead of the embedded one.

Use it: from Python

The Python wrapper loads the model once and keeps it hot in RAM — every call reuses it with zero reload cost.

from e5 import E5

model = E5()                                    # loads e5-small-q4.bin, stays hot

# single text -> shape (384,)
q = model.query("how much protein per day")

# a list -> shape (N, 384)
docs = model.passage([
    "The recommended protein intake for adult women is about 46 g/day.",
    "Mount Everest is the highest mountain above sea level.",
])

# cosine similarity (vectors are already L2-normalized, so just a dot product)
scores = docs @ q
print(scores.argmax())                          # -> 0

That's the whole API:

Method	Prefix added	Returns
`model.query(text \| list)`	`query:`	`(384,)` or `(N, 384)` `float32`
`model.passage(text \| list)`	`passage:`	`(384,)` or `(N, 384)` `float32`
`model.encode(text \| list, is_query=False)`	either	generic form

E5(model_path=..., lib_path=..., num_threads=...) lets you point at a specific model/library or cap threads.

A single text is parallelized across all CPU cores (low latency); a list is parallelized across texts (high throughput).

Query vs passage

multilingual-e5-small is trained with two prefixes, and you should use the right one:

query: — short search queries / questions.
passage: — documents you want to retrieve.

Embed your documents with passage, your search queries with query, then rank documents by cosine similarity (a plain dot product, since outputs are normalized).

Python: model.query(...) vs model.passage(...).
Server: set "input_type": "query" or "passage" per request (or name the model e5-query / e5-passage), otherwise the server's --default-type is used.

Long inputs (automatic)

The base model maxes out at 512 tokens. Instead of truncating, nanoE5.c slides a window over longer text: it splits into ≤510-token windows, embeds each, and returns the token-count-weighted average (then re-normalizes). This is mathematically equivalent to mean-pooling over the whole document and needs no API change — just pass a long string. Memory stays bounded (~350 MB) even for million-token inputs.

CLI

The same binary is also a quick CLI:

./e5 query   "how much protein should a female eat"
./e5 passage "a document to index"
./e5 --model e5-small-q4.bin query "use an external model file"

How it works (short version)

4-bit weights (Q4_0). Every large matrix is stored in 32-weight blocks with an fp16 scale (~4.5 bits/weight) — ~10× less memory traffic than fp32.
int8 × int4 matmul. Activations are quantized to int8 and multiplied against the 4-bit weights with AVX2 integer MACs — no fp32 dequant in the hot loop. Scalar fallback included for non-AVX CPUs.
One pass per batch. All tokens of a batch share a single matmul per layer, so weights stream once; attention runs per text.
OpenMP across matrix rows / texts; deterministic regardless of thread count.
Faithful tokenizer. XLM-RoBERTa SentencePiece-unigram (Viterbi) with the real Precompiled normalizer baked in as a per-codepoint table.

The model is packed into one binary blob by convert.py; e5.c is the entire engine (loader, tokenizer, BERT, quantized matmul); server.c adds the HTTP server and CLI; e5.py is the ctypes wrapper.

Performance

On a Ryzen 7 5800X3D (8 cores / 16 threads, AVX2):

	nanoE5.c (4-bit)	sentence-transformers (fp32)
single-query latency (hot)	~2 ms	~13 ms
batch throughput	~190–340 texts/s	~280 texts/s
model size	72 MB	471 MB
dependencies	libc, libm, OpenMP	torch + transformers
cold start	instant (mmap)	seconds

For online serving (one query at a time, model hot) nanoE5.c is ~7× faster per call. For huge offline batch jobs, PyTorch's oneDNN GEMM edges ahead on raw throughput — but at 1/6th the footprint and zero dependencies.

Validate & stress

make test     # cosine parity vs the fp32 HF reference + speed
make stress   # hard edge-case / concurrency / server suite

make stress throws adversarial inputs at every layer and asserts: no crashes, no hangs, finite & unit-norm outputs, determinism, batch == single (exact), server == binding parity, base64 == float parity, real OpenAI-client compatibility, correct 4xx handling for malformed requests, survival of a raw garbage barrage, and 400 concurrent requests with zero errors or races.

Files

convert.py       build e5-small-q4.bin from the HF checkpoint (one-time)
e5.c / e5.h      the entire inference engine
server.c         OpenAI-compatible HTTP server + CLI
e5.py            Python wrapper (load once, keep hot)
test_parity.py   parity vs HF reference + benchmark
stress_test.py   hard stress / edge-case suite
Makefile

License

The code here is yours to use. The model weights are intfloat/multilingual-e5-small (MIT) — see the model card for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.1

Jun 3, 2026

This version

0.1.0

Jun 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanoe5-0.1.0.tar.gz (65.6 MB view details)

Uploaded Jun 3, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nanoe5-0.1.0-py3-none-musllinux_1_2_x86_64.whl (65.9 MB view details)

Uploaded Jun 3, 2026 Python 3musllinux: musl 1.2+ x86-64

nanoe5-0.1.0-py3-none-musllinux_1_2_aarch64.whl (65.8 MB view details)

Uploaded Jun 3, 2026 Python 3musllinux: musl 1.2+ ARM64

nanoe5-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (65.8 MB view details)

Uploaded Jun 3, 2026 Python 3manylinux: glibc 2.17+ x86-64

nanoe5-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (65.8 MB view details)

Uploaded Jun 3, 2026 Python 3manylinux: glibc 2.17+ ARM64

nanoe5-0.1.0-py3-none-macosx_11_0_arm64.whl (65.6 MB view details)

Uploaded Jun 3, 2026 Python 3macOS 11.0+ ARM64

File details

Details for the file nanoe5-0.1.0.tar.gz.

File metadata

Download URL: nanoe5-0.1.0.tar.gz
Upload date: Jun 3, 2026
Size: 65.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for nanoe5-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d03826461683131e7b16d52c961e2149c9a20265fd5a67689c2fae6144751977`
MD5	`509bc888b0ea535b233b2577c56f8119`
BLAKE2b-256	`045b57f55fccb17fc9cc46fad2e9977782db847f5c9b8b3234389d745ca3dce1`

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.0-py3-none-musllinux_1_2_x86_64.whl.

File metadata

Download URL: nanoe5-0.1.0-py3-none-musllinux_1_2_x86_64.whl
Upload date: Jun 3, 2026
Size: 65.9 MB
Tags: Python 3, musllinux: musl 1.2+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.0-py3-none-musllinux_1_2_x86_64.whl
Algorithm	Hash digest
SHA256	`6ca4479c6df86ff8dd93f025575b659db5058e43ee64d7f01ac118a99fb6e198`
MD5	`3e12a203284c622d53061cb998c0c1d0`
BLAKE2b-256	`5f721b0b417d15f49174560ed40b16ee3bebc5631446f9b5947cadb70ad60a67`

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.0-py3-none-musllinux_1_2_aarch64.whl.

File metadata

Download URL: nanoe5-0.1.0-py3-none-musllinux_1_2_aarch64.whl
Upload date: Jun 3, 2026
Size: 65.8 MB
Tags: Python 3, musllinux: musl 1.2+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.0-py3-none-musllinux_1_2_aarch64.whl
Algorithm	Hash digest
SHA256	`f3400b8e9eb168339a8c4a392ac0ba81c76c44cd9dffd89ef1cc32602cc6b53d`
MD5	`552e3637de6f38bbf479d33a2c48cefc`
BLAKE2b-256	`287bd423af1c538efdf0c029b6d5e39833a3f27abe07a22b7a0f96ad51fa39cc`

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: nanoe5-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 3, 2026
Size: 65.8 MB
Tags: Python 3, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`7ae59acb0828f09ffbaa7468db0040c8762f17096fb2fd3fb3dee8785662f052`
MD5	`1632a1a85ec065d4973e3deac7421377`
BLAKE2b-256	`b7d7223487d369129163737dce0584ce5e11af42ebb500c49c519e84e7124679`

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: nanoe5-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jun 3, 2026
Size: 65.8 MB
Tags: Python 3, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`861737f804ae9e04ce8ddb716f0ee6363c798e102fb6ff3c386bf50ef49e0ac2`
MD5	`b4a9de099db29275f286c37e2c3c88e7`
BLAKE2b-256	`5d098a2f3a465c61836d46b75331a209db495154512ba7115da5de42410d4d62`

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.0-py3-none-macosx_11_0_arm64.whl.

File metadata

Download URL: nanoe5-0.1.0-py3-none-macosx_11_0_arm64.whl
Upload date: Jun 3, 2026
Size: 65.6 MB
Tags: Python 3, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.0-py3-none-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`86ffd7e7a37dd0ae627151fa08691cb401b37928d690064155420400d9936ae8`
MD5	`dd3108bc783d85daccd36c24a61b50b3`
BLAKE2b-256	`7dd540371b3b36c7c362ab1b4d5254b6f3041e1fa571f938152e750e7055e624`

See more details on using hashes here.

nanoe5 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

nanoE5.c

Why

Install

Use it: the server

Endpoints

Request fields

Server flags

Use it: from Python

Query vs passage

Long inputs (automatic)

CLI

How it works (short version)

Performance

Validate & stress

Files

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes