nanoE5.c - blazing-fast 4-bit CPU text embeddings (multilingual-e5-small), model bundled, OpenAI-compatible server, zero ML dependencies

These details have not been verified by PyPI

Project links

Project description

nanoE5.c

A blazing-fast, dependency-free CPU engine for multilingual-e5-small text embeddings.

A tiny C core (the .c is the whole point) packaged for one-command use: pip install nanoe5 from Python, or a single self-contained server binary.

The 4-bit model is bundled — there is nothing to download or configure. Use it from Python in two lines, or run an OpenAI-compatible server from a single self-contained binary.

pip install nanoe5

import nanoe5
q = nanoe5.query("how much protein per day")     # 384-dim, L2-normalized
P = nanoe5.passage(["doc a", "doc b"])           # (2, 384)
scores = P @ q                                   # cosine similarity

…or run an OpenAI-compatible server (works with the official openai client):

nanoe5-serve --port 8000           # OpenAI-compatible embeddings API

No PyTorch. No transformers. No ONNX. No BLAS. Just C, libm, and OpenMP.

Why

One file to deploy. The 4-bit model is linked inside the ./e5 binary (~69 MB). Copy it to a server and run — nothing to download, install, or mount.
Fast where it counts. ~2 ms to embed a single query on a desktop CPU — about 7× faster than sentence-transformers for one-at-a-time serving.
Tiny. 72 MB 4-bit model vs 471 MB fp32. Instant startup (mmap).
Faithful. Real XLM-RoBERTa SentencePiece tokenizer + exact BERT forward pass; cosine 0.98–0.99 vs the fp32 reference, retrieval rankings preserved.
Handles long text. Inputs over 512 tokens are windowed automatically and transparently, in bounded memory.

Install

From PyPI (Python)

pip install nanoe5

That's it — the 4-bit model is inside the package. The tiny C engine compiles on install (needs a C compiler with OpenMP, e.g. gcc), then everything runs with no ML dependencies (just NumPy). Requires an x86-64 CPU with AVX2 for the fast path; other CPUs fall back to a portable scalar build automatically.

From source (server binary + CLI)

# 1. download + quantize the model -> e5-small-q4.bin  (one-time, ~72 MB)
make convert        # pip install torch transformers safetensors tokenizers numpy

# 2a. build the self-contained server/CLI binary  ->  ./e5
make server

# 2b. (optional) build the Python shared library   ->  libe5.so
make lib

make convert is the only step that touches the Python ML stack. After it, the binary runs with no ML dependencies at all.

Use it: the OpenAI-compatible server

Start a server with one command — works with the official openai Python client out of the box (verified against openai>=1.0):

pip install nanoe5
nanoe5-serve --port 8000          # OpenAI-compatible embeddings server

from openai import OpenAI                       # the official OpenAI client

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

resp = client.embeddings.create(
    model="e5-query",                           # see "Query vs passage" below
    input=["how much protein per day", "best protein sources"],
)
embeddings = [d.embedding for d in resp.data]   # two 384-dim vectors

Both encoding_format="float" and the client's default "base64" path are supported, so nothing in your existing OpenAI code needs to change — just point base_url at the server.

Prefer a single dependency-free binary? make server builds ./e5, which embeds the model and serves the same API with zero Python: ./e5 --server --port 8000.

…or hit it with plain curl:

curl http://localhost:8000/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"input": ["doc one", "doc two"], "input_type": "passage"}'

{
  "object": "list",
  "data": [
    {"object": "embedding", "index": 0, "embedding": [0.031, -0.044, ...]},
    {"object": "embedding", "index": 1, "embedding": [0.018,  0.007, ...]}
  ],
  "model": "multilingual-e5-small-q4",
  "usage": {"prompt_tokens": 8, "total_tokens": 8}
}

Endpoints

Method & path	Purpose
`POST /v1/embeddings`	Create embeddings (string or array of strings).
`GET /v1/models`	List the served model.
`GET /health`	Liveness check → `{"status":"ok"}`.

Request fields

Field	Values	Default
`input`	a string or an array of strings	required
`encoding_format`	`"float"` or `"base64"`	`"float"`
`input_type`	`"query"`, `"passage"` (alias `"document"`)	server default
`model`	any string; if it contains `query`/`passage`/`doc` it sets the modality	—

encoding_format: "base64" returns each embedding as base64-encoded little-endian float32 — this is what the official OpenAI Python client requests by default, and it's fully supported.

Server flags

Both server forms take the same flags:

nanoe5-serve  [--host H] [--port P] [--threads N] [--default-type query|passage] [--model FILE]
./e5 --server [--host H] [--port P] [--threads N] [--default-type query|passage] [--model FILE]

--threads N caps OpenMP threads (default: all cores).
--default-type sets the modality when a request doesn't specify one (default query).
--model FILE loads an external e5-small-q4.bin (the binary otherwise uses its embedded copy; the pip server uses the bundled one).

Use it: from Python

The simplest form uses module-level helpers backed by a shared, hot model (loaded once, reused for every call):

import nanoe5

q = nanoe5.query("how much protein per day")     # (384,)
docs = nanoe5.passage([                            # (N, 384)
    "The recommended protein intake for adult women is about 46 g/day.",
    "Mount Everest is the highest mountain above sea level.",
])

scores = docs @ q          # already L2-normalized -> dot product = cosine
print(scores.argmax())     # -> 0

Or hold an explicit handle (e.g. to cap threads):

from nanoe5 import E5
model = E5(num_threads=8)
model.query("...");  model.passage(["...", "..."])

That's the whole API:

Call	Prefix added	Returns
`nanoe5.query(text \| list)` / `model.query(...)`	`query:`	`(384,)` or `(N, 384)` `float32`
`nanoe5.passage(text \| list)` / `model.passage(...)`	`passage:`	`(384,)` or `(N, 384)` `float32`
`nanoe5.encode(x, is_query=False)` / `model.encode(...)`	either	generic form

A single text is parallelized across all CPU cores (low latency); a list is parallelized across texts (high throughput).

Query vs passage

multilingual-e5-small is trained with two prefixes, and you should use the right one:

query: — short search queries / questions.
passage: — documents you want to retrieve.

Embed your documents with passage, your search queries with query, then rank documents by cosine similarity (a plain dot product, since outputs are normalized).

Python: model.query(...) vs model.passage(...).
Server: set "input_type": "query" or "passage" per request (or name the model e5-query / e5-passage), otherwise the server's --default-type is used.

Long inputs (automatic)

The base model maxes out at 512 tokens. Instead of truncating, nanoE5.c slides a window over longer text: it splits into ≤510-token windows, embeds each, and returns the token-count-weighted average (then re-normalizes). This is mathematically equivalent to mean-pooling over the whole document and needs no API change — just pass a long string. Memory stays bounded (~350 MB) even for million-token inputs.

CLI

The same binary is also a quick CLI:

./e5 query   "how much protein should a female eat"
./e5 passage "a document to index"
./e5 --model e5-small-q4.bin query "use an external model file"

How it works (short version)

4-bit weights (Q4_0). Every large matrix is stored in 32-weight blocks with an fp16 scale (~4.5 bits/weight) — ~10× less memory traffic than fp32.
int8 × int4 matmul. Activations are quantized to int8 and multiplied against the 4-bit weights with AVX2 integer MACs — no fp32 dequant in the hot loop. Scalar fallback included for non-AVX CPUs.
One pass per batch. All tokens of a batch share a single matmul per layer, so weights stream once; attention runs per text.
OpenMP across matrix rows / texts; deterministic regardless of thread count.
Faithful tokenizer. XLM-RoBERTa SentencePiece-unigram (Viterbi) with the real Precompiled normalizer baked in as a per-codepoint table.

The model is packed into one binary blob by convert.py; e5.c is the entire engine (loader, tokenizer, BERT, quantized matmul); server.c adds the HTTP server and CLI; e5.py is the ctypes wrapper.

Performance

On a Ryzen 7 5800X3D (8 cores / 16 threads, AVX2):

	nanoE5.c (4-bit)	sentence-transformers (fp32)
single-query latency (hot)	~2 ms	~13 ms
batch throughput	~190–340 texts/s	~280 texts/s
model size	72 MB	471 MB
dependencies	libc, libm, OpenMP	torch + transformers
cold start	instant (mmap)	seconds

For online serving (one query at a time, model hot) nanoE5.c is ~7× faster per call. For huge offline batch jobs, PyTorch's oneDNN GEMM edges ahead on raw throughput — but at 1/6th the footprint and zero dependencies.

Validate & stress

make test     # cosine parity vs the fp32 HF reference + speed
make stress   # hard edge-case / concurrency / server suite

make stress throws adversarial inputs at every layer and asserts: no crashes, no hangs, finite & unit-norm outputs, determinism, batch == single (exact), server == binding parity, base64 == float parity, real OpenAI-client compatibility, correct 4xx handling for malformed requests, survival of a raw garbage barrage, and 400 concurrent requests with zero errors or races.

Files

e5.c / e5.h      the entire inference engine
server.c         OpenAI-compatible HTTP server + CLI
convert.py       build e5-small-q4.bin from the HF checkpoint (one-time)
nanoe5/          the pip package (engine + 4-bit model bundled)
pyproject.toml   / setup.py   packaging (compiles the engine, bundles the model)
e5.py            standalone ctypes wrapper (repo-local use)
test_parity.py   parity vs HF reference + benchmark
stress_test.py   hard stress / edge-case suite
Makefile

License

The code here is yours to use. The model weights are intfloat/multilingual-e5-small (MIT) — see the model card for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 3, 2026

0.1.0

Jun 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanoe5-0.1.1.tar.gz (65.6 MB view details)

Uploaded Jun 3, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nanoe5-0.1.1-py3-none-musllinux_1_2_x86_64.whl (65.9 MB view details)

Uploaded Jun 3, 2026 Python 3musllinux: musl 1.2+ x86-64

nanoe5-0.1.1-py3-none-musllinux_1_2_aarch64.whl (65.8 MB view details)

Uploaded Jun 3, 2026 Python 3musllinux: musl 1.2+ ARM64

nanoe5-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (65.8 MB view details)

Uploaded Jun 3, 2026 Python 3manylinux: glibc 2.17+ x86-64

nanoe5-0.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (65.8 MB view details)

Uploaded Jun 3, 2026 Python 3manylinux: glibc 2.17+ ARM64

nanoe5-0.1.1-py3-none-macosx_11_0_arm64.whl (65.6 MB view details)

Uploaded Jun 3, 2026 Python 3macOS 11.0+ ARM64

File details

Details for the file nanoe5-0.1.1.tar.gz.

File metadata

Download URL: nanoe5-0.1.1.tar.gz
Upload date: Jun 3, 2026
Size: 65.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`34c0fb50cf84dd43951faeeb0922ab7e5d464887d8793b459a357da92f513a2b`
MD5	`904c3d28dbdd2adf307320070f9d24e9`
BLAKE2b-256	`83d0a108a717ed4d74c91f6dcafe069cb42eef5b9e11551930d53d7893051287`

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.1-py3-none-musllinux_1_2_x86_64.whl.

File metadata

Download URL: nanoe5-0.1.1-py3-none-musllinux_1_2_x86_64.whl
Upload date: Jun 3, 2026
Size: 65.9 MB
Tags: Python 3, musllinux: musl 1.2+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.1-py3-none-musllinux_1_2_x86_64.whl
Algorithm	Hash digest
SHA256	`d1d519b6acd9a3f57ea4a04d1abf5b54d3ff8de0bf5db00bad30eec826be6014`
MD5	`3c3c878f2bd8d033e70e056f9875ebcf`
BLAKE2b-256	`fe57169e889fc859dc0219083e0cafc65107bad1fd4b209bb5cddb801b31e8fc`

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.1-py3-none-musllinux_1_2_aarch64.whl.

File metadata

Download URL: nanoe5-0.1.1-py3-none-musllinux_1_2_aarch64.whl
Upload date: Jun 3, 2026
Size: 65.8 MB
Tags: Python 3, musllinux: musl 1.2+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.1-py3-none-musllinux_1_2_aarch64.whl
Algorithm	Hash digest
SHA256	`15d3409860d2adfcd9353ab6f213986055e10e23873033c5e19c21dc7999b21a`
MD5	`0e2ca265dab7bdaede25fb5361a2fe94`
BLAKE2b-256	`82e2560ffa85a113cdb62b0eed4f3c7215a5c4ed008138295bf35f6ff843134f`

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: nanoe5-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 3, 2026
Size: 65.8 MB
Tags: Python 3, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`1de6f7beff191b3dcffc761e68295a20c49af308ec3d6b978cf4783025b2098c`
MD5	`d5a54fc7520552160ae34b7fe645c7e5`
BLAKE2b-256	`c8a6cc76e4c8d1c1104e3917565a509ccbac1ee0a0c31fde85000da3c3e9621f`

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: nanoe5-0.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jun 3, 2026
Size: 65.8 MB
Tags: Python 3, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`40c359ffb1bac0a2713a0cbc79799775368519add88a6aeb982cc5a463d4ef21`
MD5	`f0fd9de9032c41634463d67528abe06a`
BLAKE2b-256	`152a104e4fedfd1eaa3391a5daaeca5be4f9d078edb93d9f92de2e8932a7a930`

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.1-py3-none-macosx_11_0_arm64.whl.

File metadata

Download URL: nanoe5-0.1.1-py3-none-macosx_11_0_arm64.whl
Upload date: Jun 3, 2026
Size: 65.6 MB
Tags: Python 3, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nanoe5-0.1.1-py3-none-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`2a7455ba42db15905a37bf29577d8fa27cf6cc9104603777d0e2bd433ca66023`
MD5	`875389ed186c9a00ddeae259c2dd90bc`
BLAKE2b-256	`77e662e1adc2acb7514d5a449918da53134f7933682caf89b15c474f13935f87`

See more details on using hashes here.

nanoe5 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

nanoE5.c

Why

Install

From PyPI (Python)

From source (server binary + CLI)

Use it: the OpenAI-compatible server

Endpoints

Request fields

Server flags

Use it: from Python

Query vs passage

Long inputs (automatic)

CLI

How it works (short version)

Performance

Validate & stress

Files

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes