Skip to main content

nanoE5.c - blazing-fast 4-bit CPU text embeddings (multilingual-e5-small), model bundled, zero ML dependencies

Project description

nanoE5.c

A blazing-fast, dependency-free CPU engine for multilingual-e5-small text embeddings.

A tiny C core (the .c is the whole point) packaged for one-command use: pip install nanoe5 from Python, or a single self-contained server binary.

Ships as a single self-contained binary with the 4-bit model baked inside — run an OpenAI-compatible embeddings server with one file and zero dependencies. Or call it from Python and keep the model hot in RAM.

./e5 --server --port 8000          # OpenAI-compatible server, one file, no deps
from e5 import E5
model = E5()
vec = model.query("how much protein per day")   # 384-dim, L2-normalized

No PyTorch. No transformers. No ONNX. No BLAS. Just C, libm, and OpenMP.


Why

  • One file to deploy. The 4-bit model is linked inside the ./e5 binary (~69 MB). Copy it to a server and run — nothing to download, install, or mount.
  • Fast where it counts. ~2 ms to embed a single query on a desktop CPU — about 7× faster than sentence-transformers for one-at-a-time serving.
  • Tiny. 72 MB 4-bit model vs 471 MB fp32. Instant startup (mmap).
  • Faithful. Real XLM-RoBERTa SentencePiece tokenizer + exact BERT forward pass; cosine 0.98–0.99 vs the fp32 reference, retrieval rankings preserved.
  • Handles long text. Inputs over 512 tokens are windowed automatically and transparently, in bounded memory.

Install

You need a C compiler with OpenMP (gcc/clang) and, once, Python to build the model file.

# 1. download + quantize the model -> e5-small-q4.bin  (one-time, ~72 MB)
make convert        # pip install torch transformers safetensors tokenizers numpy

# 2a. build the self-contained server/CLI binary  ->  ./e5
make server

# 2b. (optional) build the Python shared library   ->  libe5.so
make lib

make convert is the only step that touches the Python ML stack. After it, the binary and the Python library run with no ML dependencies at all.


Use it: the server

Start it (the model is already inside the binary):

./e5 --server --host 0.0.0.0 --port 8000

It speaks the OpenAI embeddings API, so any OpenAI client works unchanged:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

resp = client.embeddings.create(
    model="e5-query",                       # see "Query vs passage" below
    input=["how much protein per day", "best protein sources"],
)
embeddings = [d.embedding for d in resp.data]   # two 384-dim vectors

…or just curl:

curl http://localhost:8000/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"input": ["doc one", "doc two"], "input_type": "passage"}'
{
  "object": "list",
  "data": [
    {"object": "embedding", "index": 0, "embedding": [0.031, -0.044, ...]},
    {"object": "embedding", "index": 1, "embedding": [0.018,  0.007, ...]}
  ],
  "model": "multilingual-e5-small-q4",
  "usage": {"prompt_tokens": 8, "total_tokens": 8}
}

Endpoints

Method & path Purpose
POST /v1/embeddings Create embeddings (string or array of strings).
GET /v1/models List the served model.
GET /health Liveness check → {"status":"ok"}.

Request fields

Field Values Default
input a string or an array of strings required
encoding_format "float" or "base64" "float"
input_type "query", "passage" (alias "document") server default
model any string; if it contains query/passage/doc it sets the modality

encoding_format: "base64" returns each embedding as base64-encoded little-endian float32 — this is what the official OpenAI Python client requests by default, and it's fully supported.

Server flags

./e5 --server [--host H] [--port P] [--threads N]
              [--default-type query|passage] [--model FILE]
  • --threads N caps OpenMP threads (default: all cores).
  • --default-type sets the modality when a request doesn't specify one (default query).
  • --model FILE loads an external e5-small-q4.bin instead of the embedded one.

Use it: from Python

The Python wrapper loads the model once and keeps it hot in RAM — every call reuses it with zero reload cost.

from e5 import E5

model = E5()                                    # loads e5-small-q4.bin, stays hot

# single text -> shape (384,)
q = model.query("how much protein per day")

# a list -> shape (N, 384)
docs = model.passage([
    "The recommended protein intake for adult women is about 46 g/day.",
    "Mount Everest is the highest mountain above sea level.",
])

# cosine similarity (vectors are already L2-normalized, so just a dot product)
scores = docs @ q
print(scores.argmax())                          # -> 0

That's the whole API:

Method Prefix added Returns
model.query(text | list) query: (384,) or (N, 384) float32
model.passage(text | list) passage: (384,) or (N, 384) float32
model.encode(text | list, is_query=False) either generic form

E5(model_path=..., lib_path=..., num_threads=...) lets you point at a specific model/library or cap threads.

A single text is parallelized across all CPU cores (low latency); a list is parallelized across texts (high throughput).


Query vs passage

multilingual-e5-small is trained with two prefixes, and you should use the right one:

  • query: — short search queries / questions.
  • passage: — documents you want to retrieve.

Embed your documents with passage, your search queries with query, then rank documents by cosine similarity (a plain dot product, since outputs are normalized).

  • Python: model.query(...) vs model.passage(...).
  • Server: set "input_type": "query" or "passage" per request (or name the model e5-query / e5-passage), otherwise the server's --default-type is used.

Long inputs (automatic)

The base model maxes out at 512 tokens. Instead of truncating, nanoE5.c slides a window over longer text: it splits into ≤510-token windows, embeds each, and returns the token-count-weighted average (then re-normalizes). This is mathematically equivalent to mean-pooling over the whole document and needs no API change — just pass a long string. Memory stays bounded (~350 MB) even for million-token inputs.


CLI

The same binary is also a quick CLI:

./e5 query   "how much protein should a female eat"
./e5 passage "a document to index"
./e5 --model e5-small-q4.bin query "use an external model file"

How it works (short version)

  • 4-bit weights (Q4_0). Every large matrix is stored in 32-weight blocks with an fp16 scale (~4.5 bits/weight) — ~10× less memory traffic than fp32.
  • int8 × int4 matmul. Activations are quantized to int8 and multiplied against the 4-bit weights with AVX2 integer MACs — no fp32 dequant in the hot loop. Scalar fallback included for non-AVX CPUs.
  • One pass per batch. All tokens of a batch share a single matmul per layer, so weights stream once; attention runs per text.
  • OpenMP across matrix rows / texts; deterministic regardless of thread count.
  • Faithful tokenizer. XLM-RoBERTa SentencePiece-unigram (Viterbi) with the real Precompiled normalizer baked in as a per-codepoint table.

The model is packed into one binary blob by convert.py; e5.c is the entire engine (loader, tokenizer, BERT, quantized matmul); server.c adds the HTTP server and CLI; e5.py is the ctypes wrapper.


Performance

On a Ryzen 7 5800X3D (8 cores / 16 threads, AVX2):

nanoE5.c (4-bit) sentence-transformers (fp32)
single-query latency (hot) ~2 ms ~13 ms
batch throughput ~190–340 texts/s ~280 texts/s
model size 72 MB 471 MB
dependencies libc, libm, OpenMP torch + transformers
cold start instant (mmap) seconds

For online serving (one query at a time, model hot) nanoE5.c is ~7× faster per call. For huge offline batch jobs, PyTorch's oneDNN GEMM edges ahead on raw throughput — but at 1/6th the footprint and zero dependencies.


Validate & stress

make test     # cosine parity vs the fp32 HF reference + speed
make stress   # hard edge-case / concurrency / server suite

make stress throws adversarial inputs at every layer and asserts: no crashes, no hangs, finite & unit-norm outputs, determinism, batch == single (exact), server == binding parity, base64 == float parity, real OpenAI-client compatibility, correct 4xx handling for malformed requests, survival of a raw garbage barrage, and 400 concurrent requests with zero errors or races.


Files

convert.py       build e5-small-q4.bin from the HF checkpoint (one-time)
e5.c / e5.h      the entire inference engine
server.c         OpenAI-compatible HTTP server + CLI
e5.py            Python wrapper (load once, keep hot)
test_parity.py   parity vs HF reference + benchmark
stress_test.py   hard stress / edge-case suite
Makefile

License

The code here is yours to use. The model weights are intfloat/multilingual-e5-small (MIT) — see the model card for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanoe5-0.1.0.tar.gz (65.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nanoe5-0.1.0-py3-none-musllinux_1_2_x86_64.whl (65.9 MB view details)

Uploaded Python 3musllinux: musl 1.2+ x86-64

nanoe5-0.1.0-py3-none-musllinux_1_2_aarch64.whl (65.8 MB view details)

Uploaded Python 3musllinux: musl 1.2+ ARM64

nanoe5-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (65.8 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

nanoe5-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (65.8 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

nanoe5-0.1.0-py3-none-macosx_11_0_arm64.whl (65.6 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file nanoe5-0.1.0.tar.gz.

File metadata

  • Download URL: nanoe5-0.1.0.tar.gz
  • Upload date:
  • Size: 65.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for nanoe5-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d03826461683131e7b16d52c961e2149c9a20265fd5a67689c2fae6144751977
MD5 509bc888b0ea535b233b2577c56f8119
BLAKE2b-256 045b57f55fccb17fc9cc46fad2e9977782db847f5c9b8b3234389d745ca3dce1

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.0-py3-none-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for nanoe5-0.1.0-py3-none-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 6ca4479c6df86ff8dd93f025575b659db5058e43ee64d7f01ac118a99fb6e198
MD5 3e12a203284c622d53061cb998c0c1d0
BLAKE2b-256 5f721b0b417d15f49174560ed40b16ee3bebc5631446f9b5947cadb70ad60a67

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.0-py3-none-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for nanoe5-0.1.0-py3-none-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 f3400b8e9eb168339a8c4a392ac0ba81c76c44cd9dffd89ef1cc32602cc6b53d
MD5 552e3637de6f38bbf479d33a2c48cefc
BLAKE2b-256 287bd423af1c538efdf0c029b6d5e39833a3f27abe07a22b7a0f96ad51fa39cc

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nanoe5-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7ae59acb0828f09ffbaa7468db0040c8762f17096fb2fd3fb3dee8785662f052
MD5 1632a1a85ec065d4973e3deac7421377
BLAKE2b-256 b7d7223487d369129163737dce0584ce5e11af42ebb500c49c519e84e7124679

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for nanoe5-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 861737f804ae9e04ce8ddb716f0ee6363c798e102fb6ff3c386bf50ef49e0ac2
MD5 b4a9de099db29275f286c37e2c3c88e7
BLAKE2b-256 5d098a2f3a465c61836d46b75331a209db495154512ba7115da5de42410d4d62

See more details on using hashes here.

File details

Details for the file nanoe5-0.1.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nanoe5-0.1.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 86ffd7e7a37dd0ae627151fa08691cb401b37928d690064155420400d9936ae8
MD5 dd3108bc783d85daccd36c24a61b50b3
BLAKE2b-256 7dd540371b3b36c7c362ab1b4d5254b6f3041e1fa571f938152e750e7055e624

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page