Skip to main content

Run any open LLM on CPU. One command.

Project description

InferBit

v0.4.1 — Run any open LLM on CPU. One command.

pip install inferbit[cli]
inferbit quantize mistralai/Mistral-7B-Instruct-v0.3 -o model.ibf
inferbit chat model.ibf

InferBit converts HuggingFace models to a compact 4-bit PQv2 codebook format (.ibf) and runs them on CPU everywhere, with optional Apple Metal GPU acceleration on Apple Silicon and a drive mode that streams weights from disk for sub-GB peak RAM on 8B models. No GPU required, no Docker, no complex setup.

Install

# Library only
pip install inferbit

# Library + CLI
pip install inferbit[cli]

# Everything (library + CLI + server)
pip install inferbit[all]

Requires Python 3.9+. Prebuilt wheels for macOS (ARM64, x86_64), Linux (x86_64, ARM64), and Windows (x64, ARM64) — six platforms, all CPU-only by default. Apple Metal GPU is a build-from-source option (see Platform support below).

Quickstart

Command line

# Convert any HuggingFace model to PQv2 4-bit
inferbit quantize meta-llama/Llama-3.2-1B -o llama.ibf

# Convert a local safetensors file
inferbit quantize ./model.safetensors -o model.ibf

# Convert from Ollama (if installed)
inferbit quantize ollama://llama3:8b -o llama3.ibf

# Auto-calibrate: try INT2/INT4/INT8 and keep the first that hits the gate
inferbit quantize meta-llama/Llama-3.2-1B -o llama.ibf \
    --auto-calibrate --max-perplexity 12.0 --min-tokens-per-sec 30

# Quality-gated eval against a JSONL token dataset
inferbit eval-gates model.ibf --dataset tokens.jsonl \
    --max-perplexity 12.0 --min-tokens-per-sec 30

# Interactive chat
inferbit chat model.ibf

# Benchmark
inferbit bench model.ibf --tokens 128 --runs 3

# Model info
inferbit info model.ibf

# Serve OpenAI-compatible API (requires: pip install inferbit[server])
inferbit serve model.ibf --port 8000

Python API

from inferbit import InferbitModel

# Load from HuggingFace (downloads, converts, and loads automatically)
model = InferbitModel.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    bits=4,
)

# Generate text
output = model.generate("Explain gravity in one sentence:")
print(output)
# "Gravity is the force that attracts objects with mass towards each other."

# Stream tokens
for token in model.stream("Write a haiku about mountains:"):
    print(token, end="", flush=True)

# Or load a pre-converted model
model = InferbitModel.load("model.ibf")

Convert separately

from inferbit import convert

# Convert safetensors to IBF
convert("model.safetensors", "model.ibf", bits=4, sensitive_bits=8)

# Convert a HuggingFace directory (with config.json + sharded safetensors)
convert("./model_dir/", "model.ibf", bits=4)

# Convert with progress callback
convert("model.safetensors", "model.ibf", progress=lambda pct, stage: print(f"{pct:.0%} {stage}"))

Token-level API

from inferbit import InferbitModel

model = InferbitModel.load("model.ibf")

# Work with raw token IDs
token_ids = model.generate_tokens([1, 2, 3, 4, 5], max_tokens=20, temperature=0.7)

# Get raw logits
logits = model.forward([1, 2, 3])

# KV cache control
model.kv_clear()
model.kv_truncate(512)
print(model.kv_length)

Model info

model = InferbitModel.load("model.ibf")
print(model.architecture)   # "llama"
print(model.num_layers)      # 32
print(model.hidden_size)     # 4096
print(model.vocab_size)      # 32768
print(model.max_context)     # 32768
print(model.bits)            # 4
print(model.total_memory_mb) # 3971.0

Quality-gated quantization

from inferbit import search_quantization_profile, EvalGates

# Automatically find the most aggressive quantization that meets quality targets
result = search_quantization_profile(
    "model.safetensors",
    output_dir="./models",
    gates=EvalGates(max_perplexity=10.0, min_tokens_per_sec=5.0),
)
print(f"Selected: {result.selected.name} ({result.selected.bits}-bit)")
print(f"Speed: {result.eval_result.tokens_per_sec:.1f} tok/s")

Supported Sources

Source Example
HuggingFace Hub inferbit quantize mistralai/Mistral-7B-Instruct-v0.3
Local safetensors inferbit quantize model.safetensors
Sharded safetensors directory inferbit quantize ./model_dir/
Local GGUF inferbit quantize model.gguf
Ollama models inferbit quantize ollama://llama3:8b

Supported Models

Any LLaMA-family architecture with public weights:

  • LLaMA 2, LLaMA 3, LLaMA 3.2
  • Mistral, Mixtral
  • TinyLlama
  • Code Llama
  • And any model with the same architecture (GQA/MQA/MHA, RMSNorm, SiLU, RoPE)

Benchmarks

Apple M4, full v0.4.1 cross-engine matrix. Perplexity measured on the same tokenized 2048-token wikitext window for both engines (llama.cpp's tokenization fed to both bench_ppl_run and llama-perplexity, so quality is compared byte-for-byte over the identical sequence). Prefill via bench_compare --prompt-tokens 64 and llama-bench -p 64; decode --gen-tokens 128 / -n 128. Peak RAM from getrusage (InferBit) and /usr/bin/time -l (llama.cpp).

TinyLlama 1.1B-Chat

Engine / mode File Prefill Decode Peak RAM PPL
InferBit PQv2 — Metal 528 MiB 437 t/s 55.5 t/s 1205 MB 13.06
InferBit PQv2 — CPU 528 MiB 27 t/s 24.9 t/s 627 MB 13.06
InferBit PQv2 — drive 528 MiB 287 t/s 9.4 t/s 297 MB 13.06
llama.cpp Q4_K_M — Metal 638 MiB 1347 t/s 121.3 t/s 704 MB 13.89
llama.cpp Q4_K_M — CPU 638 MiB 130 t/s 74.2 t/s 1293 MB 13.89

Llama-3.2-1B Instruct

Engine / mode File Prefill Decode Peak RAM PPL
InferBit PQv2 — Metal 718 MiB 435 t/s 48.1 t/s 1258 MB 11.29
InferBit PQv2 — CPU 718 MiB 28 t/s 22.7 t/s 847 MB 11.37
InferBit PQv2 — drive 718 MiB 257 t/s 9.3 t/s 602 MB 11.29
llama.cpp Q4_K_M — Metal 770 MiB 1359 t/s 104.3 t/s 888 MB 12.33
llama.cpp Q4_K_M — CPU 770 MiB 132 t/s 64.3 t/s 1644 MB 12.33

Llama-3.1-8B Instruct

Engine / mode File Prefill Decode Peak RAM PPL
InferBit PQv2 — Metal 3.75 GiB 65 t/s 8.5 t/s 3203 MB 6.34
InferBit PQv2 — CPU 3.75 GiB 4.5 t/s 4.2 t/s 4306 MB 6.36
InferBit PQv2 — drive 3.75 GiB 34.3 t/s 0.70 t/s 1359 MB 6.34
llama.cpp Q4_K_M — Metal 4.58 GiB 216 t/s 20.1 t/s 4784 MB 6.77
llama.cpp Q4_K_M — CPU 4.58 GiB 4.2 t/s 2.4 t/s 6755 MB 6.77

What the numbers say:

  • Quality — InferBit PQv2 perplexity is 6–8% lower than the same-bit-budget Q4_K_M on all three models (over the identical token stream).
  • File size — InferBit .ibf is 7–18% smaller than the equivalent Q4_K_M GGUF.
  • Speed — On M4 Metal, llama.cpp is 2–3× faster on decode and 3–6× on prefill; on pure CPU the engines are closer. Closing the Metal decode gap is active work (the PQv2 random-access codebook reads).
  • Memory — InferBit drive mode holds the 8B model in 1.36 GB peak RAM at the same PPL as the in-memory path (3.20 GB) — −58% RAM at zero quality cost. Throughput drops at long contexts (re-streams weights every position); useful when RAM is the binding constraint.

Full methodology + tooling notes in docs/34_METRICS_SNAPSHOT.md.

How it works

  1. Convert: reads safetensors/GGUF weights, quantizes the MLP weights with PQv2 (K=256 per-(chunk, subchunk) codebook + uint8 indices, 4-bit-equivalent) and attention/embeddings with INT8, packs everything into a single mmap-friendly .ibf binary.
  2. Load: memory-maps the .ibf file for instant loading.
  3. Run: hand-tuned C kernels with multi-threaded matmul and parallel attention heads. On Apple Silicon, an optional Metal GPU backend (build from source) routes both prefill and decode through the GPU; the same .ibf works in both modes.
  4. Drive mode (IB_RESIDENCY_MODE=drive, macOS/Linux): weights stream from disk through a bounded GPU/CPU scratch ring instead of being resident. Bit-identical perplexity; the 8B model holds in 1.36 GB peak RAM (see Benchmarks).

The .ibf format is 64-byte aligned, no parsing at load time, and tracks the same K=256 codebooks the GPU kernels consume — there is no quality difference between CPU and GPU.

Configuration

Quantization

Flag Default Description
--bits 4 Weight quantization (2, 4, 8)
--sensitive-bits 8 Attention/embedding bits
--sparsity 0.0 Structured sparsity (0.0-0.6)

Generation

Flag Default Description
--temperature 0.7 Sampling temperature
--top-k 40 Top-K sampling
--top-p 0.9 Nucleus sampling
--max-tokens 512 Max tokens to generate
--threads auto CPU threads

Platform support

Platform Wheel CPU SIMD Metal GPU Drive mode
macOS Apple Silicon (arm64) macosx_11_0_arm64 NEON opt-in (build from source)
macOS Intel (x86_64) macosx_10_15_x86_64 portable C
Linux x86_64 manylinux_2_17_x86_64 portable C
Linux ARM64 (aarch64) manylinux_2_17_aarch64 NEON + dotprod
Windows x64 win_amd64 portable C (MSVC)
Windows ARM64 win_arm64 NEON (MSVC)

Build with Metal GPU (Apple Silicon, recommended for best M-series throughput):

# clone the engine repo, then:
cmake -B build -DIB_ENABLE_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# point InferBit at the freshly-built dylib:
export INFERBIT_LIB_PATH="$PWD/build/libinferbit.dylib"
python -c "import inferbit; print(inferbit.__version__)"

Drive mode is currently macOS/Linux only (uses POSIX madvise/fcntl(F_NOCACHE)); on Windows the runtime keeps weights resident.

Architecture

libinferbit (C shared library)
    |
    +-- Python: pip install inferbit
    +-- Node.js: npm install @inferbit/{core,node,cli}

Single C engine, multiple language bindings. Same .ibf model file, same numerics, any language.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

inferbit-0.4.1-py3-none-win_arm64.whl (482.3 kB view details)

Uploaded Python 3Windows ARM64

inferbit-0.4.1-py3-none-win_amd64.whl (513.6 kB view details)

Uploaded Python 3Windows x86-64

inferbit-0.4.1-py3-none-manylinux_2_17_x86_64.whl (561.0 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

inferbit-0.4.1-py3-none-manylinux_2_17_aarch64.whl (573.5 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

inferbit-0.4.1-py3-none-macosx_11_0_arm64.whl (220.1 kB view details)

Uploaded Python 3macOS 11.0+ ARM64

inferbit-0.4.1-py3-none-macosx_10_15_x86_64.whl (231.7 kB view details)

Uploaded Python 3macOS 10.15+ x86-64

File details

Details for the file inferbit-0.4.1-py3-none-win_arm64.whl.

File metadata

  • Download URL: inferbit-0.4.1-py3-none-win_arm64.whl
  • Upload date:
  • Size: 482.3 kB
  • Tags: Python 3, Windows ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferbit-0.4.1-py3-none-win_arm64.whl
Algorithm Hash digest
SHA256 0646653d86beeb09017f99b89ef10faee2ed57d9b4e4c26aeae28d5d549829b6
MD5 1f5c79c4d6323d40be0a10d54cded03e
BLAKE2b-256 01b464a8646ab4180f30f9fe27a2f27850b4d373fd36c7c344db6d0297025a36

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.1-py3-none-win_arm64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.4.1-py3-none-win_amd64.whl.

File metadata

  • Download URL: inferbit-0.4.1-py3-none-win_amd64.whl
  • Upload date:
  • Size: 513.6 kB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferbit-0.4.1-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 070e5c034295ae3bb98a323daf198ff6aab8a1cbcde084a7ef91b02a3916231e
MD5 fb2a98f27d9e55a12a8381303d197fdb
BLAKE2b-256 dc029a6ea9560aa49d6323797b44a494499ec9e90e91e427e3fd234551e2684f

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.1-py3-none-win_amd64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.4.1-py3-none-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for inferbit-0.4.1-py3-none-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 87b0ac6d9554f533994e48b0cbebfe1baed2e54b42f2c6fe35015b57a6c0053d
MD5 b6f960c2a5461fe2c903ef5c5ae5a24b
BLAKE2b-256 bed193de093b3fe962dfaa744aaeb50a0b5b47632a858ea94dfdeec49c8a7c09

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.1-py3-none-manylinux_2_17_x86_64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.4.1-py3-none-manylinux_2_17_aarch64.whl.

File metadata

File hashes

Hashes for inferbit-0.4.1-py3-none-manylinux_2_17_aarch64.whl
Algorithm Hash digest
SHA256 6759e3043c5752b3e57f29763b940f83ec745af001f91b3e231c6b2ddab59941
MD5 ab6934b394718e0809d52e17ccd35f38
BLAKE2b-256 6daede097624043167b050306afcf9341aa83aa37cc1234172a86959474b60bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.1-py3-none-manylinux_2_17_aarch64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.4.1-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for inferbit-0.4.1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6e64784fcbb4c5d466dfb18e81e64f815e16679f39495c6cd0ad1a680e17f832
MD5 a085482194276e5eceb01e8d32e57572
BLAKE2b-256 b2f1b7b467a56893fa785c78b8e5cabfe2b273f3139d0ac8a7e2e1c24410fce2

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.1-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.4.1-py3-none-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for inferbit-0.4.1-py3-none-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 2fb7d998a21113aba82951ae11b975cfc275a1139d2fbc5d15a9089cabeded0e
MD5 7140e4a5f2975df6c575ee87e8f9892a
BLAKE2b-256 83f5cb2761ba74bebc3c3ffad9fe2f0783b8c7c451416e1f6e7cd93e76350d0a

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.1-py3-none-macosx_10_15_x86_64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page