Skip to main content

Run any open LLM on CPU. One command.

Project description

InferBit

v0.2.0 — Run any open LLM on CPU. One command.

pip install inferbit[cli]
inferbit quantize mistralai/Mistral-7B-Instruct-v0.3 -o model.ibf
inferbit chat model.ibf

InferBit converts HuggingFace models to optimized INT4 and runs them on any CPU (Apple Silicon, x86) with no GPU, no Docker, and no complex setup.

Install

# Library only
pip install inferbit

# Library + CLI
pip install inferbit[cli]

# Everything (library + CLI + server)
pip install inferbit[all]

Requires Python 3.9+. Works on macOS (ARM/Intel) and Linux (x86_64).

Quickstart

Command line

# Convert any HuggingFace model to INT4
inferbit quantize meta-llama/Llama-3.2-1B -o llama.ibf

# Convert a local safetensors file
inferbit quantize ./model.safetensors -o model.ibf

# Convert from Ollama (if installed)
inferbit quantize ollama://llama3:8b -o llama3.ibf

# Interactive chat
inferbit chat model.ibf

# Benchmark
inferbit bench model.ibf --tokens 128 --runs 3

# Model info
inferbit info model.ibf

# Serve with OpenAI-compatible API
inferbit serve model.ibf --port 8000

Python API

from inferbit import InferbitModel

# Load from HuggingFace (downloads, converts, and loads automatically)
model = InferbitModel.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    bits=4,
)

# Generate text
output = model.generate("Explain gravity in one sentence:")
print(output)
# "Gravity is the force that attracts objects with mass towards each other."

# Stream tokens
for token in model.stream("Write a haiku about mountains:"):
    print(token, end="", flush=True)

# Or load a pre-converted model
model = InferbitModel.load("model.ibf")

Convert separately

from inferbit import convert

# Convert safetensors to IBF
convert("model.safetensors", "model.ibf", bits=4, sensitive_bits=8)

# Convert a HuggingFace directory (with config.json + sharded safetensors)
convert("./model_dir/", "model.ibf", bits=4)

# Convert with progress callback
convert("model.safetensors", "model.ibf", progress=lambda pct, stage: print(f"{pct:.0%} {stage}"))

Token-level API

from inferbit import InferbitModel

model = InferbitModel.load("model.ibf")

# Work with raw token IDs
token_ids = model.generate_tokens([1, 2, 3, 4, 5], max_tokens=20, temperature=0.7)

# Get raw logits
logits = model.forward([1, 2, 3])

# KV cache control
model.kv_clear()
model.kv_truncate(512)
print(model.kv_length)

Model info

model = InferbitModel.load("model.ibf")
print(model.architecture)   # "llama"
print(model.num_layers)      # 32
print(model.hidden_size)     # 4096
print(model.vocab_size)      # 32768
print(model.max_context)     # 32768
print(model.bits)            # 4
print(model.total_memory_mb) # 3971.0

Quality-gated quantization

from inferbit import search_quantization_profile, EvalGates

# Automatically find the most aggressive quantization that meets quality targets
result = search_quantization_profile(
    "model.safetensors",
    output_dir="./models",
    gates=EvalGates(max_perplexity=10.0, min_tokens_per_sec=5.0),
)
print(f"Selected: {result.selected.name} ({result.selected.bits}-bit)")
print(f"Speed: {result.eval_result.tokens_per_sec:.1f} tok/s")

Supported Sources

Source Example
HuggingFace Hub inferbit quantize mistralai/Mistral-7B-Instruct-v0.3
Local safetensors inferbit quantize model.safetensors
Sharded safetensors directory inferbit quantize ./model_dir/
Local GGUF inferbit quantize model.gguf
Ollama models inferbit quantize ollama://llama3:8b

Supported Models

Any LLaMA-family architecture with public weights:

  • LLaMA 2, LLaMA 3, LLaMA 3.2
  • Mistral, Mixtral
  • TinyLlama
  • Code Llama
  • And any model with the same architecture (GQA/MQA/MHA, RMSNorm, SiLU, RoPE)

Benchmarks

Apple Silicon, INT4 + INT8 attention, 8 threads:

Model File size Decode speed Quality
TinyLlama 1.1B 643 MB 34.6 tok/s Good
Mistral 7B 3,971 MB 6.8 tok/s Excellent

Compression: 3.5x vs FP16 source. No retraining required.

How it works

  1. Convert: reads safetensors/GGUF weights, quantizes to INT4 (MLP layers) and INT8 (attention/embeddings), packs into an optimized .ibf binary format
  2. Load: memory-maps the .ibf file for instant loading
  3. Run: SIMD-optimized kernels (NEON on ARM, AVX2 on x86) with multi-threaded matmul and parallel attention heads

The .ibf format is designed for fast loading: 64-byte aligned, mmap-friendly, no parsing at load time.

Configuration

Quantization

Flag Default Description
--bits 4 Weight quantization (2, 4, 8)
--sensitive-bits 8 Attention/embedding bits
--sparsity 0.0 Structured sparsity (0.0-0.6)

Generation

Flag Default Description
--temperature 0.7 Sampling temperature
--top-k 40 Top-K sampling
--top-p 0.9 Nucleus sampling
--max-tokens 512 Max tokens to generate
--threads auto CPU threads

Architecture

libinferbit (C shared library)
    |
    +-- Python: pip install inferbit
    +-- Node.js: npm install @inferbit/node (coming soon)

Single C engine, multiple language bindings. Same model, same results, any language.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

inferbit-0.2.3-py3-none-win_arm64.whl (197.4 kB view details)

Uploaded Python 3Windows ARM64

inferbit-0.2.3-py3-none-win_amd64.whl (208.6 kB view details)

Uploaded Python 3Windows x86-64

inferbit-0.2.3-py3-none-manylinux_2_17_x86_64.whl (250.6 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

inferbit-0.2.3-py3-none-manylinux_2_17_aarch64.whl (246.3 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

inferbit-0.2.3-py3-none-macosx_11_0_arm64.whl (121.8 kB view details)

Uploaded Python 3macOS 11.0+ ARM64

inferbit-0.2.3-py3-none-macosx_10_15_x86_64.whl (132.2 kB view details)

Uploaded Python 3macOS 10.15+ x86-64

File details

Details for the file inferbit-0.2.3-py3-none-win_arm64.whl.

File metadata

  • Download URL: inferbit-0.2.3-py3-none-win_arm64.whl
  • Upload date:
  • Size: 197.4 kB
  • Tags: Python 3, Windows ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferbit-0.2.3-py3-none-win_arm64.whl
Algorithm Hash digest
SHA256 e56272452dc9c09f8c86c5380a809c17641ab23f58385a97e924d3eb08f76766
MD5 9bb9aa7a76f1f0b9f60e59f49f6700c0
BLAKE2b-256 5a3a4fcb368091c6698f79f6c8bb575c71da8a557fd5c62c1450b16e67eb13df

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.2.3-py3-none-win_arm64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.2.3-py3-none-win_amd64.whl.

File metadata

  • Download URL: inferbit-0.2.3-py3-none-win_amd64.whl
  • Upload date:
  • Size: 208.6 kB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferbit-0.2.3-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 9d99d9320f744768b90707e85952898ef8b00b54c1076961484a5b33d014e380
MD5 3f6157bcd6dc6f499f08cfdcc488e7a8
BLAKE2b-256 272505661addb6fd7a5700cbde3456a330eaacf981d9af560760ec77b7bce62c

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.2.3-py3-none-win_amd64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.2.3-py3-none-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for inferbit-0.2.3-py3-none-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 ae6a883f6ca705f5e6f1a100fb8109d0334dd0e6f1b87aa8c0cdd1b7fb6029b8
MD5 ec460659852bcc5298fce36cfc09af9b
BLAKE2b-256 83b658c8f1d594b326e768f1b72709fe82f3bdf3b954cfab696e0d6bc45a58b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.2.3-py3-none-manylinux_2_17_x86_64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.2.3-py3-none-manylinux_2_17_aarch64.whl.

File metadata

File hashes

Hashes for inferbit-0.2.3-py3-none-manylinux_2_17_aarch64.whl
Algorithm Hash digest
SHA256 985a98eb1ec5e98a1f42eb113d004ccb3016f1e2ad62ce8dd406baea5510a057
MD5 0e360d1e7f157601b5643e7c5888a58f
BLAKE2b-256 9bce2ef07ae3559a9fdd30fa6454aaca62bcc72cef18aa0908009c694658c5f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.2.3-py3-none-manylinux_2_17_aarch64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.2.3-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for inferbit-0.2.3-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ff2b1e0a638afc68f502f801a3f253dc55641f90245de6be1587bb230218636c
MD5 ac8d9ff32208856f5a36c1e42b4e3482
BLAKE2b-256 38079026ffcf348dffa5c85d7a6115cf7d64e2e32f755ecb731a5d7894616b5c

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.2.3-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.2.3-py3-none-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for inferbit-0.2.3-py3-none-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 d61bba264a7b7e93fe404d36f501b2dc30c1e9c32e65f2520646eea6d6599bae
MD5 6fc4d709bf1c08c351b1ccac725330df
BLAKE2b-256 aeb8e2681269e9a2f4a4275f231817dd5b5900a29b81c7bf93155e0450dafe69

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.2.3-py3-none-macosx_10_15_x86_64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page