Skip to main content

Run any open LLM on CPU. One command.

Project description

InferBit

v0.2.0 — Run any open LLM on CPU. One command.

pip install inferbit[cli]
inferbit quantize mistralai/Mistral-7B-Instruct-v0.3 -o model.ibf
inferbit chat model.ibf

InferBit converts HuggingFace models to optimized INT4 and runs them on any CPU (Apple Silicon, x86) with no GPU, no Docker, and no complex setup.

Install

# Library only
pip install inferbit

# Library + CLI
pip install inferbit[cli]

# Everything (library + CLI + server)
pip install inferbit[all]

Requires Python 3.9+. Works on macOS (ARM/Intel) and Linux (x86_64).

Quickstart

Command line

# Convert any HuggingFace model to INT4
inferbit quantize meta-llama/Llama-3.2-1B -o llama.ibf

# Convert a local safetensors file
inferbit quantize ./model.safetensors -o model.ibf

# Convert from Ollama (if installed)
inferbit quantize ollama://llama3:8b -o llama3.ibf

# Interactive chat
inferbit chat model.ibf

# Benchmark
inferbit bench model.ibf --tokens 128 --runs 3

# Model info
inferbit info model.ibf

# Serve with OpenAI-compatible API
inferbit serve model.ibf --port 8000

Python API

from inferbit import InferbitModel

# Load from HuggingFace (downloads, converts, and loads automatically)
model = InferbitModel.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    bits=4,
)

# Generate text
output = model.generate("Explain gravity in one sentence:")
print(output)
# "Gravity is the force that attracts objects with mass towards each other."

# Stream tokens
for token in model.stream("Write a haiku about mountains:"):
    print(token, end="", flush=True)

# Or load a pre-converted model
model = InferbitModel.load("model.ibf")

Convert separately

from inferbit import convert

# Convert safetensors to IBF
convert("model.safetensors", "model.ibf", bits=4, sensitive_bits=8)

# Convert a HuggingFace directory (with config.json + sharded safetensors)
convert("./model_dir/", "model.ibf", bits=4)

# Convert with progress callback
convert("model.safetensors", "model.ibf", progress=lambda pct, stage: print(f"{pct:.0%} {stage}"))

Token-level API

from inferbit import InferbitModel

model = InferbitModel.load("model.ibf")

# Work with raw token IDs
token_ids = model.generate_tokens([1, 2, 3, 4, 5], max_tokens=20, temperature=0.7)

# Get raw logits
logits = model.forward([1, 2, 3])

# KV cache control
model.kv_clear()
model.kv_truncate(512)
print(model.kv_length)

Model info

model = InferbitModel.load("model.ibf")
print(model.architecture)   # "llama"
print(model.num_layers)      # 32
print(model.hidden_size)     # 4096
print(model.vocab_size)      # 32768
print(model.max_context)     # 32768
print(model.bits)            # 4
print(model.total_memory_mb) # 3971.0

Quality-gated quantization

from inferbit import search_quantization_profile, EvalGates

# Automatically find the most aggressive quantization that meets quality targets
result = search_quantization_profile(
    "model.safetensors",
    output_dir="./models",
    gates=EvalGates(max_perplexity=10.0, min_tokens_per_sec=5.0),
)
print(f"Selected: {result.selected.name} ({result.selected.bits}-bit)")
print(f"Speed: {result.eval_result.tokens_per_sec:.1f} tok/s")

Supported Sources

Source Example
HuggingFace Hub inferbit quantize mistralai/Mistral-7B-Instruct-v0.3
Local safetensors inferbit quantize model.safetensors
Sharded safetensors directory inferbit quantize ./model_dir/
Local GGUF inferbit quantize model.gguf
Ollama models inferbit quantize ollama://llama3:8b

Supported Models

Any LLaMA-family architecture with public weights:

  • LLaMA 2, LLaMA 3, LLaMA 3.2
  • Mistral, Mixtral
  • TinyLlama
  • Code Llama
  • And any model with the same architecture (GQA/MQA/MHA, RMSNorm, SiLU, RoPE)

Benchmarks

Apple Silicon, INT4 + INT8 attention, 8 threads:

Model File size Decode speed Quality
TinyLlama 1.1B 643 MB 34.6 tok/s Good
Mistral 7B 3,971 MB 6.8 tok/s Excellent

Compression: 3.5x vs FP16 source. No retraining required.

How it works

  1. Convert: reads safetensors/GGUF weights, quantizes to INT4 (MLP layers) and INT8 (attention/embeddings), packs into an optimized .ibf binary format
  2. Load: memory-maps the .ibf file for instant loading
  3. Run: SIMD-optimized kernels (NEON on ARM, AVX2 on x86) with multi-threaded matmul and parallel attention heads

The .ibf format is designed for fast loading: 64-byte aligned, mmap-friendly, no parsing at load time.

Configuration

Quantization

Flag Default Description
--bits 4 Weight quantization (2, 4, 8)
--sensitive-bits 8 Attention/embedding bits
--sparsity 0.0 Structured sparsity (0.0-0.6)

Generation

Flag Default Description
--temperature 0.7 Sampling temperature
--top-k 40 Top-K sampling
--top-p 0.9 Nucleus sampling
--max-tokens 512 Max tokens to generate
--threads auto CPU threads

Architecture

libinferbit (C shared library)
    |
    +-- Python: pip install inferbit
    +-- Node.js: npm install @inferbit/node (coming soon)

Single C engine, multiple language bindings. Same model, same results, any language.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

inferbit-0.4.0-py3-none-win_arm64.whl (478.8 kB view details)

Uploaded Python 3Windows ARM64

inferbit-0.4.0-py3-none-win_amd64.whl (510.8 kB view details)

Uploaded Python 3Windows x86-64

inferbit-0.4.0-py3-none-manylinux_2_17_x86_64.whl (556.8 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

inferbit-0.4.0-py3-none-manylinux_2_17_aarch64.whl (568.4 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

inferbit-0.4.0-py3-none-macosx_11_0_arm64.whl (216.2 kB view details)

Uploaded Python 3macOS 11.0+ ARM64

inferbit-0.4.0-py3-none-macosx_10_15_x86_64.whl (228.6 kB view details)

Uploaded Python 3macOS 10.15+ x86-64

File details

Details for the file inferbit-0.4.0-py3-none-win_arm64.whl.

File metadata

  • Download URL: inferbit-0.4.0-py3-none-win_arm64.whl
  • Upload date:
  • Size: 478.8 kB
  • Tags: Python 3, Windows ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferbit-0.4.0-py3-none-win_arm64.whl
Algorithm Hash digest
SHA256 6c233f93c545363feadcb556b86cb2c105ad605d8935998749fbdaa7d1310691
MD5 600d494517d577ed9a1c0c60072657c9
BLAKE2b-256 24832556033205265aeb6dc8205185d880b8e669aed8809b8a6211ac88309804

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.0-py3-none-win_arm64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.4.0-py3-none-win_amd64.whl.

File metadata

  • Download URL: inferbit-0.4.0-py3-none-win_amd64.whl
  • Upload date:
  • Size: 510.8 kB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferbit-0.4.0-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 d557a60ee0017043d4c8035523b262174b8e990127f71ec987f5e163034b26d4
MD5 79341f74839bbebbf2d653faea6b510c
BLAKE2b-256 196cc0bc1b2fe9711398f15fbde3be3bbf536396153b9273f978a0b008cc069a

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.0-py3-none-win_amd64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.4.0-py3-none-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for inferbit-0.4.0-py3-none-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 c3edeb5f83df57b0adb3f46508257d0affcbee7346b3b21d2ed4c0ab3f72e548
MD5 4af06dead3197c9c8509f8bab7377340
BLAKE2b-256 22baf7c08b2ab4fa5a0692c41ffd7278cbfe2d7d0a1ab45aad876753e5727b0b

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.0-py3-none-manylinux_2_17_x86_64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.4.0-py3-none-manylinux_2_17_aarch64.whl.

File metadata

File hashes

Hashes for inferbit-0.4.0-py3-none-manylinux_2_17_aarch64.whl
Algorithm Hash digest
SHA256 04624c144edda89c88b462b2aa4fc43ff48ce0dccde28f44d777740dbef11038
MD5 ed2f4ceae9f51ad5e010b7c2f9a3ce8d
BLAKE2b-256 2064879ec5b774d8bd7a52efa835dfd004fff7d5025ab38e6ddbe8083d8cf9f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.0-py3-none-manylinux_2_17_aarch64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.4.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for inferbit-0.4.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7e4097ee8634cdb1096b5fbf583553b8675b5b07954299957018a44dd28413b9
MD5 7c7ec8017af49dfe54e77b8a339bc6ef
BLAKE2b-256 262a5c83aa060011518b5c72b63db68cd97479332218d50fe63785b1829dbc7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.0-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.4.0-py3-none-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for inferbit-0.4.0-py3-none-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 e6a43565726ede12d9ff53297e44ea69e6ed81ed71317186da6a26fe448424ab
MD5 6586f35ce711b60165dee854ce88de3b
BLAKE2b-256 eb1ec060a94ede0f6114137c182df5b9abc0cf90043310469b747237c0950d09

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.4.0-py3-none-macosx_10_15_x86_64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page