Run any open LLM on CPU. One command.
Project description
InferBit
v0.2.0 — Run any open LLM on CPU. One command.
pip install inferbit[cli]
inferbit quantize mistralai/Mistral-7B-Instruct-v0.3 -o model.ibf
inferbit chat model.ibf
InferBit converts HuggingFace models to optimized INT4 and runs them on any CPU (Apple Silicon, x86) with no GPU, no Docker, and no complex setup.
Install
# Library only
pip install inferbit
# Library + CLI
pip install inferbit[cli]
# Everything (library + CLI + server)
pip install inferbit[all]
Requires Python 3.9+. Works on macOS (ARM/Intel) and Linux (x86_64).
Quickstart
Command line
# Convert any HuggingFace model to INT4
inferbit quantize meta-llama/Llama-3.2-1B -o llama.ibf
# Convert a local safetensors file
inferbit quantize ./model.safetensors -o model.ibf
# Convert from Ollama (if installed)
inferbit quantize ollama://llama3:8b -o llama3.ibf
# Interactive chat
inferbit chat model.ibf
# Benchmark
inferbit bench model.ibf --tokens 128 --runs 3
# Model info
inferbit info model.ibf
# Serve with OpenAI-compatible API
inferbit serve model.ibf --port 8000
Python API
from inferbit import InferbitModel
# Load from HuggingFace (downloads, converts, and loads automatically)
model = InferbitModel.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
bits=4,
)
# Generate text
output = model.generate("Explain gravity in one sentence:")
print(output)
# "Gravity is the force that attracts objects with mass towards each other."
# Stream tokens
for token in model.stream("Write a haiku about mountains:"):
print(token, end="", flush=True)
# Or load a pre-converted model
model = InferbitModel.load("model.ibf")
Convert separately
from inferbit import convert
# Convert safetensors to IBF
convert("model.safetensors", "model.ibf", bits=4, sensitive_bits=8)
# Convert a HuggingFace directory (with config.json + sharded safetensors)
convert("./model_dir/", "model.ibf", bits=4)
# Convert with progress callback
convert("model.safetensors", "model.ibf", progress=lambda pct, stage: print(f"{pct:.0%} {stage}"))
Token-level API
from inferbit import InferbitModel
model = InferbitModel.load("model.ibf")
# Work with raw token IDs
token_ids = model.generate_tokens([1, 2, 3, 4, 5], max_tokens=20, temperature=0.7)
# Get raw logits
logits = model.forward([1, 2, 3])
# KV cache control
model.kv_clear()
model.kv_truncate(512)
print(model.kv_length)
Model info
model = InferbitModel.load("model.ibf")
print(model.architecture) # "llama"
print(model.num_layers) # 32
print(model.hidden_size) # 4096
print(model.vocab_size) # 32768
print(model.max_context) # 32768
print(model.bits) # 4
print(model.total_memory_mb) # 3971.0
Quality-gated quantization
from inferbit import search_quantization_profile, EvalGates
# Automatically find the most aggressive quantization that meets quality targets
result = search_quantization_profile(
"model.safetensors",
output_dir="./models",
gates=EvalGates(max_perplexity=10.0, min_tokens_per_sec=5.0),
)
print(f"Selected: {result.selected.name} ({result.selected.bits}-bit)")
print(f"Speed: {result.eval_result.tokens_per_sec:.1f} tok/s")
Supported Sources
| Source | Example |
|---|---|
| HuggingFace Hub | inferbit quantize mistralai/Mistral-7B-Instruct-v0.3 |
| Local safetensors | inferbit quantize model.safetensors |
| Sharded safetensors directory | inferbit quantize ./model_dir/ |
| Local GGUF | inferbit quantize model.gguf |
| Ollama models | inferbit quantize ollama://llama3:8b |
Supported Models
Any LLaMA-family architecture with public weights:
- LLaMA 2, LLaMA 3, LLaMA 3.2
- Mistral, Mixtral
- TinyLlama
- Code Llama
- And any model with the same architecture (GQA/MQA/MHA, RMSNorm, SiLU, RoPE)
Benchmarks
Apple Silicon, INT4 + INT8 attention, 8 threads:
| Model | File size | Decode speed | Quality |
|---|---|---|---|
| TinyLlama 1.1B | 643 MB | 34.6 tok/s | Good |
| Mistral 7B | 3,971 MB | 6.8 tok/s | Excellent |
Compression: 3.5x vs FP16 source. No retraining required.
How it works
- Convert: reads safetensors/GGUF weights, quantizes to INT4 (MLP layers) and INT8 (attention/embeddings), packs into an optimized
.ibfbinary format - Load: memory-maps the
.ibffile for instant loading - Run: SIMD-optimized kernels (NEON on ARM, AVX2 on x86) with multi-threaded matmul and parallel attention heads
The .ibf format is designed for fast loading: 64-byte aligned, mmap-friendly, no parsing at load time.
Configuration
Quantization
| Flag | Default | Description |
|---|---|---|
--bits |
4 | Weight quantization (2, 4, 8) |
--sensitive-bits |
8 | Attention/embedding bits |
--sparsity |
0.0 | Structured sparsity (0.0-0.6) |
Generation
| Flag | Default | Description |
|---|---|---|
--temperature |
0.7 | Sampling temperature |
--top-k |
40 | Top-K sampling |
--top-p |
0.9 | Nucleus sampling |
--max-tokens |
512 | Max tokens to generate |
--threads |
auto | CPU threads |
Architecture
libinferbit (C shared library)
|
+-- Python: pip install inferbit
+-- Node.js: npm install @inferbit/node (coming soon)
Single C engine, multiple language bindings. Same model, same results, any language.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inferbit-0.4.0-py3-none-win_arm64.whl.
File metadata
- Download URL: inferbit-0.4.0-py3-none-win_arm64.whl
- Upload date:
- Size: 478.8 kB
- Tags: Python 3, Windows ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c233f93c545363feadcb556b86cb2c105ad605d8935998749fbdaa7d1310691
|
|
| MD5 |
600d494517d577ed9a1c0c60072657c9
|
|
| BLAKE2b-256 |
24832556033205265aeb6dc8205185d880b8e669aed8809b8a6211ac88309804
|
Provenance
The following attestation bundles were made for inferbit-0.4.0-py3-none-win_arm64.whl:
Publisher:
release.yml on demonarch/inferbit-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inferbit-0.4.0-py3-none-win_arm64.whl -
Subject digest:
6c233f93c545363feadcb556b86cb2c105ad605d8935998749fbdaa7d1310691 - Sigstore transparency entry: 1540724520
- Sigstore integration time:
-
Permalink:
demonarch/inferbit-py@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/demonarch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Trigger Event:
push
-
Statement type:
File details
Details for the file inferbit-0.4.0-py3-none-win_amd64.whl.
File metadata
- Download URL: inferbit-0.4.0-py3-none-win_amd64.whl
- Upload date:
- Size: 510.8 kB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d557a60ee0017043d4c8035523b262174b8e990127f71ec987f5e163034b26d4
|
|
| MD5 |
79341f74839bbebbf2d653faea6b510c
|
|
| BLAKE2b-256 |
196cc0bc1b2fe9711398f15fbde3be3bbf536396153b9273f978a0b008cc069a
|
Provenance
The following attestation bundles were made for inferbit-0.4.0-py3-none-win_amd64.whl:
Publisher:
release.yml on demonarch/inferbit-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inferbit-0.4.0-py3-none-win_amd64.whl -
Subject digest:
d557a60ee0017043d4c8035523b262174b8e990127f71ec987f5e163034b26d4 - Sigstore transparency entry: 1540724655
- Sigstore integration time:
-
Permalink:
demonarch/inferbit-py@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/demonarch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Trigger Event:
push
-
Statement type:
File details
Details for the file inferbit-0.4.0-py3-none-manylinux_2_17_x86_64.whl.
File metadata
- Download URL: inferbit-0.4.0-py3-none-manylinux_2_17_x86_64.whl
- Upload date:
- Size: 556.8 kB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3edeb5f83df57b0adb3f46508257d0affcbee7346b3b21d2ed4c0ab3f72e548
|
|
| MD5 |
4af06dead3197c9c8509f8bab7377340
|
|
| BLAKE2b-256 |
22baf7c08b2ab4fa5a0692c41ffd7278cbfe2d7d0a1ab45aad876753e5727b0b
|
Provenance
The following attestation bundles were made for inferbit-0.4.0-py3-none-manylinux_2_17_x86_64.whl:
Publisher:
release.yml on demonarch/inferbit-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inferbit-0.4.0-py3-none-manylinux_2_17_x86_64.whl -
Subject digest:
c3edeb5f83df57b0adb3f46508257d0affcbee7346b3b21d2ed4c0ab3f72e548 - Sigstore transparency entry: 1540724808
- Sigstore integration time:
-
Permalink:
demonarch/inferbit-py@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/demonarch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Trigger Event:
push
-
Statement type:
File details
Details for the file inferbit-0.4.0-py3-none-manylinux_2_17_aarch64.whl.
File metadata
- Download URL: inferbit-0.4.0-py3-none-manylinux_2_17_aarch64.whl
- Upload date:
- Size: 568.4 kB
- Tags: Python 3, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04624c144edda89c88b462b2aa4fc43ff48ce0dccde28f44d777740dbef11038
|
|
| MD5 |
ed2f4ceae9f51ad5e010b7c2f9a3ce8d
|
|
| BLAKE2b-256 |
2064879ec5b774d8bd7a52efa835dfd004fff7d5025ab38e6ddbe8083d8cf9f0
|
Provenance
The following attestation bundles were made for inferbit-0.4.0-py3-none-manylinux_2_17_aarch64.whl:
Publisher:
release.yml on demonarch/inferbit-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inferbit-0.4.0-py3-none-manylinux_2_17_aarch64.whl -
Subject digest:
04624c144edda89c88b462b2aa4fc43ff48ce0dccde28f44d777740dbef11038 - Sigstore transparency entry: 1540724400
- Sigstore integration time:
-
Permalink:
demonarch/inferbit-py@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/demonarch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Trigger Event:
push
-
Statement type:
File details
Details for the file inferbit-0.4.0-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: inferbit-0.4.0-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 216.2 kB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e4097ee8634cdb1096b5fbf583553b8675b5b07954299957018a44dd28413b9
|
|
| MD5 |
7c7ec8017af49dfe54e77b8a339bc6ef
|
|
| BLAKE2b-256 |
262a5c83aa060011518b5c72b63db68cd97479332218d50fe63785b1829dbc7e
|
Provenance
The following attestation bundles were made for inferbit-0.4.0-py3-none-macosx_11_0_arm64.whl:
Publisher:
release.yml on demonarch/inferbit-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inferbit-0.4.0-py3-none-macosx_11_0_arm64.whl -
Subject digest:
7e4097ee8634cdb1096b5fbf583553b8675b5b07954299957018a44dd28413b9 - Sigstore transparency entry: 1540725048
- Sigstore integration time:
-
Permalink:
demonarch/inferbit-py@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/demonarch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Trigger Event:
push
-
Statement type:
File details
Details for the file inferbit-0.4.0-py3-none-macosx_10_15_x86_64.whl.
File metadata
- Download URL: inferbit-0.4.0-py3-none-macosx_10_15_x86_64.whl
- Upload date:
- Size: 228.6 kB
- Tags: Python 3, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6a43565726ede12d9ff53297e44ea69e6ed81ed71317186da6a26fe448424ab
|
|
| MD5 |
6586f35ce711b60165dee854ce88de3b
|
|
| BLAKE2b-256 |
eb1ec060a94ede0f6114137c182df5b9abc0cf90043310469b747237c0950d09
|
Provenance
The following attestation bundles were made for inferbit-0.4.0-py3-none-macosx_10_15_x86_64.whl:
Publisher:
release.yml on demonarch/inferbit-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inferbit-0.4.0-py3-none-macosx_10_15_x86_64.whl -
Subject digest:
e6a43565726ede12d9ff53297e44ea69e6ed81ed71317186da6a26fe448424ab - Sigstore transparency entry: 1540725205
- Sigstore integration time:
-
Permalink:
demonarch/inferbit-py@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/demonarch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@463d513781853b5b211fb7ef9828b1d5f62aa8ab -
Trigger Event:
push
-
Statement type: