Skip to main content

Run any open LLM on CPU. One command.

Project description

InferBit

v0.2.0 — Run any open LLM on CPU. One command.

pip install inferbit[cli]
inferbit quantize mistralai/Mistral-7B-Instruct-v0.3 -o model.ibf
inferbit chat model.ibf

InferBit converts HuggingFace models to optimized INT4 and runs them on any CPU (Apple Silicon, x86) with no GPU, no Docker, and no complex setup.

Install

# Library only
pip install inferbit

# Library + CLI
pip install inferbit[cli]

# Everything (library + CLI + server)
pip install inferbit[all]

Requires Python 3.9+. Works on macOS (ARM/Intel) and Linux (x86_64).

Quickstart

Command line

# Convert any HuggingFace model to INT4
inferbit quantize meta-llama/Llama-3.2-1B -o llama.ibf

# Convert a local safetensors file
inferbit quantize ./model.safetensors -o model.ibf

# Convert from Ollama (if installed)
inferbit quantize ollama://llama3:8b -o llama3.ibf

# Interactive chat
inferbit chat model.ibf

# Benchmark
inferbit bench model.ibf --tokens 128 --runs 3

# Model info
inferbit info model.ibf

# Serve with OpenAI-compatible API
inferbit serve model.ibf --port 8000

Python API

from inferbit import InferbitModel

# Load from HuggingFace (downloads, converts, and loads automatically)
model = InferbitModel.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    bits=4,
)

# Generate text
output = model.generate("Explain gravity in one sentence:")
print(output)
# "Gravity is the force that attracts objects with mass towards each other."

# Stream tokens
for token in model.stream("Write a haiku about mountains:"):
    print(token, end="", flush=True)

# Or load a pre-converted model
model = InferbitModel.load("model.ibf")

Convert separately

from inferbit import convert

# Convert safetensors to IBF
convert("model.safetensors", "model.ibf", bits=4, sensitive_bits=8)

# Convert a HuggingFace directory (with config.json + sharded safetensors)
convert("./model_dir/", "model.ibf", bits=4)

# Convert with progress callback
convert("model.safetensors", "model.ibf", progress=lambda pct, stage: print(f"{pct:.0%} {stage}"))

Token-level API

from inferbit import InferbitModel

model = InferbitModel.load("model.ibf")

# Work with raw token IDs
token_ids = model.generate_tokens([1, 2, 3, 4, 5], max_tokens=20, temperature=0.7)

# Get raw logits
logits = model.forward([1, 2, 3])

# KV cache control
model.kv_clear()
model.kv_truncate(512)
print(model.kv_length)

Model info

model = InferbitModel.load("model.ibf")
print(model.architecture)   # "llama"
print(model.num_layers)      # 32
print(model.hidden_size)     # 4096
print(model.vocab_size)      # 32768
print(model.max_context)     # 32768
print(model.bits)            # 4
print(model.total_memory_mb) # 3971.0

Quality-gated quantization

from inferbit import search_quantization_profile, EvalGates

# Automatically find the most aggressive quantization that meets quality targets
result = search_quantization_profile(
    "model.safetensors",
    output_dir="./models",
    gates=EvalGates(max_perplexity=10.0, min_tokens_per_sec=5.0),
)
print(f"Selected: {result.selected.name} ({result.selected.bits}-bit)")
print(f"Speed: {result.eval_result.tokens_per_sec:.1f} tok/s")

Supported Sources

Source Example
HuggingFace Hub inferbit quantize mistralai/Mistral-7B-Instruct-v0.3
Local safetensors inferbit quantize model.safetensors
Sharded safetensors directory inferbit quantize ./model_dir/
Local GGUF inferbit quantize model.gguf
Ollama models inferbit quantize ollama://llama3:8b

Supported Models

Any LLaMA-family architecture with public weights:

  • LLaMA 2, LLaMA 3, LLaMA 3.2
  • Mistral, Mixtral
  • TinyLlama
  • Code Llama
  • And any model with the same architecture (GQA/MQA/MHA, RMSNorm, SiLU, RoPE)

Benchmarks

Apple Silicon, INT4 + INT8 attention, 8 threads:

Model File size Decode speed Quality
TinyLlama 1.1B 643 MB 34.6 tok/s Good
Mistral 7B 3,971 MB 6.8 tok/s Excellent

Compression: 3.5x vs FP16 source. No retraining required.

How it works

  1. Convert: reads safetensors/GGUF weights, quantizes to INT4 (MLP layers) and INT8 (attention/embeddings), packs into an optimized .ibf binary format
  2. Load: memory-maps the .ibf file for instant loading
  3. Run: SIMD-optimized kernels (NEON on ARM, AVX2 on x86) with multi-threaded matmul and parallel attention heads

The .ibf format is designed for fast loading: 64-byte aligned, mmap-friendly, no parsing at load time.

Configuration

Quantization

Flag Default Description
--bits 4 Weight quantization (2, 4, 8)
--sensitive-bits 8 Attention/embedding bits
--sparsity 0.0 Structured sparsity (0.0-0.6)

Generation

Flag Default Description
--temperature 0.7 Sampling temperature
--top-k 40 Top-K sampling
--top-p 0.9 Nucleus sampling
--max-tokens 512 Max tokens to generate
--threads auto CPU threads

Architecture

libinferbit (C shared library)
    |
    +-- Python: pip install inferbit
    +-- Node.js: npm install @inferbit/node (coming soon)

Single C engine, multiple language bindings. Same model, same results, any language.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

inferbit-0.3.0-py3-none-win_arm64.whl (325.5 kB view details)

Uploaded Python 3Windows ARM64

inferbit-0.3.0-py3-none-win_amd64.whl (355.8 kB view details)

Uploaded Python 3Windows x86-64

inferbit-0.3.0-py3-none-manylinux_2_17_x86_64.whl (400.0 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

inferbit-0.3.0-py3-none-manylinux_2_17_aarch64.whl (385.7 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

inferbit-0.3.0-py3-none-macosx_11_0_arm64.whl (183.0 kB view details)

Uploaded Python 3macOS 11.0+ ARM64

inferbit-0.3.0-py3-none-macosx_10_15_x86_64.whl (207.5 kB view details)

Uploaded Python 3macOS 10.15+ x86-64

File details

Details for the file inferbit-0.3.0-py3-none-win_arm64.whl.

File metadata

  • Download URL: inferbit-0.3.0-py3-none-win_arm64.whl
  • Upload date:
  • Size: 325.5 kB
  • Tags: Python 3, Windows ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferbit-0.3.0-py3-none-win_arm64.whl
Algorithm Hash digest
SHA256 fea9378bb34b334e2dc47f964b837629f184bc559392ef0d041da6ae61227ea3
MD5 519c7a237db5cedbadc723c59934b4ec
BLAKE2b-256 c683bffc4affdc6753d300ae8f094f90c2135e1b02ed2bbd8447d2a6084a76c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.3.0-py3-none-win_arm64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.3.0-py3-none-win_amd64.whl.

File metadata

  • Download URL: inferbit-0.3.0-py3-none-win_amd64.whl
  • Upload date:
  • Size: 355.8 kB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferbit-0.3.0-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 d863187fe55e4290928413d84e4cec9fb37cf7c6dd27a2ea111b337b56636877
MD5 b1338a5cef93d4bb313d88cf097e3f20
BLAKE2b-256 4f74f0999d81779de817fd32deeb2f84921412d3d27e166379bf5cd39a1d0980

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.3.0-py3-none-win_amd64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.3.0-py3-none-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for inferbit-0.3.0-py3-none-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 3554590a3828aee8a6c97d6b0f94f6a2e614a534cf8b2a489ba4708f568634b4
MD5 6b437adb4c606da49adc0acf5508287f
BLAKE2b-256 5968e2a172eb8cc69a914c58099e98f4baf32e78bb550114420d4bd74483d55a

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.3.0-py3-none-manylinux_2_17_x86_64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.3.0-py3-none-manylinux_2_17_aarch64.whl.

File metadata

File hashes

Hashes for inferbit-0.3.0-py3-none-manylinux_2_17_aarch64.whl
Algorithm Hash digest
SHA256 c40f7ca3aaf8cc09cd045ecb38d32c8551aa71e64677af0f07821f62fe006cc3
MD5 0c577364fc700f755f29dc4ae72d9fd1
BLAKE2b-256 a35249c1f83e422b2f8aa1db741481ab42bb893afa6240c06e535bda7a3ec06c

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.3.0-py3-none-manylinux_2_17_aarch64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.3.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for inferbit-0.3.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8b33dfafcafe553112d41602a077e52c98c0f1b7a331257e9214fa6ff5b94182
MD5 3da5093364eac876cac0f4ee6f0e5c56
BLAKE2b-256 b2c8d5aed768609a80c3fd77e702c17b184aa5e266b75f092f85c32ead7d682a

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.3.0-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.3.0-py3-none-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for inferbit-0.3.0-py3-none-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 7c6767c2f13503f0227d311ef18294ad0bf79755055475afd1b0da3b12070f8d
MD5 9f4e1075efcc4f884993e2e7be45e6b3
BLAKE2b-256 6601765d0f35156bc0610d8a1729d57559f0a65a453fa966eab8a5558762b5b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.3.0-py3-none-macosx_10_15_x86_64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page