Skip to main content

Run any open LLM on CPU. One command.

Project description

InferBit

Run any open LLM on CPU. One command.

pip install inferbit[cli]
inferbit quantize mistralai/Mistral-7B-Instruct-v0.3 -o model.ibf
inferbit chat model.ibf

InferBit converts HuggingFace models to optimized INT4 and runs them on any CPU (Apple Silicon, x86) with no GPU, no Docker, and no complex setup.

Install

# Library only
pip install inferbit

# Library + CLI
pip install inferbit[cli]

# Everything (library + CLI + server)
pip install inferbit[all]

Requires Python 3.9+. Works on macOS (ARM/Intel) and Linux (x86_64).

Quickstart

Command line

# Convert any HuggingFace model to INT4
inferbit quantize meta-llama/Llama-3.2-1B -o llama.ibf

# Convert a local safetensors file
inferbit quantize ./model.safetensors -o model.ibf

# Convert from Ollama (if installed)
inferbit quantize ollama://llama3:8b -o llama3.ibf

# Interactive chat
inferbit chat model.ibf

# Benchmark
inferbit bench model.ibf --tokens 128 --runs 3

# Model info
inferbit info model.ibf

# Serve with OpenAI-compatible API
inferbit serve model.ibf --port 8000

Python API

from inferbit import InferbitModel

# Load from HuggingFace (downloads, converts, and loads automatically)
model = InferbitModel.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    bits=4,
)

# Generate text
output = model.generate("Explain gravity in one sentence:")
print(output)
# "Gravity is the force that attracts objects with mass towards each other."

# Stream tokens
for token in model.stream("Write a haiku about mountains:"):
    print(token, end="", flush=True)

# Or load a pre-converted model
model = InferbitModel.load("model.ibf")

Convert separately

from inferbit import convert

# Convert safetensors to IBF
convert("model.safetensors", "model.ibf", bits=4, sensitive_bits=8)

# Convert a HuggingFace directory (with config.json + sharded safetensors)
convert("./model_dir/", "model.ibf", bits=4)

# Convert with progress callback
convert("model.safetensors", "model.ibf", progress=lambda pct, stage: print(f"{pct:.0%} {stage}"))

Token-level API

from inferbit import InferbitModel

model = InferbitModel.load("model.ibf")

# Work with raw token IDs
token_ids = model.generate_tokens([1, 2, 3, 4, 5], max_tokens=20, temperature=0.7)

# Get raw logits
logits = model.forward([1, 2, 3])

# KV cache control
model.kv_clear()
model.kv_truncate(512)
print(model.kv_length)

Model info

model = InferbitModel.load("model.ibf")
print(model.architecture)   # "llama"
print(model.num_layers)      # 32
print(model.hidden_size)     # 4096
print(model.vocab_size)      # 32768
print(model.max_context)     # 32768
print(model.bits)            # 4
print(model.total_memory_mb) # 3971.0

Quality-gated quantization

from inferbit import search_quantization_profile, EvalGates

# Automatically find the most aggressive quantization that meets quality targets
result = search_quantization_profile(
    "model.safetensors",
    output_dir="./models",
    gates=EvalGates(max_perplexity=10.0, min_tokens_per_sec=5.0),
)
print(f"Selected: {result.selected.name} ({result.selected.bits}-bit)")
print(f"Speed: {result.eval_result.tokens_per_sec:.1f} tok/s")

Supported Sources

Source Example
HuggingFace Hub inferbit quantize mistralai/Mistral-7B-Instruct-v0.3
Local safetensors inferbit quantize model.safetensors
Sharded safetensors directory inferbit quantize ./model_dir/
Local GGUF inferbit quantize model.gguf
Ollama models inferbit quantize ollama://llama3:8b

Supported Models

Any LLaMA-family architecture with public weights:

  • LLaMA 2, LLaMA 3, LLaMA 3.2
  • Mistral, Mixtral
  • TinyLlama
  • Code Llama
  • And any model with the same architecture (GQA/MQA/MHA, RMSNorm, SiLU, RoPE)

Benchmarks

Apple Silicon, INT4 + INT8 attention, 8 threads:

Model File size Decode speed Quality
TinyLlama 1.1B 643 MB 34.6 tok/s Good
Mistral 7B 3,971 MB 6.8 tok/s Excellent

Compression: 3.5x vs FP16 source. No retraining required.

How it works

  1. Convert: reads safetensors/GGUF weights, quantizes to INT4 (MLP layers) and INT8 (attention/embeddings), packs into an optimized .ibf binary format
  2. Load: memory-maps the .ibf file for instant loading
  3. Run: SIMD-optimized kernels (NEON on ARM, AVX2 on x86) with multi-threaded matmul and parallel attention heads

The .ibf format is designed for fast loading: 64-byte aligned, mmap-friendly, no parsing at load time.

Configuration

Quantization

Flag Default Description
--bits 4 Weight quantization (2, 4, 8)
--sensitive-bits 8 Attention/embedding bits
--sparsity 0.0 Structured sparsity (0.0-0.6)

Generation

Flag Default Description
--temperature 0.7 Sampling temperature
--top-k 40 Top-K sampling
--top-p 0.9 Nucleus sampling
--max-tokens 512 Max tokens to generate
--threads auto CPU threads

Architecture

libinferbit (C shared library)
    |
    +-- Python: pip install inferbit
    +-- Node.js: npm install @inferbit/node (coming soon)

Single C engine, multiple language bindings. Same model, same results, any language.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

inferbit-0.1.1-py3-none-win_arm64.whl (136.7 kB view details)

Uploaded Python 3Windows ARM64

inferbit-0.1.1-py3-none-win_amd64.whl (146.9 kB view details)

Uploaded Python 3Windows x86-64

inferbit-0.1.1-py3-none-manylinux_2_17_x86_64.whl (158.9 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

inferbit-0.1.1-py3-none-manylinux_2_17_aarch64.whl (155.5 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

inferbit-0.1.1-py3-none-macosx_11_0_arm64.whl (79.4 kB view details)

Uploaded Python 3macOS 11.0+ ARM64

inferbit-0.1.1-py3-none-macosx_10_15_x86_64.whl (87.5 kB view details)

Uploaded Python 3macOS 10.15+ x86-64

File details

Details for the file inferbit-0.1.1-py3-none-win_arm64.whl.

File metadata

  • Download URL: inferbit-0.1.1-py3-none-win_arm64.whl
  • Upload date:
  • Size: 136.7 kB
  • Tags: Python 3, Windows ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferbit-0.1.1-py3-none-win_arm64.whl
Algorithm Hash digest
SHA256 702dbeade2004eff3cf9658ef93ede0e3f11757e2a5d99eaedbc90edfbfa9a05
MD5 bf47e1a100a959d56cb269404634b427
BLAKE2b-256 c4a0e287f15ec738b649e1275a357f25be240aace3bcde1771e08294fcd8d459

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.1.1-py3-none-win_arm64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.1.1-py3-none-win_amd64.whl.

File metadata

  • Download URL: inferbit-0.1.1-py3-none-win_amd64.whl
  • Upload date:
  • Size: 146.9 kB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferbit-0.1.1-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 562fec6e6707fc2e970f10542180ea409607ea8fefabcb2e974b500aa8fe113c
MD5 12842c5fedb82091ac0251595745c68a
BLAKE2b-256 f77ce954a1cadebd43d7d8b0465d2b6095452f7601252735d7c30847cab69c37

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.1.1-py3-none-win_amd64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.1.1-py3-none-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for inferbit-0.1.1-py3-none-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 841d2e3dbd270a917ece07b73d1676acec6853d414a0c5115af95739d05db460
MD5 5c555216adab315915a44da2688a3508
BLAKE2b-256 4e3d553d12639fc5b709a4e154dd4d6b2aa02017160fdb175a4d60c2849d17af

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.1.1-py3-none-manylinux_2_17_x86_64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.1.1-py3-none-manylinux_2_17_aarch64.whl.

File metadata

File hashes

Hashes for inferbit-0.1.1-py3-none-manylinux_2_17_aarch64.whl
Algorithm Hash digest
SHA256 e07a249a47e51ac22ba50a80d0ba120f95658fac59a7cc0ce239d1d74b6f4ee8
MD5 c701ba0df978a5200ef474e3f626af56
BLAKE2b-256 c31353ab1f4b5741e0d2db849dda47e1ca81ca9fe6db64b82ef88dd167868264

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.1.1-py3-none-manylinux_2_17_aarch64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.1.1-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for inferbit-0.1.1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 da3f5f878d11a47236ff42fb1038b673659c886f8e49fda964d573eecde06d3d
MD5 e42018cab8006feadaea390efa28b754
BLAKE2b-256 f4f5c6311f294bf1e3189c5fc8f35b95fa1da0987a37533923b8e83150bf51f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.1.1-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferbit-0.1.1-py3-none-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for inferbit-0.1.1-py3-none-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 e22a3d37d105689030e6a4593bef9ee83c09ef9a7d3fc72ed0f14f6eb3bb5cd4
MD5 aad000b31dcec9495137a545b3a54877
BLAKE2b-256 09e0421bd99e3e4c4ad9ff66cb76483b5daaaa069cf006f45dd51e5bbe2094b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferbit-0.1.1-py3-none-macosx_10_15_x86_64.whl:

Publisher: release.yml on demonarch/inferbit-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page