Skip to main content

Run 70B+ LLMs on a single 4GB GPU — no quantization required. Layer-streaming inference for consumer hardware.

Project description

RabbitLLM

RabbitLLM logo

Run 70B+ LLMs on a single 4GB GPU — no quantization required.

PyPI Python 3.10+ License: MIT CI Buy Me a Coffee

RabbitLLM is a fork of AirLLM. It enables inference on large language models (70B+ parameters) on consumer GPUs with as little as 4GB VRAM by streaming model layers one at a time through GPU memory. No quantization, distillation, or pruning needed — full model quality.

Compatibility (current status)

  • Tested and supported: only Qwen2 and Qwen3 are currently tested and compatible. Use these families for reliable results.
  • Other architectures (Llama, Mistral, Mixtral, etc.) are present in the codebase but not yet compatible — use at your own risk.
  • Apple (macOS / Apple Silicon) is not supported; run on Linux or Windows with a CUDA-capable GPU (or CPU fallback on x86/ARM Linux).

How it works

Instead of loading the entire model into GPU memory, RabbitLLM:

  1. Splits the HuggingFace checkpoint into per-layer safetensors files (once, on first use).
  2. Streams each layer individually: load to GPU → forward pass → free GPU memory.
  3. Prefetches the next layer in a background thread while the current layer is computing.

Optional 4-bit/8-bit block-wise compression (via bitsandbytes) can reduce layer size further for up to 3× speed-up with minimal accuracy loss.

Installation

pip install rabbitllm

Optional — Flash Attention 2 (faster on Ampere+ GPUs, e.g. RTX 30xx/40xx):

pip install rabbitllm[flash]

If the prebuilt wheel is unavailable for your setup, install from flashattn.dev. Without it, SDPA is used automatically.

Quickstart

from rabbitllm import AutoModel

model = AutoModel.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")  # or any Qwen2 / Qwen3

input_tokens = model.tokenizer(
    ["What is the capital of France?"],
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=128,
    padding=False,
)

output = model.generate(
    input_tokens["input_ids"].cuda(),
    max_new_tokens=50,
    use_cache=True,
    return_dict_in_generate=True,
)

print(model.tokenizer.decode(output.sequences[0]))

AutoModel automatically detects the model architecture from the HuggingFace config — no need to pick the right class manually.

Supported models

Only Qwen2 and Qwen3 are tested and supported. The following table lists the architectures present in the codebase; others are not yet compatible.

Family Architectures Class Status
Qwen2 / Qwen2.5 / Qwen3 Qwen2ForCausalLM, Qwen3ForCausalLM RabbitLLMQWen2 Tested, supported
Llama 2 / 3 / 3.1 / 3.2 LlamaForCausalLM RabbitLLMLlama2 Not yet compatible
Qwen v1 QWenLMHeadModel RabbitLLMQWen Not yet compatible
Mistral MistralForCausalLM RabbitLLMMistral Not yet compatible
Mixtral MixtralForCausalLM RabbitLLMMixtral Not yet compatible
InternLM InternLMForCausalLM RabbitLLMInternLM Not yet compatible
ChatGLM ChatGLMModel RabbitLLMChatGLM Not yet compatible
Baichuan BaichuanForCausalLM RabbitLLMBaichuan Not yet compatible
Gemma 2 / 3 Gemma2ForCausalLM, Gemma3ForCausalLM RabbitLLMLlama2 Not yet compatible
DeepSeek V2 / V3 DeepseekV2ForCausalLM, DeepseekV3ForCausalLM RabbitLLMLlama2 Not yet compatible
Phi 2 / 3 / 4 Phi3ForCausalLM, Phi4ForCausalLM RabbitLLMLlama2 Not yet compatible

Unknown architectures fall back to the Llama-based implementation with a warning.

Configuration

model = AutoModel.from_pretrained(
    "Qwen/Qwen2.5-72B-Instruct",
    compression="4bit",          # "4bit" | "8bit" | None (default)
    attn_implementation="auto",  # "auto" | "flash_attention_2" | "sdpa" | "eager"
    max_seq_len=512,             # maximum sequence length
    prefetching=True,            # overlap layer loading with compute
    prefetch_pin_memory=True,    # faster CPU→GPU for small/medium models
    token="hf_...",              # HuggingFace token for gated repos
    layer_shards_saving_path="/path/to/cache",  # custom split cache directory
    profiling_mode=False,        # print per-layer timing
    delete_original=False,       # delete original shards after splitting
)

Compression

Block-wise quantization reduces on-disk and in-memory layer size:

  • 4-bit (NF4): ~28% of original size, up to 3× faster loading, minimal quality loss.
  • 8-bit: ~50% of original size.
model = AutoModel.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", compression="4bit")

Requires bitsandbytes: pip install bitsandbytes.

Gated models

Pass a HuggingFace token for repos that require access approval:

model = AutoModel.from_pretrained("Qwen/Qwen2.5-7B-Instruct", token="hf_YOUR_TOKEN")

Or set the HF_TOKEN environment variable.

Local model cache

To keep model downloads local and out of git, set HF_HOME before running:

export HF_HOME="$(pwd)/models"

The models/ directory is in .gitignore. RabbitLLM will store split layers alongside the HuggingFace cache.

Documentation

Development

# Install with dev dependencies
pip install uv
uv sync --extra dev
# or: make install

# Run tests
make test

# Lint and format
make lint
make format

# Type check
make typecheck

FAQ

MetadataIncompleteBuffer on first run

The model splitting process is disk-intensive. Check available space — you need roughly the model size free in the split output directory.

ValueError: max() arg is an empty sequence

You are likely loading a Qwen or ChatGLM model with the wrong class. Use AutoModel:

from rabbitllm import AutoModel
model = AutoModel.from_pretrained("Qwen/Qwen-7B")

ValueError: Asking to pad but the tokenizer does not have a padding token

Turn off padding:

input_tokens = model.tokenizer(text, padding=False, truncation=True, max_length=128, return_tensors="pt")

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rabbitllm-1.0.0.tar.gz (256.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rabbitllm-1.0.0-py3-none-any.whl (50.3 kB view details)

Uploaded Python 3

File details

Details for the file rabbitllm-1.0.0.tar.gz.

File metadata

  • Download URL: rabbitllm-1.0.0.tar.gz
  • Upload date:
  • Size: 256.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rabbitllm-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ddd5c61ac43c18372afbcf31fa481e5f9176547e40a06b13e767941b0e80a633
MD5 915e5cf3b3fbdd2526cee17484294e10
BLAKE2b-256 c1ee575b8a2d2f2f8f199e7c0883c651a8f4a86156ccdd2b070dc78ab7367cd9

See more details on using hashes here.

Provenance

The following attestation bundles were made for rabbitllm-1.0.0.tar.gz:

Publisher: publish.yml on ManuelSLemos/RabbitLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rabbitllm-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: rabbitllm-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 50.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rabbitllm-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e201e2ee0560972a6104c33742cd2503d53232d79ee0268b1073dceccf632f93
MD5 ff35593f80beaa371b867b48f36b3233
BLAKE2b-256 5a51311a950b65c2bbd449bf520c807032624921579241bd17b078a09536a09b

See more details on using hashes here.

Provenance

The following attestation bundles were made for rabbitllm-1.0.0-py3-none-any.whl:

Publisher: publish.yml on ManuelSLemos/RabbitLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page