Run 70B+ LLMs on a single 4GB GPU — no quantization required. Layer-streaming inference for consumer hardware.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

ManuelSLemos

These details have not been verified by PyPI

Project description

RabbitLLM

RabbitLLM logo

Run 70B+ LLMs on a single 4GB GPU — no quantization required.

RabbitLLM is a fork of AirLLM. It enables inference on large language models (70B+ parameters) on consumer GPUs with as little as 4GB VRAM by streaming model layers one at a time through GPU memory. No quantization, distillation, or pruning needed — full model quality.

Compatibility (current status)

Tested and supported: only Qwen2 and Qwen3 are currently tested and compatible. Use these families for reliable results.
Other architectures (Llama, Mistral, Mixtral, etc.) are present in the codebase but not yet compatible — use at your own risk.
Apple (macOS / Apple Silicon) is not supported; run on Linux or Windows with a CUDA-capable GPU (or CPU fallback on x86/ARM Linux).

How it works

Instead of loading the entire model into GPU memory, RabbitLLM:

Splits the HuggingFace checkpoint into per-layer safetensors files (once, on first use).
Streams each layer individually: load to GPU → forward pass → free GPU memory.
Prefetches the next layer in a background thread while the current layer is computing.

Optional 4-bit/8-bit block-wise compression (via bitsandbytes) can reduce layer size further for up to 3× speed-up with minimal accuracy loss.

Installation

pip install rabbitllm

Optional — Flash Attention 2 (faster on Ampere+ GPUs, e.g. RTX 30xx/40xx):

pip install rabbitllm[flash]

If the prebuilt wheel is unavailable for your setup, install from flashattn.dev. Without it, SDPA is used automatically.

Docker

Build and run with GPU support (requires NVIDIA Container Toolkit on the host):

docker build -t rabbitllm .
docker run --gpus all -it rabbitllm python scripts/inference_example.py --model Qwen/Qwen2.5-0.5B-Instruct --max-new-tokens 20

Optional env vars: HF_TOKEN for gated models, HF_HOME for model cache directory. Example:

docker run --gpus all -e HF_TOKEN=hf_... -it rabbitllm python scripts/inference_example.py --model Qwen/Qwen2.5-7B-Instruct

Quickstart

import warnings
import torch
from rabbitllm import AutoModel

# Use GPU if available, otherwise CPU
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", message=".*CUDA.*unknown error.*", category=UserWarning)
    device = "cuda:0" if torch.cuda.is_available() else "cpu"

# compression: "4bit" (recommended), "8bit", or None (bfloat16)
model = AutoModel.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    device=device,
    compression="4bit",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What is the capital of France?"},
]

input_text = model.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
tokens = model.tokenizer(
    [input_text], return_tensors="pt", truncation=True, max_length=512
)
input_ids = tokens["input_ids"].to(device)
attention_mask = tokens.get("attention_mask")
if attention_mask is None:
    attention_mask = torch.ones_like(input_ids, dtype=torch.long, device=device)
else:
    attention_mask = attention_mask.to(device)

output = model.generate(
    input_ids,
    attention_mask=attention_mask,
    max_new_tokens=200,
    use_cache=True,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
    return_dict_in_generate=True,
)

# Decode only the newly generated tokens
input_len = tokens["input_ids"].shape[1]
print(model.tokenizer.decode(output.sequences[0][input_len:], skip_special_tokens=True))

AutoModel automatically detects the model architecture from the HuggingFace config — no need to pick the right class manually.

Supported models

Only Qwen2 and Qwen3 are tested and supported. The following table lists the architectures present in the codebase; others are not yet compatible.

Family	Architectures	Class	Status
Qwen2 / Qwen2.5 / Qwen3	`Qwen2ForCausalLM`, `Qwen3ForCausalLM`	`RabbitLLMQWen2`	Tested, supported
Llama 2 / 3 / 3.1 / 3.2	`LlamaForCausalLM`	`RabbitLLMLlama2`	Not yet compatible
Qwen v1	`QWenLMHeadModel`	`RabbitLLMQWen`	Not yet compatible
Mistral	`MistralForCausalLM`	`RabbitLLMMistral`	Not yet compatible
Mixtral	`MixtralForCausalLM`	`RabbitLLMMixtral`	Not yet compatible
InternLM	`InternLMForCausalLM`	`RabbitLLMInternLM`	Not yet compatible
ChatGLM	`ChatGLMModel`	`RabbitLLMChatGLM`	Not yet compatible
Baichuan	`BaichuanForCausalLM`	`RabbitLLMBaichuan`	Not yet compatible
Gemma 2 / 3	`Gemma2ForCausalLM`, `Gemma3ForCausalLM`	`RabbitLLMLlama2`	Not yet compatible
DeepSeek V2 / V3	`DeepseekV2ForCausalLM`, `DeepseekV3ForCausalLM`	`RabbitLLMLlama2`	Not yet compatible
Phi 2 / 3 / 4	`Phi3ForCausalLM`, `Phi4ForCausalLM`	`RabbitLLMLlama2`	Not yet compatible

Unknown architectures fall back to the Llama-based implementation with a warning.

Configuration

model = AutoModel.from_pretrained(
    "Qwen/Qwen2.5-72B-Instruct",
    compression="4bit",          # "4bit" | "8bit" | None (default)
    attn_implementation="auto",  # "auto" | "flash_attention_2" | "sdpa" | "eager"
    max_seq_len=512,             # maximum sequence length
    prefetching=True,            # overlap layer loading with compute
    prefetch_pin_memory=True,    # faster CPU→GPU for small/medium models
    use_gds=True,                # GPU Direct Storage (kvikio) when available
    kv_cache_dir=None,           # path to offload KV cache for long context (50k+ tokens)
    token="hf_...",              # HuggingFace token for gated repos
    layer_shards_saving_path="/path/to/cache",  # custom split cache directory
    profiling_mode=False,        # print per-layer timing
    delete_original=False,       # delete original shards after splitting
)

Compression

Block-wise quantization reduces on-disk and in-memory layer size:

4-bit (NF4): ~28% of original size, up to 3× faster loading, minimal quality loss.
8-bit: ~50% of original size.

model = AutoModel.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", compression="4bit")

Requires bitsandbytes: pip install bitsandbytes.

GPU Direct Storage (optional)

For CUDA without compression, install kvikio-cu12 to load layers directly from disk to GPU, bypassing CPU and pin_memory (can significantly speed up 70B+ models):

pip install rabbitllm[gds]
# or: pip install kvikio-cu12

Set use_gds=False to disable.

Long context (KV cache on disk)

For 50k+ token contexts, pass kv_cache_dir to offload KV cache to SSD:

model = AutoModel.from_pretrained("Qwen/Qwen2.5-72B-Instruct", kv_cache_dir="./kv_cache")

Benchmark

Scripts to measure throughput and compare configurations:

Script	What it measures
`scripts/benchmark_improvements.py`	GDS (GPU Direct Storage) and long-context DiskKVCache improvements
`scripts/benchmark_cpu_vs_cuda.py`	CPU vs CUDA inference with layer-streaming (same model and prompt)
`scripts/check_attention_and_benchmark.py --benchmark`	Throughput comparison: auto vs SDPA vs eager attention

GDS and DiskKVCache:

# Local: make install pulls in kvikio (--extra gds)
make install
uv run python scripts/benchmark_improvements.py --mode gds
uv run python scripts/benchmark_improvements.py --mode long_context

# Docker (make bash): install with GDS first
pip install -e ".[gds]"
python scripts/benchmark_improvements.py --mode gds

Quick CPU vs CUDA comparison:

uv run python scripts/benchmark_cpu_vs_cuda.py
uv run python scripts/benchmark_cpu_vs_cuda.py --model Qwen/Qwen2.5-1.5B-Instruct --runs 3

Attention implementation (auto vs SDPA vs eager):

uv run python scripts/check_attention_and_benchmark.py --benchmark

Detailed results and per-step breakdown for Qwen2.5-72B (e.g. pin_memory, async, 4-bit) are in docs/BENCHMARK_HISTORY.md.

Gated models

Pass a HuggingFace token for repos that require access approval:

model = AutoModel.from_pretrained("Qwen/Qwen2.5-7B-Instruct", token="hf_YOUR_TOKEN")

Or set the HF_TOKEN environment variable.

Local model cache

To keep model downloads local and out of git, set HF_HOME before running:

export HF_HOME="$(pwd)/models"

The models/ directory is in .gitignore. RabbitLLM will store split layers alongside the HuggingFace cache.

Documentation

docs/ARCHITECTURE.md — Design decisions: layer-streaming, KV cache, tied weights, attention implementations.
docs/COMPATIBILITY.md — Transformers version, model matrix, Flash Attention, Qwen2 notes.
docs/TROUBLESHOOTING.md — Common issues and how to debug them.
CONTRIBUTING.md — How to set up the dev environment and add new models.

Development

# Install with dev dependencies
pip install uv
uv sync --extra dev
# or: make install

# Run tests
make test

# Lint and format
make lint
make format

# Type check
make typecheck

FAQ

MetadataIncompleteBuffer on first run

The model splitting process is disk-intensive. Check available space — you need roughly the model size free in the split output directory.

ValueError: max() arg is an empty sequence

You are likely loading a Qwen or ChatGLM model with the wrong class. Use AutoModel:

from rabbitllm import AutoModel
model = AutoModel.from_pretrained("Qwen/Qwen-7B")

ValueError: Asking to pad but the tokenizer does not have a padding token

Turn off padding:

input_tokens = model.tokenizer(text, padding=False, truncation=True, max_length=128, return_tensors="pt")

License

MIT

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

ManuelSLemos

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.0

Feb 28, 2026

1.0.1

Feb 22, 2026

1.0.0

Feb 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rabbitllm-1.1.0.tar.gz (281.8 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rabbitllm-1.1.0-py3-none-any.whl (71.4 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file rabbitllm-1.1.0.tar.gz.

File metadata

Download URL: rabbitllm-1.1.0.tar.gz
Upload date: Feb 28, 2026
Size: 281.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rabbitllm-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0121f0a8313df0ec944579ac08a5d70edcb7caf37ce80aa7e66b4fa5e2e934aa`
MD5	`74bf3d8e72043921f87c4888eb50701b`
BLAKE2b-256	`e3bba8e81b48d55720582aa6f55e5c992e282474ccd016b24ba1ab7c89269451`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rabbitllm-1.1.0.tar.gz:

Publisher: publish.yml on ManuelSLemos/RabbitLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rabbitllm-1.1.0.tar.gz
- Subject digest: 0121f0a8313df0ec944579ac08a5d70edcb7caf37ce80aa7e66b4fa5e2e934aa
- Sigstore transparency entry: 1004860942
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: ManuelSLemos/RabbitLLM@d6f3e60bf103888adc4ca482634d9fb0d408a0fb
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/ManuelSLemos
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d6f3e60bf103888adc4ca482634d9fb0d408a0fb
- Trigger Event: push

File details

Details for the file rabbitllm-1.1.0-py3-none-any.whl.

File metadata

Download URL: rabbitllm-1.1.0-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 71.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rabbitllm-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c3b0c6477c68730b889040cfe2f838aa33c9a0780cb118ec48620efcfc06cd90`
MD5	`04f98c44193e2c1e05ec0a77bb3a4b90`
BLAKE2b-256	`38b42ff2edcfa549ef85fdc0a6b4f2d7cadf102ee8012849872db77d5d80a7f1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rabbitllm-1.1.0-py3-none-any.whl:

Publisher: publish.yml on ManuelSLemos/RabbitLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rabbitllm-1.1.0-py3-none-any.whl
- Subject digest: c3b0c6477c68730b889040cfe2f838aa33c9a0780cb118ec48620efcfc06cd90
- Sigstore transparency entry: 1004860965
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: ManuelSLemos/RabbitLLM@d6f3e60bf103888adc4ca482634d9fb0d408a0fb
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/ManuelSLemos
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d6f3e60bf103888adc4ca482634d9fb0d408a0fb
- Trigger Event: push

rabbitllm 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

RabbitLLM

Compatibility (current status)

How it works

Installation

Docker

Quickstart

Supported models

Configuration

Compression

GPU Direct Storage (optional)

Long context (KV cache on disk)

Benchmark

Gated models

Local model cache

Documentation

Development

FAQ

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance