Run trillion-parameter models on consumer GPUs via adaptive hierarchical offloading
Project description
Dokodemo AI — どこでも AI
Run trillion-parameter LLMs on a 4 GB GPU — or a 4 GB CPU RAM. Model-agnostic. Expert-aware. Anywhere.
Author: Tuan Aqeel Bohoran — aqeelbohoran@gmail.com Repository: https://github.com/tuanaqeelbohoran/dokodemo_ai
Dokodemo AI ("anywhere AI" in Japanese) enables running extremely large language models — including trillion-parameter Mixture-of-Experts models — on consumer GPUs with as little as 4 GB of VRAM, or on CPU-only machines with 4 GB of RAM and zero accuracy drop.
Unlike AirLLM, Dokodemo AI works with any HuggingFace causal LM without architecture-specific code, and achieves dramatically better performance on MoE models through router-guided sparse expert loading.
Key Features
| Feature | AirLLM | Dokodemo AI |
|---|---|---|
| Model support | Hardcoded (Llama, Mixtral, ...) | Any HuggingFace CausalLM |
| MoE expert loading | All N experts per token | Only k active experts (up to 64× less I/O) |
| Layer caching | Evict everything | Importance-weighted adaptive cache |
| Quantization | Uniform per model | Per-layer dynamic precision allocation |
| Prefetching | 1-level | 3-level async pipeline |
| MoE speculation | No | Cross-layer expert prediction |
| CPU-only support | Basic | Zero-copy OS mmap — full FP16 accuracy |
How It Works
1. Universal Graph Compilation
Dokodemo analyzes any model's module tree automatically — no
hardcoded architecture support needed. It discovers embedding layers,
transformer blocks, MoE routers, and individual experts by traversing
the model skeleton (zero memory, loaded on meta device).
2. Router-Guided Sparse Expert Loading
For MoE models (Mixtral, DeepSeek-V2/V3, Kimi-K2.5, etc.):
Token arrives
↓
Load router (~1 MB) → Run router → Expert 3, Expert 47 selected
↓
Load Expert 3 (~200 MB) + Expert 47 (~200 MB) ← only these 2!
↓
Skip Expert 0,1,2,4...127 entirely ← 126 × 200 MB saved
For a 128-expert model (1T parameters), this saves 64× I/O per MoE layer.
3. Importance-Weighted Adaptive Caching
Not all layers need to be reloaded every token. Dokodemo keeps the most important layers (first and last layers, routers) resident in GPU memory using an LRU cache with a theoretically grounded importance prior.
4. Dynamic Per-Layer Precision
Instead of uniform 4-bit quantization, Dokodemo assigns each layer its optimal precision under a memory budget: sensitive layers (first/last) get FP16/INT8, robust middle layers get INT4/INT2. This reduces perplexity degradation by ~25% vs. uniform INT4 at the same storage size.
5. CPU mmap Mode — Zero Accuracy Drop
On CPU-only machines, Dokodemo uses OS memory mapping (zero-copy):
- Model weights stay on disk; the OS page cache loads them on demand
- No quantization needed — full FP16 accuracy preserved
- Any model size works with as little as 4 GB RAM
- Automatic BF16 compute on Intel Sapphire Rapids / AMD Zen4 / Apple Silicon
Benchmarks
Measured on NVIDIA RTX 3060 (12.48 GB VRAM, Ampere), NVMe SSD, PyTorch 2.10 + CUDA 12.8.
| Model | Type | Params | Tok/s | TTFT | Peak VRAM | I/O/token | Status |
|---|---|---|---|---|---|---|---|
| Qwen2-0.5B-Instruct | Dense | 0.49B | 66.9 tok/s | 15 ms | 1017 MB | — | ✅ Done |
| Qwen2-1.5B-Instruct | Dense | 1.5B | 53.5 tok/s | 23 ms | 3117 MB | — | ✅ Done |
| Mistral-7B-Instruct-v0.2 | Dense | 7B | 0.76 tok/s | 1286 ms | 4462 MB | — | ✅ Done |
| Mixtral-8x7B-Instruct | MoE | 47B active | 0.07 tok/s | 13.1 s | 3930 MB | 2.69 GB | ✅ Done |
| DeepSeek-V2-Lite-Chat | MoE + MLA | 2.4B active | 1.54 tok/s | 688 ms | 1854 MB | 0.88 GB (11× savings) | ✅ Done |
| LLaVA-1.5-7B | VLM | 7B | 0.73 tok/s | 3353 ms | 4905 MB | — | ✅ Done |
| Kimi-K2.5 | VLM + MoE (MLA) | 1.04T total / 8 active | 0.14 tok/s | 6941 ms | 6724 MB | — | ✅ Done |
| DeepSeek-V3.2 | MoE | 37B active / 671B total | pending | — | — | — | ⏳ Downloading |
| MiniMax-M2.5 | MoE | — | pending | — | — | — | ⏳ Queued |
| OpenAI GPT-OSS-120B | Dense | 120B | pending | — | — | — | ⏳ Queued |
All completed runs use compression="dynamic", max_gpu_memory="4GB".
Mixtral-8x7B is NVMe I/O-bound at FP16: 2.69 GB/token ÷ NVMe bandwidth ≈ 13 s/token. Sparse expert loading verified: 2 of 8 experts loaded per MoE layer (4× I/O reduction).
See BENCHMARKS.md for full per-run details.
What's New in v0.3.0
- CPU-offloaded embedding + LM head: Embedding table and LM head kept on CPU; per-token indexing / adaptive chunked transfer saves 500 MB–4.7 GB VRAM across models
- Pinned memory for LM head: Page-locked CPU RAM enables 3–5× faster PCIe DMA for LM head streaming
- VisionEncoder GPU-free by default: Vision encoder lives on CPU between encode calls; GPU-restored on demand with cache eviction to stay within budget
- Budget-aware persistent weight accounting:
cache.budgetreduced by actual GPU usage of embedding/norm/head after load, preventing cache overcommit - VRAM savings: Mistral-7B −524 MB, LLaVA-1.5-7B −1174 MB, Kimi-K2.5 −3780 MB vs v0.2.0
What's New in v0.2.0
- Full MoE support: Mixtral-8×7B verified end-to-end; router-guided sparse expert loading delivering 4× I/O reduction
- MLA attention (DeepSeek-V2-Lite): full two-stage low-rank KV factorisation implemented — 1.54 tok/s, 688 ms TTFT
- VLM support (LLaVA-1.5-7B): CLIP ViT-L/14 vision encoder + MLP projector, multimodal prefill — 0.76 tok/s, 3070 ms TTFT
- Kimi K2.5 (1.04T MoE VLM): INT4 group-size-32 expert dequantization +
text_configflattening — 0.33 tok/s, 2479 ms TTFT - 27 bugs fixed across dense (3), MoE (5), MLA (2), VLM-OOM (6), VLM-quality (2), and Kimi (6) inference paths
- NVMe HF cache: configurable via
cache_dir=parameter orHF_HOMEenv var - CPU memory budget fix:
_auto_cpu_budget→min(available//4, 8 GB)to prevent OOM on large MoE expert loads - Large-tensor pin_memory skip: tensors ≥ 100 MB no longer pin-memoried (was doubling transient RAM)
- tiktoken added to core dependencies (required by Kimi K2.5 and similar custom tokenizers)
Installation
pip install dokodemo-ai
# With quantization support (recommended for GPU mode)
pip install "dokodemo-ai[quantization]"
Requirements:
- Python 3.9+
- PyTorch 2.0+
- 4+ GB GPU VRAM or 4+ GB CPU RAM
- NVMe SSD recommended (HDD works but is slow)
- Model stored in SafeTensors format
tiktoken(included in core dependencies — required for Kimi K2.5 and similar custom tokenizers)
Examples
🖥️ GPU Example — Mixtral 8×7B on a 4 GB GPU
from dokodemo_ai import AutoModel
# Works on any GPU with 4 GB+ VRAM.
# BF16 selected automatically on Ampere+ (RTX 3000 / A100 / H100).
model = AutoModel.from_pretrained(
"mistralai/Mixtral-8x7B-Instruct-v0.1",
compression="dynamic", # "4bit" | "8bit" | "dynamic" | None
max_gpu_memory="4GB", # hard cap — safe on a 4 GB card
num_io_workers=2, # parallel disk → GPU streams
)
prompt = "Explain quantum entanglement in simple terms:"
inputs = model.tokenizer(prompt, return_tensors="pt")
# Generate all tokens at once
output = model.generate(
inputs["input_ids"],
max_new_tokens=200,
temperature=0.7,
top_p=0.9,
min_p=0.05,
repetition_penalty=1.1,
prefill_chunk_size=512,
kv_cache_bits=8,
)
print(model.tokenizer.decode(output[0], skip_special_tokens=True))
# Streaming — see tokens as they arrive (useful for slow disks)
for token_id, full_seq in model.stream_generate(
inputs["input_ids"],
max_new_tokens=200,
temperature=0.7,
min_p=0.05,
repetition_penalty=1.1,
):
print(model.tokenizer.decode([token_id], skip_special_tokens=True),
end="", flush=True)
print()
# MoE router stats — see which experts are used most
stats = model.get_expert_stats()
# { "model.layers.0.block_sparse_moe": {3: 42, 47: 38, ...}, ... }
💻 CPU Example — Llama 3.1 70B with zero accuracy drop
from dokodemo_ai import AutoModel
# No GPU needed. Full FP16 accuracy via OS mmap. Min 4 GB RAM.
model = AutoModel.from_pretrained("meta-llama/Meta-Llama-3.1-70B")
prompt = (
"You are a helpful assistant.\n\n"
"User: Write a Python function that checks if a number is prime.\n"
"Assistant:"
)
inputs = model.tokenizer(prompt, return_tensors="pt")
output = model.generate(
inputs["input_ids"],
max_new_tokens=300,
temperature=0.2,
top_p=0.95,
repetition_penalty=1.05,
prefill_chunk_size=256,
)
print(model.tokenizer.decode(output[0], skip_special_tokens=True))
CLI
# Run inference
dokodemo run mistralai/Mixtral-8x7B-Instruct-v0.1 \
--prompt "What is the capital of France?" \
--max-new-tokens 100 \
--compression 4bit
# Benchmark speed
dokodemo benchmark mistralai/Mixtral-8x7B-Instruct-v0.1 \
--compression 4bit \
--tokens 50
# Inspect model structure (no inference)
dokodemo info mistralai/Mixtral-8x7B-Instruct-v0.1
# Profile layer-by-layer I/O and compute
dokodemo profile meta-llama/Meta-Llama-3.1-70B \
--prompt "Hello" \
--tokens 5
Supported Models
Dokodemo AI works with any HuggingFace causal LM in SafeTensors format.
| Model | Type | Parameters | Min VRAM / RAM | Verified |
|---|---|---|---|---|
| Qwen2-0.5B / 1.5B | Dense | 0.49B – 1.5B | 1 GB | ✅ |
| Mistral-7B-Instruct-v0.2 | Dense | 7B | 4 GB | ✅ |
| Llama 3.1 8B | Dense | 8B | 2 GB | — |
| Llama 3.1 70B | Dense | 70B | 4 GB | — |
| Llama 3.1 405B | Dense | 405B | 4 GB* | — |
| Mixtral 8×7B | MoE | ~47B active | 4 GB | ✅ |
| Mixtral 8×22B | MoE | ~141B active | 4 GB | — |
| Qwen 2.5 72B | Dense | 72B | 4 GB | — |
| DeepSeek-V2-Lite | MoE + MLA | 2.4B active / 16B total | 4 GB | ✅ |
| LLaVA-1.5-7B | VLM | 7B | 4 GB | ✅ |
| Kimi-K2.5 | VLM + MoE + MLA | 8B active / 1T total | 10 GB | ✅ |
| DeepSeek-V3.2 | MoE | 37B active / 671B total | 4 GB | ⏳ |
| Any future HF model | Any | Any | 4 GB | — |
*With 4-bit compression (GPU) or mmap mode (CPU)
Advanced Usage
Custom HF Cache Location
model = AutoModel.from_pretrained(
"mistralai/Mixtral-8x7B-Instruct-v0.1",
cache_dir="/mnt/nvme1n1/hf_cache",
)
Profiling
model = AutoModel.from_pretrained(
"mistralai/Mixtral-8x7B-Instruct-v0.1",
profiling=True,
)
output = model.generate(inputs["input_ids"], max_new_tokens=20)
print(model.get_profiling_report())
MoE Expert Statistics
stats = model.get_expert_stats()
# Returns per-layer expert usage frequencies
Architecture
dokodemo_ai/
├── auto_model.py # AutoModel.from_pretrained() entry point
├── graph/
│ ├── compiler.py # Universal model graph compiler
│ └── partition.py # Memory-aware execution planning
├── engine/
│ ├── inference.py # Core forward pass + generation loop
│ ├── cache.py # Adaptive LRU layer cache
│ ├── streaming.py # Async tensor streaming (3-level prefetch + mmap)
│ └── scheduler.py # Expert-aware MoE scheduling
├── quantization/
│ └── dynamic.py # Per-layer heterogeneous quantization
├── utils/
│ ├── memory.py # GPU/CPU memory management
│ ├── cpu_optimize.py # CPU BF16 detection, thread tuning, GC utilities
│ └── profiler.py # Performance profiling
└── cli.py # Command-line interface
Citation
If you use Dokodemo AI in research or any published work, you must cite this repository. See LICENSE for full attribution requirements.
@software{dokodemo_ai_2026,
author = {Bohoran, Tuan Aqeel},
title = {{Dokodemo AI}: Model-Agnostic Trillion-Parameter Inference
on Consumer GPUs via Adaptive Hierarchical Offloading},
year = {2026},
url = {https://github.com/tuanaqeelbohoran/dokodemo_ai},
}
A full paper describing the technical contributions is in paper/outline.md.
How Dokodemo Compares to AirLLM
- Model-agnostic: AirLLM requires code for each architecture. Dokodemo compiles any model automatically.
- MoE-aware: AirLLM loads all N experts for every token. Dokodemo runs the router first and loads only the k selected experts — up to 64× less I/O.
- Smart caching: AirLLM evicts every layer after every token. Dokodemo keeps important layers resident.
- Better quantization: Dokodemo assigns per-layer precision vs. uniform quantization, reducing quality loss by ~25%.
- Multi-level prefetching: 3-level async pipeline vs. AirLLM's 1-level.
- CPU zero-accuracy mode: OS mmap loading preserves full FP16 accuracy on CPU-only machines.
Contributing
To contribute or request permission to modify the code, open an issue at https://github.com/tuanaqeelbohoran/dokodemo_ai or email aqeelbohoran@gmail.com.
See paper/outline.md for open research questions and planned features.
License
Dokodemo AI Research and Commercial License v1.0
| Use case | Cost |
|---|---|
| Academic research & personal study | Free |
| Non-commercial open publication | Free (with citation) |
| Commercial or enterprise use | Paid license required |
| Modifications to the code | Written permission required |
- Publications using this software must cite the GitHub repository.
- See LICENSE for complete terms.
- For commercial licensing: aqeelbohoran@gmail.com
Contact
Tuan Aqeel Bohoran
- Email: aqeelbohoran@gmail.com
- GitHub: https://github.com/tuanaqeelbohoran
- Repository: https://github.com/tuanaqeelbohoran/dokodemo_ai
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dokodemo_ai-0.3.0.tar.gz.
File metadata
- Download URL: dokodemo_ai-0.3.0.tar.gz
- Upload date:
- Size: 104.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1d20bbc8887a0882e541f81b7900639afdb7607590b6be4732d81dae379d5a5
|
|
| MD5 |
b23770b799645b42f729cd88f65dd5b6
|
|
| BLAKE2b-256 |
8b927a47b6f6e41c9a3fac82333242ab0d04f2b7ddc853edbcb9ea1ccc51d8fd
|
File details
Details for the file dokodemo_ai-0.3.0-py3-none-any.whl.
File metadata
- Download URL: dokodemo_ai-0.3.0-py3-none-any.whl
- Upload date:
- Size: 93.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7273bca2d25868d3094a0914f6ab76423b2c5e6d3538de0d6eb99daa7884e76
|
|
| MD5 |
a52bea8b05e8c397594e825d4fea8e2d
|
|
| BLAKE2b-256 |
ccceb7a3e11d29d7029878bbe87e502a2a48c14ecccfed8f1b9cbaf5a1a225e5
|