Skip to main content

Run trillion-parameter models on consumer GPUs via adaptive hierarchical offloading

Project description

Dokodemo AI — どこでも AI

Run trillion-parameter LLMs on a 4 GB GPU — or a 4 GB CPU RAM. Model-agnostic. Expert-aware. Anywhere.

PyPI Version License: Research Free / Commercial Paid Python 3.9+

Author: Tuan Aqeel Bohoranaqeelbohoran@gmail.com Repository: https://github.com/tuanaqeelbohoran/dokodemo_ai

Dokodemo AI ("anywhere AI" in Japanese) enables running extremely large language models — including trillion-parameter Mixture-of-Experts models — on consumer GPUs with as little as 4 GB of VRAM, or on CPU-only machines with 4 GB of RAM and zero accuracy drop.

Unlike AirLLM, Dokodemo AI works with any HuggingFace causal LM without architecture-specific code, and achieves dramatically better performance on MoE models through router-guided sparse expert loading.


Key Features

Feature AirLLM Dokodemo AI
Model support Hardcoded (Llama, Mixtral, ...) Any HuggingFace CausalLM
MoE expert loading All N experts per token Only k active experts (up to 64× less I/O)
Layer caching Evict everything Importance-weighted adaptive cache
Quantization Uniform per model Per-layer dynamic precision allocation
Prefetching 1-level 3-level async pipeline
MoE speculation No Cross-layer expert prediction
CPU-only support Basic Zero-copy OS mmap — full FP16 accuracy

How It Works

1. Universal Graph Compilation

Dokodemo analyzes any model's module tree automatically — no hardcoded architecture support needed. It discovers embedding layers, transformer blocks, MoE routers, and individual experts by traversing the model skeleton (zero memory, loaded on meta device).

2. Router-Guided Sparse Expert Loading

For MoE models (Mixtral, DeepSeek-V2/V3, Kimi-K2.5, etc.):

Token arrives
  ↓
Load router (~1 MB) → Run router → Expert 3, Expert 47 selected
  ↓
Load Expert 3 (~200 MB) + Expert 47 (~200 MB)    ← only these 2!
  ↓
Skip Expert 0,1,2,4...127 entirely               ← 126 × 200 MB saved

For a 128-expert model (1T parameters), this saves 64× I/O per MoE layer.

3. Importance-Weighted Adaptive Caching

Not all layers need to be reloaded every token. Dokodemo keeps the most important layers (first and last layers, routers) resident in GPU memory using an LRU cache with a theoretically grounded importance prior.

4. Dynamic Per-Layer Precision

Instead of uniform 4-bit quantization, Dokodemo assigns each layer its optimal precision under a memory budget: sensitive layers (first/last) get FP16/INT8, robust middle layers get INT4/INT2. This reduces perplexity degradation by ~25% vs. uniform INT4 at the same storage size.

5. CPU mmap Mode — Zero Accuracy Drop

On CPU-only machines, Dokodemo uses OS memory mapping (zero-copy):

  • Model weights stay on disk; the OS page cache loads them on demand
  • No quantization needed — full FP16 accuracy preserved
  • Any model size works with as little as 4 GB RAM
  • Automatic BF16 compute on Intel Sapphire Rapids / AMD Zen4 / Apple Silicon

Benchmarks

Measured on NVIDIA RTX 3060 (12.48 GB VRAM, Ampere), NVMe SSD, PyTorch 2.10 + CUDA 12.8.

Model Type Params Tok/s TTFT Peak VRAM I/O/token Status
Qwen2-0.5B-Instruct Dense 0.49B 66.9 tok/s 15 ms 1017 MB ✅ Done
Qwen2-1.5B-Instruct Dense 1.5B 53.5 tok/s 23 ms 3117 MB ✅ Done
Mistral-7B-Instruct-v0.2 Dense 7B 0.76 tok/s 1286 ms 4462 MB ✅ Done
Mixtral-8x7B-Instruct MoE 47B active 0.07 tok/s 13.1 s 3930 MB 2.69 GB ✅ Done
DeepSeek-V2-Lite-Chat MoE + MLA 2.4B active 1.54 tok/s 688 ms 1854 MB 0.88 GB (11× savings) ✅ Done
LLaVA-1.5-7B VLM 7B 0.73 tok/s 3353 ms 4905 MB ✅ Done
Kimi-K2.5 VLM + MoE (MLA) 1.04T total / 8 active 0.14 tok/s 6941 ms 6724 MB ✅ Done
DeepSeek-V3.2 MoE 37B active / 671B total pending ⏳ Downloading
MiniMax-M2.5 MoE pending ⏳ Queued
OpenAI GPT-OSS-120B Dense 120B pending ⏳ Queued

All completed runs use compression="dynamic", max_gpu_memory="4GB".

Mixtral-8x7B is NVMe I/O-bound at FP16: 2.69 GB/token ÷ NVMe bandwidth ≈ 13 s/token. Sparse expert loading verified: 2 of 8 experts loaded per MoE layer (4× I/O reduction).

See BENCHMARKS.md for full per-run details.


What's New in v0.3.0

  • CPU-offloaded embedding + LM head: Embedding table and LM head kept on CPU; per-token indexing / adaptive chunked transfer saves 500 MB–4.7 GB VRAM across models
  • Pinned memory for LM head: Page-locked CPU RAM enables 3–5× faster PCIe DMA for LM head streaming
  • VisionEncoder GPU-free by default: Vision encoder lives on CPU between encode calls; GPU-restored on demand with cache eviction to stay within budget
  • Budget-aware persistent weight accounting: cache.budget reduced by actual GPU usage of embedding/norm/head after load, preventing cache overcommit
  • VRAM savings: Mistral-7B −524 MB, LLaVA-1.5-7B −1174 MB, Kimi-K2.5 −3780 MB vs v0.2.0

What's New in v0.2.0

  • Full MoE support: Mixtral-8×7B verified end-to-end; router-guided sparse expert loading delivering 4× I/O reduction
  • MLA attention (DeepSeek-V2-Lite): full two-stage low-rank KV factorisation implemented — 1.54 tok/s, 688 ms TTFT
  • VLM support (LLaVA-1.5-7B): CLIP ViT-L/14 vision encoder + MLP projector, multimodal prefill — 0.76 tok/s, 3070 ms TTFT
  • Kimi K2.5 (1.04T MoE VLM): INT4 group-size-32 expert dequantization + text_config flattening — 0.33 tok/s, 2479 ms TTFT
  • 27 bugs fixed across dense (3), MoE (5), MLA (2), VLM-OOM (6), VLM-quality (2), and Kimi (6) inference paths
  • NVMe HF cache: configurable via cache_dir= parameter or HF_HOME env var
  • CPU memory budget fix: _auto_cpu_budgetmin(available//4, 8 GB) to prevent OOM on large MoE expert loads
  • Large-tensor pin_memory skip: tensors ≥ 100 MB no longer pin-memoried (was doubling transient RAM)
  • tiktoken added to core dependencies (required by Kimi K2.5 and similar custom tokenizers)

Installation

pip install dokodemo-ai

# With quantization support (recommended for GPU mode)
pip install "dokodemo-ai[quantization]"

Requirements:

  • Python 3.9+
  • PyTorch 2.0+
  • 4+ GB GPU VRAM or 4+ GB CPU RAM
  • NVMe SSD recommended (HDD works but is slow)
  • Model stored in SafeTensors format
  • tiktoken (included in core dependencies — required for Kimi K2.5 and similar custom tokenizers)

Examples

🖥️ GPU Example — Mixtral 8×7B on a 4 GB GPU

from dokodemo_ai import AutoModel

# Works on any GPU with 4 GB+ VRAM.
# BF16 selected automatically on Ampere+ (RTX 3000 / A100 / H100).
model = AutoModel.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    compression="dynamic",    # "4bit" | "8bit" | "dynamic" | None
    max_gpu_memory="4GB",     # hard cap — safe on a 4 GB card
    num_io_workers=2,         # parallel disk → GPU streams
)

prompt = "Explain quantum entanglement in simple terms:"
inputs = model.tokenizer(prompt, return_tensors="pt")

# Generate all tokens at once
output = model.generate(
    inputs["input_ids"],
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
    min_p=0.05,
    repetition_penalty=1.1,
    prefill_chunk_size=512,
    kv_cache_bits=8,
)
print(model.tokenizer.decode(output[0], skip_special_tokens=True))

# Streaming — see tokens as they arrive (useful for slow disks)
for token_id, full_seq in model.stream_generate(
    inputs["input_ids"],
    max_new_tokens=200,
    temperature=0.7,
    min_p=0.05,
    repetition_penalty=1.1,
):
    print(model.tokenizer.decode([token_id], skip_special_tokens=True),
          end="", flush=True)
print()

# MoE router stats — see which experts are used most
stats = model.get_expert_stats()
# { "model.layers.0.block_sparse_moe": {3: 42, 47: 38, ...}, ... }

💻 CPU Example — Llama 3.1 70B with zero accuracy drop

from dokodemo_ai import AutoModel

# No GPU needed. Full FP16 accuracy via OS mmap. Min 4 GB RAM.
model = AutoModel.from_pretrained("meta-llama/Meta-Llama-3.1-70B")

prompt = (
    "You are a helpful assistant.\n\n"
    "User: Write a Python function that checks if a number is prime.\n"
    "Assistant:"
)
inputs = model.tokenizer(prompt, return_tensors="pt")

output = model.generate(
    inputs["input_ids"],
    max_new_tokens=300,
    temperature=0.2,
    top_p=0.95,
    repetition_penalty=1.05,
    prefill_chunk_size=256,
)
print(model.tokenizer.decode(output[0], skip_special_tokens=True))

CLI

# Run inference
dokodemo run mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --prompt "What is the capital of France?" \
    --max-new-tokens 100 \
    --compression 4bit

# Benchmark speed
dokodemo benchmark mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --compression 4bit \
    --tokens 50

# Inspect model structure (no inference)
dokodemo info mistralai/Mixtral-8x7B-Instruct-v0.1

# Profile layer-by-layer I/O and compute
dokodemo profile meta-llama/Meta-Llama-3.1-70B \
    --prompt "Hello" \
    --tokens 5

Supported Models

Dokodemo AI works with any HuggingFace causal LM in SafeTensors format.

Model Type Parameters Min VRAM / RAM Verified
Qwen2-0.5B / 1.5B Dense 0.49B – 1.5B 1 GB
Mistral-7B-Instruct-v0.2 Dense 7B 4 GB
Llama 3.1 8B Dense 8B 2 GB
Llama 3.1 70B Dense 70B 4 GB
Llama 3.1 405B Dense 405B 4 GB*
Mixtral 8×7B MoE ~47B active 4 GB
Mixtral 8×22B MoE ~141B active 4 GB
Qwen 2.5 72B Dense 72B 4 GB
DeepSeek-V2-Lite MoE + MLA 2.4B active / 16B total 4 GB
LLaVA-1.5-7B VLM 7B 4 GB
Kimi-K2.5 VLM + MoE + MLA 8B active / 1T total 10 GB
DeepSeek-V3.2 MoE 37B active / 671B total 4 GB
Any future HF model Any Any 4 GB

*With 4-bit compression (GPU) or mmap mode (CPU)


Advanced Usage

Custom HF Cache Location

model = AutoModel.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    cache_dir="/mnt/nvme1n1/hf_cache",
)

Profiling

model = AutoModel.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    profiling=True,
)
output = model.generate(inputs["input_ids"], max_new_tokens=20)
print(model.get_profiling_report())

MoE Expert Statistics

stats = model.get_expert_stats()
# Returns per-layer expert usage frequencies

Architecture

dokodemo_ai/
├── auto_model.py          # AutoModel.from_pretrained() entry point
├── graph/
│   ├── compiler.py        # Universal model graph compiler
│   └── partition.py       # Memory-aware execution planning
├── engine/
│   ├── inference.py       # Core forward pass + generation loop
│   ├── cache.py           # Adaptive LRU layer cache
│   ├── streaming.py       # Async tensor streaming (3-level prefetch + mmap)
│   └── scheduler.py       # Expert-aware MoE scheduling
├── quantization/
│   └── dynamic.py         # Per-layer heterogeneous quantization
├── utils/
│   ├── memory.py          # GPU/CPU memory management
│   ├── cpu_optimize.py    # CPU BF16 detection, thread tuning, GC utilities
│   └── profiler.py        # Performance profiling
└── cli.py                 # Command-line interface

Citation

If you use Dokodemo AI in research or any published work, you must cite this repository. See LICENSE for full attribution requirements.

@software{dokodemo_ai_2026,
  author  = {Bohoran, Tuan Aqeel},
  title   = {{Dokodemo AI}: Model-Agnostic Trillion-Parameter Inference
             on Consumer GPUs via Adaptive Hierarchical Offloading},
  year    = {2026},
  url     = {https://github.com/tuanaqeelbohoran/dokodemo_ai},
}

A full paper describing the technical contributions is in paper/outline.md.


How Dokodemo Compares to AirLLM

  1. Model-agnostic: AirLLM requires code for each architecture. Dokodemo compiles any model automatically.
  2. MoE-aware: AirLLM loads all N experts for every token. Dokodemo runs the router first and loads only the k selected experts — up to 64× less I/O.
  3. Smart caching: AirLLM evicts every layer after every token. Dokodemo keeps important layers resident.
  4. Better quantization: Dokodemo assigns per-layer precision vs. uniform quantization, reducing quality loss by ~25%.
  5. Multi-level prefetching: 3-level async pipeline vs. AirLLM's 1-level.
  6. CPU zero-accuracy mode: OS mmap loading preserves full FP16 accuracy on CPU-only machines.

Contributing

To contribute or request permission to modify the code, open an issue at https://github.com/tuanaqeelbohoran/dokodemo_ai or email aqeelbohoran@gmail.com.

See paper/outline.md for open research questions and planned features.


License

Dokodemo AI Research and Commercial License v1.0

Use case Cost
Academic research & personal study Free
Non-commercial open publication Free (with citation)
Commercial or enterprise use Paid license required
Modifications to the code Written permission required
  • Publications using this software must cite the GitHub repository.
  • See LICENSE for complete terms.
  • For commercial licensing: aqeelbohoran@gmail.com

Contact

Tuan Aqeel Bohoran

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dokodemo_ai-0.3.0.tar.gz (104.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dokodemo_ai-0.3.0-py3-none-any.whl (93.6 kB view details)

Uploaded Python 3

File details

Details for the file dokodemo_ai-0.3.0.tar.gz.

File metadata

  • Download URL: dokodemo_ai-0.3.0.tar.gz
  • Upload date:
  • Size: 104.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dokodemo_ai-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c1d20bbc8887a0882e541f81b7900639afdb7607590b6be4732d81dae379d5a5
MD5 b23770b799645b42f729cd88f65dd5b6
BLAKE2b-256 8b927a47b6f6e41c9a3fac82333242ab0d04f2b7ddc853edbcb9ea1ccc51d8fd

See more details on using hashes here.

File details

Details for the file dokodemo_ai-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: dokodemo_ai-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 93.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dokodemo_ai-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f7273bca2d25868d3094a0914f6ab76423b2c5e6d3538de0d6eb99daa7884e76
MD5 a52bea8b05e8c397594e825d4fea8e2d
BLAKE2b-256 ccceb7a3e11d29d7029878bbe87e502a2a48c14ecccfed8f1b9cbaf5a1a225e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page