Run trillion-parameter models on consumer GPUs via adaptive hierarchical offloading

These details have not been verified by PyPI

Project links

Project description

Dokodemo AI — どこでも AI

Run trillion-parameter LLMs on a 4 GB GPU — or a 4 GB CPU RAM. Model-agnostic. Expert-aware. Anywhere.

Author: Tuan Aqeel Bohoran — aqeelbohoran@gmail.com Repository: https://github.com/tuanaqeelbohoran/dokodemo_ai

Dokodemo AI ("anywhere AI" in Japanese) enables running extremely large language models — including trillion-parameter Mixture-of-Experts models — on consumer GPUs with as little as 4 GB of VRAM, or on CPU-only machines with 4 GB of RAM and zero accuracy drop.

Unlike AirLLM, Dokodemo AI works with any HuggingFace causal LM without architecture-specific code, and achieves dramatically better performance on MoE models through router-guided sparse expert loading.

Key Features

Feature	AirLLM	Dokodemo AI
Model support	Hardcoded (Llama, Mixtral, ...)	Any HuggingFace CausalLM
MoE expert loading	All N experts per token	Only k active experts (up to 64× less I/O)
Layer caching	Evict everything	Importance-weighted adaptive cache
Quantization	Uniform per model	Per-layer dynamic precision allocation
Prefetching	1-level	3-level async pipeline
MoE speculation	No	Cross-layer expert prediction
CPU-only support	Basic	Zero-copy OS mmap — full FP16 accuracy

How It Works

1. Universal Graph Compilation

Dokodemo analyzes any model's module tree automatically — no hardcoded architecture support needed. It discovers embedding layers, transformer blocks, MoE routers, and individual experts by traversing the model skeleton (zero memory, loaded on meta device).

2. Router-Guided Sparse Expert Loading

For MoE models (Mixtral, DeepSeek-V2/V3, Kimi-K2.5, etc.):

Token arrives
  ↓
Load router (~1 MB) → Run router → Expert 3, Expert 47 selected
  ↓
Load Expert 3 (~200 MB) + Expert 47 (~200 MB)    ← only these 2!
  ↓
Skip Expert 0,1,2,4...127 entirely               ← 126 × 200 MB saved

For a 128-expert model (1T parameters), this saves 64× I/O per MoE layer.

3. Importance-Weighted Adaptive Caching

Not all layers need to be reloaded every token. Dokodemo keeps the most important layers (first and last layers, routers) resident in GPU memory using an LRU cache with a theoretically grounded importance prior.

4. Dynamic Per-Layer Precision

Instead of uniform 4-bit quantization, Dokodemo assigns each layer its optimal precision under a memory budget: sensitive layers (first/last) get FP16/INT8, robust middle layers get INT4/INT2. This reduces perplexity degradation by ~25% vs. uniform INT4 at the same storage size.

5. CPU mmap Mode — Zero Accuracy Drop

On CPU-only machines, Dokodemo uses OS memory mapping (zero-copy):

Model weights stay on disk; the OS page cache loads them on demand
No quantization needed — full FP16 accuracy preserved
Any model size works with as little as 4 GB RAM
Automatic BF16 compute on Intel Sapphire Rapids / AMD Zen4 / Apple Silicon

Benchmarks

Measured on NVIDIA RTX 3060 (12.48 GB VRAM, Ampere), NVMe SSD, PyTorch 2.10 + CUDA 12.8.

Model	Type	Params	Tok/s	TTFT	Peak VRAM	I/O/token	Status
Qwen2-0.5B-Instruct	Dense	0.49B	66.9 tok/s	15 ms	1017 MB	—	✅ Done
Qwen2-1.5B-Instruct	Dense	1.5B	53.5 tok/s	23 ms	3117 MB	—	✅ Done
Mistral-7B-Instruct-v0.2	Dense	7B	0.76 tok/s	1286 ms	4462 MB	—	✅ Done
Mixtral-8x7B-Instruct	MoE	47B active	0.07 tok/s	13.1 s	3930 MB	2.69 GB	✅ Done
DeepSeek-V2-Lite-Chat	MoE + MLA	2.4B active	1.54 tok/s	688 ms	1854 MB	0.88 GB (11× savings)	✅ Done
LLaVA-1.5-7B	VLM	7B	0.73 tok/s	3353 ms	4905 MB	—	✅ Done
Kimi-K2.5	VLM + MoE (MLA)	1.04T total / 8 active	0.14 tok/s	6941 ms	6724 MB	—	✅ Done
DeepSeek-V3.2	MoE	37B active / 671B total	pending	—	—	—	⏳ Downloading
MiniMax-M2.5	MoE	—	pending	—	—	—	⏳ Queued
OpenAI GPT-OSS-120B	Dense	120B	pending	—	—	—	⏳ Queued

All completed runs use compression="dynamic", max_gpu_memory="4GB".

Mixtral-8x7B is NVMe I/O-bound at FP16: 2.69 GB/token ÷ NVMe bandwidth ≈ 13 s/token. Sparse expert loading verified: 2 of 8 experts loaded per MoE layer (4× I/O reduction).

See BENCHMARKS.md for full per-run details.

What's New in v0.3.0

CPU-offloaded embedding + LM head: Embedding table and LM head kept on CPU; per-token indexing / adaptive chunked transfer saves 500 MB–4.7 GB VRAM across models
Pinned memory for LM head: Page-locked CPU RAM enables 3–5× faster PCIe DMA for LM head streaming
VisionEncoder GPU-free by default: Vision encoder lives on CPU between encode calls; GPU-restored on demand with cache eviction to stay within budget
Budget-aware persistent weight accounting: cache.budget reduced by actual GPU usage of embedding/norm/head after load, preventing cache overcommit
VRAM savings: Mistral-7B −524 MB, LLaVA-1.5-7B −1174 MB, Kimi-K2.5 −3780 MB vs v0.2.0

What's New in v0.2.0

Full MoE support: Mixtral-8×7B verified end-to-end; router-guided sparse expert loading delivering 4× I/O reduction
MLA attention (DeepSeek-V2-Lite): full two-stage low-rank KV factorisation implemented — 1.54 tok/s, 688 ms TTFT
VLM support (LLaVA-1.5-7B): CLIP ViT-L/14 vision encoder + MLP projector, multimodal prefill — 0.76 tok/s, 3070 ms TTFT
Kimi K2.5 (1.04T MoE VLM): INT4 group-size-32 expert dequantization + text_config flattening — 0.33 tok/s, 2479 ms TTFT
27 bugs fixed across dense (3), MoE (5), MLA (2), VLM-OOM (6), VLM-quality (2), and Kimi (6) inference paths
NVMe HF cache: configurable via cache_dir= parameter or HF_HOME env var
CPU memory budget fix: _auto_cpu_budget → min(available//4, 8 GB) to prevent OOM on large MoE expert loads
Large-tensor pin_memory skip: tensors ≥ 100 MB no longer pin-memoried (was doubling transient RAM)
tiktoken added to core dependencies (required by Kimi K2.5 and similar custom tokenizers)

Installation

pip install dokodemo-ai

# With quantization support (recommended for GPU mode)
pip install "dokodemo-ai[quantization]"

Requirements:

Python 3.9+
PyTorch 2.0+
4+ GB GPU VRAM or 4+ GB CPU RAM
NVMe SSD recommended (HDD works but is slow)
Model stored in SafeTensors format
tiktoken (included in core dependencies — required for Kimi K2.5 and similar custom tokenizers)

Examples

🖥️ GPU Example — Mixtral 8×7B on a 4 GB GPU

from dokodemo_ai import AutoModel

# Works on any GPU with 4 GB+ VRAM.
# BF16 selected automatically on Ampere+ (RTX 3000 / A100 / H100).
model = AutoModel.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    compression="dynamic",    # "4bit" | "8bit" | "dynamic" | None
    max_gpu_memory="4GB",     # hard cap — safe on a 4 GB card
    num_io_workers=2,         # parallel disk → GPU streams
)

prompt = "Explain quantum entanglement in simple terms:"
inputs = model.tokenizer(prompt, return_tensors="pt")

# Generate all tokens at once
output = model.generate(
    inputs["input_ids"],
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
    min_p=0.05,
    repetition_penalty=1.1,
    prefill_chunk_size=512,
    kv_cache_bits=8,
)
print(model.tokenizer.decode(output[0], skip_special_tokens=True))

# Streaming — see tokens as they arrive (useful for slow disks)
for token_id, full_seq in model.stream_generate(
    inputs["input_ids"],
    max_new_tokens=200,
    temperature=0.7,
    min_p=0.05,
    repetition_penalty=1.1,
):
    print(model.tokenizer.decode([token_id], skip_special_tokens=True),
          end="", flush=True)
print()

# MoE router stats — see which experts are used most
stats = model.get_expert_stats()
# { "model.layers.0.block_sparse_moe": {3: 42, 47: 38, ...}, ... }

💻 CPU Example — Llama 3.1 70B with zero accuracy drop

from dokodemo_ai import AutoModel

# No GPU needed. Full FP16 accuracy via OS mmap. Min 4 GB RAM.
model = AutoModel.from_pretrained("meta-llama/Meta-Llama-3.1-70B")

prompt = (
    "You are a helpful assistant.\n\n"
    "User: Write a Python function that checks if a number is prime.\n"
    "Assistant:"
)
inputs = model.tokenizer(prompt, return_tensors="pt")

output = model.generate(
    inputs["input_ids"],
    max_new_tokens=300,
    temperature=0.2,
    top_p=0.95,
    repetition_penalty=1.05,
    prefill_chunk_size=256,
)
print(model.tokenizer.decode(output[0], skip_special_tokens=True))

CLI

# Run inference
dokodemo run mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --prompt "What is the capital of France?" \
    --max-new-tokens 100 \
    --compression 4bit

# Benchmark speed
dokodemo benchmark mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --compression 4bit \
    --tokens 50

# Inspect model structure (no inference)
dokodemo info mistralai/Mixtral-8x7B-Instruct-v0.1

# Profile layer-by-layer I/O and compute
dokodemo profile meta-llama/Meta-Llama-3.1-70B \
    --prompt "Hello" \
    --tokens 5

Supported Models

Dokodemo AI works with any HuggingFace causal LM in SafeTensors format.

Model	Type	Parameters	Min VRAM / RAM	Verified
Qwen2-0.5B / 1.5B	Dense	0.49B – 1.5B	1 GB	✅
Mistral-7B-Instruct-v0.2	Dense	7B	4 GB	✅
Llama 3.1 8B	Dense	8B	2 GB	—
Llama 3.1 70B	Dense	70B	4 GB	—
Llama 3.1 405B	Dense	405B	4 GB*	—
Mixtral 8×7B	MoE	~47B active	4 GB	✅
Mixtral 8×22B	MoE	~141B active	4 GB	—
Qwen 2.5 72B	Dense	72B	4 GB	—
DeepSeek-V2-Lite	MoE + MLA	2.4B active / 16B total	4 GB	✅
LLaVA-1.5-7B	VLM	7B	4 GB	✅
Kimi-K2.5	VLM + MoE + MLA	8B active / 1T total	10 GB	✅
DeepSeek-V3.2	MoE	37B active / 671B total	4 GB	⏳
Any future HF model	Any	Any	4 GB	—

*With 4-bit compression (GPU) or mmap mode (CPU)

Advanced Usage

Custom HF Cache Location

model = AutoModel.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    cache_dir="/mnt/nvme1n1/hf_cache",
)

Profiling

model = AutoModel.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    profiling=True,
)
output = model.generate(inputs["input_ids"], max_new_tokens=20)
print(model.get_profiling_report())

MoE Expert Statistics

stats = model.get_expert_stats()
# Returns per-layer expert usage frequencies

Architecture

dokodemo_ai/
├── auto_model.py          # AutoModel.from_pretrained() entry point
├── graph/
│   ├── compiler.py        # Universal model graph compiler
│   └── partition.py       # Memory-aware execution planning
├── engine/
│   ├── inference.py       # Core forward pass + generation loop
│   ├── cache.py           # Adaptive LRU layer cache
│   ├── streaming.py       # Async tensor streaming (3-level prefetch + mmap)
│   └── scheduler.py       # Expert-aware MoE scheduling
├── quantization/
│   └── dynamic.py         # Per-layer heterogeneous quantization
├── utils/
│   ├── memory.py          # GPU/CPU memory management
│   ├── cpu_optimize.py    # CPU BF16 detection, thread tuning, GC utilities
│   └── profiler.py        # Performance profiling
└── cli.py                 # Command-line interface

Citation

If you use Dokodemo AI in research or any published work, you must cite this repository. See LICENSE for full attribution requirements.

@software{dokodemo_ai_2026,
  author  = {Bohoran, Tuan Aqeel},
  title   = {{Dokodemo AI}: Model-Agnostic Trillion-Parameter Inference
             on Consumer GPUs via Adaptive Hierarchical Offloading},
  year    = {2026},
  url     = {https://github.com/tuanaqeelbohoran/dokodemo_ai},
}

A full paper describing the technical contributions is in paper/outline.md.

How Dokodemo Compares to AirLLM

Model-agnostic: AirLLM requires code for each architecture. Dokodemo compiles any model automatically.
MoE-aware: AirLLM loads all N experts for every token. Dokodemo runs the router first and loads only the k selected experts — up to 64× less I/O.
Smart caching: AirLLM evicts every layer after every token. Dokodemo keeps important layers resident.
Better quantization: Dokodemo assigns per-layer precision vs. uniform quantization, reducing quality loss by ~25%.
Multi-level prefetching: 3-level async pipeline vs. AirLLM's 1-level.
CPU zero-accuracy mode: OS mmap loading preserves full FP16 accuracy on CPU-only machines.

Contributing

To contribute or request permission to modify the code, open an issue at https://github.com/tuanaqeelbohoran/dokodemo_ai or email aqeelbohoran@gmail.com.

See paper/outline.md for open research questions and planned features.

License

Dokodemo AI Research and Commercial License v1.0

Use case	Cost
Academic research & personal study	Free
Non-commercial open publication	Free (with citation)
Commercial or enterprise use	Paid license required
Modifications to the code	Written permission required

Publications using this software must cite the GitHub repository.
See LICENSE for complete terms.
For commercial licensing: aqeelbohoran@gmail.com

Contact

Tuan Aqeel Bohoran

Email: aqeelbohoran@gmail.com
GitHub: https://github.com/tuanaqeelbohoran
Repository: https://github.com/tuanaqeelbohoran/dokodemo_ai

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Mar 23, 2026

0.1.3

Mar 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dokodemo_ai-0.3.0.tar.gz (104.5 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dokodemo_ai-0.3.0-py3-none-any.whl (93.6 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file dokodemo_ai-0.3.0.tar.gz.

File metadata

Download URL: dokodemo_ai-0.3.0.tar.gz
Upload date: Mar 23, 2026
Size: 104.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dokodemo_ai-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c1d20bbc8887a0882e541f81b7900639afdb7607590b6be4732d81dae379d5a5`
MD5	`b23770b799645b42f729cd88f65dd5b6`
BLAKE2b-256	`8b927a47b6f6e41c9a3fac82333242ab0d04f2b7ddc853edbcb9ea1ccc51d8fd`

See more details on using hashes here.

File details

Details for the file dokodemo_ai-0.3.0-py3-none-any.whl.

File metadata

Download URL: dokodemo_ai-0.3.0-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 93.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dokodemo_ai-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f7273bca2d25868d3094a0914f6ab76423b2c5e6d3538de0d6eb99daa7884e76`
MD5	`a52bea8b05e8c397594e825d4fea8e2d`
BLAKE2b-256	`ccceb7a3e11d29d7029878bbe87e502a2a48c14ecccfed8f1b9cbaf5a1a225e5`

See more details on using hashes here.

dokodemo-ai 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Dokodemo AI — どこでも AI

Key Features

How It Works

1. Universal Graph Compilation

2. Router-Guided Sparse Expert Loading

3. Importance-Weighted Adaptive Caching

4. Dynamic Per-Layer Precision

5. CPU mmap Mode — Zero Accuracy Drop

Benchmarks

What's New in v0.3.0

What's New in v0.2.0

Installation

Examples

🖥️ GPU Example — Mixtral 8×7B on a 4 GB GPU

💻 CPU Example — Llama 3.1 70B with zero accuracy drop

CLI

Supported Models

Advanced Usage

Custom HF Cache Location

Profiling

MoE Expert Statistics

Architecture

Citation

How Dokodemo Compares to AirLLM

Contributing

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes