3 AI models. 161B parameters. One Mac. 5.5GB. Full agentic pipeline on Apple Silicon.
Project description
Kandiga
Giant models. Tiny memory.
Kandiga is an open-source MoE inference engine + AI agent for Apple Silicon. Run models that normally need 20-224GB of RAM in 2-8GB — on any Mac. No cloud, no API keys.
What It Does
- Inference engine: Run 35B-397B parameter MoE models in 2-8GB RAM via Selective Expert Materialization (SEM)
- AI agent: Tool calling, web search, file operations, macOS integrations (Calendar, Reminders, Notes, Notifications) — all local
- 3-bit quantization: 21% faster and 22% smaller than 4-bit via MLX native
mx.quantize(bits=3) - Persistent KV cache: Follow-up turns process only new tokens — turn 50 is as fast as turn 1
- TurboQuant: 3.8x KV cache compression for longer conversations
Supported Models
| Model | Parameters | Active | Disk | Kandiga RAM | Decode* | Status |
|---|---|---|---|---|---|---|
| Qwen3.5-4B (3-bit) | 4B | 4B | 1.84 GB | ~1.8 GB | 31 tok/s | Proven |
| Qwen3.5-35B-A3B (full 3-bit) | 35B | 3B | 20 GB | ~1 GB | ~12 tok/s | Proven |
| Qwen3.5-122B-A10B (full 3-bit) | 122B | 10B | 70 GB | ~2.7 GB | ~4 tok/s | Proven |
| Gemma 4 26B-A4B | 26B | 4B | 13 GB | ~1.35 GB | ~10 tok/s | Proven |
| Qwen3.5-397B-A17B (full 3-bit) | 397B | 17B | 224 GB | est. ~5 GB | est. ~1 tok/s | Pending |
*MoE decode speed depends on SSD bandwidth. Estimates for internal NVMe on M4.
Install
pip install kandiga
# For maximum speed (includes ZMLX fused kernels):
pip install kandiga[fast]
Requirements: macOS with Apple Silicon (M1/M2/M3/M4), Python 3.10+
Quick Start
# One-time setup: choose model, download, prepare expert files
kandiga setup
# Interactive chat
kandiga chat
# Fast mode (K=4 experts, ~2x speed)
kandiga chat --fast
# AI agent mode — tools, skills, memory, macOS integrations
kandiga agent --fast
# Agent with web UI
kandiga agent --fast --web
# One-shot prompt
kandiga "What is the capital of France?"
# OpenAI-compatible API server
kandiga serve
# Benchmarks
kandiga bench
Architecture
Inference Engine (SEM)
MoE models have hundreds of expert sub-networks per layer, but only activate a few per token. Kandiga exploits this sparsity:
- Selective Expert Materialization — shared layers on GPU (~1.4GB), expert weights on SSD. Only router-selected experts loaded per token.
- Custom Metal GPU kernels — prefill runs expert MLP entirely on GPU. One dispatch, zero Python overhead.
- CPU NEON decode — single-token expert MLP on CPU with NEON-vectorized 4-bit dequant. Faster than GPU for single tokens (no Metal dispatch overhead).
- Cross-layer speculation — predicts next layer's experts with 77% accuracy. Pre-fetches into OS page cache during current compute.
- TurboQuant KV compression — 3.8x compression (16-bit → 3-bit) via PolarQuant + QJL. Enables 32K context on 16GB.
- ZMLX fused kernels — optimized attention and norms.
3-Bit Weight Quantization
MLX's native mx.quantize(bits=3) with quantized_matmul(bits=3):
| Metric | 4-bit | 3-bit | Improvement |
|---|---|---|---|
| Speed | 112 tok/s | 136 tok/s | 21% faster |
| Load time | 3.6s | 0.9s | 4x faster |
| GPU memory | 2,368MB | 1,842MB | 526MB saved |
| Disk | 2.4GB | 1.84GB | 23% smaller |
| Quality | ✓ correct | ✓ correct | Same |
Conversion: one-time dequant 4-bit → requant 3-bit → save safetensors. Model saved at ~/.kandiga/models/Qwen3.5-4B-3bit/.
Full 3-bit MoE — shared layers on GPU + expert weights on SSD, both at 3-bit:
| Model | 4-bit | Full 3-bit | Speed gain | GPU savings |
|---|---|---|---|---|
| 35B-A3B | ~8 tok/s, 1.4 GB | ~12 tok/s, 1.0 GB | +50% | -22% |
| 122B-A10B | ~2 tok/s, 3.5 GB | ~4 tok/s, 2.7 GB | +100% | -22% |
| Gemma 4 26B-A4B | N/A | ~10 tok/s, 1.35 GB | — | — |
Conversion (one-time):
# Shared layers (GPU): dequant 4-bit → requant 3-bit → save safetensors
python scripts/convert_3bit.py mlx-community/Qwen3.5-35B-A3B-4bit \
~/.kandiga/models/Qwen3.5-35B-A3B-3bit-shared
# Expert weights (SSD): repack binary files at 3-bit (22% smaller = 22% less I/O)
python scripts/repack_experts_3bit.py ~/.kandiga/experts/Qwen3.5-35B-A3B-4bit/packed
Both auto-detected on engine startup. NEON-vectorized 3-bit dequant kernel matches MLX's bit layout.
Agent System
Kandiga includes a full AI agent with native Qwen3.5 tool calling:
Architecture:
- 4B (3-bit, 136 tok/s): tool call JSON generation, route classification
- 35B K=4 (6.7 tok/s): response writing, reasoning via session KV cache
- 17 tools: filesystem (read/write/list/search), shell, web search, macOS (Calendar, Reminders, Notes, Notifications, Finder, Contacts, system info, text-to-speech)
- Skill engine: OpenClaw-compatible SKILL.md format
- Memory: MEMORY.md + daily notes + persistent KV cache sessions
Agent performance (M4 Mac Mini 16GB):
| Task | Time | Tools Used |
|---|---|---|
| Hello | 2.5s | — |
| List files | 3s | list_dir |
| Create + run script | 6s | write_file, run_shell |
| Web search | 3s | web_search |
| Math (127 × 389) | 3s | — (direct) |
| What time is it | 3s | — (injected) |
| Delete file | 5s | run_shell |
| Recall turn 1 at turn 44 | 8s | — (KV cache) |
95% accuracy on 44-turn multi-turn conversation. Persistent KV cache maintains context across all turns. 3.1s average per turn.
Performance (M4 Mac Mini, 16GB)
| Model | Mode | Decode* | Follow-up TTFT | RAM |
|---|---|---|---|---|
| Qwen3.5-4B (3-bit) | dense | 31 tok/s | <1s | 1.8 GB |
| Qwen3.5-35B (full 3-bit) | K=4 | ~12 tok/s | 2-4s | ~1 GB |
| Qwen3.5-122B (full 3-bit) | K=4 | ~4 tok/s | 5-10s | ~2.7 GB |
| Gemma 4 26B-A4B | K=4 | ~10 tok/s | 2-4s | ~1.35 GB |
*MoE decode speed depends on SSD bandwidth. Estimates for M4 internal NVMe.
Follow-up TTFT is constant regardless of conversation length thanks to persistent KV cache.
Persistent KV Cache
Without persistent cache: With persistent cache:
Turn 1: 8s (reads document) Turn 1: 8s (reads once)
Turn 5: 25s (re-reads all) Turn 5: 3s (new tokens only)
Turn 30: 2min+ (re-reads all) Turn 30: 3s (new tokens only)
Save/load sessions to disk:
engine.save_session("~/session.npz") # Save KV cache state
engine.load_session("~/session.npz") # Resume instantly (<0.1s)
TQ3 Weight Quantization
TQ3 (TurboQuant 3-bit) applies Walsh-Hadamard Transform rotation before quantization for better quality:
- Algorithm: WHT rotation → Lloyd-Max 8-level codebook → 3-bit packing
- Quality: 0.990 cosine similarity per layer (proven across all 32 layers)
- Metal kernel: Fused GEMV with SIMD WHT butterfly (cosine 1.0, 62% memory savings)
- Status: Algorithm proven, MLX native 3-bit is faster for production use
For production: use mx.quantize(bits=3) (MLX native). TQ3 WHT rotation is for research/future optimization.
File Structure
kandiga/
├── engine.py # SEM inference engine (1387 lines)
├── kv_compress.py # TurboQuant KV cache compression
├── speculative.py # Dual-model speculative decoding
├── cli.py # CLI interface
├── chat.py # Interactive chat (Rich terminal)
├── serve.py # OpenAI-compatible API server
├── agents/ # AI agent layer
│ ├── agent_loop.py # Native Qwen3.5 tool-calling loop
│ ├── agent_chat.py # Agent interactive chat
│ ├── agent_serve.py # Agent web server + UI
│ ├── dual_engine.py # 4B + 35B dual-model engine
│ ├── pipeline.py # Agent pipeline (routing, tools, verification)
│ ├── tools.py # 17 tools (filesystem, shell, web, macOS)
│ ├── macos.py # macOS native integrations via osascript
│ ├── skills.py # OpenClaw-compatible SKILL.md engine
│ ├── memory.py # Persistent memory (MEMORY.md + daily notes)
│ ├── cloud.py # Cloud escalation (Kimi/Claude/OpenAI)
│ ├── protocol.py # Typed dataclasses (ToolCall, ToolResult, AgentResult)
│ └── json_repair.py # 5-strategy JSON repair (never crashes)
├── tq3/ # TQ3 weight quantization
│ ├── quantize.py # WHT + Lloyd-Max + packing (vectorized)
│ ├── engine.py # TQ3Linear layer + save/load
│ ├── fused_kernel.py # Metal GEMV kernel (SIMD WHT)
│ ├── integrate.py # Model conversion pipeline
│ ├── loader.py # TQ3 model loader
│ ├── convert_experts.py # MoE expert conversion
│ ├── mlx_patch.py # MLX model patching
│ └── tq3_metal.metal # Metal compute shader
├── metal/ # C/Metal inference (6,600+ lines)
│ ├── kandiga_cpu_expert.m # NEON expert MLP (35B)
│ ├── kandiga_cpu_expert_lg.m # NEON expert MLP (122B/397B)
│ ├── attention.metal # GPU attention kernels
│ ├── expert_mlp.metal # GPU expert MLP kernels
│ └── moe_block.metal # GPU MoE block kernels
├── static/
│ └── agent.html # Agent web UI
└── tools/ # Optional tool integrations
Development
git clone https://github.com/kantheon/kandiga.git
cd kandiga
pip install -e ".[serve,fast]"
pytest tests/ -v
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kandiga-0.9.0.tar.gz.
File metadata
- Download URL: kandiga-0.9.0.tar.gz
- Upload date:
- Size: 233.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdccc7b19e05fc8ea17161a03f93330a234754d407571284f12ee92b23c6e6bb
|
|
| MD5 |
a5a4b964366eca1ce50f728c8662d722
|
|
| BLAKE2b-256 |
edf221a57ac9d674a6bfa0b0fdc98214912fdeeff6345418bcc5635c1b1e81f7
|
File details
Details for the file kandiga-0.9.0-py3-none-any.whl.
File metadata
- Download URL: kandiga-0.9.0-py3-none-any.whl
- Upload date:
- Size: 247.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1730cf7f0f3223c8d05bf38a856a43d7775c8ff70ad1cb24f9ac7e6d6521a0e8
|
|
| MD5 |
9ed44d43efee20b2efda72c664b46ebc
|
|
| BLAKE2b-256 |
591cfda327d0c5e19ec8471b2a6e2e94121f46ac5116e724276505dc5d2b1f14
|