Run frontier MoE models on consumer hardware. 35B in 1.5GB RAM.
Project description
Kandiga
Run 35B AI models in 1.5GB of RAM. Any Mac.
Kandiga is an open-source MoE inference engine that uses Selective Expert Materialization to run models that would normally require 20GB+ of memory in under 2GB on any Apple Silicon Mac.
How it works
Large MoE (Mixture of Experts) models like Qwen3.5-35B-A3B have 256 experts per layer, but only activate 8 per token. Kandiga exploits this sparsity:
- Shared layers (attention, norms, embeddings) load to GPU memory (~1.5GB)
- Expert MLP weights stay on disk in packed binary files (~17GB SSD)
- Per token: the router selects 8 experts, which are read from SSD via
pread - CPU computes expert MLP with NEON-vectorized 4-bit dequant + GCD parallelism
- GPU computes attention simultaneously via MLX (unified memory, zero copy)
This is the KTransformers architecture adapted for Apple Silicon's unified memory.
Install
pip install kandiga
Requirements: macOS with Apple Silicon (M1/M2/M3/M4), Python 3.10+
Quick start
# One-time setup: download model + prepare expert files (~20 min)
kandiga setup
# Interactive chat
kandiga chat
# Fast mode (K=4 experts instead of 8, ~2x speed, slightly less quality)
kandiga chat --fast
# One-shot prompt
kandiga "What is the capital of France?"
# Start an OpenAI-compatible API server
kandiga serve
# Run benchmarks
kandiga bench
Benchmarks
Measured on M4 Mac Mini (16GB), Qwen3.5-35B-A3B-4bit:
| Mode | Experts | Speed | RAM | Quality |
|---|---|---|---|---|
| Quality (K=8) | 8/256 per layer | ~3.5 tok/s | 1.5GB | Full |
| Fast (K=4) | 4/256 per layer | ~6.5 tok/s | 1.5GB | Near-equal |
For comparison, loading the full model requires 20.4GB of RAM and MLX alone achieves ~25 tok/s when it fits in memory. Kandiga trades speed for accessibility: if your Mac has 8-16GB of RAM, you can now run a 35B model that previously required 24GB+.
Architecture
User prompt
|
v
[Tokenizer + Chat Template]
|
v
[MLX Forward Pass]
|
+---> GPU: Attention + Norms + Router + Shared Expert + Blending
|
+---> CPU: Routed Expert MLP (NEON 4-bit dequant + GCD parallel)
| |
| +-- pread expert weights from SSD (OS page cache)
| +-- gate_proj matvec (512x2048)
| +-- up_proj matvec (512x2048)
| +-- SwiGLU activation
| +-- down_proj matvec (2048x512)
|
v
[Token Output]
Both CPU and GPU operate on the same physical DRAM (Apple Silicon unified memory), so there is zero data transfer overhead between them.
API Server
Kandiga includes an OpenAI-compatible HTTP API:
kandiga serve --port 8340
import openai
client = openai.OpenAI(base_url="http://localhost:8340/v1", api_key="unused")
response = client.chat.completions.create(
model="mlx-community/Qwen3.5-35B-A3B-4bit",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")
Project structure
kandiga/
__init__.py # Package version
cli.py # CLI entry point (argparse)
engine.py # Core inference engine (SEM)
chat.py # Interactive chat (Rich terminal UI)
serve.py # OpenAI-compatible HTTP API (FastAPI)
bench.py # Benchmarking suite
setup.py # Model download + expert splitting + packing
_split_experts.py # Split stacked weights into per-expert files
_pack_experts.py # Pack per-expert files into binary format
_build.py # Compile CPU expert dylib from source
metal/
kandiga_cpu_expert.h # C API header
kandiga_cpu_expert.m # NEON + GCD implementation
Makefile # Build the dylib
tools/
__init__.py # Future: web search, file access
scripts/
install.sh # Quick install script
tests/
...
Development
# Clone
git clone https://github.com/kantheon/kandiga.git
cd kandiga
# Install in development mode
pip install -e ".[serve]"
# Build the CPU expert library
cd kandiga/metal && make && cd ../..
# Run tests
pytest tests/ -v
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kandiga-0.3.0.tar.gz.
File metadata
- Download URL: kandiga-0.3.0.tar.gz
- Upload date:
- Size: 37.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d69f50803b743ca062d354b33c3d1d5301a24d043e0a3c21038f82ec1c1981e
|
|
| MD5 |
c5a5780d20bcc020aabda938b49cca6b
|
|
| BLAKE2b-256 |
b8d077ba7d5adf990fa25e12f8f0080f137d765afc1c466e4bc17d5ab4c5d6d0
|
File details
Details for the file kandiga-0.3.0-py3-none-any.whl.
File metadata
- Download URL: kandiga-0.3.0-py3-none-any.whl
- Upload date:
- Size: 37.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8538b6167d92b5534553d5bdf04f7ef26d51c603c62ada6164c491298270e46b
|
|
| MD5 |
81afa452f6f59c14fa99d3d1d0ee5192
|
|
| BLAKE2b-256 |
f68f1bb20d0d38042b3bece50c5cefd24f3fde301e8f631b9712a46166c5c138
|