Skip to main content

Run frontier MoE models on consumer hardware. 35B in 1.5GB RAM.

Project description

Kandiga

Run 35B AI models in 1.5GB of RAM. Any Mac.

Kandiga is an open-source MoE inference engine that uses Selective Expert Materialization to run models that would normally require 20GB+ of memory in under 2GB on any Apple Silicon Mac.

How it works

Large MoE (Mixture of Experts) models like Qwen3.5-35B-A3B have 256 experts per layer, but only activate 8 per token. Kandiga exploits this sparsity:

  1. Shared layers (attention, norms, embeddings) load to GPU memory (~1.5GB)
  2. Expert MLP weights stay on disk in packed binary files (~17GB SSD)
  3. Per token: the router selects 8 experts, which are read from SSD via pread
  4. CPU computes expert MLP with NEON-vectorized 4-bit dequant + GCD parallelism
  5. GPU computes attention simultaneously via MLX (unified memory, zero copy)

This is the KTransformers architecture adapted for Apple Silicon's unified memory.

Install

pip install kandiga

Requirements: macOS with Apple Silicon (M1/M2/M3/M4), Python 3.10+

Quick start

# One-time setup: download model + prepare expert files (~20 min)
kandiga setup

# Interactive chat
kandiga chat

# Fast mode (K=4 experts instead of 8, ~2x speed, slightly less quality)
kandiga chat --fast

# One-shot prompt
kandiga "What is the capital of France?"

# Start an OpenAI-compatible API server
kandiga serve

# Run benchmarks
kandiga bench

Benchmarks

Measured on M4 Mac Mini (16GB), Qwen3.5-35B-A3B-4bit:

Mode Experts Speed RAM Quality
Quality (K=8) 8/256 per layer ~3.5 tok/s 1.5GB Full
Fast (K=4) 4/256 per layer ~6.5 tok/s 1.5GB Near-equal

For comparison, loading the full model requires 20.4GB of RAM and MLX alone achieves ~25 tok/s when it fits in memory. Kandiga trades speed for accessibility: if your Mac has 8-16GB of RAM, you can now run a 35B model that previously required 24GB+.

Architecture

User prompt
    |
    v
[Tokenizer + Chat Template]
    |
    v
[MLX Forward Pass]
    |
    +---> GPU: Attention + Norms + Router + Shared Expert + Blending
    |
    +---> CPU: Routed Expert MLP (NEON 4-bit dequant + GCD parallel)
    |         |
    |         +-- pread expert weights from SSD (OS page cache)
    |         +-- gate_proj matvec (512x2048)
    |         +-- up_proj matvec (512x2048)
    |         +-- SwiGLU activation
    |         +-- down_proj matvec (2048x512)
    |
    v
[Token Output]

Both CPU and GPU operate on the same physical DRAM (Apple Silicon unified memory), so there is zero data transfer overhead between them.

API Server

Kandiga includes an OpenAI-compatible HTTP API:

kandiga serve --port 8340
import openai

client = openai.OpenAI(base_url="http://localhost:8340/v1", api_key="unused")
response = client.chat.completions.create(
    model="mlx-community/Qwen3.5-35B-A3B-4bit",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Project structure

kandiga/
  __init__.py          # Package version
  cli.py               # CLI entry point (argparse)
  engine.py            # Core inference engine (SEM)
  chat.py              # Interactive chat (Rich terminal UI)
  serve.py             # OpenAI-compatible HTTP API (FastAPI)
  bench.py             # Benchmarking suite
  setup.py             # Model download + expert splitting + packing
  _split_experts.py    # Split stacked weights into per-expert files
  _pack_experts.py     # Pack per-expert files into binary format
  _build.py            # Compile CPU expert dylib from source
  metal/
    kandiga_cpu_expert.h   # C API header
    kandiga_cpu_expert.m   # NEON + GCD implementation
    Makefile               # Build the dylib
  tools/
    __init__.py            # Future: web search, file access
scripts/
  install.sh           # Quick install script
tests/
  ...

Development

# Clone
git clone https://github.com/kantheon/kandiga.git
cd kandiga

# Install in development mode
pip install -e ".[serve]"

# Build the CPU expert library
cd kandiga/metal && make && cd ../..

# Run tests
pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kandiga-0.3.0.tar.gz (37.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kandiga-0.3.0-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file kandiga-0.3.0.tar.gz.

File metadata

  • Download URL: kandiga-0.3.0.tar.gz
  • Upload date:
  • Size: 37.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for kandiga-0.3.0.tar.gz
Algorithm Hash digest
SHA256 5d69f50803b743ca062d354b33c3d1d5301a24d043e0a3c21038f82ec1c1981e
MD5 c5a5780d20bcc020aabda938b49cca6b
BLAKE2b-256 b8d077ba7d5adf990fa25e12f8f0080f137d765afc1c466e4bc17d5ab4c5d6d0

See more details on using hashes here.

File details

Details for the file kandiga-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: kandiga-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 37.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for kandiga-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8538b6167d92b5534553d5bdf04f7ef26d51c603c62ada6164c491298270e46b
MD5 81afa452f6f59c14fa99d3d1d0ee5192
BLAKE2b-256 f68f1bb20d0d38042b3bece50c5cefd24f3fde301e8f631b9712a46166c5c138

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page