3 AI models. 161B parameters. One Mac. 5.5GB. Full agentic pipeline on Apple Silicon.

These details have not been verified by PyPI

Project links

Project description

Kandiga

Giant models. Tiny memory.

Kandiga is an open-source MoE inference engine + AI agent for Apple Silicon. Run models that normally need 20-224GB of RAM in 2-8GB — on any Mac. No cloud, no API keys.

What It Does

Inference engine: Run 35B-397B parameter MoE models in 2-8GB RAM via Selective Expert Materialization (SEM)
AI agent: Tool calling, web search, file operations, macOS integrations (Calendar, Reminders, Notes, Notifications) — all local
3-bit quantization: 21% faster and 22% smaller than 4-bit via MLX native mx.quantize(bits=3)
Persistent KV cache: Follow-up turns process only new tokens — turn 50 is as fast as turn 1
TurboQuant: 3.8x KV cache compression for longer conversations

Supported Models

Model	Parameters	Active	Disk	Kandiga RAM	Decode*	Status
Qwen3.5-4B (3-bit)	4B	4B	1.84 GB	~1.8 GB	31 tok/s	Proven
Qwen3.5-35B-A3B (full 3-bit)	35B	3B	20 GB	~1 GB	~12 tok/s	Proven
Qwen3.5-122B-A10B (full 3-bit)	122B	10B	70 GB	~2.7 GB	~4 tok/s	Proven
Gemma 4 26B-A4B	26B	4B	13 GB	~1.35 GB	~10 tok/s	Proven
Qwen3.5-397B-A17B (full 3-bit)	397B	17B	224 GB	est. ~5 GB	est. ~1 tok/s	Pending

*MoE decode speed depends on SSD bandwidth. Estimates for internal NVMe on M4.

Install

pip install kandiga

# For maximum speed (includes ZMLX fused kernels):
pip install kandiga[fast]

Requirements: macOS with Apple Silicon (M1/M2/M3/M4), Python 3.10+

Quick Start

# One-time setup: choose model, download, prepare expert files
kandiga setup

# Interactive chat
kandiga chat

# Fast mode (K=4 experts, ~2x speed)
kandiga chat --fast

# AI agent mode — tools, skills, memory, macOS integrations
kandiga agent --fast

# Agent with web UI
kandiga agent --fast --web

# One-shot prompt
kandiga "What is the capital of France?"

# OpenAI-compatible API server
kandiga serve

# Benchmarks
kandiga bench

Architecture

Inference Engine (SEM)

MoE models have hundreds of expert sub-networks per layer, but only activate a few per token. Kandiga exploits this sparsity:

Selective Expert Materialization — shared layers on GPU (~1.4GB), expert weights on SSD. Only router-selected experts loaded per token.
Custom Metal GPU kernels — prefill runs expert MLP entirely on GPU. One dispatch, zero Python overhead.
CPU NEON decode — single-token expert MLP on CPU with NEON-vectorized 4-bit dequant. Faster than GPU for single tokens (no Metal dispatch overhead).
Cross-layer speculation — predicts next layer's experts with 77% accuracy. Pre-fetches into OS page cache during current compute.
TurboQuant KV compression — 3.8x compression (16-bit → 3-bit) via PolarQuant + QJL. Enables 32K context on 16GB.
ZMLX fused kernels — optimized attention and norms.

3-Bit Weight Quantization

MLX's native mx.quantize(bits=3) with quantized_matmul(bits=3):

Metric	4-bit	3-bit	Improvement
Speed	112 tok/s	136 tok/s	21% faster
Load time	3.6s	0.9s	4x faster
GPU memory	2,368MB	1,842MB	526MB saved
Disk	2.4GB	1.84GB	23% smaller
Quality	✓ correct	✓ correct	Same

Conversion: one-time dequant 4-bit → requant 3-bit → save safetensors. Model saved at ~/.kandiga/models/Qwen3.5-4B-3bit/.

Full 3-bit MoE — shared layers on GPU + expert weights on SSD, both at 3-bit:

Model	4-bit	Full 3-bit	Speed gain	GPU savings
35B-A3B	~8 tok/s, 1.4 GB	~12 tok/s, 1.0 GB	+50%	-22%
122B-A10B	~2 tok/s, 3.5 GB	~4 tok/s, 2.7 GB	+100%	-22%
Gemma 4 26B-A4B	N/A	~10 tok/s, 1.35 GB	—	—

Conversion (one-time):

# Shared layers (GPU): dequant 4-bit → requant 3-bit → save safetensors
python scripts/convert_3bit.py mlx-community/Qwen3.5-35B-A3B-4bit \
    ~/.kandiga/models/Qwen3.5-35B-A3B-3bit-shared

# Expert weights (SSD): repack binary files at 3-bit (22% smaller = 22% less I/O)
python scripts/repack_experts_3bit.py ~/.kandiga/experts/Qwen3.5-35B-A3B-4bit/packed

Both auto-detected on engine startup. NEON-vectorized 3-bit dequant kernel matches MLX's bit layout.

Agent System

Kandiga includes a full AI agent with native Qwen3.5 tool calling:

Architecture:

4B (3-bit, 136 tok/s): tool call JSON generation, route classification
35B K=4 (6.7 tok/s): response writing, reasoning via session KV cache
17 tools: filesystem (read/write/list/search), shell, web search, macOS (Calendar, Reminders, Notes, Notifications, Finder, Contacts, system info, text-to-speech)
Skill engine: OpenClaw-compatible SKILL.md format
Memory: MEMORY.md + daily notes + persistent KV cache sessions

Agent performance (M4 Mac Mini 16GB):

Task	Time	Tools Used
Hello	2.5s	—
List files	3s	list_dir
Create + run script	6s	write_file, run_shell
Web search	3s	web_search
Math (127 × 389)	3s	— (direct)
What time is it	3s	— (injected)
Delete file	5s	run_shell
Recall turn 1 at turn 44	8s	— (KV cache)

95% accuracy on 44-turn multi-turn conversation. Persistent KV cache maintains context across all turns. 3.1s average per turn.

Performance (M4 Mac Mini, 16GB)

Model	Mode	Decode*	Follow-up TTFT	RAM
Qwen3.5-4B (3-bit)	dense	31 tok/s	<1s	1.8 GB
Qwen3.5-35B (full 3-bit)	K=4	~12 tok/s	2-4s	~1 GB
Qwen3.5-122B (full 3-bit)	K=4	~4 tok/s	5-10s	~2.7 GB
Gemma 4 26B-A4B	K=4	~10 tok/s	2-4s	~1.35 GB

*MoE decode speed depends on SSD bandwidth. Estimates for M4 internal NVMe.

Follow-up TTFT is constant regardless of conversation length thanks to persistent KV cache.

Persistent KV Cache

Without persistent cache:       With persistent cache:
  Turn 1:  8s (reads document)     Turn 1:  8s (reads once)
  Turn 5:  25s (re-reads all)      Turn 5:  3s (new tokens only)
  Turn 30: 2min+ (re-reads all)    Turn 30: 3s (new tokens only)

Save/load sessions to disk:

engine.save_session("~/session.npz")   # Save KV cache state
engine.load_session("~/session.npz")   # Resume instantly (<0.1s)

TQ3 Weight Quantization

TQ3 (TurboQuant 3-bit) applies Walsh-Hadamard Transform rotation before quantization for better quality:

Algorithm: WHT rotation → Lloyd-Max 8-level codebook → 3-bit packing
Quality: 0.990 cosine similarity per layer (proven across all 32 layers)
Metal kernel: Fused GEMV with SIMD WHT butterfly (cosine 1.0, 62% memory savings)
Status: Algorithm proven, MLX native 3-bit is faster for production use

For production: use mx.quantize(bits=3) (MLX native). TQ3 WHT rotation is for research/future optimization.

File Structure

kandiga/
├── engine.py              # SEM inference engine (1387 lines)
├── kv_compress.py         # TurboQuant KV cache compression
├── speculative.py         # Dual-model speculative decoding
├── cli.py                 # CLI interface
├── chat.py                # Interactive chat (Rich terminal)
├── serve.py               # OpenAI-compatible API server
├── agents/                # AI agent layer
│   ├── agent_loop.py      # Native Qwen3.5 tool-calling loop
│   ├── agent_chat.py      # Agent interactive chat
│   ├── agent_serve.py     # Agent web server + UI
│   ├── dual_engine.py     # 4B + 35B dual-model engine
│   ├── pipeline.py        # Agent pipeline (routing, tools, verification)
│   ├── tools.py           # 17 tools (filesystem, shell, web, macOS)
│   ├── macos.py           # macOS native integrations via osascript
│   ├── skills.py          # OpenClaw-compatible SKILL.md engine
│   ├── memory.py          # Persistent memory (MEMORY.md + daily notes)
│   ├── cloud.py           # Cloud escalation (Kimi/Claude/OpenAI)
│   ├── protocol.py        # Typed dataclasses (ToolCall, ToolResult, AgentResult)
│   └── json_repair.py     # 5-strategy JSON repair (never crashes)
├── tq3/                   # TQ3 weight quantization
│   ├── quantize.py        # WHT + Lloyd-Max + packing (vectorized)
│   ├── engine.py          # TQ3Linear layer + save/load
│   ├── fused_kernel.py    # Metal GEMV kernel (SIMD WHT)
│   ├── integrate.py       # Model conversion pipeline
│   ├── loader.py          # TQ3 model loader
│   ├── convert_experts.py # MoE expert conversion
│   ├── mlx_patch.py       # MLX model patching
│   └── tq3_metal.metal    # Metal compute shader
├── metal/                 # C/Metal inference (6,600+ lines)
│   ├── kandiga_cpu_expert.m    # NEON expert MLP (35B)
│   ├── kandiga_cpu_expert_lg.m # NEON expert MLP (122B/397B)
│   ├── attention.metal         # GPU attention kernels
│   ├── expert_mlp.metal        # GPU expert MLP kernels
│   └── moe_block.metal         # GPU MoE block kernels
├── static/
│   └── agent.html         # Agent web UI
└── tools/                 # Optional tool integrations

Development

git clone https://github.com/kantheon/kandiga.git
cd kandiga
pip install -e ".[serve,fast]"
pytest tests/ -v

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.9.0

Apr 4, 2026

0.8.0

Apr 4, 2026

0.7.1

Mar 26, 2026

0.7.0

Mar 26, 2026

0.6.0

Mar 26, 2026

0.5.1

Mar 25, 2026

0.5.0

Mar 25, 2026

0.4.0

Mar 25, 2026

0.3.1

Mar 25, 2026

0.3.0

Mar 25, 2026

0.2.1

Mar 25, 2026

0.2.0

Mar 25, 2026

0.1.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kandiga-0.9.0.tar.gz (233.3 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kandiga-0.9.0-py3-none-any.whl (247.1 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file kandiga-0.9.0.tar.gz.

File metadata

Download URL: kandiga-0.9.0.tar.gz
Upload date: Apr 4, 2026
Size: 233.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for kandiga-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`fdccc7b19e05fc8ea17161a03f93330a234754d407571284f12ee92b23c6e6bb`
MD5	`a5a4b964366eca1ce50f728c8662d722`
BLAKE2b-256	`edf221a57ac9d674a6bfa0b0fdc98214912fdeeff6345418bcc5635c1b1e81f7`

See more details on using hashes here.

File details

Details for the file kandiga-0.9.0-py3-none-any.whl.

File metadata

Download URL: kandiga-0.9.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 247.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for kandiga-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1730cf7f0f3223c8d05bf38a856a43d7775c8ff70ad1cb24f9ac7e6d6521a0e8`
MD5	`9ed44d43efee20b2efda72c664b46ebc`
BLAKE2b-256	`591cfda327d0c5e19ec8471b2a6e2e94121f46ac5116e724276505dc5d2b1f14`

See more details on using hashes here.

kandiga 0.9.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Kandiga

What It Does

Supported Models

Install

Quick Start

Architecture

Inference Engine (SEM)

3-Bit Weight Quantization

Agent System

Performance (M4 Mac Mini, 16GB)

Persistent KV Cache

TQ3 Weight Quantization

File Structure

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes