Distributed LLM inference on Apple Silicon — Pipeline Parallelism + Speculative Decoding + OpenAI-compatible API
Project description
Hippo
Distributed LLM inference on Apple Silicon. Pipeline parallelism across dual Mac Minis, speculative decoding for single-machine speedup, and an OpenAI-compatible API. Built for governed, auditable AI deployments.
Quick Start · Performance · Modes · API · Configuration · Architecture
Performance
Measured on dual Mac Mini M4 (16 GB each).
| Model | Mode | Hardware | tok/s | Notes |
|---|---|---|---|---|
| Qwen3-4B | DFlash | Single machine | 47.8 | 4× single-machine speedup |
| Gemma-3-12B | Pipeline (Thunderbolt) | 2 machines | 8.3 | 2.4× faster than single-machine |
| Gemma-3-12B | Pipeline (Wi-Fi) | 2 machines | 7.0 | Works without Thunderbolt |
| Qwen3-4B | Standalone | Single machine | 12.0 | Baseline |
| Gemma-3-12B | Standalone | Single machine | 3.5 | Baseline |
Modes
Hippo supports three inference modes, pick based on model size:
- Standalone — Single machine, vanilla MLX inference. For models that fit in RAM.
- DFlash — Single machine, speculative decoding via DFlash. ~4× faster than standalone. For models under ~8B.
- Pipeline — Two machines, model split across Thunderbolt or Wi-Fi. For models too large for one machine (8-15B on 16 GB machines).
Rule of thumb: Small model → DFlash. Big model → Pipeline. Don't stack them (ADR-163: 16 GB can't hold shard + target + draft simultaneously).
Quick Start
Prerequisites
- Apple Silicon Mac (M1+)
- Python 3.11+
- MLX (
pip install mlx)
Pipeline mode (two machines)
R1 (start first):
cd hippo/pipeline
./start.sh r1
R0 (start second):
cd hippo/pipeline
./start.sh r0 --model gemma-3-12b --prompt "Explain quantum computing"
DFlash mode (single machine)
./start.sh dflash --model qwen3-4b --prompt "Write a Python web server"
Benchmark
./benchmark.sh 3 50 thunderbolt # 3 runs, 50 tokens, Thunderbolt
./benchmark.sh 3 50 wifi # 3 runs, 50 tokens, Wi-Fi
OpenAI-Compatible API
Start the API server:
python hippo_api.py --config hippo.conf.yaml
Three endpoints, same shapes as OpenAI:
# List models
curl http://localhost:8002/v1/models
# Chat completion (streaming)
curl http://localhost:8002/v1/chat/completions \
-H "Authorization: Bearer your-token" \
-d '{"model":"gemma-3-12b","messages":[{"role":"user","content":"Hello"}]}'
# Health check
curl http://localhost:8002/health
Works with Cursor, Open WebUI, Continue — change base_url and it just works.
Web UI
python hippo_web.py --config hippo.conf.yaml
Gradio chat interface at http://localhost:7860.
Configuration
hippo.conf.yaml drives everything:
defaults:
mode: standalone # standalone | pipeline | dflash
host: "0.0.0.0"
port: 9998
models:
qwen3-4b:
repo: "Qwen/Qwen3-4B"
precision: "bf16"
size_gb: 7.8
modes: [standalone, dflash]
dflash:
draft_repo: "Aryagm/dflash-draft-qwen3-4b"
gemma-3-12b:
repo: "google/gemma-3-12b-pt"
precision: "qat-4bit"
size_gb: 6.9
modes: [standalone, pipeline]
pipeline:
shards: 2
r0_layers: [0, 24]
r1_layers: [25, 47]
Memory guard built in — refuses to load if it won't fit (RAM × safety_factor).
Architecture
R0 (Mac Mini 1) R1 (Mac Mini 2)
┌─────────────────┐ ┌─────────────────┐
│ Layers 0-23 │ hidden state │ Layers 24-47 │
│ (prefill + │ ──────────────>│ (forward + │
│ decode loop) │ Thunderbolt/ │ lm_head) │
│ │<─────────────── │ │
│ sample token │ top-k logits │ │
└─────────────────┘ └─────────────────┘
Why SD doesn't help Pipeline
Counter-intuitive but实测 verified: speculative decoding (including DFlash) does not accelerate pipeline inference. The bottleneck is R0's forward pass (~100ms/step). SD saves time on sampling, but verification also requires R0 forward — so SD doesn't reduce forward passes. Net result: slower than baseline (4.3 tok/s vs 6.8 tok/s).
Pipeline solves the memory problem. SD solves the speed problem. They're orthogonal.
Memory budget (16 GB machines)
| Model | Mode | Per-machine | Margin | Verdict |
|---|---|---|---|---|
| Gemma-3-12B | Pipeline | 3.5 GB | +4.1 GB | ✅ Comfortable |
| Qwen3-4B | DFlash | 8.8 GB | -1.2 GB | ❌ Needs 48 GB |
| Qwen3-4B | Standalone | 7.8 GB | -0.2 GB | ⚠️ Tight |
| Qwen3-8B | Pipeline | 7.8 GB | -0.2 GB | ⚠️ Tight |
Project structure
hippo/
├── hippo_api.py # OpenAI-compatible API server
├── hippo_web.py # Gradio chat UI
├── hippo_cli.py # Unified CLI (serve/benchmark/list-models)
├── hippo.conf.yaml # Configuration (models × modes)
├── start.sh # One-command launcher
├── pipeline/ # Core inference engine
│ ├── rank0.py # R0: autoregressive generation
│ ├── rank1.py # R1: persistent server
│ ├── model_ops.py # MLX ops (RoPE, quantized linear)
│ ├── tcp_transport.py # Thunderbolt/Wi-Fi transport
│ └── benchmark.sh # Multi-run benchmark tool
├── api.py # Legacy Ollama-compatible API
├── model_manager.py # Legacy model lifecycle
└── tests/ # Test suite
Roadmap
- Continuous batching
- Model hot-swap via API
- Qwen3-8B pipeline optimization
- Benchmark dashboard (Prometheus + Grafana)
- Audit logging for compliance reporting
Contributing
PRs welcome. See CONTRIBUTING.md.
License
MIT — see LICENSE.
Credits
Built by lawcontinue with help from the T-Mind agent family. Powered by MLX. Part of the Agora governance ecosystem — auditable, governed inference.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hippo_llm-0.2.0.tar.gz.
File metadata
- Download URL: hippo_llm-0.2.0.tar.gz
- Upload date:
- Size: 35.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1327031c173cef3c8b12d3a995cf3604c7a2a180a33cc20ca5824f91312cce0a
|
|
| MD5 |
9c827f13d81de5ee1696f3a40b4109d1
|
|
| BLAKE2b-256 |
d3bba92349082e83095eade5d861ba1b59dce0e95d30ea12a74f679e490551c1
|
Provenance
The following attestation bundles were made for hippo_llm-0.2.0.tar.gz:
Publisher:
publish.yml on lawcontinue/hippo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hippo_llm-0.2.0.tar.gz -
Subject digest:
1327031c173cef3c8b12d3a995cf3604c7a2a180a33cc20ca5824f91312cce0a - Sigstore transparency entry: 1435063515
- Sigstore integration time:
-
Permalink:
lawcontinue/hippo@ba7d7200aa0a5a6829083fe7bae5a55013ad3318 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/lawcontinue
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ba7d7200aa0a5a6829083fe7bae5a55013ad3318 -
Trigger Event:
release
-
Statement type:
File details
Details for the file hippo_llm-0.2.0-py3-none-any.whl.
File metadata
- Download URL: hippo_llm-0.2.0-py3-none-any.whl
- Upload date:
- Size: 42.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f2b6fa238527dbbcba7702c652d47fd7996b0eb37e10fee744595c4c847b0f9
|
|
| MD5 |
fa24e01a21cb041f2d98a415dcde81a4
|
|
| BLAKE2b-256 |
31f9dc80ffc07616c68005634dd1bc14e2acebd0246a57439cce25c3afc3889b
|
Provenance
The following attestation bundles were made for hippo_llm-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on lawcontinue/hippo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hippo_llm-0.2.0-py3-none-any.whl -
Subject digest:
7f2b6fa238527dbbcba7702c652d47fd7996b0eb37e10fee744595c4c847b0f9 - Sigstore transparency entry: 1435063518
- Sigstore integration time:
-
Permalink:
lawcontinue/hippo@ba7d7200aa0a5a6829083fe7bae5a55013ad3318 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/lawcontinue
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ba7d7200aa0a5a6829083fe7bae5a55013ad3318 -
Trigger Event:
release
-
Statement type: