Local LLM inference server for Apple Silicon. Block-level paged KV cache for long-context workloads. 5.4× faster end-to-end on 4K-token prompts vs Ollama, less RAM, INT3 support for Qwen3. OpenAI-compatible API.
Project description
Squish
The fastest way to run local LLMs on Apple Silicon.
Sub-second model loads. Beats Ollama on throughput, tail latency, and full-response time. One OpenAI/Ollama-compatible daemon — no cloud, no API keys, fully offline.
███████╗██╗ ██╗██╗ ██╗ █████╗ █████╗ ██╗ ██╗ ██████╗ ██████╗ ██╗ ██╗
██╔════╝██║ ██║╚██╗██╔╝ ██╔══██╗ ██╔══██╗╚██╗██╔╝ ╚════██╗╚════██╗╚═╝██╔╝
███████╗███████║ ╚███╔╝ ╚██████║ ╚█████╔╝ ╚███╔╝ █████╔╝ █████╔╝ ██╔╝
╚════██║╚════██║ ██╔██╗ ╚═══██║ ██╔══██╗ ██╔██╗ ╚═══██╗██╔═══╝ ██╔╝
███████║ ██║██╔╝ ██╗ █████╔╝██╗╚█████╔╝██╔╝ ██╗ ██████╔╝███████╗██╔╝██╗
╚══════╝ ╚═╝╚═╝ ╚═╝ ╚════╝ ╚═╝ ╚════╝ ╚═╝ ╚═╝ ╚═════╝ ╚══════╝╚═╝ ╚═╝
faster cold start faster long-prompts less RAM
██████╗ ███████╗███████╗ ██████╗ ██╗ ██╗ ██╗███╗ ██╗████████╗██████╗
██╔═████╗ ██╔════╝██╔════╝ ╚════██╗██║ ██║ ██║████╗ ██║╚══██╔══╝╚════██╗
██║██╔██║ ███████╗███████╗ █████╔╝███████║ ██║██╔██╗ ██║ ██║ █████╔╝
████╔╝██║ ╚════██║╚════██║ ██╔═══╝ ╚════██║ ██║██║╚██╗██║ ██║ ╚═══██╗
╚██████╔╝██╗███████║███████║ ███████╗ ██║ ██║██║ ╚████║ ██║ ██████╔╝
╚═════╝ ╚═╝╚══════╝╚══════╝ ╚══════╝ ╚═╝ ╚═╝╚═╝ ╚═══╝ ╚═╝ ╚═════╝
cold load · 0.33–0.53s tok/s · beats Ollama quant default
██╗ ██╗███╗ ███╗███████╗ ██████╗ ███████╗ ██████╗ ██╗ ██████╗ ██████╗
███║███║████╗ ████║██╔════╝ ╚════██╗ ██╔════╝██╔════╝ ███║██╔═████╗██╔═████╗
╚██║╚██║██╔████╔██║███████╗ █████╔╝ ███████╗███████╗ ╚██║██║██╔██║██║██╔██║
██║ ██║██║╚██╔╝██║╚════██║ ╚═══██╗ ╚════██║██╔═══██╗ ██║████╔╝██║████╔╝██║
██║ ██║██║ ╚═╝ ██║███████║ ██████╔╝██╗███████║╚██████╔╝ ██║╚██████╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚══════╝ ╚═════╝ ╚═╝╚══════╝ ╚═════╝ ╚═╝ ╚═════╝ ╚═════╝
repeat TTFT · KV hit GB · smaller on disk inference modules
Squish separates how a model's weights are stored from how they run. Store them compressed and Metal-native; map them straight into unified memory; skip the dtype-conversion pass that makes every other loader slow. The result: a model that's ready in half a second, served by a persistent daemon that out-decodes Ollama and never re-does work it's already done.
The Numbers
Measured on an Apple M3 MacBook Pro, 16 GB — thermally controlled (each engine measured from the same ~50 °C baseline; validated by a first-vs-last drift check ≤ 1.7 % and live die-temperature logging, so the numbers reflect the engines, not the order they ran). Serving: Qwen2.5-7B-Instruct, Squish INT4/INT3 vs Ollama qwen2.5:7b (Q4_K_M), against both Ollama 0.18.2 and 0.30.7 (0.30.7 shown; 0.18.2 within noise).
| Metric | Ollama | Squish |
|---|---|---|
| Cold start — load + first token (1.5B) | 20–30 s | ≈ 0.5 s (54× load) |
| Full response @ 4000-token prompt | 37.5 s | 3.8 s (9.8× faster) |
| Decode throughput @ 75 tokens | 20.3 tok/s | 24.0 tok/s (INT3) |
| Inter-token tail (p95) @ 75 tokens | 52.4 ms | 42.7 ms (INT3) |
| Repeat-prompt TTFT (KV cache hit) | ~160 ms | 4–11 ms |
| Peak RAM during inference | 5.14 GB | 3.50 GB |
| Disk — 7B INT4 / INT3 | 4.36 GB / — | 4.00 / 3.56 GB |
| Cold short-prompt TTFT | 167 ms | 192 ms (honest loss) |
Squish wins decode throughput, inter-token tail latency, full-response time, and RAM — biggest on long contexts, where its KV cache reuses the prefill instead of re-running it. INT3 adds ~18 % decode over INT4 at no measured accuracy cost (arc_easy acc_norm 0.551 vs 0.541, tied). The one place Ollama wins is single-token latency on a cold, novel prompt — we say so plainly.
→ Methodology, thermal control, and the full ablation: docs/paper.md §4.4 · BENCHMARKS.md
Why Squish
Squish is built for the workload most local-LLM tools aren't tuned for: the same model called many times an hour, with shifting context — commit messages, code review, agent loops, multi-turn chat, document Q&A.
On a 16 GB Mac that workload fights the rest of your work. Ollama keeps ~5 GB resident and re-pays a long prefill on every new long prompt. Squish is a persistent daemon: the model loads once at login, and a two-cache architecture reuses prefill across requests — so an agent resending a 4,000-token system prompt every turn pays it once, not every turn.
Designed for one developer, one machine. Not a multi-tenant production API — and the docs never pretend otherwise.
Highlights
- Sub-second cold start — a three-tier weight cache maps Metal-native bf16 straight into unified memory, eliminating the dtype-conversion + CPU-heap pass that dominates
mlx_lm/safetensors cold load. 54× faster than a coldmlx_lmload, on 160 MB of load-phase RAM instead of 2.4 GB. - Faster decode than Ollama — a decoupled decode loop (one inference-thread handoff per request, not per token), GC suspended during generation, and P-core QoS pinning recover throughput the Python serving layer was wasting.
- Two-cache prefill reuse — a block-paged KV cache for shifting prefixes plus a prompt KV cache for exact repeats: single-digit-millisecond TTFT on a cache hit.
- Greedy-lossless speculation —
--prompt-lookupverifies a whole n-gram draft in one batched forward, token-for-token identical to greedy, ~1.6× faster on repetitive output. - INT4 / INT3 / INT8 quantization — INT3 is the recommended default; family-aware accuracy gates hard-block quant configs that would silently degrade.
- Drop-in compatible — OpenAI (
/v1/*) and Ollama (/api/*) endpoints on one server. Point your existing client at it and go. - 100+ composable optimization modules — KV compression, speculative decoding, quantization, attention acceleration, agent tool execution — each an independent flag on a single server.
- Native macOS surface — the SquishBar menu-bar app (status, tok/s, one-click model switch) and a cinematic dashboard ship alongside the CLI.
- Pre-squished models —
squish pullgrabs ready-to-run weights from huggingface.co/squishai.
Install
Requires Python 3.11–3.14 and macOS 13 (Ventura) or later on Apple Silicon.
# Homebrew (recommended — no compilation, all deps bundled)
brew tap konjoai/squish
brew install squish
squish doctor
# or pipx
pipx install squish-ai --python python3.13
squish doctor
The bundled squish_quant Rust extension installs automatically — squish doctor confirms it (✓ squish_quant Rust extension (6 GB/s quantizer)).
The PyPI package is
squish-ai; the CLI and Python module are bothsquish.
Quick Start
squish pull qwen2.5:7b # download a pre-squished model
squish run qwen2.5:7b # start the daemon (loads once, stays resident)
Use it from any OpenAI or Ollama client:
curl http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5:7b","messages":[{"role":"user","content":"Hello"}]}'
export OPENAI_BASE_URL=http://localhost:11435/v1 # OpenAI SDKs
export OPENAI_API_KEY=squish
export OLLAMA_HOST=http://localhost:11435 # Ollama clients
Browse models and start the daemon at login:
squish catalog # 40 models, 9 pre-squished on the Hub
squish search qwen3
squish pull qwen3:0.6b --int3 # INT3 variant (Qwen3, Qwen2.5, Llama families)
squish daemon install # macOS LaunchAgent — daemon starts at login
How it's fast
Storage ≠ runtime. Every standard loader pays the same boot tax: allocate a CPU buffer, read the safetensors, convert dtypes, copy to the accelerator — 2–30 s and ~2.4 GB of RAM, almost all of it wasted on bytes that never changed. Squish converts weights once into the exact bf16 Metal layout MLX uses, then mmaps them directly into the GPU address space. Zero conversion at load time.
The daemon never re-does work. A block-paged KV cache persists fixed-size token blocks to disk and reconstructs partial-prefix matches for shifting context; a prompt KV cache catches exact repeats. An agent loop that resends the same long prompt every turn hits the cache instead of re-prefilling.
Decode is bandwidth-bound, so we attack the right thing. On Apple Silicon each token streams the whole weight set from unified memory — a hard ceiling. The levers that move it are fewer weight bytes (INT3) and fewer forwards per token (greedy-lossless prompt-lookup). We measured the levers that don't help here (KV-cache quantization, small-draft speculation) and say so in the paper rather than shipping them as wins.
Accuracy gates are load-bearing. INT3 holds within ~1 pp of FP16 on Qwen3/Qwen2.5; Gemma-3 collapses (~15 pp). Squish enables INT3 only where it's safe and refuses the rest — you can't accidentally ship a config that quietly degrades.
Deep dive: docs/ARCHITECTURE.md · docs/paper.md.
What Squish Doesn't Do
Honesty is a feature. If any of these matter, Ollama or LM Studio is the right call:
- No GPU outside Apple Silicon. It's MLX-based; CUDA users want vLLM or llama.cpp.
- No multi-user serving. One developer, one machine — not a production API.
- No multimodal. Text only.
- Slower first token on a cold, short prompt than Ollama (192 ms vs 167 ms) — fundamental MLX prefill kernel cost. Squish's edge is everywhere else.
- Model conversion is slow. Squish needs models in its own format; first-time conversion takes minutes (
squish pullskips it with pre-squished weights).
Built the Konjo way
KONJO — Know, Outline, Nail, Justify, Optimize. ቆንጆ (beautiful) · 根性 (grit) · 건조 (strip to the essence).
Squish exists because nothing else was fast enough, so we built it — and held it to a higher floor than "it works." Every headline number is measured under thermal control. Every honest loss is printed next to the wins. Every line that isn't load-bearing is cut. Correctness is the floor; the ceiling is correct, fast, lean, and honest.
Project
- Website — squish.run — full docs, guides, and the benchmark report.
- Contributing — CONTRIBUTING.md. Issues, benchmarks, and PRs welcome.
- License — BUSL-1.1, see LICENSE.
- Models — huggingface.co/squishai
- Docs — Architecture · Paper · Benchmarks · Modules
- Org — konjoai · siblings: Squash (EU AI Act compliance), Vectro, Kohaku
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file squish_ai-9.34.1.tar.gz.
File metadata
- Download URL: squish_ai-9.34.1.tar.gz
- Upload date:
- Size: 2.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee0b068906699e538325e1c02e4a92de355a3c453807800db3d44bab610bc2e5
|
|
| MD5 |
5c13d54894adb2e1bb59e706fc7c6eb2
|
|
| BLAKE2b-256 |
d8105e99283d4fa3ae9b8f4551d2fde4fd7fc3686fa429217865d6e0dcbe4941
|
Provenance
The following attestation bundles were made for squish_ai-9.34.1.tar.gz:
Publisher:
release.yml on konjoai/squish
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
squish_ai-9.34.1.tar.gz -
Subject digest:
ee0b068906699e538325e1c02e4a92de355a3c453807800db3d44bab610bc2e5 - Sigstore transparency entry: 1826273474
- Sigstore integration time:
-
Permalink:
konjoai/squish@8fe9f4da4643575d2395d1d22570b69a4f79ecc3 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/konjoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8fe9f4da4643575d2395d1d22570b69a4f79ecc3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file squish_ai-9.34.1-py3-none-any.whl.
File metadata
- Download URL: squish_ai-9.34.1-py3-none-any.whl
- Upload date:
- Size: 1.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4980419fc37da13d0016e3694695c731a4706ea75d7e0bc8994d887b0dc2e48
|
|
| MD5 |
f31bcfbd6c56027cb2394c4ecf4a63c9
|
|
| BLAKE2b-256 |
33188a44b5f12c0aaf23411ee318d57e2f87906b7cd9bb0915324bd2dba635bc
|
Provenance
The following attestation bundles were made for squish_ai-9.34.1-py3-none-any.whl:
Publisher:
release.yml on konjoai/squish
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
squish_ai-9.34.1-py3-none-any.whl -
Subject digest:
a4980419fc37da13d0016e3694695c731a4706ea75d7e0bc8994d887b0dc2e48 - Sigstore transparency entry: 1826273686
- Sigstore integration time:
-
Permalink:
konjoai/squish@8fe9f4da4643575d2395d1d22570b69a4f79ecc3 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/konjoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8fe9f4da4643575d2395d1d22570b69a4f79ecc3 -
Trigger Event:
push
-
Statement type: