Local LLM inference server for Apple Silicon. Block-level paged KV cache for long-context workloads. 5.4× faster end-to-end on 4K-token prompts vs Ollama, less RAM, INT3 support for Qwen3. OpenAI-compatible API.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

wscholl

These details have not been verified by PyPI

Project links

Documentation

Project description

Squish

The fastest way to run local LLMs on Apple Silicon.

Sub-second model loads. Beats Ollama on throughput, tail latency, and full-response time. One OpenAI/Ollama-compatible daemon — no cloud, no API keys, fully offline.

  ███████╗██╗  ██╗██╗  ██╗       █████╗     █████╗ ██╗  ██╗        ██████╗ ██████╗ ██╗ ██╗
  ██╔════╝██║  ██║╚██╗██╔╝      ██╔══██╗   ██╔══██╗╚██╗██╔╝        ╚════██╗╚════██╗╚═╝██╔╝
  ███████╗███████║ ╚███╔╝        ╚██████║   ╚█████╔╝ ╚███╔╝          █████╔╝ █████╔╝  ██╔╝
  ╚════██║╚════██║ ██╔██╗         ╚═══██║   ██╔══██╗ ██╔██╗          ╚═══██╗██╔═══╝  ██╔╝
  ███████║     ██║██╔╝ ██╗       █████╔╝██╗╚█████╔╝██╔╝ ██╗        ██████╔╝███████╗██╔╝██╗
  ╚══════╝     ╚═╝╚═╝  ╚═╝       ╚════╝ ╚═╝ ╚════╝ ╚═╝  ╚═╝        ╚═════╝ ╚══════╝╚═╝ ╚═╝
     faster cold start              faster long-prompts                    less RAM

 ██████╗    ███████╗███████╗          ██████╗ ██╗  ██╗          ██╗███╗   ██╗████████╗██████╗
██╔═████╗   ██╔════╝██╔════╝          ╚════██╗██║  ██║          ██║████╗  ██║╚══██╔══╝╚════██╗
██║██╔██║   ███████╗███████╗           █████╔╝███████║          ██║██╔██╗ ██║   ██║    █████╔╝
████╔╝██║   ╚════██║╚════██║          ██╔═══╝ ╚════██║          ██║██║╚██╗██║   ██║    ╚═══██╗
╚██████╔╝██╗███████║███████║          ███████╗     ██║          ██║██║ ╚████║   ██║   ██████╔╝
 ╚═════╝ ╚═╝╚══════╝╚══════╝          ╚══════╝     ╚═╝          ╚═╝╚═╝  ╚═══╝   ╚═╝   ╚═════╝
   cold load · 0.33–0.53s           tok/s · beats Ollama                quant default

 ██╗ ██╗███╗   ███╗███████╗     ██████╗    ███████╗ ██████╗          ██╗ ██████╗  ██████╗
███║███║████╗ ████║██╔════╝     ╚════██╗   ██╔════╝██╔════╝         ███║██╔═████╗██╔═████╗
╚██║╚██║██╔████╔██║███████╗      █████╔╝   ███████╗███████╗         ╚██║██║██╔██║██║██╔██║
 ██║ ██║██║╚██╔╝██║╚════██║      ╚═══██╗   ╚════██║██╔═══██╗         ██║████╔╝██║████╔╝██║
 ██║ ██║██║ ╚═╝ ██║███████║     ██████╔╝██╗███████║╚██████╔╝         ██║╚██████╔╝╚██████╔╝
 ╚═╝ ╚═╝╚═╝     ╚═╝╚══════╝     ╚═════╝ ╚═╝╚══════╝ ╚═════╝          ╚═╝ ╚═════╝  ╚═════╝
    repeat TTFT · KV hit            GB · smaller on disk              inference modules

Squish separates how a model's weights are stored from how they run. Store them compressed and Metal-native; map them straight into unified memory; skip the dtype-conversion pass that makes every other loader slow. The result: a model that's ready in half a second, served by a persistent daemon that out-decodes Ollama and never re-does work it's already done.

The Numbers

Measured on an Apple M3 MacBook Pro, 16 GB — thermally controlled (each engine measured from the same ~50 °C baseline; validated by a first-vs-last drift check ≤ 1.7 % and live die-temperature logging, so the numbers reflect the engines, not the order they ran). Serving: Qwen2.5-7B-Instruct, Squish INT4/INT3 vs Ollama qwen2.5:7b (Q4_K_M), against both Ollama 0.18.2 and 0.30.7 (0.30.7 shown; 0.18.2 within noise).

Metric	Ollama	Squish
Cold start — load + first token (1.5B)	20–30 s	≈ 0.5 s (54× load)
Full response @ 4000-token prompt	37.5 s	3.8 s (9.8× faster)
Decode throughput @ 75 tokens	20.3 tok/s	24.0 tok/s (INT3)
Inter-token tail (p95) @ 75 tokens	52.4 ms	42.7 ms (INT3)
Repeat-prompt TTFT (KV cache hit)	~160 ms	4–11 ms
Peak RAM during inference	5.14 GB	3.50 GB
Disk — 7B INT4 / INT3	4.36 GB / —	4.00 / 3.56 GB
Cold short-prompt TTFT	167 ms	192 ms (honest loss)

Squish wins decode throughput, inter-token tail latency, full-response time, and RAM — biggest on long contexts, where its KV cache reuses the prefill instead of re-running it. INT3 adds ~18 % decode over INT4 at no measured accuracy cost (arc_easy acc_norm 0.551 vs 0.541, tied). The one place Ollama wins is single-token latency on a cold, novel prompt — we say so plainly.

→ Methodology, thermal control, and the full ablation: docs/paper.md §4.4 · BENCHMARKS.md

Why Squish

Squish is built for the workload most local-LLM tools aren't tuned for: the same model called many times an hour, with shifting context — commit messages, code review, agent loops, multi-turn chat, document Q&A.

On a 16 GB Mac that workload fights the rest of your work. Ollama keeps ~5 GB resident and re-pays a long prefill on every new long prompt. Squish is a persistent daemon: the model loads once at login, and a two-cache architecture reuses prefill across requests — so an agent resending a 4,000-token system prompt every turn pays it once, not every turn.

Designed for one developer, one machine. Not a multi-tenant production API — and the docs never pretend otherwise.

Highlights

Sub-second cold start — a three-tier weight cache maps Metal-native bf16 straight into unified memory, eliminating the dtype-conversion + CPU-heap pass that dominates mlx_lm/safetensors cold load. 54× faster than a cold mlx_lm load, on 160 MB of load-phase RAM instead of 2.4 GB.
Faster decode than Ollama — a decoupled decode loop (one inference-thread handoff per request, not per token), GC suspended during generation, and P-core QoS pinning recover throughput the Python serving layer was wasting.
Two-cache prefill reuse — a block-paged KV cache for shifting prefixes plus a prompt KV cache for exact repeats: single-digit-millisecond TTFT on a cache hit.
Greedy-lossless speculation — --prompt-lookup verifies a whole n-gram draft in one batched forward, token-for-token identical to greedy, ~1.6× faster on repetitive output.
INT4 / INT3 / INT8 quantization — INT3 is the recommended default; family-aware accuracy gates hard-block quant configs that would silently degrade.
Drop-in compatible — OpenAI (/v1/*) and Ollama (/api/*) endpoints on one server. Point your existing client at it and go.
100+ composable optimization modules — KV compression, speculative decoding, quantization, attention acceleration, agent tool execution — each an independent flag on a single server.
Native macOS surface — the SquishBar menu-bar app (status, tok/s, one-click model switch) and a cinematic dashboard ship alongside the CLI.
Pre-squished models — squish pull grabs ready-to-run weights from huggingface.co/squishai.

Install

Requires Python 3.11–3.14 and macOS 13 (Ventura) or later on Apple Silicon.

# Homebrew (recommended — no compilation, all deps bundled)
brew tap konjoai/squish
brew install squish
squish doctor

# or pipx
pipx install squish-ai --python python3.13
squish doctor

The bundled squish_quant Rust extension installs automatically — squish doctor confirms it (✓ squish_quant Rust extension (6 GB/s quantizer)).

The PyPI package is squish-ai; the CLI and Python module are both squish.

Quick Start

squish pull qwen2.5:7b        # download a pre-squished model
squish run qwen2.5:7b         # start the daemon (loads once, stays resident)

Use it from any OpenAI or Ollama client:

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5:7b","messages":[{"role":"user","content":"Hello"}]}'

export OPENAI_BASE_URL=http://localhost:11435/v1   # OpenAI SDKs
export OPENAI_API_KEY=squish
export OLLAMA_HOST=http://localhost:11435           # Ollama clients

Browse models and start the daemon at login:

squish catalog                 # 40 models, 9 pre-squished on the Hub
squish search qwen3
squish pull qwen3:0.6b --int3  # INT3 variant (Qwen3, Qwen2.5, Llama families)
squish daemon install          # macOS LaunchAgent — daemon starts at login

How it's fast

Storage ≠ runtime. Every standard loader pays the same boot tax: allocate a CPU buffer, read the safetensors, convert dtypes, copy to the accelerator — 2–30 s and ~2.4 GB of RAM, almost all of it wasted on bytes that never changed. Squish converts weights once into the exact bf16 Metal layout MLX uses, then mmaps them directly into the GPU address space. Zero conversion at load time.

The daemon never re-does work. A block-paged KV cache persists fixed-size token blocks to disk and reconstructs partial-prefix matches for shifting context; a prompt KV cache catches exact repeats. An agent loop that resends the same long prompt every turn hits the cache instead of re-prefilling.

Decode is bandwidth-bound, so we attack the right thing. On Apple Silicon each token streams the whole weight set from unified memory — a hard ceiling. The levers that move it are fewer weight bytes (INT3) and fewer forwards per token (greedy-lossless prompt-lookup). We measured the levers that don't help here (KV-cache quantization, small-draft speculation) and say so in the paper rather than shipping them as wins.

Accuracy gates are load-bearing. INT3 holds within ~1 pp of FP16 on Qwen3/Qwen2.5; Gemma-3 collapses (~15 pp). Squish enables INT3 only where it's safe and refuses the rest — you can't accidentally ship a config that quietly degrades.

Deep dive: docs/ARCHITECTURE.md · docs/paper.md.

What Squish Doesn't Do

Honesty is a feature. If any of these matter, Ollama or LM Studio is the right call:

No GPU outside Apple Silicon. It's MLX-based; CUDA users want vLLM or llama.cpp.
No multi-user serving. One developer, one machine — not a production API.
No multimodal. Text only.
Slower first token on a cold, short prompt than Ollama (192 ms vs 167 ms) — fundamental MLX prefill kernel cost. Squish's edge is everywhere else.
Model conversion is slow. Squish needs models in its own format; first-time conversion takes minutes (squish pull skips it with pre-squished weights).

Built the Konjo way

KONJO — Know, Outline, Nail, Justify, Optimize. ቆንጆ (beautiful) · 根性 (grit) · 건조 (strip to the essence).

Squish exists because nothing else was fast enough, so we built it — and held it to a higher floor than "it works." Every headline number is measured under thermal control. Every honest loss is printed next to the wins. Every line that isn't load-bearing is cut. Correctness is the floor; the ceiling is correct, fast, lean, and honest.

Project

Website — squish.run — full docs, guides, and the benchmark report.
Contributing — CONTRIBUTING.md. Issues, benchmarks, and PRs welcome.
License — BUSL-1.1, see LICENSE.
Models — huggingface.co/squishai
Docs — Architecture · Paper · Benchmarks · Modules
Org — konjoai · siblings: Squash (EU AI Act compliance), Vectro, Kohaku

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

wscholl

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

9.34.1

Jun 15, 2026

9.34.0

Jun 15, 2026

9.33.9

Jun 14, 2026

9.33.8

Jun 9, 2026

9.33.7

Jun 8, 2026

9.33.6

Jun 6, 2026

9.33.5

Jun 5, 2026

9.33.4

Jun 4, 2026

9.33.2

Jun 3, 2026

9.33.1

Jun 3, 2026

9.33.0

Jun 3, 2026

9.32.0

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

squish_ai-9.34.1.tar.gz (2.0 MB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

squish_ai-9.34.1-py3-none-any.whl (1.8 MB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file squish_ai-9.34.1.tar.gz.

File metadata

Download URL: squish_ai-9.34.1.tar.gz
Upload date: Jun 15, 2026
Size: 2.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for squish_ai-9.34.1.tar.gz
Algorithm	Hash digest
SHA256	`ee0b068906699e538325e1c02e4a92de355a3c453807800db3d44bab610bc2e5`
MD5	`5c13d54894adb2e1bb59e706fc7c6eb2`
BLAKE2b-256	`d8105e99283d4fa3ae9b8f4551d2fde4fd7fc3686fa429217865d6e0dcbe4941`

See more details on using hashes here.

Provenance

The following attestation bundles were made for squish_ai-9.34.1.tar.gz:

Publisher: release.yml on konjoai/squish

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: squish_ai-9.34.1.tar.gz
- Subject digest: ee0b068906699e538325e1c02e4a92de355a3c453807800db3d44bab610bc2e5
- Sigstore transparency entry: 1826273474
- Sigstore integration time: Jun 15, 2026
Source repository:
- Permalink: konjoai/squish@8fe9f4da4643575d2395d1d22570b69a4f79ecc3
- Branch / Tag: refs/heads/main
- Owner: https://github.com/konjoai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8fe9f4da4643575d2395d1d22570b69a4f79ecc3
- Trigger Event: push

File details

Details for the file squish_ai-9.34.1-py3-none-any.whl.

File metadata

Download URL: squish_ai-9.34.1-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 1.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for squish_ai-9.34.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a4980419fc37da13d0016e3694695c731a4706ea75d7e0bc8994d887b0dc2e48`
MD5	`f31bcfbd6c56027cb2394c4ecf4a63c9`
BLAKE2b-256	`33188a44b5f12c0aaf23411ee318d57e2f87906b7cd9bb0915324bd2dba635bc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for squish_ai-9.34.1-py3-none-any.whl:

Publisher: release.yml on konjoai/squish

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: squish_ai-9.34.1-py3-none-any.whl
- Subject digest: a4980419fc37da13d0016e3694695c731a4706ea75d7e0bc8994d887b0dc2e48
- Sigstore transparency entry: 1826273686
- Sigstore integration time: Jun 15, 2026
Source repository:
- Permalink: konjoai/squish@8fe9f4da4643575d2395d1d22570b69a4f79ecc3
- Branch / Tag: refs/heads/main
- Owner: https://github.com/konjoai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8fe9f4da4643575d2395d1d22570b69a4f79ecc3
- Trigger Event: push

squish-ai 9.34.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Squish

The Numbers

Why Squish

Highlights

Install

Quick Start

How it's fast

What Squish Doesn't Do

Built the Konjo way

Project

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance