Local LLM inference server for Apple Silicon. Block-level paged KV cache for long-context workloads. 5.4× faster end-to-end on 4K-token prompts vs Ollama, less RAM, INT3 support for Qwen3. OpenAI-compatible API.
Project description
Squish
Local LLM inference for Apple Silicon. Faster end-to-end response on long contexts, less RAM, INT3 support.
The Numbers (v9.32.0 / bench v5.1.1)
Measured 2026-06-02 on Apple M3 MacBook Pro, 16 GB unified memory.
Model: Qwen2.5-7B-Instruct. Quant: INT4 (squish) / Q4_K_M (Ollama).
Five-run medians. Raw artifacts in results/benchmarks_v5_1_1/.
| Metric | Ollama 0.18.2 | Squish (recommended) |
|---|---|---|
| E2E response @ 4000-token prompt | 69.63 s | 12.78 s (5.4× faster) |
| E2E response @ 75-token prompt | 8.09 s | 5.50 s (1.5× faster) |
| Peak RAM during inference | ~5 GB | 3.36 GB |
| Disk size — INT4 | 4.36 GB | 4.00 GB |
| Disk size — INT3 (Qwen3) | not supported | 3.56 GB |
| TTFT @ 75-token prompt | 131 ms | 279 ms (honest loss) |
Squish wins end-to-end response time at every prompt size measured, with the largest win on long contexts (5.4× at 4000 tokens), uses ~33% less RAM, and supports INT3 for compatible model families.
Ollama wins time-to-first-token at every prompt size, and inter-token jitter on long contexts. If first-byte latency matters more than full-response latency, Ollama is the right tool.
Full table, methodology, and ablation: docs/RESULTS.md
(v5.1.1 section).
Why Squish
Squish is for the workload most local-LLM tools aren't tuned for: the same model called many times an hour from the terminal with shifting context — git-commit-message generation, code-review prompts, agent loops, multi-turn chat, document Q&A.
On a 16 GB Mac, that workload collides with the rest of your work. Ollama keeps ~5 GB resident and pays a long prefill cost on each new long prompt. Squish is a persistent daemon: the model loads once when the daemon starts, and a two-cache architecture (block-paged KV cache for shifting prefixes, prompt KV cache for exact repeats) avoids re-prefilling work the daemon has already done.
Designed for one developer on one machine. Not a production multi-tenant API.
Install
Prerequisite (macOS/Homebrew): Xcode Command Line Tools are required. Install them with
xcode-select --install. If Homebrew reports "Command Line Tools are too outdated", update from System Settings -> General -> Software Update, or reinstall CLT.
# Homebrew (recommended on macOS)
brew tap konjoai/squish
brew trust konjoai/squish
brew install squish
# PyPI
pip install squish-ai
# From source
git clone https://github.com/konjoai/squish
cd squish
pip install -e .
Note: The PyPI package is
squish-ai. After installing, the Python module and CLI are both namedsquish:pip install squish-ai squish run --version python -c "import squish; print(squish.__version__)"
Optional Performance Enhancements
4x faster quantization - install the Rust extension:
cd squish_quant_rs && python3 -m maturin build --release && pip install .
Requirements: macOS 13+, Apple Silicon (M1–M5), Python 3.10+.
Corporate Networks / TLS Interception
If your company network uses TLS interception (for example Zscaler or an internal
proxy CA), squish pull may fail with SSL certificate errors unless Hugging Face
downloads trust your corporate CA bundle.
1. Export your corporate CA cert(s) to PEM
# Example: specific certificate
security find-certificate -c "ORG-Root" -p > ~/ORG-root.pem
# Example: collect all matching certs (works for many corporate setups)
security find-certificate -a -p -c "ORG" /Library/Keychains/System.keychain > ~/ORG-all.pem
2. Verify cert validity dates
openssl x509 -in ~/ORG-root.pem -noout -subject -issuer -dates
If notAfter is in the past, the cert is expired and must be rotated/reinstalled
by your IT/network team.
3. Pull with CA bundle + transport compatibility flags
This is the recommended command for corporate/proxy networks:
REQUESTS_CA_BUNDLE=~/ORG-all.pem \
HF_HUB_DISABLE_XET=1 \
HF_HUB_ENABLE_HF_TRANSFER=0 \
squish pull llama3.2:3b
Notes:
REQUESTS_CA_BUNDLEtells Hugging Face/httpx which CA bundle to trust.HF_HUB_DISABLE_XET=1andHF_HUB_ENABLE_HF_TRANSFER=0force a plain path that is often more reliable behind TLS interception.
4. Last resort (insecure)
Only for temporary troubleshooting:
SQUISH_VERIFY_SSL=false \
HF_HUB_DISABLE_XET=1 \
HF_HUB_ENABLE_HF_TRANSFER=0 \
squish pull llama3.2:3b
Prefer fixing CA trust over running insecure SSL mode.
Quick Start
# Pull a pre-quantised model from the catalog
squish pull qwen2.5-7b-int4
# Start the daemon with both caches enabled (recommended config)
squish run qwen2.5-7b-int4 \
--block-kv-cache ~/.cache/squish/blocks \
--prompt-kv-cache ~/.cache/squish/pkv \
--port 8080
Use it as an OpenAI-compatible client:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-7b-int4",
"messages": [{"role": "user", "content": "Hello"}]
}'
Or point any OpenAI / Ollama client at it:
export OPENAI_BASE_URL=http://localhost:8080/v1
export OPENAI_API_KEY=squish
# Ollama-compatible /api/* endpoints also work
export OLLAMA_HOST=http://localhost:8080
Install the macOS LaunchAgent so the daemon starts at login:
squish daemon install
The SquishBar menu-bar app (apps/macos/SquishBar/) ships alongside the
daemon — model picker, load progress, and a global hotkey for the chat panel.
Build it from Xcode or grab the signed .app from the GitHub release page.
Configuration
| Flag | Purpose |
|---|---|
--block-kv-cache <DIR> |
Block-paged KV cache for shifting-prefix workloads (agents, multi-turn). Persists across daemon restarts via .safetensors blocks. |
--prompt-kv-cache <DIR> |
Exact-prompt KV cache. Single-digit-millisecond TTFT on verbatim repeats. |
--block-kv-size N |
Block size in tokens (default 64). |
--draft-model <MODEL> |
Speculative-decode draft model (opt-in; see v5.2 diagnosis for current status — net-negative on M3 INT4 with the draft models tested, kept off by default). |
--draft-depth N |
Speculative decode depth K. |
--no-spec, --no-cache |
Disable flags, intended for benchmark controls. |
squish daemon install / uninstall |
macOS LaunchAgent integration. |
Picking the right cache for your workload:
- Exact-prompt repeats (cached scripts, fixed templates, automated jobs):
--prompt-kv-cachealone. ~9 ms TTFT on a cache hit. - Shifting-prefix workloads (agents, multi-turn conversations):
--block-kv-cachealone, or combined config. - General use without knowing the workload: combined config (both caches enabled). Best end-to-end completion time across prompt sizes.
The combined config currently doesn't inherit PKV's fast-hit TTFT due to a
lookup ordering issue documented in
results/benchmarks_v5_1_1/DIAGNOSIS.md;
reordering is tracked as a v5.2 follow-up.
Benchmarks
Full table, methodology, ablation, jitter analysis, and raw per-run JSON:
docs/RESULTS.md— v5.1.1 section is the source of truthbenchmarks/ollama_vs_squish/RESULTS.md— bench harness outputresults/benchmarks_v5_1_1/DIAGNOSIS.md— combined-cache ordering write-upresults/benchmarks_v5_1_1/JITTER_ANALYSIS.md— inter-token p95 explanationresults/benchmarks_v5_2/SPEC_DECODE_DIAGNOSIS.md— why speculative decoding is currently opt-in
Reproduce locally:
python benchmarks/ollama_vs_squish/bench_v5_1.py
What Squish Doesn't Do
In the spirit of honesty:
- No GPU support outside Apple Silicon. It's MLX-based. CUDA users should use vLLM or llama.cpp.
- No multi-user serving. Designed for one developer, one machine — not a production API.
- No multimodal models. Text only.
- Higher inter-token p95 on long prompts than Ollama. Conscious tradeoff (deferred KV-cache restore off the TTFT critical path); details in
JITTER_ANALYSIS.md. - Slower first-token on short prompts than Ollama. Fundamental MLX prefill kernel cost.
- Model conversion is slow and not user-friendly. Squish needs models in its own format. Conversion takes time and isn't fully automated.
If any of those matter for your workflow, Ollama or LM Studio is the right choice.
Architecture
Persistent daemon. The model loads once when the daemon starts and stays resident. Per-invocation model-load cost becomes a once-per-login cost.
Two-cache architecture. A block-paged KV cache stores KV state for
fixed-size token blocks on disk (.safetensors) and reconstructs partial-match
prefixes for shifting-prefix workloads. A prompt KV cache catches exact-prefix
repeats with single-digit-millisecond TTFT.
INT3 quantization with a hard-block list. INT3 behaviour is not uniform across model families. Qwen3 holds within ~1pp of FP16; Gemma-3 collapses (~15pp on common benchmarks). Squish enables INT3 only for families where it's safe and hard-blocks the rest. Try to load Gemma-3 at INT3 and the accuracy gate refuses — you can't accidentally ship a config that quietly degrades.
Contributing
See CONTRIBUTING.md. Issues, benchmarks, and PRs welcome.
The bench harness lives in benchmarks/ollama_vs_squish/; if you re-run on
different hardware, please share the raw JSON output.
License
BUSL-1.1 — see LICENSE.
Links
- Article: Local LLM Server That Wins End-to-End on Long Contexts — in progress
- Org: konjoai · konjoai.org
- Related: Kohaku, Vectro, Squash (EU AI Act compliance, extracted from squish in v9.15.0)
- HuggingFace models: huggingface.co/squish-community
- Module reference: MODULES.md
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file squish_ai-9.33.4.tar.gz.
File metadata
- Download URL: squish_ai-9.33.4.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46e838a2777931d1f5373fa13183221be4159825ca65f93f9b841e7d7940b14f
|
|
| MD5 |
5dadefa2655a2fc8fbb02ae9f0774860
|
|
| BLAKE2b-256 |
17bbd46eb909b6ff8c0430fe2e7e70f4ffde55df7b60acbfa856da24298a8347
|
Provenance
The following attestation bundles were made for squish_ai-9.33.4.tar.gz:
Publisher:
publish.yml on konjoai/squish
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
squish_ai-9.33.4.tar.gz -
Subject digest:
46e838a2777931d1f5373fa13183221be4159825ca65f93f9b841e7d7940b14f - Sigstore transparency entry: 1723319627
- Sigstore integration time:
-
Permalink:
konjoai/squish@b08b6cd6df2a7d136f271e1f8e7775b19f0ba35c -
Branch / Tag:
refs/tags/v9.33.4 - Owner: https://github.com/konjoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b08b6cd6df2a7d136f271e1f8e7775b19f0ba35c -
Trigger Event:
push
-
Statement type:
File details
Details for the file squish_ai-9.33.4-py3-none-any.whl.
File metadata
- Download URL: squish_ai-9.33.4-py3-none-any.whl
- Upload date:
- Size: 1.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
daf357b156d01635560576632bb9b979c221c1753710268b2368127519ad612e
|
|
| MD5 |
103a2fd282d2518fe329bd3f0c2aa811
|
|
| BLAKE2b-256 |
9105c8bf16fa5cdc5eb98fea31c328d143a762535007a218552e5a5912f75843
|
Provenance
The following attestation bundles were made for squish_ai-9.33.4-py3-none-any.whl:
Publisher:
publish.yml on konjoai/squish
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
squish_ai-9.33.4-py3-none-any.whl -
Subject digest:
daf357b156d01635560576632bb9b979c221c1753710268b2368127519ad612e - Sigstore transparency entry: 1723319700
- Sigstore integration time:
-
Permalink:
konjoai/squish@b08b6cd6df2a7d136f271e1f8e7775b19f0ba35c -
Branch / Tag:
refs/tags/v9.33.4 - Owner: https://github.com/konjoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b08b6cd6df2a7d136f271e1f8e7775b19f0ba35c -
Trigger Event:
push
-
Statement type: