Local LLM inference server for Apple Silicon. Block-level paged KV cache for long-context workloads. 5.4× faster end-to-end on 4K-token prompts vs Ollama, less RAM, INT3 support for Qwen3. OpenAI-compatible API.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

wscholl

These details have not been verified by PyPI

Project description

Squish

Squeeze the Most Out of Your Models

Local LLM inference for Apple Silicon. Faster end-to-end response on long contexts, less RAM, INT3 support.

The Numbers

Measured 2026-06-04 on Apple M3 MacBook Pro, 16 GB unified memory. Model: Qwen3-8B. Quant: INT4.

Metric	Ollama 0.18.2	Squish (recommended)
E2E response @ 4000-token prompt	51.7 s	10.1 s (5.1× faster)
E2E response @ 75-token prompt	8.09 s	5.50 s (1.5× faster)
Peak RAM during inference	5.32 GB	2.75 GB
Disk size — INT4	4.36 GB	4.00 GB
Disk size — INT3 (Qwen3)	not supported	3.56 GB
TTFT @ 75-token prompt	131 ms	279 ms (honest loss)

Squish wins end-to-end response time at every prompt size measured, with the largest win on long contexts (5.4× at 4000 tokens), uses ~33% less RAM, and supports INT3 for compatible model families.

Ollama wins time-to-first-token at every prompt size, and inter-token jitter on long contexts. If first-byte latency matters more than full-response latency, Ollama is the right tool.

Full methodology and ablation: docs/benchmark_guide.md

Why Squish

Squish is for the workload most local-LLM tools aren't tuned for: the same model called many times an hour from the terminal with shifting context — git-commit-message generation, code-review prompts, agent loops, multi-turn chat, document Q&A.

On a 16 GB Mac, that workload collides with the rest of your work. Ollama keeps ~5 GB resident and pays a long prefill cost on each new long prompt. Squish is a persistent daemon: the model loads once when the daemon starts, and a two-cache architecture (block-paged KV cache for shifting prefixes, prompt KV cache for exact repeats) avoids re-prefilling work the daemon has already done.

Designed for one developer on one machine. Not a production multi-tenant API.

Install

Prerequisite (macOS/Homebrew): Xcode Command Line Tools are required. Install them with xcode-select --install. If Homebrew reports "Command Line Tools are too outdated", update from System Settings -> General -> Software Update, or reinstall CLT.

# Homebrew (recommended on macOS)
brew install konjoai/squish/squish

# PyPI
pip install squish-ai

# From source
git clone https://github.com/konjoai/squish
cd squish
pip install -e .

Note: The PyPI package is squish-ai. After installing, the Python module and CLI are both named squish:
pip install squish-ai
squish run --version
python -c "import squish; print(squish.__version__)"

Optional Performance Enhancements

The squish_quant Rust extension is bundled and installs automatically. Verify it is active with squish doctor — you should see:

✓  squish_quant Rust extension (6 GB/s quantizer)

Models

squish catalog                 # browse all 40+ available models
squish search qwen3            # filter by name or tag
squish pull qwen3:8b           # download pre-squished from huggingface.co/squishai
squish pull qwen3:0.6b --int3  # INT3 variant (Qwen3, Qwen2.5, Llama families)

Quick Start

# Pull a pre-quantised model from the catalog
squish pull qwen3:8b

# Start the daemon
squish run qwen3:8b

Use it as an OpenAI-compatible client:

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Or point any OpenAI / Ollama client at it:

export OPENAI_BASE_URL=http://localhost:11435/v1
export OPENAI_API_KEY=squish
# Ollama-compatible /api/* endpoints also work
export OLLAMA_HOST=http://localhost:11435

Install the macOS LaunchAgent so the daemon starts at login:

squish daemon install

The SquishBar menu-bar app (apps/macos/SquishBar/) ships alongside the CLI and gives you a native menu bar icon with server status, tok/s display, and one-click model switching. Build locally with make (requires Xcode 15+ and macOS 13+):

cd apps/macos/SquishBar
make
open SquishBar.app

Configuration

See Server Flags for the full flag reference.

Benchmarks

Full table, methodology, ablation, and raw per-run JSON:

docs/benchmark_guide.md — bench methodology and how to reproduce
benchmarks/ollama_vs_squish/RESULTS.md — raw results

Reproduce locally:

bash scripts/test_cli.sh

What Squish Doesn't Do

In the spirit of honesty:

No GPU support outside Apple Silicon. It's MLX-based. CUDA users should use vLLM or llama.cpp.
No multi-user serving. Designed for one developer, one machine — not a production API.
No multimodal models. Text only.
Higher inter-token p95 on long prompts than Ollama. Conscious tradeoff (deferred KV-cache restore off the TTFT critical path); details in JITTER_ANALYSIS.md.
Slower first-token on short prompts than Ollama. Fundamental MLX prefill kernel cost.
Model conversion is slow and not user-friendly. Squish needs models in its own format. Conversion takes time and isn't fully automated.

If any of those matter for your workflow, Ollama or LM Studio is the right choice.

Architecture

Persistent daemon. The model loads once when the daemon starts and stays resident. Per-invocation model-load cost becomes a once-per-login cost.

Two-cache architecture. A block-paged KV cache stores KV state for fixed-size token blocks on disk (.safetensors) and reconstructs partial-match prefixes for shifting-prefix workloads. A prompt KV cache catches exact-prefix repeats with single-digit-millisecond TTFT.

INT3 quantization with a hard-block list. INT3 behaviour is not uniform across model families. Qwen3 holds within ~1pp of FP16; Gemma-3 collapses (~15pp on common benchmarks). Squish enables INT3 only for families where it's safe and hard-blocks the rest. Try to load Gemma-3 at INT3 and the accuracy gate refuses — you can't accidentally ship a config that quietly degrades.

Contributing

See CONTRIBUTING.md. Issues, benchmarks, and PRs welcome.

License

BUSL-1.1 — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

wscholl

These details have not been verified by PyPI

Release history Release notifications | RSS feed

9.33.8

Jun 9, 2026

9.33.7

Jun 8, 2026

9.33.6

Jun 6, 2026

This version

9.33.5

Jun 5, 2026

9.33.4

Jun 4, 2026

9.33.2

Jun 3, 2026

9.33.1

Jun 3, 2026

9.33.0

Jun 3, 2026

9.32.0

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

squish_ai-9.33.5.tar.gz (2.0 MB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

squish_ai-9.33.5-py3-none-any.whl (1.8 MB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file squish_ai-9.33.5.tar.gz.

File metadata

Download URL: squish_ai-9.33.5.tar.gz
Upload date: Jun 5, 2026
Size: 2.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for squish_ai-9.33.5.tar.gz
Algorithm	Hash digest
SHA256	`7e399c3155bd16b03cc62d4774b60688ef31d18f473e0cd3338cb9c0a3a5b9ec`
MD5	`3d360ac467fd51a471d468c12dfca281`
BLAKE2b-256	`808271a12b87ddc4ca00d5a650427c11a0c2e12beb61518cfb55a3ec79c94843`

See more details on using hashes here.

Provenance

The following attestation bundles were made for squish_ai-9.33.5.tar.gz:

Publisher: release.yml on konjoai/squish

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: squish_ai-9.33.5.tar.gz
- Subject digest: 7e399c3155bd16b03cc62d4774b60688ef31d18f473e0cd3338cb9c0a3a5b9ec
- Sigstore transparency entry: 1734753671
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: konjoai/squish@0df057696af7cc6095a118419d920be3d53c31d0
- Branch / Tag: refs/heads/main
- Owner: https://github.com/konjoai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0df057696af7cc6095a118419d920be3d53c31d0
- Trigger Event: workflow_dispatch

File details

Details for the file squish_ai-9.33.5-py3-none-any.whl.

File metadata

Download URL: squish_ai-9.33.5-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 1.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for squish_ai-9.33.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bef4ad0c87387cb27038fa8564a84ba577844fdbe8785d83b029418d4bedb7ea`
MD5	`71eb73ea7666c98b6b66a52861de0102`
BLAKE2b-256	`178677e520c5906ed2861679a71b95a09a10700d414b1f56e23253092bdaee33`

See more details on using hashes here.

Provenance

The following attestation bundles were made for squish_ai-9.33.5-py3-none-any.whl:

Publisher: release.yml on konjoai/squish

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: squish_ai-9.33.5-py3-none-any.whl
- Subject digest: bef4ad0c87387cb27038fa8564a84ba577844fdbe8785d83b029418d4bedb7ea
- Sigstore transparency entry: 1734753722
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: konjoai/squish@0df057696af7cc6095a118419d920be3d53c31d0
- Branch / Tag: refs/heads/main
- Owner: https://github.com/konjoai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0df057696af7cc6095a118419d920be3d53c31d0
- Trigger Event: workflow_dispatch

squish-ai 9.33.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Squish

Squeeze the Most Out of Your Models

The Numbers

Why Squish

Install

Optional Performance Enhancements

Models

Quick Start

Configuration

Benchmarks

What Squish Doesn't Do

Architecture

Contributing

License

Links

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance