39% faster TTFT, 67% less KV cache, zero config — autotune optimises local LLMs on Ollama, LM Studio, and MLX

These details have not been verified by PyPI

Project links

Project description

autotune — Local LLM Inference Optimizer

Website & install guide → autotunellm.com

39% faster time-to-first-word. 3× less KV cache. Drop-in for Ollama, LM Studio, and MLX.

autotune is a middleware layer that makes your local LLMs noticeably faster and lighter — without changing your code or workflow. It computes the exact KV cache each request needs, pins your system prompt in memory, and manages context windows automatically.

pip install llm-autotune                        # macOS / Windows
pipx install llm-autotune                       # Linux (recommended — see install notes below)
brew install tanavc1/autotune/llm-autotune      # Homebrew (macOS)
docker pull tanavc1/llm-autotune                # Docker
autotune chat --model qwen3:8b                  # that's it

Works with Ollama, LM Studio, and MLX (Apple Silicon native) out of the box.

What autotune actually improves

Benchmarked on Apple M2 16 GB using Ollama's own nanosecond-precision internal timers — not Python wall-clock estimates. Results are means across 3 runs × 5 prompt types, with Wilcoxon signed-rank statistical testing and Cohen's d effect sizes.

Metric	llama3.2:3b	gemma4:e2b	qwen3:8b	Average
Time to first word (TTFT)	−35%	−29%	−53%	−39%
KV prefill time	−66%	−64%	−72%	−67%
KV cache RAM	−66%	−69%	−66%	−67%
Generation speed (tok/s)	±2%	±0.2%	±2.4%	unchanged

Timing source: prompt_eval_duration, load_duration, and total_duration from Ollama's Go runtime. Token counts (prompt_eval_count) are identical in both conditions — autotune right-sizes the buffer, not the content.

What the numbers mean

You wait 39% less for the first word. On qwen3:8b that's 53% faster. On a long-context prompt, up to 89% faster. You feel this on every message.

KV cache shrinks 3×. Raw Ollama allocates a fixed 4,096-token KV buffer regardless of prompt length. autotune computes the exact size each request needs — for a typical chat message that frees 300–400 MB before inference even starts.

Generation speed is unchanged. Token generation on Apple Silicon is Metal GPU-bound. The ±2% variance in the data is measurement noise. autotune is transparent about this.

122,778 KV buffer slots freed across all benchmark runs — slots Ollama would have allocated, zeroed, and initialized for nothing.

Verify it yourself

# Quick 45-second check on any model you have:
autotune proof --model qwen3:8b

# Full statistical benchmark with Wilcoxon p-values and Cohen's d:
autotune proof-suite --model qwen3:8b --runs 3

autotune proof runs two scenarios: a standard multi-turn session and a long-context code-review prompt where TTFT and KV allocation differences are most visible. Results are saved as JSON alongside your terminal output.

Quickstart

1. Install Ollama

macOS

brew install ollama

Or download the desktop app from https://ollama.com/download.

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows — download the installer from https://ollama.com/download.

Once installed, pull a model:

autotune pull qwen3:8b         # 5.2 GB — best general model for 16 GB machines

autotune starts Ollama in the background automatically — no separate ollama serve needed.

Not sure which model to use? Run autotune recommend after installing and it will pick the best model for your exact hardware.

2. Install autotune

macOS / Windows

pip install llm-autotune

Linux

Modern Linux distros (Ubuntu 23.04+, Debian 12+, Fedora 38+) block pip install to the system Python by default (PEP 668). Use pipx — it's the correct tool for CLI apps and keeps autotune isolated from your system packages:

# 1. Install pipx if you don't have it
sudo apt install pipx          # Debian / Ubuntu
sudo dnf install pipx          # Fedora
# or without sudo: pip install --user pipx

# 2. Install autotune
pipx install llm-autotune

# 3. Add ~/.local/bin to PATH (only needed once)
pipx ensurepath

# 4. Reload your shell — no need to open a new terminal
exec $SHELL

"autotune: command not found" right after installing? This means ~/.local/bin wasn't in your PATH before. Steps 3 and 4 above fix it — exec $SHELL reloads your shell in place without opening a new window.

If you'd rather not use pipx: pip install llm-autotune --break-system-packages works but may conflict with your distro's system packages. Not recommended.

Requirements: Python 3.9+, Ollama running locally.

# Apple Silicon acceleration (native Metal GPU kernels):
pip install "llm-autotune[mlx]"

# Development install:
git clone https://github.com/tanavc1/local-llm-autotune.git
cd local-llm-autotune && pip install -e ".[dev]"

3. Get a model recommendation for your hardware

autotune recommend

Profiles your CPU, RAM, and GPU, then scores every model in the registry against your hardware and recommends the best option with an exact autotune pull command to run.

4. Start chatting

autotune chat --model qwen3:8b                   # optimized chat, default profile
autotune chat --model qwen3:8b --profile fast    # minimum latency
autotune chat --model qwen3:8b --profile quality # largest context window
autotune chat --model qwen3:8b --no-swap         # guarantee no macOS swap
autotune chat --model qwen3:8b --system "You are a concise coding assistant."

5. Check what's running

autotune ps        # all models in memory — RAM, context, quant, age
autotune hardware  # CPU, RAM, GPU backend, and effective memory budget
autotune ls        # every locally installed model scored against your hardware

API server (OpenAI-compatible)

autotune serve
# → Listening at http://127.0.0.1:8765/v1

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8765/v1", api_key="local")
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)

Per-request headers

X-Autotune-Profile: fast          # override profile for this request
X-Conversation-Id: a3f92c1b       # attach to a persistent conversation

Endpoints

Endpoint	Description
`POST /v1/chat/completions`	OpenAI-compatible, streaming or non-streaming
`GET /v1/models`	All available models across all backends
`GET /health`	Server status, queue depth, memory pressure
`GET /api/hardware`	Live hardware snapshot
`GET /api/profiles`	Profile definitions
`GET /api/running_models`	Models in memory with RAM, context, quant, age
`POST/GET/DELETE /api/conversations`	Persistent conversation CRUD
`GET /api/conversations/{id}/export`	Export as Markdown

API key authentication

By default the server accepts all requests. To enforce API keys on all /v1/* routes:

export AUTOTUNE_ADMIN_KEY="your-secret-admin-key"
export AUTOTUNE_REQUIRE_API_KEY=1
autotune serve

Create and manage keys via the admin API (requires Authorization: Bearer $AUTOTUNE_ADMIN_KEY):

# Create a key
curl -s -X POST http://localhost:8765/admin/keys \
  -H "Authorization: Bearer $AUTOTUNE_ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-app", "label": "Production"}' | jq .

# List all keys
curl -s http://localhost:8765/admin/keys \
  -H "Authorization: Bearer $AUTOTUNE_ADMIN_KEY" | jq .

# Per-key usage (last 30 days)
curl -s "http://localhost:8765/admin/usage/summary?days=30" \
  -H "Authorization: Bearer $AUTOTUNE_ADMIN_KEY" | jq .

# Revoke a key
curl -s -X DELETE http://localhost:8765/admin/keys/{id} \
  -H "Authorization: Bearer $AUTOTUNE_ADMIN_KEY" \
  -d '{"reason": "rotated"}'

Keys use the format sk-at-<token>. The plaintext is returned once on creation — only the SHA-256 hash is stored. Usage is logged per key per day to local SQLite with an optional Supabase mirror.

Admin endpoint	Description
`POST /admin/keys`	Create key — returns plaintext once
`GET /admin/keys`	List all keys
`GET /admin/keys/{id}`	Single key + 30-day usage
`DELETE /admin/keys/{id}`	Revoke (soft delete)
`GET /admin/usage`	Per-key/day/model breakdown
`GET /admin/usage/summary`	Aggregate totals per key

/health, /api/*, and /admin/* are always exempt from key enforcement.

Concurrency

AUTOTUNE_MAX_CONCURRENT=1    # parallel inference slots (default: 1)
AUTOTUNE_MAX_QUEUED=8        # max requests waiting (default: 8)
AUTOTUNE_WAIT_TIMEOUT=120    # seconds before a queued request gets 429 (default: 120)

Web dashboard

autotune ships a built-in monitoring and control dashboard at http://localhost:8765/dashboard. Start the server with autotune serve and open that URL in any browser.

Authentication

Set AUTOTUNE_ADMIN_KEY before starting the server. The dashboard login page requires that key — it issues an HMAC-signed session cookie (valid 12 hours by default). All dashboard API routes reject unauthenticated requests with HTTP 401.

export AUTOTUNE_ADMIN_KEY="your-secret-key"
autotune serve
# → open http://localhost:8765/dashboard

Dashboard panels

Panel	What it shows
Overview KPIs	RAM used (with pressure colour), running models, requests today, avg TTFT, avg tok/s, KV savings vs the 4 096-token Ollama default
Requests chart	24-bucket bar chart of the last 24 hours
TTFT sparkline	Last 100 requests coloured by latency tier (green / blue / yellow / red)
Raw vs Tuned	autotune's average dynamic context vs Ollama's fixed 4 096-token default — shows context reduction %, KV memory savings %, and measured avg TTFT
Per-model breakdown	Requests, avg/min/max TTFT, avg tok/s, avg context, total tokens, last used — for every model ever routed through autotune
Active API keys	Requests and tokens consumed today per key
Slow requests	Recent requests that took >5 s, with model, elapsed, TTFT, context, profile, timestamp
Suggestions	Rule-based guidance derived from real data: high TTFT, RAM pressure, KV savings, slow requests
Model catalog	Full 43-model registry with tier, parameter count, and hardware fit scores for your machine
Settings	Live read/write panel for all configurable keys (see below)

The dashboard auto-polls all data every 10 seconds. A live/offline indicator appears in the header.

Settings tab

Six settings can be read and updated live from the Settings tab or via the API:

Key	Type	Range	Default	Effect
`max_context_tokens`	int	512–131 072	32 768	Global context window ceiling
`kv_cache_quant`	string	`f16`/`q8`/`q4`	`f16`	KV precision override
`keep_alive_secs`	int	0–86 400	300	Model resident-memory duration in seconds
`telemetry_enabled`	bool	`true`/`false`	`false`	Opt-in anonymous telemetry
`retention_days`	int	7–3 650	90	Local run-observation history window
`log_slow_threshold_ms`	int	100–60 000	5 000	Slow-request alert threshold in ms

# Read all settings
curl -s http://localhost:8765/api/dashboard/settings \
  -H "Cookie: session=<your-session-cookie>" | jq .

# Batch update
curl -s -X POST http://localhost:8765/api/dashboard/settings \
  -H "Cookie: session=<your-session-cookie>" \
  -H "Content-Type: application/json" \
  -d '{"settings": {"max_context_tokens": 16384, "keep_alive_secs": 600}}'

# Prune old observations
curl -s -X POST http://localhost:8765/api/dashboard/settings/cleanup \
  -H "Cookie: session=<your-session-cookie>"

Security posture

AUTOTUNE_ADMIN_KEY — all dashboard API routes require a valid session cookie derived from this key. No key set → dashboard routes return 503.
HMAC-signed sessions — session cookies are signed with itsdangerous. Tampered tokens are rejected.
Revocation — logout invalidates the session server-side via a SQLite-backed revoked-token table. Stolen cookies cannot be reused after logout.
Rate limiting — sliding-window limits: write routes 30 req/hour, read routes 300 req/min, session refresh 60 req/min. Exceeded → HTTP 429.
CSP headers — Content-Security-Policy, X-Frame-Options: DENY, X-Content-Type-Options: nosniff, and Referrer-Policy set on all responses.
Whitelist — only the 6 documented keys can be written via the settings API. Arbitrary key injection is blocked at the validation layer.

Model recommendations by hardware

RAM	Recommended model	Pull command	Why
8 GB	`qwen3:4b`	`autotune pull qwen3:4b`	Best 4B available; hybrid thinking mode
16 GB	`qwen3:8b`	`autotune pull qwen3:8b`	Near-frontier quality; best 8B as of 2026
16 GB (coding)	`qwen2.5-coder:7b`	`autotune pull qwen2.5-coder:7b`	Near GPT-4o on HumanEval at 7B
24 GB	`qwen3:14b`	`autotune pull qwen3:14b`	Excellent reasoning; comfortable headroom
24 GB (coding)	`qwen2.5-coder:14b`	`autotune pull qwen2.5-coder:14b`	Best open coding model at this size
32 GB	`qwen3:30b-a3b`	`autotune pull qwen3:30b-a3b`	MoE: flagship quality at 7B inference cost
64 GB+	`qwen3:32b`	`autotune pull qwen3:32b`	Top dense open model
Reasoning	`deepseek-r1:14b`	`autotune pull deepseek-r1:14b`	Chain-of-thought; strong math and logic

Run autotune recommend to get a personalised pick with scores for your exact hardware configuration.

Features

Feature	What happens
Dynamic KV sizing	Computes the exact `num_ctx` each request needs — typically 4–8× less KV cache than Ollama's fixed 4,096-token default
KV prefix caching	Pins system-prompt tokens via `num_keep` so they're never re-evaluated each turn
Model keep-alive	Sets `keep_alive=-1` so the model stays loaded between conversations — eliminates reload latency
Adaptive KV precision	Automatically downgrades F16 → Q8 under memory pressure before any slowdown occurs
Flash attention	Enables `flash_attn=true` on every request — reduces peak KV activation memory
Prefill batching	Sets `num_batch=1024` (2× Ollama default) — fewer Metal kernel dispatches for long prompts
Context management	Trims conversation history at token budget thresholds, always at sentence/paragraph boundaries
Inference queue	FIFO queue (1 concurrent, 8 waiting) with HTTP 429 back-pressure — prevents memory thrashing
OpenAI-compatible API	Drop-in server at `localhost:8765/v1` — works with any OpenAI SDK
MLX backend	On M-series Macs, routes inference to MLX-LM for native Metal GPU kernels
Persistent memory	Every conversation saved to SQLite; semantically searches past sessions at startup
NoSwapGuard	Exact-math pre-flight check using model architecture — computes precise KV bytes and reduces context + KV precision until the allocation fits in available RAM

Agentic workloads

Raw Ollama's fixed num_ctx=4096 hurts most inside agent loops — where tool calls, observations, and reasoning steps accumulate. autotune sizes the session context once before the loop begins, holds it constant across all turns, and uses num_keep prefix caching so the system prompt is never re-evaluated after turn 1.

Measured on llama3.2:3b, multi-turn tool-calling agent task:

Metric	Raw Ollama	autotune
Agent wall time	74 s	40 s (−46%)
Model reloads per session	0–1	~0
Swap events	1 of 3 trials	0
Tool call errors	1 avg	0
Context tokens at session end	3,043	1,946 (−36%)
TTFT trend per turn	grows	shrinks (prefix cache)

For sessions with 3+ turns, prefix caching compounds — TTFT per turn falls as the conversation grows. Full methodology and raw data: AGENT_BENCHMARK.md

Chat commands

Command	What it does
`/help`	Show available commands
`/new`	Start a new conversation
`/history`	Show full conversation history
`/profile fast\|balanced\|quality`	Switch profile mid-conversation
`/model <id>`	Switch to a different model
`/system <text>`	Set or replace the system prompt
`/export`	Export conversation to Markdown
`/metrics`	Session stats: tok/s, TTFT, request count
`/recall`	Browse past conversations
`/recall search <query>`	Semantic search across all past sessions
`/pull <model>`	Pull a model from Ollama without leaving chat
`/quit`	Exit (also Ctrl-C)

Profiles

Profile	Context	Temperature	KV precision	Best for
`fast` ⚡	2,048	0.1	Q8	Quick lookups, autocomplete
`balanced` ⚖️	8,192	0.7	F16	General chat, coding
`quality` ✨	32,768	0.8	F16	Long documents, analysis

Apple Silicon (MLX)

pip install "llm-autotune[mlx]"
autotune mlx pull qwen3:8b        # download MLX-quantized model
autotune chat --model qwen3:8b    # automatically routes to MLX
autotune mlx list                 # show locally cached MLX models

MLX activates automatically on Apple Silicon — no configuration needed. Use Ollama-backed models when you need structured tool calls in agentic workflows.

Docker — Ollama + autotune bundled

autotune ships pre-built images to Docker Hub for linux/amd64 and linux/arm64. No local Python, no separate Ollama install.

Quickstart from Docker Hub

# autotune server only — point it at your existing Ollama instance
docker run -p 8765:8765 \
  -e AUTOTUNE_OLLAMA_URL=http://host.docker.internal:11434 \
  tanavc1/llm-autotune:latest

# pin to a specific version
docker pull tanavc1/llm-autotune:1.2.0

Bundled image — Ollama + autotune in one container

git clone https://github.com/tanavc1/local-llm-autotune.git
cd local-llm-autotune
docker compose --profile single up

Docker builds the bundled image (Ollama + autotune), starts both services, and exposes the API at http://localhost:8765/v1. Point any OpenAI-compatible client there.

Auto-pull a model on first boot

OLLAMA_MODEL=qwen3:8b docker compose --profile single up

The container pulls qwen3:8b from Ollama's registry on first start, then begins serving. Subsequent runs skip the pull because the model is cached in the volume.

Raw Docker (without Compose)

docker build -t autotune .
docker run -p 8765:8765 -v ollama_models:/root/.ollama autotune

# With a model:
docker run -p 8765:8765 -v ollama_models:/root/.ollama -e OLLAMA_MODEL=qwen3:8b autotune

# Also expose Ollama directly:
docker run -p 8765:8765 -p 11434:11434 -v ollama_models:/root/.ollama autotune

docker-compose — separate Ollama + autotune services

docker compose --profile multi up

In this mode, Ollama and autotune run as separate services. autotune receives AUTOTUNE_OLLAMA_URL=http://ollama:11434 so it routes to the Ollama service by name. Use a separate Dockerfile.autotune that contains only Python (~200 MB vs ~2 GB for the bundled image).

Environment variables

Variable	Default	Purpose
`OLLAMA_MODEL`	(empty)	Model to auto-pull on first container start
`AUTOTUNE_PORT`	`8765`	Port autotune binds inside the container
`OLLAMA_HOST`	`0.0.0.0`	Bind address passed to `ollama serve` inside the container
`AUTOTUNE_OLLAMA_URL`	`http://localhost:11434`	Where autotune reaches Ollama — set to `http://ollama:11434` for multi-container mode
`AUTOTUNE_REQUIRE_API_KEY`	`0`	Set to `1` to enforce API key auth on all `/v1/*` routes
`AUTOTUNE_ADMIN_KEY`	(unset)	Bearer token for all `/admin/*` endpoints — 503 if unset when accessed

GPU support

The bundled image is built on ollama/ollama:latest which includes CUDA and ROCm layers. Mount the appropriate devices:

# NVIDIA GPU
docker run --gpus all -p 8765:8765 -v ollama_models:/root/.ollama autotune

# AMD GPU (ROCm)
docker run --device /dev/kfd --device /dev/dri -p 8765:8765 \
  -v ollama_models:/root/.ollama autotune

Embedding autotune in your application

import autotune
from openai import OpenAI

autotune.start()                             # spawns server if not running; blocks until ready
client = OpenAI(**autotune.client_kwargs())  # {"base_url": "http://localhost:8765/v1", "api_key": "local"}

response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello"}],
)

start() checks /health first and returns immediately if the server is already running.

Options

autotune.start(
    host="localhost",
    port=8765,
    timeout=30.0,       # raise TimeoutError if server isn't ready within this many seconds
    profile="balanced", # "fast" | "balanced" | "quality"
    use_mlx=False,      # True = MLX on Apple Silicon (faster, no tool calls)
    log_level="warning",
)

Error handling

try:
    response = client.chat.completions.create(...)
except Exception as e:
    error = e.response.json().get("detail", {})
    match error.get("type"):
        case "model_not_found":
            print(f"Run: autotune pull {error['model']}")
        case "memory_pressure":
            print("Not enough RAM. Try a smaller model or --profile fast.")
        case "backend_error":
            print(f"Backend error: {error['message']}\nSuggestion: {error['suggestion']}")

Server RAM footprint

Mode	Server RAM	Tool calling	Notes
`autotune.start()` (default)	~94 MB	✓	Ollama-backed
`autotune.start(use_mlx=True)`	~470 MB	✗	10–40% faster on Apple Silicon

Agentic frameworks

autotune's OpenAI-compatible server is a drop-in local LLM backend for any framework that accepts a custom base URL.

autotune serve

OpenClaw

# openclaw/config.yaml
providers:
  - name: autotune-local
    api: openai-responses
    baseUrl: http://localhost:8765/v1
    apiKey: sk-local
    model: qwen3:8b
    supportsTools: true

Hermes Agent

# ~/.hermes/config.yaml
model:
  provider: custom
  base_url: http://localhost:8765/v1
  api_key: sk-local
  name: qwen3:8b

Models confirmed for tool calling via Ollama: qwen3:8b, qwen3:14b, llama3.1:8b, qwen2.5-coder:14b, hermes3

How it works — all 14 optimizations

autotune sits between your code and Ollama as a transparent middleware layer. Every request passes through a stack of optimizations. Here's every one, explained plainly.

Full explanations with examples: see below, or visit the GitHub repo

The KV cache — the central concept

When an LLM generates text, every new token needs to "attend to" every previous token. The results of that attention computation — two tables of numbers per token called K (keys) and V (values) — are cached in RAM so they don't have to be recomputed. This is the KV cache.

Its size is mathematically exact:

2 × n_layers × kv_heads × head_dim × num_ctx × bytes_per_element

For qwen3:8b at 4,096 context: 576 MB. At 1,536 context: 216 MB. The KV cache scales linearly with context length — that's the big lever.

Memory optimizations

1. Dynamic context sizing — every request

Ollama allocates the full KV cache before generating the first token, using whatever num_ctx you've configured — even if your actual prompt is 50 words. autotune computes the minimum context each request actually needs:

num_ctx = clamp(input_tokens + max_new_tokens + 256, 512, profile_max)

A typical balanced-profile message (22-token prompt + 1024 reply + 256 buffer = 1,302 tokens) allocates ~145 MB instead of ~576 MB on qwen3:8b. No tokens are dropped — the context window grows naturally as the conversation grows.

2. KV cache precision control — per profile, adaptive

KV elements can be stored as F16 (2 bytes each) or Q8 (1 byte each). Q8 halves the entire KV cache footprint with negligible quality impact. This is separate from model quantization — it only affects the temporary computation cache, not the model weights.

fast profile: always Q8
balanced / quality: F16 by default, Q8 under memory pressure

3. NoSwapGuard — exact-math pre-flight guarantee — every request

NoSwapGuard is autotune's hard guarantee against swap — and it works completely differently from the Live pressure system below. Where Live pressure uses RAM percentage as a heuristic, NoSwapGuard queries the actual model architecture from Ollama and computes the precise bytes the KV allocation will need:

kv_bytes = 2 × n_layers × kv_heads × head_dim × num_ctx × precision_bytes

If kv_bytes + 1.5 GB safety margin > available RAM, it fires — not because RAM looks "80% full", but because the exact number of bytes won't fit. It reduces in levels, applied in order until the math clears:

Level	Action
0	Fits — no change
1	Trim context 25%
2	Halve context
3	Halve context + Q8 KV (saves ~50% more)
4	Quarter context + Q8
5	Minimum (512 tokens) + Q8 — emergency floor

The model's architecture (layers, KV heads, head dimension) is queried from Ollama's /api/show once and cached — every calculation is exact, not estimated. This is the fundamental difference from Live pressure's percentage tiers: NoSwapGuard knows precisely how many bytes are needed.

4. Live memory pressure response — every request, real-time

Live pressure is autotune's proactive, heuristic RAM tier system — entirely separate from NoSwapGuard's exact-math approach above. It reads the OS's RAM utilization percentage before every request and adjusts context + KV precision according to fixed thresholds, firing well before any swap risk:

RAM usage	Context	KV precision
< 80%	full	profile default
80–88%	−10%	profile default
88–93%	−25%	F16 → Q8
> 93%	halved	forced Q8

At 80% utilization there is still 20% RAM free — no swap danger — but autotune starts backing off preemptively. If Live pressure has already trimmed enough headroom, NoSwapGuard may not need to fire at all. Changes are reported in the chat interface. No user action needed.

5. Pre-flight model fit analysis — before loading

Before a model is loaded, autotune calculates whether it will fit: model_weights + kv_cache(context, precision) + runtime_overhead. It classifies the result as SAFE / MARGINAL / SWAP_RISK / OOM and sets a safe context ceiling. If the model is too heavy, it recommends a lighter quantization with the exact autotune pull command to run.

Speed optimizations

6. Context bucket snapping — every request

After computing the minimum context, autotune snaps it to the nearest bucket from a fixed list: [512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192, 12288, 16384, 32768].

Why: Ollama caches the KV buffer for the most recently used context length. If num_ctx changes request-to-request (1,286 → 1,157 → 1,308), Ollama reallocates the Metal buffer on every call — even with the model already loaded. This "KV thrashing" adds 100–300 ms per request. Buckets eliminate it: prompts of 50–200 tokens all map to 1,536, Ollama allocates it once and reuses it forever.

7. System prompt prefix caching — multi-turn conversations

Ollama re-processes the system prompt from scratch on every turn. autotune pins the system prompt tokens in the KV cache via num_keep — they're evaluated once at the start and never again. In agentic sessions with 10+ turns, this compounding effect means TTFT actually falls as the session grows.

8. Model keep-alive — between sessions

Ollama unloads models after 5 minutes of idle. autotune sets keep_alive="-1" (forever) on every request. The model stays in RAM between conversations, eliminating the 1–4 second cold-reload cost you'd otherwise pay every time a session goes idle. This doesn't cost more RAM — the weights were already loaded; it just keeps them committed.

9. Flash attention — every request

Passes flash_attn: true to Ollama. Flash attention computes attention in tiles rather than materializing the full N² attention matrix, dramatically reducing the peak activation memory spike during prefill. Zero quality impact — it's mathematically identical to standard attention. Models that don't support it silently ignore the flag.

10. Larger prefill batch size — long prompts

Sets num_batch=1024 (Ollama default: 512). During prefill (processing your prompt), tokens are fed through the model in chunks. A 700-token prompt with the default takes 2 GPU passes; with 1024, it takes 1. Fewer passes = fewer Metal kernel dispatches = lower TTFT for any prompt over 512 tokens. Short prompts are unaffected.

Adaptive intelligence

11. Hardware tuner — around each inference call

Makes real OS-level changes before inference and restores them after:

macOS QOS class: Sets the thread to USER_INTERACTIVE — the highest scheduling priority on macOS (same class as UI scrolling animations). The process gets more CPU time over background tasks.
Process priority (nice): Raises the autotune and Ollama process priorities on macOS/Linux for better CPU scheduling.
Python GC disabled: Python's garbage collector causes "stop the world" pauses of up to tens of milliseconds. Disabling it during inference eliminates hitches in streamed output.
Linux CPU governor: Attempts to set the CPU to performance mode (full clock speed) during inference (requires root; silently skipped otherwise).

12. Adaptive session advisor — live monitoring

Continuously watches RAM%, swap activity, tokens/sec, and TTFT. Computes a 0–100 health score every 30 seconds. When the score drops below thresholds, takes the least-disruptive available action from an ordered list:

Reduce concurrency
Reduce context window
Lower KV precision (F16 → Q8)
Enable prompt caching
Disable speculative decoding
Lower quantization
Suggest switching to a smaller model

There's a 20-second cooldown between actions and a 90-second stability window before scale-up. The advisor attributes events — it knows whether a RAM spike was caused by loading a model, KV growth, or a background application.

Context & conversation

13. Context compressor — long sessions

As conversation history grows toward the context limit, autotune compresses older messages in four tiers:

< 55%  FULL          — all turns verbatim
55–75% RECENT+FACTS  — last 8 turns + structured facts for older
75–90% COMPRESSED    — last 6 turns (lightly compressed) + compact summary
> 90%  EMERGENCY     — last 4 turns (compressed) + one-line summary

Compression strategies (lightest first): strip noise → compress JSON blobs → shorten tool output (head + tail) → trim assistant messages (keep first paragraph + code blocks + last paragraph) → trim user messages (preserve intent). Low-value chatter is dropped first; code blocks and stack traces are always preserved. All cuts happen at sentence boundaries. Facts extraction is deterministic — no extra LLM call required.

14. Conversation memory & recall — across sessions

Every conversation is saved to a local SQLite database (~/.autotune/recall.db). At the start of each new conversation, autotune searches your history for semantically relevant past context and quietly injects it as a system note.

Vector search (primary): Uses nomic-embed-text (local, ~274 MB, runs in Ollama) to find semantically similar past exchanges — even if they use different words.
FTS5 keyword fallback: Full-text search across all stored conversations when the embedding model isn't available.
Injection threshold: Only injects if cosine similarity > 0.38 — conservative by design. Better to show nothing than irrelevant noise. Up to 3 memories injected, capped at 1,200 characters total.

All data is local. Nothing is sent to any server.

What doesn't change

Generation speed (tok/s): Metal GPU-bound on Apple Silicon. autotune doesn't touch the generation loop. Benchmarks show ±2% variance — measurement noise.
Output quality: Model weights, sampling parameters, and temperature are unchanged. prompt_eval_count is identical — no tokens are dropped or skipped.
Turn 1 in agentic sessions: Pre-allocating a full session KV window makes turn 1 ~80% slower. From turn 2 onward, prefix-cache savings compound and total wall time comes out ~46% lower.

Conversation memory

Every conversation is saved to a local SQLite database with full-text and vector similarity search. No flags required.

Automatic context injection — at session start, autotune surfaces relevant facts from past conversations as a silent system note.
Session resume — use --conv-id <id> to continue an exact past session with full context.
In-chat recall — /recall to browse sessions; /recall search <topic> for semantic search.

Path	Contents
`~/.autotune/recall.db`	FTS5 + float32 vectors; turns, extracted facts
`~/Library/Application Support/autotune/autotune.db`	Hardware telemetry, run observations (macOS)
`~/.local/share/autotune/autotune.db`	Same (Linux)

Telemetry

autotune telemetry                    # last 20 inference runs
autotune telemetry --events           # notable events: swap spikes, OOMs
autotune telemetry --model qwen3:8b   # filter by model

Anonymous cloud telemetry is opt-in and off by default:

autotune telemetry --status    # check opt-in status
autotune telemetry --enable    # opt in
autotune telemetry --disable   # opt out

What is sent when opted in: CPU architecture, RAM size, GPU backend, tokens/sec, TTFT, context size, quantization label, session start/stop events. No hostnames, usernames, IP addresses, or conversation content. Data goes to a private Supabase instance and is never sold or shared.

The Supabase anon key embedded in the package is a public client token (INSERT-only, row-level security enforced). See SECURITY.md for a full explanation.

Troubleshooting

"error: externally-managed-environment" (Linux) → Your Linux distro blocks pip install to the system Python (PEP 668). Install via pipx instead:

sudo apt install pipx          # Debian/Ubuntu — or: pip install --user pipx
pipx install llm-autotune
pipx ensurepath && exec $SHELL

"autotune: command not found" after pipx install → ~/.local/bin was just added to your PATH but the current shell session doesn't know yet. No need to open a new terminal — just run:

exec $SHELL

If that doesn't work, run pipx ensurepath first, then exec $SHELL.

"Ollama is not running." → autotune starts Ollama automatically. If it still fails, install Ollama:

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Or download the desktop app from https://ollama.com/download.

"No models found." → Pull a model: autotune pull qwen3:8b or run autotune recommend for a hardware-matched suggestion.

"Memory pressure — context 8192→6144 tokens" → RAM is 88%+ full. Close other apps or switch to a smaller model.

HTTP 429 — queue full → Too many concurrent requests. Increase AUTOTUNE_MAX_QUEUED or wait for one to finish.

First message is slow → Expected — the model loads and the KV buffer initializes on the first request. Subsequent messages respond immediately.

CLI command reference

Get started

Command	What it does
`autotune run <model>`	Pre-flight RAM check + chat in one step. Best first command for any new model.
`autotune chat --model <id>`	Start an optimized chat session with a model already installed.
`autotune hardware`	Scan CPU/RAM/GPU, show which models fit, and suggest apps to close for more RAM.
`autotune recommend`	Profile your hardware and recommend the best model+settings. Prints exact `autotune pull` commands.

Manage models

Command	What it does
`autotune ls`	List downloaded models with fit scores, safe context window, and recommended profile.
`autotune ps`	Show every model currently loaded in RAM across Ollama, MLX, and LM Studio.
`autotune pull [model]`	Download an Ollama model. Omit the name to browse hardware-aware recommendations.
`autotune models`	List local models with size, architecture, and quality tier. `--registry` shows autotune's full catalog.
`autotune unload [model]`	Release a model from memory immediately. Interactive picker if no model specified.

Deploy & integrate

Command	What it does
`autotune serve`	Start an OpenAI-compatible API server on `localhost:8765`. All optimizations applied automatically.

Benchmarking & proof

Command	Duration	What it does
`autotune proof -m <model>`	~30 s	Quick head-to-head: raw Ollama vs autotune. Shows TTFT, KV RAM, swap events, RAM headroom.
`autotune proof-suite -m <model>`	~10 min	5-prompt statistical suite. Wilcoxon signed-rank + Cohen's d + 95% CI across multiple models.
`autotune bench -m <model>`	~15 min	Intensive multi-prompt benchmark with `--duel`, `--raw`, and `--compare` modes.
`autotune user-bench -m <model>`	~30 min	Real-world UX benchmark: swap events, TTFT consistency, CPU spikes, RAM headroom, 0–100 score.
`autotune agent-bench`	~1–2 h	Agentic multi-turn benchmark across 5 tasks. Shows TTFT growth curves (the key story).

# Typical proof workflow
autotune proof -m qwen3:8b                    # quick check (~30s)
autotune proof-suite -m qwen3:8b --runs 5     # statistical confirmation
autotune user-bench -m qwen3:8b --quick       # does it feel better?

Key flags for autotune proof:

Flag	Default	Description
`--model, -m`	auto	Ollama model ID. Auto-selects if omitted.
`--runs, -r`	`2`	Runs per condition. 3+ gives stabler numbers.
`--profile, -p`	`balanced`	autotune profile to test.
`--output, -o`	`proof_<model>.json`	Save JSON results.
`--list-models`	—	Print installed models and exit.

Conversation memory

Command	What it does
`autotune memory search "<query>"`	Search past conversations by meaning (vector) or keyword (FTS5 fallback).
`autotune memory list`	List recently stored memory chunks with timestamps and model names.
`autotune memory stats`	Show total chunks, vector coverage, DB size, date range, and per-model counts.
`autotune memory forget <id>`	Delete a specific memory chunk. `--all` wipes everything (with confirmation).
`autotune memory setup`	Pull `nomic-embed-text` (~274 MB) to enable semantic vector search.

autotune memory setup                          # one-time: enable semantic search
autotune memory search "FastAPI auth"          # find relevant past sessions
autotune memory list --days 7                  # recent memories
autotune memory forget 42                      # remove a specific chunk

Apple Silicon (MLX)

Command	What it does
`autotune mlx list`	List MLX models already cached locally.
`autotune mlx pull <model>`	Download MLX-quantized model from mlx-community on HuggingFace. Accepts Ollama names.
`autotune mlx resolve <model>`	Show which HuggingFace MLX model ID would be used for a given Ollama name.

MLX is 10–40% faster than Ollama on the same model by running on Apple's unified memory and Metal GPU kernels.

autotune mlx pull qwen3:8b                     # download 4-bit MLX version
autotune mlx pull qwen2.5-coder:14b --quant 8bit
autotune serve --mlx                           # start API server using MLX backend

Settings & diagnostics

Command	What it does
`autotune telemetry`	View recent inference runs (TTFT, tok/s, RAM, swap, CPU).
`autotune telemetry --enable`	Opt in to anonymous telemetry (hardware fingerprint + perf data).
`autotune telemetry --disable`	Opt out. No further data sent.
`autotune telemetry --status`	Show current consent status.
`autotune storage on\|off\|status`	Enable/disable local SQLite storage of performance observations.
`autotune doctor`	Full health check: Python, packages, Ollama connectivity, RAM/swap, DB health.

Architecture

autotune/
├── ttft/          ← TTFT optimisation (start here for latency work)
│   └── optimizer.py    TTFTOptimizer: dynamic num_ctx + keep_alive + num_keep
│
├── api/           ← Inference pipeline
│   ├── server.py       FastAPI server — OpenAI-compatible /v1 + FIFO queue
│   ├── chat.py         Terminal REPL with adaptive RAM + live stats
│   ├── kv_manager.py   KV options builder: flash_attn, num_batch, pressure tiers
│   ├── model_selector.py   Pre-flight fit analysis
│   └── backends/       Ollama, MLX, LM Studio, HuggingFace Inference API
│
├── context/       ← Context window management
│   ├── window.py       ContextWindow orchestrator
│   ├── budget.py       Tier thresholds (FULL → RECENT+FACTS → COMPRESSED → EMERGENCY)
│   ├── classifier.py   Message value scoring
│   ├── compressor.py   Tool output + long-content compression
│   └── extractor.py    Deterministic fact extraction
│
├── recall/        ← Conversation memory
│   ├── store.py        SQLite WAL: FTS5 full-text + float32 cosine vectors
│   └── manager.py      save / search / list conversations
│
├── db/            ← Persistence
│   └── store.py        SQLite: models, hardware, run_observations, telemetry_events
│
├── hardware/      ← Hardware detection
│   ├── profiler.py     CPU/GPU/RAM detection
│   └── ram_advisor.py  Real-time RAM pressure advice
│
├── memory/        ← Memory estimation + no-swap guarantee
│   ├── estimator.py    Model weights + KV + runtime overhead
│   └── noswap.py       NoSwapGuard: adjusts num_ctx to guarantee no swap
│
└── cli.py         ← Entry point (Click)

Support development

autotune is free and MIT-licensed. If it saves you time or compute costs, sponsoring helps keep it maintained and worth improving.

What sponsorships actually cover:

Cost	Monthly
Supabase (anonymous telemetry DB)	$0 — free tier
Vercel (autotunellm.com hosting)	$0 — hobby plan
Domain (autotunellm.com)	$1.25
Total infrastructure	~$1.25 / month

The real cost is developer time. All development is done evenings and weekends. Sponsorships are a direct signal that the project is worth that time.

Suggested tiers (set up at github.com/sponsors/tanavc1):

	What it means
$3 / month — Supporter	You use it and it works. Thank you.
$10 / month — Regular	Your name in the README sponsors list.
$25 / month — Patron	Priority issue responses + your name in the README.

No features are locked behind sponsorship. Everything in this repo is and will remain free.

Contributing & support

Bug reports and pull requests welcome. Open an issue on GitHub or email autotunellm@gmail.com.

For security vulnerabilities, see SECURITY.md — please do not open a public issue.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.5.0

May 27, 2026

1.3.0

May 24, 2026

1.2.2

May 24, 2026

1.2.1

May 22, 2026

1.2.0

May 21, 2026

1.1.2

May 5, 2026

1.1.1

May 3, 2026

1.1.0

May 3, 2026

1.0.15

May 2, 2026

1.0.14

May 2, 2026

1.0.13

May 2, 2026

1.0.12

May 2, 2026

1.0.11

May 2, 2026

1.0.10

Apr 30, 2026

1.0.9

Apr 29, 2026

1.0.8

Apr 29, 2026

1.0.7

Apr 29, 2026

1.0.6

Apr 28, 2026

1.0.5

Apr 28, 2026

1.0.4

Apr 28, 2026

1.0.3

Apr 28, 2026

1.0.2

Apr 28, 2026

1.0.1

Apr 28, 2026

1.0.0

Apr 24, 2026

0.1.1

Apr 15, 2026

0.1.0

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_autotune-1.5.0.tar.gz (485.9 kB view details)

Uploaded May 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_autotune-1.5.0-py3-none-any.whl (516.3 kB view details)

Uploaded May 27, 2026 Python 3

File details

Details for the file llm_autotune-1.5.0.tar.gz.

File metadata

Download URL: llm_autotune-1.5.0.tar.gz
Upload date: May 27, 2026
Size: 485.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_autotune-1.5.0.tar.gz
Algorithm	Hash digest
SHA256	`50067d7eff2f0a788bfccf119e7abf7dfbc7ed2d1008ea590ac176b41fdcf1f5`
MD5	`8d75db18aed033bf8064e3667eb881e4`
BLAKE2b-256	`8be53598b5f8fe537a174a274ea4fa8d9a8fcc021541a3b6e18ee832cf5d9941`

See more details on using hashes here.

File details

Details for the file llm_autotune-1.5.0-py3-none-any.whl.

File metadata

Download URL: llm_autotune-1.5.0-py3-none-any.whl
Upload date: May 27, 2026
Size: 516.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_autotune-1.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c6af1c56d1f6659a8e727693e68162b14b23f621337a6fe0b30f8a2f500e2c2`
MD5	`8b6920a7138db7b73c808a553299925a`
BLAKE2b-256	`111112319a45db9eac00fdb5fc301b40a6f4a8a600e4aa1f0cf439ecb7cf4d98`

See more details on using hashes here.

llm-autotune 1.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

autotune — Local LLM Inference Optimizer

What autotune actually improves

What the numbers mean

Verify it yourself

Quickstart

1. Install Ollama

2. Install autotune

3. Get a model recommendation for your hardware

4. Start chatting

5. Check what's running

API server (OpenAI-compatible)

Per-request headers

Endpoints

API key authentication

Concurrency

Web dashboard

Authentication

Dashboard panels

Settings tab

Security posture

Model recommendations by hardware

Features

Agentic workloads

Chat commands

Profiles

Apple Silicon (MLX)

Docker — Ollama + autotune bundled

Quickstart from Docker Hub

Bundled image — Ollama + autotune in one container

Auto-pull a model on first boot

Raw Docker (without Compose)

docker-compose — separate Ollama + autotune services

Environment variables

GPU support

Embedding autotune in your application

Options

Error handling

Server RAM footprint

Agentic frameworks

OpenClaw

Hermes Agent

How it works — all 14 optimizations

The KV cache — the central concept

Memory optimizations

Speed optimizations

Adaptive intelligence

Context & conversation

What doesn't change

Conversation memory

Telemetry

Troubleshooting

CLI command reference

Get started

Manage models

Deploy & integrate

Benchmarking & proof

Conversation memory

Apple Silicon (MLX)

Settings & diagnostics

Architecture

Support development

Contributing & support

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed