Automatic local-LLM inference configuration recommender for Ollama, LM Studio, and MLX
Project description
autotune — Local LLM Inference Optimizer
A middleware layer that makes locally-running LLMs faster, lighter, and smarter on your own hardware — with zero changes to your existing setup.
Works with Ollama, LM Studio, and MLX (Apple Silicon native) backends out of the box.
What it does
autotune sits between your application and the local LLM backend. It automatically:
| Feature | What happens |
|---|---|
| Dynamic KV sizing | Computes the exact num_ctx each request needs instead of allocating the profile max — typically 4–8× less KV cache memory |
| KV prefix caching | Pins system-prompt tokens in Ollama's KV cache via num_keep so they're never re-evaluated each turn |
| Adaptive KV precision | Downgrades KV cache from F16 → Q8 under memory pressure (80% → −10% ctx, 88% → −25% ctx + Q8, 93% → −50% ctx + Q8) |
| Model keep-alive | Sets keep_alive=-1m so the model stays loaded in unified memory between turns — eliminates reload latency |
| Flash attention | Enables flash_attn=true on every request — reduces peak KV activation memory during attention computation; zero quality impact |
| Prefill batching | Sets num_batch=1024 (2× Ollama default) — reduces Metal kernel dispatches for long prompts; under critical RAM pressure drops to 256 |
| Multi-tier context management | Intelligently trims conversation history at token budget thresholds with no mid-sentence cuts |
| Inference queue | FIFO queue (default: 1 concurrent, 8 waiting) with HTTP 429 back-pressure — prevents parallel inference from thrashing memory |
| Profile-based optimization | fast / balanced / quality profiles tune temperature, context length, KV precision, and OS QoS class |
| OpenAI-compatible API | Drop-in replacement for localhost:8765/v1 — works with any OpenAI SDK |
| MLX backend (Apple Silicon) | On M-series Macs, routes inference to MLX-LM — native Metal GPU kernels, unified memory |
| Persistent conversation memory | Saves every conversation to SQLite; automatically injects relevant past context at session start; searchable by topic |
| Hardware telemetry | Samples RAM/Swap/CPU every 250 ms, persists structured metrics to SQLite |
Quickstart
1. Prerequisites
Install Ollama and pull at least one model:
ollama pull qwen3:8b # 5.2 GB — best general model for 16 GB laptops
ollama pull gemma4 # ~5.8 GB — Google's newest model, multimodal, 128k context
ollama pull qwen2.5-coder:14b # 9 GB — top coding model for 24+ GB RAM
2. Install autotune
git clone https://github.com/tanavc1/local-llm-autotune.git
cd local-llm-autotune
pip install -e .
Requirements: Python 3.10+, Ollama running locally.
3. Check your hardware
autotune hardware
Shows CPU, RAM, GPU backend, and the effective memory budget autotune uses when selecting models.
4. See what models fit
autotune ls
Scores every locally downloaded Ollama model against your hardware — shows whether it fits comfortably, has swap risk, or will OOM. Recommends a profile for each.
5. Start chatting
autotune run — pre-flight memory analysis + optimized chat. Checks whether the model fits in RAM, picks the right profile, then opens a chat session:
autotune run qwen3:8b
autotune chat — skip the pre-flight and go straight to optimized chat (adaptive-RAM monitoring, KV-manager, and context-optimizer are always active):
autotune chat --model qwen3:8b # balanced (default)
autotune chat --model qwen3:8b --profile fast # fastest responses
autotune chat --model qwen3:8b --profile quality # largest context
autotune chat --model qwen3:8b --no-swap # guarantee no macOS swap
Set a system prompt:
autotune chat --model qwen3:8b --system "You are a concise coding assistant."
Resume a previous conversation (the ID is shown in the chat header):
autotune chat --model qwen3:8b --conv-id a3f92c1b
6. Check what's running
autotune ps
Shows every model currently loaded in memory — across both Ollama and the MLX backend — with RAM usage, context size, quantization, and time loaded.
Model recommendations by hardware
autotune works with any Ollama-compatible model. Here are our current picks for each hardware tier:
| RAM | Recommended model | Size | Why |
|---|---|---|---|
| 8 GB | qwen3:4b |
~2.6 GB | Best 4B model available; hybrid thinking mode |
| 16 GB | qwen3:8b |
~5.2 GB | Near-frontier quality; best 8B as of 2026 |
| 16 GB | gemma4 |
~5.8 GB | Google's newest; multimodal, 128k context |
| 24 GB | qwen3:14b |
~9.0 GB | Excellent reasoning; fits well with headroom |
| 32 GB | qwen3:30b-a3b |
~17 GB | MoE: flagship quality at 7B inference cost |
| 64 GB+ | qwen3:32b |
~20 GB | Top dense open model |
| Coding | qwen2.5-coder:14b |
~9.0 GB | Best open coding model for 24 GB machines |
| Reasoning | deepseek-r1:14b |
~9.0 GB | Chain-of-thought; strong math and logic |
Run autotune ls to see how each downloaded model scores against your specific hardware.
Chat commands
Once inside a chat session, these slash commands are available:
| Command | What it does |
|---|---|
/help |
Show available commands |
/new |
Start a new conversation (keeps model and profile) |
/history |
Show the full conversation history |
/profile fast|balanced|quality |
Switch profile mid-conversation |
/model <id> |
Switch to a different model |
/system <text> |
Set or replace the system prompt |
/export |
Export conversation to a Markdown file |
/metrics |
Show session performance stats (tok/s, TTFT, request count) |
/backends |
Show which backends are running (Ollama, LM Studio, HF API) |
/models |
List all locally available models |
/recall |
Browse past conversations with dates and snippets; pick one to resume |
/recall search <query> |
Search past conversations by topic — finds semantically related sessions |
/pull <model> |
Pull a model via Ollama without leaving chat |
/delete |
Delete the current conversation from history |
/quit |
Exit (also Ctrl-C) |
Apple Silicon (MLX acceleration)
On M-series Macs, install the MLX backend to use native Metal GPU kernels:
pip install -e ".[mlx]" # install mlx-lm
autotune mlx pull qwen3:8b # download MLX-quantized model from mlx-community
autotune chat --model qwen3:8b # automatically routes to MLX on Apple Silicon
MLX is activated automatically when running on Apple Silicon — no configuration needed. autotune resolves the Ollama model name to the corresponding mlx-community HuggingFace repo.
autotune mlx list # show locally cached MLX models
autotune mlx resolve llama3.2 # check which MLX model ID would be used
API server (OpenAI-compatible)
Run autotune as a server and point any existing OpenAI client at it:
autotune serve
# Listening at http://127.0.0.1:8765/v1
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8765/v1", api_key="local")
response = client.chat.completions.create(
model="qwen3:8b",
messages=[{"role": "user", "content": "Hello!"}],
)
Autotune-specific headers
Pass these with any request to override behavior per-call:
X-Autotune-Profile: fast # override profile (fast | balanced | quality)
X-Conversation-Id: a3f92c1b # attach to a persistent conversation
Endpoints
| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
OpenAI-compatible, streaming or non-streaming |
GET /v1/models |
List all available models across all backends |
GET /health |
Server status, queue depth, memory pressure |
GET /api/hardware |
Live hardware snapshot |
GET /api/profiles |
Profile definitions |
GET /api/running_models |
All models currently in memory (Ollama + MLX), with RAM, ctx, quant, age |
POST /api/conversations |
Create a persistent conversation |
GET /api/conversations |
List conversations |
GET /api/conversations/{id} |
Get conversation + full message history |
DELETE /api/conversations/{id} |
Delete conversation |
GET /api/conversations/{id}/export |
Export as Markdown |
Concurrency tuning
The server serialises inference by default (1 concurrent request, 8 queued). Requests beyond the queue limit receive HTTP 429 immediately. Tune with env vars:
AUTOTUNE_MAX_CONCURRENT=1 # parallel inference slots (default: 1)
AUTOTUNE_MAX_QUEUED=8 # max requests waiting for a slot (default: 8)
AUTOTUNE_WAIT_TIMEOUT=120 # seconds before a waiting request gets 429 (default: 120)
Profiles
| Profile | Context | Temperature | KV precision | System QoS | Use when |
|---|---|---|---|---|---|
fast ⚡ |
2,048 | 0.1 | Q8 | USER_INTERACTIVE | Quick lookups, autocomplete |
balanced ⚖️ |
8,192 | 0.7 | F16 | USER_INITIATED | General chat, coding |
quality ✨ |
32,768 | 0.8 | F16 | USER_INITIATED | Long-form writing, analysis |
autotune run with --profile auto (the default) analyses model size vs. available RAM and picks the profile automatically.
Telemetry and benchmarks
View past runs
autotune telemetry # last 20 inference runs
autotune telemetry --events # notable events: swap spikes, OOMs, slow tokens
Prove it works
autotune proof # default model (qwen3:8b)
autotune proof --model llama3.2:3b
autotune proof --with-noswap # include no-swap scenario analysis
autotune proof --with-cold # include cold-start phase
Runs your model twice on each prompt — once with plain Ollama, once with autotune — and reports an honest side-by-side comparison. Covers:
- First response time (cold KV allocation): load + prefill with raw
ctx=4096vs autotune's tighter ctx - Warm TTFT: steady-state prefill per prompt type, including a ~700-token long-document test
- VRAM: actual Metal GPU memory measured via
/api/ps - Generation speed: shown unchanged (GPU-bound, autotune doesn't touch it — honesty matters)
- Swap (with
--with-noswap): actual swap bytes observed + scenario table for different memory pressure levels
All timings come from Ollama's own internal timers, not estimated by Python.
Where data is stored
All runs persist to SQLite automatically:
- macOS:
~/Library/Application Support/autotune/autotune.db - Linux:
~/.local/share/autotune/autotune.db
Benchmark results
What autotune actually improves: TTFT (time to first token). Throughput is GPU-bound and autotune does not change it. The numbers below are honest.
Methodology
Hardware: Apple M2, 16 GB unified memory (macOS Sequoia)
Model: qwen3:8b — Q4_K_M quantization, 5.2 GB weights, served via Ollama
Test script: scripts/stress_test.py — automated, no manual intervention
Scale: 63 inference calls across 18 distinct prompts and 6 test phases
Two configurations are compared throughout:
| Configuration | What it does |
|---|---|
raw_ollama |
Zero autotune involvement. Pure Ollama defaults: num_ctx unspecified (Ollama picks its own default, typically 4096), keep_alive=5m, no OS tuning. Direct HTTP to /v1/chat/completions. |
autotune/balanced |
Three TTFT mechanisms applied (see below) + OS scheduling priority (USER_INITIATED QoS, GC disabled during inference). |
Metrics (psutil, sampled every 250 ms in a background thread):
- TTFT — wall time from HTTP request start to first streaming token byte
- Throughput (tok/s) —
len(response) / 4 / elapsed_sec; same formula both sides - CPU % — system-wide average across all cores during inference
- RAM Δ —
usedafter minususedbefore; memory left behind per call
Results
Overall (18 prompts, 63 calls)
| Metric | Raw Ollama | autotune/balanced | Δ |
|---|---|---|---|
| TTFT | 626 ms | 349 ms | −44% |
| Throughput | 35.5 tok/s | 34.8 tok/s | −2% (noise) |
| CPU avg | 10.8% | 12.4% | +15% (wrapper overhead) |
| RAM Δ | +0.76 GB | +0.78 GB | neutral |
By scenario
| Scenario | Raw Ollama TTFT | autotune TTFT | Improvement |
|---|---|---|---|
| General mix (10 prompts × 2 runs) | 421 ms | 404 ms | −4% |
| Sustained back-to-back (6 calls, no pause) | 282 ms | 265 ms | −6% |
| Large-context input (>1 000 tokens) | 2 015 ms | 261 ms | −87% |
Session continuity (keep_alive test) |
1 227 ms | 244 ms | −80% |
The large-context and session tests show where autotune's value is clearest and most consistent. The general-mix improvement is real but modest.
Where the TTFT reduction comes from
Three mechanisms work together, all owned by autotune/ttft/optimizer.py:
1. Dynamic num_ctx
Ollama allocates the entire KV cache before generating a single token. With the default num_ctx=4096 it allocates 4 096 token slots regardless of your actual input size — KV allocation is proportional to num_ctx, not to actual usage.
autotune computes the minimum that fits the request:
num_ctx = clamp(input_tokens + max_new_tokens + 256, min=512, max=profile_max)
Example with a 60-token question on the balanced profile:
| num_ctx | KV allocation (qwen2.5-coder:14b F16) | |
|---|---|---|
| Raw Ollama | 4 096 | ~402 MB |
| autotune | 1 340 | ~131 MB |
Smaller KV allocation = less memory to initialise before the first token, which is the KV initialisation step that TTFT measures. The −87% on large-context prompts is this mechanism at work: raw Ollama's 4 096 context can barely fit a 1 000-token input, while autotune right-sizes to the actual content.
2. keep_alive = -1
Ollama's default is keep_alive=5m — after five minutes idle the model is fully unloaded from unified memory. The next request pays a full model reload (1–4 s on a 5 GB model; longer for larger models).
autotune always sends keep_alive="-1m" (any negative Go duration = keep model resident indefinitely).
The cold-start test in the benchmark forces this condition explicitly: the raw-Ollama path unloads the model between calls, the autotune path does not.
Raw: call 1 → reload (1 304 ms TTFT) call 2 → reload (1 189 ms) call 3 → reload (1 187 ms)
autotune: call 1 → load ( 248 ms TTFT) call 2 → warm ( 242 ms) call 3 → warm ( 243 ms)
3. num_batch and flash_attn
autotune sets num_batch=1024 (Ollama default: 512). This doubles the number of prompt tokens processed per GPU pass during prefill. For a 700-token prompt, this reduces Metal kernel dispatches from 2 to 1 — directly cutting prefill time. Short prompts (<512 tokens) are unaffected since llama.cpp caps the actual batch at min(num_batch, remaining_tokens).
flash_attn=true is always enabled. Flash attention is mathematically equivalent to standard attention but reduces peak KV activation memory by fusing the softmax and matmul operations. Models that don't support it silently ignore the flag.
Under critical RAM pressure (≥93%), num_batch drops to 256. This piggybacks on the forced model reload that already happens at this tier — the reduced batch lowers the peak activation tensor footprint during that reload.
4. num_keep (system-prompt prefix caching)
When a system prompt is present, autotune passes num_keep = <system_prompt_tokens> to Ollama. Ollama pins those tokens in the KV cache and never re-evaluates them on subsequent turns. Raw Ollama re-processes the full prompt from scratch on every call.
For a 120-token system prompt on a 30-turn conversation: autotune saves 120 tokens of attention computation on every single turn.
What autotune does NOT improve
| Metric | Why it doesn't change |
|---|---|
| Throughput (tok/s) | Token generation on Apple Silicon runs on the Metal GPU. No software change above the Metal layer affects how fast the GPU generates tokens. autotune measured −2% (within noise). |
| RAM usage | At 16 GB with a 2.5 GB model there is no memory pressure, so the pressure guard never activates. RAM impact is neutral. |
| CPU % | The autotune wrapper adds Python overhead (KV option computation, hardware tuner, psutil calls) that slightly increases CPU%. Measured +15% CPU vs raw Ollama. |
Prompt-by-prompt TTFT
| Prompt | Raw (ms) | autotune (ms) | Δ |
|---|---|---|---|
| simple factual | 257 | 310 | +21% |
| code (fibonacci) | 383 | 345 | −10% |
| reasoning chain | 378 | 368 | −3% |
| code with long system prompt | 514 | 373 | −27% |
| code review (large input) | 892 | 845 | −5% |
| explain transformer | 334 | 318 | −5% |
| multi-turn follow-up | 417 | 399 | −4% |
| math proof | 322 | 312 | −3% |
| system design | 407 | 405 | ~0% |
| creative technical | 352 | 361 | +3% |
| large context (pressure test) | 2 015 | 261 | −87% |
| cold-start / warm session | 1 227 | 244 | −80% |
The simple factual prompt shows autotune slightly slower on run 1 because TTFTOptimizer has a small warm-up cost on the first call (GC collect, psutil snapshot, option computation). By run 2 of the same session this vanishes. The large-context and cold-start improvements are consistent across all runs.
Limitations
- Single model. All numbers are from
qwen3:8b(5.2 GB, Q4_K_M) on Apple M2 16 GB. TTFT gains fromkeep_alive=-1mwill be proportionally larger on bigger models — a 9 GB model has a longer reload penalty than a 5 GB one. num_ctxtrade-off. Smaller context means fewer tokens of conversation history fit. For short sessions this is pure win. For very long conversations autotune may need to trim history earlier (handled byautotune.context.ContextWindow).- Token estimation. Throughput uses
len(response) / 4 / elapsed_sec. The same formula is used for both configurations so the comparison is fair, but absolute tok/s may differ from Ollama's internal tokenizer count.
How dynamic KV sizing works
Ollama allocates the entire KV cache upfront before generating a single token. If num_ctx=8192, it allocates memory for 8,192 tokens even if your conversation is 50 tokens.
autotune computes the minimum num_ctx each request actually needs:
num_ctx = clamp(input_tokens + max_new_tokens + 256, 512, profile_max)
For a short conversation on the balanced profile (max 8,192):
- Input: ~22 tokens →
num_ctx= 22 + 1,024 + 256 = 1,302 - Savings on
qwen2.5-coder:14b: 8,192 → 1,302 tokens = ~677 MB of KV cache freed
num_ctx grows naturally as the conversation grows since the full history is included in every request.
Conversation memory and recall
autotune records every conversation turn to a local SQLite database (~/.autotune/recall.db) using both full-text search and vector similarity. Memory is always-on — no flags required.
What it does
- Automatic context injection — at the start of each chat session, autotune searches past conversations for topics similar to your current model and system prompt. Relevant facts are injected as a silent system message (only shown when
--verboseis set). The 0.38 cosine similarity threshold filters out irrelevant memories. - Session linking — conversations are stored with a unique ID shown in the chat header. Use
--conv-id <id>to resume an exact past session. - In-chat recall — use
/recallto browse recent sessions with dates, model names, and first-turn snippets. Use/recall search <topic>for semantic search across all past conversations. Both commands offer a numbered prompt to resume the selected session with full context restored. - Model change detection — the background watcher polls Ollama every 30 s; if a model unloads unexpectedly (crash, OOM), the chat interface notifies you immediately.
Storage
| Path | Contents |
|---|---|
~/.autotune/recall.db |
FTS5 + float32 vectors; conversation turns, extracted facts |
~/Library/Application Support/autotune/autotune.db |
Hardware telemetry, run observations (macOS) |
~/.local/share/autotune/autotune.db |
Same (Linux) |
Context management tiers
autotune monitors history_tokens / effective_budget and selects a strategy automatically:
< 55% FULL — all turns verbatim, nothing dropped
55–75% RECENT+FACTS — last 8 turns + structured facts block for older turns
75–90% COMPRESSED — last 6 turns (lightly compressed) + compact summary
> 90% EMERGENCY — last 4 turns (aggressively compressed) + one-line summary
Low-value chatter ("ok", "thanks") is dropped first. Code blocks, stack traces, and technical content are always preserved. All cutoffs happen at sentence or paragraph boundaries — never mid-sentence.
The facts block injected for older turns is extracted deterministically (no LLM call) and includes accomplishments, active decisions, key facts, errors, and topics covered.
Architecture
autotune/
│
├── ttft/ ← TTFT optimisation layer (start here for latency work)
│ ├── optimizer.py # TTFTOptimizer: dynamic num_ctx + keep_alive + num_keep
│ │ # + flash_attn + num_batch + NoSwapGuard integration
│ └── __init__.py # Public API: TTFTOptimizer, KEEP_ALIVE_FOREVER
│
├── api/ Inference pipeline
│ ├── profiles.py # fast / balanced / quality profile definitions
│ ├── server.py # FastAPI server — OpenAI-compatible /v1 + FIFO queue
│ ├── chat.py # Terminal chat REPL (adaptive-RAM + KV-manager + ctx-optimizer)
│ │ # /recall + /recall search, live tok/s TTFT stats, model watcher
│ ├── running_models.py # Cross-backend model visibility (Ollama + MLX state file)
│ ├── conversation.py # SQLite-backed persistent conversation state
│ ├── ctx_utils.py # Token estimation + compute_num_ctx (used by ttft/)
│ ├── kv_manager.py # KV options builder: flash_attn, num_batch, pressure tiers
│ ├── hardware_tuner.py # OS-level tuning: nice, QoS class, GC, CPU governor
│ ├── model_selector.py # Pre-flight fit analysis: weights + KV + overhead
│ └── backends/ # Ollama, LM Studio, MLX, HuggingFace Inference API
│
├── context/ Context window management for long conversations
│ ├── window.py # ContextWindow orchestrator
│ ├── budget.py # Tier thresholds (FULL → RECENT+FACTS → COMPRESSED → EMERGENCY)
│ ├── classifier.py # Message value scoring (0.0 chatter → 1.0 technical)
│ ├── compressor.py # Tool output and long-content compression
│ └── extractor.py # Deterministic fact extraction for summary blocks
│
├── recall/ ← Conversation memory system
│ ├── store.py # SQLite WAL-mode: FTS5 full-text + float32 cosine vectors
│ ├── manager.py # save_conversation / get_context_for / list_conversations
│ └── extractor.py # Chunk extraction + conversation value scoring
│
├── bench/ Benchmarking framework
│ └── runner.py # run_raw_ollama / run_bench_ollama_only / BenchResult
│
├── db/ Persistence
│ └── store.py # SQLite: models, hardware, run_observations, telemetry_events
│
├── hardware/ Hardware detection
│ ├── profiler.py # CPU/GPU/RAM detection (psutil + py-cpuinfo)
│ └── ram_advisor.py # Real-time RAM pressure advice and swap risk scoring
│
├── memory/ Memory estimation + no-swap guarantee
│ ├── estimator.py # Model weights + KV cache + runtime overhead
│ └── noswap.py # NoSwapGuard: adjusts num_ctx/KV to guarantee no swap
│
├── models/ Model registry
│ └── registry.py # 9 OSS models with real MMLU/HumanEval/GSM8K scores
│
├── config/ Recommendation engine
│ └── generator.py # Multi-objective scoring: stability × speed × quality × context
│
└── cli.py Entry point (Click)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_autotune-0.1.0.tar.gz.
File metadata
- Download URL: llm_autotune-0.1.0.tar.gz
- Upload date:
- Size: 268.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9d24a6c15ba993ac0acba91bcf52765a6c609c037eff4342d2401c01ef23ee1
|
|
| MD5 |
bea5fca617558b6ed1552c47e2c8b548
|
|
| BLAKE2b-256 |
7056810d63acb4a72fcc9a5a2e702b8aedc024e8bb55f491ccf5306a999c054a
|
File details
Details for the file llm_autotune-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_autotune-0.1.0-py3-none-any.whl
- Upload date:
- Size: 244.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
068fccb0412224f712f65b26cda42d9179a5161f6c089277ce2f3fe26fe1e831
|
|
| MD5 |
a8183e450691b26bfe5cd274f3f102eb
|
|
| BLAKE2b-256 |
9add2cfbd443a5758cd6b9ec22dc4be51d0c6d4913d3cb5e3e7fd4f4ccc25ad8
|