Skip to main content

Automatic local-LLM inference configuration recommender for Ollama, LM Studio, and MLX

Project description

autotune — Local LLM Inference Optimizer

A middleware layer that makes locally-running LLMs faster, lighter, and smarter on your own hardware — with zero changes to your existing setup.

Works with Ollama, LM Studio, and MLX (Apple Silicon native) backends out of the box.


What it does

autotune sits between your application and the local LLM backend. It automatically:

Feature What happens
Dynamic KV sizing Computes the exact num_ctx each request needs instead of allocating the profile max — typically 4–8× less KV cache memory
KV prefix caching Pins system-prompt tokens in Ollama's KV cache via num_keep so they're never re-evaluated each turn
Adaptive KV precision Downgrades KV cache from F16 → Q8 under memory pressure (80% → −10% ctx, 88% → −25% ctx + Q8, 93% → −50% ctx + Q8)
Model keep-alive Sets keep_alive=-1m so the model stays loaded in unified memory between turns — eliminates reload latency
Flash attention Enables flash_attn=true on every request — reduces peak KV activation memory during attention computation; zero quality impact
Prefill batching Sets num_batch=1024 (2× Ollama default) — reduces Metal kernel dispatches for long prompts; under critical RAM pressure drops to 256
Multi-tier context management Intelligently trims conversation history at token budget thresholds with no mid-sentence cuts
Inference queue FIFO queue (default: 1 concurrent, 8 waiting) with HTTP 429 back-pressure — prevents parallel inference from thrashing memory
Profile-based optimization fast / balanced / quality profiles tune temperature, context length, KV precision, and OS QoS class
OpenAI-compatible API Drop-in replacement for localhost:8765/v1 — works with any OpenAI SDK
MLX backend (Apple Silicon) On M-series Macs, routes inference to MLX-LM — native Metal GPU kernels, unified memory
Persistent conversation memory Saves every conversation to SQLite; automatically injects relevant past context at session start; searchable by topic
Hardware telemetry Samples RAM/Swap/CPU every 250 ms, persists structured metrics to SQLite

Quickstart

1. Prerequisites

Install Ollama and pull at least one model:

ollama pull qwen3:8b           # 5.2 GB — best general model for 16 GB laptops
ollama pull gemma4             # ~5.8 GB — Google's newest model, multimodal, 128k context
ollama pull qwen2.5-coder:14b  # 9 GB — top coding model for 24+ GB RAM

2. Install autotune

git clone https://github.com/tanavc1/local-llm-autotune.git
cd local-llm-autotune
pip install -e .

Requirements: Python 3.10+, Ollama running locally.

3. Check your hardware

autotune hardware

Shows CPU, RAM, GPU backend, and the effective memory budget autotune uses when selecting models.

4. See what models fit

autotune ls

Scores every locally downloaded Ollama model against your hardware — shows whether it fits comfortably, has swap risk, or will OOM. Recommends a profile for each.

5. Start chatting

autotune run — pre-flight memory analysis + optimized chat. Checks whether the model fits in RAM, picks the right profile, then opens a chat session:

autotune run qwen3:8b

autotune chat — skip the pre-flight and go straight to optimized chat (adaptive-RAM monitoring, KV-manager, and context-optimizer are always active):

autotune chat --model qwen3:8b                   # balanced (default)
autotune chat --model qwen3:8b --profile fast    # fastest responses
autotune chat --model qwen3:8b --profile quality # largest context
autotune chat --model qwen3:8b --no-swap         # guarantee no macOS swap

Set a system prompt:

autotune chat --model qwen3:8b --system "You are a concise coding assistant."

Resume a previous conversation (the ID is shown in the chat header):

autotune chat --model qwen3:8b --conv-id a3f92c1b

6. Check what's running

autotune ps

Shows every model currently loaded in memory — across both Ollama and the MLX backend — with RAM usage, context size, quantization, and time loaded.


Model recommendations by hardware

autotune works with any Ollama-compatible model. Here are our current picks for each hardware tier:

RAM Recommended model Size Why
8 GB qwen3:4b ~2.6 GB Best 4B model available; hybrid thinking mode
16 GB qwen3:8b ~5.2 GB Near-frontier quality; best 8B as of 2026
16 GB gemma4 ~5.8 GB Google's newest; multimodal, 128k context
24 GB qwen3:14b ~9.0 GB Excellent reasoning; fits well with headroom
32 GB qwen3:30b-a3b ~17 GB MoE: flagship quality at 7B inference cost
64 GB+ qwen3:32b ~20 GB Top dense open model
Coding qwen2.5-coder:14b ~9.0 GB Best open coding model for 24 GB machines
Reasoning deepseek-r1:14b ~9.0 GB Chain-of-thought; strong math and logic

Run autotune ls to see how each downloaded model scores against your specific hardware.


Chat commands

Once inside a chat session, these slash commands are available:

Command What it does
/help Show available commands
/new Start a new conversation (keeps model and profile)
/history Show the full conversation history
/profile fast|balanced|quality Switch profile mid-conversation
/model <id> Switch to a different model
/system <text> Set or replace the system prompt
/export Export conversation to a Markdown file
/metrics Show session performance stats (tok/s, TTFT, request count)
/backends Show which backends are running (Ollama, LM Studio, HF API)
/models List all locally available models
/recall Browse past conversations with dates and snippets; pick one to resume
/recall search <query> Search past conversations by topic — finds semantically related sessions
/pull <model> Pull a model via Ollama without leaving chat
/delete Delete the current conversation from history
/quit Exit (also Ctrl-C)

Apple Silicon (MLX acceleration)

On M-series Macs, install the MLX backend to use native Metal GPU kernels:

pip install -e ".[mlx]"           # install mlx-lm
autotune mlx pull qwen3:8b        # download MLX-quantized model from mlx-community
autotune chat --model qwen3:8b    # automatically routes to MLX on Apple Silicon

MLX is activated automatically when running on Apple Silicon — no configuration needed. autotune resolves the Ollama model name to the corresponding mlx-community HuggingFace repo.

autotune mlx list                 # show locally cached MLX models
autotune mlx resolve llama3.2     # check which MLX model ID would be used

API server (OpenAI-compatible)

Run autotune as a server and point any existing OpenAI client at it:

autotune serve
# Listening at http://127.0.0.1:8765/v1
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8765/v1", api_key="local")
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)

Autotune-specific headers

Pass these with any request to override behavior per-call:

X-Autotune-Profile: fast        # override profile (fast | balanced | quality)
X-Conversation-Id: a3f92c1b     # attach to a persistent conversation

Endpoints

Endpoint Description
POST /v1/chat/completions OpenAI-compatible, streaming or non-streaming
GET /v1/models List all available models across all backends
GET /health Server status, queue depth, memory pressure
GET /api/hardware Live hardware snapshot
GET /api/profiles Profile definitions
GET /api/running_models All models currently in memory (Ollama + MLX), with RAM, ctx, quant, age
POST /api/conversations Create a persistent conversation
GET /api/conversations List conversations
GET /api/conversations/{id} Get conversation + full message history
DELETE /api/conversations/{id} Delete conversation
GET /api/conversations/{id}/export Export as Markdown

Concurrency tuning

The server serialises inference by default (1 concurrent request, 8 queued). Requests beyond the queue limit receive HTTP 429 immediately. Tune with env vars:

AUTOTUNE_MAX_CONCURRENT=1    # parallel inference slots (default: 1)
AUTOTUNE_MAX_QUEUED=8        # max requests waiting for a slot (default: 8)
AUTOTUNE_WAIT_TIMEOUT=120    # seconds before a waiting request gets 429 (default: 120)

Profiles

Profile Context Temperature KV precision System QoS Use when
fast 2,048 0.1 Q8 USER_INTERACTIVE Quick lookups, autocomplete
balanced ⚖️ 8,192 0.7 F16 USER_INITIATED General chat, coding
quality 32,768 0.8 F16 USER_INITIATED Long-form writing, analysis

autotune run with --profile auto (the default) analyses model size vs. available RAM and picks the profile automatically.


Telemetry and benchmarks

View past runs

autotune telemetry               # last 20 inference runs
autotune telemetry --events      # notable events: swap spikes, OOMs, slow tokens

Prove it works

autotune proof                         # default model (qwen3:8b)
autotune proof --model llama3.2:3b
autotune proof --with-noswap           # include no-swap scenario analysis
autotune proof --with-cold             # include cold-start phase

Runs your model twice on each prompt — once with plain Ollama, once with autotune — and reports an honest side-by-side comparison. Covers:

  • First response time (cold KV allocation): load + prefill with raw ctx=4096 vs autotune's tighter ctx
  • Warm TTFT: steady-state prefill per prompt type, including a ~700-token long-document test
  • VRAM: actual Metal GPU memory measured via /api/ps
  • Generation speed: shown unchanged (GPU-bound, autotune doesn't touch it — honesty matters)
  • Swap (with --with-noswap): actual swap bytes observed + scenario table for different memory pressure levels

All timings come from Ollama's own internal timers, not estimated by Python.

Where data is stored

All runs persist to SQLite automatically:

  • macOS: ~/Library/Application Support/autotune/autotune.db
  • Linux: ~/.local/share/autotune/autotune.db

Benchmark results

What autotune actually improves: TTFT (time to first token). Throughput is GPU-bound and autotune does not change it. The numbers below are honest.

Methodology

Hardware: Apple M2, 16 GB unified memory (macOS Sequoia)
Model: qwen3:8b — Q4_K_M quantization, 5.2 GB weights, served via Ollama
Test script: scripts/stress_test.py — automated, no manual intervention
Scale: 63 inference calls across 18 distinct prompts and 6 test phases

Two configurations are compared throughout:

Configuration What it does
raw_ollama Zero autotune involvement. Pure Ollama defaults: num_ctx unspecified (Ollama picks its own default, typically 4096), keep_alive=5m, no OS tuning. Direct HTTP to /v1/chat/completions.
autotune/balanced Three TTFT mechanisms applied (see below) + OS scheduling priority (USER_INITIATED QoS, GC disabled during inference).

Metrics (psutil, sampled every 250 ms in a background thread):

  • TTFT — wall time from HTTP request start to first streaming token byte
  • Throughput (tok/s) — len(response) / 4 / elapsed_sec; same formula both sides
  • CPU % — system-wide average across all cores during inference
  • RAM Δ — used after minus used before; memory left behind per call

Results

Overall (18 prompts, 63 calls)

Metric Raw Ollama autotune/balanced Δ
TTFT 626 ms 349 ms −44%
Throughput 35.5 tok/s 34.8 tok/s −2% (noise)
CPU avg 10.8% 12.4% +15% (wrapper overhead)
RAM Δ +0.76 GB +0.78 GB neutral

By scenario

Scenario Raw Ollama TTFT autotune TTFT Improvement
General mix (10 prompts × 2 runs) 421 ms 404 ms −4%
Sustained back-to-back (6 calls, no pause) 282 ms 265 ms −6%
Large-context input (>1 000 tokens) 2 015 ms 261 ms −87%
Session continuity (keep_alive test) 1 227 ms 244 ms −80%

The large-context and session tests show where autotune's value is clearest and most consistent. The general-mix improvement is real but modest.


Where the TTFT reduction comes from

Three mechanisms work together, all owned by autotune/ttft/optimizer.py:

1. Dynamic num_ctx

Ollama allocates the entire KV cache before generating a single token. With the default num_ctx=4096 it allocates 4 096 token slots regardless of your actual input size — KV allocation is proportional to num_ctx, not to actual usage.

autotune computes the minimum that fits the request:

num_ctx = clamp(input_tokens + max_new_tokens + 256,  min=512,  max=profile_max)

Example with a 60-token question on the balanced profile:

num_ctx KV allocation (qwen2.5-coder:14b F16)
Raw Ollama 4 096 ~402 MB
autotune 1 340 ~131 MB

Smaller KV allocation = less memory to initialise before the first token, which is the KV initialisation step that TTFT measures. The −87% on large-context prompts is this mechanism at work: raw Ollama's 4 096 context can barely fit a 1 000-token input, while autotune right-sizes to the actual content.

2. keep_alive = -1

Ollama's default is keep_alive=5m — after five minutes idle the model is fully unloaded from unified memory. The next request pays a full model reload (1–4 s on a 5 GB model; longer for larger models).

autotune always sends keep_alive="-1m" (any negative Go duration = keep model resident indefinitely).

The cold-start test in the benchmark forces this condition explicitly: the raw-Ollama path unloads the model between calls, the autotune path does not.

Raw:      call 1 → reload (1 304 ms TTFT)   call 2 → reload (1 189 ms)   call 3 → reload (1 187 ms)
autotune: call 1 → load   ( 248 ms TTFT)   call 2 → warm  (  242 ms)   call 3 → warm  (  243 ms)

3. num_batch and flash_attn

autotune sets num_batch=1024 (Ollama default: 512). This doubles the number of prompt tokens processed per GPU pass during prefill. For a 700-token prompt, this reduces Metal kernel dispatches from 2 to 1 — directly cutting prefill time. Short prompts (<512 tokens) are unaffected since llama.cpp caps the actual batch at min(num_batch, remaining_tokens).

flash_attn=true is always enabled. Flash attention is mathematically equivalent to standard attention but reduces peak KV activation memory by fusing the softmax and matmul operations. Models that don't support it silently ignore the flag.

Under critical RAM pressure (≥93%), num_batch drops to 256. This piggybacks on the forced model reload that already happens at this tier — the reduced batch lowers the peak activation tensor footprint during that reload.

4. num_keep (system-prompt prefix caching)

When a system prompt is present, autotune passes num_keep = <system_prompt_tokens> to Ollama. Ollama pins those tokens in the KV cache and never re-evaluates them on subsequent turns. Raw Ollama re-processes the full prompt from scratch on every call.

For a 120-token system prompt on a 30-turn conversation: autotune saves 120 tokens of attention computation on every single turn.


What autotune does NOT improve

Metric Why it doesn't change
Throughput (tok/s) Token generation on Apple Silicon runs on the Metal GPU. No software change above the Metal layer affects how fast the GPU generates tokens. autotune measured −2% (within noise).
RAM usage At 16 GB with a 2.5 GB model there is no memory pressure, so the pressure guard never activates. RAM impact is neutral.
CPU % The autotune wrapper adds Python overhead (KV option computation, hardware tuner, psutil calls) that slightly increases CPU%. Measured +15% CPU vs raw Ollama.

Prompt-by-prompt TTFT

Prompt Raw (ms) autotune (ms) Δ
simple factual 257 310 +21%
code (fibonacci) 383 345 −10%
reasoning chain 378 368 −3%
code with long system prompt 514 373 −27%
code review (large input) 892 845 −5%
explain transformer 334 318 −5%
multi-turn follow-up 417 399 −4%
math proof 322 312 −3%
system design 407 405 ~0%
creative technical 352 361 +3%
large context (pressure test) 2 015 261 −87%
cold-start / warm session 1 227 244 −80%

The simple factual prompt shows autotune slightly slower on run 1 because TTFTOptimizer has a small warm-up cost on the first call (GC collect, psutil snapshot, option computation). By run 2 of the same session this vanishes. The large-context and cold-start improvements are consistent across all runs.


Limitations

  • Single model. All numbers are from qwen3:8b (5.2 GB, Q4_K_M) on Apple M2 16 GB. TTFT gains from keep_alive=-1m will be proportionally larger on bigger models — a 9 GB model has a longer reload penalty than a 5 GB one.
  • num_ctx trade-off. Smaller context means fewer tokens of conversation history fit. For short sessions this is pure win. For very long conversations autotune may need to trim history earlier (handled by autotune.context.ContextWindow).
  • Token estimation. Throughput uses len(response) / 4 / elapsed_sec. The same formula is used for both configurations so the comparison is fair, but absolute tok/s may differ from Ollama's internal tokenizer count.

How dynamic KV sizing works

Ollama allocates the entire KV cache upfront before generating a single token. If num_ctx=8192, it allocates memory for 8,192 tokens even if your conversation is 50 tokens.

autotune computes the minimum num_ctx each request actually needs:

num_ctx = clamp(input_tokens + max_new_tokens + 256, 512, profile_max)

For a short conversation on the balanced profile (max 8,192):

  • Input: ~22 tokens → num_ctx = 22 + 1,024 + 256 = 1,302
  • Savings on qwen2.5-coder:14b: 8,192 → 1,302 tokens = ~677 MB of KV cache freed

num_ctx grows naturally as the conversation grows since the full history is included in every request.


Conversation memory and recall

autotune records every conversation turn to a local SQLite database (~/.autotune/recall.db) using both full-text search and vector similarity. Memory is always-on — no flags required.

What it does

  • Automatic context injection — at the start of each chat session, autotune searches past conversations for topics similar to your current model and system prompt. Relevant facts are injected as a silent system message (only shown when --verbose is set). The 0.38 cosine similarity threshold filters out irrelevant memories.
  • Session linking — conversations are stored with a unique ID shown in the chat header. Use --conv-id <id> to resume an exact past session.
  • In-chat recall — use /recall to browse recent sessions with dates, model names, and first-turn snippets. Use /recall search <topic> for semantic search across all past conversations. Both commands offer a numbered prompt to resume the selected session with full context restored.
  • Model change detection — the background watcher polls Ollama every 30 s; if a model unloads unexpectedly (crash, OOM), the chat interface notifies you immediately.

Storage

Path Contents
~/.autotune/recall.db FTS5 + float32 vectors; conversation turns, extracted facts
~/Library/Application Support/autotune/autotune.db Hardware telemetry, run observations (macOS)
~/.local/share/autotune/autotune.db Same (Linux)

Context management tiers

autotune monitors history_tokens / effective_budget and selects a strategy automatically:

< 55%   FULL              — all turns verbatim, nothing dropped
55–75%  RECENT+FACTS      — last 8 turns + structured facts block for older turns
75–90%  COMPRESSED        — last 6 turns (lightly compressed) + compact summary
> 90%   EMERGENCY         — last 4 turns (aggressively compressed) + one-line summary

Low-value chatter ("ok", "thanks") is dropped first. Code blocks, stack traces, and technical content are always preserved. All cutoffs happen at sentence or paragraph boundaries — never mid-sentence.

The facts block injected for older turns is extracted deterministically (no LLM call) and includes accomplishments, active decisions, key facts, errors, and topics covered.


Architecture

autotune/
│
├── ttft/                  ← TTFT optimisation layer (start here for latency work)
│   ├── optimizer.py       #   TTFTOptimizer: dynamic num_ctx + keep_alive + num_keep
│   │                      #   + flash_attn + num_batch + NoSwapGuard integration
│   └── __init__.py        #   Public API: TTFTOptimizer, KEEP_ALIVE_FOREVER
│
├── api/                   Inference pipeline
│   ├── profiles.py        #   fast / balanced / quality profile definitions
│   ├── server.py          #   FastAPI server — OpenAI-compatible /v1 + FIFO queue
│   ├── chat.py            #   Terminal chat REPL (adaptive-RAM + KV-manager + ctx-optimizer)
│   │                      #   /recall + /recall search, live tok/s TTFT stats, model watcher
│   ├── running_models.py  #   Cross-backend model visibility (Ollama + MLX state file)
│   ├── conversation.py    #   SQLite-backed persistent conversation state
│   ├── ctx_utils.py       #   Token estimation + compute_num_ctx (used by ttft/)
│   ├── kv_manager.py      #   KV options builder: flash_attn, num_batch, pressure tiers
│   ├── hardware_tuner.py  #   OS-level tuning: nice, QoS class, GC, CPU governor
│   ├── model_selector.py  #   Pre-flight fit analysis: weights + KV + overhead
│   └── backends/          #   Ollama, LM Studio, MLX, HuggingFace Inference API
│
├── context/               Context window management for long conversations
│   ├── window.py          #   ContextWindow orchestrator
│   ├── budget.py          #   Tier thresholds (FULL → RECENT+FACTS → COMPRESSED → EMERGENCY)
│   ├── classifier.py      #   Message value scoring (0.0 chatter → 1.0 technical)
│   ├── compressor.py      #   Tool output and long-content compression
│   └── extractor.py       #   Deterministic fact extraction for summary blocks
│
├── recall/                ← Conversation memory system
│   ├── store.py           #   SQLite WAL-mode: FTS5 full-text + float32 cosine vectors
│   ├── manager.py         #   save_conversation / get_context_for / list_conversations
│   └── extractor.py       #   Chunk extraction + conversation value scoring
│
├── bench/                 Benchmarking framework
│   └── runner.py          #   run_raw_ollama / run_bench_ollama_only / BenchResult
│
├── db/                    Persistence
│   └── store.py           #   SQLite: models, hardware, run_observations, telemetry_events
│
├── hardware/              Hardware detection
│   ├── profiler.py        #   CPU/GPU/RAM detection (psutil + py-cpuinfo)
│   └── ram_advisor.py     #   Real-time RAM pressure advice and swap risk scoring
│
├── memory/                Memory estimation + no-swap guarantee
│   ├── estimator.py       #   Model weights + KV cache + runtime overhead
│   └── noswap.py          #   NoSwapGuard: adjusts num_ctx/KV to guarantee no swap
│
├── models/                Model registry
│   └── registry.py        #   9 OSS models with real MMLU/HumanEval/GSM8K scores
│
├── config/                Recommendation engine
│   └── generator.py       #   Multi-objective scoring: stability × speed × quality × context
│
└── cli.py                 Entry point (Click)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_autotune-0.1.0.tar.gz (268.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_autotune-0.1.0-py3-none-any.whl (244.0 kB view details)

Uploaded Python 3

File details

Details for the file llm_autotune-0.1.0.tar.gz.

File metadata

  • Download URL: llm_autotune-0.1.0.tar.gz
  • Upload date:
  • Size: 268.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for llm_autotune-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e9d24a6c15ba993ac0acba91bcf52765a6c609c037eff4342d2401c01ef23ee1
MD5 bea5fca617558b6ed1552c47e2c8b548
BLAKE2b-256 7056810d63acb4a72fcc9a5a2e702b8aedc024e8bb55f491ccf5306a999c054a

See more details on using hashes here.

File details

Details for the file llm_autotune-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_autotune-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 244.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for llm_autotune-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 068fccb0412224f712f65b26cda42d9179a5161f6c089277ce2f3fe26fe1e831
MD5 a8183e450691b26bfe5cd274f3f102eb
BLAKE2b-256 9add2cfbd443a5758cd6b9ec22dc4be51d0c6d4913d3cb5e3e7fd4f4ccc25ad8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page