Will it fit? GPU toolkit for AI models — simulate, benchmark, monitor, serve.
Project description
localfit
Will it fit? Say what model you want — localfit figures out the GPU layer.
Fits locally? Run it. Doesn't fit? Kaggle free GPU. Still too big? RunPod cloud. You never think about hardware.
LLMs today. Image and video generation coming next.
pipx install localfit
Quick Start
localfit # GPU dashboard + trending models
localfit run gemma4:e4b # Ollama-style: download + run
localfit run qwen3:14b # doesn't fit? auto-offers Kaggle/RunPod
localfit --launch claude # start model + launch Claude Code
Ollama-Compatible Commands
Same syntax you already know. But smarter.
localfit run gemma4:e4b # serve Gemma 4 E4B (4.6GB, vision+audio)
localfit run gemma4:26b # serve Gemma 4 26B MoE (12GB, best quality)
localfit run gemma4 # auto-pick best for your GPU
localfit run qwen35:a3b # Qwen 3.5 35B MoE
localfit pull gemma4:e4b # download only
localfit list # show installed models
localfit ps # show running models
localfit stop # stop server
localfit show ui-tars-1.5-7b # show all quants + fit analysis
localfit login kaggle # save Kaggle key (free cloud GPU)
localfit login runpod # save RunPod key (paid cloud GPU)
All --flag style commands still work (localfit --serve, localfit --ps, etc.)
What Ollama Can't Do
localfit is what Ollama is missing — deep GPU integration, fit analysis, and auto-cloud fallback.
"Doesn't fit" → Here are ALL your options
When a model doesn't fit your GPU, localfit shows every option. No other tool does this:
✗ Can't run qwen3:14b locally — no quant fits your 8GB GPU
1 Run locally — Q2_K (5.5GB) fits your 8GB GPU
Extreme quant — lower quality but runs full speed
2 Partial GPU offload — Q8_0 (14.5GB)
~55% on GPU, rest on CPU · -ngl 22 · ~16 tok/s
3 CPU-only — Q8_0 (14.5GB) ~3 tok/s · slow but works
4 Kaggle remote (free) — T4x2 (32GB) · BF16 (28.0GB) · 12h
5 RunPod cloud (paid) — ~$0.75/hr A6000
── tips ──
KV cache quantization: Q4_K_M (8.5GB) is close to fitting. With
-ctk q4_0 -ctv q4_0 (4-bit KV cache), you save ~2GB VRAM at 32K context.
Options include:
- Smaller quant that fits your GPU
- Partial GPU offload — some layers GPU, rest on CPU
- CPU-only — slow but works
- Kaggle free T4/T4x2 GPU — via Cloudflare tunnel, from any PC
- RunPod paid cloud GPU — any size
- YOLO mode — swap to disk, 0.5 tok/s, you asked for it
- Tips: KV cache quantization, TurboQuant, missing quant creation hints
Remote GPU Serving (Kaggle Free / RunPod Paid)
Can't run locally? One command to serve on a free Kaggle GPU:
localfit run qwen3:14b --remote kaggle # free T4 GPU + Cloudflare tunnel
localfit run gemma4:27b --remote kaggle # auto-picks T4x2 (32GB) for bigger models
localfit run llama3:70b --remote runpod # paid cloud GPU
localfit --remote-status # check active session
localfit --remote-stop # stop session
How it works:
- Checks your model against Kaggle GPU tiers (T4 16GB, T4x2 32GB, P100 16GB)
- Picks the right GPU, best quant that fits
- Generates a notebook, pushes to Kaggle via API
- Starts Ollama + Cloudflare tunnel
- Gives you a public URL — use from any PC
Supports VLM models (vision-language) with automatic mmproj handling.
GPU Health & Fit Analysis
localfit health # GPU VRAM, temp, processes, memory pressure
localfit simulate # interactive "will this model fit?"
localfit show MODEL # all quants + fit check + cloud pricing
localfit specs # full machine specs
localfit trending # top models with fit/cloud tags
localfit bench # benchmark all installed models
localfit arena # leaderboard on YOUR hardware
Launch Any Tool (One Command)
localfit --launch claude # Claude Code (--bare, safe, no config changes)
localfit --launch claude --model gemma4:26b
localfit --launch codex # OpenAI Codex CLI
localfit --launch opencode # OpenCode
localfit --launch aider # aider
localfit --launch webui # Open WebUI (ChatGPT-style browser UI)
localfit --launch webui --tunnel # + public URL via Cloudflare Tunnel
How Launch Works
- Starts llama-server with the right model + optimal context for your VRAM
- Sets env vars scoped to the subprocess only (nothing persists)
- Launches the tool pointing at the local API endpoint
- For Claude Code, auto-starts an Anthropic compatibility proxy on
localhost:8090 - When you exit, env vars die. Your normal tool setup is untouched.
Safety: We Never Break Your Setup
localfit never modifies these files:
~/.zshrc,~/.bashrc(no permanent exports)~/.claude.json,~/.claude/settings.json(no Claude config changes)
If anything goes wrong:
localfit doctor # check all tool configs for corruption
localfit restore # restore configs from automatic backups
Configure Tools
localfit --config claude # show safe launch command
localfit --config codex # show safe launch command
localfit --config opencode # configure OpenCode
localfit --config aider # configure aider
Manual Launch (No localfit Required)
Claude Code:
python -m localfit.proxy --port 8090 --llama-url http://127.0.0.1:8089/v1/chat/completions &
ANTHROPIC_AUTH_TOKEN=localfit \
ANTHROPIC_BASE_URL=http://localhost:8090 \
ANTHROPIC_API_KEY= \
claude --bare --model gemma4-26b
Codex:
OPENAI_BASE_URL=http://localhost:8089/v1 \
OPENAI_API_KEY=sk-no-key-required \
codex --model local
Open WebUI:
OPENAI_API_BASE_URL=http://localhost:8089/v1 \
OPENAI_API_KEY=no-key-required \
open-webui serve
Cloud GPU
Kaggle (Free) — Setup
Kaggle gives you 30 hours/week of free GPU (T4 16GB or T4x2 32GB). localfit auto-deploys models there with a Cloudflare tunnel.
Step 1: Install Kaggle CLI
pipx install kaggle
Step 2: Get your Legacy API Key (not the new KGAT_ tokens — those don't work with kernel push)
- Go to https://www.kaggle.com/settings
- Scroll down to "Legacy API Credentials" (NOT "API Tokens")
- Click "Create Legacy API Key"
- A
kaggle.jsonfile downloads — it contains{"username":"you","key":"hex..."}
Step 3: Save credentials
localfit login kaggle
# Paste the JSON from kaggle.json, or enter username + key separately
Or manually:
mkdir -p ~/.kaggle
cp ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
Step 4: Run any model
localfit run gemma4:26b --remote kaggle # 26B MoE, fits T4
localfit run qwen3:14b --remote kaggle # 14B, fits T4
localfit run 0000/ui-tars-1.5-7b --remote kaggle # VLM, fits T4
Default: 10 min auto-stop + auto-delete (saves quota). Override with --duration 30.
✓ Qwen3-Coder-Next-GGUF
Quant: IQ3_S (27.7GB)
GPU: T4x2 (32GB)
Duration: 10 min (auto-stops + auto-deletes)
Quota: 29.8h remaining of 30h weekly
Cost: Free
RunPod (Paid)
localfit login runpod # save API key
localfit run MODEL --cloud # provision pod + serve
localfit --stop # stop pod + billing
How it works:
- Fetches live GPU pricing from RunPod API
- Matches model quant to GPU VRAM, shows best options for your budget
- Spins up a lightweight pod (~60s boot) with Ollama + Cloudflare tunnel
- Downloads model, creates public endpoint — use from any machine
- Auto-stops when budget expires
Benchmarks (RunPod RTX 3090, $0.46/hr):
| Model | Quant | Internal tok/s | Via Tunnel | Pull Time |
|---|---|---|---|---|
| Gemma 3 4B | Q4_K_M | 167 | 38 | 6s |
| Qwen 3 8B | Q4_K_M | 122 | 106 | 35s |
Docker Template (RunPod / Self-hosted)
Pre-built image with Ollama + Cloudflare tunnel. Zero setup time.
docker pull localfit/runpod:latest
Or use directly on RunPod as a custom template. See docker/ for Dockerfile and config.
Dynamic VRAM Context Sizing
localfit auto-calculates the optimal context window for your hardware:
free_vram = gpu_total - model_size - 512MB headroom
max_context = free_vram / 60MB per 1K tokens
| Machine | Model | Free VRAM | Context |
|---|---|---|---|
| M4 Pro 24GB (16GB Metal) | Gemma 4 26B (12GB) | 3.5GB | 32K |
| M4 Pro 24GB (16GB Metal) | Gemma 4 E4B (4.6GB) | 11GB | 128K |
| M4 Max 64GB (48GB Metal) | Gemma 4 26B (12GB) | 35GB | 128K |
Monitor & Maintain
localfit health # GPU health dashboard
localfit specs # machine specs
localfit cleanup # free GPU memory
localfit debloat # disable macOS services stealing GPU
localfit check # check prerequisites
localfit doctor # check if localfit broke anything
localfit restore # restore all configs from backup
Supported Platforms
| Platform | GPU Detection | Monitoring |
|---|---|---|
| macOS Apple Silicon | Metal budget, memory pressure | ioreg |
| Linux NVIDIA | nvidia-smi VRAM, temp, fan | nvidia-smi |
| Linux AMD | rocm-smi | rocm-smi |
| Windows (WSL2) | nvidia-smi via WSL | nvidia-smi |
Requirements
pipx install 'localfit[all]' # includes TUI + HF downloads
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file localfit-0.7.1.tar.gz.
File metadata
- Download URL: localfit-0.7.1.tar.gz
- Upload date:
- Size: 3.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff9ae37a4418484ba954c8834c6fbc4a23f5b148cdf1484414b00bb172c7e324
|
|
| MD5 |
14a0a5797ba65f65f0126966c834105f
|
|
| BLAKE2b-256 |
c41b378a51ae0f79088580ad9bf401abe71fd91873acbf9d0278e215b9875b77
|
File details
Details for the file localfit-0.7.1-py3-none-any.whl.
File metadata
- Download URL: localfit-0.7.1-py3-none-any.whl
- Upload date:
- Size: 123.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b0def0eb1986fb2216c616a9ac6319b5463cec2abf2f17374a153f50da36ffe
|
|
| MD5 |
6a798847b5cb4dcb8d45723aa7aaf977
|
|
| BLAKE2b-256 |
b064352762a1c2e8b3419fe4c1f758dccf97cd9186a9389fe418efab12247fcc
|