Skip to main content

Will it fit? GPU toolkit for AI models — simulate, benchmark, monitor, serve.

Project description

localfit

Will it fit? Say what model you want — localfit figures out the GPU layer.

Fits locally? Run it. Doesn't fit? Kaggle free GPU. Still too big? RunPod cloud. You never think about hardware.

LLMs today. Image and video generation coming next.

pipx install localfit

Quick Start

localfit                              # GPU dashboard + trending models
localfit run gemma4:e4b               # Ollama-style: download + run
localfit run qwen3:14b                # doesn't fit? auto-offers Kaggle/RunPod
localfit --launch claude              # start model + launch Claude Code

Ollama-Compatible Commands

Same syntax you already know. But smarter.

localfit run gemma4:e4b               # serve Gemma 4 E4B (4.6GB, vision+audio)
localfit run gemma4:26b               # serve Gemma 4 26B MoE (12GB, best quality)
localfit run gemma4                   # auto-pick best for your GPU
localfit run qwen35:a3b               # Qwen 3.5 35B MoE
localfit pull gemma4:e4b              # download only
localfit list                         # show installed models
localfit ps                           # show running models
localfit stop                         # stop server
localfit show ui-tars-1.5-7b          # show all quants + fit analysis
localfit login kaggle                 # save Kaggle key (free cloud GPU)
localfit login runpod                 # save RunPod key (paid cloud GPU)

All --flag style commands still work (localfit --serve, localfit --ps, etc.)

What Ollama Can't Do

localfit is what Ollama is missing — deep GPU integration, fit analysis, and auto-cloud fallback.

"Doesn't fit" → Here are ALL your options

When a model doesn't fit your GPU, localfit shows every option. No other tool does this:

  ✗ Can't run qwen3:14b locally — no quant fits your 8GB GPU

  1  Run locally — Q2_K (5.5GB) fits your 8GB GPU
     Extreme quant — lower quality but runs full speed
  2  Partial GPU offload — Q8_0 (14.5GB)
     ~55% on GPU, rest on CPU · -ngl 22 · ~16 tok/s
  3  CPU-only — Q8_0 (14.5GB) ~3 tok/s · slow but works
  4  Kaggle remote (free) — T4x2 (32GB) · BF16 (28.0GB) · 12h
  5  RunPod cloud (paid) — ~$0.75/hr A6000

  ── tips ──
  KV cache quantization: Q4_K_M (8.5GB) is close to fitting. With
  -ctk q4_0 -ctv q4_0 (4-bit KV cache), you save ~2GB VRAM at 32K context.

Options include:

  • Smaller quant that fits your GPU
  • Partial GPU offload — some layers GPU, rest on CPU
  • CPU-only — slow but works
  • Kaggle free T4/T4x2 GPU — via Cloudflare tunnel, from any PC
  • RunPod paid cloud GPU — any size
  • YOLO mode — swap to disk, 0.5 tok/s, you asked for it
  • Tips: KV cache quantization, TurboQuant, missing quant creation hints

Remote GPU Serving (Kaggle Free / RunPod Paid)

Can't run locally? One command to serve on a free Kaggle GPU:

localfit run qwen3:14b --remote kaggle    # free T4 GPU + Cloudflare tunnel
localfit run gemma4:27b --remote kaggle   # auto-picks T4x2 (32GB) for bigger models
localfit run llama3:70b --remote runpod   # paid cloud GPU
localfit --remote-status                  # check active session
localfit --remote-stop                    # stop session

How it works:

  1. Checks your model against Kaggle GPU tiers (T4 16GB, T4x2 32GB, P100 16GB)
  2. Picks the right GPU, best quant that fits
  3. Generates a notebook, pushes to Kaggle via API
  4. Starts Ollama + Cloudflare tunnel
  5. Gives you a public URL — use from any PC

Supports VLM models (vision-language) with automatic mmproj handling.

GPU Health & Fit Analysis

localfit health                       # GPU VRAM, temp, processes, memory pressure
localfit simulate                     # interactive "will this model fit?"
localfit show MODEL                   # all quants + fit check + cloud pricing
localfit specs                        # full machine specs
localfit trending                     # top models with fit/cloud tags
localfit bench                        # benchmark all installed models
localfit arena                        # leaderboard on YOUR hardware

Launch Any Tool (One Command)

localfit --launch claude              # Claude Code (--bare, safe, no config changes)
localfit --launch claude --model gemma4:26b
localfit --launch codex               # OpenAI Codex CLI
localfit --launch opencode            # OpenCode
localfit --launch aider               # aider
localfit --launch webui               # Open WebUI (ChatGPT-style browser UI)
localfit --launch webui --tunnel      # + public URL via Cloudflare Tunnel

How Launch Works

  1. Starts llama-server with the right model + optimal context for your VRAM
  2. Sets env vars scoped to the subprocess only (nothing persists)
  3. Launches the tool pointing at the local API endpoint
  4. For Claude Code, auto-starts an Anthropic compatibility proxy on localhost:8090
  5. When you exit, env vars die. Your normal tool setup is untouched.

Safety: We Never Break Your Setup

localfit never modifies these files:

  • ~/.zshrc, ~/.bashrc (no permanent exports)
  • ~/.claude.json, ~/.claude/settings.json (no Claude config changes)

If anything goes wrong:

localfit doctor                       # check all tool configs for corruption
localfit restore                      # restore configs from automatic backups

Configure Tools

localfit --config claude              # show safe launch command
localfit --config codex               # show safe launch command
localfit --config opencode            # configure OpenCode
localfit --config aider               # configure aider

Manual Launch (No localfit Required)

Claude Code:

python -m localfit.proxy --port 8090 --llama-url http://127.0.0.1:8089/v1/chat/completions &
ANTHROPIC_AUTH_TOKEN=localfit \
ANTHROPIC_BASE_URL=http://localhost:8090 \
ANTHROPIC_API_KEY= \
claude --bare --model gemma4-26b

Codex:

OPENAI_BASE_URL=http://localhost:8089/v1 \
OPENAI_API_KEY=sk-no-key-required \
codex --model local

Open WebUI:

OPENAI_API_BASE_URL=http://localhost:8089/v1 \
OPENAI_API_KEY=no-key-required \
open-webui serve

Cloud GPU

Kaggle (Free) — Setup

Kaggle gives you 30 hours/week of free GPU (T4 16GB or T4x2 32GB). localfit auto-deploys models there with a Cloudflare tunnel.

Step 1: Install Kaggle CLI

pipx install kaggle

Step 2: Get your Legacy API Key (not the new KGAT_ tokens — those don't work with kernel push)

  1. Go to https://www.kaggle.com/settings
  2. Scroll down to "Legacy API Credentials" (NOT "API Tokens")
  3. Click "Create Legacy API Key"
  4. A kaggle.json file downloads — it contains {"username":"you","key":"hex..."}

Step 3: Save credentials

localfit login kaggle
# Paste the JSON from kaggle.json, or enter username + key separately

Or manually:

mkdir -p ~/.kaggle
cp ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

Step 4: Run any model

localfit run gemma4:26b --remote kaggle        # 26B MoE, fits T4
localfit run qwen3:14b --remote kaggle         # 14B, fits T4
localfit run 0000/ui-tars-1.5-7b --remote kaggle  # VLM, fits T4

Default: 10 min auto-stop + auto-delete (saves quota). Override with --duration 30.

  ✓ Qwen3-Coder-Next-GGUF
    Quant:    IQ3_S (27.7GB)
    GPU:      T4x2 (32GB)
    Duration: 10 min (auto-stops + auto-deletes)
    Quota:    29.8h remaining of 30h weekly
    Cost:     Free

RunPod (Paid)

localfit login runpod                 # save API key
localfit run MODEL --cloud            # provision pod + serve
localfit --stop                       # stop pod + billing

How it works:

  1. Fetches live GPU pricing from RunPod API
  2. Matches model quant to GPU VRAM, shows best options for your budget
  3. Spins up a lightweight pod (~60s boot) with Ollama + Cloudflare tunnel
  4. Downloads model, creates public endpoint — use from any machine
  5. Auto-stops when budget expires

Benchmarks (RunPod RTX 3090, $0.46/hr):

Model Quant Internal tok/s Via Tunnel Pull Time
Gemma 3 4B Q4_K_M 167 38 6s
Qwen 3 8B Q4_K_M 122 106 35s

Docker Template (RunPod / Self-hosted)

Pre-built image with Ollama + Cloudflare tunnel. Zero setup time.

docker pull localfit/runpod:latest

Or use directly on RunPod as a custom template. See docker/ for Dockerfile and config.

Dynamic VRAM Context Sizing

localfit auto-calculates the optimal context window for your hardware:

free_vram = gpu_total - model_size - 512MB headroom
max_context = free_vram / 60MB per 1K tokens
Machine Model Free VRAM Context
M4 Pro 24GB (16GB Metal) Gemma 4 26B (12GB) 3.5GB 32K
M4 Pro 24GB (16GB Metal) Gemma 4 E4B (4.6GB) 11GB 128K
M4 Max 64GB (48GB Metal) Gemma 4 26B (12GB) 35GB 128K

Monitor & Maintain

localfit health                       # GPU health dashboard
localfit specs                        # machine specs
localfit cleanup                      # free GPU memory
localfit debloat                      # disable macOS services stealing GPU
localfit check                        # check prerequisites
localfit doctor                       # check if localfit broke anything
localfit restore                      # restore all configs from backup

Supported Platforms

Platform GPU Detection Monitoring
macOS Apple Silicon Metal budget, memory pressure ioreg
Linux NVIDIA nvidia-smi VRAM, temp, fan nvidia-smi
Linux AMD rocm-smi rocm-smi
Windows (WSL2) nvidia-smi via WSL nvidia-smi

Requirements

pipx install 'localfit[all]'   # includes TUI + HF downloads

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

localfit-0.7.1.tar.gz (3.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

localfit-0.7.1-py3-none-any.whl (123.4 kB view details)

Uploaded Python 3

File details

Details for the file localfit-0.7.1.tar.gz.

File metadata

  • Download URL: localfit-0.7.1.tar.gz
  • Upload date:
  • Size: 3.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for localfit-0.7.1.tar.gz
Algorithm Hash digest
SHA256 ff9ae37a4418484ba954c8834c6fbc4a23f5b148cdf1484414b00bb172c7e324
MD5 14a0a5797ba65f65f0126966c834105f
BLAKE2b-256 c41b378a51ae0f79088580ad9bf401abe71fd91873acbf9d0278e215b9875b77

See more details on using hashes here.

File details

Details for the file localfit-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: localfit-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 123.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for localfit-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6b0def0eb1986fb2216c616a9ac6319b5463cec2abf2f17374a153f50da36ffe
MD5 6a798847b5cb4dcb8d45723aa7aaf977
BLAKE2b-256 b064352762a1c2e8b3419fe4c1f758dccf97cd9186a9389fe418efab12247fcc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page