Skip to main content

Will it fit? GPU toolkit for local LLMs — simulate, benchmark, monitor, serve.

Project description

localfit

The DirectX for local LLMs. Say what model you want — localfit figures out the GPU layer.

Fits locally? Run it. Doesn't fit? Kaggle free GPU. Still too big? RunPod cloud. You never think about hardware.

pipx install localfit

Quick Start

localfit                              # GPU dashboard + trending models
localfit run gemma4:e4b               # Ollama-style: download + run
localfit run qwen3:14b                # doesn't fit? auto-offers Kaggle/RunPod
localfit --launch claude              # start model + launch Claude Code

Ollama-Compatible Commands

Same syntax you already know. But smarter.

localfit run gemma4:e4b               # serve Gemma 4 E4B (4.6GB, vision+audio)
localfit run gemma4:26b               # serve Gemma 4 26B MoE (12GB, best quality)
localfit run gemma4                   # auto-pick best for your GPU
localfit run qwen35:a3b               # Qwen 3.5 35B MoE
localfit pull gemma4:e4b              # download only
localfit list                         # show installed models
localfit ps                           # show running models
localfit stop                         # stop server
localfit show ui-tars-1.5-7b          # show all quants + fit analysis
localfit login kaggle                 # save Kaggle key (free cloud GPU)
localfit login runpod                 # save RunPod key (paid cloud GPU)

All --flag style commands still work (localfit --serve, localfit --ps, etc.)

What Ollama Can't Do

localfit is what Ollama is missing — deep GPU integration, fit analysis, and auto-cloud fallback.

"Doesn't fit" → Here are ALL your options

When a model doesn't fit your GPU, localfit shows every option. No other tool does this:

  ✗ Can't run qwen3:14b locally — no quant fits your 8GB GPU

  1  Run locally — Q2_K (5.5GB) fits your 8GB GPU
     Extreme quant — lower quality but runs full speed
  2  Partial GPU offload — Q8_0 (14.5GB)
     ~55% on GPU, rest on CPU · -ngl 22 · ~16 tok/s
  3  CPU-only — Q8_0 (14.5GB) ~3 tok/s · slow but works
  4  Kaggle remote (free) — T4x2 (32GB) · BF16 (28.0GB) · 12h
  5  RunPod cloud (paid) — ~$0.75/hr A6000

  ── tips ──
  KV cache quantization: Q4_K_M (8.5GB) is close to fitting. With
  -ctk q4_0 -ctv q4_0 (4-bit KV cache), you save ~2GB VRAM at 32K context.

Options include:

  • Smaller quant that fits your GPU
  • Partial GPU offload — some layers GPU, rest on CPU
  • CPU-only — slow but works
  • Kaggle free T4/T4x2 GPU — via Cloudflare tunnel, from any PC
  • RunPod paid cloud GPU — any size
  • YOLO mode — swap to disk, 0.5 tok/s, you asked for it
  • Tips: KV cache quantization, TurboQuant, missing quant creation hints

Remote GPU Serving (Kaggle Free / RunPod Paid)

Can't run locally? One command to serve on a free Kaggle GPU:

localfit run qwen3:14b --remote kaggle    # free T4 GPU + Cloudflare tunnel
localfit run gemma4:27b --remote kaggle   # auto-picks T4x2 (32GB) for bigger models
localfit run llama3:70b --remote runpod   # paid cloud GPU
localfit --remote-status                  # check active session
localfit --remote-stop                    # stop session

How it works:

  1. Checks your model against Kaggle GPU tiers (T4 16GB, T4x2 32GB, P100 16GB)
  2. Picks the right GPU, best quant that fits
  3. Generates a notebook, pushes to Kaggle via API
  4. Builds llama-server with CUDA, starts Cloudflare tunnel
  5. Gives you a public URL — use from any PC

Supports VLM models (vision-language) with automatic mmproj handling.

GPU Health & Fit Analysis

localfit health                       # GPU VRAM, temp, processes, memory pressure
localfit simulate                     # interactive "will this model fit?"
localfit show MODEL                   # all quants + fit check + cloud pricing
localfit specs                        # full machine specs
localfit trending                     # top models with fit/cloud tags
localfit bench                        # benchmark all installed models
localfit arena                        # leaderboard on YOUR hardware

Launch Any Tool (One Command)

localfit --launch claude              # Claude Code (--bare, safe, no config changes)
localfit --launch claude --model gemma4:26b
localfit --launch codex               # OpenAI Codex CLI
localfit --launch opencode            # OpenCode
localfit --launch aider               # aider
localfit --launch webui               # Open WebUI (ChatGPT-style browser UI)
localfit --launch webui --tunnel      # + public URL via Cloudflare Tunnel

How Launch Works

  1. Starts llama-server with the right model + optimal context for your VRAM
  2. Sets env vars scoped to the subprocess only (nothing persists)
  3. Launches the tool pointing at localhost:8089
  4. When you exit, env vars die. Your normal tool setup is untouched.

Safety: We Never Break Your Setup

localfit never modifies these files:

  • ~/.zshrc, ~/.bashrc (no permanent exports)
  • ~/.claude.json, ~/.claude/settings.json (no Claude config changes)

If anything goes wrong:

localfit doctor                       # check all tool configs for corruption
localfit restore                      # restore configs from automatic backups

Configure Tools

localfit --config claude              # show safe launch command
localfit --config codex               # show safe launch command
localfit --config opencode            # configure OpenCode
localfit --config aider               # configure aider

Manual Launch (No localfit Required)

Claude Code:

ANTHROPIC_AUTH_TOKEN=localfit \
ANTHROPIC_BASE_URL=http://localhost:8089 \
ANTHROPIC_API_KEY= \
claude --bare --model gemma4-26b

Codex:

OPENAI_BASE_URL=http://localhost:8089/v1 \
OPENAI_API_KEY=sk-no-key-required \
codex --model local

Open WebUI:

OPENAI_API_BASE_URL=http://localhost:8089/v1 \
OPENAI_API_KEY=no-key-required \
open-webui serve

Cloud GPU

Kaggle (Free) — Setup

Kaggle gives you 30 hours/week of free GPU (T4 16GB or T4x2 32GB). localfit auto-deploys models there with a Cloudflare tunnel.

Step 1: Install Kaggle CLI

pipx install kaggle

Step 2: Get your Legacy API Key (not the new KGAT_ tokens — those don't work with kernel push)

  1. Go to https://www.kaggle.com/settings
  2. Scroll down to "Legacy API Credentials" (NOT "API Tokens")
  3. Click "Create Legacy API Key"
  4. A kaggle.json file downloads — it contains {"username":"you","key":"hex..."}

Step 3: Save credentials

localfit login kaggle
# Paste the JSON from kaggle.json, or enter username + key separately

Or manually:

mkdir -p ~/.kaggle
cp ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

Step 4: Run any model

localfit run gemma4:26b --remote kaggle        # 26B MoE, fits T4
localfit run qwen3:14b --remote kaggle         # 14B, fits T4
localfit run 0000/ui-tars-1.5-7b --remote kaggle  # VLM, fits T4

Default: 10 min auto-stop + auto-delete (saves quota). Override with --duration 30.

  ✓ Qwen3-Coder-Next-GGUF
    Quant:    IQ3_S (27.7GB)
    GPU:      T4x2 (32GB)
    Duration: 10 min (auto-stops + auto-deletes)
    Quota:    29.8h remaining of 30h weekly
    Cost:     Free

RunPod (Paid)

localfit login runpod                 # save API key
localfit run MODEL --cloud            # provision pod + serve
localfit --stop                       # stop pod + billing

Dynamic VRAM Context Sizing

localfit auto-calculates the optimal context window for your hardware:

free_vram = gpu_total - model_size - 512MB headroom
max_context = free_vram / 60MB per 1K tokens
Machine Model Free VRAM Context
M4 Pro 24GB (16GB Metal) Gemma 4 26B (12GB) 3.5GB 32K
M4 Pro 24GB (16GB Metal) Gemma 4 E4B (4.6GB) 11GB 128K
M4 Max 64GB (48GB Metal) Gemma 4 26B (12GB) 35GB 128K

Monitor & Maintain

localfit health                       # GPU health dashboard
localfit specs                        # machine specs
localfit cleanup                      # free GPU memory
localfit debloat                      # disable macOS services stealing GPU
localfit check                        # check prerequisites
localfit doctor                       # check if localfit broke anything
localfit restore                      # restore all configs from backup

Supported Platforms

Platform GPU Detection Monitoring
macOS Apple Silicon Metal budget, memory pressure ioreg
Linux NVIDIA nvidia-smi VRAM, temp, fan nvidia-smi
Linux AMD rocm-smi rocm-smi
Windows (WSL2) nvidia-smi via WSL nvidia-smi

Requirements

pipx install 'localfit[all]'   # includes TUI + HF downloads

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

localfit-0.6.0.tar.gz (3.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

localfit-0.6.0-py3-none-any.whl (116.9 kB view details)

Uploaded Python 3

File details

Details for the file localfit-0.6.0.tar.gz.

File metadata

  • Download URL: localfit-0.6.0.tar.gz
  • Upload date:
  • Size: 3.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for localfit-0.6.0.tar.gz
Algorithm Hash digest
SHA256 b3928fa6ffff9d18262ec7eb7504215a15c156aafb349aadc05c73482e1e869f
MD5 b14bfec779d7bd509d6607a34ebc5933
BLAKE2b-256 1a2fa949e6bc2b0b85754c2e0a1e68a52ec7477e8cc5e89945a14c16e20d004a

See more details on using hashes here.

File details

Details for the file localfit-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: localfit-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 116.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for localfit-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2c890e97aeda65d92a87689ac1b4d55e6387cd1f8ac7318756e9a58ad3870772
MD5 c219f1b3872611b3608afeeaa04da42c
BLAKE2b-256 6605284460fcc4c4f88ee5bc215ef0d98c9b625b00caaa192f14727109d971e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page