Will it fit? GPU toolkit for AI models — simulate, benchmark, monitor, serve.

These details have not been verified by PyPI

Project links

Project description

localfit

Will it fit? Say what model you want — localfit figures out the GPU layer.

Fits locally? Run it. Doesn't fit? Kaggle free GPU. Still too big? RunPod cloud. You never think about hardware.

LLMs today. Image and video generation coming next.

pipx install localfit

Quick Start

localfit                              # GPU dashboard + trending models
localfit run gemma4:e4b               # Ollama-style: download + run
localfit run qwen3:14b                # doesn't fit? auto-offers Kaggle/RunPod
localfit --launch claude              # start model + launch Claude Code

Ollama-Compatible Commands

Same syntax you already know. But smarter.

localfit run gemma4:e4b               # serve Gemma 4 E4B (4.6GB, vision+audio)
localfit run gemma4:26b               # serve Gemma 4 26B MoE (12GB, best quality)
localfit run gemma4                   # auto-pick best for your GPU
localfit run qwen35:a3b               # Qwen 3.5 35B MoE
localfit pull gemma4:e4b              # download only
localfit list                         # show installed models
localfit ps                           # show running models
localfit stop                         # stop server
localfit show ui-tars-1.5-7b          # show all quants + fit analysis
localfit login kaggle                 # save Kaggle key (free cloud GPU)
localfit login runpod                 # save RunPod key (paid cloud GPU)

All --flag style commands still work (localfit --serve, localfit --ps, etc.)

What Ollama Can't Do

localfit is what Ollama is missing — deep GPU integration, fit analysis, and auto-cloud fallback.

"Doesn't fit" → Here are ALL your options

When a model doesn't fit your GPU, localfit shows every option. No other tool does this:

  ✗ Can't run qwen3:14b locally — no quant fits your 8GB GPU

  1  Run locally — Q2_K (5.5GB) fits your 8GB GPU
     Extreme quant — lower quality but runs full speed
  2  Partial GPU offload — Q8_0 (14.5GB)
     ~55% on GPU, rest on CPU · -ngl 22 · ~16 tok/s
  3  CPU-only — Q8_0 (14.5GB) ~3 tok/s · slow but works
  4  Kaggle remote (free) — T4x2 (32GB) · BF16 (28.0GB) · 12h
  5  RunPod cloud (paid) — ~$0.75/hr A6000

  ── tips ──
  KV cache quantization: Q4_K_M (8.5GB) is close to fitting. With
  -ctk q4_0 -ctv q4_0 (4-bit KV cache), you save ~2GB VRAM at 32K context.

Options include:

Smaller quant that fits your GPU
Partial GPU offload — some layers GPU, rest on CPU
CPU-only — slow but works
Kaggle free T4/T4x2 GPU — via Cloudflare tunnel, from any PC
RunPod paid cloud GPU — any size
YOLO mode — swap to disk, 0.5 tok/s, you asked for it
Tips: KV cache quantization, TurboQuant, missing quant creation hints

Remote GPU Serving (Kaggle Free / RunPod Paid)

Can't run locally? One command to serve on a free Kaggle GPU:

localfit run qwen3:14b --remote kaggle    # free T4 GPU + Cloudflare tunnel
localfit run gemma4:27b --remote kaggle   # auto-picks T4x2 (32GB) for bigger models
localfit run llama3:70b --remote runpod   # paid cloud GPU
localfit --remote-status                  # check active session
localfit --remote-stop                    # stop session

How it works:

Checks your model against Kaggle GPU tiers (T4 16GB, T4x2 32GB, P100 16GB)
Picks the right GPU, best quant that fits
Generates a notebook, pushes to Kaggle via API
Starts Ollama + Cloudflare tunnel
Gives you a public URL — use from any PC

Supports VLM models (vision-language) with automatic mmproj handling.

GPU Health & Fit Analysis

localfit health                       # GPU VRAM, temp, processes, memory pressure
localfit simulate                     # interactive "will this model fit?"
localfit show MODEL                   # all quants + fit check + cloud pricing
localfit specs                        # full machine specs
localfit trending                     # top models with fit/cloud tags
localfit bench                        # benchmark all installed models
localfit arena                        # leaderboard on YOUR hardware

Launch Any Tool (One Command)

localfit --launch claude              # Claude Code (--bare, safe, no config changes)
localfit --launch claude --model gemma4:26b
localfit --launch codex               # OpenAI Codex CLI
localfit --launch opencode            # OpenCode
localfit --launch aider               # aider
localfit --launch webui               # Open WebUI (ChatGPT-style browser UI)
localfit --launch webui --tunnel      # + public URL via Cloudflare Tunnel

How Launch Works

Starts llama-server with the right model + optimal context for your VRAM
Sets env vars scoped to the subprocess only (nothing persists)
Launches the tool pointing at the local API endpoint
For Claude Code, auto-starts an Anthropic compatibility proxy on localhost:8090
When you exit, env vars die. Your normal tool setup is untouched.

Safety: We Never Break Your Setup

localfit never modifies these files:

~/.zshrc, ~/.bashrc (no permanent exports)
~/.claude.json, ~/.claude/settings.json (no Claude config changes)

If anything goes wrong:

localfit doctor                       # check all tool configs for corruption
localfit restore                      # restore configs from automatic backups

Configure Tools

localfit --config claude              # show safe launch command
localfit --config codex               # show safe launch command
localfit --config opencode            # configure OpenCode
localfit --config aider               # configure aider

Manual Launch (No localfit Required)

Claude Code:

python -m localfit.proxy --port 8090 --llama-url http://127.0.0.1:8089/v1/chat/completions &
ANTHROPIC_AUTH_TOKEN=localfit \
ANTHROPIC_BASE_URL=http://localhost:8090 \
ANTHROPIC_API_KEY= \
claude --bare --model gemma4-26b

Codex:

OPENAI_BASE_URL=http://localhost:8089/v1 \
OPENAI_API_KEY=sk-no-key-required \
codex --model local

Open WebUI:

OPENAI_API_BASE_URL=http://localhost:8089/v1 \
OPENAI_API_KEY=no-key-required \
open-webui serve

Cloud GPU

Kaggle (Free) — Setup

Kaggle gives you 30 hours/week of free GPU (T4 16GB or T4x2 32GB). localfit auto-deploys models there with a Cloudflare tunnel.

Step 1: Install Kaggle CLI

pipx install kaggle

Step 2: Get your Legacy API Key (not the new KGAT_ tokens — those don't work with kernel push)

Go to https://www.kaggle.com/settings
Scroll down to "Legacy API Credentials" (NOT "API Tokens")
Click "Create Legacy API Key"
A kaggle.json file downloads — it contains {"username":"you","key":"hex..."}

Step 3: Save credentials

localfit login kaggle
# Paste the JSON from kaggle.json, or enter username + key separately

Or manually:

mkdir -p ~/.kaggle
cp ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

Step 4: Run any model

localfit run gemma4:26b --remote kaggle        # 26B MoE, fits T4
localfit run qwen3:14b --remote kaggle         # 14B, fits T4
localfit run 0000/ui-tars-1.5-7b --remote kaggle  # VLM, fits T4

Default: 10 min auto-stop + auto-delete (saves quota). Override with --duration 30.

  ✓ Qwen3-Coder-Next-GGUF
    Quant:    IQ3_S (27.7GB)
    GPU:      T4x2 (32GB)
    Duration: 10 min (auto-stops + auto-deletes)
    Quota:    29.8h remaining of 30h weekly
    Cost:     Free

RunPod (Paid)

localfit login runpod                 # save API key
localfit run MODEL --cloud            # provision pod + serve
localfit --stop                       # stop pod + billing

How it works:

Fetches live GPU pricing from RunPod API
Matches model quant to GPU VRAM, shows best options for your budget
Spins up a lightweight pod (~60s boot) with Ollama + Cloudflare tunnel
Downloads model, creates public endpoint — use from any machine
Auto-stops when budget expires

Benchmarks (RunPod RTX 3090, $0.46/hr):

Model	Quant	Internal tok/s	Via Tunnel	Pull Time
Gemma 3 4B	Q4_K_M	167	38	6s
Qwen 3 8B	Q4_K_M	122	106	35s

Docker Template (RunPod / Self-hosted)

Pre-built image with Ollama + Cloudflare tunnel. Zero setup time.

docker pull localfit/runpod:latest

Or use directly on RunPod as a custom template. See docker/ for Dockerfile and config.

Dynamic VRAM Context Sizing

localfit auto-calculates the optimal context window for your hardware:

free_vram = gpu_total - model_size - 512MB headroom
max_context = free_vram / 60MB per 1K tokens

Machine	Model	Free VRAM	Context
M4 Pro 24GB (16GB Metal)	Gemma 4 26B (12GB)	3.5GB	32K
M4 Pro 24GB (16GB Metal)	Gemma 4 E4B (4.6GB)	11GB	128K
M4 Max 64GB (48GB Metal)	Gemma 4 26B (12GB)	35GB	128K

Monitor & Maintain

localfit health                       # GPU health dashboard
localfit specs                        # machine specs
localfit cleanup                      # free GPU memory
localfit debloat                      # disable macOS services stealing GPU
localfit check                        # check prerequisites
localfit doctor                       # check if localfit broke anything
localfit restore                      # restore all configs from backup

Supported Platforms

Platform	GPU Detection	Monitoring
macOS Apple Silicon	Metal budget, memory pressure	ioreg
Linux NVIDIA	nvidia-smi VRAM, temp, fan	nvidia-smi
Linux AMD	rocm-smi	rocm-smi
Windows (WSL2)	nvidia-smi via WSL	nvidia-smi

Requirements

Python 3.10+
llama.cpp or Ollama

pipx install 'localfit[all]'   # includes TUI + HF downloads

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.7.0

Apr 12, 2026

1.6.3

Apr 10, 2026

1.6.2

Apr 10, 2026

1.6.1

Apr 10, 2026

1.6.0

Apr 10, 2026

1.5.6

Apr 10, 2026

1.5.5

Apr 10, 2026

1.5.4

Apr 10, 2026

1.5.3

Apr 10, 2026

1.5.2

Apr 10, 2026

1.5.1

Apr 10, 2026

1.5.0

Apr 10, 2026

1.4.6

Apr 10, 2026

1.4.5

Apr 10, 2026

1.4.4

Apr 10, 2026

1.4.3

Apr 10, 2026

1.4.2

Apr 10, 2026

1.4.1

Apr 10, 2026

1.4.0

Apr 10, 2026

1.3.9

Apr 10, 2026

1.3.8

Apr 10, 2026

1.3.7

Apr 10, 2026

1.3.6

Apr 10, 2026

1.3.5

Apr 10, 2026

1.3.4

Apr 10, 2026

1.3.3

Apr 10, 2026

1.3.2

Apr 10, 2026

1.3.1

Apr 10, 2026

1.3.0

Apr 10, 2026

1.2.9

Apr 10, 2026

1.2.7

Apr 10, 2026

1.2.6

Apr 10, 2026

1.2.5

Apr 10, 2026

1.2.4

Apr 10, 2026

1.2.3

Apr 10, 2026

1.2.2

Apr 10, 2026

1.2.1

Apr 10, 2026

1.2.0

Apr 10, 2026

1.1.3

Apr 10, 2026

1.1.2

Apr 10, 2026

1.1.1

Apr 10, 2026

1.1.0

Apr 10, 2026

1.0.9

Apr 10, 2026

1.0.8

Apr 10, 2026

1.0.7

Apr 10, 2026

1.0.6

Apr 10, 2026

1.0.5

Apr 10, 2026

1.0.4

Apr 10, 2026

1.0.3

Apr 10, 2026

1.0.2

Apr 10, 2026

1.0.1

Apr 10, 2026

1.0.0

Apr 10, 2026

0.8.0

Apr 10, 2026

This version

0.7.1

Apr 10, 2026

0.6.0

Apr 9, 2026

0.5.0

Apr 8, 2026

0.4.0

Apr 7, 2026

0.3.1

Apr 7, 2026

0.3.0

Apr 6, 2026

0.1.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

localfit-0.7.1.tar.gz (3.6 MB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

localfit-0.7.1-py3-none-any.whl (123.4 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file localfit-0.7.1.tar.gz.

File metadata

Download URL: localfit-0.7.1.tar.gz
Upload date: Apr 10, 2026
Size: 3.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for localfit-0.7.1.tar.gz
Algorithm	Hash digest
SHA256	`ff9ae37a4418484ba954c8834c6fbc4a23f5b148cdf1484414b00bb172c7e324`
MD5	`14a0a5797ba65f65f0126966c834105f`
BLAKE2b-256	`c41b378a51ae0f79088580ad9bf401abe71fd91873acbf9d0278e215b9875b77`

See more details on using hashes here.

File details

Details for the file localfit-0.7.1-py3-none-any.whl.

File metadata

Download URL: localfit-0.7.1-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 123.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for localfit-0.7.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6b0def0eb1986fb2216c616a9ac6319b5463cec2abf2f17374a153f50da36ffe`
MD5	`6a798847b5cb4dcb8d45723aa7aaf977`
BLAKE2b-256	`b064352762a1c2e8b3419fe4c1f758dccf97cd9186a9389fe418efab12247fcc`

See more details on using hashes here.

localfit 0.7.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

localfit

Quick Start

Ollama-Compatible Commands

What Ollama Can't Do

"Doesn't fit" → Here are ALL your options

Remote GPU Serving (Kaggle Free / RunPod Paid)

GPU Health & Fit Analysis

Launch Any Tool (One Command)

How Launch Works

Safety: We Never Break Your Setup

Configure Tools

Manual Launch (No localfit Required)

Cloud GPU

Kaggle (Free) — Setup

RunPod (Paid)

Docker Template (RunPod / Self-hosted)

Dynamic VRAM Context Sizing

Monitor & Maintain

Supported Platforms

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes