Skip to main content

LLM Turbo-Optimizer CLI. detect hardware, parse GGUF models, generate optimized llama-server commands.

Project description

clanker

generates llama-server commands so you don't have to guess how many layers fit on your gpu. reads the gguf header, does the math, prints the command.

i got tired of trial-and-error vram tuning for moe models on consumer gpus.

Setup

requires python 3.10+ and pip.

pip install clanker-gguf

or install from source:

pip install git+https://github.com/Stavros-alt/clanker.git
pip install "git+https://github.com/Stavros-alt/clanker.git#egg=clanker-gguf[all]"

or clone and install locally:

git clone https://github.com/Stavros-alt/clanker.git
cd clanker
pip install -e ".[all]"

you also need:

  • nvidia-smi on PATH (cpu-only mode works without it)
  • llama-server from ik_llama.cpp (build it with clanker build, or install manually)
  • windows? good luck. this probably works on wsl.

Usage

clanker run <model>

clanker run ~/models/qwen.gguf          # direct path
clanker run qwen                         # fuzzy cache search
clanker run qwen.gguf --preset speed     # apply a preset
clanker run qwen.gguf --context 65536    # override context
clanker run qwen.gguf --execute          # run it

defaults to 8K context, q8_0 KV cache, no flash attention. pass --preset big-brain for the optimized 128k config.

clanker download <repo>

clanker download unsloth/Qwen3.6-35B-A3B-GGUF            # scan only
clanker download unsloth/Qwen3.6-35B-A3B-GGUF --execute   # download
clanker download unsloth/Qwen3.6-35B-A3B-GGUF -c 32768    # budget for 32k ctx

picks the highest-quality quant that fits your ram, with preference for models whose backbone fits on gpu. --fit handles the offloading at runtime.

quality warnings: if the best available quant is below IQ4_XS (the kneedle knee point from KLD benchmarks), clanker aborts and tells you to use a smaller model or a different repo. accuracy drops off a cliff below IQ4_XS (78.4% top-1 @ 16.4 gb).

unsloth _XL warning: some unsloth _XL quants contain f16 tensors that crash ik_llama.cpp. clanker warns and lets you filter out the _risky quant to pick the next best.

Other commands

clanker discover         # scan hf cache for gguf models
clanker discover --json  # json output
clanker ls               # alias for discover
clanker search           # open hf search with hardware-tuned bounds
clanker presets          # list all presets with their settings
clanker build            # clone and compile ik_llama.cpp

clanker search caps results by vram budget at IQ4_XS quality. if your kv cache eats all the vram (128k ctx on 12gb cards), it warns that models will offload entirely to ram.

Presets

all presets enable mtp speculative decoding when the model supports it.

Name Context KV Cache Notes
big-brain 128K k=q8_0, v=q5_1 mlock, no-mmap, flash-attn
speed 32K k=q6_0, v=q5_0 mlock, flash-attn
infinite 512K k=q5_0, v=q4_1 no-mmap, flash-attn
coding 64K k=q8_0, v=q5_1 mlock, flash-attn, temp=0.2

kv cache quants are from KLD benchmarks (anbeeld). mixed k/v pairs outperform symmetric ones on the pareto frontier.

How it works

  1. reads your gpu vram via nvidia-smi
  2. detects physical cpu cores (uses cores - 1 for thread count)
  3. parses the gguf header for architecture, layers, experts, quant type
  4. figures out layer split, kv cache size, fit-margin based on available vram
  5. spits out a llama-server command with thread tuning, ubatch tuning, and mlock

Notes

  • --fit --fit-margin N means "keep N MB of vram free". base is 1664 for ik_llama, 4608 for mainline. mtp adds 2048 mb overhead.
  • thread count is physical_cores - 1 (min 1). keeps one core free for gpu scheduling and os tasks.
  • -ub 2048 sets the micro-batch size for max memory-bandwidth utilization during prompt prefill.
  • mtp flags are only added when the model file has "mtp" in its name.
  • the downloader tries llama-cli -hf first, falls back to huggingface_hub. if llama-cli downloads then ooms, clanker detects the existing file and skips re-download.
  • quantization detection from gguf metadata is unreliable (returns int enums). filenames are more accurate, so clanker parses the filename.
  • combined expert tensors (qwen3, deepseek) are handled by dividing total size by expert_count from metadata.
  • --backend llama switches to mainline llama.cpp flags (spec-type, spec-draft-n-max, etc). default is ik_llama.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clanker_gguf-1.1.0.tar.gz (29.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clanker_gguf-1.1.0-py3-none-any.whl (27.1 kB view details)

Uploaded Python 3

File details

Details for the file clanker_gguf-1.1.0.tar.gz.

File metadata

  • Download URL: clanker_gguf-1.1.0.tar.gz
  • Upload date:
  • Size: 29.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clanker_gguf-1.1.0.tar.gz
Algorithm Hash digest
SHA256 59b5a148d3ce61f12b84913cf723f7e417eb909a01cbc9e4c6242fad7230559e
MD5 9c36f8bcdbe94f20af9015fe7a991450
BLAKE2b-256 ece13bee65779cd15b92f2ca9fd554809cc8d9e49b4d97564093ea201fcd82b0

See more details on using hashes here.

Provenance

The following attestation bundles were made for clanker_gguf-1.1.0.tar.gz:

Publisher: publish.yml on Stavros-alt/clanker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clanker_gguf-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: clanker_gguf-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 27.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clanker_gguf-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d3f685608f0b1b4dff02aac261635b140d1cc4786e7a81ca64e06277a760b4fe
MD5 cccf0fd4b330d6ead39f3c9271c0ec10
BLAKE2b-256 c376674f9137c120104e7a0a66959486367aa540a21e1a0d8f0d84c4bbc00a60

See more details on using hashes here.

Provenance

The following attestation bundles were made for clanker_gguf-1.1.0-py3-none-any.whl:

Publisher: publish.yml on Stavros-alt/clanker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page