Skip to main content

LLM Turbo-Optimizer CLI. detect hardware, parse GGUF models, generate optimized llama-server commands.

Project description

clanker

generates llama-server commands so you don't have to guess how many layers fit on your gpu. reads the gguf header, does the math, prints the command.

i got tired of trial-and-error vram tuning for moe models on consumer gpus.

Setup

requires python 3.10+ and pip.

pip install clanker-gguf

or install from source:

pip install git+https://github.com/Stavros-alt/clanker.git
pip install "git+https://github.com/Stavros-alt/clanker.git#egg=clanker-gguf[all]"

or clone and install locally:

git clone https://github.com/Stavros-alt/clanker.git
cd clanker
pip install -e ".[all]"

you also need:

  • nvidia-smi on PATH (cpu-only mode works without it)
  • llama-server from ik_llama.cpp (build it with clanker build, or install manually)
  • windows? good luck. this probably works on wsl.

Usage

clanker run <model>

clanker run ~/models/qwen.gguf          # direct path
clanker run qwen                         # fuzzy cache search
clanker run qwen.gguf --preset speed     # apply a preset
clanker run qwen.gguf --context 65536    # override context
clanker run qwen.gguf --execute          # run it

defaults to 8K context, q8_0 KV cache, no flash attention. pass --preset big-brain for the optimized 128k config.

clanker download <repo>

clanker download unsloth/Qwen3.6-35B-A3B-GGUF            # scan only
clanker download unsloth/Qwen3.6-35B-A3B-GGUF --execute   # download
clanker download unsloth/Qwen3.6-35B-A3B-GGUF -c 32768    # budget for 32k ctx

picks the highest-quality quant that fits your ram, with preference for models whose backbone fits on gpu. --fit handles the offloading at runtime.

Other commands

clanker discover         # scan hf cache for gguf models
clanker discover --json  # json output
clanker ls               # alias for discover
clanker search           # open hf search with hardware-tuned bounds
clanker presets          # list all presets with their settings
clanker build            # clone and compile ik_llama.cpp

Presets

all presets enable mtp speculative decoding when the model supports it.

Name Context KV Cache Notes
big-brain 128K k=q8_0, v=q5_1 mlock, no-mmap, flash-attn
speed 32K k=q6_0, v=q5_0 mlock, flash-attn
infinite 512K k=q5_0, v=q4_1 no-mmap, flash-attn
coding 64K k=q8_0, v=q5_1 mlock, flash-attn, temp=0.2

kv cache quants are from KLD benchmarks (anbeeld). mixed k/v pairs outperform symmetric ones on the pareto frontier.

How it works

  1. reads your gpu vram via nvidia-smi
  2. detects physical cpu cores (uses cores - 1 for thread count)
  3. parses the gguf header for architecture, layers, experts, quant type
  4. figures out layer split, kv cache size, fit-margin based on available vram
  5. spits out a llama-server command with thread tuning, ubatch tuning, and mlock

Notes

  • --fit --fit-margin N means "keep N MB of vram free". base is 1664 for ik_llama, 4608 for mainline. mtp adds 2048 mb overhead.
  • thread count is physical_cores - 1 (min 1). keeps one core free for gpu scheduling and os tasks.
  • -ub 2048 sets the micro-batch size for max memory-bandwidth utilization during prompt prefill.
  • mtp flags are only added when the model file has "mtp" in its name.
  • the downloader tries llama-cli -hf first, falls back to huggingface_hub. if llama-cli downloads then ooms, clanker detects the existing file and skips re-download.
  • quantization detection from gguf metadata is unreliable (returns int enums). filenames are more accurate, so clanker parses the filename.
  • combined expert tensors (qwen3, deepseek) are handled by dividing total size by expert_count from metadata.
  • --backend llama switches to mainline llama.cpp flags (spec-type, spec-draft-n-max, etc). default is ik_llama.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clanker_gguf-1.0.4.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clanker_gguf-1.0.4-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file clanker_gguf-1.0.4.tar.gz.

File metadata

  • Download URL: clanker_gguf-1.0.4.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clanker_gguf-1.0.4.tar.gz
Algorithm Hash digest
SHA256 fbc5336fe5b687905ad2b1018917620dec579f182c4d5fd809f75d4d5a2c6752
MD5 db6c0d4d7f02d41cc3516f2ce9917b1f
BLAKE2b-256 1ac836fce056e3f645e94de9b6c547e91d3166f01fcb10ade9d506fd998761b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for clanker_gguf-1.0.4.tar.gz:

Publisher: publish.yml on Stavros-alt/clanker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clanker_gguf-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: clanker_gguf-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clanker_gguf-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e3f52d7f2cf6eab72ef3ab3dd641109b4349a7d4a8188f1a1ee6db672dc16807
MD5 5f20138460892664e8133a7543ef0dba
BLAKE2b-256 6bc99c481a647268731b91d8b8bda595a3e9777592b4daea8671b81161ec5d75

See more details on using hashes here.

Provenance

The following attestation bundles were made for clanker_gguf-1.0.4-py3-none-any.whl:

Publisher: publish.yml on Stavros-alt/clanker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page