LLM inference hardware calculator — architecture-aware, engine-version-aware, honest-labeled.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

zhoukura

These details have not been verified by PyPI

Project description

llm-cal

LLM inference hardware calculator — architecture-aware, engine-version-aware, honest-labeled.

English · 中文 · Docs · 中文文档

Give it a HuggingFace / ModelScope model id and a GPU, get back:

real weight size (summed from safetensors API, not params × precision)
architecture profile — MHA / GQA / MQA / MLA / NSA / CSA+HCA, MoE active-expert ratio, sliding window, tied embeddings
KV cache per request at multiple context lengths, with TP-aware sharding
fleet size — min / dev / prod tiers that respect num_heads TP divisibility
prefill latency + decode throughput with named coefficients and citations
K/L concurrency bounds with bottleneck classification (memory vs compute vs bandwidth)
engine compatibility from a curated matrix (vLLM + SGLang × 16 model families × 32 entries)
a ready-to-paste vllm serve or sglang launch_server command

Every number in the output carries a provenance label. --explain prints the full derivation trace. --llm-review (opt-in) sends the trace to any OpenAI-compatible endpoint for a second opinion.

Why another calculator?

Existing tools (gpu_poor, llm-vram-calculator, APXML, SelfHostLLM, ...) compute weight size with params × precision. That silently fails on mixed-precision quantization:

Model	`gpu_poor`	Real `safetensors`	llm-cal
DeepSeek-V4-Flash (FP4+FP8 pack)	284 GB (FP8 assumed)	160 GB	160 GB ✓
DeepSeek-V3 (pure FP8)	685 GB	688 GB	688 GB ✓
Qwen2.5-72B (FP16)	140 GB	145 GB	145 GB ✓

llm-cal reads real bytes from the HF API, reconciles against every known quantization scheme, picks the best match, and surfaces ties when multiple schemes share the same bits/param:

Quantization reconciliation
  FP4_FP8_MIXED    160.01 GB   0.2%  ← wins (tied with GPTQ_INT4, AWQ_INT4
  GPTQ_INT4        160.01 GB   0.2%    at bpp=0.55 — need per-tensor dtype
  AWQ_INT4         160.01 GB   0.2%    to distinguish, deferred to v0.2)
  FP8              290.94 GB  45.1%  ← the gpu_poor trap

This tie was caught by --llm-review running MiniMax-M2 against the tool's own output during dogfood testing. First real bug from LLM review, fixed in v0.1.0.

The honesty principle — 7 labels

Every number carries one of these:

Label	Meaning	Example
`[verified]`	Direct read from API or file	`safetensors bytes: 159.62 GB`
`[inferred]`	One-step derivation from verified data	`bits/param: 4.39` (bytes ÷ params)
`[estimated]`	Formula-based, coefficient from source	`prefill latency: 735 ms`
`[cited]`	From a paper / PR / release note	`vLLM ≥0.19.0 supports CSA+HCA`
`[unverified]`	Matrix entry without evidence, flagged	`SGLang day-0 support pending`
`[unknown]`	Graceful degrade — unknown model type	New `model_type` not in registry
`[llm-opinion]`	Opt-in LLM audit, never overrides the 6 above	`--llm-review` output only

The first 6 labels are deterministic. [llm-opinion] is explicitly tagged as non-authoritative.

Install

Python 3.11+.

# pipx (cleanest)
pipx install git+https://github.com/FlyTOmeLight/llm-cal.git@v0.1.0

# uv
uv tool install git+https://github.com/FlyTOmeLight/llm-cal.git@v0.1.0

# pip
pip install git+https://github.com/FlyTOmeLight/llm-cal.git@v0.1.0

Gated models (Llama, Gemma):

export HF_TOKEN=hf_...

Mainland China HF mirror:

export HF_ENDPOINT=https://hf-mirror.com

Quickstart

# Basic evaluation
llm-cal deepseek-ai/DeepSeek-V4-Flash --gpu H800

# Chinese output + longer context
llm-cal Qwen/Qwen2.5-72B-Instruct --gpu A100-80G --context-length 32768 --lang zh

# Full derivation trace (every formula + input + step + source)
llm-cal mistralai/Mixtral-8x7B-v0.1 --gpu H100 --explain

# LLM audit of the derivation (opt-in, needs env vars)
export LLM_CAL_REVIEWER_API_KEY=sk-...
export LLM_CAL_REVIEWER_BASE_URL=https://api.deepseek.com/v1
export LLM_CAL_REVIEWER_MODEL=deepseek-chat
llm-cal deepseek-ai/DeepSeek-V3 --gpu H800 --explain --llm-review

# All 53 supported GPUs
llm-cal --list-gpus

# Run the curated benchmark (8 models × 33 checks vs reference truth)
llm-cal --benchmark

Abbreviated output:

┌─ deepseek-ai/DeepSeek-V4-Flash  via huggingface @ 6c858e7 ─┐

Architecture
  model_type         deepseek_v4                             [verified]
  attention          CSA_HCA (heads=64, kv_heads=1, hd=512)  [verified]
  moe                256 routed + 1 shared, top-6            [verified]
  sliding_window     128                                     [verified]

Weights
  safetensors bytes  159.62 GB      [verified]
  quantization       FP4_FP8_MIXED  [inferred]  (tied with GPTQ_INT4, AWQ_INT4)

Fleet — H800
  tier       GPUs    concurrent @ 128K    concurrent @ 1.0M
  min          4           ~14                  ~1
  dev ★        4           ~14                  ~1
  prod         8           ~23                  ~2

Performance — dev tier (4× H800)
  prefill latency   735 ms @ 2000 input tokens     [estimated, Kaplan 2020]
  decode throughput 48 tok/s per user              [estimated, Kwon SOSP 2023]
  bottleneck        memory bandwidth               [inferred]

Generated command
  vllm serve deepseek-ai/DeepSeek-V4-Flash \
    --tensor-parallel-size 4 --max-model-len 1048576 \
    --trust-remote-code --gpu-memory-utilization 0.9 \
    --attention-backend auto

CLI reference

llm-cal [MODEL_ID] [OPTIONS]

Core:
  --gpu TEXT                     GPU id (see --list-gpus). Aliases accepted, case-insensitive.
  --engine [vllm|sglang]         Default: vllm
  --gpu-count INT                Force fleet size (skips min/dev/prod auto-pick)
  --context-length INT           Context length for KV cache estimation
  --lang [en|zh]                 Output language (default: auto-detect from LANG)

Performance tuning (all have honest defaults — see docs/methodology.md):
  --input-tokens INT             Prefill input budget. Default: 2000
  --output-tokens INT            Decode output budget. Default: 512
  --target-tokens-per-sec FLOAT  SLA for per-user decode. Default: 30
  --prefill-util FLOAT           Compute utilization factor. Default: 0.40
  --decode-bw-util FLOAT         Memory-BW utilization factor. Default: 0.50
  --concurrency-degradation FLOAT  High-load efficiency loss. Default: 1.0 (honest baseline)

Introspection:
  --explain                      Print full derivation trace for every non-trivial number
  --llm-review                   Send derivation to LLM for second opinion (opt-in)
                                 Requires: LLM_CAL_REVIEWER_API_KEY / _BASE_URL / _MODEL

Meta:
  --list-gpus                    List all 53 supported GPUs and exit
  --benchmark                    Run the curated dataset (8 models × 33 checks)
  --refresh                      Bypass cache, re-fetch from HF/ModelScope

Supported hardware (53 GPUs)

Vendor	Models
NVIDIA	B200, GB200, H100, H800, H200, H20, GH200, L40S, L40, L4, RTX6000-Ada, RTX4090, A100-80G/40G, A40, A10, A10G, V100-SXM2/PCIe-32G, T4
AMD	MI325X, MI300X, MI250X, MI210
Intel Habana	Gaudi3, Gaudi2
华为昇腾	910A, 910B1, 910B2, 910B3, 910B4, 910C, Atlas-300I-Duo
沐曦	MXC500, MXC550
昆仑芯	Kunlun-P800, Kunlun-R200
壁仞	BR100, BR104
天数智芯	BI-V100
摩尔线程	MTT-S4000, MTT-S3000, MR-V100
寒武纪	MLU370-X8, MLU590
海光	K100-AI, Z100

Each entry carries spec_source (vendor page, datasheet, or verified benchmark URL) and bilingual notes.

Full details: llm-cal --list-gpus. Missing one? PR src/llm_cal/hardware/gpu_database.yaml — data-only change, no code.

Engine × architecture matrix (32 entries / 16 families)

Covers vLLM 0.6–0.19 and SGLang 0.4–0.5:

Dense: llama, mistral, qwen2, qwen3, phi, gemma, internlm
MoE: mixtral, qwen3_moe, deepseek_v3, deepseek_v3_2, deepseek_v4, phi_moe
Sparse attention: deepseek_v3_2 (NSA), deepseek_v4 (CSA+HCA)
Sliding window: mistral, qwen3_moe

Every matrix entry carries verification_level (verified / cited / unverified) and sources[] with URL + captured_date. v0.1 has no verified entries — the author has no test hardware. Community tested contributions welcome.

Full matrix: src/llm_cal/engine_compat/matrix.yaml.

Benchmark (8 models × 33 checks)

llm-cal --benchmark runs the curated dataset and compares tool output against reference truth (HF API sizes, model card claims, vLLM recipes).

Model	Ref weight	llm-cal	Quant	Status
`deepseek-ai/DeepSeek-V4-Flash`	160 GB	159.62 GB	FP4_FP8_MIXED	✓
`deepseek-ai/DeepSeek-V3`	688 GB	688.59 GB	FP8	✓
`deepseek-ai/DeepSeek-V3.2`	688 GB	687.84 GB	FP8 (NSA)	✓
`Qwen/Qwen2.5-72B-Instruct`	145 GB	145.41 GB	FP16	✓
`Qwen/Qwen3-30B-A3B`	61 GB	60.82 GB	FP16 (MoE)	✓
`Qwen/Qwen2.5-7B`	14.2 GB	14.19 GB	FP16	✓
`mistralai/Mixtral-8x7B-v0.1`	93 GB	93.41 GB	FP16 (MoE)	✓
`microsoft/Phi-4`	28 GB	28.17 GB	FP16	✓

Exit code 0 on all-pass, 1 on any FAIL. Runnable in CI.

Methodology

Every formula and coefficient has a primary source. No magic numbers.

Prefill FLOPs: 2 × params × input_tokens (Kaplan et al. 2020, Scaling Laws for Neural Language Models)
Decode throughput: bandwidth × util / weight_bytes (Kwon et al. SOSP 2023, Efficient Memory Management for LLM Serving with PagedAttention)
KV cache layout: matches vLLM PagedAttention and SGLang RadixAttention source behavior
TP sharding: per_gpu_KV = total_KV / min(tp_size, num_kv_heads) — empirically verified against vLLM runtime
Utilization coefficients: prefill_util=0.40, decode_bw_util=0.50, concurrency_degradation=1.0 (honest defaults; override per-workload via CLI flags)

Full writeup with citations: docs/methodology.md · 中文.

Documentation

Homepage (English)
Homepage (中文)
Architecture guide — 10-step checklist for adding a new model type
Methodology — every formula with source
Contributing

Scope of v0.1

Shipped:

HuggingFace + ModelScope as model sources, real bytes from safetensors metadata
Architecture detection: Dense / MoE / GQA / MQA / MLA / NSA / CSA+HCA / Sliding Window
KV cache with traits composition, TP-aware sharding
Fleet planner (min/dev/prod, TP divisibility)
Prefill / decode performance estimator
K/L concurrency bounds + bottleneck classification
Engine compat matrix (vLLM + SGLang, 32 entries)
Command generator (vLLM + SGLang with required flags)
Bilingual output (en / zh) with label localization
--explain derivation trace
--llm-review opt-in LLM audit (any OpenAI-compatible endpoint)
--benchmark curated regression suite
--list-gpus discovery
53-GPU database with spec_source traceability

v0.2 roadmap:

Per-tensor dtype read from safetensors metadata (distinguishes FP4/GPTQ/AWQ tie)
Lazy matrix loading when entries > 100
Ollama / GGUF support
Multimodal models (Qwen-VL, InternVL)
LoRA / adapter VRAM math
--offline mode for air-gapped environments
Community-contributed verified matrix entries (requires real hardware runs)

Contributing

Especially welcome:

New GPUs — src/llm_cal/hardware/gpu_database.yaml (data only, no code)
New engine entries — src/llm_cal/engine_compat/matrix.yaml with sources[]
New model architectures — 10-step checklist
verified matrix entries — if you have real hardware and can run a config, send us the tested result

See CONTRIBUTING.md for dev setup.

License

Apache-2.0. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

zhoukura

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Apr 26, 2026

0.1.3

Apr 25, 2026

This version

0.1.2

Apr 25, 2026

0.1.1

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_cal-0.1.2.tar.gz (81.7 kB view details)

Uploaded Apr 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_cal-0.1.2-py3-none-any.whl (97.9 kB view details)

Uploaded Apr 25, 2026 Python 3

File details

Details for the file llm_cal-0.1.2.tar.gz.

File metadata

Download URL: llm_cal-0.1.2.tar.gz
Upload date: Apr 25, 2026
Size: 81.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_cal-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`4cc4cbed5651e829a9f866715b06b4f6ff6cebee2126ac523d26d5feca7e9af7`
MD5	`56b9b9c6763d9d0345f6deaa922da399`
BLAKE2b-256	`920604cfde591a541b9a0c375ef52431f84b8e40eb2ffc06810e416822422e9f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_cal-0.1.2.tar.gz:

Publisher: publish.yml on FlyTOmeLight/llm-cal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_cal-0.1.2.tar.gz
- Subject digest: 4cc4cbed5651e829a9f866715b06b4f6ff6cebee2126ac523d26d5feca7e9af7
- Sigstore transparency entry: 1381919648
- Sigstore integration time: Apr 25, 2026
Source repository:
- Permalink: FlyTOmeLight/llm-cal@31fed5dc5e5fb29ec23d8b4ae4b65c5b0e6a8c6f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/FlyTOmeLight
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@31fed5dc5e5fb29ec23d8b4ae4b65c5b0e6a8c6f
- Trigger Event: workflow_dispatch

File details

Details for the file llm_cal-0.1.2-py3-none-any.whl.

File metadata

Download URL: llm_cal-0.1.2-py3-none-any.whl
Upload date: Apr 25, 2026
Size: 97.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_cal-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0267cfa214336ff4e76e213f67516801e2e17a0e5a7a07c2c93506cffa46502d`
MD5	`bbd26f55bdcec30a48955162a0e73404`
BLAKE2b-256	`55849b3e9a8088faaea6024ff5f5d8e65f5694cbbb599e4f17733298719375c9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_cal-0.1.2-py3-none-any.whl:

Publisher: publish.yml on FlyTOmeLight/llm-cal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_cal-0.1.2-py3-none-any.whl
- Subject digest: 0267cfa214336ff4e76e213f67516801e2e17a0e5a7a07c2c93506cffa46502d
- Sigstore transparency entry: 1381919702
- Sigstore integration time: Apr 25, 2026
Source repository:
- Permalink: FlyTOmeLight/llm-cal@31fed5dc5e5fb29ec23d8b4ae4b65c5b0e6a8c6f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/FlyTOmeLight
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@31fed5dc5e5fb29ec23d8b4ae4b65c5b0e6a8c6f
- Trigger Event: workflow_dispatch

llm-cal 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

llm-cal

Why another calculator?

The honesty principle — 7 labels

Install

Quickstart

CLI reference

Supported hardware (53 GPUs)

Engine × architecture matrix (32 entries / 16 families)

Benchmark (8 models × 33 checks)

Methodology

Documentation

Scope of v0.1

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance