Estimate, benchmark, and generate fine-tuning recipes for LLMs on consumer GPUs.
Project description
can-i-finetune-this
Estimate, benchmark, and generate fine-tuning recipes for LLMs on consumer GPUs.
You have one consumer-grade NVIDIA GPU. You want to fine-tune an open-weight LLM with LoRA or QLoRA, but you do not want to download 14 GB of weights just to discover that your 12 GB / 16 GB / 24 GB card OOMs on step 1.
canifinetune answers, before you spend the disk and the time:
- Can I fine-tune this model?
- About how much VRAM will it use?
- What batch size / sequence length / LoRA rank / quantization should I use?
- If I can't, how should I downsize?
- Is there local benchmark evidence for that answer?
- Can I get a ready-to-run Hugging Face + PEFT + TRL training script for that config?
It is a single Python package with a CLI:
canifinetune doctor
canifinetune estimate --model Qwen/Qwen2.5-1.5B-Instruct --method qlora --gpu-vram-gb 16 --seq-len 2048 --micro-batch-size 1 --lora-rank 16
canifinetune recommend --model Qwen/Qwen2.5-1.5B-Instruct --gpu-vram-gb 16
canifinetune bench --model sshleifer/tiny-gpt2 --method lora --steps 3
canifinetune calibrate --benchmarks benchmarks/results
canifinetune recipe --model Qwen/Qwen2.5-1.5B-Instruct --method qlora --output recipes/qwen2.5-1.5b-qlora-4080
canifinetune report --benchmarks benchmarks/results --out report.md
canifinetune compare --benchmarks benchmarks/results --out compare.md
What canifinetune estimate actually prints:
+-------- Qwen/Qwen2.5-1.5B-Instruct (qlora) --------+
| feasible: YES ratio = 0.20 confidence = medium |
+------------------------------------------------------+
Memory breakdown (GB)
+---------------------------------+
| Component | Value |
|-----------------------+---------|
| static model | 0.737 |
| quantization overhead | 0.018 |
| trainable params | 4.4 MB |
| gradients | 0.008 |
| optimizer states | 0.010 |
| activations | 0.328 |
| CUDA / fragmentation | 1.280 |
| safety margin | 0.800 |
| total | 3.163 |
+---------------------------------+
Static estimate says 3.16 GB; on a real RTX 4080 the same config measures
7.10 GB (heavy bitsandbytes unpacking buffers at seq_len=2048). canifinetune bench and canifinetune calibrate close that gap on your machine —
that is the point of the project.
Install
canifinetune runs in two layers:
| Layer | Install | What you get |
|---|---|---|
| Core (estimate / recommend / recipe / report) | pip install canifinetune |
All CLI commands. No PyTorch required. |
| Training (bench / real fine-tuning) | pip install canifinetune[train] |
Adds torch, transformers, peft, bitsandbytes, trl, datasets. |
| Reporting extras | pip install canifinetune[report] |
Pandas/tabulate for prettier tables. |
| Development | pip install canifinetune[dev] |
pytest, ruff, mypy. |
If you use uv:
uv venv
uv pip install -e ".[dev,report]"
# Add training deps when you want to run benchmarks:
uv pip install -e ".[dev,train,report]"
PyTorch should generally be installed with the CUDA wheel that matches your driver, e.g.
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
See docs/troubleshooting.md for Windows / WSL / bitsandbytes specifics.
Quickstart
# 1. See what your machine looks like
canifinetune doctor
# 2. Ask if a model fits on your card
canifinetune estimate \
--model Qwen/Qwen2.5-1.5B-Instruct \
--method qlora \
--gpu-vram-gb 16 \
--seq-len 2048 \
--micro-batch-size 1 \
--lora-rank 16
# 3. Have it search for a feasible config
canifinetune recommend --model Qwen/Qwen2.5-1.5B-Instruct --gpu-vram-gb 16
# 4. Run a tiny real benchmark (downloads sshleifer/tiny-gpt2, ~5 MB)
canifinetune bench --model sshleifer/tiny-gpt2 --method lora --steps 3
# 5. Generate a ready-to-run training recipe
canifinetune recipe \
--model Qwen/Qwen2.5-1.5B-Instruct \
--method qlora \
--seq-len 2048 \
--output recipes/qwen2.5-1.5b-qlora-4080
What's different from accelerate estimate-memory?
accelerate estimate-memory tells you how much memory loading a model takes.
That is not enough to know whether you can train it.
This project tries to answer the harder question. It models:
- Model weights, in fp32 / fp16 / bf16 / int8 / NF4 + double-quant
- LoRA / QLoRA trainable parameter count for typical
target_modules - Gradients only for trainable parameters
- AdamW vs 8-bit / paged AdamW optimizer states
- Activations as a function of
seq_len,batch_size,hidden_size,num_layers, with and without gradient checkpointing - A fragmentation / CUDA / buffer safety margin
- A feasibility decision against your actual GPU
- Concrete degradation suggestions when not feasible
Estimates are always marked with an assumptions block and a confidence
level, because activation memory in particular is hard to predict statically.
Run canifinetune bench and canifinetune calibrate to ground them in real
measurements on your machine.
RTX 4080 baselines
docs/rtx4080_baselines.md contains real measurements collected on a single
RTX 4080 (16 GB). These are not synthetic. If a configuration was not run, the
table says "not run", not a guessed number.
Highlights (more in the doc):
| model | method | seq_len | measured peak | tok/sec |
|---|---|---|---|---|
Qwen/Qwen2.5-0.5B-Instruct |
qlora | 1024 | 3.30 GB | 1995 |
Qwen/Qwen2.5-1.5B-Instruct |
qlora | 1024 | 4.36 GB | 1352 |
Qwen/Qwen2.5-1.5B-Instruct |
qlora | 2048 | 7.10 GB | 1470 |
Qwen/Qwen2.5-3B-Instruct |
qlora | 1024 | 5.54 GB | 1158 |
sshleifer/tiny-gpt2 (smoke) |
lora | 128 | 0.12 GB | 1735 |
Repository layout
src/canifinetune/ # package code (estimator, bench, recipes, reports, cli)
benchmarks/ # configs/, results/ (JSON), calibration/
docs/ # design, memory model, troubleshooting
examples/ # end-to-end recipe folders
tests/ # pytest tests (CPU-only, no large downloads)
scripts/ # helper scripts for collecting baselines
.github/workflows/ # CI (ruff + pytest on CPU)
Roadmap
The current scope is "single consumer GPU, single node, LoRA / QLoRA, causal LM, Hugging Face stack". Possible directions, none committed:
- DeepSpeed ZeRO and FSDP estimation for multi-GPU setups
- Heuristics for sequence-classification / encoder-decoder training
- Throughput modeling (tokens / sec), not just feasibility
- Auto-tuning of
gradient_accumulation_stepsfor a target effective batch size - A web UI on top of the CLI
Contributions welcome.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file canifinetune-0.1.0.tar.gz.
File metadata
- Download URL: canifinetune-0.1.0.tar.gz
- Upload date:
- Size: 46.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5b815d7962af7db9dae8af68f3a5b691bb99d5532383ffa956845b1887588bf
|
|
| MD5 |
ae6cbdc374b8ed9756381af13b3c9100
|
|
| BLAKE2b-256 |
db590d1c98e8021c8df448d9c484931b4527dc74791c8b3e335f51317e4e799b
|
File details
Details for the file canifinetune-0.1.0-py3-none-any.whl.
File metadata
- Download URL: canifinetune-0.1.0-py3-none-any.whl
- Upload date:
- Size: 60.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51481f0a0feace835b10b7b71faccf1f133f679cbce1d05cf5ccaefd71b0faf6
|
|
| MD5 |
0f62a76f754fe1a7e16ba55d71313db0
|
|
| BLAKE2b-256 |
db1d895c65074cd83c293ae235535c4be9f9b19fc49c5218d9c0a940120a49a7
|