Estimate, benchmark, and generate fine-tuning recipes for LLMs on consumer GPUs.

These details have not been verified by PyPI

Project links

Project description

can-i-finetune-this

Estimate, benchmark, and generate fine-tuning recipes for LLMs on consumer GPUs.

can-i-finetune-this architecture

You have one consumer-grade NVIDIA GPU. You want to fine-tune an open-weight LLM with LoRA or QLoRA, but you do not want to download 14 GB of weights just to discover that your 12 GB / 16 GB / 24 GB card OOMs on step 1.

canifinetune answers, before you spend the disk and the time:

Can I fine-tune this model?
About how much VRAM will it use?
What batch size / sequence length / LoRA rank / quantization should I use?
If I can't, how should I downsize?
Is there local benchmark evidence for that answer?
Can I get a ready-to-run Hugging Face + PEFT + TRL training script for that config?

It is a single Python package with a CLI:

canifinetune doctor
canifinetune estimate --model Qwen/Qwen2.5-1.5B-Instruct --method qlora --gpu-vram-gb 16 --seq-len 2048 --micro-batch-size 1 --lora-rank 16
canifinetune recommend --model Qwen/Qwen2.5-1.5B-Instruct --gpu-vram-gb 16
canifinetune bench    --model sshleifer/tiny-gpt2 --method lora --steps 3
canifinetune calibrate --benchmarks benchmarks/results
canifinetune recipe   --model Qwen/Qwen2.5-1.5B-Instruct --method qlora --output recipes/qwen2.5-1.5b-qlora-4080
canifinetune report   --benchmarks benchmarks/results --out report.md
canifinetune compare  --benchmarks benchmarks/results --out compare.md

What canifinetune estimate actually prints:

+-------- Qwen/Qwen2.5-1.5B-Instruct  (qlora) --------+
| feasible: YES    ratio = 0.20    confidence = medium |
+------------------------------------------------------+
       Memory breakdown (GB)
+---------------------------------+
| Component             |   Value |
|-----------------------+---------|
| static model          |   0.737 |
| quantization overhead |   0.018 |
| trainable params      |  4.4 MB |
| gradients             |   0.008 |
| optimizer states      |   0.010 |
| activations           |   0.328 |
| CUDA / fragmentation  |   1.280 |
| safety margin         |   0.800 |
| total                 |   3.163 |
+---------------------------------+

Static estimate says 3.16 GB; on a real RTX 4080 the same config measures 7.10 GB (heavy bitsandbytes unpacking buffers at seq_len=2048). canifinetune bench and canifinetune calibrate close that gap on your machine — that is the point of the project.

Install

canifinetune runs in two layers:

Layer	Install	What you get
Core (estimate / recommend / recipe / report)	`pip install canifinetune`	All CLI commands. No PyTorch required.
Training (bench / real fine-tuning)	`pip install canifinetune[train]`	Adds `torch`, `transformers`, `peft`, `bitsandbytes`, `trl`, `datasets`.
Reporting extras	`pip install canifinetune[report]`	Pandas/tabulate for prettier tables.
Development	`pip install canifinetune[dev]`	pytest, ruff, mypy.

If you use uv:

uv venv
uv pip install -e ".[dev,report]"
# Add training deps when you want to run benchmarks:
uv pip install -e ".[dev,train,report]"

PyTorch should generally be installed with the CUDA wheel that matches your driver, e.g.

uv pip install torch --index-url https://download.pytorch.org/whl/cu121

See docs/troubleshooting.md for Windows / WSL / bitsandbytes specifics.

Quickstart

# 1. See what your machine looks like
canifinetune doctor

# 2. Ask if a model fits on your card
canifinetune estimate \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --method qlora \
  --gpu-vram-gb 16 \
  --seq-len 2048 \
  --micro-batch-size 1 \
  --lora-rank 16

# 3. Have it search for a feasible config
canifinetune recommend --model Qwen/Qwen2.5-1.5B-Instruct --gpu-vram-gb 16

# 4. Run a tiny real benchmark (downloads sshleifer/tiny-gpt2, ~5 MB)
canifinetune bench --model sshleifer/tiny-gpt2 --method lora --steps 3

# 5. Generate a ready-to-run training recipe
canifinetune recipe \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --method qlora \
  --seq-len 2048 \
  --output recipes/qwen2.5-1.5b-qlora-4080

What's different from `accelerate estimate-memory`?

accelerate estimate-memory tells you how much memory loading a model takes. That is not enough to know whether you can train it.

This project tries to answer the harder question. It models:

Model weights, in fp32 / fp16 / bf16 / int8 / NF4 + double-quant
LoRA / QLoRA trainable parameter count for typical target_modules
Gradients only for trainable parameters
AdamW vs 8-bit / paged AdamW optimizer states
Activations as a function of seq_len, batch_size, hidden_size, num_layers, with and without gradient checkpointing
A fragmentation / CUDA / buffer safety margin
A feasibility decision against your actual GPU
Concrete degradation suggestions when not feasible

Estimates are always marked with an assumptions block and a confidence level, because activation memory in particular is hard to predict statically. Run canifinetune bench and canifinetune calibrate to ground them in real measurements on your machine.

RTX 4080 baselines

docs/rtx4080_baselines.md contains real measurements collected on a single RTX 4080 (16 GB). These are not synthetic. If a configuration was not run, the table says "not run", not a guessed number.

Highlights (more in the doc):

model	method	seq_len	measured peak	tok/sec
`Qwen/Qwen2.5-0.5B-Instruct`	qlora	1024	3.30 GB	1995
`Qwen/Qwen2.5-1.5B-Instruct`	qlora	1024	4.36 GB	1352
`Qwen/Qwen2.5-1.5B-Instruct`	qlora	2048	7.10 GB	1470
`Qwen/Qwen2.5-3B-Instruct`	qlora	1024	5.54 GB	1158
`sshleifer/tiny-gpt2` (smoke)	lora	128	0.12 GB	1735

Repository layout

src/canifinetune/        # package code (estimator, bench, recipes, reports, cli)
benchmarks/              # configs/, results/ (JSON), calibration/
docs/                    # design, memory model, troubleshooting
examples/                # end-to-end recipe folders
tests/                   # pytest tests (CPU-only, no large downloads)
scripts/                 # helper scripts for collecting baselines
.github/workflows/       # CI (ruff + pytest on CPU)

Roadmap

The current scope is "single consumer GPU, single node, LoRA / QLoRA, causal LM, Hugging Face stack". Possible directions, none committed:

DeepSpeed ZeRO and FSDP estimation for multi-GPU setups
Heuristics for sequence-classification / encoder-decoder training
Throughput modeling (tokens / sec), not just feasibility
Auto-tuning of gradient_accumulation_steps for a target effective batch size
A web UI on top of the CLI

Contributions welcome.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canifinetune-0.1.0.tar.gz (46.3 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

canifinetune-0.1.0-py3-none-any.whl (60.0 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file canifinetune-0.1.0.tar.gz.

File metadata

Download URL: canifinetune-0.1.0.tar.gz
Upload date: May 16, 2026
Size: 46.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for canifinetune-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f5b815d7962af7db9dae8af68f3a5b691bb99d5532383ffa956845b1887588bf`
MD5	`ae6cbdc374b8ed9756381af13b3c9100`
BLAKE2b-256	`db590d1c98e8021c8df448d9c484931b4527dc74791c8b3e335f51317e4e799b`

See more details on using hashes here.

File details

Details for the file canifinetune-0.1.0-py3-none-any.whl.

File metadata

Download URL: canifinetune-0.1.0-py3-none-any.whl
Upload date: May 16, 2026
Size: 60.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for canifinetune-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51481f0a0feace835b10b7b71faccf1f133f679cbce1d05cf5ccaefd71b0faf6`
MD5	`0f62a76f754fe1a7e16ba55d71313db0`
BLAKE2b-256	`db1d895c65074cd83c293ae235535c4be9f9b19fc49c5218d9c0a940120a49a7`

See more details on using hashes here.

canifinetune 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

can-i-finetune-this

Install

Quickstart

What's different from `accelerate estimate-memory`?

RTX 4080 baselines

Repository layout

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

canifinetune 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

can-i-finetune-this

Install

Quickstart

What's different from accelerate estimate-memory?

RTX 4080 baselines

Repository layout

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

What's different from `accelerate estimate-memory`?