Skip to main content

Estimate, benchmark, and generate fine-tuning recipes for LLMs on consumer GPUs.

Project description

can-i-finetune-this

CI Python License

Estimate, benchmark, and generate fine-tuning recipes for LLMs on consumer GPUs.

can-i-finetune-this architecture

You have one consumer-grade NVIDIA GPU. You want to fine-tune an open-weight LLM with LoRA or QLoRA, but you do not want to download 14 GB of weights just to discover that your 12 GB / 16 GB / 24 GB card OOMs on step 1.

canifinetune answers, before you spend the disk and the time:

  1. Can I fine-tune this model?
  2. About how much VRAM will it use?
  3. What batch size / sequence length / LoRA rank / quantization should I use?
  4. If I can't, how should I downsize?
  5. Is there local benchmark evidence for that answer?
  6. Can I get a ready-to-run Hugging Face + PEFT + TRL training script for that config?

It is a single Python package with a CLI:

canifinetune doctor
canifinetune estimate --model Qwen/Qwen2.5-1.5B-Instruct --method qlora --gpu-vram-gb 16 --seq-len 2048 --micro-batch-size 1 --lora-rank 16
canifinetune recommend --model Qwen/Qwen2.5-1.5B-Instruct --gpu-vram-gb 16
canifinetune bench    --model sshleifer/tiny-gpt2 --method lora --steps 3
canifinetune calibrate --benchmarks benchmarks/results
canifinetune recipe   --model Qwen/Qwen2.5-1.5B-Instruct --method qlora --output recipes/qwen2.5-1.5b-qlora-4080
canifinetune report   --benchmarks benchmarks/results --out report.md
canifinetune compare  --benchmarks benchmarks/results --out compare.md

What canifinetune estimate actually prints:

+-------- Qwen/Qwen2.5-1.5B-Instruct  (qlora) --------+
| feasible: YES    ratio = 0.20    confidence = medium |
+------------------------------------------------------+
       Memory breakdown (GB)
+---------------------------------+
| Component             |   Value |
|-----------------------+---------|
| static model          |   0.737 |
| quantization overhead |   0.018 |
| trainable params      |  4.4 MB |
| gradients             |   0.008 |
| optimizer states      |   0.010 |
| activations           |   0.328 |
| CUDA / fragmentation  |   1.280 |
| safety margin         |   0.800 |
| total                 |   3.163 |
+---------------------------------+

Static estimate says 3.16 GB; on a real RTX 4080 the same config measures 7.10 GB (heavy bitsandbytes unpacking buffers at seq_len=2048). canifinetune bench and canifinetune calibrate close that gap on your machine — that is the point of the project.


Install

canifinetune runs in two layers:

Layer Install What you get
Core (estimate / recommend / recipe / report) pip install canifinetune All CLI commands. No PyTorch required.
Training (bench / real fine-tuning) pip install canifinetune[train] Adds torch, transformers, peft, bitsandbytes, trl, datasets.
Reporting extras pip install canifinetune[report] Pandas/tabulate for prettier tables.
Development pip install canifinetune[dev] pytest, ruff, mypy.

If you use uv:

uv venv
uv pip install -e ".[dev,report]"
# Add training deps when you want to run benchmarks:
uv pip install -e ".[dev,train,report]"

PyTorch should generally be installed with the CUDA wheel that matches your driver, e.g.

uv pip install torch --index-url https://download.pytorch.org/whl/cu121

See docs/troubleshooting.md for Windows / WSL / bitsandbytes specifics.


Quickstart

# 1. See what your machine looks like
canifinetune doctor

# 2. Ask if a model fits on your card
canifinetune estimate \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --method qlora \
  --gpu-vram-gb 16 \
  --seq-len 2048 \
  --micro-batch-size 1 \
  --lora-rank 16

# 3. Have it search for a feasible config
canifinetune recommend --model Qwen/Qwen2.5-1.5B-Instruct --gpu-vram-gb 16

# 4. Run a tiny real benchmark (downloads sshleifer/tiny-gpt2, ~5 MB)
canifinetune bench --model sshleifer/tiny-gpt2 --method lora --steps 3

# 5. Generate a ready-to-run training recipe
canifinetune recipe \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --method qlora \
  --seq-len 2048 \
  --output recipes/qwen2.5-1.5b-qlora-4080

What's different from accelerate estimate-memory?

accelerate estimate-memory tells you how much memory loading a model takes. That is not enough to know whether you can train it.

This project tries to answer the harder question. It models:

  • Model weights, in fp32 / fp16 / bf16 / int8 / NF4 + double-quant
  • LoRA / QLoRA trainable parameter count for typical target_modules
  • Gradients only for trainable parameters
  • AdamW vs 8-bit / paged AdamW optimizer states
  • Activations as a function of seq_len, batch_size, hidden_size, num_layers, with and without gradient checkpointing
  • A fragmentation / CUDA / buffer safety margin
  • A feasibility decision against your actual GPU
  • Concrete degradation suggestions when not feasible

Estimates are always marked with an assumptions block and a confidence level, because activation memory in particular is hard to predict statically. Run canifinetune bench and canifinetune calibrate to ground them in real measurements on your machine.


RTX 4080 baselines

docs/rtx4080_baselines.md contains real measurements collected on a single RTX 4080 (16 GB). These are not synthetic. If a configuration was not run, the table says "not run", not a guessed number.

Highlights (more in the doc):

model method seq_len measured peak tok/sec
Qwen/Qwen2.5-0.5B-Instruct qlora 1024 3.30 GB 1995
Qwen/Qwen2.5-1.5B-Instruct qlora 1024 4.36 GB 1352
Qwen/Qwen2.5-1.5B-Instruct qlora 2048 7.10 GB 1470
Qwen/Qwen2.5-3B-Instruct qlora 1024 5.54 GB 1158
sshleifer/tiny-gpt2 (smoke) lora 128 0.12 GB 1735

Repository layout

src/canifinetune/        # package code (estimator, bench, recipes, reports, cli)
benchmarks/              # configs/, results/ (JSON), calibration/
docs/                    # design, memory model, troubleshooting
examples/                # end-to-end recipe folders
tests/                   # pytest tests (CPU-only, no large downloads)
scripts/                 # helper scripts for collecting baselines
.github/workflows/       # CI (ruff + pytest on CPU)

Roadmap

The current scope is "single consumer GPU, single node, LoRA / QLoRA, causal LM, Hugging Face stack". Possible directions, none committed:

  • DeepSpeed ZeRO and FSDP estimation for multi-GPU setups
  • Heuristics for sequence-classification / encoder-decoder training
  • Throughput modeling (tokens / sec), not just feasibility
  • Auto-tuning of gradient_accumulation_steps for a target effective batch size
  • A web UI on top of the CLI

Contributions welcome.


License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

canifinetune-0.1.0.tar.gz (46.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

canifinetune-0.1.0-py3-none-any.whl (60.0 kB view details)

Uploaded Python 3

File details

Details for the file canifinetune-0.1.0.tar.gz.

File metadata

  • Download URL: canifinetune-0.1.0.tar.gz
  • Upload date:
  • Size: 46.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for canifinetune-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f5b815d7962af7db9dae8af68f3a5b691bb99d5532383ffa956845b1887588bf
MD5 ae6cbdc374b8ed9756381af13b3c9100
BLAKE2b-256 db590d1c98e8021c8df448d9c484931b4527dc74791c8b3e335f51317e4e799b

See more details on using hashes here.

File details

Details for the file canifinetune-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: canifinetune-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 60.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for canifinetune-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 51481f0a0feace835b10b7b71faccf1f133f679cbce1d05cf5ccaefd71b0faf6
MD5 0f62a76f754fe1a7e16ba55d71313db0
BLAKE2b-256 db1d895c65074cd83c293ae235535c4be9f9b19fc49c5218d9c0a940120a49a7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page