Estimate GPU memory, training time, and costs for LLM fine-tuning
Project description
⚡ ftune
Know your GPU costs before you hit OOM.
Estimate memory, training time, and cloud costs for LLM fine-tuning — in seconds.
Problem • Quick Start • Web UI • Features • Budget Optimizer • Multi-GPU • Calibration • Validation • CLI • Models & GPUs • Contributing
🔥 The Problem
You want to fine-tune Llama 3.1 70B. You spin up an A100, start training, and... CUDA out of memory 💀
Or worse — you rent 8×H100s for $30/hr, only to realize you could've done it with QLoRA on a single $1.50/hr GPU.
ftune fixes this. Get VRAM estimates, training time projections, and cost comparisons across 8 cloud providers — all before you spend a single dollar.
What makes ftune different?
| Feature | ftune | Manual math | HF accelerate estimate |
|---|---|---|---|
| Works with any HuggingFace model | ✅ auto-fetches from Hub | ❌ | ✅ |
| Multi-GPU / ZeRO / FSDP sharding | ✅ ZeRO 1/2/3 + FSDP | ❌ | ❌ |
| Cloud cost comparison (8 providers) | ✅ with spot pricing | ❌ | ❌ |
| Budget optimizer ("I have $50") | ✅ | ❌ | ❌ |
| Hardware calibration mode | ✅ | ❌ | ❌ |
| FlashAttention-2 memory savings | ✅ | manual | ❌ |
| Validation against real runs | ✅ W&B, JSON, CSV | ❌ | ❌ |
| Zero ML dependencies | ✅ pure Python | ✅ | ❌ needs torch |
🚀 Quick Start
pip install ftune
from ftune import Estimator
est = Estimator(
model="meta-llama/Llama-3.1-8B",
method="qlora",
quantization="4bit",
lora_rank=16,
batch_size=4,
seq_length=2048,
flash_attention=True,
)
# Memory
mem = est.estimate_memory()
print(f"Total VRAM: {mem.total_gb:.2f} GB")
# Training time
time = est.estimate_time(gpu="A100-80GB", dataset_size=50000, epochs=3)
print(f"Training time: {time.total_hours:.1f} hours")
# Cost comparison across all providers
costs = est.full_comparison(dataset_size=50000, epochs=3)
for c in costs.estimates[:5]:
print(f"{c.provider:15s} | {c.gpu:15s} | ${c.total_cost:.2f}")
Works with ANY HuggingFace model
ftune auto-fetches model architecture from HuggingFace Hub — no configuration needed:
est = Estimator(model="NousResearch/Llama-2-7b-hf", method="qlora", quantization="4bit")
est = Estimator(model="tiiuae/falcon-40b", method="lora", lora_rank=32)
est = Estimator(model="bigscience/bloom-7b1", method="qlora", quantization="4bit")
🌐 Web UI
ftune includes a full interactive web calculator built with Streamlit.
pip install ftune[web]
streamlit run src/ftune/app.py
Four tabs: Memory (VRAM breakdown + chart), Training Time (all GPUs), Cost (provider comparison + spot pricing), GPU Compatibility (utilization chart).
🔗 Try it live: ftuneai.streamlit.app
✨ Features
📊 Memory Estimation
Component-level VRAM breakdown with support for FlashAttention-2, gradient checkpointing, and 5 optimizer types:
est = Estimator(
model="meta-llama/Llama-3.1-8B",
method="qlora",
quantization="4bit",
flash_attention=True, # 25-50% activation memory reduction
gradient_checkpointing=True, # 5x activation reduction
optimizer="adam_8bit", # 75% less optimizer memory vs AdamW
)
mem = est.estimate_memory()
print(f"Model weights: {mem.model_weights_gb:.2f} GB")
print(f"LoRA adapters: {mem.trainable_params_gb:.2f} GB")
print(f"Gradients: {mem.gradients_gb:.2f} GB")
print(f"Optimizer states: {mem.optimizer_states_gb:.2f} GB")
print(f"Activations: {mem.activations_gb:.2f} GB")
print(f"CUDA overhead: {mem.overhead_gb:.2f} GB")
print(f"TOTAL: {mem.total_gb:.2f} GB")
FlashAttention-2 avoids materializing the full N×N attention matrix, cutting activation memory by ~50%:
Without FlashAttention: 9.09 GB
With FlashAttention: 6.79 GB ← saved 2.30 GB (25%)
Supported methods: Full Fine-Tuning, LoRA, QLoRA (4-bit / 8-bit) Supported optimizers: AdamW, Adam, SGD, 8-bit Adam (bitsandbytes), Adafactor
⏱️ Training Time Estimation
FLOPs-based wall-clock time estimates with multi-GPU scaling:
# Single GPU
time = est.estimate_time(gpu="A100-80GB", dataset_size=50000, epochs=3)
# Compare all compatible GPUs
for t in est.estimate_time_all_gpus(dataset_size=50000, epochs=3):
print(f"{t.gpu_name:<18} {t.total_hours:>6.1f}h")
💰 Cloud Cost Comparison
Compare across 8 cloud providers including spot pricing:
costs = est.full_comparison(dataset_size=50000, epochs=3)
print(f"🏆 Cheapest: {costs.cheapest}")
print(f"💡 Best value: {costs.best_value}")
Providers: AWS, Google Cloud, Microsoft Azure, Lambda Labs, RunPod, Vast.ai, Together AI, Modal
🎯 Budget Optimizer
Reverse the logic — tell ftune your constraints and it finds the optimal configuration:
from ftune import BudgetOptimizer
recs = BudgetOptimizer.optimize(
model="meta-llama/Llama-3.1-8B",
budget=50.0, # max $50
gpu="RTX-3090-24GB", # hardware constraint
dataset_size=10000,
epochs=1,
priority="cost", # "cost", "speed", or "quality"
)
print(BudgetOptimizer.format_recommendations(recs))
╭────────────────────────────────────────────────╮
│ 🎯 ftune Budget Optimizer — Recommendations │
╰────────────────────────────────────────────────╯
#1: LORA none, rank=8
GPU: RTX-3090-24GB (Vast.ai) | 1 GPU(s)
Batch: 1 × 1 accum | Optimizer: adamw
Memory: 17.8 GB | Time: 8.3h | Cost: $2.07
💡 FlashAttention-2 enabled, Gradient checkpointing ON
#2: LORA none, rank=16
GPU: RTX-3090-24GB (Vast.ai) | 1 GPU(s)
Batch: 1 × 1 accum | Optimizer: adamw
Memory: 17.9 GB | Time: 8.3h | Cost: $2.07
💡 FlashAttention-2 enabled, Gradient checkpointing ON
The optimizer searches across methods, LoRA ranks, batch sizes, optimizers, and FlashAttention to find configurations that fit your budget and hardware.
🔀 Multi-GPU & Sharding
ftune supports DeepSpeed ZeRO Stages 1/2/3 and PyTorch FSDP for multi-GPU memory estimation:
# Single GPU — doesn't fit
est = Estimator(model="meta-llama/Llama-3.1-8B", method="full", batch_size=1)
print(est.estimate_memory().total_gb) # 104.4 GB ❌
# ZeRO-3 on 4 GPUs — fits!
est = Estimator(
model="meta-llama/Llama-3.1-8B",
method="full",
batch_size=1,
sharding="zero_3",
num_gpus=4,
)
print(est.estimate_memory().total_gb) # 27.8 GB per GPU ✅
How sharding reduces per-GPU memory:
| Strategy | What's sharded | 8B Full FT (per GPU) |
|---|---|---|
| None (single GPU) | Nothing | 104.4 GB |
| ZeRO Stage 1 | Optimizer states | 52.8 GB |
| ZeRO Stage 2 | + Gradients | 39.9 GB |
| ZeRO Stage 3 / FSDP | + Model weights | 27.8 GB |
This means ftune can now tell you: "This 70B model won't fit on one A100, but it will fit on 4×A100s using ZeRO-3 with 17.4 GB per GPU utilization."
est = Estimator(
model="meta-llama/Llama-3.1-70B",
method="qlora",
quantization="4bit",
sharding="zero_3",
num_gpus=4,
)
mem = est.estimate_memory()
print(f"70B QLoRA ZeRO-3: {mem.total_gb:.1f} GB per GPU") # 17.4 GB ✅
Supported strategies: none, zero_1, zero_2, zero_3, fsdp, fsdp_shard_grad
🔧 Calibration
Generic MFU constants can be off by 2-10x depending on your hardware, drivers, and framework. Calibration fixes this.
Run a quick 10-step benchmark on your GPU, then feed the results to ftune:
from ftune import Estimator, Calibrator
est = Estimator(model="meta-llama/Llama-3.1-8B", method="qlora", quantization="4bit")
mem = est.estimate_memory()
time = est.estimate_time(gpu="A100-80GB", dataset_size=50000, epochs=3)
# After running 10 real training steps, you measured:
cal = Calibrator.from_benchmark(
estimated_memory_gb=mem.total_gb, # ftune's estimate
actual_memory_gb=11.2, # nvidia-smi peak
estimated_time_hours=time.total_hours, # ftune's estimate
actual_time_hours=5.0, # extrapolated from benchmark
gpu_name="A100-80GB",
)
# Now all future estimates are hardware-calibrated
adjusted_time = cal.adjust_time(time.total_hours)
adjusted_memory = cal.adjust_memory(mem.total_gb)
print(f"Calibrated time: {adjusted_time:.1f}h (was {time.total_hours:.1f}h)")
print(f"Calibrated memory: {adjusted_memory:.1f} GB (was {mem.total_gb:.1f} GB)")
# Save calibration for reuse
Calibrator.save(cal.result, "~/.ftune/my_a100_calibration.json")
# Load it later
from ftune.core.models import CalibrationResult
saved = Calibrator.load("~/.ftune/my_a100_calibration.json")
You can also use the measured MFU directly:
# Calibration found your actual MFU is 0.52
time = est.estimate_time(gpu="A100-80GB", dataset_size=50000, epochs=3, mfu_override=0.52)
📊 Validation
Compare ftune estimates against actual training metrics from real runs:
from ftune import Estimator
from ftune.validation import Validator, ActualMetrics
est = Estimator(model="meta-llama/Llama-3.1-8B", method="qlora", quantization="4bit")
actual = ActualMetrics(
peak_memory_gb=11.2,
training_time_hours=4.5,
total_cost=8.50,
gpu_name="A100-80GB",
dataset_size=50000,
epochs=3,
)
result = Validator.compare(est, actual)
print(Validator.format_report(result))
Load metrics from multiple sources:
actual = Validator.from_json("training_metrics.json")
actual = Validator.from_wandb("username/project/run_id") # pip install ftune[wandb]
metrics_list = Validator.from_csv("all_runs.csv") # batch validation
⌨️ CLI
pip install ftune[cli]
# Full estimate
ftune estimate --model meta-llama/Llama-3.1-8B --method qlora --quantization 4bit
# With all options
ftune estimate \
--model meta-llama/Llama-3.1-8B \
--method qlora \
--quantization 4bit \
--lora-rank 16 \
--batch-size 4 \
--seq-length 2048 \
--dataset-size 50000 \
--epochs 3 \
--output json
# List models and GPUs
ftune models
ftune gpus
# Check pricing
ftune pricing --gpu A100-80GB --hours 10
# Validate against actual metrics
ftune validate --model meta-llama/Llama-3.1-8B --method qlora --metrics metrics.json
🧮 How It Works
Memory Formula
| Component | Full Fine-Tune | LoRA | QLoRA (4-bit) |
|---|---|---|---|
| Model Weights | params × dtype_bytes |
params × dtype_bytes |
params × 0.5 + quant overhead |
| Trainable Params | (same as weights) | modules × 2 × hidden × rank |
Same as LoRA |
| Gradients | params × 2B |
lora_params × 2B |
lora_params × 2B |
| Optimizer (AdamW) | params × 8B |
lora_params × 8B |
lora_params × 8B |
| Optimizer (8-bit Adam) | params × 2B |
lora_params × 2B |
lora_params × 2B |
| Activations | batch × seq × hidden × layers × factor |
Same | Same |
| FlashAttention-2 | Activations × 0.5 | Activations × 0.5 | Activations × 0.5 |
| Overhead | ~15% buffer | ~15% buffer | ~15% buffer |
| ZeRO-3 / FSDP | Weights, grads, optimizer ÷ N GPUs | Same | Same |
Training Time Formula
FLOPs per token ≈ 6 × num_parameters
Total FLOPs = flops_per_token × dataset_tokens × epochs
Time = Total FLOPs / (GPU TFLOPS × MFU × num_gpus × scaling_efficiency)
MFU defaults to 0.30-0.35 (conservative). Use calibration mode for hardware-specific values.
Limitations & Accuracy
ftune provides analytical estimates, not profiling results. All numbers are derived from architecture-level formulas with empirical correction factors — no PyTorch, no GPU required, but also no runtime measurement.
Known assumptions and their impact:
| Assumption | Impact | Mitigation |
|---|---|---|
| MFU defaults (0.30-0.35) | Time estimates can be off by 2-3x depending on hardware, batch size, and framework optimizations | Use calibration mode with a real 10-step benchmark |
| Activation memory formula | Simplified — doesn't model per-op memory peaks or memory allocator behavior | Conservative factors partially compensate |
| Static cloud pricing | Prices change frequently; bundled data may be stale | Check provider websites for current rates |
| LoRA on MoE models | Assumes standard (non-expert) LoRA targets | Expert-specific LoRA estimation not yet supported |
When to trust ftune estimates:
- Relative comparisons (QLoRA vs LoRA, GPU A vs GPU B) — high confidence
- Will-it-fit checks (does this config OOM on 24GB?) — good confidence with ~20% margin
- Absolute wall-clock time — use calibration mode; defaults can be off by 2-3x
- Absolute cost — treat as order-of-magnitude; verify provider pricing
When NOT to trust ftune:
- Sequence lengths near the model's maximum (attention memory scaling is nonlinear)
- Exotic architectures not in the model database (use HuggingFace Hub auto-detect and verify)
- Multi-node training (ftune models single-node multi-GPU only)
📋 Supported Models, GPUs & Providers
Built-in Models (15+)
| Model | Parameters | Default dtype |
|---|---|---|
| meta-llama/Llama-3.1-8B / 70B / 405B | 8B / 70B / 405B | bf16 |
| mistralai/Mistral-7B-v0.3 | 7B | bf16 |
| mistralai/Mixtral-8x7B-v0.1 | 47B (MoE) | bf16 |
| google/gemma-2-9b / 27b | 9B / 27B | bf16 |
| Qwen/Qwen2.5-7B / 72B | 7B / 72B | bf16 |
| microsoft/phi-3-mini / medium | 3.8B / 14B | bf16 |
| deepseek-ai/DeepSeek-V2-Lite | 16B | bf16 |
| + Yi, Falcon, StableLM | various | bf16 |
Plus ANY model on HuggingFace Hub via auto-detect.
GPUs (11)
| GPU | VRAM | FP16 TFLOPS |
|---|---|---|
| NVIDIA H100 | 80 GB | 989 |
| NVIDIA A100 | 40 / 80 GB | 312 |
| NVIDIA A10G | 24 GB | 125 |
| NVIDIA L4 | 24 GB | 121 |
| RTX 4090 / 4080 | 24 / 16 GB | 165 / 97 |
| RTX 3090 | 24 GB | 71 |
| Tesla T4 / V100 | 16 / 16-32 GB | 65 / 125 |
Cloud Providers (8)
| Provider | GPUs Available | Spot Pricing |
|---|---|---|
| AWS | H100, A100, T4, L4 | ✅ |
| Google Cloud | H100, A100, T4, L4, V100 | ✅ |
| Microsoft Azure | H100, A100, T4, V100 | ✅ |
| Lambda Labs | H100, A100, RTX 4090 | — |
| RunPod | H100, A100, RTX 4090/3090, L4 | ✅ |
| Vast.ai | H100, A100, RTX 4090/3090 | — |
| Together AI | H100, A100 | — |
| Modal | H100, A100, L4, T4 | — |
📦 Installation
pip install ftune # Core library (zero ML dependencies)
pip install ftune[cli] # + CLI with Rich terminal output
pip install ftune[web] # + Streamlit web UI
pip install ftune[wandb] # + Weights & Biases validation
pip install ftune[all] # Everything
🗺️ Roadmap
- Memory estimation (full, LoRA, QLoRA)
- Training time estimation (FLOPs-based, multi-GPU)
- Cloud cost comparison (8 providers, spot pricing)
- HuggingFace Hub auto-detect
- FlashAttention-2 memory optimization
- ZeRO Stages 1/2/3 + FSDP sharding
- Hardware calibration mode
- Budget optimizer ("I have $50, what's optimal?")
- Validation mode (manual, JSON, W&B, CSV)
- Streamlit web UI
- CLI with Rich output
- GitHub Actions CI/CD
- Streamlit Cloud deployment (public hosted version)
- PyPI publish
- Community validation dataset (crowdsourced accuracy data)
- Active pricing API (real-time provider rates)
- Exportable PDF reports
🔑 Design Principles
- Zero ML dependencies — Pure Python calculator. No PyTorch, no TensorFlow, no GPU required.
- Works with any model — HuggingFace Hub integration for instant support of thousands of models.
- Hardware-aware — Calibration mode closes the gap between theory and your specific setup.
- Enterprise-ready — ZeRO/FSDP sharding makes ftune relevant for serious multi-GPU training.
- Validates itself — Compare estimates against actual runs. No black box.
- Fast — Every estimate returns in under 1 second.
🤝 Contributing
The most valuable contribution is validation data. Run ftune against your actual training runs and share the results:
from ftune import Estimator, Calibrator
est = Estimator(model="your-model", method="qlora", quantization="4bit", ...)
cal = Calibrator.from_benchmark(
estimated_memory_gb=est.estimate_memory().total_gb,
actual_memory_gb=..., # from nvidia-smi
estimated_time_hours=est.estimate_time(...).total_hours,
actual_time_hours=..., # wall-clock
gpu_name="...",
)
print(cal.format_report())
# Share this in an issue or PR!
Other ways to help: update pricing data, add models/GPUs, fix bugs, improve the web UI.
git clone https://github.com/ritikmahy5/ftune.git
cd ftune
pip install -e ".[dev]"
PYTHONPATH=src pytest tests/ -v
See CONTRIBUTING.md for full guidelines.
📄 License
MIT License — See LICENSE for details.
If ftune saved you from an OOM error or an expensive cloud bill, consider giving it a ⭐
Built by Ritik Mahyavanshi
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ftuneai-0.2.0.tar.gz.
File metadata
- Download URL: ftuneai-0.2.0.tar.gz
- Upload date:
- Size: 60.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d715c86bc9162a6ebf620b3a7df35a2fdd08d7748df6a50dcbd720a304b2a5d4
|
|
| MD5 |
620848f97760540e4c57e67cbe17695b
|
|
| BLAKE2b-256 |
64a29835b9f5767b29c85d23b62d270ce05a44968c51c1054cadd6bab6b2e7a7
|
File details
Details for the file ftuneai-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ftuneai-0.2.0-py3-none-any.whl
- Upload date:
- Size: 54.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afcbcd279207b555bf10c67c8506bcc7eeb9b8dcdb9ab6a2b001caa6eec76d7f
|
|
| MD5 |
d11c67bcfec7a7884adb1288d92ac905
|
|
| BLAKE2b-256 |
8a86f5c587bfc94f1b6b6c1572cd760a356a0c6a52344cbeb4d7781184b3db5a
|