Estimate GPU memory, training time, and costs for LLM fine-tuning

These details have not been verified by PyPI

Project links

Project description

⚡ ftune

Know your GPU costs before you hit OOM.
Estimate memory, training time, and cloud costs for LLM fine-tuning — in seconds.

Problem • Quick Start • Web UI • Features • Budget Optimizer • Multi-GPU • Calibration • Validation • CLI • Models & GPUs • Contributing

🔥 The Problem

You want to fine-tune Llama 3.1 70B. You spin up an A100, start training, and... CUDA out of memory 💀

Or worse — you rent 8×H100s for $30/hr, only to realize you could've done it with QLoRA on a single $1.50/hr GPU.

ftune fixes this. Get VRAM estimates, training time projections, and cost comparisons across 8 cloud providers — all before you spend a single dollar.

What makes ftune different?

Feature	ftune	Manual math	HF `accelerate estimate`
Works with any HuggingFace model	✅ auto-fetches from Hub	❌	✅
Multi-GPU / ZeRO / FSDP sharding	✅ ZeRO 1/2/3 + FSDP	❌	❌
Cloud cost comparison (8 providers)	✅ with spot pricing	❌	❌
Budget optimizer ("I have $50")	✅	❌	❌
Hardware calibration mode	✅	❌	❌
FlashAttention-2 memory savings	✅	manual	❌
Validation against real runs	✅ W&B, JSON, CSV	❌	❌
Zero ML dependencies	✅ pure Python	✅	❌ needs torch

🚀 Quick Start

pip install ftuneai

from ftuneai import Estimator

est = Estimator(
    model="meta-llama/Llama-3.1-8B",
    method="qlora",
    quantization="4bit",
    lora_rank=16,
    batch_size=4,
    seq_length=2048,
    flash_attention=True,
)

# Memory
mem = est.estimate_memory()
print(f"Total VRAM: {mem.total_gb:.2f} GB")

# Training time
time = est.estimate_time(gpu="A100-80GB", dataset_size=50000, epochs=3)
print(f"Training time: {time.total_hours:.1f} hours")

# Cost comparison across all providers
costs = est.full_comparison(dataset_size=50000, epochs=3)
for c in costs.estimates[:5]:
    print(f"{c.provider:15s} | {c.gpu:15s} | ${c.total_cost:.2f}")

Works with ANY HuggingFace model

ftune auto-fetches model architecture from HuggingFace Hub — no configuration needed:

est = Estimator(model="NousResearch/Llama-2-7b-hf", method="qlora", quantization="4bit")
est = Estimator(model="tiiuae/falcon-40b", method="lora", lora_rank=32)
est = Estimator(model="bigscience/bloom-7b1", method="qlora", quantization="4bit")

🌐 Web UI

ftune includes a full interactive web calculator built with Streamlit.

pip install ftuneai[web]
streamlit run src/ftune/app.py

Four tabs: Memory (VRAM breakdown + chart), Training Time (all GPUs), Cost (provider comparison + spot pricing), GPU Compatibility (utilization chart).

🔗 Try it live: ftuneai.streamlit.app

✨ Features

📊 Memory Estimation

Component-level VRAM breakdown with support for FlashAttention-2, gradient checkpointing, and 5 optimizer types:

est = Estimator(
    model="meta-llama/Llama-3.1-8B",
    method="qlora",
    quantization="4bit",
    flash_attention=True,        # 25-50% activation memory reduction
    gradient_checkpointing=True, # 5x activation reduction
    optimizer="adam_8bit",       # 75% less optimizer memory vs AdamW
)

mem = est.estimate_memory()
print(f"Model weights:    {mem.model_weights_gb:.2f} GB")
print(f"LoRA adapters:    {mem.trainable_params_gb:.2f} GB")
print(f"Gradients:        {mem.gradients_gb:.2f} GB")
print(f"Optimizer states: {mem.optimizer_states_gb:.2f} GB")
print(f"Activations:      {mem.activations_gb:.2f} GB")
print(f"CUDA overhead:    {mem.overhead_gb:.2f} GB")
print(f"TOTAL:            {mem.total_gb:.2f} GB")

FlashAttention-2 avoids materializing the full N×N attention matrix, cutting activation memory by ~50%:

Without FlashAttention: 9.09 GB
With FlashAttention:    6.79 GB  ← saved 2.30 GB (25%)

Supported methods: Full Fine-Tuning, LoRA, QLoRA (4-bit / 8-bit) Supported optimizers: AdamW, Adam, SGD, 8-bit Adam (bitsandbytes), Adafactor

⏱️ Training Time Estimation

FLOPs-based wall-clock time estimates with multi-GPU scaling:

# Single GPU
time = est.estimate_time(gpu="A100-80GB", dataset_size=50000, epochs=3)

# Compare all compatible GPUs
for t in est.estimate_time_all_gpus(dataset_size=50000, epochs=3):
    print(f"{t.gpu_name:<18} {t.total_hours:>6.1f}h")

💰 Cloud Cost Comparison

Compare across 8 cloud providers including spot pricing:

costs = est.full_comparison(dataset_size=50000, epochs=3)
print(f"🏆 Cheapest: {costs.cheapest}")
print(f"💡 Best value: {costs.best_value}")

Providers: AWS, Google Cloud, Microsoft Azure, Lambda Labs, RunPod, Vast.ai, Together AI, Modal

🎯 Budget Optimizer

Reverse the logic — tell ftune your constraints and it finds the optimal configuration:

from ftuneai import BudgetOptimizer

recs = BudgetOptimizer.optimize(
    model="meta-llama/Llama-3.1-8B",
    budget=50.0,              # max $50
    gpu="RTX-3090-24GB",      # hardware constraint
    dataset_size=10000,
    epochs=1,
    priority="cost",          # "cost", "speed", or "quality"
)

print(BudgetOptimizer.format_recommendations(recs))

╭────────────────────────────────────────────────╮
│    🎯 ftune Budget Optimizer — Recommendations  │
╰────────────────────────────────────────────────╯

  #1: LORA none, rank=8
      GPU: RTX-3090-24GB (Vast.ai) | 1 GPU(s)
      Batch: 1 × 1 accum | Optimizer: adamw
      Memory: 17.8 GB | Time: 8.3h | Cost: $2.07
      💡 FlashAttention-2 enabled, Gradient checkpointing ON

  #2: LORA none, rank=16
      GPU: RTX-3090-24GB (Vast.ai) | 1 GPU(s)
      Batch: 1 × 1 accum | Optimizer: adamw
      Memory: 17.9 GB | Time: 8.3h | Cost: $2.07
      💡 FlashAttention-2 enabled, Gradient checkpointing ON

The optimizer searches across methods, LoRA ranks, batch sizes, optimizers, and FlashAttention to find configurations that fit your budget and hardware.

🔀 Multi-GPU & Sharding

ftune supports DeepSpeed ZeRO Stages 1/2/3 and PyTorch FSDP for multi-GPU memory estimation:

# Single GPU — doesn't fit
est = Estimator(model="meta-llama/Llama-3.1-8B", method="full", batch_size=1)
print(est.estimate_memory().total_gb)  # 104.4 GB ❌

# ZeRO-3 on 4 GPUs — fits!
est = Estimator(
    model="meta-llama/Llama-3.1-8B",
    method="full",
    batch_size=1,
    sharding="zero_3",
    num_gpus=4,
)
print(est.estimate_memory().total_gb)  # 27.8 GB per GPU ✅

How sharding reduces per-GPU memory:

Strategy	What's sharded	8B Full FT (per GPU)
None (single GPU)	Nothing	104.4 GB
ZeRO Stage 1	Optimizer states	52.8 GB
ZeRO Stage 2	+ Gradients	39.9 GB
ZeRO Stage 3 / FSDP	+ Model weights	27.8 GB

This means ftune can now tell you: "This 70B model won't fit on one A100, but it will fit on 4×A100s using ZeRO-3 with 17.4 GB per GPU utilization."

est = Estimator(
    model="meta-llama/Llama-3.1-70B",
    method="qlora",
    quantization="4bit",
    sharding="zero_3",
    num_gpus=4,
)
mem = est.estimate_memory()
print(f"70B QLoRA ZeRO-3: {mem.total_gb:.1f} GB per GPU")  # 17.4 GB ✅

Supported strategies: none, zero_1, zero_2, zero_3, fsdp, fsdp_shard_grad

🔧 Calibration

Generic MFU constants can be off by 2-10x depending on your hardware, drivers, and framework. Calibration fixes this.

Run a quick 10-step benchmark on your GPU, then feed the results to ftune:

from ftuneai import Estimator, Calibrator

est = Estimator(model="meta-llama/Llama-3.1-8B", method="qlora", quantization="4bit")
mem = est.estimate_memory()
time = est.estimate_time(gpu="A100-80GB", dataset_size=50000, epochs=3)

# After running 10 real training steps, you measured:
cal = Calibrator.from_benchmark(
    estimated_memory_gb=mem.total_gb,     # ftune's estimate
    actual_memory_gb=11.2,                 # nvidia-smi peak
    estimated_time_hours=time.total_hours, # ftune's estimate
    actual_time_hours=5.0,                 # extrapolated from benchmark
    gpu_name="A100-80GB",
)

# Now all future estimates are hardware-calibrated
adjusted_time = cal.adjust_time(time.total_hours)
adjusted_memory = cal.adjust_memory(mem.total_gb)
print(f"Calibrated time: {adjusted_time:.1f}h (was {time.total_hours:.1f}h)")
print(f"Calibrated memory: {adjusted_memory:.1f} GB (was {mem.total_gb:.1f} GB)")

# Save calibration for reuse
Calibrator.save(cal.result, "~/.ftune/my_a100_calibration.json")

# Load it later
from ftuneai.core.models import CalibrationResult
saved = Calibrator.load("~/.ftune/my_a100_calibration.json")

You can also use the measured MFU directly:

# Calibration found your actual MFU is 0.52
time = est.estimate_time(gpu="A100-80GB", dataset_size=50000, epochs=3, mfu_override=0.52)

📊 Validation

Compare ftune estimates against actual training metrics from real runs:

from ftuneai import Estimator
from ftuneai.validation import Validator, ActualMetrics

est = Estimator(model="meta-llama/Llama-3.1-8B", method="qlora", quantization="4bit")

actual = ActualMetrics(
    peak_memory_gb=11.2,
    training_time_hours=4.5,
    total_cost=8.50,
    gpu_name="A100-80GB",
    dataset_size=50000,
    epochs=3,
)

result = Validator.compare(est, actual)
print(Validator.format_report(result))

Load metrics from multiple sources:

actual = Validator.from_json("training_metrics.json")
actual = Validator.from_wandb("username/project/run_id")  # pip install ftuneai[wandb]
metrics_list = Validator.from_csv("all_runs.csv")          # batch validation

⌨️ CLI

pip install ftuneai[cli]

# Full estimate
ftune estimate --model meta-llama/Llama-3.1-8B --method qlora --quantization 4bit

# With all options
ftune estimate \
  --model meta-llama/Llama-3.1-8B \
  --method qlora \
  --quantization 4bit \
  --lora-rank 16 \
  --batch-size 4 \
  --seq-length 2048 \
  --dataset-size 50000 \
  --epochs 3 \
  --output json

# List models and GPUs
ftune models
ftune gpus

# Check pricing
ftune pricing --gpu A100-80GB --hours 10

# Validate against actual metrics
ftune validate --model meta-llama/Llama-3.1-8B --method qlora --metrics metrics.json

🧮 How It Works

Memory Formula

Component	Full Fine-Tune	LoRA	QLoRA (4-bit)
Model Weights	`params × dtype_bytes`	`params × dtype_bytes`	`params × 0.5` + quant overhead
Trainable Params	(same as weights)	`modules × 2 × hidden × rank`	Same as LoRA
Gradients	`params × 2B`	`lora_params × 2B`	`lora_params × 2B`
Optimizer (AdamW)	`params × 8B`	`lora_params × 8B`	`lora_params × 8B`
Optimizer (8-bit Adam)	`params × 2B`	`lora_params × 2B`	`lora_params × 2B`
Activations	`batch × seq × hidden × layers × factor`	Same	Same
FlashAttention-2	Activations × 0.5	Activations × 0.5	Activations × 0.5
Overhead	~15% buffer	~15% buffer	~15% buffer
ZeRO-3 / FSDP	Weights, grads, optimizer ÷ N GPUs	Same	Same

Training Time Formula

FLOPs per token ≈ 6 × num_parameters
Total FLOPs = flops_per_token × dataset_tokens × epochs
Time = Total FLOPs / (GPU TFLOPS × MFU × num_gpus × scaling_efficiency)

MFU defaults to 0.30-0.35 (conservative). Use calibration mode for hardware-specific values.

Limitations & Accuracy

ftune provides analytical estimates, not profiling results. All numbers are derived from architecture-level formulas with empirical correction factors — no PyTorch, no GPU required, but also no runtime measurement.

Known assumptions and their impact:

Assumption	Impact	Mitigation
MFU defaults (0.30-0.35)	Time estimates can be off by 2-3x depending on hardware, batch size, and framework optimizations	Use calibration mode with a real 10-step benchmark
Activation memory formula	Simplified — doesn't model per-op memory peaks or memory allocator behavior	Conservative factors partially compensate
Static cloud pricing	Prices change frequently; bundled data may be stale	Check provider websites for current rates
LoRA on MoE models	Assumes standard (non-expert) LoRA targets	Expert-specific LoRA estimation not yet supported

When to trust ftune estimates:

Relative comparisons (QLoRA vs LoRA, GPU A vs GPU B) — high confidence
Will-it-fit checks (does this config OOM on 24GB?) — good confidence with ~20% margin
Absolute wall-clock time — use calibration mode; defaults can be off by 2-3x
Absolute cost — treat as order-of-magnitude; verify provider pricing

When NOT to trust ftune:

Sequence lengths near the model's maximum (attention memory scaling is nonlinear)
Exotic architectures not in the model database (use HuggingFace Hub auto-detect and verify)
Multi-node training (ftune models single-node multi-GPU only)

📋 Supported Models, GPUs & Providers

Built-in Models (15+)

Model	Parameters	Default dtype
meta-llama/Llama-3.1-8B / 70B / 405B	8B / 70B / 405B	bf16
mistralai/Mistral-7B-v0.3	7B	bf16
mistralai/Mixtral-8x7B-v0.1	47B (MoE)	bf16
google/gemma-2-9b / 27b	9B / 27B	bf16
Qwen/Qwen2.5-7B / 72B	7B / 72B	bf16
microsoft/phi-3-mini / medium	3.8B / 14B	bf16
deepseek-ai/DeepSeek-V2-Lite	16B	bf16
+ Yi, Falcon, StableLM	various	bf16

Plus ANY model on HuggingFace Hub via auto-detect.

GPUs (11)

GPU	VRAM	FP16 TFLOPS
NVIDIA H100	80 GB	989
NVIDIA A100	40 / 80 GB	312
NVIDIA A10G	24 GB	125
NVIDIA L4	24 GB	121
RTX 4090 / 4080	24 / 16 GB	165 / 97
RTX 3090	24 GB	71
Tesla T4 / V100	16 / 16-32 GB	65 / 125

Cloud Providers (8)

Provider	GPUs Available	Spot Pricing
AWS	H100, A100, T4, L4	✅
Google Cloud	H100, A100, T4, L4, V100	✅
Microsoft Azure	H100, A100, T4, V100	✅
Lambda Labs	H100, A100, RTX 4090	—
RunPod	H100, A100, RTX 4090/3090, L4	✅
Vast.ai	H100, A100, RTX 4090/3090	—
Together AI	H100, A100	—
Modal	H100, A100, L4, T4	—

📦 Installation

pip install ftuneai            # Core library (zero ML dependencies)
pip install ftuneai[cli]       # + CLI with Rich terminal output
pip install ftuneai[web]       # + Streamlit web UI
pip install ftuneai[wandb]     # + Weights & Biases validation
pip install ftuneai[all]       # Everything

🗺️ Roadmap

Memory estimation (full, LoRA, QLoRA)
Training time estimation (FLOPs-based, multi-GPU)
Cloud cost comparison (8 providers, spot pricing)
HuggingFace Hub auto-detect
FlashAttention-2 memory optimization
ZeRO Stages 1/2/3 + FSDP sharding
Hardware calibration mode
Budget optimizer ("I have $50, what's optimal?")
Validation mode (manual, JSON, W&B, CSV)
Streamlit web UI
CLI with Rich output
GitHub Actions CI/CD
Streamlit Cloud deployment (public hosted version)
PyPI publish
Community validation dataset (crowdsourced accuracy data)
Active pricing API (real-time provider rates)
Exportable PDF reports

🔑 Design Principles

Zero ML dependencies — Pure Python calculator. No PyTorch, no TensorFlow, no GPU required.
Works with any model — HuggingFace Hub integration for instant support of thousands of models.
Hardware-aware — Calibration mode closes the gap between theory and your specific setup.
Enterprise-ready — ZeRO/FSDP sharding makes ftune relevant for serious multi-GPU training.
Validates itself — Compare estimates against actual runs. No black box.
Fast — Every estimate returns in under 1 second.

🤝 Contributing

The most valuable contribution is validation data. Run ftune against your actual training runs and share the results:

from ftuneai import Estimator, Calibrator

est = Estimator(model="your-model", method="qlora", quantization="4bit", ...)
cal = Calibrator.from_benchmark(
    estimated_memory_gb=est.estimate_memory().total_gb,
    actual_memory_gb=...,       # from nvidia-smi
    estimated_time_hours=est.estimate_time(...).total_hours,
    actual_time_hours=...,       # wall-clock
    gpu_name="...",
)
print(cal.format_report())
# Share this in an issue or PR!

Other ways to help: update pricing data, add models/GPUs, fix bugs, improve the web UI.

git clone https://github.com/ritikmahy5/ftune.git
cd ftune
pip install -e ".[dev]"
PYTHONPATH=src pytest tests/ -v

See CONTRIBUTING.md for full guidelines.

📄 License

MIT License — See LICENSE for details.

If ftune saved you from an OOM error or an expensive cloud bill, consider giving it a ⭐

Built by Ritik Mahyavanshi

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.1

Apr 10, 2026

0.4.0

Apr 9, 2026

0.3.1

Apr 9, 2026

This version

0.3.0

Apr 9, 2026

0.2.1

Apr 9, 2026

0.2.0

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ftuneai-0.3.0.tar.gz (60.9 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ftuneai-0.3.0-py3-none-any.whl (54.6 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file ftuneai-0.3.0.tar.gz.

File metadata

Download URL: ftuneai-0.3.0.tar.gz
Upload date: Apr 9, 2026
Size: 60.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ftuneai-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`520daaf282918e7df139f8da10ddd9d1090fc6a594f1c66f4d2c4c0192bdac5b`
MD5	`6f5c11aca14f332a9c75d2f072007070`
BLAKE2b-256	`1f19ed002c49142e6db5cd5574eaa749f456ecba6770a68a48b37eba36f455e5`

See more details on using hashes here.

File details

Details for the file ftuneai-0.3.0-py3-none-any.whl.

File metadata

Download URL: ftuneai-0.3.0-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 54.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for ftuneai-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`18ecd32c74a1d63c9ae1403a96dbc895f8b61b4458754d4a5bfeb383eb607a92`
MD5	`471042c242ceb6728675aa9c7d4f2625`
BLAKE2b-256	`9e6aed801694d71df31c79474d454b59183b6c9080d024f6b0e3d2779d17027f`

See more details on using hashes here.

ftuneai 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

⚡ ftune

🔥 The Problem

What makes ftune different?

🚀 Quick Start

Works with ANY HuggingFace model

🌐 Web UI

✨ Features

📊 Memory Estimation

⏱️ Training Time Estimation

💰 Cloud Cost Comparison

🎯 Budget Optimizer

🔀 Multi-GPU & Sharding

🔧 Calibration

📊 Validation

⌨️ CLI

🧮 How It Works

Memory Formula

Training Time Formula

Limitations & Accuracy

📋 Supported Models, GPUs & Providers

Built-in Models (15+)

GPUs (11)

Cloud Providers (8)

📦 Installation

🗺️ Roadmap

🔑 Design Principles

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes