GPU memory calculator for Hugging Face models with different data types and parallelization strategies
Project description
HF VRAM Calculator
A professional Python CLI tool for estimating GPU memory requirements for Hugging Face models with different data types and parallelization strategies.
โก Latest Features: Smart dtype detection, MHA/MQA/GQA-aware KV cache, 12 quantization formats, 20+ GPU models, professional Rich UI
Quick Demo
# Install and run
pip install hf-vram-calc
# Set up authentication (required for most models)
hf auth login --token yourtoken --add-to-git-credential
# Calculate memory requirements
hf-vram-calc microsoft/DialoGPT-medium
# Output: Beautiful tables showing 0.9GB inference, GPU compatibility, parallelization strategies
Features
- ๐ Automatic Model Analysis: Fetch configurations from Hugging Face Hub automatically
- ๐ง Smart Data Type Detection: Intelligent dtype recommendation from model names, config, or defaults
- ๐ Comprehensive Data Type Support: fp32, fp16, bf16, fp8, int8, int4, mxfp4, nvfp4, awq_int4, gptq_int4, nf4, fp4
- ๐ฏ Multi-Scenario Memory Estimation:
- Inference: Model weights + KV cache overhead (MHA/MQA/GQA-aware, ร1.2 factor)
- Training: Full Adam optimizer states (ร4ร1.3 factors)
- LoRA Fine-tuning: Low-rank adaptation with trainable parameter overhead
- โก Advanced Parallelization Analysis:
- Tensor Parallelism (TP): 1, 2, 4, 8
- Pipeline Parallelism (PP): 1, 2, 4, 8
- Expert Parallelism (EP) for MoE models
- Data Parallelism (DP): 2, 4, 8
- Combined strategies (TP + PP combinations)
- ๐ฎ GPU Compatibility Matrix:
- 20+ GPU models (RTX 4090, A100, H100, L40S, etc.)
- Automatic compatibility checking for inference/training/LoRA
- Minimum GPU memory requirement calculations
- ๐ Professional Rich UI:
- ๐จ Beautiful color-coded tables and panels
- ๐ Real-time progress indicators
- ๐ Modern CLI interface with emoji icons
- ๐ก Smart recommendations and warnings
- ๐ง Flexible Configuration:
- Customizable LoRA rank, batch size, sequence length
- External JSON configuration files
- User-defined GPU models and data types
- ๐ Parameter Display: Raw count + human-readable format (e.g., "405,016,576 (405.0M)")
Installation
Quick Install (from PyPI)
pip install hf-vram-calc
Build from Source
# Clone the repository
git clone <repository-url>
cd hf-vram-calc
# Build with uv (recommended)
uv build
uv pip install dist/hf_vram_calc-1.0.0-py3-none-any.whl
# Or install directly
uv pip install .
Dependencies:
requests(HTTP),rich(beautiful CLI), Python โฅ3.8
For detailed build instructions, see: BUILD.md
Authentication Setup
Many models require a Hugging Face token. Get yours at https://huggingface.co/settings/tokens, then:
hf auth login --token yourtoken --add-to-git-credential
Usage
Basic Usage - Smart Dtype Detection
# Automatic dtype recommendation from model config/name
hf-vram-calc microsoft/DialoGPT-medium
# Model name contains dtype - automatically detects fp16
hf-vram-calc nvidia/DeepSeek-R1-0528-FP4
Specify Data Type Override
# Override with specific data type
hf-vram-calc meta-llama/Llama-2-7b-hf --dtype bf16
hf-vram-calc mistralai/Mistral-7B-v0.1 --dtype bf16,fp8
Advanced Configuration
# Custom batch size and sequence length
hf-vram-calc mistralai/Mistral-7B-v0.1 --batch-size 4 --sequence-length 4096
# Custom LoRA rank for fine-tuning estimation
hf-vram-calc microsoft/DialoGPT-medium --lora-rank 128
# Detailed analysis (disabled by default)
hf-vram-calc meta-llama/Llama-2-7b-hf --verbose
System Information
# List all available data types and GPU models
hf-vram-calc --list-types
# Use custom configuration directory
hf-vram-calc --config-dir ./my_config microsoft/DialoGPT-medium
# Show help
hf-vram-calc --help
Command Line Arguments
Required
model_name: Hugging Face model name (e.g.,microsoft/DialoGPT-medium)
Data Type Control
--dtype {fp32,fp16,bf16,fp8,int8,int4,mxfp4,nvfp4,awq_int4,fp4,nf4,gptq_int4}: Override automatic dtype detection--list-types: List all available data types and GPU models
Memory Estimation Parameters
--batch-size BATCH_SIZE: Batch size for activation estimation (default: 1)--sequence-length SEQUENCE_LENGTH: Sequence length for memory calculation (default: 2048)--lora-rank LORA_RANK: LoRA rank for fine-tuning estimation (default: 64)
Display & Configuration
--verbose: Show detailed parallelization and GPU compatibility (default: disabled)--config-dir CONFIG_DIR: Custom configuration directory path--help: Show complete help message with examples
Smart Behavior
- No
--dtype: Uses intelligent priority (model name โ config โ fp16 default) - With
--dtype: Overrides automatic detection with specified type - Invalid model: Graceful error handling with helpful suggestions
Quick Start Examples
# Set up authentication first time
hf auth login --token yourtoken --add-to-git-credential
# Estimate memory for different models
hf-vram-calc microsoft/DialoGPT-medium # โ 0.9GB inference (FP16)
hf-vram-calc meta-llama/Llama-2-7b-hf # โ ~13GB inference
hf-vram-calc nvidia/DeepSeek-R1-0528-FP4 # โ Auto-detects FP4 from name
# estimate size for specified quantization versions
hf-vram-calc meta-llama/Llama-2-7b-hf --dtype fp16 # โ ~13GB
hf-vram-calc meta-llama/Llama-2-7b-hf --dtype int4 # โ ~3.5GB
hf-vram-calc meta-llama/Llama-2-7b-hf --dtype awq_int4 # โ ~3.5GB
# for private access models, it is recommended to use --local-config
hf-vram-calc meta-llama/Llama-4-Scout-17B-16E-Instruct --local-config config.json
# Find optimal parallelization strategy
hf-vram-calc mistralai/Mistral-7B-v0.1 --verbose # โ TP/PP recommendations
# Check what's available
hf-vram-calc --list-types # โ All types & GPUs
Data Type Priority & Detection
Automatic Data Type Recommendation
The tool uses intelligent priority-based dtype selection:
-
Model Name Detection (Highest Priority)
model-fp16,model-bf16โ Extracts from model namemodel-4bit,model-gptq,model-awqโ Detects quantization
-
Config torch_dtype (Medium Priority)
- Reads
torch_dtypefrom model'sconfig.json - Maps
torch.float16โfp16,torch.bfloat16โbf16, etc.
- Reads
-
Default Fallback (Lowest Priority)
- Defaults to
fp16when no dtype detected
- Defaults to
Supported Data Types
| Data Type | Bytes/Param | Description | Detection Patterns |
|---|---|---|---|
| fp32 | 4.0 | 32-bit floating point | fp32, float32 |
| fp16 | 2.0 | 16-bit floating point | fp16, float16, half |
| bf16 | 2.0 | Brain Float 16 | bf16, bfloat16 |
| fp8 | 1.0 | 8-bit floating point | fp8, float8 |
| int8 | 1.0 | 8-bit integer | int8, 8bit |
| int4 | 0.5 | 4-bit integer | int4, 4bit |
| mxfp4 | 0.5 | Microsoft FP4 | mxfp4 |
| nvfp4 | 0.5 | NVIDIA FP4 | nvfp4 |
| awq_int4 | 0.5 | AWQ 4-bit quantization | awq, awq-int4 |
| gptq_int4 | 0.5 | GPTQ 4-bit quantization | gptq, gptq-int4 |
| nf4 | 0.5 | 4-bit NormalFloat | nf4, bnb-4bit |
| fp4 | 0.5 | 4-bit floating point | fp4 |
Parallelization Strategies
Tensor Parallelism (TP)
Splits model weights by tensor dimensions across multiple GPUs.
Pipeline Parallelism (PP)
Distributes different model layers to different GPUs.
Expert Parallelism (EP)
For MoE (Mixture of Experts) models, distributes expert networks to different GPUs.
Data Parallelism (DP)
Each GPU holds a complete model copy, only splitting data.
Example Output
Smart Dtype Detection Example
$ hf-vram-calc microsoft/DialoGPT-medium --verbose
Using recommended data type: FP16
Use --dtype to specify different type, or see --list-types for all options
๐ Fetching configuration for microsoft/DialoGPT-medium...
Using recommended data type: FP16
Use --dtype to specify different type, or see --list-types for all options
๐ Parsing model configuration...
๐งฎ Calculating model parameters...
๐พ Computing memory requirements...
โญโโโโโโโ ๐ค Model Information โโโโโโโโฎ
โ โ
โ Model: microsoft/DialoGPT-medium โ
โ Architecture: gpt2 โ
โ Parameters: 406,286,336 (406.3M) โ
โ Recommended dtype: FP16 โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐พ Memory Requirements by Data Type and Scenario
โญโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฎ
โ โ Model Size โ KV Cache โ Inference โ Training โ LoRA โ
โ Data Type โ (GB) โ (GB) โ Total (GB) โ (Adam) (GB) โ (GB) โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ FP16 โ 0.76 โ 0.19 โ 0.91 โ 3.94 โ 0.94 โ
โฐโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโฏ
================================================================================
โก Parallelization Strategies (FP16 Inference)
โโโโโโโโโโโโโโโโโโโโโโคโโโโโโโคโโโโโโโคโโโโโโโคโโโโโโโคโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโ
โ โ โ โ โ โ Memory/GPU โ Min GPU โ
โ Strategy โ TP โ PP โ EP โ DP โ (GB) โ Required โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโผโโโโโโโผโโโโโโโผโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโข
โ Single GPU โ 1 โ 1 โ 1 โ 1 โ 0.91 โ 4GB+ โ
โ Tensor Parallel โ 2 โ 1 โ 1 โ 1 โ 0.45 โ 4GB+ โ
โ TP + PP โ 4 โ 4 โ 1 โ 1 โ 0.06 โ 4GB+ โ
โโโโโโโโโโโโโโโโโโโโโโงโโโโโโโงโโโโโโโงโโโโโโโงโโโโโโโงโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโ
๐ฎ GPU Compatibility Matrix
โโโโโโโโโโโโโโโโโโโฏโโโโโโโโโโโโโฏโโโโโโโโโโโโโโโฏโโโโโโโโโโโโโโโฏโโโโโโโโโโโโโโโ
โ GPU Type โ Memory โ Inference โ Training โ LoRA โ
โ โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโจ
โ RTX 4090 โ 24GB โ โ โ โ โ โ โ
โ A100 80GB โ 80GB โ โ โ โ โ โ โ
โ H100 80GB โ 80GB โ โ โ โ โ โ โ
โโโโโโโโโโโโโโโโโโโทโโโโโโโโโโโโโทโโโโโโโโโโโโโโโทโโโโโโโโโโโโโโโทโโโโโโโโโโโโโโโ
โญโโโ ๐ Minimum GPU Requirements โโโโโฎ
โ โ
โ Single GPU Inference: 0.9GB โ
โ Single GPU Training: 3.9GB โ
โ Single GPU LoRA: 0.9GB โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Large Model with User Override
$ hf-vram-calc nvidia/DeepSeek-R1-0528-FP4 --dtype nvfp4
$ hf-vram-calc Qwen/Qwen-72B-Chat
โญโโโโโโโโ ๐ค Model Information โโโโโโโโโฎ
โ โ
โ Model: nvidia/DeepSeek-R1-0528-FP4 โ
โ Architecture: deepseek_v3 โ
โ Parameters: 30,510,606,336 (30.5B) โ
โ Original torch_dtype: bfloat16 โ
โ User specified dtype: NVFP4 โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐พ Memory Requirements by Data Type and Scenario
โญโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฎ
โ โ Total Size โ Inference โ Training โ LoRA โ
โ Data Type โ (GB) โ (GB) โ (Adam) (GB) โ (GB) โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ NVFP4 โ 14.21 โ 17.05 โ 73.88 โ 19.34 โ
โฐโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโฏ
List Available Types
$ hf-vram-calc --list-types
Available Data Types:
โญโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Data Type โ Bytes/Param โ Description โ
โโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโค
โ FP32 โ 4 โ 32-bit floating point โ
โ FP16 โ 2 โ 16-bit floating point โ
โ BF16 โ 2 โ Brain Float 16 โ
โ NVFP4 โ 0.5 โ NVIDIA FP4 โ
โ AWQ_INT4 โ 0.5 โ AWQ 4-bit quantization โ
โ GPTQ_INT4 โ 0.5 โ GPTQ 4-bit quantizationโ
โฐโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Available GPU Types:
โญโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฎ
โ GPU Name โ Memory (GB) โ Category โ Architecture โ
โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ RTX 4090 โ 24 โ consumer โ Ada Lovelace โ
โ A100 80GB โ 80 โ datacenter โ Ampere โ
โ H100 80GB โ 80 โ datacenter โ Hopper โ
โฐโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโฏ
Calculation Formulas
Inference Memory
Inference Memory = Model Weights ร 1.2
Includes model weights and KV cache overhead.
KV Cache Memory
KV Cache (GB) = 2 ร Batch_Size ร Sequence_Length ร Head_Dim ร Num_KV_Heads ร Num_Layers ร Precision รท 1,073,741,824
- Head_Dim = hidden_size รท num_attention_heads
- Num_KV_Heads = config.num_key_value_heads (if present) else num_attention_heads
- Automatically supports MHA, MQA, and GQA via model config; KV cache uses FP16/BF16 for quantized models
Training Memory (with Adam)
Training Memory = Model Weights ร 4 ร 1.3
- 4x factor: Model weights (1x) + Gradients (1x) + Adam optimizer states (2x)
- 1.3x factor: 30% additional overhead (activation caching, etc.)
LoRA Fine-tuning Memory
LoRA Memory = (Model Weights + LoRA Parameter Overhead) ร 1.2
LoRA parameter overhead calculated based on rank and target module ratio.
Advanced Features
Configuration System
External JSON configuration files for maximum flexibility:
data_types.json- Add custom quantization formatsgpu_types.json- Define new GPU models and specificationsdisplay_settings.json- Customize UI appearance and limits
# Use custom config directory
hf-vram-calc --config-dir ./custom_config model_name
# Add custom data type example (data_types.json)
{
"my_custom_int2": {
"bytes_per_param": 0.25,
"description": "Custom 2-bit quantization"
}
}
Memory Calculation Details
| Scenario | Formula | Explanation |
|---|---|---|
| Inference | Model ร 1.2 | Includes KV cache and activation overhead |
| Training | Model ร 4 ร 1.3 | Weights(1x) + Gradients(1x) + Adam(2x) + 30% overhead |
| LoRA | (Model + LoRA_paramsร4) ร 1.2 | Base model + trainable parameters with optimizer |
Parallelization Efficiency
- TP (Tensor Parallel): Near-linear scaling, slight communication overhead
- PP (Pipeline Parallel): Good efficiency, pipeline bubble ~10-15%
- EP (Expert Parallel): MoE-specific, depends on expert routing efficiency
- DP (Data Parallel): No memory reduction per GPU, full model replica
Supported Architectures
Fully Supported โ
- GPT Family: GPT-2, GPT-3, GPT-4, GPT-NeoX, etc.
- LLaMA Family: LLaMA, LLaMA-2, Code Llama, Vicuna, etc.
- Mistral Family: Mistral 7B, Mixtral 8x7B (MoE), etc.
- Other Transformers: BERT, RoBERTa, T5, FLAN-T5, etc.
- New Architectures: DeepSeek, Qwen, ChatGLM, Baichuan, etc.
Architecture Detection
- Automatic field mapping for different config.json formats
- Fallback support for uncommon architectures
- MoE handling for Mixture-of-Experts models
Accuracy & Limitations
โ Highly Accurate For:
- Parameter counting (exact calculation)
- Memory estimation (within 5-10% of actual)
- Parallelization ratios (theoretical maximum)
โ ๏ธ Considerations:
- Activation memory varies with sequence length and optimization
- Real-world efficiency may differ due to framework overhead
- Quantization accuracy depends on specific implementation
- MoE models require expert routing consideration
Build & Development
Built with modern Python tooling:
- uv: Fast Python package management and building
- Rich: Professional terminal interface
- Requests: HTTP client for model config fetching
- JSON configuration: Flexible external configuration system
For development setup, see: BUILD.md
Contributing
We welcome contributions! Areas for improvement:
- ๐ง New quantization formats (add to
data_types.json) - ๐ฎ GPU models (update
gpu_types.json) - ๐ Architecture support (enhance config parsing)
- ๐ Performance optimizations
- ๐ Documentation improvements
- ๐งช Test coverage expansion
See Also
- ๐ BUILD.md - Complete build and installation guide
- โ๏ธ CONFIG_GUIDE.md - Configuration customization details
- ๐ Examples in help:
hf-vram-calc --helpfor usage examples
Version History
- v1.0.0: Complete rewrite with uv build, smart dtype detection, professional UI
- v0.x: Legacy single-file version (deprecated)
License
MIT License - see LICENSE file for details.
Made with โค๏ธ for the ML community | Built with uv and Rich
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hf_vram_calc-1.0.6.tar.gz.
File metadata
- Download URL: hf_vram_calc-1.0.6.tar.gz
- Upload date:
- Size: 26.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb400274d2c8d0592916651c39cbcce1c55f2e70b04aa7a77e57e26e421c5057
|
|
| MD5 |
2361f38b6abadcd99d66ca41a057bea1
|
|
| BLAKE2b-256 |
2e79a795625137828807bce79d73f7507b4777ea7ee9f140a3486b39edb74a06
|
File details
Details for the file hf_vram_calc-1.0.6-py3-none-any.whl.
File metadata
- Download URL: hf_vram_calc-1.0.6-py3-none-any.whl
- Upload date:
- Size: 26.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fbf62684050b8812e91275224d3115df270e33c768023c343c76640b188bb7a
|
|
| MD5 |
e7e90603c83592571a7bdb0680ada347
|
|
| BLAKE2b-256 |
93473fee64baaadad4ab561cbfa31bff368d80834431089f3403e4807186fcfa
|