GPU memory calculator for Hugging Face models with different data types and parallelization strategies

These details have not been verified by PyPI

Project links

Project description

Model VRAM Calculator

A Python CLI tool for estimating GPU memory requirements for Hugging Face models with different data types and parallelization strategies.

Features

🔍 Automatically fetch model configurations from Hugging Face
📊 Support multiple data types: fp32, fp16/bf16, fp8, int8, int4, mxfp4, nvfp4
🎯 Memory estimation for different scenarios:
- Inference: Model weights + KV cache overhead
- Training: Including gradients and optimizer states (Adam)
- LoRA Fine-tuning: Low-rank adaptation fine-tuning memory requirements
⚡ Calculate memory distribution across parallelization strategies:
- Tensor Parallelism (TP): 1, 2, 4, 8
- Pipeline Parallelism (PP): 1, 2, 4, 8
- Expert Parallelism (EP)
- Data Parallelism (DP)
- Combined strategies (TP + PP)
🎮 GPU compatibility checks:
- Common GPU type recommendations (RTX 4090, A100, H100, etc.)
- Minimum GPU memory requirement calculations
📈 Professional table output using Rich library:
- 🎨 Color coding and beautiful borders
- 📊 Progress bars and status displays
- 🚀 Modern CLI interface experience
🔧 Customizable parameters: LoRA rank, batch size, sequence length

Installation

pip3 install -r requirements.txt

Main dependencies: requests and rich (for beautiful tables and progress display)

Usage

Basic Usage

python3 vram_calculator.py microsoft/DialoGPT-medium

Specify Data Type

python3 vram_calculator.py meta-llama/Llama-2-7b-hf --dtype bf16

Custom Batch Size and Sequence Length

python3 vram_calculator.py mistralai/Mistral-7B-v0.1 --batch-size 4 --sequence-length 4096

Show Detailed Parallelization Strategies and GPU Recommendations

python3 vram_calculator.py --show-detailed microsoft/DialoGPT-medium

Custom LoRA Rank for Fine-tuning Memory Estimation

python3 vram_calculator.py --lora-rank 128 --show-detailed microsoft/DialoGPT-medium

View Available Data Types and GPU Models

python3 vram_calculator.py --list-types

Use Custom Configuration

# Use custom configuration directory
python3 vram_calculator.py --config-dir ./my_config microsoft/DialoGPT-medium

Command Line Arguments

model_name: Hugging Face model name (required)
--dtype: Specify data type (optional, default: show all types)
--batch-size: Batch size for activation memory estimation (default: 1)
--sequence-length: Sequence length for activation memory estimation (default: 2048)
--lora-rank: LoRA rank parameter for fine-tuning (default: 64)
--show-detailed: Show detailed parallelization strategies and GPU recommendations
--config-dir: Specify custom configuration directory
--list-types: List all available data types and GPU models

Configuration System

The tool uses separate JSON configuration files to manage data types and GPU specifications, allowing flexible user customization:

Configuration File Structure

data_types.json - Define data types and bytes per parameter
gpu_types.json - Define GPU models and memory specifications
display_settings.json - Control display styles and behavior

Adding Custom Data Types

Edit the data_types.json file:

{
  "your_custom_format": {
    "bytes_per_param": 0.75,
    "description": "Your custom 6-bit format"
  }
}

Adding Custom GPU Models

Edit the gpu_types.json file:

{
  "name": "RTX 5090",
  "memory_gb": 32,
  "category": "consumer",
  "architecture": "Blackwell"
}

For detailed configuration instructions, please refer to: CONFIG_GUIDE.md

Supported Data Types

Data Type	Bytes per Parameter	Description
fp32	4	32-bit floating point
fp16	2	16-bit floating point
bf16	2	Brain Float 16
fp8	1	8-bit floating point
int8	1	8-bit integer
int4	0.5	4-bit integer
mxfp4	0.5	Microsoft FP4
nvfp4	0.5	NVIDIA FP4

Parallelization Strategies

Tensor Parallelism (TP)

Splits model weights by tensor dimensions across multiple GPUs.

Pipeline Parallelism (PP)

Distributes different model layers to different GPUs.

Expert Parallelism (EP)

For MoE (Mixture of Experts) models, distributes expert networks to different GPUs.

Data Parallelism (DP)

Each GPU holds a complete model copy, only splitting data.

Example Output

Basic Output (Default Mode)

================================================================================
Model: microsoft/DialoGPT-medium
Architecture: gpt2
Parameters: 404,966,400
================================================================================

Memory Requirements by Data Type and Scenario:              
================================================================================
Data Type    Total Size   Inference    Training     LoRA        
(GB)         (GB)         (GB)         (Adam) (GB)  (GB)        
────────────────────────────────────────────────────────────────────────────────
FP32         1.51        1.81        7.84        1.84       
FP16         0.75        0.91        3.92        0.94       
BF16         0.75        0.91        3.92        0.94       
INT8         0.38        0.45        1.96        0.48       
INT4         0.19        0.23        0.98        0.26

Detailed Output (--show-detailed mode)

================================================================================
Model: microsoft/DialoGPT-medium
Architecture: gpt2
Parameters: 404,966,400
================================================================================

Memory Requirements by Data Type and Scenario:              
================================================================================
Data Type    Total Size   Inference    Training     LoRA        
(GB)         (GB)         (GB)         (Adam) (GB)  (GB)        
────────────────────────────────────────────────────────────────────────────────
FP32         1.51        1.81        7.84        1.84       
FP16         0.75        0.91        3.92        0.94       
BF16         0.75        0.91        3.92        0.94       
INT8         0.38        0.45        1.96        0.48       
INT4         0.19        0.23        0.98        0.26       

Parallelization Strategies (BF16 Inference):                
================================================================================
Strategy             TP   PP   EP   DP   Memory/GPU (GB) Min GPUs  
────────────────────────────────────────────────────────────────────────────────
Single GPU           1    1    1    1    0.91           4GB+      
Tensor Parallel      2    1    1    1    0.45           4GB+      
Tensor Parallel      4    1    1    1    0.23           4GB+      
Tensor Parallel      8    1    1    1    0.11           4GB+      
Pipeline Parallel    1    2    1    1    0.45           4GB+      
Pipeline Parallel    1    4    1    1    0.23           4GB+      
Pipeline Parallel    1    8    1    1    0.11           4GB+      
TP + PP              2    2    1    1    0.23           4GB+      
TP + PP              2    4    1    1    0.11           4GB+      
TP + PP              4    2    1    1    0.11           4GB+      
TP + PP              4    4    1    1    0.06           4GB+      

Recommendations:                                            
================================================================================
GPU Type        Memory     Inference    Training     LoRA        
────────────────────────────────────────────────────────────────────────────────
RTX 4090        24       GB ✓           ✓           ✓          
A100 40GB       40       GB ✓           ✓           ✓          
A100 80GB       80       GB ✓           ✓           ✓          
H100            80       GB ✓           ✓           ✓          

Minimum GPU Requirements:                                   
────────────────────────────────────────────────────────────────────────────────
Single GPU Inference: 0.9GB
Single GPU Training: 3.9GB
Single GPU LoRA: 0.9GB

Calculation Formulas

Inference Memory

Inference Memory = Model Weights × 1.2

Includes model weights and KV cache overhead.

Training Memory (with Adam)

Training Memory = Model Weights × 4 × 1.3

4x factor: Model weights (1x) + Gradients (1x) + Adam optimizer states (2x)
1.3x factor: 30% additional overhead (activation caching, etc.)

LoRA Fine-tuning Memory

LoRA Memory = (Model Weights + LoRA Parameter Overhead) × 1.2

LoRA parameter overhead calculated based on rank and target module ratio.

Notes

Activation Memory: Current simplified estimation may be significantly reduced in practice due to optimization strategies (such as gradient checkpointing)
Parallelization Efficiency: Assumes ideal conditions, actual may vary slightly due to communication overhead
LoRA Estimation: Based on typical configurations (25% target modules), actual may vary depending on specific implementation
Mixed Data Types: Some cases may use mixed precision, actual memory between different data types
Model Architecture Differences: Different architectures (such as MoE) may have special memory distribution patterns

Supported Model Architectures

Currently mainly supports Transformer architecture models, including but not limited to:

GPT series
LLaMA series
Mistral series
BERT series
T5 series

Contributing

Welcome to submit Issues and Pull Requests to improve this tool!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Aug 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

model_vram_calc-1.0.0.tar.gz (16.3 kB view details)

Uploaded Aug 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

model_vram_calc-1.0.0-py3-none-any.whl (16.5 kB view details)

Uploaded Aug 15, 2025 Python 3

File details

Details for the file model_vram_calc-1.0.0.tar.gz.

File metadata

Download URL: model_vram_calc-1.0.0.tar.gz
Upload date: Aug 15, 2025
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for model_vram_calc-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`bc04eba604d62f27b5c9c05070187cd079cf943a4f2d0149a960fab7739941af`
MD5	`9ab40291baed1385de89eea3b4c2753c`
BLAKE2b-256	`a65954f5eb2ba5e71ec6d13fcdfa0da2b67abb88630712c5d4456464615a78bf`

See more details on using hashes here.

File details

Details for the file model_vram_calc-1.0.0-py3-none-any.whl.

File metadata

Download URL: model_vram_calc-1.0.0-py3-none-any.whl
Upload date: Aug 15, 2025
Size: 16.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for model_vram_calc-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4b7dae838e7d4d11a1f1d89dac55e7431be3e5582ca684f42985312ae406136c`
MD5	`32dcb70c213c9ed149228ccf65b3b556`
BLAKE2b-256	`6cdc18db49c42849017fc3efd8ad5cd01be10a71c087f2c579a99698d600bd23`

See more details on using hashes here.

model-vram-calc 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Model VRAM Calculator

Features

Installation

Usage

Basic Usage

Specify Data Type

Custom Batch Size and Sequence Length

Show Detailed Parallelization Strategies and GPU Recommendations

Custom LoRA Rank for Fine-tuning Memory Estimation

View Available Data Types and GPU Models

Use Custom Configuration

Command Line Arguments

Configuration System

Configuration File Structure

Adding Custom Data Types

Adding Custom GPU Models

Supported Data Types

Parallelization Strategies

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Expert Parallelism (EP)

Data Parallelism (DP)

Example Output

Basic Output (Default Mode)

Detailed Output (--show-detailed mode)

Calculation Formulas

Inference Memory

Training Memory (with Adam)

LoRA Fine-tuning Memory

Notes

Supported Model Architectures

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes