Skip to main content

GPU memory calculator for Hugging Face models with different data types and parallelization strategies

Project description

Model VRAM Calculator

A Python CLI tool for estimating GPU memory requirements for Hugging Face models with different data types and parallelization strategies.

Features

  • ๐Ÿ” Automatically fetch model configurations from Hugging Face
  • ๐Ÿ“Š Support multiple data types: fp32, fp16/bf16, fp8, int8, int4, mxfp4, nvfp4
  • ๐ŸŽฏ Memory estimation for different scenarios:
    • Inference: Model weights + KV cache overhead
    • Training: Including gradients and optimizer states (Adam)
    • LoRA Fine-tuning: Low-rank adaptation fine-tuning memory requirements
  • โšก Calculate memory distribution across parallelization strategies:
    • Tensor Parallelism (TP): 1, 2, 4, 8
    • Pipeline Parallelism (PP): 1, 2, 4, 8
    • Expert Parallelism (EP)
    • Data Parallelism (DP)
    • Combined strategies (TP + PP)
  • ๐ŸŽฎ GPU compatibility checks:
    • Common GPU type recommendations (RTX 4090, A100, H100, etc.)
    • Minimum GPU memory requirement calculations
  • ๐Ÿ“ˆ Professional table output using Rich library:
    • ๐ŸŽจ Color coding and beautiful borders
    • ๐Ÿ“Š Progress bars and status displays
    • ๐Ÿš€ Modern CLI interface experience
  • ๐Ÿ”ง Customizable parameters: LoRA rank, batch size, sequence length

Installation

pip3 install -r requirements.txt

Main dependencies: requests and rich (for beautiful tables and progress display)

Usage

Basic Usage

python3 vram_calculator.py microsoft/DialoGPT-medium

Specify Data Type

python3 vram_calculator.py meta-llama/Llama-2-7b-hf --dtype bf16

Custom Batch Size and Sequence Length

python3 vram_calculator.py mistralai/Mistral-7B-v0.1 --batch-size 4 --sequence-length 4096

Show Detailed Parallelization Strategies and GPU Recommendations

python3 vram_calculator.py --show-detailed microsoft/DialoGPT-medium

Custom LoRA Rank for Fine-tuning Memory Estimation

python3 vram_calculator.py --lora-rank 128 --show-detailed microsoft/DialoGPT-medium

View Available Data Types and GPU Models

python3 vram_calculator.py --list-types

Use Custom Configuration

# Use custom configuration directory
python3 vram_calculator.py --config-dir ./my_config microsoft/DialoGPT-medium

Command Line Arguments

  • model_name: Hugging Face model name (required)
  • --dtype: Specify data type (optional, default: show all types)
  • --batch-size: Batch size for activation memory estimation (default: 1)
  • --sequence-length: Sequence length for activation memory estimation (default: 2048)
  • --lora-rank: LoRA rank parameter for fine-tuning (default: 64)
  • --show-detailed: Show detailed parallelization strategies and GPU recommendations
  • --config-dir: Specify custom configuration directory
  • --list-types: List all available data types and GPU models

Configuration System

The tool uses separate JSON configuration files to manage data types and GPU specifications, allowing flexible user customization:

Configuration File Structure

  • data_types.json - Define data types and bytes per parameter
  • gpu_types.json - Define GPU models and memory specifications
  • display_settings.json - Control display styles and behavior

Adding Custom Data Types

Edit the data_types.json file:

{
  "your_custom_format": {
    "bytes_per_param": 0.75,
    "description": "Your custom 6-bit format"
  }
}

Adding Custom GPU Models

Edit the gpu_types.json file:

{
  "name": "RTX 5090",
  "memory_gb": 32,
  "category": "consumer",
  "architecture": "Blackwell"
}

For detailed configuration instructions, please refer to: CONFIG_GUIDE.md

Supported Data Types

Data Type Bytes per Parameter Description
fp32 4 32-bit floating point
fp16 2 16-bit floating point
bf16 2 Brain Float 16
fp8 1 8-bit floating point
int8 1 8-bit integer
int4 0.5 4-bit integer
mxfp4 0.5 Microsoft FP4
nvfp4 0.5 NVIDIA FP4

Parallelization Strategies

Tensor Parallelism (TP)

Splits model weights by tensor dimensions across multiple GPUs.

Pipeline Parallelism (PP)

Distributes different model layers to different GPUs.

Expert Parallelism (EP)

For MoE (Mixture of Experts) models, distributes expert networks to different GPUs.

Data Parallelism (DP)

Each GPU holds a complete model copy, only splitting data.

Example Output

Basic Output (Default Mode)

================================================================================
Model: microsoft/DialoGPT-medium
Architecture: gpt2
Parameters: 404,966,400
================================================================================

Memory Requirements by Data Type and Scenario:              
================================================================================
Data Type    Total Size   Inference    Training     LoRA        
(GB)         (GB)         (GB)         (Adam) (GB)  (GB)        
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
FP32         1.51        1.81        7.84        1.84       
FP16         0.75        0.91        3.92        0.94       
BF16         0.75        0.91        3.92        0.94       
INT8         0.38        0.45        1.96        0.48       
INT4         0.19        0.23        0.98        0.26       

Detailed Output (--show-detailed mode)

================================================================================
Model: microsoft/DialoGPT-medium
Architecture: gpt2
Parameters: 404,966,400
================================================================================

Memory Requirements by Data Type and Scenario:              
================================================================================
Data Type    Total Size   Inference    Training     LoRA        
(GB)         (GB)         (GB)         (Adam) (GB)  (GB)        
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
FP32         1.51        1.81        7.84        1.84       
FP16         0.75        0.91        3.92        0.94       
BF16         0.75        0.91        3.92        0.94       
INT8         0.38        0.45        1.96        0.48       
INT4         0.19        0.23        0.98        0.26       

Parallelization Strategies (BF16 Inference):                
================================================================================
Strategy             TP   PP   EP   DP   Memory/GPU (GB) Min GPUs  
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Single GPU           1    1    1    1    0.91           4GB+      
Tensor Parallel      2    1    1    1    0.45           4GB+      
Tensor Parallel      4    1    1    1    0.23           4GB+      
Tensor Parallel      8    1    1    1    0.11           4GB+      
Pipeline Parallel    1    2    1    1    0.45           4GB+      
Pipeline Parallel    1    4    1    1    0.23           4GB+      
Pipeline Parallel    1    8    1    1    0.11           4GB+      
TP + PP              2    2    1    1    0.23           4GB+      
TP + PP              2    4    1    1    0.11           4GB+      
TP + PP              4    2    1    1    0.11           4GB+      
TP + PP              4    4    1    1    0.06           4GB+      

Recommendations:                                            
================================================================================
GPU Type        Memory     Inference    Training     LoRA        
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
RTX 4090        24       GB โœ“           โœ“           โœ“          
A100 40GB       40       GB โœ“           โœ“           โœ“          
A100 80GB       80       GB โœ“           โœ“           โœ“          
H100            80       GB โœ“           โœ“           โœ“          

Minimum GPU Requirements:                                   
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Single GPU Inference: 0.9GB
Single GPU Training: 3.9GB
Single GPU LoRA: 0.9GB

Calculation Formulas

Inference Memory

Inference Memory = Model Weights ร— 1.2

Includes model weights and KV cache overhead.

Training Memory (with Adam)

Training Memory = Model Weights ร— 4 ร— 1.3
  • 4x factor: Model weights (1x) + Gradients (1x) + Adam optimizer states (2x)
  • 1.3x factor: 30% additional overhead (activation caching, etc.)

LoRA Fine-tuning Memory

LoRA Memory = (Model Weights + LoRA Parameter Overhead) ร— 1.2

LoRA parameter overhead calculated based on rank and target module ratio.

Notes

  1. Activation Memory: Current simplified estimation may be significantly reduced in practice due to optimization strategies (such as gradient checkpointing)
  2. Parallelization Efficiency: Assumes ideal conditions, actual may vary slightly due to communication overhead
  3. LoRA Estimation: Based on typical configurations (25% target modules), actual may vary depending on specific implementation
  4. Mixed Data Types: Some cases may use mixed precision, actual memory between different data types
  5. Model Architecture Differences: Different architectures (such as MoE) may have special memory distribution patterns

Supported Model Architectures

Currently mainly supports Transformer architecture models, including but not limited to:

  • GPT series
  • LLaMA series
  • Mistral series
  • BERT series
  • T5 series

Contributing

Welcome to submit Issues and Pull Requests to improve this tool!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_vram_calc-1.0.0.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hf_vram_calc-1.0.0-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file hf_vram_calc-1.0.0.tar.gz.

File metadata

  • Download URL: hf_vram_calc-1.0.0.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for hf_vram_calc-1.0.0.tar.gz
Algorithm Hash digest
SHA256 82290786618fd74340fee18d26e1c81220b3ae03d9bed475563acf218d0f05fd
MD5 79e7f7ccfbc6e50ad94e8eee4afdf7cb
BLAKE2b-256 5167d9b2d09f233049a7324f70accff32af4a3eebe396c3665fc61ad2cd3e114

See more details on using hashes here.

File details

Details for the file hf_vram_calc-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: hf_vram_calc-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for hf_vram_calc-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 24d44adca4e3a8ef1557bff0e49c09ed163e027ee94200c032190c1ba4fca72e
MD5 d4d99093959d30d2c1314793f7601789
BLAKE2b-256 585e6341537e9eec3d67751f0d194c06a06e96789e6bbbd8c3ac6dcd55604f3e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page