RTX 50-series GPU compatibility layer for PyTorch and CUDA - enables sm_120 support
Reason this release was yanked:
broken
Project description
rtx50-compat: Enable RTX 50-series GPUs in PyTorch ๐
Enable NVIDIA RTX 50-series GPU support (sm_120) in PyTorch and the entire Python AI ecosystem with a single import.
๐ฏ Why This Exists
The RTX 5090 features the new sm_120 compute capability, which isn't recognized by current PyTorch/CUDA libraries. This package provides a runtime patch that makes your RTX 5090 work seamlessly with existing AI frameworks.
๐ Quick Start
Installation
# Recommended: use uv
uv pip install rtx50-compat
# Or with pip
pip install rtx50-compat
Basic Usage
import rtx50_compat # Must be imported before PyTorch!
import torch
# Verify GPU is recognized
print(torch.cuda.get_device_name(0)) # NVIDIA GeForce RTX 5090
print(torch.cuda.is_available()) # True
# Now use PyTorch normally
model = torch.nn.Linear(1024, 1024).cuda()
๐ Realistic Benchmarks
Based on RTX 5090's 32GB GDDR7 VRAM and 70 TFLOPS compute:
Models that fit entirely in VRAM (fastest)
| Model | RTX 5090 | i9-14900K | Speedup |
|---|---|---|---|
| Llama 3-8B | 180-250 tokens/s | 8-12 tokens/s | ~20x |
| Llama 3-13B | 120-180 tokens/s | 4-6 tokens/s | ~30x |
| Stable Diffusion XL | 40-60 img/min | 0.5 img/min | ~100x |
Large models with partial offloading
| Model | RTX 5090 (with offload) | i9-14900K | Speedup |
|---|---|---|---|
| Llama 3-70B Q4 | 25-35 tokens/s | 1-3 tokens/s | ~15x |
| Mixtral 8x7B | 40-60 tokens/s | 2-4 tokens/s | ~20x |
Note: 70B models require ~35GB for Q4 quantization, exceeding the RTX 5090's 32GB VRAM. Performance depends on offloading efficiency.
๐ Examples
Hello World - Verify Installation
import rtx50_compat
import torch
# Check if patch was applied
if torch.cuda.is_available():
print(f"โ
GPU: {torch.cuda.get_device_name(0)}")
print(f"โ
VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
# Quick performance test
x = torch.randn(10000, 10000, device='cuda')
y = torch.matmul(x, x)
print("โ
CUDA operations working!")
else:
print("โ CUDA not available")
Running Llama 3-8B (Fits in VRAM)
import rtx50_compat
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model - fits entirely in 32GB VRAM
model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="cuda"
)
# Generate at 180-250 tokens/s!
inputs = tokenizer("The future of AI is", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=100, temperature=0.8)
print(tokenizer.decode(outputs[0]))
Running Llama 3-70B with llama.cpp (Recommended for large models)
# First convert to GGUF format for efficient memory usage
# pip install llama-cpp-python
import rtx50_compat
from llama_cpp import Llama
# Load 70B model with automatic GPU/CPU splitting
llm = Llama(
model_path="llama-3-70b-q4_k_m.gguf",
n_gpu_layers=-1, # Offload all layers that fit
n_ctx=4096,
verbose=False
)
# Generate at 25-35 tokens/s with partial offloading
response = llm("The meaning of life is", max_tokens=100)
print(response['choices'][0]['text'])
Stable Diffusion XL
import rtx50_compat
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16"
).to("cuda")
# Generate at 40-60 images per minute!
images = pipe(
"A majestic mountain landscape at sunset, highly detailed, 8k",
num_images_per_prompt=4,
guidance_scale=7.5
).images
๐ง Technical Details
What the patch does:
- Capability Masquerading: Makes sm_120 report as sm_90 (H100) for compatibility
- CUDA Compilation: Adds sm_120 flags when compiling CUDA extensions
- Memory Management: Optimizes for consumer GPU memory patterns
- Library Fixes: Patches flash-attention, xformers, and other CUDA libraries
How it works:
# The patch intercepts CUDA capability queries
original_get_device_capability = torch.cuda.get_device_capability
def patched_get_device_capability(device=None):
major, minor = original_get_device_capability(device)
if major == 12 and minor == 0: # sm_120 (RTX 50-series)
return (9, 0) # Masquerade as sm_90 (H100)
return (major, minor)
๐ฆ Batman Mode
For subtle operations:
export RTX50_BATMAN_MODE=1
python your_script.py
Output:
๐ฆ I am Batman - at your local jujitsu establishment
RTX 5090 successfully disguised as H100
You didn't see anything... ๐
๐ Repository Structure
rtx50-compat/
โโโ rtx50_compat.py # Main compatibility layer
โโโ patches/ # PyTorch & vLLM patches (for reference)
โ โโโ pytorch_rtx5090.patch
โ โโโ vllm_rtx5090.patch
โ โโโ README.md # Patch application guide
โโโ benchmarks/ # Benchmark scripts
โ โโโ benchmark_8b.py # Llama 3-8B benchmark
โ โโโ benchmark_70b.py # Llama 3-70B with offloading
โ โโโ benchmark_sd.py # Stable Diffusion benchmark
โโโ examples/ # Usage examples
โ โโโ hello_world.py
โ โโโ comfyui_integration.py
โ โโโ llama_cpp_example.py
โโโ tests/ # Unit tests
โ โโโ test_compatibility.py
โโโ LICENSE
โโโ README.md
โโโ setup.py
๐ Troubleshooting
"No kernel image available" Error
Ensure rtx50_compat is imported before any other CUDA/PyTorch imports:
import rtx50_compat # MUST be first
import torch # Now this works
Memory Errors with Large Models
For models exceeding 32GB VRAM, use quantization and offloading:
# Use 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_4bit=True,
device_map="auto" # Automatic CPU/GPU splitting
)
Performance Lower Than Expected
- Check if model fits entirely in VRAM:
nvidia-smi - Use appropriate batch sizes (larger = more efficient)
- Enable flash attention if available
- Consider using specialized inference engines (vLLM, TGI, llama.cpp)
๐ค AI Assistant Integration
Using with Claude CLI
# Install the package and verify it works
claude "I have an RTX 5090. Help me set up rtx50-compat and run a Llama 3-8B model for maximum performance"
# Optimize for 70B models with offloading
claude "Show me how to run Llama 3-70B on my RTX 5090 using llama.cpp with optimal settings"
# Debug performance issues
claude "My RTX 5090 is only getting 10 tokens/s on Llama 3-13B. Help me diagnose and fix this"
# Integration with existing projects
claude "Add rtx50-compat support to my ComfyUI installation at ~/ComfyUI"
Using with Gemini CLI
# Setup and verification
gemini -p "I have an RTX 5090 with 32GB VRAM. Guide me through installing rtx50-compat and running a benchmark"
# Model recommendations
gemini -p "What's the largest LLM I can run entirely in VRAM on my RTX 5090? Include quantization options"
# Performance optimization
gemini -p "Analyze my RTX 5090 setup and suggest optimizations for running Mixtral 8x7B at maximum speed"
# Troubleshooting
gemini -p "Getting 'no kernel image' error with RTX 5090 in PyTorch. Show me how to fix with rtx50-compat"
Prompt Templates for Complex Tasks
Full Stack Setup
Help me set up a complete local AI workstation with RTX 5090:
1. Install rtx50-compat
2. Configure vLLM for serving
3. Set up Stable Diffusion XL
4. Create benchmarks for both text and image generation
Production Deployment
I need to deploy a Llama 3-70B model on RTX 5090 for production use:
- Optimize for throughput (multiple users)
- Set up proper memory management
- Configure monitoring and logging
- Handle model switching between 8B/13B/70B based on load
๐ค Contributing
PRs welcome! Areas needing help:
- RTX 5080/5070 Ti testing
- Additional framework patches (JAX, MXNet)
- Performance optimizations
- Documentation improvements
Upstream Integration
We're working on getting these patches merged upstream:
- PyTorch: [PR #pending]
- vLLM: [PR #pending]
๐ License
MIT License - see LICENSE
๐ Acknowledgments
- NVIDIA for the incredible RTX 5090 hardware
- PyTorch team for the amazing framework
- The local LLM community for inspiration and testing
Note: This is a community compatibility layer. Once PyTorch officially supports sm_120, this package will become obsolete. Until then, enjoy running large models locally at impressive speeds! ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rtx50_compat-3.0.2.tar.gz.
File metadata
- Download URL: rtx50_compat-3.0.2.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52b0e80f40d9c8acc1d503f07a534940b1c1900ff500228efd6a739e9782834c
|
|
| MD5 |
38196d852a3a77cd896c5ec561882ac6
|
|
| BLAKE2b-256 |
5f54de5345fef1ceae4ad3e2094f9ea46093e445db7e2259a533ef33402bd2b2
|
File details
Details for the file rtx50_compat-3.0.2-py3-none-any.whl.
File metadata
- Download URL: rtx50_compat-3.0.2-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f190915ed4b47074b63f49f7ef5190051c5056b6425248e52969285df5b9dedb
|
|
| MD5 |
d631f72d76554d3a39de54014014fd4a
|
|
| BLAKE2b-256 |
ea322d1cc563049a6f2e0963996e4b7722d3021c7284ccb3e21246efafd89103
|