Skip to main content

RTX 50-series GPU compatibility layer for PyTorch and CUDA - enables sm_120 support

Reason this release was yanked:

broken

Project description

rtx50-compat: Enable RTX 50-series GPUs in PyTorch ๐Ÿš€

PyPI Python License

Enable NVIDIA RTX 50-series GPU support (sm_120) in PyTorch and the entire Python AI ecosystem with a single import.

๐ŸŽฏ Why This Exists

The RTX 5090 features the new sm_120 compute capability, which isn't recognized by current PyTorch/CUDA libraries. This package provides a runtime patch that makes your RTX 5090 work seamlessly with existing AI frameworks.

๐Ÿš€ Quick Start

Installation

# Recommended: use uv
uv pip install rtx50-compat

# Or with pip
pip install rtx50-compat

Basic Usage

import rtx50_compat  # Must be imported before PyTorch!
import torch

# Verify GPU is recognized
print(torch.cuda.get_device_name(0))  # NVIDIA GeForce RTX 5090
print(torch.cuda.is_available())      # True

# Now use PyTorch normally
model = torch.nn.Linear(1024, 1024).cuda()

๐Ÿ“Š Realistic Benchmarks

Based on RTX 5090's 32GB GDDR7 VRAM and 70 TFLOPS compute:

Models that fit entirely in VRAM (fastest)

Model RTX 5090 i9-14900K Speedup
Llama 3-8B 180-250 tokens/s 8-12 tokens/s ~20x
Llama 3-13B 120-180 tokens/s 4-6 tokens/s ~30x
Stable Diffusion XL 40-60 img/min 0.5 img/min ~100x

Large models with partial offloading

Model RTX 5090 (with offload) i9-14900K Speedup
Llama 3-70B Q4 25-35 tokens/s 1-3 tokens/s ~15x
Mixtral 8x7B 40-60 tokens/s 2-4 tokens/s ~20x

Note: 70B models require ~35GB for Q4 quantization, exceeding the RTX 5090's 32GB VRAM. Performance depends on offloading efficiency.

๐Ÿ“– Examples

Hello World - Verify Installation

import rtx50_compat
import torch

# Check if patch was applied
if torch.cuda.is_available():
    print(f"โœ… GPU: {torch.cuda.get_device_name(0)}")
    print(f"โœ… VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    
    # Quick performance test
    x = torch.randn(10000, 10000, device='cuda')
    y = torch.matmul(x, x)
    print("โœ… CUDA operations working!")
else:
    print("โŒ CUDA not available")

Running Llama 3-8B (Fits in VRAM)

import rtx50_compat
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model - fits entirely in 32GB VRAM
model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="cuda"
)

# Generate at 180-250 tokens/s!
inputs = tokenizer("The future of AI is", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=100, temperature=0.8)
print(tokenizer.decode(outputs[0]))

Running Llama 3-70B with llama.cpp (Recommended for large models)

# First convert to GGUF format for efficient memory usage
# pip install llama-cpp-python

import rtx50_compat
from llama_cpp import Llama

# Load 70B model with automatic GPU/CPU splitting
llm = Llama(
    model_path="llama-3-70b-q4_k_m.gguf",
    n_gpu_layers=-1,  # Offload all layers that fit
    n_ctx=4096,
    verbose=False
)

# Generate at 25-35 tokens/s with partial offloading
response = llm("The meaning of life is", max_tokens=100)
print(response['choices'][0]['text'])

Stable Diffusion XL

import rtx50_compat
from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16"
).to("cuda")

# Generate at 40-60 images per minute!
images = pipe(
    "A majestic mountain landscape at sunset, highly detailed, 8k",
    num_images_per_prompt=4,
    guidance_scale=7.5
).images

๐Ÿ”ง Technical Details

What the patch does:

  1. Capability Masquerading: Makes sm_120 report as sm_90 (H100) for compatibility
  2. CUDA Compilation: Adds sm_120 flags when compiling CUDA extensions
  3. Memory Management: Optimizes for consumer GPU memory patterns
  4. Library Fixes: Patches flash-attention, xformers, and other CUDA libraries

How it works:

# The patch intercepts CUDA capability queries
original_get_device_capability = torch.cuda.get_device_capability

def patched_get_device_capability(device=None):
    major, minor = original_get_device_capability(device)
    if major == 12 and minor == 0:  # sm_120 (RTX 50-series)
        return (9, 0)  # Masquerade as sm_90 (H100)
    return (major, minor)

๐Ÿฆ‡ Batman Mode

For subtle operations:

export RTX50_BATMAN_MODE=1
python your_script.py

Output:

๐Ÿฆ‡ I am Batman - at your local jujitsu establishment
RTX 5090 successfully disguised as H100
You didn't see anything... ๐ŸŒ™

๐Ÿ“‚ Repository Structure

rtx50-compat/
โ”œโ”€โ”€ rtx50_compat.py      # Main compatibility layer
โ”œโ”€โ”€ patches/             # PyTorch & vLLM patches (for reference)
โ”‚   โ”œโ”€โ”€ pytorch_rtx5090.patch
โ”‚   โ”œโ”€โ”€ vllm_rtx5090.patch
โ”‚   โ””โ”€โ”€ README.md        # Patch application guide
โ”œโ”€โ”€ benchmarks/          # Benchmark scripts
โ”‚   โ”œโ”€โ”€ benchmark_8b.py  # Llama 3-8B benchmark
โ”‚   โ”œโ”€โ”€ benchmark_70b.py # Llama 3-70B with offloading
โ”‚   โ””โ”€โ”€ benchmark_sd.py  # Stable Diffusion benchmark
โ”œโ”€โ”€ examples/            # Usage examples
โ”‚   โ”œโ”€โ”€ hello_world.py
โ”‚   โ”œโ”€โ”€ comfyui_integration.py
โ”‚   โ””โ”€โ”€ llama_cpp_example.py
โ”œโ”€โ”€ tests/               # Unit tests
โ”‚   โ””โ”€โ”€ test_compatibility.py
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ setup.py

๐Ÿ› Troubleshooting

"No kernel image available" Error

Ensure rtx50_compat is imported before any other CUDA/PyTorch imports:

import rtx50_compat  # MUST be first
import torch  # Now this works

Memory Errors with Large Models

For models exceeding 32GB VRAM, use quantization and offloading:

# Use 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto"  # Automatic CPU/GPU splitting
)

Performance Lower Than Expected

  1. Check if model fits entirely in VRAM: nvidia-smi
  2. Use appropriate batch sizes (larger = more efficient)
  3. Enable flash attention if available
  4. Consider using specialized inference engines (vLLM, TGI, llama.cpp)

๐Ÿค– AI Assistant Integration

Using with Claude CLI

# Install the package and verify it works
claude "I have an RTX 5090. Help me set up rtx50-compat and run a Llama 3-8B model for maximum performance"

# Optimize for 70B models with offloading
claude "Show me how to run Llama 3-70B on my RTX 5090 using llama.cpp with optimal settings"

# Debug performance issues
claude "My RTX 5090 is only getting 10 tokens/s on Llama 3-13B. Help me diagnose and fix this"

# Integration with existing projects
claude "Add rtx50-compat support to my ComfyUI installation at ~/ComfyUI"

Using with Gemini CLI

# Setup and verification
gemini -p "I have an RTX 5090 with 32GB VRAM. Guide me through installing rtx50-compat and running a benchmark"

# Model recommendations
gemini -p "What's the largest LLM I can run entirely in VRAM on my RTX 5090? Include quantization options"

# Performance optimization
gemini -p "Analyze my RTX 5090 setup and suggest optimizations for running Mixtral 8x7B at maximum speed"

# Troubleshooting
gemini -p "Getting 'no kernel image' error with RTX 5090 in PyTorch. Show me how to fix with rtx50-compat"

Prompt Templates for Complex Tasks

Full Stack Setup

Help me set up a complete local AI workstation with RTX 5090:
1. Install rtx50-compat
2. Configure vLLM for serving
3. Set up Stable Diffusion XL
4. Create benchmarks for both text and image generation

Production Deployment

I need to deploy a Llama 3-70B model on RTX 5090 for production use:
- Optimize for throughput (multiple users)
- Set up proper memory management
- Configure monitoring and logging
- Handle model switching between 8B/13B/70B based on load

๐Ÿค Contributing

PRs welcome! Areas needing help:

  • RTX 5080/5070 Ti testing
  • Additional framework patches (JAX, MXNet)
  • Performance optimizations
  • Documentation improvements

Upstream Integration

We're working on getting these patches merged upstream:

  • PyTorch: [PR #pending]
  • vLLM: [PR #pending]

๐Ÿ“„ License

MIT License - see LICENSE

๐Ÿ™ Acknowledgments

  • NVIDIA for the incredible RTX 5090 hardware
  • PyTorch team for the amazing framework
  • The local LLM community for inspiration and testing

Note: This is a community compatibility layer. Once PyTorch officially supports sm_120, this package will become obsolete. Until then, enjoy running large models locally at impressive speeds! ๐Ÿš€

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rtx50_compat-3.0.2.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rtx50_compat-3.0.2-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file rtx50_compat-3.0.2.tar.gz.

File metadata

  • Download URL: rtx50_compat-3.0.2.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for rtx50_compat-3.0.2.tar.gz
Algorithm Hash digest
SHA256 52b0e80f40d9c8acc1d503f07a534940b1c1900ff500228efd6a739e9782834c
MD5 38196d852a3a77cd896c5ec561882ac6
BLAKE2b-256 5f54de5345fef1ceae4ad3e2094f9ea46093e445db7e2259a533ef33402bd2b2

See more details on using hashes here.

File details

Details for the file rtx50_compat-3.0.2-py3-none-any.whl.

File metadata

  • Download URL: rtx50_compat-3.0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for rtx50_compat-3.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f190915ed4b47074b63f49f7ef5190051c5056b6425248e52969285df5b9dedb
MD5 d631f72d76554d3a39de54014014fd4a
BLAKE2b-256 ea322d1cc563049a6f2e0963996e4b7722d3021c7284ccb3e21246efafd89103

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page