Optimized RGB-D depth refinement using Vision Transformers with CUDA, MPS, and CPU support

These details have not been verified by PyPI

Project links

Project description

title: rgbd-depth emoji: 🎨 colorFrom: blue colorTo: purple sdk: gradio sdk_version: "4.44.1" python_version: "3.10" app_file: app.py pinned: false license: apache-2.0

Camera Depth Models (CDM)

Optimized Python package for RGB-D depth refinement using Vision Transformer encoders. This implementation is aligned with the ByteDance CDM reference implementation with additional performance optimizations for CUDA, MPS (Apple Silicon), and CPU.

🎮 Try it Online

Try rgbd-depth directly in your browser with our interactive Gradio demo—no installation required. Upload your images and refine depth maps instantly.

Available on Hugging Face Spaces: Upload your RGB and depth images, adjust parameters (camera model, precision, resolution), and get refined depth maps instantly. Models are automatically downloaded from Hugging Face Hub on first use.

Overview

Camera Depth Models (CDMs) are sensor-specific depth models trained to produce clean, simulation-like depth maps from noisy real-world depth camera data. By bridging the visual gap between simulation and reality through depth perception, CDMs enable robotic policies trained purely in simulation to transfer directly to real robots.

Original work by ByteDance Research. This package provides an optimized implementation with:

✅ Pixel-perfect alignment with reference implementation (verified: 0 pixel difference)
⚡ Device-specific optimizations: xFormers (CUDA), SDPA fallback, torch.compile
🎯 Mixed precision support: FP16 (CUDA/MPS), BF16 (CUDA)
🔧 Better CLI: Device selection, optimization control, precision modes
📦 Easy installation: Single pip install command

Why This Package?

This is an optimized, production-ready version of ByteDance's Camera Depth Models with several improvements:

Feature	ByteDance Original	This Package
Installation	Manual setup	`pip install rgbd-depth`
CUDA Optimization	Basic	xFormers (~8% faster) + torch.compile
Apple Silicon (MPS)	Not optimized	Native support with fallbacks
Mixed Precision	Manual	Automatic FP16/BF16 with `--precision` flag
CLI	Basic	Enhanced with device selection, optimization control
Documentation	Minimal	Comprehensive guides (README + OPTIMIZATION.md)
Testing	None	CI/CD with automated tests
PyPI Package	No	✅ Yes (`rgbd-depth`)

Choose this package if you want:

🚀 Faster inference on CUDA (xFormers) or Apple Silicon (MPS)
🎯 Easy mixed precision (FP16/BF16) without code changes
📦 Simple installation via PyPI
🔧 Production-ready CLI with device/precision control
✅ Maintained with CI/CD and tests

Key Features

Metric Depth Estimation: Produces accurate absolute depth measurements in meters
Multi-Camera Support: Optimized models for various depth sensors (RealSense D405/D435/L515, ZED 2i, Azure Kinect)
Performance Optimizations: ~8% faster on CUDA with xFormers, automatic backend selection
Mixed Precision: FP16/BF16 support for faster inference on compatible hardware
Sim-to-Real Ready: Generates simulation-quality depth from real camera data

Architecture

CDM uses a dual-branch Vision Transformer architecture:

RGB Branch: Extracts semantic information from RGB images
Depth Branch: Processes noisy depth sensor data
Cross-Attention Fusion: Combines RGB semantics with depth scale information
DPT Decoder: Produces final metric depth estimation

Supported ViT encoder sizes:

vits: Small (64 features, 384 output channels)
vitb: Base (128 features, 768 output channels)
vitl: Large (256 features, 1024 output channels)
vitg: Giant (384 features, 1536 output channels)

All pretrained models we provide are based on vitl.

🌐 Hugging Face Spaces Demo

The easiest way to try rgbd-depth is via Hugging Face Spaces—completely free, no installation needed:

Open the interactive demo
Upload an RGB image and a depth map (PNG or JPG)
Configure camera model, precision, and visualization options
Click "Refine Depth" and download the result

What happens:

Models are auto-downloaded from Hugging Face Hub on first use
Runs on free CPU hardware (inference: ~10-30s)
GPU hardware available for faster processing (~2-5s)
All computations are done server-side—your images stay private

Limitations (HF Spaces CPU):

No xFormers optimization (CUDA-only)
Inference slower than local GPU
Perfect for testing and prototyping

For production workflows or faster inference, use the local installation below.

📌 Note: This README is optimized for GitHub, PyPI, and Hugging Face Spaces. The YAML metadata (top of file) is auto-detected by HF Spaces and not displayed.

Installation

From PyPI (recommended)

Basic installation (core dependencies only):

pip install rgbd-depth

Installation with extras:

# With CUDA optimizations (xFormers, ~8% faster)
pip install rgbd-depth[xformers]

# With Gradio demo interface
pip install rgbd-depth[demo]

# With HuggingFace Hub model downloads
pip install rgbd-depth[download]

# With development tools (pytest, black, ruff, etc.)
pip install rgbd-depth[dev]

# Install everything (all extras)
pip install rgbd-depth[all]

Development installation (editable):

git clone https://github.com/Aedelon/rgbd-depth.git
cd rgbd-depth
pip install -e ".[dev]"  # or uv sync --extra dev

Requirements:

Python 3.10+ (Python 3.8-3.9 support dropped in v1.0.2+)
PyTorch 2.0+ with appropriate CUDA/MPS support
OpenCV, NumPy, Pillow

Quick Start

Easiest: No Installation (HF Spaces)

👉 Open interactive demo in your browser ← Start here!

Local Installation

After pip install rgbd-depth:

# CUDA (optimizations auto-enabled, FP16 for best speed)
python infer.py --input rgb.png --depth depth.png --precision fp16

# Apple Silicon (MPS)
python infer.py --input rgb.png --depth depth.png --device mps

# CPU (FP32 only)
python infer.py --input rgb.png --depth depth.png --device cpu

Example images are provided in example_data/. Pre-trained models can be downloaded from Hugging Face.

Usage

Command Line Interface

Basic inference:

python infer.py \
    --input /path/to/rgb.png \
    --depth /path/to/depth.png \
    --output refined_depth.png

CUDA with optimizations (default):

# FP32 (best accuracy)
python infer.py --input rgb.png --depth depth.png

# FP16 (best speed, ~2× faster)
python infer.py --input rgb.png --depth depth.png --precision fp16

# BF16 (best stability)
python infer.py --input rgb.png --depth depth.png --precision bf16

# Disable optimizations (debugging)
python infer.py --input rgb.png --depth depth.png --no-optimize

Apple Silicon (MPS):

# FP32 (default)
python infer.py --input rgb.png --depth depth.png --device mps

# FP16 (faster)
python infer.py --input rgb.png --depth depth.png --device mps --precision fp16

CPU:

# FP32 only (FP16 not recommended on CPU)
python infer.py --input rgb.png --depth depth.png --device cpu

Command Line Arguments

Required:

--input: Path to RGB input image (JPG/PNG)
--depth: Path to depth input image (PNG, 16-bit or 32-bit)

Optional:

--output: Output visualization path (default: output.png)
--device: Device to use: auto, cuda, mps, cpu (default: auto)
--precision: Precision mode: fp32, fp16, bf16 (default: fp32)
--no-optimize: Disable optimizations on CUDA (for debugging)
--encoder: Model size: vits, vitb, vitl, vitg (default: vitl)
--input-size: Input resolution for inference (default: 518)
--depth-scale: Scale factor for depth values (default: 1000.0)
--max-depth: Maximum valid depth in meters (default: 6.0)

Python API

import torch
from rgbddepth import RGBDDepth
import cv2
import numpy as np

# Load model with optimizations
model = RGBDDepth(encoder='vitl', features=256, use_xformers=True)
model.load_state_dict(torch.load('model.pth'))
model.eval()
model = model.to('cuda')  # or 'mps', 'cpu'

# Optional: compile for extra speed on CUDA
model = torch.compile(model)

# Load images
rgb = cv2.imread('rgb.jpg')[:, :, ::-1]  # BGR to RGB
depth = cv2.imread('depth.png', cv2.IMREAD_UNCHANGED) / 1000.0  # Convert to meters

# Create similarity depth (inverse depth)
simi_depth = np.zeros_like(depth)
simi_depth[depth > 0] = 1 / depth[depth > 0]

# Run inference with mixed precision
with torch.amp.autocast('cuda', dtype=torch.float16):
    pred_depth = model.infer_image(rgb, simi_depth, input_size=518)

Model Training

CDMs are trained on synthetic datasets generated using camera-specific noise models:

Noise Model Training: Learn hole and value noise patterns from real camera data
Synthetic Data Generation: Apply learned noise to clean simulation depth
CDM Training: Train depth estimation model on synthetic noisy data

Training datasets: HyperSim, DREDS, HISS, IRS (280,000+ images total)

Supported Cameras

We currently provide pre-trained models available for:

Intel RealSense D405/D435/L515
Stereolabs ZED 2i (2 modes: Quality, Neural)
Microsoft Azure Kinect

File Structure

rgbd-depth/
├── app.py                      # Gradio web demo for HuggingFace Spaces
├── infer.py                    # CLI inference script (main entry point)
├── pyproject.toml              # Modern package config (PEP 621, replaces setup.py)
├── setup.py                    # Legacy setuptools build script
├── requirements.txt            # Minimal deps for HuggingFace Spaces
├── uv.lock                     # UV package manager lock file
├── LICENSE                     # Apache 2.0 license
├── README.md                   # This file (GitHub/PyPI/HF Spaces unified)
├── OPTIMIZATION.md             # Performance benchmarks and optimization guide
├── CHANGELOG.md                # Version history and release notes
└── VIRAL_STRATEGY.md           # GitHub/PyPI marketing strategy
│
├── rgbddepth/                  # Main Python package
│   ├── __init__.py             # Public API exports (RGBDDepth, DinoVisionTransformer, __version__)
│   ├── dpt.py                  # RGBDDepth model (dual-branch ViT + DPT decoder)
│   ├── dinov2.py               # DINOv2 Vision Transformer encoder
│   ├── flexible_attention.py   # Cross-attention w/ xFormers + SDPA fallback
│   │
│   ├── dinov2_layers/          # Vision Transformer building blocks (from Meta DINOv2)
│   │   ├── __init__.py
│   │   ├── attention.py        # Self-attention w/ optional xFormers (MemEffAttention)
│   │   ├── block.py            # Transformer encoder block (NestedTensorBlock)
│   │   ├── mlp.py              # Feed-forward network (Mlp)
│   │   ├── patch_embed.py      # Image → patch embeddings (PatchEmbed)
│   │   ├── swiglu_ffn.py       # SwiGLU activation FFN
│   │   ├── drop_path.py        # Stochastic depth regularization
│   │   └── layer_scale.py      # LayerScale normalization
│   │
│   └── util/                   # Utilities
│       ├── __init__.py
│       ├── blocks.py           # DPT decoder blocks (FeatureFusionBlock, ResidualConvUnit)
│       └── transform.py        # Image preprocessing (Resize, PrepareForNet)
│
├── tests/                      # Test suite (42 tests, runs in GitHub Actions)
│   ├── test_import.py          # Basic imports and smoke tests
│   └── test_model.py           # Architecture, forward pass, attention, preprocessing
│
├── example_data/               # Example RGB-D pairs for testing
│   ├── color_12.png            # RGB image sample
│   ├── depth_12.png            # Depth map sample
│   └── result.png              # Expected output
│
└── .github/workflows/          # CI/CD automation
    ├── test.yml                # Run tests on Python 3.10-3.12 (Ubuntu/macOS/Windows)
    ├── publish.yml             # Auto-publish to PyPI on release tags
    └── deploy-hf.yml           # Auto-deploy to HuggingFace Spaces on push to main

Performance

Accuracy

This implementation achieves pixel-perfect alignment with the ByteDance reference:

✅ 0 pixel difference between vanilla and optimized inference (verified on test images)
✅ Identical checkpoint loading (weights are fully compatible)
✅ Numerical precision preserved (min=0.2036, max=1.1217, exact match)

CDMs achieve state-of-the-art performance on metric depth estimation:

Superior accuracy compared to existing prompt-based depth models
Zero-shot generalization across different camera types
Real-time inference suitable for robot control (lightweight ViT variants)

Performance optimizations:

xFormers support on CUDA (~8% faster than native SDPA)
Mixed precision (FP16/BF16) for faster inference
Device-specific optimizations (CUDA/MPS/CPU)

For detailed optimization strategies and benchmarks, see OPTIMIZATION.md.

What's Different from Reference?

This implementation maintains 100% compatibility with ByteDance CDM while adding:

1. Performance Optimizations

xFormers support: ~8% faster attention on CUDA (automatic fallback to SDPA)
torch.compile: JIT compilation (CUDA only, auto-enabled)
Mixed precision: FP16/BF16 support via torch.amp.autocast
Device-specific strategies: Optimizations only where beneficial

2. Better CLI/API

--device flag: Force specific device (auto/cuda/mps/cpu)
--precision flag: Choose FP32/FP16/BF16
--no-optimize flag: Disable optimizations for debugging
Automatic device detection and optimization selection

3. Improved Architecture

FlexibleCrossAttention: Inherits from nn.MultiheadAttention for checkpoint compatibility
Automatic backend selection: xFormers (CUDA) → SDPA (fallback)
Device-aware preprocessing: Uses model's device instead of auto-detection

4. Code Quality

Type hints and better documentation
Cleaner argument parsing
Validation for precision/device combinations
Helpful warnings for incompatible configurations

All changes are backwards compatible with original checkpoints and produce identical numerical results.

Citation

If you use CDM in your research, please cite:

@article{liu2025manipulation,
  title={Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots},
  author={Liu, Minghuan and Zhu, Zhengbang and Han, Xiaoshen and Hu, Peng and Lin, Haotong and
          Li, Xinyao and Chen, Jingxiao and Xu, Jiafeng and Yang, Yichu and Lin, Yunfeng and
          Li, Xinghang and Yu, Yong and Zhang, Weinan and Kong, Tao and Kang, Bingyi},
  journal={arXiv preprint},
  year={2025}
}

License

This project is licensed under the Apache 2.0 License. See LICENSE for details.

Available on: GitHub | PyPI | HF Spaces

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.3

Nov 27, 2025

1.0.2

Nov 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rgbd_depth-1.0.3.tar.gz (42.1 kB view details)

Uploaded Nov 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rgbd_depth-1.0.3-py3-none-any.whl (33.0 kB view details)

Uploaded Nov 27, 2025 Python 3

File details

Details for the file rgbd_depth-1.0.3.tar.gz.

File metadata

Download URL: rgbd_depth-1.0.3.tar.gz
Upload date: Nov 27, 2025
Size: 42.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rgbd_depth-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`2e38eb5165228557e64de2a89d62ef9cd427617c4ddcda5b7af194201412033f`
MD5	`cd9a678dc374e661f3a0e50845e1f714`
BLAKE2b-256	`3f553a4790fd7b76512e31460c7d99e236578a09ed19f499adf9c9a6be42c893`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rgbd_depth-1.0.3.tar.gz:

Publisher: publish-pypi.yml on Aedelon/rgbd-depth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rgbd_depth-1.0.3.tar.gz
- Subject digest: 2e38eb5165228557e64de2a89d62ef9cd427617c4ddcda5b7af194201412033f
- Sigstore transparency entry: 729721812
- Sigstore integration time: Nov 27, 2025
Source repository:
- Permalink: Aedelon/rgbd-depth@6c8b897f5179e6b1cf0e6d98215458338cee55f1
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/Aedelon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@6c8b897f5179e6b1cf0e6d98215458338cee55f1
- Trigger Event: push

File details

Details for the file rgbd_depth-1.0.3-py3-none-any.whl.

File metadata

Download URL: rgbd_depth-1.0.3-py3-none-any.whl
Upload date: Nov 27, 2025
Size: 33.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rgbd_depth-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`227582c9c59d2c42aa6b99399bcd1876216dfee2056e23164b9c1ffb144f1ddf`
MD5	`d829b247cd41811cdf5a5d0e83a39349`
BLAKE2b-256	`6607f01577a68b629e1a8b62103b226a675dd45ef2fc7d582293381810cd528d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rgbd_depth-1.0.3-py3-none-any.whl:

Publisher: publish-pypi.yml on Aedelon/rgbd-depth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rgbd_depth-1.0.3-py3-none-any.whl
- Subject digest: 227582c9c59d2c42aa6b99399bcd1876216dfee2056e23164b9c1ffb144f1ddf
- Sigstore transparency entry: 729721816
- Sigstore integration time: Nov 27, 2025
Source repository:
- Permalink: Aedelon/rgbd-depth@6c8b897f5179e6b1cf0e6d98215458338cee55f1
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/Aedelon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@6c8b897f5179e6b1cf0e6d98215458338cee55f1
- Trigger Event: push

rgbd-depth 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

title: rgbd-depth emoji: 🎨 colorFrom: blue colorTo: purple sdk: gradio sdk_version: "4.44.1" python_version: "3.10" app_file: app.py pinned: false license: apache-2.0

Camera Depth Models (CDM)

🎮 Try it Online

Overview

Why This Package?

Key Features

Architecture

🌐 Hugging Face Spaces Demo

Installation

From PyPI (recommended)

Quick Start

Easiest: No Installation (HF Spaces)

Local Installation

Usage

Command Line Interface

Command Line Arguments

Python API

Model Training

Supported Cameras

File Structure

Performance

Accuracy

What's Different from Reference?

1. Performance Optimizations

2. Better CLI/API

3. Improved Architecture

4. Code Quality

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance