Advanced autonomous driving perception and vision-language foundation models for robotics
Project description
PyRoboVision
Advanced autonomous driving perception and vision-language foundation models for robotics. Built on top of PyRoboFrames.
Focus: Advanced perception pipelines and multimodal understanding that consume data loaded by PyRoboFrames.
What's Inside
Autonomous Driving (v0.5)
- Cylindrical panoramic stitching — 360° multi-camera fusion (Waymo, nuScenes)
- Advanced blending — Laplacian pyramid + graph-cut seams
- Bird's-eye-view (BEV) — 3D projection for autonomous perception
- GPU acceleration — CuPy (NVIDIA), MLX (Apple Silicon), NumPy (CPU)
- Sensor fusion — Lidar/Radar + occupancy grid mapping
- Dataset loaders — Waymo TFRecord, nuScenes JSON, KITTI stereo
Foundation Models (Phase 7)
- SAM3 segmentation — Instance segmentation + temporal tracking
- CLIP embeddings — Scene understanding, text-image similarity
- Grounding DINO — Open-vocabulary object detection
- Multi-modal fusion — Unified detection + segmentation + classification
Installation
# Requires PyRoboFrames v1.0+
pip install pyroboframes pyroboframes-vision
# From source
git clone https://github.com/Mullassery/PyRoboVision.git
cd PyRoboVision
pip install -e .
Quick Start
Autonomous Driving: 360° Panoramic Perception
from pyrobovision.automotive import (
CylindricalStitcher,
get_waymo_layout,
)
# Stitch 5 cameras into 360° panorama
layout = get_waymo_layout()
stitcher = CylindricalStitcher(layout, blend_method="laplacian")
frames = {
"FRONT": ...,
"FRONT_LEFT": ...,
# ... other cameras
}
panorama = stitcher.stitch(frames) # [1, H, W, 3] seamless 360°
Foundation Models: Multi-Modal Scene Understanding
from pyrobovision.foundation_models import MultiModalFusion
fusion = MultiModalFusion(
detection_prompt="car . pedestrian . cyclist",
device="mlx", # or "cuda"
)
scene = fusion.understand(frame)
for obj in scene.objects:
print(f"{obj.object_class}: {obj.semantic_label}")
Architecture
Dependency Graph
PyRoboVision/
├── automotive/ # v0.5 AV perception
│ ├── stitching.py
│ ├── blending.py
│ ├── bev.py
│ ├── perception_3d.py
│ ├── tfrecord_utils.py
│ ├── nuscenes_utils.py
│ └── datasets.py
│
└── foundation_models/ # Phase 7
├── sam3_segmentation.py
├── clip_embeddings.py
├── grounding_dino.py
└── multimodal_fusion.py
↓ Depends on PyRoboFrames v1.0+ (dataloader)
PyRoboFrames/
├── RoboFrameDataset # Load LeRobot
├── ProprioceptiveLoader # Load state/action
├── DataLoader # Device selection
└── [video decode, sensor fusion, etc.]
Key design: PyRoboVision is a consumer library, not a foundation. It uses PyRoboFrames to load data, then applies perception algorithms.
Features
| Phase | Feature | Status | Tests |
|---|---|---|---|
| 1 | Cylindrical panoramic projection | ✅ | 10 |
| 2 | Laplacian pyramid blending | ✅ | 5 |
| 3 | Bird's-eye-view (BEV) projection | ✅ | 5 |
| 4a | GPU acceleration (CuPy/MLX/NumPy) | ✅ | 6 |
| 4b | Optical flow seam tracking | ✅ | 10 |
| 5 | Waymo/nuScenes/KITTI loaders | ✅ | 9 |
| 6 | Lidar/Radar fusion + Occupancy grids | ✅ | 18 |
| 7a | SAM3 temporal segmentation | ✅ | 18 |
| 7b | CLIP scene embeddings | ✅ | 25 |
| 7c | Grounding DINO detection | ✅ | 26 |
| 7d | Multi-modal fusion | ✅ | 17 |
Total: 149 tests, all passing
Use Cases
Autonomous Driving
- Waymo perception pipeline (panoramic stitching + 3D fusion)
- nuScenes multi-camera understanding
- Real-time BEV mapping
Mobile Manipulation
- Egocentric robot perception (360° view from mobile base)
- Scene understanding for pick-and-place
Robotdog Navigation
- Panoramic localization (where am I in the scene?)
- Terrain classification from multi-camera fusion
Related Projects
- PyRoboFrames — Fast ML dataloader for robot learning (core dependency)
- LeRobot — HuggingFace robotics datasets
- Segment Anything 3 (SAM3) — Instance segmentation
- CLIP — Vision-language models
- Grounding DINO — Open-vocabulary detection
License
MIT (same as PyRoboFrames)
Contributing
Contributions welcome. Please open issues and PRs on GitHub.
For architectural decisions, see ARCHITECTURE.md.
Citation
@software{mullassery2025pyrobovision,
title={PyRoboVision: Advanced perception and vision-language models for robotics},
author={Mullassery, Georgi},
url={https://github.com/Mullassery/PyRoboVision},
year={2025}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyrobovision-0.5.0.tar.gz.
File metadata
- Download URL: pyrobovision-0.5.0.tar.gz
- Upload date:
- Size: 23.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98e454c4aa918ede38e882c2c0ae4218f542405e420412e4908fd29195205472
|
|
| MD5 |
8e4b2e728ee239fdca95a0ffeb7ea2cc
|
|
| BLAKE2b-256 |
e2c8632ce4180571a7ccbbdeda025bae3278c86773a71fb9114ae5565d7bae0a
|
File details
Details for the file pyrobovision-0.5.0-py3-none-any.whl.
File metadata
- Download URL: pyrobovision-0.5.0-py3-none-any.whl
- Upload date:
- Size: 3.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
594d324a0537627728bf9060e7ad775487c5cbdb3e245d2f00dddf4a2cadf394
|
|
| MD5 |
3ede410cd3ac76114bb6b0c89a9a5c99
|
|
| BLAKE2b-256 |
8dceea66fa5d7e8df18d0957165e70cd4da8860661a55507710991eff169b09c
|