Unified deep learning models for Computer Vision
Project description
UniCV
UniCV is a unified, extensible framework for computer vision models that operate across heterogeneous input and output representations. It wraps state-of-the-art models — depth estimators, Gaussian splat predictors, mesh generators, and more — behind a single, composable VisionModule interface.
Philosophy
Modern computer vision has fragmented into dozens of incompatible APIs: each model ships with its own preprocessing, its own output format, and its own integration burden.
The architecture and design philosophy of UniCV is inspired by modular deep learning ecosystems such as pytorch and HuggingFace's transformers, as well as recent efforts toward foundation models and generalist perception systems in computer vision. Rather than prescribing fixed pipelines (e.g. RGB → Depth or RGB → Mesh), UniCV abstracts vision algorithms as composable transformations between representation spaces.
The core abstraction of UniCV is VisionModule, which defines a standardized interface for mapping any combination of visual input modalities to any combination of output modalities. These modalities include, but are not limited to:
- RGB images
- Depth maps
- Point clouds
- Meshes
- Gaussian splats and other implicit or semi-implicit scene representations
Concrete vision algorithms—such as monocular depth estimation, RGB-to-point-cloud reconstruction, or RGB-D refinement—are implemented as subclasses of this abstract interface. Existing models available online (e.g. DepthPro, MiDaS, CDM, or point-cloud reconstruction networks) can be redefined within this framework without altering their internal logic, allowing them to be seamlessly integrated into a shared system.
UniCV is hence designed to accommodate the full spectrum of modern 3D perception: classical CNNs, ViT-based backbones, implicit neural representations, Gaussian splatting, and diffusion-based generation.
This abstraction enables UniCV to decouple input modality, latent processing, and output representation, encouraging reuse, composition, and extension of vision algorithms. Models may share encoders, latent spaces, or decoders, and can be combined or chained to support progressive or multi-stage reconstruction pipelines.
The conceptual motivation for UniCV is closely aligned with the emergence of foundation models for perception, where a single system is expected to reason across tasks, representations, and data sources. By enforcing a common interface at the representation level, UniCV facilitates cross-representation supervision, multi-task learning, and interoperability between otherwise incompatible vision methods.
Motivation
UniCV aims to support vision systems that are:
- Modality-agnostic, capable of ingesting arbitrary combinations of visual inputs (e.g. RGB, depth, point clouds) without architectural redesign.
- Representation-agnostic, able to emit multiple scene representations from a shared latent abstraction.
- Algorithm-agnostic, allowing existing and future computer vision models to be wrapped, extended, or replaced under a common interface.
- Composable, enabling complex pipelines to be constructed by chaining or jointly training multiple
VisionModule-based modules. - Extensible, supporting both classical CV algorithms and modern deep learning approaches, including implicit scene representations and neural rendering techniques.
- Foundation-ready, serving as an architectural substrate for training large, generalist vision models capable of cross-task and cross-representation transfer.
In addition to standard convolutional and Transformer-based architectures, UniCV is designed to accommodate emerging paradigms such as implicit neural representations, Gaussian splatting, and hybrid geometric–neural pipelines, enabling a unified experimental platform for next-generation 3D perception systems.
Architecture Overview
src/unicv/
├── utils/
│ └── types.py # Modality and InputForm enums
├── models/
│ ├── base.py # VisionModule abstract base class
│ ├── depth_pro/ # DepthPro (Apple, 2024)
│ ├── depth_anything_3/ # Depth Anything 3 (ByteDance, 2025)
│ ├── cdm/ # Camera Depth Model (ByteDance, 2025)
│ ├── sharp/ # SHARP (Apple, 2024)
│ └── simplerecon/ # SimpleRecon (Niantic, 2022)
└── nn/
├── decoder.py # MultiresConvDecoder, FeatureFusionBlock2d
├── dpt.py # DPTDecoder, Reassemble, FeatureFusionBlock
├── fov.py # FOVNetwork (field-of-view estimation)
├── sdt.py # SDTHead (AnyDepth lightweight decoder)
├── gaussian.py # GaussianHead
└── geometry.py # backproject_depth, homography_warp
The VisionModule interface
Every model in UniCV inherits from the VisionModule interface, which is defined by three key attributes and one method:
input_spec: dict[Modality, InputForm]— declares what the model needs (e.g. a single RGB image, a temporal sequence of RGB frames).output_modalities: list[Modality]— declares what the model produces (e.g. a depth map, a Gaussian cloud).forward(**inputs) -> dict[Modality, Any]— the actual computation.
Calling an instance validates inputs, dispatches to forward, and validates outputs automatically. Models can be chained, swapped, or jointly trained without rewriting pipelines.
An example is shown below:
from unicv.models.base import VisionModule
from unicv.utils.types import Modality, InputForm
class MyModel(VisionModule):
input_spec = {Modality.RGB: InputForm.SINGLE}
output_modalities = [Modality.DEPTH]
def forward(self, **inputs):
rgb = inputs["rgb"] # validated automatically
depth = ... # your model logic
return {Modality.DEPTH: depth}
model = MyModel()
result = model(rgb=image_tensor) # → {Modality.DEPTH: tensor}
unicv.nn building blocks
| Class | File | Purpose |
|---|---|---|
MultiresConvDecoder |
decoder.py |
Fuses multi-scale encoder maps finest → coarsest; used by DepthPro |
FeatureFusionBlock2d |
decoder.py |
Single fusion step with optional residual skip and deconv upsampling |
DPTDecoder |
dpt.py |
Full DPT pipeline: reassemble patch tokens → spatial maps, fuse; used by DA3 and CDM |
Reassemble |
dpt.py |
Converts flat ViT patch tokens to 2-D feature maps at a configurable scale |
FeatureFusionBlock |
dpt.py |
DPT-style fusion with 2× bilinear upsampling |
FOVNetwork |
fov.py |
Estimates a scalar field-of-view from a low-resolution feature map |
SDTHead |
sdt.py |
Lightweight decoder from AnyDepth: per-level attention + depth-wise fusion |
GaussianHead |
gaussian.py |
Regresses per-pixel Gaussian parameters (scales, rotations, opacities, SH coefficients) |
Implemented Models
| Model | Paper | Class | Input → Output | Pretrained |
|---|---|---|---|---|
| DepthPro | Apple, 2024 | DepthProModel |
RGB → Depth | DepthProModel.from_pretrained() |
| Depth Anything 3 | ByteDance, 2025 | DepthAnything3Model |
RGB → Depth | DepthAnything3Model.from_pretrained(variant=...) |
| Camera Depth Model | ByteDance, 2025 | CameraDepthModel |
RGB + Depth → Depth | CameraDepthModel.from_pretrained(camera=...) |
| SHARP | Apple, 2024 | SHARPModel |
RGB → Splat | SHARPModel.from_pretrained() |
| SimpleRecon | Niantic, 2022 | SimpleReconModel |
RGB (temporal) → Depth | — |
Installation
From PyPI
pip install unicv
From source
git clone https://github.com/aether-raid/unicv.git
cd unicv
pip install .
Development environment
Requires uv:
uv sync --dev
uv run pytest -v # run the full test suite
Loading Pretrained Weights
Each model with a from_pretrained classmethod downloads the official checkpoint and loads it into the UniCV architecture. Install the optional dependencies first:
pip install huggingface_hub timm safetensors
DepthPro
Downloads Apple's DepthPro weights from apple/DepthPro on Hugging Face. Requires huggingface_hub and timm.
from unicv.models.depth_pro import DepthProModel
model = DepthProModel.from_pretrained() # includes FoV head
model = DepthProModel.from_pretrained(use_fov_head=False) # depth-only
model.eval()
result = model(rgb=image_tensor) # (B, 3, 1536, 1536)
depth = result["depth"] # metric depth in metres
Depth Anything 3
Downloads from the depth-anything organisation. Requires huggingface_hub and safetensors.
variant |
Hugging Face repo | Backbone embed dim |
|---|---|---|
"vit_s" |
depth-anything/DA3-SMALL |
384 |
"vit_b" |
depth-anything/DA3-BASE |
768 |
"vit_l" |
depth-anything/DA3-LARGE (default) |
1024 |
"vit_g" |
depth-anything/DA3-GIANT |
1536 |
from unicv.models.depth_anything_3 import DepthAnything3Model
model = DepthAnything3Model.from_pretrained(variant="vit_l")
model.eval()
result = model(rgb=image_tensor) # (B, 3, H, W)
depth = result["depth"] # inverse-depth, (B, 1, H, W)
Camera Depth Model (CDM)
Downloads camera-specific checkpoints. Choose the variant that matches your depth sensor. Requires huggingface_hub.
camera |
Sensor |
|---|---|
"d405" (default) |
Intel RealSense D405 |
"d435" |
Intel RealSense D435 |
"l515" |
Intel RealSense L515 |
"kinect" |
Azure Kinect |
from unicv.models.cdm import CameraDepthModel
model = CameraDepthModel.from_pretrained(camera="d405")
model.eval()
result = model(rgb=rgb_tensor, depth=raw_depth_tensor)
refined = result["depth"] # (B, 1, H, W)
SHARP
Downloads directly from Apple's CDN — no Hugging Face dependency required.
from unicv.models.sharp import SHARPModel
model = SHARPModel.from_pretrained()
model.eval()
result = model(rgb=image_tensor) # (B, 3, H, W)
cloud = result["splat"] # GaussianCloud with N = H×W Gaussians
Note — the official SHARP checkpoint uses a DepthPro-based encoder (
SlidingPyramidNetwork+TimmViT), not DINOv2. UniCV applies best-effort partial remappings for fusion-block conv weights; backbone weights are not transferable without a full architectural realignment. AUserWarninglists any missing keys at load time.
Cache directory
All from_pretrained methods accept a cache_dir keyword:
model = DepthAnything3Model.from_pretrained(
variant="vit_l",
cache_dir="/data/model_cache",
)
Roadmap
Foundation — complete
-
VisionModulebase class,Modality/InputFormtype system -
unicv.nn:DPTDecoder,Reassemble,FeatureFusionBlock(DPT architecture) -
unicv.nn:SDTHead(AnyDepth lightweight decoder) -
unicv.nn:MultiresConvDecoder,FOVNetwork(DepthPro) -
unicv.nn:GaussianHead,GaussianCloud,TriangleMesh -
unicv.nn:backproject_depth,homography_warp(camera projection utilities) -
unicv.nn: plane-sweep cost volume - Full pytest suite (200+ tests, mocked external downloads, offline)
Depth estimation
| Model | Status |
|---|---|
| DepthPro | Done — architecture + pretrained weights |
| Depth Anything 3 | Done — architecture + pretrained weights |
| Camera Depth Model | Done — architecture + pretrained weights |
| SimpleRecon | Done — architecture (no public checkpoint) |
Gaussian splat models
| Model | Status |
|---|---|
| SHARP | Done — architecture + partial pretrained weights |
| DepthSplat | Planned — joint depth + splat from stereo/multi-view pairs |
| InstantSplat | Planned — pose-free pipeline via DUSt3R initialisation |
| LongSplat | Planned — online real-time splats from long video |
Mesh models
| Model | Status |
|---|---|
| SuGaR | Planned — Gaussian-based surface reconstruction |
| Hunyuan3D-2.1 | Planned — diffusion-based single-image 3D generation |
| TRELLIS.2 | Planned — sparse 3D VAE + flow-matching transformer |
Point cloud models
| Model | Status |
|---|---|
| POMATO | Planned — pose-aware multi-frame RGB → point cloud |
| MASt3R-SLAM | Planned — monocular SLAM via DUSt3R matching transformer |
Full Model Catalogue
| Model | Input | Form | Output |
|---|---|---|---|
| DepthPro | RGB | Single | Depth |
| Depth Anything 3 | RGB | Single | Depth |
| Camera Depth Model | RGB + Depth | Single | Depth |
| SimpleRecon | RGB | Temporal | Depth |
| SHARP | RGB | Single | Splat |
| DepthSplat | RGB | List | Splat |
| InstantSplat | RGB | Temporal | Splat |
| LongSplat | RGB | Temporal | Splat |
| SuGaR | RGB | List | Mesh |
| Hunyuan3D-2.1 | RGB | Single | Mesh |
| TRELLIS.2 | RGB | Single | Mesh |
| POMATO | RGB | Temporal | Point Cloud |
| MASt3R-SLAM | RGB | Temporal | Point Cloud |
Contributing
Contributions are welcome! See CONTRIBUTING.md for guidelines on adding new models, writing tests, and submitting pull requests.
Code of Conduct
This project follows the Contributor Covenant. By participating, you agree to uphold a welcoming and respectful environment for everyone.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unicv-1.0.0.tar.gz.
File metadata
- Download URL: unicv-1.0.0.tar.gz
- Upload date:
- Size: 122.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.5 {"installer":{"name":"uv","version":"0.10.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73e97494a9d1d2f51f3c84d05c4f206391a9f13afecdfa26b342f19018255271
|
|
| MD5 |
a2ecc83c425541decdd9766495087201
|
|
| BLAKE2b-256 |
965d474c7472cac62f56d6ddd2e1f4fb6563cfb02ea28cdcc3d1211c858c543b
|
File details
Details for the file unicv-1.0.0-py3-none-any.whl.
File metadata
- Download URL: unicv-1.0.0-py3-none-any.whl
- Upload date:
- Size: 57.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.5 {"installer":{"name":"uv","version":"0.10.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da3a860a25dccee3cfae5d8fac1ce899470c438d782bd2d67086fc014241e265
|
|
| MD5 |
7d14b20da1f2688a63e98d47af529253
|
|
| BLAKE2b-256 |
5bad725cbced2a83915ca9f7e839fdf9a2102a2bba3f049354211b36688ce216
|