Skip to main content

Unified deep learning models for Computer Vision

Project description

UniCV

PyPI Latest Release CI License: MIT

UniCV is a unified, extensible framework for computer vision models that operate across heterogeneous input and output representations. It wraps state-of-the-art models — depth estimators, Gaussian splat predictors, mesh generators, and more — behind a single, composable VisionModule interface.


Philosophy

Modern computer vision has fragmented into dozens of incompatible APIs: each model ships with its own preprocessing, its own output format, and its own integration burden.

The architecture and design philosophy of UniCV is inspired by modular deep learning ecosystems such as pytorch and HuggingFace's transformers, as well as recent efforts toward foundation models and generalist perception systems in computer vision. Rather than prescribing fixed pipelines (e.g. RGB → Depth or RGB → Mesh), UniCV abstracts vision algorithms as composable transformations between representation spaces.

The core abstraction of UniCV is VisionModule, which defines a standardized interface for mapping any combination of visual input modalities to any combination of output modalities. These modalities include, but are not limited to:

  • RGB images
  • Depth maps
  • Point clouds
  • Meshes
  • Gaussian splats and other implicit or semi-implicit scene representations

Concrete vision algorithms—such as monocular depth estimation, RGB-to-point-cloud reconstruction, or RGB-D refinement—are implemented as subclasses of this abstract interface. Existing models available online (e.g. DepthPro, MiDaS, CDM, or point-cloud reconstruction networks) can be redefined within this framework without altering their internal logic, allowing them to be seamlessly integrated into a shared system.

UniCV is hence designed to accommodate the full spectrum of modern 3D perception: classical CNNs, ViT-based backbones, implicit neural representations, Gaussian splatting, and diffusion-based generation.

This abstraction enables UniCV to decouple input modality, latent processing, and output representation, encouraging reuse, composition, and extension of vision algorithms. Models may share encoders, latent spaces, or decoders, and can be combined or chained to support progressive or multi-stage reconstruction pipelines.

The conceptual motivation for UniCV is closely aligned with the emergence of foundation models for perception, where a single system is expected to reason across tasks, representations, and data sources. By enforcing a common interface at the representation level, UniCV facilitates cross-representation supervision, multi-task learning, and interoperability between otherwise incompatible vision methods.

Motivation

UniCV aims to support vision systems that are:

  1. Modality-agnostic, capable of ingesting arbitrary combinations of visual inputs (e.g. RGB, depth, point clouds) without architectural redesign.
  2. Representation-agnostic, able to emit multiple scene representations from a shared latent abstraction.
  3. Algorithm-agnostic, allowing existing and future computer vision models to be wrapped, extended, or replaced under a common interface.
  4. Composable, enabling complex pipelines to be constructed by chaining or jointly training multiple VisionModule-based modules.
  5. Extensible, supporting both classical CV algorithms and modern deep learning approaches, including implicit scene representations and neural rendering techniques.
  6. Foundation-ready, serving as an architectural substrate for training large, generalist vision models capable of cross-task and cross-representation transfer.

In addition to standard convolutional and Transformer-based architectures, UniCV is designed to accommodate emerging paradigms such as implicit neural representations, Gaussian splatting, and hybrid geometric–neural pipelines, enabling a unified experimental platform for next-generation 3D perception systems.


Architecture Overview

src/unicv/
├── utils/
│   └── types.py              # Modality and InputForm enums
├── models/
│   ├── base.py               # VisionModule abstract base class
│   ├── depth_pro/            # DepthPro (Apple, 2024)
│   ├── depth_anything_3/     # Depth Anything 3 (ByteDance, 2025)
│   ├── cdm/                  # Camera Depth Model (ByteDance, 2025)
│   ├── sharp/                # SHARP (Apple, 2024)
│   └── simplerecon/          # SimpleRecon (Niantic, 2022)
└── nn/
    ├── decoder.py            # MultiresConvDecoder, FeatureFusionBlock2d
    ├── dpt.py                # DPTDecoder, Reassemble, FeatureFusionBlock
    ├── fov.py                # FOVNetwork (field-of-view estimation)
    ├── sdt.py                # SDTHead (AnyDepth lightweight decoder)
    ├── gaussian.py           # GaussianHead
    └── geometry.py           # backproject_depth, homography_warp

The VisionModule interface

Every model in UniCV inherits from the VisionModule interface, which is defined by three key attributes and one method:

  • input_spec: dict[Modality, InputForm] — declares what the model needs (e.g. a single RGB image, a temporal sequence of RGB frames).
  • output_modalities: list[Modality] — declares what the model produces (e.g. a depth map, a Gaussian cloud).
  • forward(**inputs) -> dict[Modality, Any] — the actual computation.

Calling an instance validates inputs, dispatches to forward, and validates outputs automatically. Models can be chained, swapped, or jointly trained without rewriting pipelines.

An example is shown below:

from unicv.models.base import VisionModule
from unicv.utils.types import Modality, InputForm

class MyModel(VisionModule):
    input_spec         = {Modality.RGB: InputForm.SINGLE}
    output_modalities  = [Modality.DEPTH]

    def forward(self, **inputs):
        rgb   = inputs["rgb"]   # validated automatically
        depth = ...             # your model logic
        return {Modality.DEPTH: depth}

model  = MyModel()
result = model(rgb=image_tensor)   # → {Modality.DEPTH: tensor}

unicv.nn building blocks

Class File Purpose
MultiresConvDecoder decoder.py Fuses multi-scale encoder maps finest → coarsest; used by DepthPro
FeatureFusionBlock2d decoder.py Single fusion step with optional residual skip and deconv upsampling
DPTDecoder dpt.py Full DPT pipeline: reassemble patch tokens → spatial maps, fuse; used by DA3 and CDM
Reassemble dpt.py Converts flat ViT patch tokens to 2-D feature maps at a configurable scale
FeatureFusionBlock dpt.py DPT-style fusion with 2× bilinear upsampling
FOVNetwork fov.py Estimates a scalar field-of-view from a low-resolution feature map
SDTHead sdt.py Lightweight decoder from AnyDepth: per-level attention + depth-wise fusion
GaussianHead gaussian.py Regresses per-pixel Gaussian parameters (scales, rotations, opacities, SH coefficients)

Implemented Models

Model Paper Class Input → Output Pretrained
DepthPro Apple, 2024 DepthProModel RGB → Depth DepthProModel.from_pretrained()
Depth Anything 3 ByteDance, 2025 DepthAnything3Model RGB → Depth DepthAnything3Model.from_pretrained(variant=...)
Camera Depth Model ByteDance, 2025 CameraDepthModel RGB + Depth → Depth CameraDepthModel.from_pretrained(camera=...)
SHARP Apple, 2024 SHARPModel RGB → Splat SHARPModel.from_pretrained()
SimpleRecon Niantic, 2022 SimpleReconModel RGB (temporal) → Depth

Installation

From PyPI

pip install unicv

From source

git clone https://github.com/aether-raid/unicv.git
cd unicv
pip install .

Development environment

Requires uv:

uv sync --dev
uv run pytest -v   # run the full test suite

Loading Pretrained Weights

Each model with a from_pretrained classmethod downloads the official checkpoint and loads it into the UniCV architecture. Install the optional dependencies first:

pip install huggingface_hub timm safetensors

DepthPro

Downloads Apple's DepthPro weights from apple/DepthPro on Hugging Face. Requires huggingface_hub and timm.

from unicv.models.depth_pro import DepthProModel

model = DepthProModel.from_pretrained()               # includes FoV head
model = DepthProModel.from_pretrained(use_fov_head=False)   # depth-only

model.eval()
result = model(rgb=image_tensor)   # (B, 3, 1536, 1536)
depth  = result["depth"]           # metric depth in metres

Depth Anything 3

Downloads from the depth-anything organisation. Requires huggingface_hub and safetensors.

variant Hugging Face repo Backbone embed dim
"vit_s" depth-anything/DA3-SMALL 384
"vit_b" depth-anything/DA3-BASE 768
"vit_l" depth-anything/DA3-LARGE (default) 1024
"vit_g" depth-anything/DA3-GIANT 1536
from unicv.models.depth_anything_3 import DepthAnything3Model

model = DepthAnything3Model.from_pretrained(variant="vit_l")
model.eval()

result = model(rgb=image_tensor)   # (B, 3, H, W)
depth  = result["depth"]           # inverse-depth, (B, 1, H, W)

Camera Depth Model (CDM)

Downloads camera-specific checkpoints. Choose the variant that matches your depth sensor. Requires huggingface_hub.

camera Sensor
"d405" (default) Intel RealSense D405
"d435" Intel RealSense D435
"l515" Intel RealSense L515
"kinect" Azure Kinect
from unicv.models.cdm import CameraDepthModel

model  = CameraDepthModel.from_pretrained(camera="d405")
model.eval()

result  = model(rgb=rgb_tensor, depth=raw_depth_tensor)
refined = result["depth"]   # (B, 1, H, W)

SHARP

Downloads directly from Apple's CDN — no Hugging Face dependency required.

from unicv.models.sharp import SHARPModel

model = SHARPModel.from_pretrained()
model.eval()

result = model(rgb=image_tensor)   # (B, 3, H, W)
cloud  = result["splat"]           # GaussianCloud with N = H×W Gaussians

Note — the official SHARP checkpoint uses a DepthPro-based encoder (SlidingPyramidNetwork + TimmViT), not DINOv2. UniCV applies best-effort partial remappings for fusion-block conv weights; backbone weights are not transferable without a full architectural realignment. A UserWarning lists any missing keys at load time.

Cache directory

All from_pretrained methods accept a cache_dir keyword:

model = DepthAnything3Model.from_pretrained(
    variant="vit_l",
    cache_dir="/data/model_cache",
)

Roadmap

Foundation — complete

  • VisionModule base class, Modality / InputForm type system
  • unicv.nn: DPTDecoder, Reassemble, FeatureFusionBlock (DPT architecture)
  • unicv.nn: SDTHead (AnyDepth lightweight decoder)
  • unicv.nn: MultiresConvDecoder, FOVNetwork (DepthPro)
  • unicv.nn: GaussianHead, GaussianCloud, TriangleMesh
  • unicv.nn: backproject_depth, homography_warp (camera projection utilities)
  • unicv.nn: plane-sweep cost volume
  • Full pytest suite (200+ tests, mocked external downloads, offline)

Depth estimation

Model Status
DepthPro Done — architecture + pretrained weights
Depth Anything 3 Done — architecture + pretrained weights
Camera Depth Model Done — architecture + pretrained weights
SimpleRecon Done — architecture (no public checkpoint)

Gaussian splat models

Model Status
SHARP Done — architecture + partial pretrained weights
DepthSplat Planned — joint depth + splat from stereo/multi-view pairs
InstantSplat Planned — pose-free pipeline via DUSt3R initialisation
LongSplat Planned — online real-time splats from long video

Mesh models

Model Status
SuGaR Planned — Gaussian-based surface reconstruction
Hunyuan3D-2.1 Planned — diffusion-based single-image 3D generation
TRELLIS.2 Planned — sparse 3D VAE + flow-matching transformer

Point cloud models

Model Status
POMATO Planned — pose-aware multi-frame RGB → point cloud
MASt3R-SLAM Planned — monocular SLAM via DUSt3R matching transformer

Full Model Catalogue

Model Input Form Output
DepthPro RGB Single Depth
Depth Anything 3 RGB Single Depth
Camera Depth Model RGB + Depth Single Depth
SimpleRecon RGB Temporal Depth
SHARP RGB Single Splat
DepthSplat RGB List Splat
InstantSplat RGB Temporal Splat
LongSplat RGB Temporal Splat
SuGaR RGB List Mesh
Hunyuan3D-2.1 RGB Single Mesh
TRELLIS.2 RGB Single Mesh
POMATO RGB Temporal Point Cloud
MASt3R-SLAM RGB Temporal Point Cloud

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines on adding new models, writing tests, and submitting pull requests.

Code of Conduct

This project follows the Contributor Covenant. By participating, you agree to uphold a welcoming and respectful environment for everyone.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unicv-1.0.0.tar.gz (122.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unicv-1.0.0-py3-none-any.whl (57.1 kB view details)

Uploaded Python 3

File details

Details for the file unicv-1.0.0.tar.gz.

File metadata

  • Download URL: unicv-1.0.0.tar.gz
  • Upload date:
  • Size: 122.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.5 {"installer":{"name":"uv","version":"0.10.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for unicv-1.0.0.tar.gz
Algorithm Hash digest
SHA256 73e97494a9d1d2f51f3c84d05c4f206391a9f13afecdfa26b342f19018255271
MD5 a2ecc83c425541decdd9766495087201
BLAKE2b-256 965d474c7472cac62f56d6ddd2e1f4fb6563cfb02ea28cdcc3d1211c858c543b

See more details on using hashes here.

File details

Details for the file unicv-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: unicv-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 57.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.5 {"installer":{"name":"uv","version":"0.10.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for unicv-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 da3a860a25dccee3cfae5d8fac1ce899470c438d782bd2d67086fc014241e265
MD5 7d14b20da1f2688a63e98d47af529253
BLAKE2b-256 5bad725cbced2a83915ca9f7e839fdf9a2102a2bba3f049354211b36688ce216

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page