Skip to main content

GPU-accelerated video stabilization with a deep-learning homography model and a custom fused CUDA autograd kernel.

Project description

gimbal-engine

GPU video stabilization with two interchangeable camera motion estimators, a custom CUDA pipeline and a learned homography network, benchmarked head to head.

The gimbal CLI on a stabilize run

Install

pip install gimbal-engine

Building the compiled CUDA extension needs an NVIDIA GPU and a CUDA toolkit. The trained weights ship inside the package, so a successful install can stabilize immediately with no extra download. If you do not have a local toolchain, the Docker path under Build and install builds everything.

Gimbal Engine stabilizes shaky video on the GPU. Its real subject is a head to head comparison of two interchangeable camera motion estimators: a classical pipeline written in CUDA (pyramidal Lucas-Kanade tracking with RANSAC homography fitting) and an iterative homography network (IHN) trained from scratch. Both sit behind one shared Estimator interface and feed the same back end (trajectory smoothing, GPU warping, auto crop, and the standard stabilization metrics), so they can be swapped and measured on identical footage. The two share more than that interface: each turns its four corner estimates into a homography through the same differentiable Tensor-DLT, solved on the GPU.

Stabilization, side by side

Three NUS clips, across rotation, running, and crowd scenes. Each row is one clip: the shaky input, the gimbal IHN result, and the classical CUDA result, with the stability score under each.

shaky input gimbal IHN result classical result
Shaky input gimbal IHN
stability 0.928
Classical
stability 0.264
QuickRotation/19.avi
shaky input gimbal IHN result classical result
Shaky input gimbal IHN
stability 0.973
Classical
stability 0.631
Running/1.avi
shaky input gimbal IHN result classical result
Shaky input gimbal IHN
stability 0.908
Classical
stability 0.495
Crowd/14.avi

These are clips where the IHN is strongest. On large zoom and parallax the classical pipeline is steadier, and the full per category numbers, wins and losses, are in Results.

Highlights · What is inside · Architecture · Results · Correctness · Build and install

Highlights

  • A fused local correlation CUDA operator with its own forward and backward pass. Against the PyTorch reference it is 26.3x faster and uses 1.72x less memory (forward and backward, RTX 5070 Ti laptop).
  • The trained IHN reaches a sub pixel mean average corner error of 0.863 px on held out synthetic pairs, against 6.489 px for a single shot regression baseline. The iterative refinement is the difference.
  • A mesh (multi homography) model that fits a grid of local homographies and reduces exactly to the single global homography at a 1x1 grid. On synthetic parallax it lowers corner error against the global model by 6.5 px (see the mesh study).
  • One Estimator interface for all three. The classical pipeline, the global IHN, and the mesh model return the same MotionField, so the pipeline never knows which one it is running.
  • Field standard evaluation: the NUS dataset, reported as the cropping ratio, distortion value, and stability score triplet, plus CUDA event timing.
  • The whole inference loop captured into a CUDA graph, which removes the launch overhead that dominates at this size and gives an 11.4x end to end speedup.

What is inside

Two estimators, one back end. The core of the project is the comparison between them and the geometry they share.

Component What it is
Classical estimator Shi-Tomasi corners, pyramidal Lucas-Kanade tracking, RANSAC homography fitting, all in CUDA
Learned estimator (IHN) Feature encoder, local correlation cost volume, iterative 4 point refinement, differentiable Tensor-DLT
Regression baseline Single shot 4 point regression (the ablation control for the IHN)
Mesh estimator A grid of per cell homographies (MeshFlow style), reducing to the global model at a 1x1 grid
Fused correlation op A compiled CUDA autograd operator for the cost volume, gated behind a gradient check
Shared back end Trajectory smoothing, GPU warp, auto crop, and the stabilization metric triplet
Smoothers Gaussian, Kalman RTS, and L1-TV camera path smoothing

The classical and learned estimators are interchangeable because they agree on one contract: take two consecutive grayscale frames, return a MotionField that maps frame A coordinates to frame B coordinates. Everything downstream, the smoothing, the warp, the metrics, sees only that result and never the model that produced it.

Architecture

flowchart LR
  V[Input clip] --> P[grayscale frame pairs]
  P --> E{Estimator interface}
  E -->|classical| C[CUDA LK plus RANSAC]
  E -->|learned| I[IHN]
  E -->|mesh| M[MeshIHN]
  C --> F[MotionField]
  I --> F
  M --> F
  F --> S[trajectory smoothing]
  S --> W[GPU warp and auto crop]
  W --> O[Stabilized clip plus metrics]

The classical estimator runs the parallel path entirely in CUDA: Shi-Tomasi corner detection, pyramidal Lucas-Kanade tracking of those corners across the frame pair, and a RANSAC homography fit over the surviving matches, with a degenerate fit falling back to identity rather than a bad warp.

The IHN follows the iterative homography idea. A shared encoder turns both frames into feature maps. At each of six iterations the model builds a local correlation cost volume between the current warped features and the target, predicts an update to four corner offsets, and turns those offsets into a homography with the Tensor-DLT. The cost volume is the hot path, which is why it has a dedicated fused CUDA operator. The mesh model replaces the single set of four corners with a grid of cells, each with its own local homography, blended into a smooth sampling field; with a 1x1 grid it is identical to the global IHN.

Both paths produce a homography per frame pair. The shared back end turns that sequence into a stabilized clip: it accumulates the per frame motion into a camera path, smooths the path (Gaussian, Kalman RTS, or L1-TV), warps each frame by the difference between the original and smoothed path on the GPU, and auto crops to the largest rectangle that stays inside every warped frame.

Results

All numbers below are measured on an RTX 5070 Ti laptop GPU (Blackwell, sm_120), torch 2.11.0+cu128.

Stabilization on NUS

Classical against the learned IHN across all six NUS scene categories (144 clips, the shipped IHN trained only on synthetic data so the entire NUS set is held out). Higher stability is better; the throughput column is the per frame rate.

CategoryClassical stabilityIHN stabilityClassical fpsIHN fps
Regular0.8860.86416.527.8
QuickRotation0.8620.89716.327.2
Zooming0.8790.76616.925.2
Parallax0.8770.81216.325.6
Crowd0.8480.83316.724.9
Running0.8480.85216.326.7
Mean0.8670.83716.526.3

The IHN wins the hard rotation case and runs about 1.6x faster everywhere. The classical pipeline is steadier on large zoom and parallax, which are the motions furthest from the IHN's synthetic training distribution.

NUS benchmark dashboard

Quality against speed, per category

Training ablation

Mean average corner error (MACE) on held out synthetic COCO pairs, lower is better. The iterative model and the single shot regression baseline use the same data and encoder.

ModelBest MACE
IHN (iterative, 6 steps)0.863 px
Regression baseline (single shot)6.489 px

Iterative refinement is roughly 7.5x more accurate than predicting the homography in one shot, and it lands at sub pixel error.

Systems study

The cost volume operator and the inference loop, measured on the same GPU. The full study, including the roofline and the optimization log, is in perf_study.

MeasurementResult
Fused correlation against the PyTorch reference26.3x faster, 1.72x less memory
CUDA graph replay against eager inference11.4x faster (41.2 ms to 3.61 ms per call)
fp16 accuracy cost (MACE)+0.002 px
bf16 accuracy cost (MACE)+0.029 px

Fused correlation roofline

The roofline shows why the simplest kernel wins: at a 16x16 cost volume the operation is latency and occupancy bound, not compute bound, so launching enough threads with coalesced loads beats reducing arithmetic.

Correctness

Each GPU component is checked against an independent reference. The full suite is 41 tests.

Check Reference Result
Scharr gradient kernel OpenCV cv2.Scharr match to 1e-3
Gaussian downsample kernel OpenCV cv2.pyrDown match to 1e-2
Shi-Tomasi corner response NumPy and OpenCV reference match
Classical estimator known homography (cv2.warpPerspective) recovers translation and rotation
Tensor-DLT known homography match to 1e-3
Tensor-DLT gradient torch.autograd.gradcheck passes
Fused correlation forward PyTorch reference match to 1e-4
Fused correlation backward PyTorch autograd match to 1e-4
Fused correlation gradient gradcheck in float64 passes
No pivot DLT solve cuSOLVER (torch.linalg.solve) match to 1e-4
CUDA graph replay eager execution max error 4e-5
Mesh 1x1 grid single global homography match to 1e-5
Global MotionField path raw homography product bit exact
Phase B adoption guard a deliberately degrading run reverts to the kept weights

The fused correlation operator is gated: it is only used after it passes the gradient check against the PyTorch reference, otherwise the model falls back to the reference implementation.

Build and install

The package is published as a source distribution. pip compiles the CUDA extension on your machine at install time, so it adapts to your CUDA version and GPU architecture, and the trained weights are bundled inside the package.

With a CUDA toolchain

pip install gimbal-engine

This needs an NVIDIA GPU and a CUDA toolkit (nvcc) that matches your PyTorch build. The build detects your GPU architecture; if nvcc is missing it stops with a clear message rather than a compiler error.

With Docker

If you do not have a local toolchain, the included image carries CUDA, PyTorch, and the build tools.

./run.ps1 image          # build the image
./run.ps1 cli stabilize input.mp4 output.mp4 --estimator ihn

Use it

gimbal stabilize input.mp4 output.mp4 --estimator ihn      # learned model, bundled weights
gimbal stabilize input.mp4 output.mp4 --estimator classical
gimbal benchmark                                            # classical against IHN on NUS
gimbal info                                                 # GPU and library versions

gimbal requires CUDA cores, so will likely require an external GPU. It will proceed to error without this.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gimbal_engine-2.1.2.tar.gz (3.2 MB view details)

Uploaded Source

File details

Details for the file gimbal_engine-2.1.2.tar.gz.

File metadata

  • Download URL: gimbal_engine-2.1.2.tar.gz
  • Upload date:
  • Size: 3.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for gimbal_engine-2.1.2.tar.gz
Algorithm Hash digest
SHA256 0621cfac7055531b22e1b75b687ebacba925484bbc58c37ed9bfddfa86a29721
MD5 62fcd9c9c7f8697c1ce97f5cac8459c8
BLAKE2b-256 a3558528b7bdde1b766335054c73724553a04292f9ac06f0c8fa0874345c608b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page