GPU-accelerated video stabilization with a deep-learning homography model and a custom fused CUDA autograd kernel.

These details have not been verified by PyPI

Project links

Project description

gimbal-engine

GPU video stabilization with two interchangeable camera motion estimators, a custom CUDA pipeline and a learned homography network, benchmarked head to head.

The gimbal CLI on a stabilize run

Install
pip install gimbal_engine
Building the compiled CUDA extension needs an NVIDIA GPU and a CUDA toolkit. The trained weights ship inside the package, so a successful install can stabilize immediately with no extra download. If you do not have a local toolchain, the Docker path under Build and install builds everything.

Gimbal Engine stabilizes shaky video on the GPU. Its real subject is a head to head comparison of two interchangeable camera motion estimators: a classical pipeline written in CUDA (pyramidal Lucas-Kanade tracking with RANSAC homography fitting) and an iterative homography network (IHN) trained from scratch. Both sit behind one shared Estimator interface and feed the same back end (trajectory smoothing, GPU warping, auto crop, and the standard stabilization metrics), so they can be swapped and measured on identical footage. The two share more than that interface: each turns its four corner estimates into a homography through the same differentiable Tensor-DLT, solved on the GPU.

Stabilization, side by side

Three NUS clips, across rotation, running, and crowd scenes. Each row is one clip: the shaky input, the gimbal IHN result, and the classical CUDA result, with the stability score under each.


Shaky input	gimbal IHN stability 0.928	Classical stability 0.264
_{QuickRotation/19.avi}

Shaky input	gimbal IHN stability 0.973	Classical stability 0.631
_{Running/1.avi}

Shaky input	gimbal IHN stability 0.908	Classical stability 0.495
_Crowd/14.avi

These are clips where the IHN is strongest. On large zoom and parallax the classical pipeline is steadier, and the full per category numbers, wins and losses, are in Results.

Highlights · What is inside · Architecture · Results · Correctness · Build and install

Highlights

A fused local correlation CUDA operator with its own forward and backward pass. Against the PyTorch reference it is 26.3x faster and uses 1.72x less memory (forward and backward, RTX 5070 Ti laptop).
The trained IHN reaches a sub pixel mean average corner error of 0.863 px on held out synthetic pairs, against 6.489 px for a single shot regression baseline. The iterative refinement is the difference.
A mesh (multi homography) model that fits a grid of local homographies and reduces exactly to the single global homography at a 1x1 grid. On synthetic parallax it lowers corner error against the global model by 6.5 px (see the mesh study).
One Estimator interface for all three. The classical pipeline, the global IHN, and the mesh model return the same MotionField, so the pipeline never knows which one it is running.
Field standard evaluation: the NUS dataset, reported as the cropping ratio, distortion value, and stability score triplet, plus CUDA event timing.
The whole inference loop captured into a CUDA graph, which removes the launch overhead that dominates at this size and gives an 11.4x end to end speedup.

What is inside

Two estimators, one back end. The core of the project is the comparison between them and the geometry they share.

Component	What it is
Classical estimator	Shi-Tomasi corners, pyramidal Lucas-Kanade tracking, RANSAC homography fitting, all in CUDA
Learned estimator (IHN)	Feature encoder, local correlation cost volume, iterative 4 point refinement, differentiable Tensor-DLT
Regression baseline	Single shot 4 point regression (the ablation control for the IHN)
Mesh estimator	A grid of per cell homographies (MeshFlow style), reducing to the global model at a 1x1 grid
Fused correlation op	A compiled CUDA autograd operator for the cost volume, gated behind a gradient check
Shared back end	Trajectory smoothing, GPU warp, auto crop, and the stabilization metric triplet
Smoothers	Gaussian, Kalman RTS, and L1-TV camera path smoothing

The classical and learned estimators are interchangeable because they agree on one contract: take two consecutive grayscale frames, return a MotionField that maps frame A coordinates to frame B coordinates. Everything downstream, the smoothing, the warp, the metrics, sees only that result and never the model that produced it.

Architecture

flowchart LR
  V[Input clip] --> P[grayscale frame pairs]
  P --> E{Estimator interface}
  E -->|classical| C[CUDA LK plus RANSAC]
  E -->|learned| I[IHN]
  E -->|mesh| M[MeshIHN]
  C --> F[MotionField]
  I --> F
  M --> F
  F --> S[trajectory smoothing]
  S --> W[GPU warp and auto crop]
  W --> O[Stabilized clip plus metrics]

The classical estimator runs the parallel path entirely in CUDA: Shi-Tomasi corner detection, pyramidal Lucas-Kanade tracking of those corners across the frame pair, and a RANSAC homography fit over the surviving matches, with a degenerate fit falling back to identity rather than a bad warp.

The IHN follows the iterative homography idea. A shared encoder turns both frames into feature maps. At each of six iterations the model builds a local correlation cost volume between the current warped features and the target, predicts an update to four corner offsets, and turns those offsets into a homography with the Tensor-DLT. The cost volume is the hot path, which is why it has a dedicated fused CUDA operator. The mesh model replaces the single set of four corners with a grid of cells, each with its own local homography, blended into a smooth sampling field; with a 1x1 grid it is identical to the global IHN.

Both paths produce a homography per frame pair. The shared back end turns that sequence into a stabilized clip: it accumulates the per frame motion into a camera path, smooths the path (Gaussian, Kalman RTS, or L1-TV), warps each frame by the difference between the original and smoothed path on the GPU, and auto crops to the largest rectangle that stays inside every warped frame.

Results

All numbers below are measured on an RTX 5070 Ti laptop GPU (Blackwell, sm_120), torch 2.11.0+cu128.

Stabilization on NUS

Classical against the learned IHN across all six NUS scene categories (144 clips, the shipped IHN trained only on synthetic data so the entire NUS set is held out). Higher stability is better; the throughput column is the per frame rate.

Category	Classical stability	IHN stability	Classical fps	IHN fps
Regular	0.886	0.864	16.5	27.8
QuickRotation	0.862	0.897	16.3	27.2
Zooming	0.879	0.766	16.9	25.2
Parallax	0.877	0.812	16.3	25.6
Crowd	0.848	0.833	16.7	24.9
Running	0.848	0.852	16.3	26.7
Mean	0.867	0.837	16.5	26.3

The IHN wins the hard rotation case and runs about 1.6x faster everywhere. The classical pipeline is steadier on large zoom and parallax, which are the motions furthest from the IHN's synthetic training distribution.

NUS benchmark dashboard

Quality against speed, per category

Training ablation

Mean average corner error (MACE) on held out synthetic COCO pairs, lower is better. The iterative model and the single shot regression baseline use the same data and encoder.

Model	Best MACE
IHN (iterative, 6 steps)	0.863 px
Regression baseline (single shot)	6.489 px

Iterative refinement is roughly 7.5x more accurate than predicting the homography in one shot, and it lands at sub pixel error.

Systems study

The cost volume operator and the inference loop, measured on the same GPU. The full study, including the roofline and the optimization log, is in perf_study.

Measurement	Result
Fused correlation against the PyTorch reference	26.3x faster, 1.72x less memory
CUDA graph replay against eager inference	11.4x faster (41.2 ms to 3.61 ms per call)
fp16 accuracy cost (MACE)	+0.002 px
bf16 accuracy cost (MACE)	+0.029 px

Fused correlation roofline

The roofline shows why the simplest kernel wins: at a 16x16 cost volume the operation is latency and occupancy bound, not compute bound, so launching enough threads with coalesced loads beats reducing arithmetic.

Correctness

Each GPU component is checked against an independent reference. The full suite is 41 tests.

Check	Reference	Result
Scharr gradient kernel	OpenCV `cv2.Scharr`	match to 1e-3
Gaussian downsample kernel	OpenCV `cv2.pyrDown`	match to 1e-2
Shi-Tomasi corner response	NumPy and OpenCV reference	match
Classical estimator	known homography (`cv2.warpPerspective`)	recovers translation and rotation
Tensor-DLT	known homography	match to 1e-3
Tensor-DLT gradient	`torch.autograd.gradcheck`	passes
Fused correlation forward	PyTorch reference	match to 1e-4
Fused correlation backward	PyTorch autograd	match to 1e-4
Fused correlation gradient	gradcheck in float64	passes
No pivot DLT solve	cuSOLVER (`torch.linalg.solve`)	match to 1e-4
CUDA graph replay	eager execution	max error 4e-5
Mesh 1x1 grid	single global homography	match to 1e-5
Global MotionField path	raw homography product	bit exact
Phase B adoption guard	a deliberately degrading run	reverts to the kept weights

The fused correlation operator is gated: it is only used after it passes the gradient check against the PyTorch reference, otherwise the model falls back to the reference implementation.

Build and install

The package is published as a source distribution. pip compiles the CUDA extension on your machine at install time, so it adapts to your CUDA version and GPU architecture, and the trained weights are bundled inside the package.

With a CUDA toolchain

pip install gimbal_engine

This needs an NVIDIA GPU and a CUDA toolkit (nvcc) that matches your PyTorch build. The build detects your GPU architecture; if nvcc is missing it stops with a clear message rather than a compiler error.

With Docker

If you do not have a local toolchain, the included image carries CUDA, PyTorch, and the build tools.

./run.ps1 image          # build the image
./run.ps1 cli stabilize input.mp4 output.mp4 --estimator ihn

Use it

gimbal stabilize input.mp4 output.mp4 --estimator ihn      # learned model, bundled weights
gimbal stabilize input.mp4 output.mp4 --estimator classical
gimbal benchmark                                            # classical against IHN on NUS
gimbal info                                                 # GPU and library versions

gimbal requires CUDA cores, so will likely require an external GPU. It will proceed to error without this.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.1.2

Jun 6, 2026

This version

2.1.1

Jun 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gimbal_engine-2.1.1.tar.gz (3.2 MB view details)

Uploaded Jun 6, 2026 Source

File details

Details for the file gimbal_engine-2.1.1.tar.gz.

File metadata

Download URL: gimbal_engine-2.1.1.tar.gz
Upload date: Jun 6, 2026
Size: 3.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for gimbal_engine-2.1.1.tar.gz
Algorithm	Hash digest
SHA256	`d8af1b6580e79ce0d258bdeb063c7c20729404a26ef2c704fefe2c0f5d04ed0f`
MD5	`7543d02777a2f73af4e29eb3756d5351`
BLAKE2b-256	`b4f8fd7029be1d9471ef79de86e2c06a810bc3020e6159e24bb651bd645e1ef7`

See more details on using hashes here.

gimbal-engine 2.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gimbal-engine

Stabilization, side by side

Highlights

What is inside

Architecture

Results

Stabilization on NUS

Training ablation

Systems study

Correctness

Build and install

With a CUDA toolchain

With Docker

Use it

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes