No project description provided
Project description
vidkompy
Intelligent Video Overlay and Synchronization
vidkompy is a powerful command-line tool engineered to overlay a foreground video onto a background video with exceptional precision and automatic alignment. The system intelligently handles discrepancies in resolution, frame rate, duration, and audio, prioritizing content integrity and synchronization accuracy over raw processing speed.
The core philosophy of vidkompy is to treat the foreground video as the definitive source of quality and timing. All its frames are preserved without modification or re-timing. The background video is dynamically adapted—stretched, retimed, and selectively sampled—to synchronize perfectly with every frame of the foreground content, ensuring a seamless and coherent final output.
Features
- Automatic Spatial Alignment: Intelligently detects the optimal x/y offset to position the foreground video within the background, even if they are cropped differently.
- Advanced Temporal Synchronization: Aligns videos with different start times, durations, and frame rates, eliminating temporal drift and ensuring content matches perfectly over time.
- Foreground-First Principle: Guarantees that every frame of the foreground video is included in the output, preserving its original timing and quality. The background video is adapted to match the foreground.
- Drift-Free Alignment: Utilizes Dynamic Time Warping (DTW) to create a globally optimal, monotonic alignment, preventing the common "drift-and-catchup" artifacts seen with simpler methods.
- High-Performance Processing: Leverages multi-core processing, perceptual hashing, and optimized video I/O to deliver results quickly.
- Frame fingerprinting is 100-1000x faster than traditional pixel-wise comparison.
- Sequential video composition is 10-100x faster than random-access methods.
- Smart Audio Handling: Automatically uses the foreground audio track if available, falling back to the background audio. The audio is correctly synchronized with the final video.
- Flexible Operation Modes: Supports specialized modes like
bordermatching for aligning content based on visible background edges, andsmoothblending for seamless visual integration.
How It Works
The vidkompy pipeline is a multi-stage process designed for precision and accuracy:
-
Video Analysis: The tool begins by probing both background (BG) and foreground (FG) videos using
ffprobeto extract essential metadata: resolution, frames per second (FPS), duration, frame count, and audio stream information. -
Spatial Alignment: To determine where to place the foreground on the background,
vidkompyextracts a sample frame from the middle of each video (where content is most likely to be stable). It then calculates the optimal (x, y) offset. -
Temporal Alignment: This is the core of
vidkompy. To determine when to start the overlay and how to map frames over time, the tool generates "fingerprints" of frames from both videos and uses Dynamic Time Warping (DTW) to find the best alignment path. This ensures every foreground frame is matched to the most suitable background frame. -
Video Composition: Once the spatial and temporal alignments are known,
vidkompycomposes the final video. It reads both video streams sequentially (for maximum performance) and, for each foreground frame, fetches the corresponding background frame as determined by the alignment map. The foreground is then overlaid at the correct spatial position. -
Audio Integration: After the silent video is composed,
vidkompyadds the appropriate audio track (preferring the foreground's audio) with the correct offset to ensure it's perfectly synchronized with the video content.
The Algorithms
vidkompy employs several sophisticated algorithms to achieve its high-precision results.
Frame Fingerprinting (Perceptual Hashing)
Instead of comparing the millions of pixels in a frame, vidkompy creates a tiny, unique "fingerprint" (a hash) for each frame. Comparing these small fingerprints is thousands of times faster and smart enough to ignore minor changes from video compression.
The FrameFingerprinter module is designed for ultra-fast and robust frame comparison. It uses perceptual hashing, which generates a compact representation of a frame's visual structure.
The process works as follows:
- Standardization: The input frame is resized to a small, standard size (e.g., 64x64 pixels) and converted to grayscale. This ensures consistency and focuses on structural information over color.
- Multi-Algorithm Hashing: To improve robustness,
vidkompycomputes several types of perceptual hashes for each frame, as different algorithms are sensitive to different visual features:
pHash(Perceptual Hash): Analyzes the frequency domain (using DCT), making it robust to changes in brightness, contrast, and gamma correction.AverageHash: Computes a hash based on the average color of the frame.ColorMomentHash: Captures the color distribution statistics of the frame.MarrHildrethHash: Detects edges and shapes, making it sensitive to structural features.
- Combined Fingerprint: The results from these hashers, along with a color histogram, are combined into a single "fingerprint" dictionary for the frame.
- Comparison: To compare two frames, their fingerprints are compared. The similarity is calculated using a weighted average of the normalized Hamming distance between their hashes and the correlation between their histograms. The weights are tuned based on the reliability of each hash type for video content. This entire process is parallelized across multiple CPU cores for maximum speed.
Spatial Alignment (Template Matching)
To find the correct position for the foreground video, the tool takes a screenshot from the middle of it and searches for that exact image within a screenshot from the background video.
Spatial alignment determines the (x, y) coordinates at which to overlay the foreground frame onto the background. vidkompy uses a highly accurate and efficient method based on template matching.
- Frame Selection: A single frame is extracted from the temporal midpoint of both the foreground and background videos. This is done to get a representative frame, avoiding potential opening/closing titles or black frames.
- Grayscale Conversion: The frames are converted to grayscale. This speeds up the matching process by 3x and makes the alignment more robust to minor color variations between the videos.
- Template Matching: The core of the alignment is
cv2.matchTemplateusing theTM_CCOEFF_NORMEDmethod. This function effectively "slides" the smaller foreground frame image across the larger background frame image and calculates a normalized cross-correlation score at each position. - Locating the Best Match: The position with the highest correlation score (from
cv2.minMaxLoc) is considered the best match. This location(x_offset, y_offset)represents the top-left corner where the foreground should be placed. The confidence of this match is the correlation score itself, which typically approaches1.0for a perfect match. - Scaling: The system checks if the foreground video is larger than the background. If so, it is scaled down to fit, and the scale factor is recorded.
Temporal Alignment Engines
vidkompy offers two high-performance temporal alignment engines optimized for different scenarios:
- Full (default): Direct pixel comparison with sliding windows for maximum accuracy
- Mask: Content-focused comparison with intelligent masking for letterboxed content
Temporal alignment is the most critical and complex part of vidkompy. The goal is to create a mapping FrameAlignment(fg_frame_idx, bg_frame_idx) for every single foreground frame. vidkompy provides two optimized engines for this task:
Full Engine (Default)
The Full Engine uses direct pixel-by-pixel frame comparison with a sliding window approach for maximum accuracy:
-
Bidirectional Matching:
- Forward Pass: Starts from the first FG frame, searches for best match in BG within a sliding window
- Backward Pass: Starts from the last FG frame, searches backward
- Merges both passes for robust alignment
-
Sliding Window Constraint:
- Enforces monotonicity by design - can only search forward from the last matched frame
- Window size controls the maximum temporal displacement
- Prevents temporal jumps and ensures smooth progression
-
Direct Pixel Comparison:
- Compares actual pixel values between FG and BG frames
- No information loss from hashing or fingerprinting
- More sensitive to compression artifacts but potentially more accurate
Characteristics:
- Processing time: ~40 seconds for an 8-second video (d10-w10 configuration)
- Zero drift by design due to monotonic constraints
- Perfect confidence scores (1.000)
- Best overall performance for standard videos
Mask Engine (Content-Focused)
The Mask Engine extends the Full engine approach with intelligent masking for letterboxed or pillarboxed content:
-
Content Mask Generation:
- Automatically detects content regions (non-black areas) in FG frames
- Creates binary mask to focus comparison on actual content
- Helps with letterboxed or pillarboxed videos
-
Masked Comparison:
- Only compares pixels within the mask region
- Ignores black borders and letterboxing
- More robust for videos with varying aspect ratios
-
Same Bidirectional Approach:
- Uses forward and backward passes like Full engine
- Applies mask during all comparisons
- Maintains monotonicity constraints
Characteristics:
- Processing time: ~45 seconds for an 8-second video (d10-w10 configuration)
- Perfect confidence scores (1.000)
- Better handling of videos with black borders
- Ideal for videos where content doesn't fill the entire frame
Engine Comparison
| Aspect | Full | Mask |
|---|---|---|
| Algorithm | Direct pixel comparison | Masked pixel comparison |
| Speed | ~5x real-time | ~5x real-time |
| Drift | Zero (monotonic) | Zero (monotonic) |
| Memory | Medium | Medium |
| Confidence | Perfect (1.000) | Perfect (1.000) |
| Best For | Standard videos | Letterboxed/pillarboxed content |
Usage
Prerequisites
You must have the FFmpeg binary installed on your system and accessible in your system's PATH. vidkompy depends on it for all video and audio processing tasks.
Installation
The tool is a Python package. It is recommended to install it from the repository to get the latest version.
# Clone the repository
git clone https://github.com/twardoch/vidkompy.git
cd vidkompy
# Install using uv (or pip)
uv pip install .
Command-Line Interface (CLI)
The tool is run from the command line, providing paths to the background and foreground videos.
Basic Examples:
# Full engine (default) - direct pixel comparison with zero drift
python -m vidkompy --bg background.mp4 --fg foreground.mp4
# Mask engine for letterboxed/pillarboxed content
python -m vidkompy --bg background.mp4 --fg foreground.mp4 --engine mask
# Custom output path
python -m vidkompy --bg bg.mp4 --fg fg.mp4 --output result.mp4
# Fine-tune performance with drift interval and window size
python -m vidkompy --bg bg.mp4 --fg fg.mp4 --drift_interval 10 --window 10
CLI Help:
INFO: Showing help with the command '__main__.py -- --help'.
NAME
__main__.py - Overlay foreground video onto background video with intelligent alignment.
SYNOPSIS
__main__.py BG FG <flags>
DESCRIPTION
Overlay foreground video onto background video with intelligent alignment.
POSITIONAL ARGUMENTS
BG
Type: str | pathlib.Path
Background video path
FG
Type: str | pathlib.Path
Foreground video path
FLAGS
-o, --output=OUTPUT
Type: Optional[str | pathlib...
Default: None
Output video path (auto-generated if not provided)
-e, --engine=ENGINE
Type: str
Default: 'fast'
Temporal alignment engine - 'fast', 'precise', 'mask', 'tunnel_full', or 'tunnel_mask' (default: 'fast')
-m, --margin=MARGIN
Type: int
Default: 8
Border thickness for border matching mode (default: 8)
-s, --smooth=SMOOTH
Type: bool
Default: False
Enable smooth blending at frame edges
-g, --gpu=GPU
Type: bool
Default: False
Enable GPU acceleration (future feature)
-v, --verbose=VERBOSE
Type: bool
Default: False
Enable verbose logging
NOTES
You can also use flags syntax for POSITIONAL ARGUMENTS
Performance
Recent updates have significantly improved vidkompy's performance and accuracy:
Real-World Performance Comparison
Based on actual benchmarks with an 8-second test video (1920x1080 background, 1920x870 foreground, ~480 frames):
| Engine | Processing Time | Speed Ratio | Confidence | Notes |
|---|---|---|---|---|
| Full (default) | 40.9 seconds | ~5x real-time | 1.000 (perfect) | Fastest overall with zero drift |
| Mask | 45.8 seconds | ~6x real-time | 1.000 (perfect) | Best for letterboxed content |
Key Performance Insights:
-
Full Engine: Delivers perfect confidence scores (1.000) with ~5x real-time processing. Uses direct frame mapping which completely eliminates drift while maintaining excellent performance.
-
Mask Engine: Slightly slower than Full engine but achieves perfect confidence. Ideal for content with black borders or letterboxing where content-focused comparison is beneficial.
Technical Optimizations
- Zero Drift Design: Both engines use sliding window constraints that enforce monotonicity by design, completely eliminating temporal drift.
- Optimized Compositing: Sequential frame reading instead of random access yields a 10-100x speedup in the final composition stage.
- Direct Pixel Comparison: Frame comparison uses actual pixel values without information loss from hashing or compression.
- Bidirectional Matching: Forward and backward passes are merged for robust alignment results.
- Efficient Memory Usage: Both engines use streaming processing with reasonable memory footprints.
Choosing the Right Engine
Use the Full Engine (default) when:
- Working with standard videos without letterboxing
- You need the fastest processing with perfect accuracy
- Videos have consistent content filling the frame
- General-purpose video synchronization
Use the Mask Engine when:
- Working with letterboxed or pillarboxed content
- Videos have significant black borders
- Content doesn't fill the entire frame
- Aspect ratio mismatches between foreground and background
Development
To contribute to vidkompy, set up a development environment using hatch.
Setup
- Clone the repository.
- Ensure you have
hatchinstalled (pip install hatch). - The project is managed through
hatchenvironments defined inpyproject.toml.
Key Commands
Run these commands from the root of the repository.
- Run Tests:
hatch run test
- Run Tests with Coverage Report:
hatch run test-cov
- Run Type Checking:
hatch run type-check
- Check Formatting and Linting:
hatch run lint
- Automatically Fix Formatting and Linting Issues:
hatch run fix
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vidkompy-1.2.5.tar.gz.
File metadata
- Download URL: vidkompy-1.2.5.tar.gz
- Upload date:
- Size: 307.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d2c49d2c08acd9e459f9a29e4970acc651b59374a88f5c8e0a794e537250e2f
|
|
| MD5 |
c1527cde4b40879663c8a2fee5d2317b
|
|
| BLAKE2b-256 |
71f6de4cef705b315b06e8482cc6dfee4e81ea9b6b1c9c8462812c7cde0f1b93
|
File details
Details for the file vidkompy-1.2.5-py3-none-any.whl.
File metadata
- Download URL: vidkompy-1.2.5-py3-none-any.whl
- Upload date:
- Size: 53.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00014ec5fea89b3c081a71dd023212734b43f14fbdedb8871d6b2a76b28f4cb1
|
|
| MD5 |
27ed5aaa7440ce0b52ee75ee6fe7af9a
|
|
| BLAKE2b-256 |
3cb2d542f05a67289aefaaf8d3f66d34d25eda99a7bcb0da4145680e16a99591
|