Convert Parquet files to video and back, inspired by xarrayvideo
Project description
videoparquet
Inspired by xarrayvideo, videoparquet is a Python library for converting Parquet files (containing array-like or tabular data) to video files and back, using ffmpeg and advanced data handling techniques.
THIS IS JUST A FUN EXPERIMENT! DO NOT TAKE IT TOO SERIOUSLY ⚠️
Features
- Convert Parquet files to video (mp4/mkv) using ffmpeg codecs (lossy or lossless)
- Store and recover all necessary metadata for roundtrip conversion
- Support for normalization, denormalization, and PCA (dimensionality reduction)
- Multi-array and multi-video support per Parquet file
- Flexible codec and bit-depth selection
- Automated recipe generation for batch processing
- Strict, xarrayvideo-style test suite for lossless and lossy roundtrip
Installation
pip install -r requirements.txt
# or, for development:
# pip install -e .
Usage
Basic: Parquet to Video and Back
from videoparquet.parquet2video import parquet2video
from videoparquet.video2parquet import video2parquet
import pandas as pd
import numpy as np
# Create synthetic data and save as Parquet
arr = np.random.rand(4, 4, 3) # (frames, pixels, channels)
df = pd.DataFrame(arr.reshape(4, -1))
df.to_parquet('data.parquet')
# Define conversion rules (see below for details)
conversion_rules = {
'arr1': (list(df.columns), arr.shape, 0, {'c:v': 'libx264'}, 8, [arr.min(), arr.max()])
}
# Parquet -> Video
parquet2video('data.parquet', 'exampleid', conversion_rules, output_path='.')
# Video -> Parquet
video2parquet('.', 'exampleid', name='arr1')
Advanced: Using Lossless Codecs
conversion_rules = {
# Use lossless codec (ffv1, 3-channel RGB only)
'arr_lossless': (list(df.columns), arr.shape, 0, {'c:v': 'ffv1'}, 16, [arr.min(), arr.max()])
}
parquet2video('data.parquet', 'exampleid', conversion_rules, output_path='.')
video2parquet('.', 'exampleid', name='arr_lossless')
Automated Recipe Generation
from videoparquet.get_recipe import get_recipe
import pandas as pd
# df = pd.read_parquet('data.parquet')
recipe = get_recipe(df) # Returns a dict of conversion rules
Testing
Run the test suite to verify strict roundtrip and lossy scenarios:
pytest tests/test_roundtrip.py
What is tested?
- Lossless roundtrip: Only 3-channel ffv1+gbrp16le is supported and tested. Max error is <0.001 per channel.
- Lossy roundtrip: A test with libx264 (rgb24) is included for comparison. Max error is typically 2–3 per channel (on a 0–95 range).
Example output:
Max abs error per channel: [0.00068665 0.00068665 0.00068665] # ffv1+gbrp16le (lossless)
[libx264] Max abs error per channel: [2.588234 2.588234 2.588234] # libx264 (lossy)
Motivation
This project enables efficient storage, compression, and sharing of large datasets by leveraging video codecs, while maintaining the ability to recover the original data using Parquet as the canonical format.
Codec and Pixel Format Restrictions
Important: For robust, lossless roundtrip, only the ffv1 codec with the gbrp16le pixel format (planar RGB, 3 channels, no padding) is supported and tested. If your ffmpeg build does not support this combination, the library will raise a clear error. This ensures that timeseries/tabular data can be reliably converted to and from video without data loss or row padding issues.
Other codecs (e.g., libx264) may be used for lossy compression, but roundtrip is not guaranteed.
🚀 Benchmark Highlights
- Achieve up to 25x compression over Parquet for timeseries/tabular data using video codecs (ffv1/gbrp16le)
- Fast encoding/decoding: Video roundtrip in under a second for typical scientific arrays
- Lossless roundtrip supported (with ffv1/gbrp16le and compatible ffmpeg)
- See BENCHMARK.md for details and reproducibility
Benchmarking
See BENCHMARK.md for a summary of benchmark results comparing Parquet and video-based storage for timeseries/tabular data. This includes size, compression ratio, and performance metrics for scientific reproducibility.
⚠️ ffmpeg, ffv1, and Pixel Format Limitations
IMPORTANT:
-
For true lossless roundtrip and compression,
videoparquetrequires ffmpeg to encodeffv1videos with thegbrp16le(planar RGB) pixel format. -
On macOS (Homebrew) and many Linux builds, ffmpeg may encode
ffv1asbgr0instead ofgbrp16le, even thoughgbrp16leis listed as supported. This is a known limitation/quirk of many ffmpeg builds. -
bgr0is not true planar RGB and may have padding/alpha issues. It is not guaranteed to be robust for scientific roundtrip. -
The test suite will skip strict roundtrip and compression tests if
gbrp16leis not available, and will warn the user. Only platforms withffv1/gbrp16lewill run and require these tests to pass. -
To check your ffmpeg's pixel format support for ffv1, run:
ffmpeg -h encoder=ffv1 | grep gbrp
-
For scientific reproducibility, use a Docker image or reference ffmpeg build known to support
ffv1/gbrp16le.
How to Check Your ffmpeg
Run this command:
ffmpeg -f lavfi -i testsrc2=duration=1:size=2x2:rate=1 -pix_fmt gbrp16le -c:v ffv1 -y test_ffv1_gbrp16le.mkv && ffprobe -v error -select_streams v:0 -show_entries stream=pix_fmt -of default=noprint_wrappers=1:nokey=1 test_ffv1_gbrp16le.mkv
- If the output is
gbrp16le, your ffmpeg is suitable for scientific roundtrip. - If the output is
bgr0, your ffmpeg will not guarantee true lossless roundtrip.
For Scientific Reproducibility & CI
- Use a reference ffmpeg build (e.g., static Linux build from https://johnvansickle.com/ffmpeg/) or a Docker container with a known-good ffmpeg.
- The test suite is designed to run in CI (GitHub Actions) as long as the correct ffmpeg build is available.
- See the code and error messages for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file videoparquet-0.1.1.tar.gz.
File metadata
- Download URL: videoparquet-0.1.1.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ca30ae07d697338f77d3904a1c8e54ae64ca94a557521a3457c89b78c86b8cb
|
|
| MD5 |
e04b4833a0dd9e41bd8bb176bdb04629
|
|
| BLAKE2b-256 |
f8df40291dc4af3400b4d11d8181b41ad8b64fdebe088598df85505f152f3127
|
Provenance
The following attestation bundles were made for videoparquet-0.1.1.tar.gz:
Publisher:
publish.yml on lmangani/videoparquet
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
videoparquet-0.1.1.tar.gz -
Subject digest:
8ca30ae07d697338f77d3904a1c8e54ae64ca94a557521a3457c89b78c86b8cb - Sigstore transparency entry: 255187100
- Sigstore integration time:
-
Permalink:
lmangani/videoparquet@0f62b13a3ad33191ef244b5385b90dfb99b40914 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/lmangani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0f62b13a3ad33191ef244b5385b90dfb99b40914 -
Trigger Event:
release
-
Statement type:
File details
Details for the file videoparquet-0.1.1-py3-none-any.whl.
File metadata
- Download URL: videoparquet-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
372e33ee01506444a9cf1eb41a7ddd36782f15273f263b6736f6363848e5f9cb
|
|
| MD5 |
f0ca67474362bf5b2788b81fe25d2aa6
|
|
| BLAKE2b-256 |
1e38bb21d1750068b866f8681c6d2bd33001337d34833bb026c6aad8ee2b0b02
|
Provenance
The following attestation bundles were made for videoparquet-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on lmangani/videoparquet
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
videoparquet-0.1.1-py3-none-any.whl -
Subject digest:
372e33ee01506444a9cf1eb41a7ddd36782f15273f263b6736f6363848e5f9cb - Sigstore transparency entry: 255187109
- Sigstore integration time:
-
Permalink:
lmangani/videoparquet@0f62b13a3ad33191ef244b5385b90dfb99b40914 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/lmangani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0f62b13a3ad33191ef244b5385b90dfb99b40914 -
Trigger Event:
release
-
Statement type: