Skip to main content

Convert Parquet files to video and back, inspired by xarrayvideo

Project description

videoparquet

Inspired by xarrayvideo, videoparquet is a Python library for converting Parquet files (containing array-like or tabular data) to video files and back, using ffmpeg and advanced data handling techniques.

THIS IS JUST A FUN EXPERIMENT! DO NOT TAKE IT TOO SERIOUSLY ⚠️

Features

  • Convert Parquet files to video (mp4/mkv) using ffmpeg codecs (lossy or lossless)
  • Store and recover all necessary metadata for roundtrip conversion
  • Support for normalization, denormalization, and PCA (dimensionality reduction)
  • Multi-array and multi-video support per Parquet file
  • Flexible codec and bit-depth selection
  • Automated recipe generation for batch processing
  • Strict, xarrayvideo-style test suite for lossless and lossy roundtrip

Installation

pip install -r requirements.txt
# or, for development:
# pip install -e .

Usage

Basic: Parquet to Video and Back

from videoparquet.parquet2video import parquet2video
from videoparquet.video2parquet import video2parquet
import pandas as pd
import numpy as np

# Create synthetic data and save as Parquet
arr = np.random.rand(4, 4, 3)  # (frames, pixels, channels)
df = pd.DataFrame(arr.reshape(4, -1))
df.to_parquet('data.parquet')

# Define conversion rules (see below for details)
conversion_rules = {
    'arr1': (list(df.columns), arr.shape, 0, {'c:v': 'libx264'}, 8, [arr.min(), arr.max()])
}

# Parquet -> Video
parquet2video('data.parquet', 'exampleid', conversion_rules, output_path='.')

# Video -> Parquet
video2parquet('.', 'exampleid', name='arr1')

Advanced: Using Lossless Codecs

conversion_rules = {
    # Use lossless codec (ffv1, 3-channel RGB only)
    'arr_lossless': (list(df.columns), arr.shape, 0, {'c:v': 'ffv1'}, 16, [arr.min(), arr.max()])
}
parquet2video('data.parquet', 'exampleid', conversion_rules, output_path='.')
video2parquet('.', 'exampleid', name='arr_lossless')

Automated Recipe Generation

from videoparquet.get_recipe import get_recipe
import pandas as pd
# df = pd.read_parquet('data.parquet')
recipe = get_recipe(df)  # Returns a dict of conversion rules

Testing

Run the test suite to verify strict roundtrip and lossy scenarios:

pytest tests/test_roundtrip.py

What is tested?

  • Lossless roundtrip: Only 3-channel ffv1+gbrp16le is supported and tested. Max error is <0.001 per channel.
  • Lossy roundtrip: A test with libx264 (rgb24) is included for comparison. Max error is typically 2–3 per channel (on a 0–95 range).

Example output:

Max abs error per channel: [0.00068665 0.00068665 0.00068665]  # ffv1+gbrp16le (lossless)
[libx264] Max abs error per channel: [2.588234 2.588234 2.588234]  # libx264 (lossy)

Motivation

This project enables efficient storage, compression, and sharing of large datasets by leveraging video codecs, while maintaining the ability to recover the original data using Parquet as the canonical format.

Codec and Pixel Format Restrictions

Important: For robust, lossless roundtrip, only the ffv1 codec with the gbrp16le pixel format (planar RGB, 3 channels, no padding) is supported and tested. If your ffmpeg build does not support this combination, the library will raise a clear error. This ensures that timeseries/tabular data can be reliably converted to and from video without data loss or row padding issues.

Other codecs (e.g., libx264) may be used for lossy compression, but roundtrip is not guaranteed.

🚀 Benchmark Highlights

  • Achieve up to 25x compression over Parquet for timeseries/tabular data using video codecs (ffv1/gbrp16le)
  • Fast encoding/decoding: Video roundtrip in under a second for typical scientific arrays
  • Lossless roundtrip supported (with ffv1/gbrp16le and compatible ffmpeg)
  • See BENCHMARK.md for details and reproducibility

Benchmarking

See BENCHMARK.md for a summary of benchmark results comparing Parquet and video-based storage for timeseries/tabular data. This includes size, compression ratio, and performance metrics for scientific reproducibility.

⚠️ ffmpeg, ffv1, and Pixel Format Limitations

IMPORTANT:

  • For true lossless roundtrip and compression, videoparquet requires ffmpeg to encode ffv1 videos with the gbrp16le (planar RGB) pixel format.

  • On macOS (Homebrew) and many Linux builds, ffmpeg may encode ffv1 as bgr0 instead of gbrp16le, even though gbrp16le is listed as supported. This is a known limitation/quirk of many ffmpeg builds.

  • bgr0 is not true planar RGB and may have padding/alpha issues. It is not guaranteed to be robust for scientific roundtrip.

  • The test suite will skip strict roundtrip and compression tests if gbrp16le is not available, and will warn the user. Only platforms with ffv1/gbrp16le will run and require these tests to pass.

  • To check your ffmpeg's pixel format support for ffv1, run:

    ffmpeg -h encoder=ffv1 | grep gbrp
    
  • For scientific reproducibility, use a Docker image or reference ffmpeg build known to support ffv1/gbrp16le.

How to Check Your ffmpeg

Run this command:

ffmpeg -f lavfi -i testsrc2=duration=1:size=2x2:rate=1 -pix_fmt gbrp16le -c:v ffv1 -y test_ffv1_gbrp16le.mkv && ffprobe -v error -select_streams v:0 -show_entries stream=pix_fmt -of default=noprint_wrappers=1:nokey=1 test_ffv1_gbrp16le.mkv
  • If the output is gbrp16le, your ffmpeg is suitable for scientific roundtrip.
  • If the output is bgr0, your ffmpeg will not guarantee true lossless roundtrip.

For Scientific Reproducibility & CI

  • Use a reference ffmpeg build (e.g., static Linux build from https://johnvansickle.com/ffmpeg/) or a Docker container with a known-good ffmpeg.
  • The test suite is designed to run in CI (GitHub Actions) as long as the correct ffmpeg build is available.
  • See the code and error messages for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

videoparquet-0.1.1.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

videoparquet-0.1.1-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file videoparquet-0.1.1.tar.gz.

File metadata

  • Download URL: videoparquet-0.1.1.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for videoparquet-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8ca30ae07d697338f77d3904a1c8e54ae64ca94a557521a3457c89b78c86b8cb
MD5 e04b4833a0dd9e41bd8bb176bdb04629
BLAKE2b-256 f8df40291dc4af3400b4d11d8181b41ad8b64fdebe088598df85505f152f3127

See more details on using hashes here.

Provenance

The following attestation bundles were made for videoparquet-0.1.1.tar.gz:

Publisher: publish.yml on lmangani/videoparquet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file videoparquet-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: videoparquet-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for videoparquet-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 372e33ee01506444a9cf1eb41a7ddd36782f15273f263b6736f6363848e5f9cb
MD5 f0ca67474362bf5b2788b81fe25d2aa6
BLAKE2b-256 1e38bb21d1750068b866f8681c6d2bd33001337d34833bb026c6aad8ee2b0b02

See more details on using hashes here.

Provenance

The following attestation bundles were made for videoparquet-0.1.1-py3-none-any.whl:

Publisher: publish.yml on lmangani/videoparquet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page