Skip to main content

A library to compress ESRF data and reduce their footprint

Project description

ESRF Data Compressor

ESRF Data Compressor is a command-line tool and Python library designed to compress large ESRF HDF5 datasets (3D volumes) and verify data consistency via SSIM. The default compression backend uses Blosc2 + Grok (JPEG2000).


Features

  • Discover raw HDF5 dataset files under an experiment’s RAW_DATA

    • Goes through the HDF5 Virtual Datasets to find the data to compress
    • Allows to filter down scan by scan based on the value of a key
  • Slice-by-slice compression

    • Uses Blosc2 + Grok (JPEG2000) on every slice of each 3D dataset (axis 0)
    • User-configurable compression ratio (e.g. --cratio 10)
  • Parallel execution

    • Automatically factors CPU cores into worker processes × per-process threads
    • By default, each worker runs up to 4 Blosc2 threads (or falls back to 1 thread if < 4 cores)
  • Non-destructive workflow

    1. compress writes a sibling file <basename>_<compression_method>.h5 next to each original
    2. check computes SSIM (first and last frames) and writes a report
    3. overwrite (optional) swaps out the raw frame file (irreversible)
  • Four simple CLI subcommands

    • list  Show all raw HDF5 files to be processed
    • compress Generate compressed siblings
    • check  Produce a per-dataset SSIM report between raw & compressed
    • overwrite Atomically replace each raw frame file (irreversible)

Installation

From PyPI

pip install esrf-data-compressor

Once installed, the compress-hdf5 command will be available in your PATH.

From Source (for development)

git clone https://gitlab.esrf.fr/dau/esrf-data-compressor.git
cd esrf-data-compressor

# (Optional) Create & activate a virtual environment
python -m venv venv
source venv/bin/activate

# Install build dependencies & the package itself
pip install .

Documentation

Full documentation is available online: ESRF Data Compressor Docs

Contributing & Development

  • Clone the repository:

    git clone https://gitlab.esrf.fr/dau/esrf-data-compressor.git
    cd esrf-data-compressor
    
  • Install dependencies (in a virtual environment):

    python -m venv venv
    source venv/bin/activate
    pip install -e "[dev]"
    
  • Run tests with coverage:

    pytest -v --cov=esrf_data_compressor --cov-report=term-missing
    
  • Style:

    • black .
    • flake8 .
    • ruff .
  • Build docs (Sphinx + pydata theme):

    sphinx-build doc build/html
    

License

This project is licensed under the MIT License. See LICENSE for full text.


Changelog

All noteworthy changes are recorded in CHANGELOG.md. Version 0.1.0 marks the first public release with:

  • Initial implementation of Blosc2 + Grok (JPEG2000) compression for 3D HDF5 datasets.
  • SSIM-based integrity check (first & last slice).
  • Four-command CLI (list, compress, check, overwrite).
  • Parallelism with worker×thread auto-factoring.

For more details, see the full history in CHANGELOG.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

esrf_data_compressor-0.1.0.tar.gz (21.6 kB view details)

Uploaded Source

File details

Details for the file esrf_data_compressor-0.1.0.tar.gz.

File metadata

  • Download URL: esrf_data_compressor-0.1.0.tar.gz
  • Upload date:
  • Size: 21.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for esrf_data_compressor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bfbefec33143f1cf1625bd80089dc64bc5df7c4de1c1f4f4fc2c1747b9c21355
MD5 876a8933bcbc53d4e351669b8e72c6be
BLAKE2b-256 d79670998b5c09a95f6b33ad5b9202db2a054101408b936db71d9db1b0c82826

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page