Skip to main content

A library to compress ESRF data and reduce their footprint

Project description

ESRF Data Compressor

ESRF Data Compressor is a command-line tool and Python library designed to compress large ESRF HDF5 datasets (3D volumes) and verify data consistency via SSIM. The default compression backend uses Blosc2 + Grok (JPEG2000).


Features

  • Discover raw HDF5 dataset files under an experiment’s RAW_DATA

    • Goes through the HDF5 Virtual Datasets to find the data to compress
    • Allows to filter down scan by scan based on the value of a key
  • Slice-by-slice compression

    • Uses Blosc2 + Grok (JPEG2000) on every slice of each 3D dataset (axis 0)
    • User-configurable compression ratio (e.g. --cratio 10)
  • Parallel execution

    • Automatically factors CPU cores into worker processes × per-process threads
    • By default, each worker runs up to 4 Blosc2 threads (or falls back to 1 thread if < 4 cores)
  • Non-destructive workflow

    1. compress writes a sibling file <basename>_<compression_method>.h5 next to each original
    2. check computes SSIM (first and last frames) and writes a report
    3. overwrite (optional) swaps out the raw frame file (irreversible)
  • Four simple CLI subcommands

    • compress-hdf5 list  Show all raw HDF5 files to be processed
    • compress-hdf5 compress Generate compressed siblings
    • compress-hdf5 check  Produce a per-dataset SSIM report between raw & compressed
    • compress-hdf5 overwrite Atomically replace each raw frame file (irreversible)

Installation

From PyPI

pip install esrf-data-compressor

Once installed, the compress-hdf5 command will be available in your PATH.

From Source (for development)

git clone https://gitlab.esrf.fr/dau/esrf-data-compressor.git
cd esrf-data-compressor

# (Optional) Create & activate a virtual environment
python -m venv venv
source venv/bin/activate

# Install build dependencies & the package itself
pip install .

Documentation

Full documentation is available online: ESRF Data Compressor Docs

Contributing & Development

  • Clone the repository:

    git clone https://gitlab.esrf.fr/dau/esrf-data-compressor.git
    cd esrf-data-compressor
    
  • Install dependencies (in a virtual environment):

    python -m venv venv
    source venv/bin/activate
    pip install -e "[dev]"
    
  • Run tests with coverage:

    pytest -v --cov=esrf_data_compressor --cov-report=term-missing
    
  • Style:

    • black .
    • flake8 .
    • ruff .
  • Build docs (Sphinx + pydata theme):

    sphinx-build doc build/html
    

License

This project is licensed under the MIT License. See LICENSE for full text.


Changelog

All noteworthy changes are recorded in CHANGELOG.md. Version 0.1.0 marks the first public release with:

  • Initial implementation of Blosc2 + Grok (JPEG2000) compression for 3D HDF5 datasets.
  • SSIM-based integrity check (first & last slice).
  • Four-command CLI (compress-hdf5 list, compress-hdf5 compress, compress-hdf5 check, compress-hdf5 overwrite).
  • Parallelism with worker×thread auto-factoring.

For more details, see the full history in CHANGELOG.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

esrf_data_compressor-0.1.2.tar.gz (22.4 kB view details)

Uploaded Source

File details

Details for the file esrf_data_compressor-0.1.2.tar.gz.

File metadata

  • Download URL: esrf_data_compressor-0.1.2.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for esrf_data_compressor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d397ae845c1d2ee005aa661a01f1a61cd7b599a92ddac639ad9c6d67b6a4ccb3
MD5 6fda4efbe76703245a03c30476139813
BLAKE2b-256 14213777655d9d0b4d19851403c70b7fbcf6c029ef29d77941ad066ac475b818

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page