Skip to main content

A library to compress ESRF data and reduce their footprint

Project description

ESRF Data Compressor

ESRF Data Compressor is a command-line tool and Python library designed to compress large ESRF HDF5 datasets (3D volumes) and verify data consistency via SSIM. The default compression backend uses Blosc2 + Grok (JPEG2000).


Features

  • Discover raw HDF5 dataset files under an experiment’s RAW_DATA

    • Goes through the HDF5 Virtual Datasets to find the data to compress
    • Allows to filter down scan by scan based on the value of a key
  • Slice-by-slice compression

    • Uses Blosc2 + Grok (JPEG2000) on every slice of each 3D dataset (axis 0)
    • User-configurable compression ratio (e.g. --cratio 10)
  • Parallel execution

    • Automatically factors CPU cores into worker processes × per-process threads
    • By default, each worker runs up to 4 Blosc2 threads (or falls back to 1 thread if < 4 cores)
  • Non-destructive workflow

    1. compress writes compressed files either:
      • next to each source as <basename>_<compression_method>.h5 (--layout sibling), or
      • under a mirrored RAW_DATA_COMPRESSED tree using the same source file names, while copying non-compressed folders/files (--layout mirror, default)
    2. check computes SSIM (first and last frames) and writes a report
    3. overwrite (optional) swaps out the raw frame file (irreversible)
  • Four simple CLI subcommands

    • compress-hdf5 list  Show all raw HDF5 files to be processed
    • compress-hdf5 compress Generate compressed siblings
    • compress-hdf5 check  Produce a per-dataset SSIM report between raw & compressed
    • compress-hdf5 overwrite Atomically replace each raw frame file (irreversible)

Installation

From PyPI

pip install esrf-data-compressor

Once installed, the compress-hdf5 command will be available in your PATH.

From Source (for development)

git clone https://gitlab.esrf.fr/dau/esrf-data-compressor.git
cd esrf-data-compressor

# (Optional) Create & activate a virtual environment
python -m venv venv
source venv/bin/activate

# Install build dependencies & the package itself
pip install .

Documentation

Full documentation is available online: ESRF Data Compressor Docs

Contributing & Development

  • Clone the repository:

    git clone https://gitlab.esrf.fr/dau/esrf-data-compressor.git
    cd esrf-data-compressor
    
  • Install dependencies (in a virtual environment):

    python -m venv venv
    source venv/bin/activate
    pip install -e "[dev]"
    
  • Run tests with coverage:

    pytest -v --cov=esrf_data_compressor --cov-report=term-missing
    
  • Style:

    • black .
    • flake8 .
    • ruff .
  • Build docs (Sphinx + pydata theme):

    sphinx-build doc build/html
    

License

This project is licensed under the MIT License. See LICENSE for full text.


Changelog

All noteworthy changes are recorded in CHANGELOG.md. Version 0.1.0 marks the first public release with:

  • Initial implementation of Blosc2 + Grok (JPEG2000) compression for 3D HDF5 datasets.
  • SSIM-based integrity check (first & last slice).
  • Four-command CLI (compress-hdf5 list, compress-hdf5 compress, compress-hdf5 check, compress-hdf5 overwrite).
  • Parallelism with worker×thread auto-factoring.

For more details, see the full history in CHANGELOG.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

esrf_data_compressor-0.2.0.tar.gz (24.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

esrf_data_compressor-0.2.0-py3-none-any.whl (29.2 kB view details)

Uploaded Python 3

File details

Details for the file esrf_data_compressor-0.2.0.tar.gz.

File metadata

  • Download URL: esrf_data_compressor-0.2.0.tar.gz
  • Upload date:
  • Size: 24.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for esrf_data_compressor-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a55a2e46c09cdde59f5d2b28a9a11cfaeaac7efd376e511a8806a0dff36f3e73
MD5 26c36da09af14a5f3bc9d1b91915eace
BLAKE2b-256 3bf4213ee7446b2c2759a53a35b07426c9138dc013944923ab10af20b2f6b5b3

See more details on using hashes here.

File details

Details for the file esrf_data_compressor-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for esrf_data_compressor-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3c571c8728bb415ebee41399ca05e0fa2891df954e7027f6d247fd24c852928c
MD5 5a9e3949d7a58f627c3427906df1ae16
BLAKE2b-256 d4951060d6c2f8c30f16f9c119129e4ed462a5a85e3b6451388894c02f1fff0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page