Skip to main content

A library to compress ESRF data and reduce their footprint

Project description

ESRF Data Compressor

ESRF Data Compressor is a command-line tool and Python library designed to compress large ESRF HDF5 datasets (3D volumes) and verify data consistency via SSIM. The default compression backend uses Blosc2 + Grok (JPEG2000).


Features

  • Discover raw HDF5 dataset files under an experiment’s RAW_DATA

    • Goes through the HDF5 Virtual Datasets to find the data to compress
    • Allows to filter down scan by scan based on the value of a key
  • Slice-by-slice compression

    • Uses Blosc2 + Grok (JPEG2000) on every slice of each 3D dataset (axis 0)
    • User-configurable compression ratio (e.g. --cratio 10)
  • Parallel execution

  • Automatically factors CPU cores into worker processes × per-process threads

  • By default, each worker runs up to 2 Blosc2 threads (or falls back to 1 thread if < 2 cores)

  • Non-destructive workflow

    1. compress writes compressed files either:
      • next to each source as <basename>_<compression_method>.h5 (--layout sibling), or
      • under a mirrored RAW_DATA_COMPRESSED tree using the same source file names, while copying non-compressed folders/files (--layout mirror, default)
    2. check computes SSIM (first and last frames) and writes a report
    3. overwrite (optional) swaps out the raw frame file (irreversible)
  • Four simple CLI subcommands

    • compress-hdf5 list  Show all raw HDF5 files to be processed
    • compress-hdf5 compress Generate compressed siblings
    • compress-hdf5 check  Produce a per-dataset SSIM report between raw & compressed
    • compress-hdf5 overwrite Atomically replace each raw frame file (irreversible)

Installation

From PyPI

pip install esrf-data-compressor

Once installed, the compress-hdf5 command will be available in your PATH.

From Source (for development)

git clone https://gitlab.esrf.fr/dau/esrf-data-compressor.git
cd esrf-data-compressor

# (Optional) Create & activate a virtual environment
python -m venv venv
source venv/bin/activate

# Install build dependencies & the package itself
pip install .

Documentation

Full documentation is available online: ESRF Data Compressor Docs

Contributing & Development

  • Clone the repository:

    git clone https://gitlab.esrf.fr/dau/esrf-data-compressor.git
    cd esrf-data-compressor
    
  • Install dependencies (in a virtual environment):

    python -m venv venv
    source venv/bin/activate
    pip install -e "[dev]"
    
  • Run tests with coverage:

    pytest -v --cov=esrf_data_compressor --cov-report=term-missing
    
  • Style:

    • black .
    • flake8 .
    • ruff .
  • Build docs (Sphinx + pydata theme):

    sphinx-build doc build/html
    

License

This project is licensed under the MIT License. See LICENSE for full text.


Changelog

All noteworthy changes are recorded in CHANGELOG.md. Version 0.1.0 marks the first public release with:

  • Initial implementation of Blosc2 + Grok (JPEG2000) compression for 3D HDF5 datasets.
  • SSIM-based integrity check (first & last slice).
  • Four-command CLI (compress-hdf5 list, compress-hdf5 compress, compress-hdf5 check, compress-hdf5 overwrite).
  • Parallelism with worker×thread auto-factoring.

For more details, see the full history in CHANGELOG.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

esrf_data_compressor-0.2.1.tar.gz (25.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

esrf_data_compressor-0.2.1-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file esrf_data_compressor-0.2.1.tar.gz.

File metadata

  • Download URL: esrf_data_compressor-0.2.1.tar.gz
  • Upload date:
  • Size: 25.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for esrf_data_compressor-0.2.1.tar.gz
Algorithm Hash digest
SHA256 08658acec629139e50eaf649ad462e765a1578075cf4d9a6821dc31905c1bd4a
MD5 2032607736a0a213c1fb4d466e1b0450
BLAKE2b-256 26c9b76509b4cb76f33b64673f655b414cf6d3e78f3d7dda6cf5faa24222c580

See more details on using hashes here.

File details

Details for the file esrf_data_compressor-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for esrf_data_compressor-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dff0fea0d14cb2aef7b4170ed6dfc4f65e6fd343d2677e07caeded2107eae4f1
MD5 5a6040c0ef88e10bf934e58f94920813
BLAKE2b-256 4c17d43ce8751307450e0dd617e677b5ca8a1360c32738b43eed469fb07fe30c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page