Python package that hashes content of Earth science data files and compares to reference files.
Project description
earthdata-hashdiff
This repository contains functionality to read Earth science data file formats and hash the contents of those files into a JSON object. This enables the easy storage of a smaller artefact for tasks such as regression tests, while omitting metadata and data attributes that may change between test executions (such as timestamps in history attributes).
Features
Generating hashed files
JSON files that contain SHA 256 hash values for all variables and groups in
a netCDF4 or HDF-5 file can be generated using either the create_h5_hash_file
or create_nc4_hash_file.
from earthdata_hashdiff import create_nc4_hash_file
create_nc4_hash_file('path/to/netcdf/file.nc4', 'path/to/output/hash.json')
The functions to create the hash files have two additional optional arguments:
skipped_metadata_attributes- this is a set of strings. When specified, the hashing functionality will not include metadata attributes with that exact name in the calculation of the hash for all variables or groups.xarray_kwargs- this dictionary allows users to specify keyword arguments toxarraywhen the input file is opened as a dictionary of group objects. The default value for this kwarg is to turn off allxarraydecoding for CF Conventions, coordinates, times and time deltas.
Comparisons against reference files
When a JSON file exists with hashed values, it can be used for comparisons. The
public API provides h5_matches_reference_hash_file and
nc4_matches_reference_hash_file, although these both are aliases for the same
underlying functionality using xarray:
from earthdata_hashdiff import nc4_matches_reference_hash_file
assert nc4_matches_reference_hash_file(
'path/to/netcdf/file.nc4',
'path/to/json/with/hashes.json',
)
The comparison functions have three optional arguments:
skipped_variables_or_groups- the input for this kwarg is a set of string. The strings are the full paths to variables and groups, which tell the function to not check if the generated hash for those variables and groups are identical to the values in the JSON reference hash file. Note, the comparison function will still check that the input file contains the named variables and/or groups, even though it doesn't check their hashed value.skipped_metadata_attributes- this set of strings, when specified, omits matching metadata attributes from the calculation of all variables and groups. If metadata attributes were specified as skipped when generating the JSON file containing hashes, the same metadata attributes will need to be specified as skipped during comparison, to ensure the hashes match.xarray_kwargs- this dictionary allows users to specify keyword arguments toxarraywhen the input file is opened as a dictionary of group objects. The default value for this kwarg is to turn off allxarraydecoding for CF Conventions, coordinates, times and time deltas.
Installing
Using pip
Install the latest version of the package from PyPI using pip:
$ pip install earthdata-hashdiff
Other methods:
For local development, it is possible to clone the repository and then install the version being developed in editable mode:
$ git clone https://github.com/nasa/earthdata-hashdiff
$ cd earthdata-hashdiff
$ pip install -e .
Developing
Development within this repository should occur on a feature branch. Pull
Requests (PRs) are created with a target of the main branch before being
reviewed and merged.
Releases are created when a feature branch is merged to main and that branch
also contains an update to the earthdata_hashdiff.__about__.py file.
Development Setup:
Prerequisites:
- Python 3.10+, ideally installed in a virtual environment, such as
pyenvorconda. - A local copy of this repository.
As an example to set up a conda virtual environment:
conda create --name earthdata-hashdiff python=3.12 --channel conda-forge \
--override-channels -y
conda activate earthdata-hashdiff
Install dependencies:
pip install -r requirements.txt -r dev-requirements.txt -r tests/test_requirements.txt
Running tests
earthdata-hashdiff uses pytest to execute tests. Once test requirements have
been installed via pip, you can execute the tests:
pytest tests
The CI/CD workflows that execute the tests also make use of pytest plugins to
additionally create code test coverage reports and JUnit XML output. These
extra outputs can be produced with the following command:
pytest tests --junitxml=tests/reports/earthdata-hashdiff_junit.xml \
--cov earthdata_hashdiff --cov-report html:tests/coverage --cov-report term
This will produce:
- The test results (pass/fail) in the terminal.
- A coverage report in the terminal running the tests. The coverage report will
cover the contents within the
earthdata_hashdiffdirectory. - An HTML format coverage report in the
tests/coveragedirectory. The entry point for this output istests/coverage/index.html. - JUnit style output in
tests/reports/earthdata-hashdiff_junit.xml.
pre-commit hooks
This repository uses pre-commit to enable pre-commit checks that enforce coding standard best practices. These include:
- Removing trailing whitespaces.
- Removing blank lines at the end of a file.
- Ensure JSON files have valid formats.
- ruff Python linting checks.
- black Python code formatting checks.
- mypy Type hint checking and enforcement.
To enable these checks locally:
# Install pre-commit Python package:
pip install pre-commit
# Install the git hook scripts:
pre-commit install
Versioning
Releases for the earthdata-hashdiff adhere to semantic version
numbers: major.minor.patch.
- Major increments: These are non-backwards compatible API changes.
- Minor increments: These are backwards compatible API changes.
- Patch increments: These updates do not affect the API to the service.
Contibuting
Contributions are welcome! For more information see CONTRIBUTING.md.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file earthdata_hashdiff-1.0.0.tar.gz.
File metadata
- Download URL: earthdata_hashdiff-1.0.0.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d9c40d9684529aa1da638f9240adee7e8948206b94051c08889afccb5d621f0
|
|
| MD5 |
4cbe8a807a324728b45b193652032366
|
|
| BLAKE2b-256 |
9897c7c2df994cbc2540240b2db3cef4f8ab88158a8a4e9dc35c4e0704a45505
|
Provenance
The following attestation bundles were made for earthdata_hashdiff-1.0.0.tar.gz:
Publisher:
publish_to_pypi.yml on nasa/earthdata-hashdiff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
earthdata_hashdiff-1.0.0.tar.gz -
Subject digest:
6d9c40d9684529aa1da638f9240adee7e8948206b94051c08889afccb5d621f0 - Sigstore transparency entry: 310186514
- Sigstore integration time:
-
Permalink:
nasa/earthdata-hashdiff@71f9ee9e0df8c6be455788df831bd34787f3b387 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/nasa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_to_pypi.yml@71f9ee9e0df8c6be455788df831bd34787f3b387 -
Trigger Event:
push
-
Statement type:
File details
Details for the file earthdata_hashdiff-1.0.0-py3-none-any.whl.
File metadata
- Download URL: earthdata_hashdiff-1.0.0-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c226d9f398ab1ff1c9b32f2f1d579034a1a1b8bd2115da3c77e00f36f7b5fd84
|
|
| MD5 |
b4e413d6711eadb95b06eab8ed1c5daa
|
|
| BLAKE2b-256 |
5fefb80805ae800da44570a19e73830e9e16a0bc001baca316b25496267ccfda
|
Provenance
The following attestation bundles were made for earthdata_hashdiff-1.0.0-py3-none-any.whl:
Publisher:
publish_to_pypi.yml on nasa/earthdata-hashdiff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
earthdata_hashdiff-1.0.0-py3-none-any.whl -
Subject digest:
c226d9f398ab1ff1c9b32f2f1d579034a1a1b8bd2115da3c77e00f36f7b5fd84 - Sigstore transparency entry: 310186527
- Sigstore integration time:
-
Permalink:
nasa/earthdata-hashdiff@71f9ee9e0df8c6be455788df831bd34787f3b387 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/nasa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_to_pypi.yml@71f9ee9e0df8c6be455788df831bd34787f3b387 -
Trigger Event:
push
-
Statement type: