Skip to main content

Package that performs compression by rounding.

Project description

elePyant

elePyant (pronounced elephant) provides a set of tools for compressing netCDF files, and xarray Datasets and DataArrays. It works by ridding data of 'meaningless digits' before saving the rounded dataset as compressed netCDF file.

To give an idea of the performance that can be obtained, I was able to reduce and file from 1 GB in size to 30 MB by using only the functions within this package. No qualitative difference is visible in the dataset.

The compression is based on work by Milan Kloewer. Often when working with data we only 'know' the value of a quantity to several significant figures. When we store it however we save the value as a 32 or 64 bit number, which can save the number using up to 11 decimal places or so. This is clearly overkill. By rounding all digits of surplus precision in our dataset to zero, we create a pattern in the binary used to encode the data. Lossless compression algorithms can then exploit these patterns to reduce the file size.

The compression relies on the user having a good understanding of the data they are working with. It is up to the user to decide the 'true' precision of their dataset so that they can select appropriate rounding. The method of compression may not be suitable for all purposes as the initial rounding stage of the process is lossy.

Example usage

The functions contained within the package have been designed to work with objects from the xarray ecosystem. For anyone currently using xarray objects in their workflow, making use of the package is incredibly simple. For instance, if one wants to save an xarray dataset, the process is as simple as going from

ds.to_netcdf('output_file')

to

import elePyant as ep
ep.compress_dataset(ds, 'output_file', decimal_places=2)

The new function takes the dataset, ds, rounds all the data variables (but not coordinate variables) within it to two decimal places and then saves the resulting dataset in to the file 'output_file'. Similar functions exist for xr.DataArray objects and netCDF files.

Advanced functionality allows the user to specify the rounding to use for each variable in a netCDF file. Users can also specify which variables not to round. For instance if you had an xr.Dataset object containing the data variables 'UVEL', 'VVEL and 'WVEL', you may use the following command

ep.compress_dataset(ds, 'out.nc', decimal_places={'UVEL': 2,
                                                  'VVEL': 2,
                                                  'WVEL': 6})

which will round both 'UVEL' and 'VVEL' to two decimal places, but 'WVEL' to six. Alternatively you may not wish to round 'WVEL' at all in which case you could use

ep.compress_dataset(ds, 'out.nc', decimal_places=2, ignore_vars='WVEL')

Note that by default coordinates are never rounded. If you wish to round a coordinate, then the argument decimal_places must be a dictionary containing the coordinate you wish to round as a key.

Installation

To run elePyant you will need a version of Python 3 with the following packages installed:

  • numpy
  • xarray
  • h5netcdf

To install in development mode from the command line you can use:

pip install -e git+https://github.com/fraserwg/elePyant.git

which will automatically update the package when changes are made here.

Alternatively you can clone the repository to your computer using

git clone https://github.com/fraserwg/elePyant.git
cd elePyant
pip install -e ./

or

git clone https://github.com/fraserwg/elePyant.git
cd elePyant
python setup.py build
python setup.py install

You can then update to the latest version as and when you like by performing a git pull

In the future I may make it possible to install from PyPi or using conda.

Updates and feature requests

If you make a modification to the code you think would be cool to share with the world, I welcome pull requests. Ditto for bugs etc. Alternatively if you have an idea which you think I should implement let me know and I'll se what I can do.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elePyant-0.0.1.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

elePyant-0.0.1-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file elePyant-0.0.1.tar.gz.

File metadata

  • Download URL: elePyant-0.0.1.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200616 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.3

File hashes

Hashes for elePyant-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c2e873b4f287c4c8554f2d157bfd3dd683ed6590ab218d6d7d99e675d89474fa
MD5 2ffa0177a3e482ffdf2e36466d240703
BLAKE2b-256 bb27d35d284173757b4d225ea4be8337d1e79b82e11cb6d471edcd1cd05b8596

See more details on using hashes here.

File details

Details for the file elePyant-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: elePyant-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200616 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.3

File hashes

Hashes for elePyant-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 498cc5a5ca6ad2ce141d58e4d3f3ea9e213b40f839878f169d650e2fbf57430f
MD5 55ca2cd6b59836abb723770e8cece15c
BLAKE2b-256 9ab54769a1ddceb9a49a9edb110b5d24a0501628d803b2559c47665826d9523f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page