Skip to main content

Bitshuffle filter for improving typed data compression.

Project description

Filter for improving compression of typed binary data.

Bitshuffle is an algorithm that rearranges typed, binary data for improving compression, as well as a python/C package that implements this algorithm within the Numpy framework.

The library can be used along side HDF5 to compress and decompress datasets and is integrated through the dynamically loaded filters framework. Bitshuffle is HDF5 filter number 32008.

Algorithmically, Bitshuffle is closely related to HDF5’s Shuffle filter except it operates at the bit level instead of the byte level. Arranging a typed data array in to a matrix with the elements as the rows and the bits within the elements as the columns, Bitshuffle “transposes” the matrix, such that all the least-significant-bits are in a row, etc. This transpose is performed within blocks of data roughly 8 kB long [1].

This does not in itself compress data, only rearranges it for more efficient compression. To perform the actual compression you will need a compression library. Bitshuffle has been designed to be well matched to Marc Lehmann’s LZF as well as LZ4 and ZSTD. Note that because Bitshuffle modifies the data at the bit level, sophisticated entropy reducing compression libraries such as GZIP and BZIP are unlikely to achieve significantly better compression than simpler and faster duplicate-string-elimination algorithms such as LZF, LZ4 and ZSTD. Bitshuffle thus includes routines (and HDF5 filter options) to apply LZ4 and ZSTD compression to each block after shuffling [2].

The Bitshuffle algorithm relies on neighbouring elements of a dataset being highly correlated to improve data compression. Any correlations that span at least 24 elements of the dataset may be exploited to improve compression.

Bitshuffle was designed with performance in mind. On most machines the time required for Bitshuffle+LZ4 is insignificant compared to the time required to read or write the compressed data to disk. Because it is able to exploit the SSE and AVX instruction sets present on modern Intel and AMD processors, on these machines compression is only marginally slower than an out-of-cache memory copy. On modern x86 processors you can expect Bitshuffle to have a throughput of roughly 1 byte per clock cycle, and on the Haswell generation of Intel processors (2013) and later, you can expect up to 2 bytes per clock cycle. In addition, Bitshuffle is parallelized using OpenMP.

As a bonus, Bitshuffle ships with a dynamically loaded version of h5py’s LZF compression filter, such that the filter can be transparently used outside of python and in command line utilities such as h5dump.

Applications

Bitshuffle might be right for your application if:

  • You need to compress typed binary data.

  • Your data is arranged such that adjacent elements over the fastest varying index of your dataset are similar (highly correlated).

  • A special case of the previous point is if you are only exercising a subset of the bits in your data-type, as is often true of integer data.

  • You need both high compression ratios and high performance.

Comparing Bitshuffle to other compression algorithms and HDF5 filters:

  • Bitshuffle is less general than many other compression algorithms. To achieve good compression ratios, consecutive elements of your data must be highly correlated.

  • For the right datasets, Bitshuffle is one of the few compression algorithms that promises both high throughput and high compression ratios.

  • Bitshuffle should have roughly the same throughput as Shuffle, but may obtain higher compression ratios.

  • The MAFISC filter actually includes something similar to Bitshuffle as one of its prefilters, However, MAFICS’s emphasis is on obtaining high compression ratios at all costs, sacrificing throughput.

Installation for Python

In most cases bitshuffle can be installed by pip:

pip install bitshuffle

On Linux and macOS x86_64 platforms binary wheels are available, on other platforms a source build will be performed. The binary wheels are built with AVX2 support and will only run processors that support these instructions (most processors from 2015 onwards, i.e. Intel Haswell, AMD Excavator and later). On an unsupported processor these builds of bitshuffle will crash with SIGILL. To run on unsupported x86_64 processors, or target newer instructions such as AVX512, you should perform a build from source. This can be forced by giving pip the –no-binary=bitshuffle option.

Source installation requires python 2.7+ or 3.3+, HDF5 1.8.4 or later, HDF5 for python (h5py), Numpy and Cython. Bitshuffle is linked against HDF5. To use the dynamically loaded HDF5 filter requires HDF5 1.8.11 or later.

For total control, bitshuffle can be built using python setup.py. If ZSTD support is to be enabled the ZSTD repo needs to pulled into bitshuffle before installation with:

git submodule update --init

To build and install bitshuffle:

python setup.py install [--h5plugin [--h5plugin-dir=spam] --zstd]

To get finer control of installation options, including whether to compile with OpenMP multi-threading and the target microarchitecture copy the setup.cfg.example to setup.cfg and edit the values therein.

If using the dynamically loaded HDF5 filter (which gives you access to the Bitshuffle and LZF filters outside of python), set the environment variable HDF5_PLUGIN_PATH to the value of --h5plugin-dir or use HDF5’s default search location of /usr/local/hdf5/lib/plugin.

ZSTD support is enabled with --zstd.

If you get an error about missing source files when building the extensions, try upgrading setuptools. There is a weird bug where setuptools prior to 0.7 doesn’t work properly with Cython in some cases.

Usage from Python

The bitshuffle module contains routines for shuffling and unshuffling Numpy arrays.

If installed with the dynamically loaded filter plugins, Bitshuffle can be used in conjunction with HDF5 both inside and outside of python, in the same way as any other filter; simply by specifying the filter number 32008. Otherwise the filter will be available only within python and only after importing bitshuffle.h5. Reading Bitshuffle encoded datasets will be transparent. The filter can be added to new datasets either through the h5py low level interface or through the convenience functions provided in bitshuffle.h5. See the docstrings and unit tests for examples. For h5py version 2.5.0 and later Bitshuffle can be added to new datasets through the high level interface, as in the example below.

The compression algorithm can be configured using the filter_opts in bitshuffle.h5.create_dataset(). LZ4 is chosen with: (BLOCK_SIZE, h5.H5_COMPRESS_LZ4) and ZSTD with: (BLOCK_SIZE, h5.H5_COMPRESS_ZSTD, COMP_LVL). See test_h5filter.py for an example.

Example h5py

import h5py
import numpy
import bitshuffle.h5

print(h5py.__version__) # >= '2.5.0'

f = h5py.File(filename, "w")

# block_size = 0 let Bitshuffle choose its value
block_size = 0

dataset = f.create_dataset(
    "data",
    (100, 100, 100),
    compression=bitshuffle.h5.H5FILTER,
    compression_opts=(block_size, bitshuffle.h5.H5_COMPRESS_LZ4),
    dtype='float32',
    )

# create some random data
array = numpy.random.rand(100, 100, 100)
array = array.astype('float32')

dataset[:] = array

f.close()

Usage from C

If you wish to use Bitshuffle in your C program and would prefer not to use the HDF5 dynamically loaded filter, the C library in the src/ directory is self-contained and complete.

Usage from Java

You can use Bitshuffle even in Java and the routines for shuffling and unshuffling are ported into snappy-java. To use the routines, you need to add the following dependency to your pom.xml:

<dependency>
  <groupId>org.xerial.snappy</groupId>
  <artifactId>snappy-java</artifactId>
  <version>1.1.3-M1</version>
</dependency>

First, import org.xerial.snapy.BitShuffle in your Java code:

import org.xerial.snappy.BitShuffle;

Then, you use them like this:

int[] data = new int[] {1, 3, 34, 43, 34};
byte[] shuffledData = BitShuffle.bitShuffle(data);
int[] result = BitShuffle.bitUnShuffleIntArray(shuffledData);

Rust HDF5 plugin

If you wish to open HDF5 files compressed with bitshuffle in your Rust program, there is a Rust binding for it. In your Cargo.toml:

[dependencies]
...
hdf5-bitshuffle = "0.9"
...

To register the plugin in your code:

use hdf5_bitshuffle::register_bitshuffle_plugin;

fn main() {
    register_bitshuffle_plugin();
}

Anaconda

The conda package can be build via:

conda build conda-recipe

For Best Results

Here are a few tips to help you get the most out of Bitshuffle:

  • For multi-dimensional datasets, order your data such that the fastest varying dimension is the one over which your data is most correlated (have values that change the least), or fake this using chunks.

  • To achieve the highest throughput, use a data type that is 64 bytes or smaller. If you have a very large compound data type, consider adding a dimension to your datasets instead.

  • To make full use of the SSE2 instruction set, use a data type whose size is a multiple of 2 bytes. For the AVX2 instruction set, use a data type whose size is a multiple of 4 bytes.

Citing Bitshuffle

Bitshuffle was initially described in https://doi.org/10.1016/j.ascom.2015.07.002, pre-print available at https://arxiv.org/abs/1503.00638.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bitshuffle-0.5.2.tar.gz (290.2 kB view details)

Uploaded Source

Built Distributions

bitshuffle-0.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.2 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

bitshuffle-0.5.2-cp312-cp312-macosx_10_9_x86_64.whl (509.9 kB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

bitshuffle-0.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

bitshuffle-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

bitshuffle-0.5.2-cp310-cp310-macosx_10_9_x86_64.whl (510.1 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

bitshuffle-0.5.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

bitshuffle-0.5.2-cp39-cp39-macosx_10_9_x86_64.whl (510.3 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file bitshuffle-0.5.2.tar.gz.

File metadata

  • Download URL: bitshuffle-0.5.2.tar.gz
  • Upload date:
  • Size: 290.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for bitshuffle-0.5.2.tar.gz
Algorithm Hash digest
SHA256 dc0e3fb7bdbf42be1009cc3028744180600d625a75b31833a24aa32aeaf83d8d
MD5 dc58afef1bbcfd295e812377e14fe171
BLAKE2b-256 34d3539ae1f2c7404e5396f90a9b4cca2e0d83ed1a9c8e598f94efe88130094a

See more details on using hashes here.

Provenance

File details

Details for the file bitshuffle-0.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bitshuffle-0.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 de1bc1c272bd3d1e99a8b0e84c58e7f136a1cbe064be9f7955daa3b17bc82ae8
MD5 7d344c603fc7f433ea4f837d9a51f520
BLAKE2b-256 e20d8013501bb460a630c07aba2e44616603245e0ca29992a1b1e1fdc568e623

See more details on using hashes here.

Provenance

File details

Details for the file bitshuffle-0.5.2-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for bitshuffle-0.5.2-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 cc4ff3313d56110d84797cbc505723d6011b8e37bc7d8e67b59a1481e410b97e
MD5 5fa074b3397ba6325f3e6ac405e01636
BLAKE2b-256 5c9357554443afef59436e8f4f88c06d160add34f636d3ddf1e34251d8bd2083

See more details on using hashes here.

Provenance

File details

Details for the file bitshuffle-0.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bitshuffle-0.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b1f0aea223d92629f7319657ea4a3a88e95a6ee5be20e53c5038b766ea23502f
MD5 125c860764071114ad423f2b4c817a87
BLAKE2b-256 b2690c95cfa1f331c6196b7bc7746188ba929a963d33ce333824a4c86dcf0880

See more details on using hashes here.

Provenance

File details

Details for the file bitshuffle-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bitshuffle-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2a435cbe584fe60fe1c49c3d3b99dffc6b68ba96feb265febf4072d403849362
MD5 542e933ec26e7bbb46119a6fba097d0c
BLAKE2b-256 dcdf0e012beef3dc4836d40e9973a0ddb5c0b4ad8fa0a7a8e22bdff35264baab

See more details on using hashes here.

Provenance

File details

Details for the file bitshuffle-0.5.2-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for bitshuffle-0.5.2-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d6216a8b5f3f91259715b2bed6a937a23b2022fa8fdee345fe1d1f56e0cc46b4
MD5 ca87c6b3a0ae99d92b6dbe38ccae57b9
BLAKE2b-256 45f0ef86a653fc6a356a399df09b1693876c27b7169b5b402228643e1d74534d

See more details on using hashes here.

Provenance

File details

Details for the file bitshuffle-0.5.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bitshuffle-0.5.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a02093ccdb129c22471aeee7e447bcc7f1920b855f3fbe36fc5d787fb0a2c574
MD5 d9d4bca6bdb2745a30892635426aa167
BLAKE2b-256 43879c510f4a311c9a5d92c09f8a83881c17ebbdc587389290e3d3e4052843a8

See more details on using hashes here.

Provenance

File details

Details for the file bitshuffle-0.5.2-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for bitshuffle-0.5.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 54abd7278a2032b5d457a10a41b709c5dbcf23a92d7b1f9f9f51e443b0d35216
MD5 94b0b6cda311872e0365294b47e23a35
BLAKE2b-256 583d0fc2c1978cd514fa31d425525c64f09b2b1f2ee7aa7392ad7fc5ad148b59

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page