Skip to main content

Compression techniques for numpy arrays

Project description

fewerbytes

A numpy-based compression library to make your data require fewerbytes

Quick Start

pip install fewerbytes

import fewerbytes as fb
import numpy as np
arr = np.array([50000, 55000, 60000, 65000, 70000], dtype=np.uint32)
arr.nbytes  # 20 Bytes
new_arr, new_arr_type, details = fb.integer_minimize_compression(arr)
new_arr  # [0, 5000, 10000, 15000, 20000] dtype=uint16
new_arr_type  # NumpyType with kind UNSIGNED and size SHORT (16bit)
details  # IntegerMinimizeTransformation with reference_value 50000
decomp_arr = fb.integer_decompression_from_transform(new_arr, details)  # decompressed array
decomp_arr  # [50000, 55000, 60000, 65000, 70000]

Integer Compression

fewerbytes offers four types of integer compression

Simple Downcast

This technique simply calculates the minimum and maximum values of the array, and then attempts to down-cast the integers to a smaller storage size, e.g. from 64-bit integers to 16-bit integers

import fewerbytes as fb
import numpy as np
arr = np.full(10, fill_value=1, dtype=np.int64)
new_arr, new_arr_type = fb.downcast_integers(arr)  # values are stored in 8-bit unsigned integers

Minimize

This compression technique calculates the minimum array value and subtracts the value from each array element. This allows switching from signed integers to unsigned integers, which opens an extra bit, doubling the range of absolute values that can be stored in each class of integers (8-bit, 16-bit, etc.).

This technique is effective when values in an array are large, but when the difference between the min and max are not as large.

import fewerbytes as fb
import numpy as np
arr = np.full(10, fill_value=1000000)
fb.integer_minimize_compression(arr)  # (array, NumpyType) - array is all 0's

Derivative or Element-Wise

This compression technique calculates the element-wise difference of the array. The returned array has 1 fewer element than the original array.

This technique is effective when differences in values in an array are similar. For example, unix timestamps from some regularly occurring measurement.

import fewerbytes as fb
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
fb.integer_derivative_compression(arr)  # (array, numpy compressed array is [1, 1, 1, 1]

Hash Set

This compression technique calculates a unique set of values in the array. Then, for each element in the array, it stores the index of its value in the unique set. The transformation details contain the unique values.

This technique is effective when the original array has a relatively small set of unique values. Due to the compression overhead of doing key-lookups, this technique is abandoned if the byte-storage efficiency is improved by less than 20%.

import fewerbytes as fb
import numpy as np
arr = np.array([1000000, 1000001, 1000000, 1000000, 1000000], dtype=np.uint32)
arr.nbytes  # 20 bytes
keys, keys_type, transform = fb.integer_hash_compression(arr)
keys  # array [0, 1, 0, 0, 0]
keys_type  # UNSIGNED BYTE
transform.key_values  # array [1000000, 1000001]
transform.key_values_type  # UNSIGNED SINGLE (32-bit)

Integer Decompression

Integer decompression can be achieved using any of the following functions?

import fewerbytes as fb
# arr and transform obtained from compression
fb.integer_decompression_from_transform(arr, transform)
fb.integer_minimize_decompression(arr, transform)
fb.integer_derivative_decompression(arr, transform)
fb.integer_hash_decompression(arr, transform)
fb.integer_decompression_from_transforms(arr, list_of_transforms)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fewerbytes-0.0.1.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

fewerbytes-0.0.1-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file fewerbytes-0.0.1.tar.gz.

File metadata

  • Download URL: fewerbytes-0.0.1.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/2.7.12

File hashes

Hashes for fewerbytes-0.0.1.tar.gz
Algorithm Hash digest
SHA256 789bd421fd1787aba8e13e5649f986cd4fe0accb9444180f781653a34c78038d
MD5 d0c811abf2c462e26eb272a9d3a55ba8
BLAKE2b-256 bed14da7283f967b93268bf854ea599ff34c7dfa3e6681cc39e59d7b9212241d

See more details on using hashes here.

File details

Details for the file fewerbytes-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: fewerbytes-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/2.7.12

File hashes

Hashes for fewerbytes-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 82a8bd17f854b6ffb84f3385f466c74691da71575a918867317a057a128f1488
MD5 ffb7231f37f0544ba6c651cef82ebbd4
BLAKE2b-256 a40169d7e542a47c4af06ae3caaa1e3c467d1934eecfff13043a3d7c37975e35

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page