Skip to main content

A minimal implementation of chunked, compressed, N-dimensional arrays for Python.

Project description

A minimal implementation of chunked, compressed, N-dimensional arrays for Python.

Installation

Installation currently requires NumPy and Cython pre-installed. Currently only compatible with Python >= 3.4.

Install from PyPI:

$ pip install -U zarr

Install from GitHub:

$ pip install -U git+https://github.com/alimanfoo/zarr.git@master

Status

Highly experimental, pre-alpha. Bug reports and pull requests very welcome.

Design goals

  • Chunking in multiple dimensions

  • Resize any dimension

  • Concurrent reads

  • Concurrent writes

  • Release the GIL during compression and decompression

Usage

Create an array:

>>> import numpy as np
>>> import zarr
>>> z = zarr.empty((10000, 1000), dtype='i4', chunks=(1000, 100))
>>> z
zarr.ext.Array((10000, 1000), int32, chunks=(1000, 100), cname='blosclz', clevel=5, shuffle=1)
  nbytes: 38.1M; cbytes: 0

Fill it with some data:

>>> z[:] = np.arange(10000000, dtype='i4').reshape(10000, 1000)
>>> z
zarr.ext.Array((10000, 1000), int32, chunks=(1000, 100), cname='blosclz', clevel=5, shuffle=1)
  nbytes: 38.1M; cbytes: 2.0M; ratio: 19.3

Obtain a NumPy array by slicing:

>>> z[:]
array([[      0,       1,       2, ...,     997,     998,     999],
       [   1000,    1001,    1002, ...,    1997,    1998,    1999],
       [   2000,    2001,    2002, ...,    2997,    2998,    2999],
       ...,
       [9997000, 9997001, 9997002, ..., 9997997, 9997998, 9997999],
       [9998000, 9998001, 9998002, ..., 9998997, 9998998, 9998999],
       [9999000, 9999001, 9999002, ..., 9999997, 9999998, 9999999]], dtype=int32)
>>> z[:100]
array([[    0,     1,     2, ...,   997,   998,   999],
       [ 1000,  1001,  1002, ...,  1997,  1998,  1999],
       [ 2000,  2001,  2002, ...,  2997,  2998,  2999],
       ...,
       [97000, 97001, 97002, ..., 97997, 97998, 97999],
       [98000, 98001, 98002, ..., 98997, 98998, 98999],
       [99000, 99001, 99002, ..., 99997, 99998, 99999]], dtype=int32)
>>> z[:, :100]
array([[      0,       1,       2, ...,      97,      98,      99],
       [   1000,    1001,    1002, ...,    1097,    1098,    1099],
       [   2000,    2001,    2002, ...,    2097,    2098,    2099],
       ...,
       [9997000, 9997001, 9997002, ..., 9997097, 9997098, 9997099],
       [9998000, 9998001, 9998002, ..., 9998097, 9998098, 9998099],
       [9999000, 9999001, 9999002, ..., 9999097, 9999098, 9999099]], dtype=int32)

Resize the array and add more data:

>>> z.resize(20000, 1000)
>>> z
zarr.ext.Array((20000, 1000), int32, chunks=(1000, 100), cname='blosclz', clevel=5, shuffle=1)
  nbytes: 76.3M; cbytes: 2.0M; ratio: 38.5
>>> z[10000:, :] = np.arange(10000000, dtype='i4').reshape(10000, 1000)
>>> z
zarr.ext.Array((20000, 1000), int32, chunks=(1000, 100), cname='blosclz', clevel=5, shuffle=1)
  nbytes: 76.3M; cbytes: 4.0M; ratio: 19.3

For convenience, an append() method is also available, which can be used to append data to any axis:

>>> a = np.arange(10000000, dtype='i4').reshape(10000, 1000)
>>> z = zarr.array(a, chunks=(1000, 100))
>>> z
zarr.ext.Array((10000, 1000), int32, chunks=(1000, 100), cname='blosclz', clevel=5, shuffle=1)
  nbytes: 38.1M; cbytes: 2.0M; ratio: 19.3
>>> z.append(a+a)
>>> z
zarr.ext.Array((20000, 1000), int32, chunks=(1000, 100), cname='blosclz', clevel=5, shuffle=1)
  nbytes: 76.3M; cbytes: 3.6M; ratio: 21.2
>>> z.append(np.vstack([a, a]), axis=1)
>>> z
zarr.ext.Array((20000, 2000), int32, chunks=(1000, 100), cname='blosclz', clevel=5, shuffle=1)
  nbytes: 152.6M; cbytes: 7.6M; ratio: 20.2

Tuning

zarr is designed for use in parallel computations working chunk-wise over data. Try it with dask.array.

zarr is optimised for accessing and storing data in contiguous slices, of the same size or larger than chunks. It is not and will never be optimised for single item access.

Chunks sizes >= 1M are generally good. Optimal chunk shape will depend on the correlation structure in your data.

Acknowledgments

zarr uses c-blosc internally for compression and decompression and borrows code heavily from bcolz.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zarr-0.2.7.tar.gz (427.3 kB view details)

Uploaded Source

File details

Details for the file zarr-0.2.7.tar.gz.

File metadata

  • Download URL: zarr-0.2.7.tar.gz
  • Upload date:
  • Size: 427.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for zarr-0.2.7.tar.gz
Algorithm Hash digest
SHA256 ff4521d4e17521bdfd95689228b7515323a29633a13648f37ea0639ab31f74ec
MD5 b5f9aab43435ae4574aef7d2f34c620d
BLAKE2b-256 5231bafd4d70674f35b19c2b11da41873645d0973392cc2f7cf60b1976a9cc03

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page