Skip to main content

Provides a sharded Zarr store.

Project description

shardedstore

image Test DOI

Provides a sharded Zarr store.

Features

  • For large Zarr stores, avoid an excessive number of objects or extremely large objects, which bypasses filesystem inode usage and object store limitations.
  • Performance-sensitive implementation.
  • Use existing Zarr v2 stores.
  • Mix and match shard store types.
  • Serialize and deserialize the ShardedStore in JSON.
  • Shard groups or array chunks.
  • Easily run transformations on store shards.

Installation

pip install shardedstore

Example

from shardedstore import ShardedStore, array_shard_directory_store, to_zip_store_with_prefix

from zarr.storage import DirectoryStore

# xarray example, but works with zarr in general
import xarray as xr
from datatree import DataTree, open_datatree
import json
import numpy as np
import os

Create component shard stores

base_store = DirectoryStore("base.zarr")
shard1 = DirectoryStore("shard1.zarr")
shard2 = DirectoryStore("shard2.zarr")
array_shards1 = array_shard_directory_store("array_shards1")
array_shards2 = array_shard_directory_store("array_shards2")

Generate data for the example

# xarray-datatree Quick Overview
data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
# Sharded array dimensions must have a chunk shape of 1.
data = data.chunk([1,2])
ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))
ds2 = ds.interp(coords={"x": [10, 12, 14, 16, 18, 20]})
ds2 = ds2.chunk({'x':1, 'y':2})
ds3 = xr.Dataset(
    dict(people=["alice", "bob"], heights=("people", [1.57, 1.82])),
    coords={"species": "human"},
    )
dt = DataTree.from_dict({"simulation/coarse": ds, "simulation/fine": ds2, "/": ds3})

A monolithic store

single_store = DirectoryStore("single.zarr")
dt.to_zarr(single_store)

A sharded store demonstrating sharding on groups and arrays.

Arrays are sharded over 1 dimension.

sharded_store = ShardedStore(base_store,
    {'people': shard1, 'species': shard2},
    {'simulation/coarse/foo': (1, array_shards1), 'simulation/fine/foo': (1, array_shards2)})
dt.to_zarr(sharded_store)

Serialize / deserialize

config = sharded_store.get_config()
config_str = json.dumps(config)
config = json.loads(config_str)
sharded_store = ShardedStore.from_config(config)

Validate

from_single = open_datatree(single_store, engine='zarr').compute()
from_sharded = open_datatree(sharded_store, engine='zarr').compute()
assert from_single.identical(from_sharded)

Run transformations over component shards with map_shards

to_zip_stores = to_zip_store_with_prefix("zip_stores")
zip_sharded_stores = sharded_store.map_shards(to_zip_stores)

Development

Contributions are welcome and appreciated.

git clone https://github.com/thewtex/shardedstore
cd shardedstore
pip install -e ".[test]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shardedstore-0.3.1.tar.gz (13.1 kB view hashes)

Uploaded Source

Built Distribution

shardedstore-0.3.1-py3-none-any.whl (10.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page