Skip to main content

A tool for joining multiple, homogeneous tensorstore datasets using the N5 driver into a single dataset.

Project description

python black Tests Build Coverage Reliability Rating License: MIT

tsconcat


This is a neat little tool that supports the concatenation of tensorstore datasets using the n5/zarr driver along a given axis to obtain concatenated datasets, without having to write a new dataset which would imply doubling the storage used on disk or using a python for-loop to iterate over all datasets.

NOTE: This tool works under the assumption your individual datasets are homogeneous, i.e. they all use the same block structure, datatype, and compression scheme.

Installation

You can install the package from PyPI

pip install tsconcat

or directly from GitHub by running

pip install git+https://github.com/luisherrmann/tsconcat

How it works


The tool exploits the file hierarchy of n5/ zarr driver datasets to create concatenated datasets. Suppose you have two tensorstore datasets ds1. ds2 with shape [2, 2, 2] and [2, 3, 2], respectively. Furthermore, let's say the datasets use a block size of [1, 2, 1]. Then, the respective nested file hierarchies would be:

ds1

.
├── metadata
├── 0   └── 0       ├── 0       └── 1
└── 1
    └── 0
        ├── 0
        └── 1

and

ds2

.
├── metadata
├── 0   ├── 0      ├── 0      └── 1   └── 1       ├── 0       └── 1
└── 1
    ├── 0
       ├── 0
       └── 1
    └── 1
        ├── 0
        └── 1

, where directory nesting level 0 corresponds to dimension 0, nesting level 1 corresponds to dimension 1 and so on.

Let's say we want to concatenate along dimension 1 to obtain a dataset ds12. By assumption of homogeneity (same block size), the file hierarchy for a concatenated dataset must look exactly the same on all nesting levels above the concatenation level. So, all we have to do is link all directories from nesting level 1 into common directories. For the above example:

  1. ds1/0/0 -> ds12/0/0
  2. ds2/0/0 -> ds12/0/1
  3. ds2/0/1 -> ds12/0/2
  4. ds1/1/0 -> ds12/1/0
  5. ds2/1/0 -> ds12/1/1
  6. ds2/1/1 -> ds12/1/1

This leading to the following file hierarchy in the concatenated dataset:

ds12

.
├── metadata
├── 0   ├── 0 <---- 'ds1/0/0'      ├── 0      └── 1   ├── 1 <---- 'ds2/0/0'      ├── 0      └── 1   └── 2 <---- 'ds2/0/1'       ├── 0       └── 1
└── 1
    ├── 0 <---- 'ds1/1/0'
       ├── 0
       └── 1
    ├── 1 <---- 'ds2/1/0'
       ├── 0
       └── 1
    └── 2 <---- 'ds2/1/1'
        ├── 0
        └── 1

The metadata object of *ds12* is written to match the concatenated dataset.

NOTE: The n5and zarr driver specifications require all blocks to be the same size (except for termination blocks). The specification is respected in the above example where all blocks have size [1, 2, 1], except for termination blocks 0/2/1 and 1/2/1 at a block size of [1, 1, 1]. But what if we change the concatenation order and build a dataset ds21 from concatenating ds2 with ds1?

In that case, all blocks would have size [1, 2, 1] except for termination blocks0/1/1 and 1/1/1. But these are NOT termination blocks! However, we can pretend they are blocks of the correct size [1, 2, 1] and mask out the excess.

The zarr driver specification additionally allows for flat directory structures using . as a dimension separator. E.g. ds1 in the above example would look like this:

ds1

. .. metadata  0.0.0  0.0.1  1.0.0  1.0.1

Since the number of blocks and thus of files to be linked does not change, scenarios with flat or nested directory structures are equivalent with the exception of the dimension separator.

Usage

You can perform the concatenation by running

tsconcat <CONCAT_PATH> <TS_PATH1> <TS_PATH2> ... <CAT_DIM> [-d <DRIVER>] [-s <DIMSEP>] [-p]

, where CONCAT_PATH is the path of the target directory where to concatenate the datasets, and TS_PATH1, ... are the paths to the tensorstore datasets to be concatenated. CAT_DIM is the dimension along which to concatenate. As optional arguments, you can provide the <DRIVER> used in your tensorstores ('n5' or 'zarr') and the dimension separator <DIMSEP> to use (where '.' is only supported for the zarr driver). Using the -p flag enables a progress bar.

You can read from the datasets in python as follows:

from tsconcat import ConcatDataset
ds = ConcatDataset(concat_path)
data = ds[:].read().result()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tsconcat-0.1.0.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tsconcat-0.1.0-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file tsconcat-0.1.0.tar.gz.

File metadata

  • Download URL: tsconcat-0.1.0.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.8.16 Darwin/22.1.0

File hashes

Hashes for tsconcat-0.1.0.tar.gz
Algorithm Hash digest
SHA256 88f0719086e39acc1f16f4618daaf8c67cda71b5b2e5ae3b4414155e2878f8ef
MD5 aa3208ea4918fece2e3643aa1c23268a
BLAKE2b-256 ef765f21d132738e8499f6049d844bfa0a66f5b736bb1a3a1de8e0adadff38f2

See more details on using hashes here.

File details

Details for the file tsconcat-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tsconcat-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.8.16 Darwin/22.1.0

File hashes

Hashes for tsconcat-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b2c99a05a8a9ade490c35b78aff9175f6aacd86685173987ae07b19b4b20f5a
MD5 38a9c2264af52c0a78f075156d3c062c
BLAKE2b-256 fb1f309ad4a762d018a6e4da1bc2c6a6f6662ce7a9af27267a98fb25a3979427

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page