A tool for joining multiple, homogeneous tensorstore datasets using the N5 driver into a single dataset.

These details have not been verified by PyPI

Project description

Tests Build

tsconcat

This is a neat little tool that supports the concatenation of tensorstore datasets using the n5/zarr driver along a given axis to obtain concatenated datasets, without having to write a new dataset which would imply doubling the storage used on disk or using a python for-loop to iterate over all datasets.

NOTE: This tool works under the assumption your individual datasets are homogeneous, i.e. they all use the same block structure, datatype, and compression scheme.

Installation

You can install the package from PyPI

pip install tsconcat

or directly from GitHub by running

pip install git+https://github.com/luisherrmann/tsconcat

How it works

The tool exploits the file hierarchy of n5/ zarr driver datasets to create concatenated datasets. Suppose you have two tensorstore datasets ds1. ds2 with shape [2, 2, 2] and [2, 3, 2], respectively. Furthermore, let's say the datasets use a block size of [1, 2, 1]. Then, the respective nested file hierarchies would be:

ds1

.
├── metadata
├── 0
│   └── 0
│       ├── 0
│       └── 1
└── 1
    └── 0
        ├── 0
        └── 1

and

ds2

.
├── metadata
├── 0
│   ├── 0
│   │   ├── 0
│   │   └── 1
│   └── 1
│       ├── 0
│       └── 1
└── 1
    ├── 0
    │   ├── 0
    │   └── 1
    └── 1
        ├── 0
        └── 1

, where directory nesting level 0 corresponds to dimension 0, nesting level 1 corresponds to dimension 1 and so on.

Let's say we want to concatenate along dimension 1 to obtain a dataset ds12. By assumption of homogeneity (same block size), the file hierarchy for a concatenated dataset must look exactly the same on all nesting levels above the concatenation level. So, all we have to do is link all directories from nesting level 1 into common directories. For the above example:

ds1/0/0 -> ds12/0/0
ds2/0/0 -> ds12/0/1
ds2/0/1 -> ds12/0/2
ds1/1/0 -> ds12/1/0
ds2/1/0 -> ds12/1/1
ds2/1/1 -> ds12/1/1

This leading to the following file hierarchy in the concatenated dataset:

ds12

.
├── metadata
├── 0
│   ├── 0 <---- 'ds1/0/0'
│   │   ├── 0
│   │   └── 1
│   ├── 1 <---- 'ds2/0/0'
│   │   ├── 0
│   │   └── 1
│   └── 2 <---- 'ds2/0/1'
│       ├── 0
│       └── 1
└── 1
    ├── 0 <---- 'ds1/1/0'
    │   ├── 0
    │   └── 1
    ├── 1 <---- 'ds2/1/0'
    │   ├── 0
    │   └── 1
    └── 2 <---- 'ds2/1/1'
        ├── 0
        └── 1

The metadata object of *ds12* is written to match the concatenated dataset.

NOTE: The n5and zarr driver specifications require all blocks to be the same size (except for termination blocks). The specification is respected in the above example where all blocks have size [1, 2, 1], except for termination blocks 0/2/1 and 1/2/1 at a block size of [1, 1, 1]. But what if we change the concatenation order and build a dataset ds21 from concatenating ds2 with ds1?

In that case, all blocks would have size [1, 2, 1] except for termination blocks0/1/1 and 1/1/1. But these are NOT termination blocks! However, we can pretend they are blocks of the correct size [1, 2, 1] and mask out the excess.

The zarr driver specification additionally allows for flat directory structures using . as a dimension separator. E.g. ds1 in the above example would look like this:

ds1

. .. metadata  0.0.0  0.0.1  1.0.0  1.0.1

Since the number of blocks and thus of files to be linked does not change, scenarios with flat or nested directory structures are equivalent with the exception of the dimension separator.

Usage

You can perform the concatenation by running

tsconcat <CONCAT_PATH> <TS_PATH1> <TS_PATH2> ... <CAT_DIM> [-d <DRIVER>] [-s <DIMSEP>] [-p]

, where CONCAT_PATH is the path of the target directory where to concatenate the datasets, and TS_PATH1, ... are the paths to the tensorstore datasets to be concatenated. CAT_DIM is the dimension along which to concatenate. As optional arguments, you can provide the <DRIVER> used in your tensorstores ('n5' or 'zarr') and the dimension separator <DIMSEP> to use (where '.' is only supported for the zarr driver). Using the -p flag enables a progress bar.

You can read from the datasets in python as follows:

from tsconcat import ConcatDataset
ds = ConcatDataset(concat_path)
data = ds[:].read().result()

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jul 10, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tsconcat-0.1.0.tar.gz (10.6 kB view details)

Uploaded Jul 10, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tsconcat-0.1.0-py3-none-any.whl (9.4 kB view details)

Uploaded Jul 10, 2023 Python 3

File details

Details for the file tsconcat-0.1.0.tar.gz.

File metadata

Download URL: tsconcat-0.1.0.tar.gz
Upload date: Jul 10, 2023
Size: 10.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.8.16 Darwin/22.1.0

File hashes

Hashes for tsconcat-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`88f0719086e39acc1f16f4618daaf8c67cda71b5b2e5ae3b4414155e2878f8ef`
MD5	`aa3208ea4918fece2e3643aa1c23268a`
BLAKE2b-256	`ef765f21d132738e8499f6049d844bfa0a66f5b736bb1a3a1de8e0adadff38f2`

See more details on using hashes here.

File details

Details for the file tsconcat-0.1.0-py3-none-any.whl.

File metadata

Download URL: tsconcat-0.1.0-py3-none-any.whl
Upload date: Jul 10, 2023
Size: 9.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.8.16 Darwin/22.1.0

File hashes

Hashes for tsconcat-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3b2c99a05a8a9ade490c35b78aff9175f6aacd86685173987ae07b19b4b20f5a`
MD5	`38a9c2264af52c0a78f075156d3c062c`
BLAKE2b-256	`fb1f309ad4a762d018a6e4da1bc2c6a6f6662ce7a9af27267a98fb25a3979427`

See more details on using hashes here.

tsconcat 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

tsconcat

Installation

How it works

ds1

ds2

ds12

ds1

Usage

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes