A tool for joining multiple, homogeneous tensorstore datasets using the N5 driver into a single dataset.
Project description
tsconcat
This is a neat little tool that supports the concatenation of tensorstore datasets using the n5/zarr driver along a given axis to obtain concatenated datasets, without having to write a new dataset which would imply doubling the storage used on disk or using a python for-loop to iterate over all datasets.
NOTE: This tool works under the assumption your individual datasets are homogeneous, i.e. they all use the same block structure, datatype, and compression scheme.
Installation
You can install the package from PyPI
pip install tsconcat
or directly from GitHub by running
pip install git+https://github.com/luisherrmann/tsconcat
How it works
The tool exploits the file hierarchy of n5/ zarr driver datasets to create concatenated datasets. Suppose you have two tensorstore datasets ds1. ds2 with shape [2, 2, 2] and [2, 3, 2], respectively. Furthermore, let's say the datasets use a block size of [1, 2, 1]. Then, the respective nested file hierarchies would be:
ds1
.
├── metadata
├── 0
│ └── 0
│ ├── 0
│ └── 1
└── 1
└── 0
├── 0
└── 1
and
ds2
.
├── metadata
├── 0
│ ├── 0
│ │ ├── 0
│ │ └── 1
│ └── 1
│ ├── 0
│ └── 1
└── 1
├── 0
│ ├── 0
│ └── 1
└── 1
├── 0
└── 1
, where directory nesting level 0 corresponds to dimension 0, nesting level 1 corresponds to dimension 1 and so on.
Let's say we want to concatenate along dimension 1 to obtain a dataset ds12. By assumption of homogeneity (same block size), the file hierarchy for a concatenated dataset must look exactly the same on all nesting levels above the concatenation level. So, all we have to do is link all directories from nesting level 1 into common directories. For the above example:
ds1/0/0 -> ds12/0/0ds2/0/0 -> ds12/0/1ds2/0/1 -> ds12/0/2ds1/1/0 -> ds12/1/0ds2/1/0 -> ds12/1/1ds2/1/1 -> ds12/1/1
This leading to the following file hierarchy in the concatenated dataset:
ds12
.
├── metadata
├── 0
│ ├── 0 <---- 'ds1/0/0'
│ │ ├── 0
│ │ └── 1
│ ├── 1 <---- 'ds2/0/0'
│ │ ├── 0
│ │ └── 1
│ └── 2 <---- 'ds2/0/1'
│ ├── 0
│ └── 1
└── 1
├── 0 <---- 'ds1/1/0'
│ ├── 0
│ └── 1
├── 1 <---- 'ds2/1/0'
│ ├── 0
│ └── 1
└── 2 <---- 'ds2/1/1'
├── 0
└── 1
The metadata object of *ds12* is written to match the concatenated dataset.
NOTE: The n5and zarr driver specifications require all blocks to be the same size (except for termination blocks). The specification is respected in the above example where all blocks have size [1, 2, 1], except for termination blocks 0/2/1 and 1/2/1 at a block size of [1, 1, 1]. But what if we change the concatenation order and build a dataset ds21 from concatenating ds2 with ds1?
In that case, all blocks would have size [1, 2, 1] except for termination blocks0/1/1 and 1/1/1. But these are NOT termination blocks! However, we can pretend they are blocks of the correct size [1, 2, 1] and mask out the excess.
The zarr driver specification additionally allows for flat directory structures using . as a dimension separator. E.g. ds1 in the above example would look like this:
ds1
. .. metadata 0.0.0 0.0.1 1.0.0 1.0.1
Since the number of blocks and thus of files to be linked does not change, scenarios with flat or nested directory structures are equivalent with the exception of the dimension separator.
Usage
You can perform the concatenation by running
tsconcat <CONCAT_PATH> <TS_PATH1> <TS_PATH2> ... <CAT_DIM> [-d <DRIVER>] [-s <DIMSEP>] [-p]
, where CONCAT_PATH is the path of the target directory where to concatenate the datasets, and TS_PATH1, ... are the paths to the tensorstore datasets to be concatenated. CAT_DIM is the dimension along which to concatenate. As optional arguments, you can provide the <DRIVER> used in your tensorstores ('n5' or 'zarr') and the dimension separator <DIMSEP> to use (where '.' is only supported for the zarr driver). Using the -p flag enables a progress bar.
You can read from the datasets in python as follows:
from tsconcat import ConcatDataset
ds = ConcatDataset(concat_path)
data = ds[:].read().result()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tsconcat-0.1.0.tar.gz.
File metadata
- Download URL: tsconcat-0.1.0.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.8.16 Darwin/22.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88f0719086e39acc1f16f4618daaf8c67cda71b5b2e5ae3b4414155e2878f8ef
|
|
| MD5 |
aa3208ea4918fece2e3643aa1c23268a
|
|
| BLAKE2b-256 |
ef765f21d132738e8499f6049d844bfa0a66f5b736bb1a3a1de8e0adadff38f2
|
File details
Details for the file tsconcat-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tsconcat-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.8.16 Darwin/22.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b2c99a05a8a9ade490c35b78aff9175f6aacd86685173987ae07b19b4b20f5a
|
|
| MD5 |
38a9c2264af52c0a78f075156d3c062c
|
|
| BLAKE2b-256 |
fb1f309ad4a762d018a6e4da1bc2c6a6f6662ce7a9af27267a98fb25a3979427
|