Fast format for datasets

These details have not been verified by PyPI

Project links

Homepage

Project description

Granular

Granular is a format for datasets, from simple to complex. Each Granular dataset is a collection of linked files in bag file format, a seekable container structure. Granular comes with a high-performance data loader.

pip install granular

Features

🚀 Performance: High read and write throughput locally and on Cloud.
🔎 Seeking: Fast random access from disk by datapoint index.
🎞️ Sequences: Datapoints can contain seekable lists of modalities.
🤸 Flexibility: User provides encoders and decoders; examples available.
👥 Sharding: Store datasets into shards to split processing workloads.
🔄 Determinism: Deterministic and resumable global shuffling per epoch.
✅ Correctness: A suite of unit tests with high code coverage.

Quickstart

import pathlib
import granular
import numpy as np

directory = './dataset'

Writing

spec = {
    'foo': 'int',      # integer
    'bar': 'utf8[]',   # *list* of strings
    'baz': 'msgpack',  # packed structure
    'abc': 'jpg',      # image
    'xyz': 'array',    # array
}

with granular.DatasetWriter(directory, spec, granular.encoders) as writer:
  for i in range(10):
    datapoint = {
        'foo': i,
        'bar': ['hello'] * i,
        'baz': {'a': 1},
        'abc': np.zeros((60, 80, 3), np.uint8),
        'xyz': np.arange(0, 1 + i, np.float32),
    }
    writer.append(datapoint)

print(list(directory.glob('*')))
# ['spec.json', 'refs.bag', 'foo.bag', 'bar.bag', 'baz.bag', 'abc.bag', 'xyz.bag']

Reading

with granular.DatasetReader(directory, granular.decoders) as reader:
  print(reader.spec)    # {'foo': 'int', 'bar': 'utf8[]', 'baz': 'msgpack', ...}
  print(reader.size)    # Dataset size in bytes
  print(len(reader))    # Number of datapoints

  datapoint = reader[2]
  print(datapoint['foo'])        # 2
  print(datapoint['bar'])        # ['hello', 'hello']
  print(datapoint['abc'].shape)  # (60, 80, 3)

def preproc(datapoint, seed):
  return {'image': datapoint['abc'], 'label': datapoint['foo']}

source = granular.sources.Epochs(reader, shuffle=True, seed=0)
source = granular.sources.Transform(source, preproc)

loader = granular.Loader(source, batch=8, workers=64)

print(loader.spec)
# {'image': (np.uint8, (60, 80, 3)), 'label': (np.int64, ())}

dataset = iter(loader)
for _ in range(100):
  batch = next(dataset)
  print(batch['image'].shape)  # (8, 60, 80, 3)

Advanced

Filesystems

Custom filesystems are supported by providing different Path implementations. For example, on Google Cloud you can use the Path from elements that is optimized for data loading throughput:

import elements  # pip install elements

directory = elements.Path('gs://<bucket>/dataset')

reader = granular.DatasetReader(directory, ...)
wrtier = granular.DatasetWriter(directory, ...)

Formats

Granular does not impose a serialization solution on the user. Any strings can be used as types in spec, as long as their encoder and decoder functions are provided, for example:

import msgpack

encoders = {
    'bytes': lambda x: x,
    'utf8': lambda x: x.encode('utf-8'),
    'msgpack': msgpack.packb,
}

decoders = {
    'bytes': lambda x: x,
    'utf8': lambda x: x.decode('utf-8'),
    'msgpack': msgpack.unpackb,
}

Examples of common encode and decode functions are provided in formats.py. These support Numpy arrays, images, videos, and more. They can be used as granular.encoders and granular.decoders.

Resuming

The dataloader is fully deterministic and resumable, given only the step and seed integers. For this, checkpoint the state dictionary returned by loader.save() and pass this into loader.load() when storing a checkpoint.

state = loader.save()
print(state)  # {'step': 100, 'seed': 0}
loader.load(state)

Caching

Retriving a datapoint requires first reading from refs.bag to find the references into the other bag files, and then reading from each of the modality bag files. If some of the modalities are small enough, they can be cached in RAM by setting cache_keys. In general, it is recommended to cache refs as well as all small modalities, such as integer labels.

Additionally, reading from a Bag file requires two read operations. The first operation looks at the index table at the end of the file to locate the byte offset of the record. The second operation retrieves the actual record. In general, it is recommended to cache the index for all Bag files. Together, the tables take up 8 * len(spec) * len(reader) bytes of RAM.

reader = granular.DatasetReader(
    directory, decoders,
    cache_index=True,            # Cache index tables of all bag files in memory.
    cache_keys=('refs', 'foo'),  # Fully cache refs.bag and foo.bag in memory.
)

Masking

It is possible to load the values of only a subset of keys of a datapoint. For this, provide a mask in addition to the datapoint index. This reduces the number of read requests to only the bag files that are actually needed:

print(reader.spec)  # {'foo': 'int', 'bar': 'utf8', 'baz': 'array'}

mask = {'foo': True, 'baz': True}
datapoint = reader[index, mask]
print('foo' in datapoint)  # True
print('bar' in datapoint)  # False
print('baz' in datapoint)  # True

Sequences

Each dataset is a list of datapoints. Each datapoint is a dictionary with string keys and either individual byte values or lists of byte values. To use sequence values, add the [] suffix to the type in the spec:

spec = {
    'title': 'utf8',
    'frames': 'jpg[]',
    'captions': 'utf8[]',
    'times': 'int[]',
}

Sequence fields can not only store values of variable length, but also allow reading ranges of the value without loading the whole sequence from disk using masking:

available = reader.available(index)
print(available)
# {'title': True, 'frames': range(54), 'captions': range(7), 'times': range(7)}

mask = {
    'title': True,            # Read the title modality
    'frames': range(32, 42),  # Read a range of 10 frames.
    'captions': range(0, 7),  # Read all captions.
    'times': True,            # Another way to read the full list.
}
datapoint = reader[index, mask]
print(len(datapoint['frames']))  # 10

Ranges are loaded using a single read operation, corresponding to a single download request on Cloud infrastructure.

Sharding

Large datasets can be stored as list of smaller datasets to easily parallelize processing, by processing each smaller dataset individually in a different process or on a different machine. The shard length specifies the number of datapoints per shard. A good default is to set the number of datapoints such that each shard is around 10 Gb in size.

# Write into a sharded dataset.
writer = granular.ShardedDatasetWriter(directory, spec, encoders, shardlen=10000)

# Read from a sharded dataset.
reader = granular.ShardedDatasetReader(directory, decoders)

The file structure of a sharded dataset is one folder per shard, named after the shard number. Each shard itself is a dataset and can also be read using the non-sharded granular.DatasetReader.

$ tree ./directory
.
├── 000000
│  ├── spec.json
│  ├── refs.bag
│  ├── foo.bag
│  ├── bar.bag
│  └── baz.bag
├── 000001
│  ├── spec.json
│  ├── refs.bag
│  ├── foo.bag
│  ├── bar.bag
│  └── baz.bag
└── ...

When processing a dataset with a large number of shards using a smaller number of workers, specify shardstart and shardstep so each worker reads and writes its dedicated subset of shards.

# Write into a sharded dataset.
writer = granular.ShardedDatasetWriter(
    directory, spec, encoders, shardlen=10000,
    shardstart=worker_id,   # Start writing at this shard.
    shardstep=num_workers,  # Afterwards, jump this many shards ahead.
)

# Read from a sharded dataset.
reader = granular.ShardedDatasetReader(
    directory, decoders,
    shardstart=worker_id,   # Start reading at this shard.
    shardstep=num_workers,  # Afterwards, jump this many shards ahead.
)

Questions

If you have a question, please file an issue.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.23.0

Mar 23, 2026

0.22.0

Mar 6, 2026

0.22.0a1 pre-release

Jan 2, 2026

0.21.2

Jan 1, 2026

This version

0.21.1

Jan 1, 2026

0.21.0

Jan 17, 2025

0.20.3

Oct 16, 2024

0.20.2

Oct 15, 2024

0.20.1

Oct 15, 2024

0.20.0

Oct 15, 2024

0.19.0

Oct 10, 2024

0.18.0

Oct 9, 2024

0.17.3

Sep 13, 2024

0.17.2

Sep 9, 2024

0.17.1

Aug 19, 2024

0.17.0

Aug 18, 2024

0.16.4

Aug 15, 2024

0.16.3

Aug 13, 2024

0.16.2

Aug 13, 2024

0.16.1

Jul 30, 2024

0.16.0

Jul 30, 2024

0.15.5

Jul 25, 2024

0.15.4

Jul 24, 2024

0.15.3

Jul 17, 2024

0.15.2

Jul 9, 2024

0.15.1

Jul 8, 2024

0.15.0

Jul 8, 2024

0.14.2

Jul 5, 2024

0.14.1

Jul 4, 2024

0.14.0

Jul 4, 2024

0.13.1

Jul 4, 2024

0.13.0

Jul 4, 2024

0.12.0

Jul 3, 2024

0.11.2

Jul 3, 2024

0.11.1

Jul 3, 2024

0.11.0

Jul 3, 2024

0.10.4

Jul 3, 2024

0.10.3

Jul 2, 2024

0.10.2

Jul 2, 2024

0.10.1

Jul 2, 2024

0.10.0

Jul 2, 2024

0.9.2

Jul 2, 2024

0.9.1

Jul 2, 2024

0.9.0

Jul 2, 2024

0.8.2

Jul 2, 2024

0.8.1

Jul 2, 2024

0.8.0

Jul 1, 2024

0.7.0

Jul 1, 2024

0.6.5

Jul 1, 2024

0.6.4

Jul 1, 2024

0.6.3

Jul 1, 2024

0.6.2

Jul 1, 2024

0.6.1

Jul 1, 2024

0.6.0

Jun 30, 2024

0.1.0

Jun 30, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

granular-0.21.1.tar.gz (13.1 kB view details)

Uploaded Jan 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

granular-0.21.1-py3-none-any.whl (16.0 kB view details)

Uploaded Jan 1, 2026 Python 3

File details

Details for the file granular-0.21.1.tar.gz.

File metadata

Download URL: granular-0.21.1.tar.gz
Upload date: Jan 1, 2026
Size: 13.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for granular-0.21.1.tar.gz
Algorithm	Hash digest
SHA256	`71bb0659b3f6ddd6e501ea9aaa30a4d32459b05c3f3526630968939fbc998c9d`
MD5	`bb0ded19697652f68b7d3b90d33c24a1`
BLAKE2b-256	`60c2d915a6b73163069f3cd40dd586520f0ef95be79b8e510694a0692b4d505b`

See more details on using hashes here.

File details

Details for the file granular-0.21.1-py3-none-any.whl.

File metadata

Download URL: granular-0.21.1-py3-none-any.whl
Upload date: Jan 1, 2026
Size: 16.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for granular-0.21.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3a44842ec74fb56305f86ccffead0786c1636dae01a2bc00f4ccb1d71aadfc1c`
MD5	`cb005b964d96a39aae75d9231dd0c4c2`
BLAKE2b-256	`ec9cc1c9e750ec1c45a25d29494000cb46fcd8fae1f2c02b020841daece413c1`

See more details on using hashes here.

granular 0.21.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Granular

Features

Quickstart

Advanced

Filesystems

Formats

Resuming

Caching

Masking

Sequences

Sharding

Questions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes