Fast format for datasets.

These details have not been verified by PyPI

Project links

Homepage

Project description

Granular: Fast format for datasets

Granular is a library for reading and writing multimodal datasets. Each dataset is a collection of linked files of the bag file format, a simple seekable container structure.

Features

🚀 Performance: Minimal overhead for maximum read and write throughput.
🔎 Seekable: Fast random access from disk by datapoint index.
🎞️ Sequences: Datapoints can contain seekable ranges of modalities.
🤸 Flexible: User provides encoders and decoders; examples available.
👥 Sharding: Store datasets into shards to split processing workloads.

Installation

Granular is a single file, so you can just copy it to your project directory. Or you can install the package:

pip install granular

Quickstart

Writing

import granular
import msgpack
import numpy as np

spec = {
    'foo': 'int',      # integer
    'bar': 'utf8[]',   # list of strings
    'baz': 'msgpack',  # packed structure
}

# Or use the provided `granular.encoders`.
encoders = {
    'int': lambda x: x.to_bytes(8, 'little'),
    'utf8': lambda x: x.encode('utf-8'),
    'msgpack': msgpack.packb,
}

with granular.ShardedDatasetWriter(
    directory, spec, encoders, shardlen=1000) as writer:
  writer.append({'foo': 42, 'bar': ['hello', 'world'], 'baz': {'a': 1})
  # ...

Files

$ tree directory
.
├── 000000
│  ├── spec.json
│  ├── refs.bag
│  ├── foo.bag
│  ├── bar.bag
│  └── baz.bag
├── 000001
│  ├── spec.json
│  ├── refs.bag
│  ├── foo.bag
│  ├── bar.bag
│  └── baz.bag
└── ...

Reading

# Or use the provided `granular.decoders`.
decoders = {
    'int': lambda x: int.from_bytes(x),
    'utf8': lambda x: x.decode('utf-8'),
    'msgpack': msgpack.unpackb,
}

with granular.ShardedDatasetReader(directory, decoders) as reader:
  print(len(reader))    # Number of datapoints in the dataset.
  print(reader.size)    # Dataset size in bytes.
  print(reader.shards)  # Number of shards.

  # Read data points by index. This will read only the relevant bytes from
  # disk. An additional small read is used when caching index tables is
  # disabled, supporting arbitrarily large datasets with minimal overhead.
  assert reader[0] == {'foo': 42, 'bar': ['hello', 'world'], 'baz': {'a': 1}

  # Read a subset of keys of a datapoint. For example, this allows quickly
  # iterating over the metadata fields of all datapoints without accessing
  # expensive image or video modalities.
  assert reader[0, {'foo': True, 'baz': True}] == {'foo': 42, 'baz': {'a': 1}}

  # Read only a slice of the 'bar' list. Only the requested slice will be
  # fetched from disk. For example, the could be used to load a subsequence of
  # a long video that is stored as list of consecutive MP4 clips.
  assert reader[0, {'bar': range(1, 2)}] == {'bar': ['world']}

For small datasets where sharding is not necessary, you can also use DatasetReader and DatasetWriter.

For distributed processing using multiple processes or machines, use ShardedDatasetReader and ShardedDatasetWriter and set shardstart to the worker index and shardstep to the total number of workers.

Formats

Granular does not impose a serialization solution on the user. Any words can be used as types, as long as their encoder and decoder functions are provided.

Examples of common encode and decode functions are provided in formats.py. These support Numpy arrays, JPG and PNG images, MP4 videos, and more. They can be used as granular.encoders and granular.decoders.

Questions

If you have a question, please file an issue.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.20.3

Oct 16, 2024

0.20.2

Oct 15, 2024

0.20.1

Oct 15, 2024

0.20.0

Oct 15, 2024

0.19.0

Oct 10, 2024

0.18.0

Oct 9, 2024

0.17.3

Sep 13, 2024

0.17.2

Sep 9, 2024

0.17.1

Aug 19, 2024

0.17.0

Aug 18, 2024

0.16.4

Aug 15, 2024

0.16.3

Aug 13, 2024

0.16.2

Aug 13, 2024

0.16.1

Jul 30, 2024

0.16.0

Jul 30, 2024

0.15.5

Jul 25, 2024

0.15.4

Jul 24, 2024

0.15.3

Jul 17, 2024

0.15.2

Jul 9, 2024

0.15.1

Jul 8, 2024

0.15.0

Jul 8, 2024

0.14.2

Jul 5, 2024

0.14.1

Jul 4, 2024

0.14.0

Jul 4, 2024

0.13.1

Jul 4, 2024

0.13.0

Jul 4, 2024

0.12.0

Jul 3, 2024

0.11.2

Jul 3, 2024

0.11.1

Jul 3, 2024

0.11.0

Jul 3, 2024

This version

0.10.4

Jul 3, 2024

0.10.3

Jul 2, 2024

0.10.2

Jul 2, 2024

0.10.1

Jul 2, 2024

0.10.0

Jul 2, 2024

0.9.2

Jul 2, 2024

0.9.1

Jul 2, 2024

0.9.0

Jul 2, 2024

0.8.2

Jul 2, 2024

0.8.1

Jul 2, 2024

0.8.0

Jul 1, 2024

0.7.0

Jul 1, 2024

0.6.5

Jul 1, 2024

0.6.4

Jul 1, 2024

0.6.3

Jul 1, 2024

0.6.2

Jul 1, 2024

0.6.1

Jul 1, 2024

0.6.0

Jun 30, 2024

0.1.0

Jun 30, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

granular-0.10.4.tar.gz (12.5 kB view details)

Uploaded Jul 3, 2024 Source

File details

Details for the file granular-0.10.4.tar.gz.

File metadata

Download URL: granular-0.10.4.tar.gz
Upload date: Jul 3, 2024
Size: 12.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for granular-0.10.4.tar.gz
Algorithm	Hash digest
SHA256	`6fdc9ce3ca490168facd29ec5c97c4ea7f214246da7adbf9eefa27413d645658`
MD5	`5080afcd88329c4776856c73c57a16ea`
BLAKE2b-256	`398d4f13ef621c1631bde2bce97e6828be03eb317ad5b1c25fa3ba5931a86239`