Skip to main content

Fast format for datasets.

Project description

PyPI

Granular: Fast format for datasets

Granular is a library for reading and writing multimodal datasets. Each dataset is a collection of linked files of the bag file format, a simple seekable container structure.

Features

  • 🚀 Performance: Minimal overhead for maximum read and write throughput.
  • 🔎 Seekable: Fast random access from disk by datapoint index.
  • 🎞️ Sequences: Datapoints can contain seekable ranges of modalities.
  • 🤸 Flexible: User provides encoders and decoders; examples available.
  • 👥 Sharding: Store datasets into shards to split processing workloads.

Installation

Granular is a single file, so you can just copy it to your project directory. Or you can install the package:

pip install granular

Quickstart

Writing

import granular
import msgpack
import numpy as np

encoders = {
    'utf8': lambda x: x.encode('utf-8'),
    'int': lambda x, size: x.to_bytes(int(size), 'little'),
    'msgpack': msgpack.packb,
}

spec = {
    'foo': 'int(8)',   # 8-byte integer
    'bar': 'utf8[]',   # list of strings
    'baz': 'msgpack',  # packed structure
}

shardsize = 10 * 1024 ** 3  # 10GB shards

with granular.ShardedDatasetWriter(directory, spec, encoders, shardsize) as writer:
  writer.append({'foo': 42, 'bar': ['hello', 'world'], 'baz': {'a': 1})
  # ...

Files

$ tree directory
.
├── 000000  ├── spec.json
│  ├── refs.bag
│  ├── foo.bag
│  ├── bar.bag
│  └── baz.bag
├── 000001  ├── spec.json
│  ├── refs.bag
│  ├── foo.bag
│  ├── bar.bag
│  └── baz.bag
└── ...

Reading

decoders = {
    'utf8': lambda x: x.decode('utf-8'),
    'int': lambda x, size=None: int.from_bytes(x),
    'msgpack': msgpack.unpackb,
}

with granular.ShardedDatasetReader(directory, decoders) as reader:
  print(len(reader))  # Total number of datapoints.
  print(reader.size)  # Total dataset size in bytes.
  print(reader.shards)

  # Read data points by index. This will read only the relevant bytes from
  # disk. An additional small read is used when caching index tables is
  # disabled, supporting arbitrarily large datasets with minimal overhead.
  assert reader[0] == {'foo': 42, 'bar': ['hello', 'world'], 'baz': {'a': 1}

  # Read a subset of keys of a datapoint. For example, this allows quickly
  # iterating over the metadata fields of all datapoints without accessing
  # expensive image or video modalities.
  assert reader[0, {'foo': True, 'baz': True}] == {'foo': 42, 'baz': {'a': 1}}

  # Read only a slice of the 'bar' list. Only the requested slice will be
  # fetched from disk. For example, the could be used to load a subsequence of
  # a long video that is stored as list of consecutive MP4 clips.
  assert reader[0, {'bar': range(1, 2)}] == {'bar': ['world']}

For small datasets where sharding is not necessary, you can also use DatasetReader and DatasetWriter. These can also be used to look at the individual shards of a sharded dataset.

For distributed processing using multiple processes or machines, use ShardedDatasetReader and ShardedDatasetWriter and set shard_start to the worker index and shard_stop to the total number of workers.

Formats

Granular does not impose a serialization solution on the user. Any words can be used as types, as long as an encoder and decoder is provided.

Examples of encode and decode functions for common types are provided in formats.py and include:

  • Numpy
  • JPEG
  • PNG
  • MP4

Types can be paremeterized with args that will be forwarded to the encoder and decoder, for example array(float32,64,128).

Bag

The Bag file format is a simple container file type. It simply stores a list of byte blobs, following by an index table of integers for all the start locations in the file. The start locations are encoded as 8-byte unsigned little-endian and also include the end offset of the last blob.

This format allows for fast random access, either by loading the index table into memory upfront, or by doing one small read to find the start and end locations followed by a targeted large read for the blob content.

Granular builds on top of Bag to read and write datasets of multiple modalities and where datapoints can contain sequences of blobs of a modality, with efficient seeking for both datapoints and range queries into modalities.

Questions

If you have a question, please file an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

granular-0.6.5.tar.gz (10.5 kB view details)

Uploaded Source

File details

Details for the file granular-0.6.5.tar.gz.

File metadata

  • Download URL: granular-0.6.5.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for granular-0.6.5.tar.gz
Algorithm Hash digest
SHA256 93d98498d989a2cc95b383a4d45582c03f008007c4df2cd1405920d958c50a41
MD5 e35b0694294ed766855584364f2d48de
BLAKE2b-256 4562f098a782fba243d79ddef6ee1cc1b581aa900e0da614aac33c1d1b0da5bc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page