Simple and fast format for storing datasets.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

👜 Bags: Fast format for storing datasets

Bags is a library for reading and writing multimodal datasets. Each dataset is a collection of linked files of the bag type, a simple container format.

Features

Performance: Minimal overhead for maximum read and write throughput.
Seekable: Fast random access from disk by datapoint index.
Sharding: Automatically splits large datasets into multiple files.
Flexible: No predefined types, user provides encoders and decoders.
Sequences: Datapoints can reference record range of other bag files.

Installation

Bags is a single file, so you can just copy it to your project directory. Or you can install the package:

pip install bags

Quickstart

Writing

import bags
import msgpack
import numpy as np

encoders = {
    'utf8': lambda x: x.encode('utf-8'),
    'int': lambda x, size: x.to_bytes(int(size), 'little'),
    'msgpack': msgpack.packb,
}

spec = {
    'foo': 'int(8)',   # 8-byte integer
    'bar': 'utf8[]',   # list of strings
    'baz': 'msgpack',  # packed structure
}

shardsize = 10 * 1024 ** 3  # 10GB shards

with bags.DatasetWriter(directory, spec, encoders, shardsize) as writer:
  writer.append({'foo': 42, 'bar': ['hello', 'world'], 'baz': {'a': 1})

Files

$ ls directory
spec.json
refs-00001.bag
foo-00001.bag
bar-00001.bag
baz-00001.bag

Reading

decoders = {
    'utf8': lambda x: x.decode('utf-8'),
    'int': lambda x, size=None: int.from_bytes(x),
    'msgpack': msgpack.unpackb,
}

with bags.DatasetReader(directory, decoders) as reader:
  print(len(reader))

  # Read data points by index. This will read only the relevant bytes from #
  disk. An additional small read is used when caching index tables is #
  disabled, supporting arbitrarily large datasets with minimal overhead.
  assert reader[0] == {'foo': 42, 'bar': ['hello', 'world'], 'baz': {'a': 1}

  # Read a subset of keys of a datapoint. For example, this allows quickly
  # iterating over the metadata fields of all datapoints without accessing
  # expensive image or video modalities.
  assert reader[0, {'foo': True, 'baz': True}] == {'foo': 42, 'baz': {'a': 1}}

  # Read only a slice of the 'bar' list. Only the requested slice will be
  # fetched from disk. For example, the could be used to load a subsequence of
  a long video that is stored as list of consecutive MP4 clips.
  assert reader[0, {'bar': range(1, 2)}] == {'bar': ['world']}

Serialization

Bags does not impose a serialization solution on the user. Any words can be used as types, as long as an encoder and decoder is available.

Examples of commonly used type strings and corresponding encode and decode functions are provided in [formats.py][formats].

Types can be paremeterized with args that will be passed into the encoder and decoder, such as array(float32,64,128).

Questions

If you have a question, please file an issue.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.5.1

Jun 29, 2024

0.5.0

Jun 29, 2024

0.4.0

Jun 28, 2024

0.3.1

Jun 28, 2024

This version

0.3.0

Jun 27, 2024

0.2.0

Jun 27, 2024

0.1.0

Jun 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bags-0.3.0.tar.gz (9.0 kB view hashes)

Uploaded Jun 27, 2024 Source

Hashes for bags-0.3.0.tar.gz

Hashes for bags-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`f379cda521603cc17c151a3d34a5bdd791967925436c3267004580f3ed3381a7`
MD5	`fd4243b2b4be0d0133a11e5ca94dbde3`
BLAKE2b-256	`fcf0b393ea0d2d21cd297e75d54e0120599b0075fd824299dac60e4cfd507d2f`