A format for storing a sequence of byte-array records
Project description
Bagz
Freethreading fork
This is a fork of bagz which adds freethreading (nogil) support and wheels.
It publishes to PyPI as bagz-freethreading but still imports as
bagz. Do not install both bagz and bagz-freethreading at the same time.
uv pip install bagz-freethreading
Versioning for this fork follows upstream and uses PEP 440 post-releases,
for example 0.2.0.post1, 0.2.0.post2, and so on. When upstream moves to
0.3.0, this fork will continue as 0.3.0.post1.
What follows is the original readme.
Overview
Bagz is a format for storing a sequence of byte-array records, typically serialised protocol buffers. It supports per-record compression and fast index-based lookup. All indexing is zero based.
Installation
To install Bagz from source we recommend using uv.
The install will download required dependencies apart from curl
(libcurl-devel) and OpenSSL (openssl-devel). These need to be installed in a
location that cmake's find_package searches.
uv pip install .
On Linux you can install the latest version from PyPI.
uv pip install bagz
Python API
Python Reader
Reader for reading a single or sharded Bagz file-set.
from collections.abc import Sequence, Iterable
import bagz
import numpy as np
# Bagz Readers support random access. The order of elements within a Bagz
# file is the order in which they are written. Records are returned as `bytes`
# objects.
data = bagz.Reader('/path/to/data.bagz')
# Bagz Readers can be configured like this - here we require that the file was
# written with separate limits.
data_separate_limits = bagz.Reader('/path/to/data.bagz', bagz.Reader.Options(
limits_placement=bagz.LimitsPlacement.SEPARATE,
))
# Bagz Readers are Sequences and support slicing, iterating, etc.
assert isinstance(data, Sequence)
# Bagz Readers have a length.
assert len(data) > 10
# Can access record by row-index.
fifth_value: bytes = data[5]
# Can slice.
data_from_5: bagz.Reader = data[5:]
# Slices are still Readers.
assert isinstance(data_from_5, bagz.Reader)
assert data_from_5[0] == fifth_value
# Can access records by multiple row-indices.
fourth, second, tenth = data.read_indices([4, 2, 10])
assert fourth == data[4]
assert second == data[2]
assert tenth == data[10]
# Can iterate records.
for record in data:
do_something_else(record)
# Can read all records. This eager version can be faster than iteration.
all_records = data.read()
# Can iterate sub-range of records.
for record in data[4:9]:
do_something_else(record)
# Can read a sub-range of records. This eager form can be faster than
# iteration.
sub_range = data[4:9].read()
# Can use an infinite iterator as source of indices. (Reads ahead in parallel.)
def my_generator(size: int) -> Iterable[int]:
rng = np.random.default_rng(42)
while True:
yield rng.integers(size).item()
data_iter: Iterable[bytes] = data.read_indices_iter(my_generator(len(data)))
for i in range(10):
random_item: bytes = next(data_iter)
Python Reader - Index and MultiIndex
You can use Index to find the first index of a record and MultiIndex to find
all instances of an item.
keys = bagz.Reader('/path/to/keys.bag')
# Get the index of the first occurrence of key.
index = bagz.Index(keys)
key_index: int = index[b'example_key']
# Get all occurrences of key.
multi_index = bagz.MultiIndex(keys)
all_indices: list[int] = multi_index[b'example_key']
Python Writer
For writing a single Bagz file.
Example:
import bagz
# Compression is selected based on the file extension:
# `.bagz` will use Zstandard compression with default settings.
# `.bag` will use no compression.
with bagz.Writer('/path/to/data.bagz') as writer:
for d in generate_records():
writer.write(d)
# Adjust compression level explicitly.
# Note this will no longer use the extension to detemine whether to compress.
with bagz.Writer(
'/path/to/data.bagz',
bagz.Writer.Options(
compression=bagz.CompressionZstd(level=3)
),
) as writer:
for d in generate_records():
writer.write(d)
Options
Reader Options
bagz.Reader.Options has these optional arguments.
compression: Can be one of:bagz.CompressionAutoDetect(): Default - Uses extension whether to compress. (.bagz- Compressed (ZStandard),.bag- Uncompressed)bagz.CompressionNone(): Records are not decompressed.bagz.CompressionZstd(): Records are decompressed using Zstandard.
limits_placement: Can be one of:bagz.LimitsPlacement.TAIL: Default- Reads limits from a tail of file.bagz.LimitsPlacement.SEPARATE: Reads limits from a separate file.
limits_storage: Can be one of:bagz.LimitsStorage.ON_DISK: Default - Reads limits from disk for each read.bagz.LimitsStorage.IN_MEMORY: Reads all limits from disk in one go.
max_parallelism: Default number of threads when reading many records.sharding_layout: Can be one of:
Writer Options
bagz.Writer.Options has these optional arguments.
compression: Can be one of:bagz.CompressionAutoDetect(): Default - Uses extension whether to compress. (.bagz- Compressed (Zstandard),.bag- Uncompressed)bagz.CompressionNone(): Records are not compressed.bagz.CompressionZstd(level = 3): Records are compressed using Zstandard the level of the compression can be specified.
limits_placement: Can be one of:bagz.LimitsPlacement.TAIL: Default - Writes limits to a tail of file.bagz.LimitsPlacement.SEPARATE: Writes limits to a separate file.
Apache Beam Support
Bagz also provides Apache Beam connectors for reading and writing Bagz files in Beam pipelines.
Ensure you have Apache Beam installed.
uv pip install apache_beam
Bagz Source
import apache_beam as beam
from bagz.beam import bagzio
import tensorflow as tf
with beam.Pipeline() as pipeline:
examples = (
pipeline
| 'ReadData' >> bagzio.ReadFromBagz('/path/to/your/data@*.bagz')
| 'Decode' >> beam.Map(tf.train.Example.FromString)
)
# Continue your pipeline.
Bagz Sink
from bagz.beam import bagzio
import tensorflow as tf
def create_tf_example(data):
# Replace with your actual feature creation logic.
feature = {
'data': tf.train.Feature(bytes_list=tf.train.BytesList(value=[data])),
}
return tf.train.Example(features=tf.train.Features(feature=feature))
with beam.Pipeline() as pipeline:
data = [b'record1', b'record2', b'record3']
examples = (
pipeline
| 'CreateData' >> beam.Create(data)
| 'Encode' >> beam.Map(lambda x: create_tf_example(x).SerializeToString())
| 'WriteData' >> bagzio.WriteToBagz('/path/to/output/data@*.bagz')
)
GCS Support
Bagz supports Posix file-systems and Google Cloud Storage (GCS). If you have
files on GCS you can access them with using the prefix /gs: to the path. These
examples assume you have gcloud CLI installed.
From the shell:
gcloud config set project your-project-name
gcloud auth application-default login
Then use the 'gs:' file-system prefix.
import pathlib
import bagz
# (This may freeze if you have not configured the project.)
reader = bagz.Reader('gs://your-bucket-name/your-file.bagz')
# Path supports a leading slash to work well with pathlib.
bucket = pathlib.Path('/gs://your-bucket-name')
reader = bagz.Reader(bucket / 'your-file.bagz')
Sharding
An ordered collection of Bagz-formatted files ("shards") may be opened together
and indexed via a single global-index. The global-index is mapped to a
shard and an index within that shard (shard-index) in one of two ways:
-
Concatenated (default). Indexing is equivalent to the records in each Bagz-formatted shard being concatenated into a single sequence of records.
Example:
When opening four Bagz-formatted files with sizes
[8, 4, 0, 5], theglobal-indexwith range[0, 17)(shown as the table entries) maps toshardandshard-indexlike this:| shard-index shard | 0 1 2 3 4 5 6 7 -------------- | ----------------------- 00000-of-00004 | 0 1 2 3 4 5 6 7 00001-of-00004 | 8 9 10 11 00002-of-00004 | 00003-of-00004 | 12 13 14 15 16Mappings
global-indexshardshard-index000000-of-000040100000-of-000041200000-of-000042... ... ... 800001-of-000040900001-of-000041... ... ... 1500003-of-0000431600003-of-000044 -
Interleaved where the global-index is interleaved in a round-robin manner across all the shards.
Example:
When opening three Bagz-formatted files with sizes
[6, 6, 5], theglobal-indexwith range[0, 17)(shown as the table entries) maps toshardandshard-indexlike this:| shard-index shard | 0 1 2 3 4 5 -------------- | ----------------- 00000-of-00003 | 0 3 6 9 12 15 00001-of-00003 | 1 4 7 10 13 16 00002-of-00003 | 2 5 8 11 14Mappings
global-indexshardshard-index000000-of-000030100001-of-000030200002-of-000030... ... ... 600000-of-000032700001-of-000032800002-of-000032... ... ... 1500000-of-0000351600001-of-000035
Bagz file format
The Bagz file format has two parts: the records section and the limits
section.
- The
recordssection consists of the concatenation of all (possibly compressed) records. (There are no additional bytes inside or between records, and records are not aligned in any way.) - The
limitssection is a dense array of the end-offsets of each record in order, encoded in little-endian 64-bit unsigned integers.
These can be stored with tail-limits in one file, where the limits sections
is appended to the record section or separate-limits where they are stored
in separate files.
Tail-limits example
Given Bagz file formatted file with the following 3 uncompressed records:
| Records |
|---|
abcdef |
123 |
catcat |
The raw bytes of the Bagz file formated file corresponding to the records above:
0x61 a 0x62 b 0x63 c 0x64 d 0x65 e 0x66 f
0x31 1 0x32 2 0x33 3
0x63 c 0x61 a 0x74 t 0x63 c 0x61 a 0x74 t
0x06 0x00 0x00 0x00 0x00 0x00 0x00 0x00 # 6 byte offset
0x09 0x00 0x00 0x00 0x00 0x00 0x00 0x00 # 9 byte offset
0x0f 0x00 0x00 0x00 0x00 0x00 0x00 0x00 # 15 byte offset
The last 8 bytes represent the end-offset of the last record. This is also the
start of the limits section. Therefore reading the last 8 bytes will directly
tell you the offset of the records/limits boundary.
Disclaimer
This is not an official Google product.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bagz_freethreading-0.2.0.post1.tar.gz.
File metadata
- Download URL: bagz_freethreading-0.2.0.post1.tar.gz
- Upload date:
- Size: 71.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a067ef1bcc96877db4b8882d78394dccf2cb6e479fc32034b009ecf6f55d8d55
|
|
| MD5 |
4ce17792cf0582649872d2db410416a1
|
|
| BLAKE2b-256 |
6ce6eb679aa12f429dc3a2b700a1e304c8ebf88bcd1dfddff3768125b4cc3339
|
Provenance
The following attestation bundles were made for bagz_freethreading-0.2.0.post1.tar.gz:
Publisher:
publish.yml on lucasb-eyer/bagz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bagz_freethreading-0.2.0.post1.tar.gz -
Subject digest:
a067ef1bcc96877db4b8882d78394dccf2cb6e479fc32034b009ecf6f55d8d55 - Sigstore transparency entry: 787435669
- Sigstore integration time:
-
Permalink:
lucasb-eyer/bagz@2a33f1e6b7cf5a67b59b16b3cec08bcb81d4b75a -
Branch / Tag:
refs/tags/v0.2.0.post1 - Owner: https://github.com/lucasb-eyer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2a33f1e6b7cf5a67b59b16b3cec08bcb81d4b75a -
Trigger Event:
push
-
Statement type:
File details
Details for the file bagz_freethreading-0.2.0.post1-cp313-cp313-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: bagz_freethreading-0.2.0.post1-cp313-cp313-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 9.5 MB
- Tags: CPython 3.13, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61b2f3531175a52c5e8f3af9d37e866338069fdc164c0ea6dfc060fa39cd14cd
|
|
| MD5 |
e04d9e6e6965a1742319e0d6943de100
|
|
| BLAKE2b-256 |
67eb2b1d2e33b827443fcf8b961d09f96350604a5857677da879ff31405c8c8e
|
Provenance
The following attestation bundles were made for bagz_freethreading-0.2.0.post1-cp313-cp313-manylinux_2_28_x86_64.whl:
Publisher:
publish.yml on lucasb-eyer/bagz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bagz_freethreading-0.2.0.post1-cp313-cp313-manylinux_2_28_x86_64.whl -
Subject digest:
61b2f3531175a52c5e8f3af9d37e866338069fdc164c0ea6dfc060fa39cd14cd - Sigstore transparency entry: 787435675
- Sigstore integration time:
-
Permalink:
lucasb-eyer/bagz@2a33f1e6b7cf5a67b59b16b3cec08bcb81d4b75a -
Branch / Tag:
refs/tags/v0.2.0.post1 - Owner: https://github.com/lucasb-eyer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2a33f1e6b7cf5a67b59b16b3cec08bcb81d4b75a -
Trigger Event:
push
-
Statement type: