Skip to main content

A format for storing a sequence of byte-array records

Project description

Säckli

This is a friendly fork of bagz.

Additions so far:

  • Merge some PRs such as S3 support PR by @KefanXIAO and compile fixes.
  • Add access_pattern and cache_policy reader hints:
    • On POSIX filesystem, this can add mmap hints or use pread to optimize for random access and larger-than-RAM data.
    • On Linux, support O_DIRECT for even better reading of random access and larger-than-RAM data.
  • Make it compatible to Python versions past 3.13.
  • Make it compatible with free-threading (nogil) Python.
  • Add CI, stress-tests and automatic wheel releases to PyPI.

Versioning of this fork is detached from the original bagz library at the point it was forked (v0.2.0).

Overview

Säckli is a format for storing a sequence of byte-array records. It supports per-record compression and fast index-based lookup. All indexing is zero based.

Installation

The recommended installation on Linux is via the pre-built wheels on PyPI:

uv pip install sackli

If you want to build locally to work on this, just uv pip install .. However, building can be slow because of GCS and S3 support; to skip both of these dependencies for much faster builds, you can do:

CMAKE_ARGS="-DSACKLI_ENABLE_GCS=OFF -DSACKLI_ENABLE_S3=OFF" uv pip install .

Python API

Python Reader

Reader for reading a single or sharded Säckli file-set.

from collections.abc import Sequence, Iterable

import sackli
import numpy as np

# Säckli Readers support random access. The order of elements within a Säckli
# file is the order in which they are written. Records are returned as `bytes`
# objects.
data = sackli.Reader('/path/to/data.bagz')

# Säckli Readers can be configured like this - here we require that the file was
# written with separate limits.
data_separate_limits = sackli.Reader('/path/to/data.bagz', sackli.Reader.Options(
    limits_placement=sackli.LimitsPlacement.SEPARATE,
))

# Säckli Readers are Sequences and support slicing, iterating, etc.
assert isinstance(data, Sequence)

# Säckli Readers have a length.
assert len(data) > 10

# Can access record by row-index.
fifth_value: bytes = data[5]

# Can slice.
data_from_5: sackli.Reader = data[5:]

# Slices are still Readers.
assert isinstance(data_from_5, sackli.Reader)

assert data_from_5[0] == fifth_value

# Can access records by multiple row-indices.
fourth, second, tenth = data.read_indices([4, 2, 10])
assert fourth == data[4]
assert second == data[2]
assert tenth == data[10]

# Can iterate records.
for record in data:
  do_something_else(record)

# Can read all records. This eager version can be faster than iteration.
all_records = data.read()

# Can iterate sub-range of records.
for record in data[4:9]:
  do_something_else(record)

# Can read a sub-range of records. This eager form can be faster than
# iteration.
sub_range = data[4:9].read()

# Can use an infinite iterator as source of indices. (Reads ahead in parallel.)
def my_generator(size: int) -> Iterable[int]:
  rng = np.random.default_rng(42)
  while True:
    yield rng.integers(size).item()

data_iter: Iterable[bytes] = data.read_indices_iter(my_generator(len(data)))
for i in range(10):
  random_item: bytes = next(data_iter)

Python Reader - Index and MultiIndex

You can use Index to find the first index of a record and MultiIndex to find all instances of an item.

keys = sackli.Reader('/path/to/keys.bag')
# Get the index of the first occurrence of key.
index = sackli.Index(keys)
key_index: int = index[b'example_key']

# Get all occurrences of key.
multi_index = sackli.MultiIndex(keys)
all_indices: list[int] = multi_index[b'example_key']

Python Writer

For writing a single Säckli file.

Example:

import sackli

# Compression is selected based on the file extension:
# `.bagz` will use Zstandard compression with default settings.
# `.bag` will use no compression.
with sackli.Writer('/path/to/data.bagz') as writer:
  for d in generate_records():
    writer.write(d)

# Adjust compression level explicitly.
# Note this will no longer use the extension to detemine whether to compress.
with sackli.Writer(
    '/path/to/data.bagz',
    sackli.Writer.Options(
        compression=sackli.CompressionZstd(level=3)
    ),
) as writer:
  for d in generate_records():
    writer.write(d)

Options

Reader Options

sackli.Reader.Options has these optional arguments.

  • compression: Can be one of:
    • sackli.CompressionAutoDetect(): Default - Uses extension whether to compress. (.bagz - Compressed (ZStandard), .bag - Uncompressed)
    • sackli.CompressionNone(): Records are not decompressed.
    • sackli.CompressionZstd(): Records are decompressed using Zstandard.
  • limits_placement: Can be one of:
    • sackli.LimitsPlacement.TAIL: Default- Reads limits from a tail of file.
    • sackli.LimitsPlacement.SEPARATE: Reads limits from a separate file.
  • limits_storage: Can be one of:
    • sackli.LimitsStorage.ON_DISK: Default - Reads limits from disk for each read.
    • sackli.LimitsStorage.IN_MEMORY: Reads all limits from disk in one go.
  • access_pattern: Can be one of:
    • sackli.AccessPattern.SYSTEM: Default - no specific hint to the OS.
    • sackli.AccessPattern.RANDOM: Hints that you read entries in random order.
    • sackli.AccessPattern.SEQUENTIAL: Hints that you read entries roughly sequentially.
  • cache_policy: Can be one of:
    • sackli.CachePolicy.SYSTEM: Default - no specific hint to the OS.
    • sackli.CachePolicy.DROP_AFTER_READ: Reads data in such a way that the OS is unlikely to hold any of it in cache. For POSIX filesystems, this means using pread with specific flags. This is more efficient when you read more data than your RAM before doing any repeats (ie epoch > RAM).
    • sackli.CachePolicy.DIRECT_IO: Uses Linux O_DIRECT for record reads. This is the most direct reading option where the OS doesn't try anything, no page caches, no readaheads, nothing. Can be the best options for random reads on huge data with rare re-reads. Not all file-systems support this. Uses STATX_DIOALIGN if supported, otherwise tries probing the alignment. For the unaligned tail of the bagz file, does a one-time standard read at init.
  • max_parallelism: Default number of threads when reading many records.
  • sharding_layout: Can be one of:
    • sackli.ShardingLayout.CONCATENATED: Default - See Sharding
    • sackli.ShardingLayout.INTERLEAVED: See Sharding

access_pattern and cache_policy are currently interpreted only for local POSIX files and influence OS-level behaviour on page cache and cache lines.

For tail-formatted files, non-default POSIX record-cache policies open a second POSIX read handle to the same file so limits metadata reads keep the default cache policy.

Writer Options

sackli.Writer.Options has these optional arguments.

  • compression: Can be one of:
    • sackli.CompressionAutoDetect(): Default - Uses extension whether to compress. (.bagz - Compressed (Zstandard), .bag - Uncompressed)
    • sackli.CompressionNone(): Records are not compressed.
    • sackli.CompressionZstd(level = 3): Records are compressed using Zstandard the level of the compression can be specified.
  • limits_placement: Can be one of:
    • sackli.LimitsPlacement.TAIL: Default - Writes limits to a tail of file.
    • sackli.LimitsPlacement.SEPARATE: Writes limits to a separate file.

Apache Beam Support

Säckli also provides Apache Beam connectors for reading and writing Säckli files in Beam pipelines.

Ensure you have Apache Beam installed.

uv pip install apache_beam

Säckli Source

import apache_beam as beam
from sackli.beam import sacklio
import tensorflow as tf

with beam.Pipeline() as pipeline:
  examples = (
      pipeline
      | 'ReadData' >> sacklio.ReadFromSackli('/path/to/your/data@*.bagz')
      | 'Decode' >> beam.Map(tf.train.Example.FromString)
  )
  # Continue your pipeline.

Säckli Sink

from sackli.beam import sacklio
import tensorflow as tf

def create_tf_example(data):
  # Replace with your actual feature creation logic.
  feature = {
      'data': tf.train.Feature(bytes_list=tf.train.BytesList(value=[data])),
  }
  return tf.train.Example(features=tf.train.Features(feature=feature))

with beam.Pipeline() as pipeline:
  data = [b'record1', b'record2', b'record3']

  examples = (
      pipeline
      | 'CreateData' >> beam.Create(data)
      | 'Encode' >> beam.Map(lambda x: create_tf_example(x).SerializeToString())
      | 'WriteData' >> sacklio.WriteToSackli('/path/to/output/data@*.bagz')
  )

GCS Support

Säckli supports Posix file-systems and Google Cloud Storage (GCS). If you have files on GCS you can access them with using the prefix /gs: to the path. These examples assume you have gcloud CLI installed.

From the shell:

gcloud config set project your-project-name
gcloud auth application-default login

Then use the 'gs:' file-system prefix.

import pathlib
import sackli

# (This may freeze if you have not configured the project.)
reader = sackli.Reader('gs://your-bucket-name/your-file.bagz')

# Path supports a leading slash to work well with pathlib.
bucket = pathlib.Path('/gs://your-bucket-name')
reader = sackli.Reader(bucket / 'your-file.bagz')

Sharding

An ordered collection of Säckli-formatted files ("shards") may be opened together and indexed via a single global-index. The global-index is mapped to a shard and an index within that shard (shard-index) in one of two ways:

  1. Concatenated (default). Indexing is equivalent to the records in each Säckli-formatted shard being concatenated into a single sequence of records.

    Example:

    When opening four Säckli-formatted files with sizes [8, 4, 0, 5], the global-index with range [0, 17) (shown as the table entries) maps to shard and shard-index like this:

                   | shard-index
    shard          |  0  1  2  3  4  5  6  7
    -------------- | -----------------------
    00000-of-00004 |  0  1  2  3  4  5  6  7
    00001-of-00004 |  8  9 10 11
    00002-of-00004 |
    00003-of-00004 | 12 13 14 15 16
    

    Mappings

    global-index shard shard-index
    0 00000-of-00004 0
    1 00000-of-00004 1
    2 00000-of-00004 2
    ... ... ...
    8 00001-of-00004 0
    9 00001-of-00004 1
    ... ... ...
    15 00003-of-00004 3
    16 00003-of-00004 4
  2. Interleaved where the global-index is interleaved in a round-robin manner across all the shards.

    Example:

    When opening three Säckli-formatted files with sizes [6, 6, 5], the global-index with range [0, 17) (shown as the table entries) maps to shard and shard-index like this:

                   |  shard-index
    shard          |  0  1  2  3  4  5
    -------------- | -----------------
    00000-of-00003 |  0  3  6  9 12 15
    00001-of-00003 |  1  4  7 10 13 16
    00002-of-00003 |  2  5  8 11 14
    

    Mappings

    global-index shard shard-index
    0 00000-of-00003 0
    1 00001-of-00003 0
    2 00002-of-00003 0
    ... ... ...
    6 00000-of-00003 2
    7 00001-of-00003 2
    8 00002-of-00003 2
    ... ... ...
    15 00000-of-00003 5
    16 00001-of-00003 5

Säckli file format

The Säckli file format has two parts: the records section and the limits section.

  • The records section consists of the concatenation of all (possibly compressed) records. (There are no additional bytes inside or between records, and records are not aligned in any way.)
  • The limits section is a dense array of the end-offsets of each record in order, encoded in little-endian 64-bit unsigned integers.

These can be stored with tail-limits in one file, where the limits sections is appended to the record section or separate-limits where they are stored in separate files.

Tail-limits example

Given Säckli file formatted file with the following 3 uncompressed records:

Records
abcdef
123
catcat

The raw bytes of the Säckli file formated file corresponding to the records above:

0x61 a 0x62 b 0x63 c 0x64 d 0x65 e 0x66 f
0x31 1 0x32 2 0x33 3
0x63 c 0x61 a 0x74 t 0x63 c 0x61 a 0x74 t
0x06   0x00   0x00   0x00   0x00   0x00   0x00   0x00  # 6 byte offset
0x09   0x00   0x00   0x00   0x00   0x00   0x00   0x00  # 9 byte offset
0x0f   0x00   0x00   0x00   0x00   0x00   0x00   0x00  # 15 byte offset

The last 8 bytes represent the end-offset of the last record. This is also the start of the limits section. Therefore reading the last 8 bytes will directly tell you the offset of the records/limits boundary.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sackli-0.2.3.tar.gz (89.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sackli-0.2.3-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.3-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.3-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file sackli-0.2.3.tar.gz.

File metadata

  • Download URL: sackli-0.2.3.tar.gz
  • Upload date:
  • Size: 89.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sackli-0.2.3.tar.gz
Algorithm Hash digest
SHA256 1a1e3cb1df9d806d6beefe5864109f4b0c5143cd3cd86e89a482c713afa7b900
MD5 3ff5432c9d90f052c0d1a57eb79a33a5
BLAKE2b-256 6a7d9501eef91e28026a35afbe51827d2eb27fa20979e3e0d1d96f3e97fea801

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.3.tar.gz:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.3-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.3-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5db9166c19f1b94cc10d44bd5805c4248b845ce17d5b6ea875b1ce57b3312d54
MD5 695a7333ed993938f2c898b36016e0f4
BLAKE2b-256 9d0fb1f5cd809ec702ef32529138c938cda20b26969277fd5c74ef57d0c38c5f

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.3-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.3-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.3-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 225a07c12675be3997d01f72cbf41851e4d91e6da353c130c6521838fbf7332f
MD5 ea6b40bf00186829fbce080f52451c96
BLAKE2b-256 e8c061138c3fd2722ddc25830b592412f74bc9ab415d53c60a36846ff4a0178a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.3-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.3-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.3-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 638c1585254689ccb4de9ddb229704ef173d8a9120eed26a515e248cfcf5765e
MD5 9ad04a2dd31bec4c6c913126d4c6adb7
BLAKE2b-256 638d220357be7d7468bb7357c9eaa14062bbafceb19b22a38804812c28ab1ad5

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.3-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 05bc2dd241a1f5c6651d19e9fb5f4d7413038edd459f012893e78767c666207d
MD5 68440f33b9011ec37074328f0b1621ff
BLAKE2b-256 2c2b902ac39f37c605d42cdb8d4d3985fd93e5781e706a0a4c3f01771667555b

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page