Skip to main content

A format for storing a sequence of byte-array records

Project description

Säckli

This is a friendly fork of bagz.

Additions so far:

  • Merge some PRs such as S3 support PR by @KefanXIAO and compile fixes.
  • Add access_pattern and cache_policy reader hints:
    • On POSIX filesystems, this can add mmap hints or use pread-based no-cache reads to optimize for random access and larger-than-RAM data.
    • On Linux, support O_DIRECT for even better reading of random access and larger-than-RAM data.
    • On macOS, support F_NOCACHE, MAP_NOCACHE, and madvise-based cache hints on Apple silicon.
  • Make it compatible to Python versions past 3.13.
  • Make it compatible with free-threading (nogil) Python.
  • Add macOS support and wheels.
  • Add CI, stress-tests and automatic wheel releases to PyPI for Linux x86_64 and macOS arm64 (macOS 14+).

Versioning of this fork is detached from the original bagz library at the point it was forked (v0.2.0).

Overview

Säckli is a format for storing a sequence of byte-array records. It supports per-record compression and fast index-based lookup. All indexing is zero based.

Installation

The recommended installation on Linux and Mac is via the pre-built wheels on PyPI. PyPI currently ships Linux x86_64 wheels and macOS arm64 wheels for macOS 14+:

uv pip install sackli

If you want to build locally to work on this, just uv pip install .. However, building can be slow because of GCS and S3 support; to skip both of these dependencies for much faster builds, you can do:

CMAKE_ARGS="-DSACKLI_ENABLE_GCS=OFF -DSACKLI_ENABLE_S3=OFF" uv pip install .

Python API

Python Reader

Reader for reading a single or sharded Säckli file-set.

from collections.abc import Sequence, Iterable

import sackli
import numpy as np

# Säckli Readers support random access. The order of elements within a Säckli
# file is the order in which they are written. Records are returned as `bytes`
# objects.
data = sackli.Reader('/path/to/data.bagz')

# Säckli Readers can be configured like this - here we require that the file was
# written with separate limits.
data_separate_limits = sackli.Reader('/path/to/data.bagz', sackli.Reader.Options(
    limits_placement=sackli.LimitsPlacement.SEPARATE,
))

# Säckli Readers are Sequences and support slicing, iterating, etc.
assert isinstance(data, Sequence)

# Säckli Readers have a length.
assert len(data) > 10

# Can access record by row-index.
fifth_value: bytes = data[5]

# Can slice.
data_from_5: sackli.Reader = data[5:]

# Slices are still Readers.
assert isinstance(data_from_5, sackli.Reader)

assert data_from_5[0] == fifth_value

# Can access records by multiple row-indices.
fourth, second, tenth = data.read_indices([4, 2, 10])
assert fourth == data[4]
assert second == data[2]
assert tenth == data[10]

# Can iterate records.
for record in data:
  do_something_else(record)

# Can read all records. This eager version can be faster than iteration.
all_records = data.read()

# Can iterate sub-range of records.
for record in data[4:9]:
  do_something_else(record)

# Can read a sub-range of records. This eager form can be faster than
# iteration.
sub_range = data[4:9].read()

# Can use an infinite iterator as source of indices. (Reads ahead in parallel.)
def my_generator(size: int) -> Iterable[int]:
  rng = np.random.default_rng(42)
  while True:
    yield rng.integers(size).item()

data_iter: Iterable[bytes] = data.read_indices_iter(my_generator(len(data)))
for i in range(10):
  random_item: bytes = next(data_iter)

Python Reader - Index and MultiIndex

You can use Index to find the first index of a record and MultiIndex to find all instances of an item.

keys = sackli.Reader('/path/to/keys.bag')
# Get the index of the first occurrence of key.
index = sackli.Index(keys)
key_index: int = index[b'example_key']

# Get all occurrences of key.
multi_index = sackli.MultiIndex(keys)
all_indices: list[int] = multi_index[b'example_key']

Python Writer

For writing a single Säckli file.

Example:

import sackli

# Compression is selected based on the file extension:
# `.bagz` will use Zstandard compression with default settings.
# `.bag` will use no compression.
with sackli.Writer('/path/to/data.bagz') as writer:
  for d in generate_records():
    writer.write(d)

# Adjust compression level explicitly.
# Note this will no longer use the extension to detemine whether to compress.
with sackli.Writer(
    '/path/to/data.bagz',
    sackli.Writer.Options(
        compression=sackli.CompressionZstd(level=3)
    ),
) as writer:
  for d in generate_records():
    writer.write(d)

Options

Reader Options

sackli.Reader.Options has these optional arguments.

  • compression: Can be one of:
    • sackli.CompressionAutoDetect(): Default - Uses extension whether to compress. (.bagz - Compressed (ZStandard), .bag - Uncompressed)
    • sackli.CompressionNone(): Records are not decompressed.
    • sackli.CompressionZstd(): Records are decompressed using Zstandard.
  • limits_placement: Can be one of:
    • sackli.LimitsPlacement.TAIL: Default- Reads limits from a tail of file.
    • sackli.LimitsPlacement.SEPARATE: Reads limits from a separate file.
  • limits_storage: Can be one of:
    • sackli.LimitsStorage.ON_DISK: Default - Reads limits from disk for each read.
    • sackli.LimitsStorage.IN_MEMORY: Reads all limits from disk in one go.
  • access_pattern: Can be one of:
    • sackli.AccessPattern.SYSTEM: Default - no specific hint to the OS.
    • sackli.AccessPattern.RANDOM: Hints that you read entries in random order.
    • sackli.AccessPattern.SEQUENTIAL: Hints that you read entries roughly sequentially.
  • cache_policy: Can be one of:
    • sackli.CachePolicy.SYSTEM: Default - no specific hint to the OS.
    • sackli.CachePolicy.DROP_AFTER_READ: Reads data in such a way that the OS is unlikely to hold any of it in cache. For POSIX filesystems, this means using OS-specific no-cache hints: Linux uses pread with posix_fadvise, while macOS uses MAP_NOCACHE plus madvise for mmap-backed reads and F_NOCACHE for streaming reads. This is more efficient when you read more data than your RAM before doing any repeats (ie when an epoch is larger than RAM).
    • sackli.CachePolicy.DIRECT_IO: Uses O_DIRECT on Linux and F_NOCACHE on macOS to read records. This is the most aggressive os-cache avoidance option and can be best for random reads on huge data with rare re-reads. Linux uses STATX_DIOALIGN if supported, otherwise probes from a conservative page-aligned starting point derived from file/filesystem metadata. For the unaligned tail, it does a one-time standard read at init.
  • max_parallelism: Default number of threads when reading many records.
  • sharding_layout: Can be one of:
    • sackli.ShardingLayout.CONCATENATED: Default - See Sharding
    • sackli.ShardingLayout.INTERLEAVED: See Sharding

access_pattern and cache_policy are currently interpreted only for local POSIX files and influence OS-level behaviour on page cache and cache lines.

For tail-formatted files, non-default POSIX record-cache policies open a second POSIX read handle to the same file so limits metadata reads keep the default cache policy.

Writer Options

sackli.Writer.Options has these optional arguments.

  • compression: Can be one of:
    • sackli.CompressionAutoDetect(): Default - Uses extension whether to compress. (.bagz - Compressed (Zstandard), .bag - Uncompressed)
    • sackli.CompressionNone(): Records are not compressed.
    • sackli.CompressionZstd(level = 3): Records are compressed using Zstandard the level of the compression can be specified.
  • limits_placement: Can be one of:
    • sackli.LimitsPlacement.TAIL: Default - Writes limits to a tail of file.
    • sackli.LimitsPlacement.SEPARATE: Writes limits to a separate file.

Sharding

An ordered collection of Säckli-formatted files ("shards") may be opened together and indexed via a single global-index. The global-index is mapped to a shard and an index within that shard (shard-index) in one of two ways:

  1. Concatenated (default). Indexing is equivalent to the records in each Säckli-formatted shard being concatenated into a single sequence of records.

    Example:

    When opening four Säckli-formatted files with sizes [8, 4, 0, 5], the global-index with range [0, 17) (shown as the table entries) maps to shard and shard-index like this:

                   | shard-index
    shard          |  0  1  2  3  4  5  6  7
    -------------- | -----------------------
    00000-of-00004 |  0  1  2  3  4  5  6  7
    00001-of-00004 |  8  9 10 11
    00002-of-00004 |
    00003-of-00004 | 12 13 14 15 16
    

    Mappings

    global-index shard shard-index
    0 00000-of-00004 0
    1 00000-of-00004 1
    2 00000-of-00004 2
    ... ... ...
    8 00001-of-00004 0
    9 00001-of-00004 1
    ... ... ...
    15 00003-of-00004 3
    16 00003-of-00004 4
  2. Interleaved where the global-index is interleaved in a round-robin manner across all the shards.

    Example:

    When opening three Säckli-formatted files with sizes [6, 6, 5], the global-index with range [0, 17) (shown as the table entries) maps to shard and shard-index like this:

                   |  shard-index
    shard          |  0  1  2  3  4  5
    -------------- | -----------------
    00000-of-00003 |  0  3  6  9 12 15
    00001-of-00003 |  1  4  7 10 13 16
    00002-of-00003 |  2  5  8 11 14
    

    Mappings

    global-index shard shard-index
    0 00000-of-00003 0
    1 00001-of-00003 0
    2 00002-of-00003 0
    ... ... ...
    6 00000-of-00003 2
    7 00001-of-00003 2
    8 00002-of-00003 2
    ... ... ...
    15 00000-of-00003 5
    16 00001-of-00003 5

Apache Beam Support

Säckli also provides Apache Beam connectors for reading and writing Säckli files in Beam pipelines.

Ensure you have Apache Beam installed.

uv pip install apache_beam

Säckli Source

import apache_beam as beam
from sackli.beam import sacklio
import tensorflow as tf

with beam.Pipeline() as pipeline:
  examples = (
      pipeline
      | 'ReadData' >> sacklio.ReadFromSackli('/path/to/your/data@*.bagz')
      | 'Decode' >> beam.Map(tf.train.Example.FromString)
  )
  # Continue your pipeline.

Säckli Sink

from sackli.beam import sacklio
import tensorflow as tf

def create_tf_example(data):
  # Replace with your actual feature creation logic.
  feature = {
      'data': tf.train.Feature(bytes_list=tf.train.BytesList(value=[data])),
  }
  return tf.train.Example(features=tf.train.Features(feature=feature))

with beam.Pipeline() as pipeline:
  data = [b'record1', b'record2', b'record3']

  examples = (
      pipeline
      | 'CreateData' >> beam.Create(data)
      | 'Encode' >> beam.Map(lambda x: create_tf_example(x).SerializeToString())
      | 'WriteData' >> sacklio.WriteToSackli('/path/to/output/data@*.bagz')
  )

Cloud Storage

Säckli supports POSIX file-systems, Google Cloud Storage (GCS), and Amazon S3. These can be enabled or disabled at compile-time, but the PyPI-deployed wheels have support for both built-in.

GCS authentication

These examples assume you have the gcloud CLI installed.

gcloud config set project your-project-name
gcloud auth application-default login

S3 authentication

Authentication uses the standard AWS credential chain (environment variables, ~/.aws/credentials, IAM roles, etc.).

aws configure

Paths

Use the gs: and s3: file-system prefixes in paths.

import pathlib
import sackli

# This may freeze if you have not configured the GCS project.
gcs_reader = sackli.Reader('gs://your-bucket-name/your-file.bagz')
s3_reader = sackli.Reader('s3://your-bucket-name/your-file.bagz')

# Path supports a leading slash to work well with pathlib.
gcs_bucket = pathlib.Path('/gs://your-bucket-name')
gcs_reader = sackli.Reader(gcs_bucket / 'your-file.bagz')

s3_bucket = pathlib.Path('/s3://your-bucket-name')
s3_reader = sackli.Reader(s3_bucket / 'your-file.bagz')

Säckli/Bagz file format

For now, Säckli still preserves exactly the Bagz file format. However, this is not guaranteed to remain the case.

The Bagz file format has two parts: the records section and the limits section.

  • The records section consists of the concatenation of all (possibly compressed) records. (There are no additional bytes inside or between records, and records are not aligned in any way.)
  • The limits section is a dense array of the end-offsets of each record in order, encoded in little-endian 64-bit unsigned integers.

These can be stored as tail-limits in one file, where the limits section is appended to the records section, or as separate-limits, where they are stored in separate files.

Tail-limits example

Given a Bagz-formatted file with the following 3 uncompressed records:

Records
abcdef
123
catcat

The raw bytes of the Bagz-formatted file corresponding to the records above:

0x61 a 0x62 b 0x63 c 0x64 d 0x65 e 0x66 f
0x31 1 0x32 2 0x33 3
0x63 c 0x61 a 0x74 t 0x63 c 0x61 a 0x74 t
0x06   0x00   0x00   0x00   0x00   0x00   0x00   0x00  # 6 byte offset
0x09   0x00   0x00   0x00   0x00   0x00   0x00   0x00  # 9 byte offset
0x0f   0x00   0x00   0x00   0x00   0x00   0x00   0x00  # 15 byte offset

The last 8 bytes represent the end-offset of the last record. This is also the start of the limits section. Therefore reading the last 8 bytes will directly tell you the offset of the records/limits boundary.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sackli-0.2.8.tar.gz (92.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sackli-0.2.8-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.8-cp314-cp314t-macosx_14_0_arm64.whl (5.7 MB view details)

Uploaded CPython 3.14tmacOS 14.0+ ARM64

sackli-0.2.8-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.8-cp314-cp314-macosx_14_0_arm64.whl (5.7 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

sackli-0.2.8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.8-cp313-cp313-macosx_14_0_arm64.whl (5.7 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

sackli-0.2.8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.8-cp312-cp312-macosx_14_0_arm64.whl (5.7 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

File details

Details for the file sackli-0.2.8.tar.gz.

File metadata

  • Download URL: sackli-0.2.8.tar.gz
  • Upload date:
  • Size: 92.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sackli-0.2.8.tar.gz
Algorithm Hash digest
SHA256 337e04f6a1044fd9224d4da647fa942c311d2e94e83083eca7096d6cde63bb61
MD5 54efde36a90ea827169beb1df238c286
BLAKE2b-256 eabbe0f271ec79a88645df24e19cfcc1e34df6690bcd3a9848dbb6c8a0d584fd

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.8.tar.gz:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.8-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.8-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c40d56c5369d920f92304f93159593f5884a189619ff5d4ccc8f70a8a9eb61ab
MD5 54ce2735505b92a0035077ab877fbb4f
BLAKE2b-256 23b467fb6204a3608202908529da9c946fc1c56853cfcdd6303d71cc805a6cce

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.8-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.8-cp314-cp314t-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for sackli-0.2.8-cp314-cp314t-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 7a0701c8136f6f96b0753bf2078014b069f778911fb545d5811c22956c362528
MD5 fcedee16d0732c55e0991e4a2435affa
BLAKE2b-256 a9e9072e4f4ecf2cadee4e943c5230a6da2b06b3c12be3ff8164169ba95ddab4

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.8-cp314-cp314t-macosx_14_0_arm64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.8-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.8-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e627ecacc9aa5941ef9bbe7af7756536fb0fa450addf1e3280cf260da12ca5b3
MD5 67c0acebad79c6a2b8e236eff6508985
BLAKE2b-256 b892015d7fbcb8006db07b50f1342a21119b385d2533d5aff7431806da63072f

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.8-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.8-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for sackli-0.2.8-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 203d41082a2e4ea1e931c2dbb61c666c066bd9c8a91d3fee8cb10ccabc8f7fad
MD5 845d60367883dc82f7f66d1f46fcf581
BLAKE2b-256 981c51ec465a23ca24c2c8b4a5832d75c776460d39018ffd6ad160464ede3899

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.8-cp314-cp314-macosx_14_0_arm64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6e8781a661eb96b72a88c4e5ca8f9965a7e5388e0b7be288e1c767fcdae300ed
MD5 589842a56c83249376bfd5c99bface12
BLAKE2b-256 4cb88d537040aab244131f1fbc551c639a10ec6d8e89ab9d812bb7487f084d4f

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.8-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for sackli-0.2.8-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 436accc96b212ae87af44f4af9f74a59d0c43ab10b772bad7a7d90fcff88b12b
MD5 76c35a18d7015b7f8fa19716654cfd0a
BLAKE2b-256 39847e537e92df66da705ff1bf3431a2a8fbb382990192c7db4434ab05203ea0

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.8-cp313-cp313-macosx_14_0_arm64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bc88fd3f7626b42791b74f8e9a890ec7beac7236a99cc8eef198b6026876d3e3
MD5 fb88af164f48c155457aaea8c1709b93
BLAKE2b-256 eed02b515c852505c4e84a413da5dbaa18208186ce360fe1cf588e3c28e3afb7

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.8-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for sackli-0.2.8-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 b003e284d6e8cdb09f462aa75a7014948845862e9e8f895f72a15242964587b9
MD5 50f58543f35a9dd03b4a70db720fb498
BLAKE2b-256 9b429db4e8f162cc1ea0dbe8c40191d6ce23a01e0f5e8a27e2d8087dbc04acc3

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.8-cp312-cp312-macosx_14_0_arm64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page