Skip to main content

A format for storing a sequence of byte-array records

Project description

Säckli

This is a friendly fork of bagz.

Additions so far:

  • Merge some PRs such as S3 support PR by @KefanXIAO and compile fixes.
  • Add access_pattern and cache_policy reader hints:
    • On POSIX filesystems, this can add mmap hints or use pread-based no-cache reads to optimize for random access and larger-than-RAM data.
    • On Linux, support O_DIRECT for even better reading of random access and larger-than-RAM data.
    • On macOS, support F_NOCACHE, MAP_NOCACHE, and madvise-based cache hints on Apple silicon.
  • Make it compatible to Python versions past 3.13.
  • Make it compatible with free-threading (nogil) Python.
  • Add macOS support and wheels.
  • Add CI, stress-tests and automatic wheel releases to PyPI for Linux x86_64 and macOS arm64 (macOS 14+).

Versioning of this fork is detached from the original bagz library at the point it was forked (v0.2.0).

Overview

Säckli is a format for storing a sequence of byte-array records. It supports per-record compression and fast index-based lookup. All indexing is zero based.

Installation

The recommended installation on Linux and Mac is via the pre-built wheels on PyPI. PyPI currently ships Linux x86_64 wheels and macOS arm64 wheels for macOS 14+:

uv pip install sackli

If you want to build locally to work on this, just uv pip install .. However, building can be slow because of GCS and S3 support; to skip both of these dependencies for much faster builds, you can do:

CMAKE_ARGS="-DSACKLI_ENABLE_GCS=OFF -DSACKLI_ENABLE_S3=OFF" uv pip install .

Python API

Python Reader

Reader for reading a single or sharded Säckli file-set.

from collections.abc import Sequence, Iterable

import sackli
import numpy as np

# Säckli Readers support random access. The order of elements within a Säckli
# file is the order in which they are written. Records are returned as `bytes`
# objects.
data = sackli.Reader('/path/to/data.bagz')

# Säckli Readers can be configured like this - here we require that the file was
# written with separate limits.
data_separate_limits = sackli.Reader('/path/to/data.bagz', sackli.Reader.Options(
    limits_placement=sackli.LimitsPlacement.SEPARATE,
))

# Säckli Readers are Sequences and support slicing, iterating, etc.
assert isinstance(data, Sequence)

# Säckli Readers have a length.
assert len(data) > 10

# Can access record by row-index.
fifth_value: bytes = data[5]

# Can slice.
data_from_5: sackli.Reader = data[5:]

# Slices are still Readers.
assert isinstance(data_from_5, sackli.Reader)

assert data_from_5[0] == fifth_value

# Can access records by multiple row-indices.
fourth, second, tenth = data.read_indices([4, 2, 10])
assert fourth == data[4]
assert second == data[2]
assert tenth == data[10]

# Can iterate records.
for record in data:
  do_something_else(record)

# Can read all records. This eager version can be faster than iteration.
all_records = data.read()

# Can iterate sub-range of records.
for record in data[4:9]:
  do_something_else(record)

# Can read a sub-range of records. This eager form can be faster than
# iteration.
sub_range = data[4:9].read()

# Can use an infinite iterator as source of indices. (Reads ahead in parallel.)
def my_generator(size: int) -> Iterable[int]:
  rng = np.random.default_rng(42)
  while True:
    yield rng.integers(size).item()

data_iter: Iterable[bytes] = data.read_indices_iter(my_generator(len(data)))
for i in range(10):
  random_item: bytes = next(data_iter)

Python Reader - Index and MultiIndex

You can use Index to find the first index of a record and MultiIndex to find all instances of an item.

keys = sackli.Reader('/path/to/keys.bag')
# Get the index of the first occurrence of key.
index = sackli.Index(keys)
key_index: int = index[b'example_key']

# Get all occurrences of key.
multi_index = sackli.MultiIndex(keys)
all_indices: list[int] = multi_index[b'example_key']

Python Writer

For writing a single Säckli file.

Example:

import sackli

# Compression is selected based on the file extension:
# `.bagz` will use Zstandard compression with default settings.
# `.bag` will use no compression.
with sackli.Writer('/path/to/data.bagz') as writer:
  for d in generate_records():
    writer.write(d)

# Adjust compression level explicitly.
# Note this will no longer use the extension to detemine whether to compress.
with sackli.Writer(
    '/path/to/data.bagz',
    sackli.Writer.Options(
        compression=sackli.CompressionZstd(level=3)
    ),
) as writer:
  for d in generate_records():
    writer.write(d)

Options

Reader Options

sackli.Reader.Options has these optional arguments.

  • compression: Can be one of:
    • sackli.CompressionAutoDetect(): Default - Uses extension whether to compress. (.bagz - Compressed (ZStandard), .bag - Uncompressed)
    • sackli.CompressionNone(): Records are not decompressed.
    • sackli.CompressionZstd(): Records are decompressed using Zstandard.
  • limits_placement: Can be one of:
    • sackli.LimitsPlacement.TAIL: Default- Reads limits from a tail of file.
    • sackli.LimitsPlacement.SEPARATE: Reads limits from a separate file.
  • limits_storage: Can be one of:
    • sackli.LimitsStorage.ON_DISK: Default - Reads limits from disk for each read.
    • sackli.LimitsStorage.IN_MEMORY: Reads all limits from disk in one go.
  • access_pattern: Can be one of:
    • sackli.AccessPattern.SYSTEM: Default - no specific hint to the OS.
    • sackli.AccessPattern.RANDOM: Hints that you read entries in random order.
    • sackli.AccessPattern.SEQUENTIAL: Hints that you read entries roughly sequentially.
  • cache_policy: Can be one of:
    • sackli.CachePolicy.SYSTEM: Default - no specific hint to the OS.
    • sackli.CachePolicy.DROP_AFTER_READ: Reads data in such a way that the OS is unlikely to hold any of it in cache. For POSIX filesystems, this means using OS-specific no-cache hints: Linux uses pread with posix_fadvise, while macOS uses MAP_NOCACHE plus madvise for mmap-backed reads and F_NOCACHE for streaming reads. This is more efficient when you read more data than your RAM before doing any repeats (ie when an epoch is larger than RAM).
    • sackli.CachePolicy.DIRECT_IO: Uses O_DIRECT on Linux and F_NOCACHE on macOS to read records. This is the most aggressive os-cache avoidance option and can be best for random reads on huge data with rare re-reads. Linux uses STATX_DIOALIGN if supported, otherwise probes from a conservative page-aligned starting point derived from file/filesystem metadata. For the unaligned tail, it does a one-time standard read at init.
  • max_parallelism: Default number of threads when reading many records.
  • sharding_layout: Can be one of:
    • sackli.ShardingLayout.CONCATENATED: Default - See Sharding
    • sackli.ShardingLayout.INTERLEAVED: See Sharding

access_pattern and cache_policy are currently interpreted only for local POSIX files and influence OS-level behaviour on page cache and cache lines.

For tail-formatted files, non-default POSIX record-cache policies open a second POSIX read handle to the same file so limits metadata reads keep the default cache policy.

Writer Options

sackli.Writer.Options has these optional arguments.

  • compression: Can be one of:
    • sackli.CompressionAutoDetect(): Default - Uses extension whether to compress. (.bagz - Compressed (Zstandard), .bag - Uncompressed)
    • sackli.CompressionNone(): Records are not compressed.
    • sackli.CompressionZstd(level = 3): Records are compressed using Zstandard the level of the compression can be specified.
  • limits_placement: Can be one of:
    • sackli.LimitsPlacement.TAIL: Default - Writes limits to a tail of file.
    • sackli.LimitsPlacement.SEPARATE: Writes limits to a separate file.

Sharding

An ordered collection of Säckli-formatted files ("shards") may be opened together and indexed via a single global-index. The global-index is mapped to a shard and an index within that shard (shard-index) in one of two ways:

  1. Concatenated (default). Indexing is equivalent to the records in each Säckli-formatted shard being concatenated into a single sequence of records.

    Example:

    When opening four Säckli-formatted files with sizes [8, 4, 0, 5], the global-index with range [0, 17) (shown as the table entries) maps to shard and shard-index like this:

                   | shard-index
    shard          |  0  1  2  3  4  5  6  7
    -------------- | -----------------------
    00000-of-00004 |  0  1  2  3  4  5  6  7
    00001-of-00004 |  8  9 10 11
    00002-of-00004 |
    00003-of-00004 | 12 13 14 15 16
    

    Mappings

    global-index shard shard-index
    0 00000-of-00004 0
    1 00000-of-00004 1
    2 00000-of-00004 2
    ... ... ...
    8 00001-of-00004 0
    9 00001-of-00004 1
    ... ... ...
    15 00003-of-00004 3
    16 00003-of-00004 4
  2. Interleaved where the global-index is interleaved in a round-robin manner across all the shards.

    Example:

    When opening three Säckli-formatted files with sizes [6, 6, 5], the global-index with range [0, 17) (shown as the table entries) maps to shard and shard-index like this:

                   |  shard-index
    shard          |  0  1  2  3  4  5
    -------------- | -----------------
    00000-of-00003 |  0  3  6  9 12 15
    00001-of-00003 |  1  4  7 10 13 16
    00002-of-00003 |  2  5  8 11 14
    

    Mappings

    global-index shard shard-index
    0 00000-of-00003 0
    1 00001-of-00003 0
    2 00002-of-00003 0
    ... ... ...
    6 00000-of-00003 2
    7 00001-of-00003 2
    8 00002-of-00003 2
    ... ... ...
    15 00000-of-00003 5
    16 00001-of-00003 5

Apache Beam Support

Säckli also provides Apache Beam connectors for reading and writing Säckli files in Beam pipelines.

Ensure you have Apache Beam installed.

uv pip install apache_beam

Säckli Source

import apache_beam as beam
from sackli.beam import sacklio
import tensorflow as tf

with beam.Pipeline() as pipeline:
  examples = (
      pipeline
      | 'ReadData' >> sacklio.ReadFromSackli('/path/to/your/data@*.bagz')
      | 'Decode' >> beam.Map(tf.train.Example.FromString)
  )
  # Continue your pipeline.

Säckli Sink

from sackli.beam import sacklio
import tensorflow as tf

def create_tf_example(data):
  # Replace with your actual feature creation logic.
  feature = {
      'data': tf.train.Feature(bytes_list=tf.train.BytesList(value=[data])),
  }
  return tf.train.Example(features=tf.train.Features(feature=feature))

with beam.Pipeline() as pipeline:
  data = [b'record1', b'record2', b'record3']

  examples = (
      pipeline
      | 'CreateData' >> beam.Create(data)
      | 'Encode' >> beam.Map(lambda x: create_tf_example(x).SerializeToString())
      | 'WriteData' >> sacklio.WriteToSackli('/path/to/output/data@*.bagz')
  )

Cloud Storage

Säckli supports POSIX file-systems, Google Cloud Storage (GCS), and Amazon S3. These can be enabled or disabled at compile-time, but the PyPI-deployed wheels have support for both built-in.

GCS authentication

These examples assume you have the gcloud CLI installed.

gcloud config set project your-project-name
gcloud auth application-default login

S3 authentication

Authentication uses the standard AWS credential chain (environment variables, ~/.aws/credentials, IAM roles, etc.).

aws configure

Paths

Use the gs: and s3: file-system prefixes in paths.

import pathlib
import sackli

# This may freeze if you have not configured the GCS project.
gcs_reader = sackli.Reader('gs://your-bucket-name/your-file.bagz')
s3_reader = sackli.Reader('s3://your-bucket-name/your-file.bagz')

# Path supports a leading slash to work well with pathlib.
gcs_bucket = pathlib.Path('/gs://your-bucket-name')
gcs_reader = sackli.Reader(gcs_bucket / 'your-file.bagz')

s3_bucket = pathlib.Path('/s3://your-bucket-name')
s3_reader = sackli.Reader(s3_bucket / 'your-file.bagz')

Säckli/Bagz file format

For now, Säckli still preserves exactly the Bagz file format. However, this is not guaranteed to remain the case.

The Bagz file format has two parts: the records section and the limits section.

  • The records section consists of the concatenation of all (possibly compressed) records. (There are no additional bytes inside or between records, and records are not aligned in any way.)
  • The limits section is a dense array of the end-offsets of each record in order, encoded in little-endian 64-bit unsigned integers.

These can be stored as tail-limits in one file, where the limits section is appended to the records section, or as separate-limits, where they are stored in separate files.

Tail-limits example

Given a Bagz-formatted file with the following 3 uncompressed records:

Records
abcdef
123
catcat

The raw bytes of the Bagz-formatted file corresponding to the records above:

0x61 a 0x62 b 0x63 c 0x64 d 0x65 e 0x66 f
0x31 1 0x32 2 0x33 3
0x63 c 0x61 a 0x74 t 0x63 c 0x61 a 0x74 t
0x06   0x00   0x00   0x00   0x00   0x00   0x00   0x00  # 6 byte offset
0x09   0x00   0x00   0x00   0x00   0x00   0x00   0x00  # 9 byte offset
0x0f   0x00   0x00   0x00   0x00   0x00   0x00   0x00  # 15 byte offset

The last 8 bytes represent the end-offset of the last record. This is also the start of the limits section. Therefore reading the last 8 bytes will directly tell you the offset of the records/limits boundary.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sackli-0.2.7.tar.gz (92.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sackli-0.2.7-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.7-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.7-cp314-cp314-macosx_14_0_arm64.whl (5.7 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

sackli-0.2.7-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.7-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.7-cp313-cp313-macosx_14_0_arm64.whl (5.7 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

sackli-0.2.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (11.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

sackli-0.2.7-cp312-cp312-macosx_14_0_arm64.whl (5.7 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

File details

Details for the file sackli-0.2.7.tar.gz.

File metadata

  • Download URL: sackli-0.2.7.tar.gz
  • Upload date:
  • Size: 92.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sackli-0.2.7.tar.gz
Algorithm Hash digest
SHA256 bd45b998efd0a9d4dad248046aa7c5bb2220a99968ccd4b06f58dfe4eddfee54
MD5 cbec434e3e22e87fec497ddd99883bfe
BLAKE2b-256 bd7272a417560896c7bc06b31e14f48ee36c290badec6efc06bc720ef42eb793

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.7.tar.gz:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.7-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.7-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 92ab09220549133ac8c90e7089006378ccdee00147ce436249702782ccc921eb
MD5 80dfedc0112462d2cf96742a6163f6c9
BLAKE2b-256 75b14f079c0f54ce99eb8561ce2b111bb36573c973b5c4247a9bef77a9765911

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.7-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.7-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.7-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 34d75f977f0a1abee8f2b7642ecdb4510698ed5c584cc89446b9aa918bb5f074
MD5 7aac38040ab2d083050d4b7f286da3a0
BLAKE2b-256 4daf816b23073c5f6795c931ae0f5d7bf8d9d5ab66b44e398cddb5cd9ff7f826

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.7-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.7-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for sackli-0.2.7-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 3a83514d34057d1348e18ad1e61f4defcb021aaf83d4f3d4d6f4f20947948edc
MD5 03cb889bcc51f11ce8708aff682a91d4
BLAKE2b-256 918590ed6c00dee6529d10cb0c7c0bb275015da185976e16c006f6323fa2bf9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.7-cp314-cp314-macosx_14_0_arm64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.7-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.7-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0a0389cd55c4b889168aebaf22ab7500f6a1b8e114598d4d27460034c1d2b243
MD5 91b0866f50d6ce107854827c49cfe82e
BLAKE2b-256 cc23614efd2e1526c53bc8e49c6c594f559c10a65c3ccabdade5bbecd13ec1c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.7-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.7-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.7-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4121d22fe3f83fd1845b90468a60e45ff47153c9fa26c23d6531ff4737fd3521
MD5 cc5d36102779dabff72b53050294f31e
BLAKE2b-256 b922879bfb86ed523f5d985c69c04b0e648a98e85101fe9b195545987ad6771b

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.7-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.7-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for sackli-0.2.7-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 4bf6a61567775e2a4f2de94471592383c6970c94e34309fd79e8850930b13012
MD5 3ffb842ec4a54ad9e480b2e214adf79a
BLAKE2b-256 52fc0c87c0e78837cb8a7b07cc261cff3868195e4ef9669c21c73edcefc7c852

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.7-cp313-cp313-macosx_14_0_arm64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sackli-0.2.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 92d5e880988fb42f0c0d64175b285a91d67ab7cf17d5971c1d57d4ecab56411a
MD5 404656ea8941e1315ba79a632ac9f235
BLAKE2b-256 25405ff57a01d12b89b637243ee6a1d391c70a4db0f43ca619ed3a39de817f6a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sackli-0.2.7-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for sackli-0.2.7-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 65486932fa1abb0e0c00752c6a1e5529404d65fe084e914815b7c0fd8654a70f
MD5 f57c1adedfa09f7a067e1aa43267b4fb
BLAKE2b-256 ef4682bce0f74b717ce7086da40f48ca6fe4e28b6e872b303ad86e3e73512f22

See more details on using hashes here.

Provenance

The following attestation bundles were made for sackli-0.2.7-cp312-cp312-macosx_14_0_arm64.whl:

Publisher: publish.yml on lucasb-eyer/sackli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page