Skip to main content

A Rust-backed Python library for caching Arrow record-batch streams so they can be replayed multiple times

Project description

batchcorder

A Rust-backed Python library for caching Arrow record-batch streams so they can be replayed multiple times from a source that can only be read once.

The problem

Arrow RecordBatchReader is a single-use stream — once consumed, it is gone. Training loops and multi-pass data pipelines need to iterate the same stream repeatedly without re-reading from disk or the network each time.

What batchcorder does

StreamCache wraps any Arrow stream source (anything that implements __arrow_c_stream__) and stores each RecordBatch in a two-tier hybrid cache (memory + disk) backed by Foyer. Multiple independent readers can then replay the stream concurrently, each maintaining their own position in the batch sequence.

flowchart LR
    U["upstream source<br/>(read once)"] --> D["StreamCache<br/>[mem + disk cache]"]
    D --> R0["StreamCacheReader 0<br/>(from batch 0)"]
    D --> R1["StreamCacheReader 1<br/>(from batch 0)"]
    D --> R2["StreamCacheReader 2<br/>(from batch 3)"]

Installation

pip install batchcorder

Usage

import pyarrow as pa
from batchcorder import StreamCache

table = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})

ds = StreamCache(
    table.to_reader(max_chunksize=1),  # any __arrow_c_stream__ source
    memory_capacity=64 * 1024 * 1024,  # 64 MB
    disk_path="/tmp/batchcorder-cache",
    disk_capacity=512 * 1024 * 1024,  # 512 MB
)

# Replay as many times as needed
for batch in ds:
    print(batch)

# Or get an independent reader handle
reader = ds.reader()
result = pa.RecordBatchReader.from_stream(reader).read_all()

# Pre-ingest everything upfront
ds.ingest_all()

Compatibility

StreamCache and StreamCacheReader implement both __arrow_c_stream__ and __arrow_c_schema__, so they work with any Arrow-compatible library:

import pyarrow as pa
import duckdb

pa.table(ds)             # PyArrow
pa.table(ds.reader())    # via StreamCacheReader
duckdb.table("ds")       # DuckDB

Key properties

  • Single-read source: the upstream stream is consumed exactly once; all subsequent reads come from the cache.
  • Concurrent readers: multiple StreamCacheReader instances from the same stream cache are fully independent and thread-safe.
  • Lazy ingestion: batches are fetched from the upstream source on demand as readers advance, not upfront.
  • Replay from any position: ds.reader(from_start=True) replays from batch 0; ds.reader(from_start=False) (default) starts from the frontier (next batch not yet ingested).

Eviction caveat

Foyer evicts cache entries under memory/disk pressure. If an entry is evicted before a reader reaches it, that reader will raise an error. Size the cache to hold at least as many batches as the span between the slowest and fastest concurrent reader.

Development

# Install dependencies and build the extension
uv sync
maturin develop --uv

# Run tests
uv run pytest

# Run all pre-commit checks
uv run pre-commit run --all-files

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

batchcorder-0.1.0.tar.gz (138.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

batchcorder-0.1.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

batchcorder-0.1.0-cp311-abi3-win_amd64.whl (3.7 MB view details)

Uploaded CPython 3.11+Windows x86-64

batchcorder-0.1.0-cp311-abi3-win32.whl (3.3 MB view details)

Uploaded CPython 3.11+Windows x86

batchcorder-0.1.0-cp311-abi3-manylinux_2_28_aarch64.whl (4.4 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ ARM64

batchcorder-0.1.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

batchcorder-0.1.0-cp311-abi3-macosx_11_0_arm64.whl (3.8 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

batchcorder-0.1.0-cp311-abi3-macosx_10_12_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

File details

Details for the file batchcorder-0.1.0.tar.gz.

File metadata

  • Download URL: batchcorder-0.1.0.tar.gz
  • Upload date:
  • Size: 138.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for batchcorder-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5e1e2b34b6458ef6e1b6798f85f1cfa234ef9b7eba282d714f4a201a88349789
MD5 8e9e40e6c3c4bada88f5a588d4b51ef5
BLAKE2b-256 e0ca21605913b7bf03520099548315d0feb26df68907b2ea7f251e75c7be7269

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 322917379cc95d001e512091bf4c96c6b17623c88d7c17ae0d968c883ea6550a
MD5 0037b887939a2f8940c839c575c1ca61
BLAKE2b-256 e8da5f137d0ae53d5ee558dc4b2d845d0b72681a990816c04c4f2c729717b62c

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.0-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.0-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 70b649d1bf270a0b49cbb6caed43890c680b6e352371a62b923841bc533748ab
MD5 6dd51fe7e7256c0dbe53080ebd218d74
BLAKE2b-256 124297ddc001c648d2295d00e516f7e5f75fd8f7c1b0a3a8282030319caf4a14

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.0-cp311-abi3-win32.whl.

File metadata

  • Download URL: batchcorder-0.1.0-cp311-abi3-win32.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.11+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for batchcorder-0.1.0-cp311-abi3-win32.whl
Algorithm Hash digest
SHA256 472fc7f7e082f37322551b0cac1accb7cf6d1bfeb5f4c692ec7d89954f7757d0
MD5 4f775bfedd09566ee81a3f958af61175
BLAKE2b-256 e4504fd013887a33fc97aed3579380ae92be70aa577469d8c7dabd2ae8eaff5f

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.0-cp311-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.0-cp311-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3ac38445766f3741f1804edf8047206cc3a213b1d707e1864a3a2845812ebf68
MD5 3a565fc2fae09207c6de935a2a14b6b7
BLAKE2b-256 c38069c7fa4c265180bb58c248be37411a27c14bd9a5359f7885799733b27c1f

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 81cd3b74e58b63729e150088dc09be85041010834408d383b1ee56dd5cb83ee1
MD5 be38ad9a1073084a2d85550463ec47db
BLAKE2b-256 994f6c4076ead7d0066cf3836ffe1761cf4d928647ddcae812a9829c360c67aa

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.0-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.0-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f6c0ace05ea4be5905b394e3b15cd790e755ef4c31b903bf553005187917f6f6
MD5 8ef86eaac0e517549cb5c67d776dd88d
BLAKE2b-256 ac0aade41109948eaf0c9d427730e7057f8007e99fc9126950bce0eafab423cc

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.0-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.0-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5ecd82bbf1c28754cc95d7f65bf83f8ce9a693c2e4946fb046ea9c39430b0794
MD5 210680d76c938c2ff7f85a67a59718c3
BLAKE2b-256 808793724a003a7632d2b00356dfd4fc9401f9c88831573e710b453fa87d866d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page