Skip to main content

A Rust-backed Python library for caching Arrow record-batch streams so they can be replayed multiple times

Project description

batchcorder

A Rust-backed Python library for caching Arrow record-batch streams so they can be replayed multiple times from a source that can only be read once.

The problem

Arrow RecordBatchReader is a single-use stream — once consumed, it is gone. Training loops and multi-pass data pipelines need to iterate the same stream repeatedly without re-reading from disk or the network each time.

What batchcorder does

StreamCache wraps any Arrow stream source (anything that implements __arrow_c_stream__) and stores each RecordBatch in a Foyer cache. Two storage modes are supported:

  • Memory-only (default): all batches are kept in RAM; no files are created on disk.
  • Hybrid memory+disk: batches evicted from the memory tier spill to disk, allowing the working set to exceed available RAM.

Multiple independent readers can replay the stream concurrently, each maintaining their own position in the batch sequence.

flowchart LR
    U["upstream source<br/>(read once)"] --> D["StreamCache<br/>[mem cache / mem+disk cache]"]
    D --> R0["StreamCacheReader 0<br/>(from batch 0)"]
    D --> R1["StreamCacheReader 1<br/>(from batch 0)"]
    D --> R2["StreamCacheReader 2<br/>(from batch 3)"]

Installation

pip install batchcorder

Usage

import tempfile

import pyarrow as pa

from batchcorder import StreamCache

table = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})

# Memory-only (default capacity = total physical RAM)
ds = StreamCache(table.to_reader(max_chunksize=1))

# Memory-only with explicit capacity
ds = StreamCache(
    table.to_reader(max_chunksize=1),
    memory_capacity=64 * 1024 * 1024,  # 64 MB
)

# Hybrid memory+disk
tmp = tempfile.mkdtemp()
ds = StreamCache(
    table.to_reader(max_chunksize=1),
    memory_capacity=64 * 1024 * 1024,  # 64 MB in RAM
    disk_path=tmp,
    disk_capacity=512 * 1024 * 1024,  # 512 MB on disk
)

# Replay as many times as needed
for batch in ds:
    print(batch)

# Or get an independent reader handle
reader = ds.reader()
result = pa.RecordBatchReader.from_stream(reader).read_all()

# Pre-ingest everything upfront
ds.ingest_all()

Compatibility

StreamCache and StreamCacheReader implement both __arrow_c_stream__ and __arrow_c_schema__, so they work with any Arrow-compatible library:

import pyarrow as pa
import duckdb

pa.table(ds)  # PyArrow
pa.table(ds.reader())  # via StreamCacheReader
duckdb.table("ds")  # DuckDB

Key properties

  • Single-read source: the upstream stream is consumed exactly once; all subsequent reads come from the cache.
  • Concurrent readers: multiple StreamCacheReader instances from the same stream cache are fully independent and thread-safe.
  • Lazy ingestion: batches are fetched from the upstream source on demand as readers advance, not upfront.
  • Replay from any position: ds.reader(from_start=True) (default) replays from batch 0; ds.reader(from_start=False) starts from the current ingestion frontier (next batch not yet ingested).

Eviction caveat

Foyer evicts cache entries under memory/disk pressure. If an entry is evicted before a reader reaches it, that reader will raise an error. Size the cache to hold at least as many batches as the span between the slowest and fastest concurrent reader.

Development

# Install dependencies and build the extension
uv sync --no-install-project --dev
uv run maturin develop --uv

# Run tests
uv run pytest

# Run all pre-commit checks
uv run pre-commit run --all-files

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

batchcorder-0.1.1.tar.gz (201.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

batchcorder-0.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

batchcorder-0.1.1-cp311-abi3-win_amd64.whl (3.7 MB view details)

Uploaded CPython 3.11+Windows x86-64

batchcorder-0.1.1-cp311-abi3-win32.whl (3.4 MB view details)

Uploaded CPython 3.11+Windows x86

batchcorder-0.1.1-cp311-abi3-manylinux_2_28_aarch64.whl (4.5 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ ARM64

batchcorder-0.1.1-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

batchcorder-0.1.1-cp311-abi3-macosx_11_0_arm64.whl (3.9 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

batchcorder-0.1.1-cp311-abi3-macosx_10_12_x86_64.whl (4.2 MB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

File details

Details for the file batchcorder-0.1.1.tar.gz.

File metadata

  • Download URL: batchcorder-0.1.1.tar.gz
  • Upload date:
  • Size: 201.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for batchcorder-0.1.1.tar.gz
Algorithm Hash digest
SHA256 cbe6784759376cb87dae4a4333ec387f9f1b2d3412099963a30dd7ba4ed6abb2
MD5 5ae77ce878225f5df0c2ae17c93aca9a
BLAKE2b-256 0a5adc1272ce156bf1454f6705bd4a8d9a45828add54d3bda6ed54e4e8b5918a

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ac22b4defd1a77925a29bbcba1e08a1f3bcd9a33eeed5d985c42ab6cdba20873
MD5 ffc5bc8d125704c921fb22f7950014c1
BLAKE2b-256 35d8b5568d347afa5bf18eeb7d345a4cf6433c943e5263482467c532caa65dbf

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.1-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.1-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 536c84a59320b30227eaa17c674329ea87142020aee00cef9684f0f8cb5bba1d
MD5 13e920018870721d6e62fdcf515e1640
BLAKE2b-256 3c0823a0d79c9d972b038f646c06bf88b7b853e15662d111b547e1d1b599c943

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.1-cp311-abi3-win32.whl.

File metadata

  • Download URL: batchcorder-0.1.1-cp311-abi3-win32.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.11+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for batchcorder-0.1.1-cp311-abi3-win32.whl
Algorithm Hash digest
SHA256 9dd4b41424b0d28439a55fac6c979086b2c6b43ff9b18614f0a523edd68b20cc
MD5 bda1b2e05df65cc8e682052960a12e9c
BLAKE2b-256 1ea36cfc006e3cd7f73bd3fae6c1b6d43d0b6a58d7b81a7a2bcb3f33b1cd9127

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.1-cp311-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.1-cp311-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8a6fd22ea201f6773e802aa1c0c08c36bdc88eb12da28f8f8092642d2ed09371
MD5 b907edc061d63f8d1abf1e0d37c8ca78
BLAKE2b-256 9fbbd1dd2176b1d5a41ba64934bcdf04d268dd26b92d56a0ad1b594eb6f09d86

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.1-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.1-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0e3a07ed4a850f793b3a97ce50b35c91258e984f01549447378b52d33ac869ef
MD5 7ad2d04c2a931e8c400fd5de4f02f77a
BLAKE2b-256 055ffc098712fbcb9c4c0865c84cf6c5c0623a4ee4cd312d0851efa8f88051e7

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.1-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.1-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5d196b2f21fc716eb786d48c648feeb43937ac3a3b30631bf85e153e9753d1ff
MD5 1b2293706ca09d43ecbc48f04e534349
BLAKE2b-256 6b864482f954fe64bf0856552700280325d6684c68ca989d43128a86c7a60e7c

See more details on using hashes here.

File details

Details for the file batchcorder-0.1.1-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for batchcorder-0.1.1-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b1a22ca25c17ecdac28f30ec1c222f0b400f964f4cb9b3bfe4581b935e9f49e6
MD5 3610e690a979cf8faf440f17fb0863f8
BLAKE2b-256 7ef4ec054995441391f36b7fd91e368644646b2249f69842b4d874f68d6260f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page