A Rust-backed Python library for caching Arrow record-batch streams so they can be replayed multiple times
Project description
batchcorder
A Rust-backed Python library for caching Arrow record-batch streams so they can be replayed multiple times from a source that can only be read once.
The problem
Arrow RecordBatchReader is a single-use stream — once consumed, it is gone.
Training loops and multi-pass data pipelines need to iterate the same stream
repeatedly without re-reading from disk or the network each time.
What batchcorder does
StreamCache wraps any Arrow stream source (anything that implements
__arrow_c_stream__) and stores each RecordBatch in a
Foyer cache. Two storage modes are
supported:
- Memory-only (default): all batches are kept in RAM; no files are created on disk.
- Hybrid memory+disk: batches evicted from the memory tier spill to disk, allowing the working set to exceed available RAM.
Multiple independent readers can replay the stream concurrently, each maintaining their own position in the batch sequence.
flowchart LR
U["upstream source<br/>(read once)"] --> D["StreamCache<br/>[mem cache / mem+disk cache]"]
D --> R0["StreamCacheReader 0<br/>(from batch 0)"]
D --> R1["StreamCacheReader 1<br/>(from batch 0)"]
D --> R2["StreamCacheReader 2<br/>(from batch 3)"]
Installation
pip install batchcorder
Usage
import tempfile
import pyarrow as pa
from batchcorder import StreamCache
table = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
# Memory-only (default capacity = total physical RAM)
ds = StreamCache(table.to_reader(max_chunksize=1))
# Memory-only with explicit capacity
ds = StreamCache(
table.to_reader(max_chunksize=1),
memory_capacity=64 * 1024 * 1024, # 64 MB
)
# Hybrid memory+disk
tmp = tempfile.mkdtemp()
ds = StreamCache(
table.to_reader(max_chunksize=1),
memory_capacity=64 * 1024 * 1024, # 64 MB in RAM
disk_path=tmp,
disk_capacity=512 * 1024 * 1024, # 512 MB on disk
)
# Replay as many times as needed
for batch in ds:
print(batch)
# Or get an independent reader handle
reader = ds.reader()
result = pa.RecordBatchReader.from_stream(reader).read_all()
# Pre-ingest everything upfront
ds.ingest_all()
Compatibility
StreamCache and StreamCacheReader implement both __arrow_c_stream__
and __arrow_c_schema__, so they work with any Arrow-compatible library:
import pyarrow as pa
import duckdb
pa.table(ds) # PyArrow
pa.table(ds.reader()) # via StreamCacheReader
duckdb.table("ds") # DuckDB
Key properties
- Single-read source: the upstream stream is consumed exactly once; all subsequent reads come from the cache.
- Concurrent readers: multiple
StreamCacheReaderinstances from the same stream cache are fully independent and thread-safe. - Lazy ingestion: batches are fetched from the upstream source on demand as readers advance, not upfront.
- Replay from any position:
ds.reader(from_start=True)(default) replays from batch 0;ds.reader(from_start=False)starts from the current ingestion frontier (next batch not yet ingested).
Eviction caveat
Foyer evicts cache entries under memory/disk pressure. If an entry is evicted before a reader reaches it, that reader will raise an error. Size the cache to hold at least as many batches as the span between the slowest and fastest concurrent reader.
Development
# Install dependencies and build the extension
uv sync --no-install-project --dev
uv run maturin develop --uv
# Run tests
uv run pytest
# Run all pre-commit checks
uv run pre-commit run --all-files
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file batchcorder-0.1.1.tar.gz.
File metadata
- Download URL: batchcorder-0.1.1.tar.gz
- Upload date:
- Size: 201.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbe6784759376cb87dae4a4333ec387f9f1b2d3412099963a30dd7ba4ed6abb2
|
|
| MD5 |
5ae77ce878225f5df0c2ae17c93aca9a
|
|
| BLAKE2b-256 |
0a5adc1272ce156bf1454f6705bd4a8d9a45828add54d3bda6ed54e4e8b5918a
|
File details
Details for the file batchcorder-0.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: batchcorder-0.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.7 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac22b4defd1a77925a29bbcba1e08a1f3bcd9a33eeed5d985c42ab6cdba20873
|
|
| MD5 |
ffc5bc8d125704c921fb22f7950014c1
|
|
| BLAKE2b-256 |
35d8b5568d347afa5bf18eeb7d345a4cf6433c943e5263482467c532caa65dbf
|
File details
Details for the file batchcorder-0.1.1-cp311-abi3-win_amd64.whl.
File metadata
- Download URL: batchcorder-0.1.1-cp311-abi3-win_amd64.whl
- Upload date:
- Size: 3.7 MB
- Tags: CPython 3.11+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
536c84a59320b30227eaa17c674329ea87142020aee00cef9684f0f8cb5bba1d
|
|
| MD5 |
13e920018870721d6e62fdcf515e1640
|
|
| BLAKE2b-256 |
3c0823a0d79c9d972b038f646c06bf88b7b853e15662d111b547e1d1b599c943
|
File details
Details for the file batchcorder-0.1.1-cp311-abi3-win32.whl.
File metadata
- Download URL: batchcorder-0.1.1-cp311-abi3-win32.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.11+, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dd4b41424b0d28439a55fac6c979086b2c6b43ff9b18614f0a523edd68b20cc
|
|
| MD5 |
bda1b2e05df65cc8e682052960a12e9c
|
|
| BLAKE2b-256 |
1ea36cfc006e3cd7f73bd3fae6c1b6d43d0b6a58d7b81a7a2bcb3f33b1cd9127
|
File details
Details for the file batchcorder-0.1.1-cp311-abi3-manylinux_2_28_aarch64.whl.
File metadata
- Download URL: batchcorder-0.1.1-cp311-abi3-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 4.5 MB
- Tags: CPython 3.11+, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a6fd22ea201f6773e802aa1c0c08c36bdc88eb12da28f8f8092642d2ed09371
|
|
| MD5 |
b907edc061d63f8d1abf1e0d37c8ca78
|
|
| BLAKE2b-256 |
9fbbd1dd2176b1d5a41ba64934bcdf04d268dd26b92d56a0ad1b594eb6f09d86
|
File details
Details for the file batchcorder-0.1.1-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: batchcorder-0.1.1-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 4.7 MB
- Tags: CPython 3.11+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e3a07ed4a850f793b3a97ce50b35c91258e984f01549447378b52d33ac869ef
|
|
| MD5 |
7ad2d04c2a931e8c400fd5de4f02f77a
|
|
| BLAKE2b-256 |
055ffc098712fbcb9c4c0865c84cf6c5c0623a4ee4cd312d0851efa8f88051e7
|
File details
Details for the file batchcorder-0.1.1-cp311-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: batchcorder-0.1.1-cp311-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.9 MB
- Tags: CPython 3.11+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d196b2f21fc716eb786d48c648feeb43937ac3a3b30631bf85e153e9753d1ff
|
|
| MD5 |
1b2293706ca09d43ecbc48f04e534349
|
|
| BLAKE2b-256 |
6b864482f954fe64bf0856552700280325d6684c68ca989d43128a86c7a60e7c
|
File details
Details for the file batchcorder-0.1.1-cp311-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: batchcorder-0.1.1-cp311-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 4.2 MB
- Tags: CPython 3.11+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1a22ca25c17ecdac28f30ec1c222f0b400f964f4cb9b3bfe4581b935e9f49e6
|
|
| MD5 |
3610e690a979cf8faf440f17fb0863f8
|
|
| BLAKE2b-256 |
7ef4ec054995441391f36b7fd91e368644646b2249f69842b4d874f68d6260f4
|