Python bindings for s4-codec — in-process CPU compression (zstd / gzip) for ML and ETL pipelines. GPU codecs are not exposed in Python in v1.0; route GPU workloads through the s4 server gateway.
Project description
s4-codec (Python bindings)
In-process CPU compression (zstd + gzip) from Python — no S4 gateway required.
Wraps the same Rust s4-codec crate that powers the S4
S3-compatible storage gateway, so a Python notebook / Airflow task / Spark
UDF can compress and decompress with the exact same byte format as
objects sitting in an S4 bucket. (GPU codecs are intentionally NOT exposed
in Python in v1.0 — they require a CUDA toolchain + GPU at runtime, which
is a poor fit for pip install. Workloads that need GPU compression
should route through the s4 server gateway instead. Python GPU exposure
is a v1.x roadmap candidate.)
Install
pip install s4-codec # CPU codecs only (zstd + gzip)
Example
from s4_codec import CpuZstd, CpuGzip, gpu_available
codec = CpuZstd(level=3)
data = b"hello squished s3 " * 10_000
compressed, original_size, crc = codec.compress(data)
roundtrip = codec.decompress(compressed, original_size, crc)
assert roundtrip == data
# RFC 1952 gzip output — decodable by any standard `gunzip`-aware client.
gz_compressed, *_ = CpuGzip(level=6).compress(data)
assert gz_compressed[:2] == b"\x1f\x8b"
print("GPU available:", gpu_available())
API
| Class / function | Purpose |
|---|---|
CpuZstd(level: int = 3) |
CPU zstd, level 1..=22. |
CpuGzip(level: int = 6) |
CPU gzip (RFC 1952), level 0..=9. |
<codec>.compress(data: bytes) -> (bytes, int, int) |
Returns (compressed, original_size, crc32c). |
<codec>.decompress(data, original_size, crc32c) -> bytes |
Inverse of compress. |
gpu_available() -> bool |
True iff the wheel was built with --features nvcomp-gpu and a CUDA-capable GPU is reachable. |
The (original_size, crc32c) tuple corresponds to the
ChunkManifest.original_size / ChunkManifest.crc32c fields the Rust
crate uses; round-trip them alongside the compressed payload (e.g. as
JSON sidecar fields).
Build from source
pip install maturin
cd crates/s4-codec-py
maturin build --release
ls target/wheels/ # *.whl is here
The --features nvcomp-gpu flag forwards to the underlying s4-codec-rs
crate's GPU codecs at the Rust level, but the Python module does NOT
expose Python classes for the GPU codecs in v1.0 (see the §Status note
above). Building with --features nvcomp-gpu therefore only affects
what gpu_available() reports, not which Python classes are importable.
maturin develop installs the wheel into the current virtualenv for
iterative development.
Running tests
maturin develop
pip install -e ".[dev]"
pytest tests/
The --features nvcomp-gpu build flag forwards to the underlying
s4-codec-rs GPU paths at the Rust level. In v1.0 this only affects
what gpu_available() reports; the Python module does NOT add GPU
codec classes when built with this feature (see the §Status note at
the top of this file).
maturin develop --release --features nvcomp-gpu
The pytest suite covers CPU codec round-trips, RFC 1952 gzip compatibility,
GIL-release threading, version inheritance, and the per-CodecError
exception class hierarchy (v0.8.5 #85). A separate Rust-side test
(tests/version_matches_workspace.rs) guards the workspace semver inherit.
Error handling
The binding raises a subclass tree per CodecError variant so callers can
branch programmatically instead of string-matching:
| Exception class | CodecError variant |
Base class |
|---|---|---|
S4Error |
(base + TruncatedStream) |
ValueError |
S4CrcMismatchError |
CrcMismatch |
S4Error |
S4SizeMismatchError |
SizeMismatch |
S4Error |
S4CodecMismatchError |
CodecMismatch |
S4Error |
S4UnregisteredCodecError |
UnregisteredCodec |
S4Error |
S4ManifestSizeExceedsLimitError |
ManifestSizeExceedsLimit |
S4Error |
S4ManifestSizeMismatchError |
ManifestSizeMismatch |
S4Error |
S4BackendError |
Backend / Join |
RuntimeError |
S4IoError |
Io |
OSError |
S4Error inherits from ValueError for backward compat with code that
caught the previous flat ValueError mapping. S4BackendError and
S4IoError deliberately escape that hierarchy so existing retry-on-IOError
middleware continues to fire on the right class.
Workspace integration
The crate ships a cdylib only and uses PyO3's extension-module
feature, so cargo check -p s4-codec-py and cargo build --workspace
succeed on a CI runner with no Python development headers installed —
no libpython link is performed; the Python interpreter that loads the
.so provides those symbols at runtime.
If you ever see a link error like
undefined reference to PyExc_…, drop pyo3/extension-module from the
features and you'll get the diagnostic build that does link libpython.
Threading / GIL
Both CpuZstd.compress() and CpuGzip.compress() (and their decompress()
counterparts) release the Python GIL while running, so other Python threads
make progress concurrently. This is safe for:
- Django / Flask workers
- ASGI / asyncio event loops (use
asyncio.to_thread()to wrap the blocking call) - multi-threaded data pipelines
Example (asyncio):
import asyncio
from s4_codec import CpuZstd
async def compress_async(data: bytes) -> bytes:
codec = CpuZstd()
compressed, orig_size, crc = await asyncio.to_thread(codec.compress, data)
return compressed
Note: the methods themselves are synchronous — they don't return awaitables. The GIL release means another Python thread can run during the compress; it doesn't make the call async-aware.
Supported codecs
| Codec | Default |
|---|---|
CpuZstd |
✓ |
CpuGzip |
✓ |
The GPU codecs (nvcomp-zstd, nvcomp-bitcomp, nvcomp-gdeflate) are intentionally not exposed as Python classes in v1.0 — they require a CUDA toolchain at build time and a GPU at runtime, which is a poor fit for the pip install s4-codec packaging story. The nvcomp-gpu feature on the underlying Rust crate exists for the server path; the Python module's runtime classes are the two CPU codecs above. gpu_available() -> bool is exposed for clients that want to gate their own logic on GPU presence (e.g. to decide whether to route a workload through the s4 server gateway instead of the in-process Python decoder), but it does not enable any new Python class on its own. GPU codec exposure in Python is a v1.x roadmap candidate.
Publishing status
- PyPI publish is manual (no CI automation as of v0.8.5):
cd crates/s4-codec-py maturin build --release twine upload target/wheels/*
- Workspace version inheritance was fixed in v0.8.5 #82 — the published wheel version now matches the gateway version.
target/wheels/ is gitignored — never commit .whl files.
License
Apache-2.0 — same as the rest of the S4 project.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file s4_codec-1.1.0.tar.gz.
File metadata
- Download URL: s4_codec-1.1.0.tar.gz
- Upload date:
- Size: 244.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
429cbb2676b3217b8c71f2ec597117375b1a06756b1ecd43bfc589da4b93130f
|
|
| MD5 |
10c9496a83171d05d28277e4288754ef
|
|
| BLAKE2b-256 |
f4c1dcd6813c3f7194a3525d659a825f258c1ab7c6719980cf4c02aae0ca4a92
|
File details
Details for the file s4_codec-1.1.0-cp39-abi3-manylinux_2_39_x86_64.whl.
File metadata
- Download URL: s4_codec-1.1.0-cp39-abi3-manylinux_2_39_x86_64.whl
- Upload date:
- Size: 901.0 kB
- Tags: CPython 3.9+, manylinux: glibc 2.39+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68b5d48386f28ef5270a8e7f7eeffa14c7e53e6ad45dfc621c0f97da905ac609
|
|
| MD5 |
415228d89dd9c993424fee07780984d2
|
|
| BLAKE2b-256 |
030015a3e3d8f5e48071274b37700e4875b5d4598c6ee00eda21da07bd368c6e
|