Pydantic-compatible streaming data validator for CSV, JSONL, Parquet, and Arrow
Project description
streamval
Streaming, Pydantic-backed validation for CSV, JSONL, Parquet, Arrow, and HTTP NDJSON / SSE.
Existing data-validation libraries (Pydantic, Pandera, Great Expectations,
Cerberus) all assume the dataset fits in memory. streamval keeps the
file (or HTTP response) on disk / on the wire and validates it row by
row through a Pydantic schema, so you can validate a multi-gigabyte
file with a few tens of megabytes of RAM and start consuming valid
rows immediately. The same streaming model handles LLM token streams,
log services, and any REST endpoint that emits NDJSON or Server-Sent
Events.
Install
pip install streamval
# faster JSON + lazy CSV via polars/orjson:
pip install "streamval[fast]"
# HTTP NDJSON / LLM streaming via httpx:
pip install "streamval[http]"
# everything:
pip install "streamval[fast,http]"
Quickstart
from pydantic import BaseModel
from streamval import stream_csv
class User(BaseModel):
id: int
name: str
score: float
active: bool
for result in stream_csv("users.csv", User, on_error="collect"):
if result.valid:
user = result.data
# ... do something with the parsed model ...
else:
for err in result.errors:
print(f"row {result.row_index}: {err}")
The generator finishes when the file ends. Stats are available on the underlying validator:
from streamval import StreamValidator
v = StreamValidator(User, on_error="skip", batch_size=2000)
for r in v.stream_csv("users.csv"):
handle(r.data)
print(v.stats) # rows_total, rows_valid, throughput_rps, peak_memory_mb, ...
Performance
streamval optimises for bounded memory with strong throughput as
a secondary goal. The v0.2 Arrow batch fast path validates an entire
pyarrow.RecordBatch per Python ↔ Rust boundary crossing instead of
one row dict at a time:
| Mode | Approx rps (CI target) | Peak memory |
|---|---|---|
| streamval CSV — batch (Arrow path) | 35 000+ (polars installed) | < 5 MB |
| streamval Parquet — batch (Arrow path) | 45 000+ | < 5 MB |
| streamval CSV — row mode (polars) | ~14 000 | < 5 MB |
| streamval CSV — row mode (aiofiles fallback) | ~11 000 | < 5 MB |
| Naive Pydantic loop | ~120 000 | ~1 GB (reads whole file) |
The naive loop is faster on small files but loads the entire dataset into RAM.
streamvalis the right choice when files don't fit in memory or you want to start consuming valid rows immediately.
Numbers from a developer Windows laptop with Python 3.13. Real CI hardware (Linux x86, faster I/O) typically shows 2-3× higher throughput. Run
STREAMVAL_BENCH=1 pytest tests/benchmarks/to measure on your own machine.
Performance tuning
-
Install
streamval[fast]to unlock the polars Arrow path for CSV. Parquet gets the Arrow fast path with no extra dependency. -
use_arrow=Trueis the default for CSV and Parquet on theStreamValidatorconstructor. Passuse_arrow=Falseto fall back to the row-mode pipeline (useful for adapters or strategies that need per-row Python dicts). -
batch_sizeis the main throughput / memory dial — larger batches mean fewer Python ↔ Rust crossings but slightly higher peak memory. The defaults give comfortable bounded-memory behaviour:batch_size=100 → ~0.05 MB peak batch_size=1000 → ~0.4 MB peak (default) batch_size=5000 → ~1.8 MB peak batch_size=10000 → ~3.5 MB peak -
workers > 1enables a thread pool. Pydantic's Rust core is thread-safe; per-row ordering is preserved.
Formats
| Format | Source | Requires |
|---|---|---|
| CSV | file / path | (none, or streamval[fast] for polars path) |
| JSONL | file / path | (none, or streamval[fast] for orjson) |
| Parquet | file / path | pyarrow (always-on dependency) |
| Arrow | file / path | pyarrow (always-on dependency) |
| NDJSON | HTTP URL | streamval[http] (httpx) |
| SSE/LLM | HTTP URL | streamval[http] (httpx) |
Why not Pydantic / Pandera / Great Expectations?
| Library | Loads whole file? | Streams? | Multi-format? | Async? |
|---|---|---|---|---|
| Pydantic v2 | yes (caller decides) | no | no | no |
| Pandera | yes (DataFrame) | no | DataFrame only | no |
| Great Expectations | yes (DataFrame) | no | DataFrame only | no |
| Cerberus | per-record only | no | no | no |
| streamval | no | yes | CSV / JSONL / Parquet / Arrow / HTTP NDJSON / SSE | yes |
How it works
- Each format has a tiny async-generator adapter that yields one row dict at a time without loading the whole file.
- A
BatchBufferchunks the row stream into fixed-size lists so peak memory stays bounded bybatch_size. - Each batch is run through a
CompiledValidationPlan(a per-model, cached wrapper aroundmodel.model_validate). - A pluggable error strategy (
fail_fast,collect,skip) decides whether each row is emitted, dropped, or terminates the run. - A
StatsAccumulatorrecords per-field error counts, throughput, and peak memory viatracemalloc.
Error strategies
fail_fast— raiseStreamValidationErroron the first invalid row.collect— emit every row; ifmax_errorsis exceeded, raise on finalize.skip— drop invalid rows silently (logged at WARNING level).
Contributing
git clone https://github.com/AmeerTechsoft/streamval
cd streamval
pip install -e ".[dev]"
pytest
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file streamval-0.2.1.tar.gz.
File metadata
- Download URL: streamval-0.2.1.tar.gz
- Upload date:
- Size: 57.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38cdf1a7ffd07cbb30d78416dd3ad20b192ab161090e7a09b93601e63a5759f1
|
|
| MD5 |
9a46c5a94695fbbab13da5cc436873e8
|
|
| BLAKE2b-256 |
ee10881c006db27fe0348b5bde922397155ae7fa5ab072cb053c0aef2abb4e0c
|
Provenance
The following attestation bundles were made for streamval-0.2.1.tar.gz:
Publisher:
release.yml on AmeerTechsoft/streamval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
streamval-0.2.1.tar.gz -
Subject digest:
38cdf1a7ffd07cbb30d78416dd3ad20b192ab161090e7a09b93601e63a5759f1 - Sigstore transparency entry: 1592102983
- Sigstore integration time:
-
Permalink:
AmeerTechsoft/streamval@c120dc71da944e6825a6e2d62bfc18aee33e342b -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/AmeerTechsoft
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c120dc71da944e6825a6e2d62bfc18aee33e342b -
Trigger Event:
push
-
Statement type:
File details
Details for the file streamval-0.2.1-py3-none-any.whl.
File metadata
- Download URL: streamval-0.2.1-py3-none-any.whl
- Upload date:
- Size: 39.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82b052a18a82544f9f4a75775edcfd7cee284f6ecf560f325ce4e57cb0979904
|
|
| MD5 |
ebfc959b30ebc780582fc4410c396efb
|
|
| BLAKE2b-256 |
43c84cdd711cc3d4d89f57a5dec8ee0b5018dea08640b737b3b4a93ac3d3383d
|
Provenance
The following attestation bundles were made for streamval-0.2.1-py3-none-any.whl:
Publisher:
release.yml on AmeerTechsoft/streamval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
streamval-0.2.1-py3-none-any.whl -
Subject digest:
82b052a18a82544f9f4a75775edcfd7cee284f6ecf560f325ce4e57cb0979904 - Sigstore transparency entry: 1592103029
- Sigstore integration time:
-
Permalink:
AmeerTechsoft/streamval@c120dc71da944e6825a6e2d62bfc18aee33e342b -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/AmeerTechsoft
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c120dc71da944e6825a6e2d62bfc18aee33e342b -
Trigger Event:
push
-
Statement type: