Pydantic-compatible streaming data validator for CSV, JSONL, Parquet, and Arrow

These details have not been verified by PyPI

Project description

streamval

Streaming, Pydantic-backed validation for CSV, JSONL, Parquet, Arrow, and HTTP NDJSON / SSE.

Existing data-validation libraries (Pydantic, Pandera, Great Expectations, Cerberus) all assume the dataset fits in memory. streamval keeps the file (or HTTP response) on disk / on the wire and validates it row by row through a Pydantic schema, so you can validate a multi-gigabyte file with a few tens of megabytes of RAM and start consuming valid rows immediately. The same streaming model handles LLM token streams, log services, and any REST endpoint that emits NDJSON or Server-Sent Events.

Install

pip install streamval
# faster JSON + lazy CSV via polars/orjson:
pip install "streamval[fast]"
# HTTP NDJSON / LLM streaming via httpx:
pip install "streamval[http]"
# everything:
pip install "streamval[fast,http]"

Quickstart

from pydantic import BaseModel
from streamval import stream_csv

class User(BaseModel):
    id: int
    name: str
    score: float
    active: bool

for result in stream_csv("users.csv", User, on_error="collect"):
    if result.valid:
        user = result.data
        # ... do something with the parsed model ...
    else:
        for err in result.errors:
            print(f"row {result.row_index}: {err}")

The generator finishes when the file ends. Stats are available on the underlying validator:

from streamval import StreamValidator
v = StreamValidator(User, on_error="skip", batch_size=2000)
for r in v.stream_csv("users.csv"):
    handle(r.data)
print(v.stats)  # rows_total, rows_valid, throughput_rps, peak_memory_mb, ...

Performance

streamval optimises for bounded memory with strong throughput as a secondary goal. The v0.2 Arrow batch fast path validates an entire pyarrow.RecordBatch per Python ↔ Rust boundary crossing instead of one row dict at a time:

Mode	Approx rps (CI target)	Peak memory
streamval CSV — batch (Arrow path)	35 000+ (polars installed)	< 5 MB
streamval Parquet — batch (Arrow path)	45 000+	< 5 MB
streamval CSV — row mode (polars)	~14 000	< 5 MB
streamval CSV — row mode (aiofiles fallback)	~11 000	< 5 MB
Naive Pydantic loop	~120 000	~1 GB (reads whole file)

The naive loop is faster on small files but loads the entire dataset into RAM. streamval is the right choice when files don't fit in memory or you want to start consuming valid rows immediately.

Numbers from a developer Windows laptop with Python 3.13. Real CI hardware (Linux x86, faster I/O) typically shows 2-3× higher throughput. Run STREAMVAL_BENCH=1 pytest tests/benchmarks/ to measure on your own machine.

Performance tuning

Install streamval[fast] to unlock the polars Arrow path for CSV. Parquet gets the Arrow fast path with no extra dependency.
use_arrow=True is the default for CSV and Parquet on the StreamValidator constructor. Pass use_arrow=False to fall back to the row-mode pipeline (useful for adapters or strategies that need per-row Python dicts).
batch_size is the main throughput / memory dial — larger batches mean fewer Python ↔ Rust crossings but slightly higher peak memory. The defaults give comfortable bounded-memory behaviour:
```
batch_size=100   → ~0.05 MB peak
batch_size=1000  → ~0.4 MB peak  (default)
batch_size=5000  → ~1.8 MB peak
batch_size=10000 → ~3.5 MB peak
```
workers > 1 enables a thread pool. Pydantic's Rust core is thread-safe; per-row ordering is preserved.

Formats

Format	Source	Requires
CSV	file / path	(none, or `streamval[fast]` for polars path)
JSONL	file / path	(none, or `streamval[fast]` for orjson)
Parquet	file / path	`pyarrow` (always-on dependency)
Arrow	file / path	`pyarrow` (always-on dependency)
NDJSON	HTTP URL	`streamval[http]` (httpx)
SSE/LLM	HTTP URL	`streamval[http]` (httpx)

Why not Pydantic / Pandera / Great Expectations?

Library	Loads whole file?	Streams?	Multi-format?	Async?
Pydantic v2	yes (caller decides)	no	no	no
Pandera	yes (DataFrame)	no	DataFrame only	no
Great Expectations	yes (DataFrame)	no	DataFrame only	no
Cerberus	per-record only	no	no	no
streamval	no	yes	CSV / JSONL / Parquet / Arrow / HTTP NDJSON / SSE	yes

How it works

Each format has a tiny async-generator adapter that yields one row dict at a time without loading the whole file.
A BatchBuffer chunks the row stream into fixed-size lists so peak memory stays bounded by batch_size.
Each batch is run through a CompiledValidationPlan (a per-model, cached wrapper around model.model_validate).
A pluggable error strategy (fail_fast, collect, skip) decides whether each row is emitted, dropped, or terminates the run.
A StatsAccumulator records per-field error counts, throughput, and peak memory via tracemalloc.

Error strategies

fail_fast — raise StreamValidationError on the first invalid row.
collect — emit every row; if max_errors is exceeded, raise on finalize.
skip — drop invalid rows silently (logged at WARNING level).

Contributing

git clone https://github.com/AmeerTechsoft/streamval
cd streamval
pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.2

May 21, 2026

0.2.1

May 21, 2026

This version

0.2.0

May 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

streamval-0.2.0.tar.gz (56.6 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

streamval-0.2.0-py3-none-any.whl (39.0 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file streamval-0.2.0.tar.gz.

File metadata

Download URL: streamval-0.2.0.tar.gz
Upload date: May 21, 2026
Size: 56.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for streamval-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`e01be479276c539401fec32f1d7739ff8a3715a6eed2256e22e1c35719408607`
MD5	`a12580aa6321dada81be66caf1de321a`
BLAKE2b-256	`ed7c847b484414a2f309dce63ebdb723ce0ffda20d4800932b617201df76380f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for streamval-0.2.0.tar.gz:

Publisher: release.yml on AmeerTechsoft/streamval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: streamval-0.2.0.tar.gz
- Subject digest: e01be479276c539401fec32f1d7739ff8a3715a6eed2256e22e1c35719408607
- Sigstore transparency entry: 1591848717
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: AmeerTechsoft/streamval@8b1eabf12aafbb51a57d375b47b42f72da33e9ef
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/AmeerTechsoft
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8b1eabf12aafbb51a57d375b47b42f72da33e9ef
- Trigger Event: push

File details

Details for the file streamval-0.2.0-py3-none-any.whl.

File metadata

Download URL: streamval-0.2.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 39.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for streamval-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cf22933f4a6a8787b55ce5d464057d9f42b6e2f1947b143351bdafe43c51dbed`
MD5	`6d5d46e96071eb1ad04c22d92aef2414`
BLAKE2b-256	`13e07767f416df7dead69f4b41adf0a6f5456ee2b61c767e4c3244f4be018ea8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for streamval-0.2.0-py3-none-any.whl:

Publisher: release.yml on AmeerTechsoft/streamval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: streamval-0.2.0-py3-none-any.whl
- Subject digest: cf22933f4a6a8787b55ce5d464057d9f42b6e2f1947b143351bdafe43c51dbed
- Sigstore transparency entry: 1591848736
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: AmeerTechsoft/streamval@8b1eabf12aafbb51a57d375b47b42f72da33e9ef
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/AmeerTechsoft
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8b1eabf12aafbb51a57d375b47b42f72da33e9ef
- Trigger Event: push

streamval 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

streamval

Install

Quickstart

Performance

Performance tuning

Formats

Why not Pydantic / Pandera / Great Expectations?

How it works

Error strategies

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance