Skip to main content

Pydantic-compatible streaming data validator for CSV, JSONL, Parquet, and Arrow

Project description

streamval

Streaming, Pydantic-backed validation for CSV, JSONL, Parquet, Arrow, and HTTP NDJSON / SSE.

Existing data-validation libraries (Pydantic, Pandera, Great Expectations, Cerberus) all assume the dataset fits in memory. streamval keeps the file (or HTTP response) on disk / on the wire and validates it row by row through a Pydantic schema, so you can validate a multi-gigabyte file with a few tens of megabytes of RAM and start consuming valid rows immediately. The same streaming model handles LLM token streams, log services, and any REST endpoint that emits NDJSON or Server-Sent Events.

Install

pip install streamval
# faster JSON + lazy CSV via polars/orjson:
pip install "streamval[fast]"
# HTTP NDJSON / LLM streaming via httpx:
pip install "streamval[http]"
# everything:
pip install "streamval[fast,http]"

Quickstart

from pydantic import BaseModel
from streamval import stream_csv

class User(BaseModel):
    id: int
    name: str
    score: float
    active: bool

for result in stream_csv("users.csv", User, on_error="collect"):
    if result.valid:
        user = result.data
        # ... do something with the parsed model ...
    else:
        for err in result.errors:
            print(f"row {result.row_index}: {err}")

The generator finishes when the file ends. Stats are available on the underlying validator:

from streamval import StreamValidator
v = StreamValidator(User, on_error="skip", batch_size=2000)
for r in v.stream_csv("users.csv"):
    handle(r.data)
print(v.stats)  # rows_total, rows_valid, throughput_rps, peak_memory_mb, ...

Performance

streamval optimises for bounded memory with strong throughput as a secondary goal. The v0.2 Arrow batch fast path validates an entire pyarrow.RecordBatch per Python ↔ Rust boundary crossing instead of one row dict at a time:

Mode Approx rps (CI target) Peak memory
streamval CSV — batch (Arrow path) 35 000+ (polars installed) < 5 MB
streamval Parquet — batch (Arrow path) 45 000+ < 5 MB
streamval CSV — row mode (polars) ~14 000 < 5 MB
streamval CSV — row mode (aiofiles fallback) ~11 000 < 5 MB
Naive Pydantic loop ~120 000 ~1 GB (reads whole file)

The naive loop is faster on small files but loads the entire dataset into RAM. streamval is the right choice when files don't fit in memory or you want to start consuming valid rows immediately.

Numbers from a developer Windows laptop with Python 3.13. Real CI hardware (Linux x86, faster I/O) typically shows 2-3× higher throughput. Run STREAMVAL_BENCH=1 pytest tests/benchmarks/ to measure on your own machine.

Performance tuning

  • Install streamval[fast] to unlock the polars Arrow path for CSV. Parquet gets the Arrow fast path with no extra dependency.

  • use_arrow=True is the default for CSV and Parquet on the StreamValidator constructor. Pass use_arrow=False to fall back to the row-mode pipeline (useful for adapters or strategies that need per-row Python dicts).

  • batch_size is the main throughput / memory dial — larger batches mean fewer Python ↔ Rust crossings but slightly higher peak memory. The defaults give comfortable bounded-memory behaviour:

    batch_size=100   → ~0.05 MB peak
    batch_size=1000  → ~0.4 MB peak  (default)
    batch_size=5000  → ~1.8 MB peak
    batch_size=10000 → ~3.5 MB peak
    
  • workers > 1 enables a thread pool. Pydantic's Rust core is thread-safe; per-row ordering is preserved.

Formats

Format Source Requires
CSV file / path (none, or streamval[fast] for polars path)
JSONL file / path (none, or streamval[fast] for orjson)
Parquet file / path pyarrow (always-on dependency)
Arrow file / path pyarrow (always-on dependency)
NDJSON HTTP URL streamval[http] (httpx)
SSE/LLM HTTP URL streamval[http] (httpx)

Why not Pydantic / Pandera / Great Expectations?

Library Loads whole file? Streams? Multi-format? Async?
Pydantic v2 yes (caller decides) no no no
Pandera yes (DataFrame) no DataFrame only no
Great Expectations yes (DataFrame) no DataFrame only no
Cerberus per-record only no no no
streamval no yes CSV / JSONL / Parquet / Arrow / HTTP NDJSON / SSE yes

How it works

  • Each format has a tiny async-generator adapter that yields one row dict at a time without loading the whole file.
  • A BatchBuffer chunks the row stream into fixed-size lists so peak memory stays bounded by batch_size.
  • Each batch is run through a CompiledValidationPlan (a per-model, cached wrapper around model.model_validate).
  • A pluggable error strategy (fail_fast, collect, skip) decides whether each row is emitted, dropped, or terminates the run.
  • A StatsAccumulator records per-field error counts, throughput, and peak memory via tracemalloc.

Error strategies

  • fail_fast — raise StreamValidationError on the first invalid row.
  • collect — emit every row; if max_errors is exceeded, raise on finalize.
  • skip — drop invalid rows silently (logged at WARNING level).

Contributing

git clone https://github.com/AmeerTechsoft/streamval
cd streamval
pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

streamval-0.2.0.tar.gz (56.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

streamval-0.2.0-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file streamval-0.2.0.tar.gz.

File metadata

  • Download URL: streamval-0.2.0.tar.gz
  • Upload date:
  • Size: 56.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for streamval-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e01be479276c539401fec32f1d7739ff8a3715a6eed2256e22e1c35719408607
MD5 a12580aa6321dada81be66caf1de321a
BLAKE2b-256 ed7c847b484414a2f309dce63ebdb723ce0ffda20d4800932b617201df76380f

See more details on using hashes here.

Provenance

The following attestation bundles were made for streamval-0.2.0.tar.gz:

Publisher: release.yml on AmeerTechsoft/streamval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file streamval-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: streamval-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for streamval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cf22933f4a6a8787b55ce5d464057d9f42b6e2f1947b143351bdafe43c51dbed
MD5 6d5d46e96071eb1ad04c22d92aef2414
BLAKE2b-256 13e07767f416df7dead69f4b41adf0a6f5456ee2b61c767e4c3244f4be018ea8

See more details on using hashes here.

Provenance

The following attestation bundles were made for streamval-0.2.0-py3-none-any.whl:

Publisher: release.yml on AmeerTechsoft/streamval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page