Skip to main content

Pydantic-compatible streaming data validator for CSV, JSONL, Parquet, and Arrow

Project description

streamval

Streaming, Pydantic-backed validation for CSV, JSONL, Parquet, Arrow, and HTTP NDJSON / SSE.

Existing data-validation libraries (Pydantic, Pandera, Great Expectations, Cerberus) all assume the dataset fits in memory. streamval keeps the file (or HTTP response) on disk / on the wire and validates it row by row through a Pydantic schema, so you can validate a multi-gigabyte file with a few tens of megabytes of RAM and start consuming valid rows immediately. The same streaming model handles LLM token streams, log services, and any REST endpoint that emits NDJSON or Server-Sent Events.

Install

pip install streamval
# faster JSON + lazy CSV via polars/orjson:
pip install "streamval[fast]"
# HTTP NDJSON / LLM streaming via httpx:
pip install "streamval[http]"
# everything:
pip install "streamval[fast,http]"

Quickstart

from pydantic import BaseModel
from streamval import stream_csv

class User(BaseModel):
    id: int
    name: str
    score: float
    active: bool

for result in stream_csv("users.csv", User, on_error="collect"):
    if result.valid:
        user = result.data
        # ... do something with the parsed model ...
    else:
        for err in result.errors:
            print(f"row {result.row_index}: {err}")

The generator finishes when the file ends. Stats are available on the underlying validator:

from streamval import StreamValidator
v = StreamValidator(User, on_error="skip", batch_size=2000)
for r in v.stream_csv("users.csv"):
    handle(r.data)
print(v.stats)  # rows_total, rows_valid, throughput_rps, peak_memory_mb, ...

Performance

streamval optimises for bounded memory with strong throughput as a secondary goal. The v0.2 Arrow batch fast path validates an entire pyarrow.RecordBatch per Python ↔ Rust boundary crossing instead of one row dict at a time:

Mode Approx rps (CI target) Peak memory
streamval CSV — batch (Arrow path) 35 000+ (polars installed) < 5 MB
streamval Parquet — batch (Arrow path) 45 000+ < 5 MB
streamval CSV — row mode (polars) ~14 000 < 5 MB
streamval CSV — row mode (aiofiles fallback) ~11 000 < 5 MB
Naive Pydantic loop ~120 000 ~1 GB (reads whole file)

The naive loop is faster on small files but loads the entire dataset into RAM. streamval is the right choice when files don't fit in memory or you want to start consuming valid rows immediately.

Numbers from a developer Windows laptop with Python 3.13. Real CI hardware (Linux x86, faster I/O) typically shows 2-3× higher throughput. Run STREAMVAL_BENCH=1 pytest tests/benchmarks/ to measure on your own machine.

Performance tuning

  • Install streamval[fast] to unlock the polars Arrow path for CSV. Parquet gets the Arrow fast path with no extra dependency.

  • use_arrow=True is the default for CSV and Parquet on the StreamValidator constructor. Pass use_arrow=False to fall back to the row-mode pipeline (useful for adapters or strategies that need per-row Python dicts).

  • batch_size is the main throughput / memory dial — larger batches mean fewer Python ↔ Rust crossings but slightly higher peak memory. The defaults give comfortable bounded-memory behaviour:

    batch_size=100   → ~0.05 MB peak
    batch_size=1000  → ~0.4 MB peak  (default)
    batch_size=5000  → ~1.8 MB peak
    batch_size=10000 → ~3.5 MB peak
    
  • workers > 1 enables a thread pool. Pydantic's Rust core is thread-safe; per-row ordering is preserved.

Formats

Format Source Requires
CSV file / path (none, or streamval[fast] for polars path)
JSONL file / path (none, or streamval[fast] for orjson)
Parquet file / path pyarrow (always-on dependency)
Arrow file / path pyarrow (always-on dependency)
NDJSON HTTP URL streamval[http] (httpx)
SSE/LLM HTTP URL streamval[http] (httpx)

Why not Pydantic / Pandera / Great Expectations?

Library Loads whole file? Streams? Multi-format? Async?
Pydantic v2 yes (caller decides) no no no
Pandera yes (DataFrame) no DataFrame only no
Great Expectations yes (DataFrame) no DataFrame only no
Cerberus per-record only no no no
streamval no yes CSV / JSONL / Parquet / Arrow / HTTP NDJSON / SSE yes

How it works

  • Each format has a tiny async-generator adapter that yields one row dict at a time without loading the whole file.
  • A BatchBuffer chunks the row stream into fixed-size lists so peak memory stays bounded by batch_size.
  • Each batch is run through a CompiledValidationPlan (a per-model, cached wrapper around model.model_validate).
  • A pluggable error strategy (fail_fast, collect, skip) decides whether each row is emitted, dropped, or terminates the run.
  • A StatsAccumulator records per-field error counts, throughput, and peak memory via tracemalloc.

Error strategies

  • fail_fast — raise StreamValidationError on the first invalid row.
  • collect — emit every row; if max_errors is exceeded, raise on finalize.
  • skip — drop invalid rows silently (logged at WARNING level).

Contributing

git clone https://github.com/AmeerTechsoft/streamval
cd streamval
pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

streamval-0.2.1.tar.gz (57.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

streamval-0.2.1-py3-none-any.whl (39.4 kB view details)

Uploaded Python 3

File details

Details for the file streamval-0.2.1.tar.gz.

File metadata

  • Download URL: streamval-0.2.1.tar.gz
  • Upload date:
  • Size: 57.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for streamval-0.2.1.tar.gz
Algorithm Hash digest
SHA256 38cdf1a7ffd07cbb30d78416dd3ad20b192ab161090e7a09b93601e63a5759f1
MD5 9a46c5a94695fbbab13da5cc436873e8
BLAKE2b-256 ee10881c006db27fe0348b5bde922397155ae7fa5ab072cb053c0aef2abb4e0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for streamval-0.2.1.tar.gz:

Publisher: release.yml on AmeerTechsoft/streamval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file streamval-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: streamval-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 39.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for streamval-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 82b052a18a82544f9f4a75775edcfd7cee284f6ecf560f325ce4e57cb0979904
MD5 ebfc959b30ebc780582fc4410c396efb
BLAKE2b-256 43c84cdd711cc3d4d89f57a5dec8ee0b5018dea08640b737b3b4a93ac3d3383d

See more details on using hashes here.

Provenance

The following attestation bundles were made for streamval-0.2.1-py3-none-any.whl:

Publisher: release.yml on AmeerTechsoft/streamval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page