Skip to main content

Declarative data validation for pandas DataFrames — 33 built-in rules, powered by Rust and Polars.

Project description

invr

Declarative data validation engine for Rust.

Define invariants (validation rules) and evaluate them against a dataset using a typed execution engine.

Features

  • 33 built-in invariant types (nullability, uniqueness, numeric, string, date, relational, statistical, …)
  • Lazy Polars execution backend
  • Load specs from YAML
  • Engine-agnostic core — bring your own backend
  • Fully typed: no stringly-typed rule names

Installation

[dependencies]
invr = { version = "0.2", features = ["polars"] }

To also load specs from YAML:

invr = { version = "0.2", features = ["polars", "yaml"] }

Quick start

Programmatic spec

use invr::prelude::*;
use polars::prelude::*;

let df = df![
    "age" => [25, 30, 45],
    "email" => ["a@b.com", "c@d.com", "e@f.com"],
]?;

let spec = Spec::from_invariants(vec![
    Invariant::new(
        InvariantId::new("age_not_null")?,
        PolarsKind::NotNull,
        Scope::column("age"),
    ),
    Invariant::new(
        InvariantId::new("row_count_min")?,
        PolarsKind::RowCountMin,
        Scope::Dataset,
    )
    .with_param_value("min", "1"),
]);

let runner = RunSpec::new(EnginePolarsDataFrame);
let report = runner.run(&df, &spec)?;

if report.failed() {
    for v in report.errors() {
        eprintln!("violation: {}", v.reason());
    }
}

YAML spec

# spec.yaml
invariants:
  - id: age_not_null
    kind: not_null
    scope:
      type: column
      name: age

  - id: email_unique
    kind: unique
    scope:
      type: column
      name: email
    severity: error

  - id: row_count_check
    kind: row_count_min
    scope:
      type: dataset
    params:
      min: "10"
use invr::prelude::*;

let yaml = std::fs::read_to_string("spec.yaml")?;
let spec = spec_from_str(&yaml)?;

let runner = RunSpec::new(EnginePolarsDataFrame);
let report = runner.run(&df, &spec)?;

Invariant types

Category Kinds
Nullability not_null, null_ratio_max
Uniqueness unique, composite_unique, duplicate_ratio_max
Row count row_count_min, row_count_max, row_count_between
Structure column_exists, column_missing, dtype_is, schema_equals
Numeric value_min, value_max, value_between, mean_between, stddev_max, sum_between
Date / Time date_between, no_future_dates, monotonic_increasing, no_gaps_in_sequence
String regex_match, string_length_min, string_length_max, string_length_between
Domain allowed_values, forbidden_values
Statistical outlier_ratio_max, percentile_between
Relational foreign_key, column_equals, conditional_not_null
Custom custom_expr

Parameters reference

All param values are strings. Numeric values are passed as string literals (e.g. "42", "0.05").

Kind Scope Required params
not_null Column
null_ratio_max Column max_ratio — float 0.0–1.0
unique Column
composite_unique Dataset columns — comma-sep column list e.g. "a,b"
duplicate_ratio_max Column max_ratio — float 0.0–1.0
row_count_min Dataset min
row_count_max Dataset max
row_count_between Dataset min, max
column_exists Column
column_missing Column
dtype_is Column dtype — e.g. "Int64", "Utf8", "Float64"
schema_equals Dataset schema — comma-sep col:dtype pairs e.g. "a:Int64,b:Utf8"
value_min Column min
value_max Column max
value_between Column min, max
mean_between Column min, max
stddev_max Column max
sum_between Column min, max
date_between Column start, end — ISO 8601 e.g. "2024-01-01"
no_future_dates Column
monotonic_increasing Column
no_gaps_in_sequence Column
regex_match Column pattern — regex string
string_length_min Column min
string_length_max Column max
string_length_between Column min, max
allowed_values Column values — comma-sep list e.g. "A,B,C"
forbidden_values Column values — comma-sep list
outlier_ratio_max Column z — Z-score threshold, max_ratio — float 0.0–1.0
percentile_between Column p — percentile 0.0–1.0, min, max
foreign_key Column allowed_values — comma-sep valid FK values
column_equals Column other_column — column name to compare against
conditional_not_null Column condition_column, condition_value
custom_expr Column column — column name

Report API

report.failed()          // true if any Error-severity violation exists
report.violations()      // all violations
report.errors()          // iterator over Error violations
report.warnings()        // iterator over Warn violations
report.error_count()     // number of Error violations
report.metrics()         // execution_time_ms, total_invariants, violations

Severity

Each invariant defaults to Error. Override with:

invariant.with_severity(Severity::Warn)

Or in YAML:

severity: warn   # info | warn | error

Feature flags

Feature Description
polars Enables the Polars execution engine
yaml Enables loading specs from YAML strings

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

invr-0.2.3-cp313-cp313-macosx_11_0_arm64.whl (22.6 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file invr-0.2.3-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for invr-0.2.3-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c8a1dfa1ebbd452943f9f29ef21b69477dc2895693c49b99ecb09b78064b9d58
MD5 b721f3bec0d2daffa3a759de98ae3317
BLAKE2b-256 abe63f3ebf89b6c38994453eace7480a4ab477840ae5b9375f65bbd24ea0b812

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page