Skip to main content

A powerful data quality validation framework inspired by Great Expectations

Project description

๐Ÿš€ ValidateX

A powerful, extensible data quality validation framework for Python.

Build Status (Tests & CI) Code Coverage Test Passing Rate PyPI Latest Version Supported Python Versions MIT License Code Style: black

Badges represent (from left to right): CI/CD Build Status, Code Coverage, Test Count, Latest PyPI Release, Supported Python Versions, License, and Code Style.

ValidateX provides a comprehensive suite of tools for validating, profiling, and monitoring data quality across Pandas and PySpark DataFrames. Inspired by Great Expectations, it offers a simpler, more focused approach with modern, production-ready HTML reports and an intuitive API.

๐Ÿ“‘ Table of Contents


๐Ÿ–ผ๏ธ Report Preview

ValidateX Report โ€” Overview

Column Health Summary

Column Health Summary with mini bar charts

Expectations Table

Severity-tagged Expectations with human-readable output


๐Ÿค” Why ValidateX?

Feature ValidateX Great Expectations
Setup pip install โ†’ validate in 5 lines Multi-step setup with contexts & stores
API Fluent, chainable Python API Heavy config system
Severity levels โœ” (Critical, Warning, Info) โŒ
Quality score โœ” (Weighted 0โ€“100) โŒ
Auto-suggest expectations โœ” โœ”
Reports Modern dark-theme HTML with minicharts Basic data docs
Output Data Types Clean native Python types NumPy types leak into JSON
PySpark Support โœ” โœ”
Polars Support Soon โœ”
CI/CD friendly CLI โœ” โŒ
Downloads JSON / CSV / clipboard built into report Separate export
Learning curve Minutes Hours to days

ValidateX is not a replacement for Great Expectations โ€” it's a focused alternative for teams that want production-grade data validation without the overhead.


๐ŸŽฏ Who Is This For?

  • Startup data teams โ€” Ship data quality checks in minutes, not days
  • ML engineers โ€” Validate feature stores and training data before model runs
  • CI/CD pipelines โ€” Gate deployments on data quality with a single CLI command
  • Analytics teams โ€” Catch data issues before they reach dashboards
  • dbt users โ€” Lightweight validation alongside your transformation layer
  • Data platform teams โ€” Monitor data quality across dozens of tables

โœจ Features

Feature Description
25+ Built-in Expectations Column-level, table-level, and aggregate validations
Dual Engine Support Pandas and PySpark execution engines
๐ŸŽฏ Data Quality Score Weighted score (0โ€“100) based on severity of checks
๐Ÿ”ด๐ŸŸก๐Ÿ”ต Severity Levels Critical / Warning / Info classification for every expectation
๐Ÿ“Š Column Health Summary At-a-glance per-column health with mini bar charts
Modern HTML Reports Stunning, self-contained dark-theme reports with animations
๐Ÿ“ฅ Download Buttons Export reports as JSON, CSV, or copy summary to clipboard
๐Ÿ“ˆ Drift Detection Track changes between validation runs
Data Profiling Auto-analyse datasets and suggest expectations
YAML/JSON Config Define expectations declaratively
CLI Interface Run validations from the command line
Checkpoint System Tie data sources and suites together
Extensible Create custom expectations with the registry pattern
Clean Output All values are native Python types โ€” zero NumPy leakage

๐Ÿ“ฆ Installation

# Basic install
pip install -e .

# With PySpark support
pip install -e ".[spark]"

# With database support
pip install -e ".[database]"

# Full install
pip install -e ".[all]"

# Development
pip install -e ".[dev]"

๐Ÿ Quick Start

Python API

import pandas as pd
import validatex as vx

# Create your data
df = pd.DataFrame({
    "user_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age": [25, 30, 35, 28, 42],
    "email": ["alice@test.com", "bob@test.com", "charlie@test.com",
              "diana@test.com", "eve@test.com"],
    "status": ["active", "active", "inactive", "active", "pending"],
})

# Build an expectation suite
suite = (
    vx.ExpectationSuite("user_quality")
    .add("expect_column_to_not_be_null", column="user_id")
    .add("expect_column_values_to_be_unique", column="user_id")
    .add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
    .add("expect_column_values_to_be_in_set",
         column="status", value_set=["active", "inactive", "pending"])
    .add("expect_column_values_to_match_regex",
         column="email", regex=r"^[\w.]+@[\w]+\.\w+$")
)

# Validate
result = vx.validate(df, suite)

# Print summary (includes Quality Score)
print(result.summary())

# Generate reports
result.to_html("report.html")
result.to_json_file("report.json")

CLI

# Initialize a project
validatex init

# Profile a dataset
validatex profile --data data.csv --suggest --output auto_suite.yaml

# Run validation
validatex validate --data data.csv --suite suite.yaml --report report.html

# Run checkpoint
validatex run --checkpoint checkpoint.yaml

# List available expectations
validatex list-expectations

๐Ÿค– Automate with CI/CD

ValidateX is designed to be lightweight and CI-friendly. You can easily integrate it into your GitHub Actions, GitLab CI, or Jenkins pipelines to gate deployments on data quality.

Example: GitHub Actions

name: Data Quality Validation
on: [push, pull_request]

jobs:
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          
      - name: Install ValidateX
        run: pip install validatex
        
      - name: Run Data Validation
        run: |
          validatex validate \
            --data data/production_data.csv \
            --suite tests/data_quality/suite.yaml \
            --report dq_report.html
            
      - name: Archive production artifacts
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: validatex-report
          path: dq_report.html

๐ŸŽฏ Data Quality Score

ValidateX computes a weighted quality score (0โ€“100) based on the severity of each expectation:

Severity Weight Example Expectations
๐Ÿ”ด Critical ร—3 Null checks, uniqueness, column existence, row count
๐ŸŸก Warning ร—2 Range checks, set membership, regex, type checks
๐Ÿ”ต Info ร—1 Mean/stdev bounds, string lengths, distinct values

Formula: Score = 100 ร— (weighted_passed / weighted_total)

A critical failure impacts the score 3ร— more than an info-level check. This gives decision-makers a single number to assess data health.

result = vx.validate(df, suite)
score = result.compute_quality_score()
print(f"Data Quality Score: {score}/100")

Custom Severity

Override the default severity on any expectation via meta:

expectations:
  - expectation_type: expect_column_mean_to_be_between
    column: revenue
    kwargs:
      min_value: 1000
      max_value: 50000
    meta:
      severity: critical   # Override default "info" โ†’ "critical"

๐Ÿ“Š Column Health Summary

The HTML report includes a Column Health Summary that aggregates all expectations per column:

Column Checks Passed Failed Health Null % Unique %
user_id 3 3 0 100% โ–ˆโ–ˆโ–ˆ 0.0% 100.0% โ–ˆโ–ˆโ–ˆ
email 4 4 0 100% โ–ˆโ–ˆโ–ˆ 0.0% 100.0% โ–ˆโ–ˆโ–ˆ
status 1 1 0 100% โ–ˆโ–ˆโ–ˆ โ€” โ€”

Each metric includes a mini CSS bar chart for instant visual scanning.

for col in result.column_health():
    print(f"{col.column}: {col.health_score}% health, "
          f"{col.passed}/{col.checks} passed")

๐Ÿ“‹ Available Expectations

Column-Level (16)

Expectation Severity Description
expect_column_to_exist ๐Ÿ”ด Critical Column exists in DataFrame
expect_column_to_not_be_null ๐Ÿ”ด Critical No null values
expect_column_values_to_be_unique ๐Ÿ”ด Critical All values unique
expect_column_values_to_be_between ๐ŸŸก Warning Values within range
expect_column_values_to_be_in_set ๐ŸŸก Warning Values in allowed set
expect_column_values_to_not_be_in_set ๐ŸŸก Warning Values not in forbidden set
expect_column_values_to_match_regex ๐ŸŸก Warning Values match regex pattern
expect_column_values_to_be_of_type ๐ŸŸก Warning Column dtype matches
expect_column_values_to_be_dateutil_parseable ๐ŸŸก Warning Values parseable as dates
expect_column_value_lengths_to_be_between ๐Ÿ”ต Info String lengths within range
expect_column_max_to_be_between ๐Ÿ”ต Info Column max within bounds
expect_column_min_to_be_between ๐Ÿ”ต Info Column min within bounds
expect_column_mean_to_be_between ๐Ÿ”ต Info Column mean within bounds
expect_column_stdev_to_be_between ๐Ÿ”ต Info Column std dev within bounds
expect_column_distinct_values_to_be_in_set ๐Ÿ”ต Info All distinct values in set
expect_column_proportion_of_unique_values_to_be_between ๐Ÿ”ต Info Uniqueness ratio in range

Table-Level (5)

Expectation Severity Description
expect_table_row_count_to_equal ๐Ÿ”ด Critical Exact row count
expect_table_row_count_to_be_between ๐Ÿ”ด Critical Row count in range
expect_table_columns_to_match_ordered_list ๐Ÿ”ด Critical Column order matches
expect_table_columns_to_match_set ๐Ÿ”ด Critical Column names match (unordered)
expect_table_column_count_to_equal ๐Ÿ”ด Critical Exact column count

Aggregate / Cross-Column (4)

Expectation Severity Description
expect_column_pair_values_a_to_be_greater_than_b ๐ŸŸก Warning Column A > Column B
expect_column_pair_values_to_be_equal ๐ŸŸก Warning Two columns equal
expect_multicolumn_sum_to_equal ๐ŸŸก Warning Row-wise sum equals target
expect_compound_columns_to_be_unique ๐Ÿ”ด Critical Compound key uniqueness

๐Ÿ“Š Data Profiling

import pandas as pd
from validatex import DataProfiler

df = pd.read_csv("data.csv")
profiler = DataProfiler()

# Profile
profile = profiler.profile(df)
print(profile.summary())

# Auto-suggest expectations
suite = profiler.suggest_expectations(df, suite_name="auto_suite")
suite.save("auto_suite.yaml")

๐Ÿ”ง YAML Suite Configuration

suite_name: my_data_quality
meta:
  description: "Quality checks for production data"

expectations:
  - expectation_type: expect_column_to_not_be_null
    column: id
    meta:
      severity: critical

  - expectation_type: expect_column_values_to_be_between
    column: age
    kwargs:
      min_value: 0
      max_value: 150

  - expectation_type: expect_column_values_to_be_in_set
    column: status
    kwargs:
      value_set: ["active", "inactive"]

๐Ÿ—๏ธ Architecture

validatex/
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ expectation.py     # Base class + registry
โ”‚   โ”œโ”€โ”€ result.py          # ValidationResult, QualityScore, Severity, ColumnHealth
โ”‚   โ”œโ”€โ”€ suite.py           # ExpectationSuite (fluent API)
โ”‚   โ””โ”€โ”€ validator.py       # Validation orchestrator
โ”œโ”€โ”€ expectations/
โ”‚   โ”œโ”€โ”€ column_expectations.py     # 16 column-level checks
โ”‚   โ”œโ”€โ”€ table_expectations.py      # 5 table-level checks
โ”‚   โ””โ”€โ”€ aggregate_expectations.py  # 4 cross-column checks
โ”œโ”€โ”€ datasources/
โ”‚   โ”œโ”€โ”€ csv_source.py      # CSV files
โ”‚   โ”œโ”€โ”€ parquet_source.py  # Parquet files
โ”‚   โ”œโ”€โ”€ database_source.py # SQL databases (SQLAlchemy)
โ”‚   โ””โ”€โ”€ dataframe_source.py # Direct DataFrames
โ”œโ”€โ”€ profiler/
โ”‚   โ””โ”€โ”€ profiler.py        # Auto-profiling & suggestion engine
โ”œโ”€โ”€ reporting/
โ”‚   โ”œโ”€โ”€ html_report.py     # Production HTML reports
โ”‚   โ””โ”€โ”€ json_report.py     # JSON reports
โ”œโ”€โ”€ config/
โ”‚   โ””โ”€โ”€ loader.py          # YAML/JSON config loading
โ””โ”€โ”€ cli/
    โ””โ”€โ”€ main.py            # CLI (validate, run, profile, init, list-expectations)

๐Ÿงช Testing

# Run all tests (66 tests)
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=validatex --cov-report=html

# Unit tests only
pytest tests/unit/ -v

# Integration tests
pytest tests/integration/ -v

๐Ÿค Creating Custom Expectations

from dataclasses import dataclass, field
from validatex.core.expectation import Expectation, register_expectation
from validatex.core.result import ExpectationResult

@register_expectation
@dataclass
class ExpectColumnValuesToBePositive(Expectation):
    """Expect all values in a numeric column to be positive."""

    expectation_type: str = field(
        init=False, default="expect_column_values_to_be_positive"
    )

    def _validate_pandas(self, df) -> ExpectationResult:
        series = df[self.column].dropna()
        total = len(series)
        negative_mask = series <= 0
        unexpected_count = int(negative_mask.sum())
        pct = (unexpected_count / total * 100) if total > 0 else 0.0

        return self._build_result(
            success=(unexpected_count == 0),
            element_count=total,
            unexpected_count=unexpected_count,
            unexpected_percent=pct,
            unexpected_values=series[negative_mask].tolist()[:20],
        )

๐Ÿงน Clean Output

ValidateX converts all internal types to native Python before rendering. You'll never see np.int64(20) in reports or JSON โ€” only clean 20.

result = vx.validate(df, suite)
data = result.to_dict()

# Observed values are always clean:
# {'min': 20, 'max': 69}        โ† NOT {'min': np.int64(20), ...}
# "Unique: 100/100 (100.0%)"    โ† NOT "100 unique out of 100"
# "Distinct values: 3"          โ† NOT "{'unique_values': 3}"

๐Ÿš€ Roadmap

  • 25+ built-in expectations (column, table, aggregate)
  • Pandas + PySpark dual-engine support
  • Severity modeling (Critical / Warning / Info)
  • Weighted data quality score (0โ€“100)
  • Column health summary with mini charts
  • Modern HTML reports with dark theme
  • Download buttons (JSON, CSV, clipboard)
  • Drift detection foundation
  • Data profiler with auto-suggestion
  • CLI with validate, profile, run, init commands
  • YAML/JSON declarative configuration
  • Native Python type sanitization
  • Slack / Teams notifications on failure
  • GitHub Action template for CI/CD
  • Polars engine support
  • Baseline history tracking & trend charts
  • Anomaly detection expectations
  • Great Expectations suite import/migration
  • Web dashboard for multi-dataset monitoring
  • dbt integration plugin

Versioning

ValidateX follows Semantic Versioning.

  • MAJOR version for incompatible API changes
  • MINOR version for backwards-compatible new functionality
  • PATCH version for backwards-compatible bug fixes

๐Ÿ“„ License

MIT License


Built with โค๏ธ by the ValidateX Team
If this project helps you, consider giving it a โญ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

validatex-1.0.0.tar.gz (44.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

validatex-1.0.0-py3-none-any.whl (46.1 kB view details)

Uploaded Python 3

File details

Details for the file validatex-1.0.0.tar.gz.

File metadata

  • Download URL: validatex-1.0.0.tar.gz
  • Upload date:
  • Size: 44.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for validatex-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e82e5433814440f894b76158b8b2489a363f4da905db68614c0e821363bbda21
MD5 1ca75b80bbc2344a16c944bfd1bb8ad1
BLAKE2b-256 f367cfbcab0b1999d89e9e81063d283da20f8bb848f6d5782c0d1bf07009f063

See more details on using hashes here.

File details

Details for the file validatex-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: validatex-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 46.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for validatex-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 50c03b2d35892cb063ca18baa80442e8b375cee4a848bb550dd860c85fa981e8
MD5 caeb331871dffa9b94eb21286a8b7218
BLAKE2b-256 3c9f34a91002cb57617e5f8ca91ff64a8c1d3c72b1d9c5383dcb9523b4713ffc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page