A powerful data quality validation framework inspired by Great Expectations
Project description
๐ ValidateX
A powerful, extensible data quality validation framework for Python.
Badges represent (from left to right): CI/CD Build Status, Code Coverage, Test Count, Latest PyPI Release, Supported Python Versions, License, and Code Style.
ValidateX provides a comprehensive suite of tools for validating, profiling, and monitoring data quality across Pandas and PySpark DataFrames. Inspired by Great Expectations, it offers a simpler, more focused approach with modern, production-ready HTML reports and an intuitive API.
๐ Table of Contents
- ๐ผ๏ธ Report Preview
- ๐ค Why ValidateX?
- ๐ฏ Who Is This For?
- โจ Features
- ๐ฆ Installation
- ๐ Quick Start
- ๐ค Automate with CI/CD
- ๐ฏ Data Quality Score
- ๐ Available Expectations
- ๐ Roadmap
๐ผ๏ธ Report Preview
|
Column Health Summary with mini bar charts |
Severity-tagged Expectations with human-readable output |
๐ค Why ValidateX?
| Feature | ValidateX | Great Expectations |
|---|---|---|
| Setup | pip install โ validate in 5 lines |
Multi-step setup with contexts & stores |
| API | Fluent, chainable Python API | Heavy config system |
| Severity levels | โ (Critical, Warning, Info) | โ |
| Quality score | โ (Weighted 0โ100) | โ |
| Auto-suggest expectations | โ | โ |
| Reports | Modern dark-theme HTML with minicharts | Basic data docs |
| Output Data Types | Clean native Python types | NumPy types leak into JSON |
| PySpark Support | โ | โ |
| Polars Support | Soon | โ |
| CI/CD friendly CLI | โ | โ |
| Downloads | JSON / CSV / clipboard built into report | Separate export |
| Learning curve | Minutes | Hours to days |
ValidateX is not a replacement for Great Expectations โ it's a focused alternative for teams that want production-grade data validation without the overhead.
๐ฏ Who Is This For?
- Startup data teams โ Ship data quality checks in minutes, not days
- ML engineers โ Validate feature stores and training data before model runs
- CI/CD pipelines โ Gate deployments on data quality with a single CLI command
- Analytics teams โ Catch data issues before they reach dashboards
- dbt users โ Lightweight validation alongside your transformation layer
- Data platform teams โ Monitor data quality across dozens of tables
โจ Features
| Feature | Description |
|---|---|
| 25+ Built-in Expectations | Column-level, table-level, and aggregate validations |
| Dual Engine Support | Pandas and PySpark execution engines |
| ๐ฏ Data Quality Score | Weighted score (0โ100) based on severity of checks |
| ๐ด๐ก๐ต Severity Levels | Critical / Warning / Info classification for every expectation |
| ๐ Column Health Summary | At-a-glance per-column health with mini bar charts |
| Modern HTML Reports | Stunning, self-contained dark-theme reports with animations |
| ๐ฅ Download Buttons | Export reports as JSON, CSV, or copy summary to clipboard |
| ๐ Drift Detection | Track changes between validation runs |
| Data Profiling | Auto-analyse datasets and suggest expectations |
| YAML/JSON Config | Define expectations declaratively |
| CLI Interface | Run validations from the command line |
| Checkpoint System | Tie data sources and suites together |
| Extensible | Create custom expectations with the registry pattern |
| Clean Output | All values are native Python types โ zero NumPy leakage |
๐ฆ Installation
# Basic install
pip install -e .
# With PySpark support
pip install -e ".[spark]"
# With database support
pip install -e ".[database]"
# Full install
pip install -e ".[all]"
# Development
pip install -e ".[dev]"
๐ Quick Start
Python API
import pandas as pd
import validatex as vx
# Create your data
df = pd.DataFrame({
"user_id": [1, 2, 3, 4, 5],
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"age": [25, 30, 35, 28, 42],
"email": ["alice@test.com", "bob@test.com", "charlie@test.com",
"diana@test.com", "eve@test.com"],
"status": ["active", "active", "inactive", "active", "pending"],
})
# Build an expectation suite
suite = (
vx.ExpectationSuite("user_quality")
.add("expect_column_to_not_be_null", column="user_id")
.add("expect_column_values_to_be_unique", column="user_id")
.add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
.add("expect_column_values_to_be_in_set",
column="status", value_set=["active", "inactive", "pending"])
.add("expect_column_values_to_match_regex",
column="email", regex=r"^[\w.]+@[\w]+\.\w+$")
)
# Validate
result = vx.validate(df, suite)
# Print summary (includes Quality Score)
print(result.summary())
# Generate reports
result.to_html("report.html")
result.to_json_file("report.json")
CLI
# Initialize a project
validatex init
# Profile a dataset
validatex profile --data data.csv --suggest --output auto_suite.yaml
# Run validation
validatex validate --data data.csv --suite suite.yaml --report report.html
# Run checkpoint
validatex run --checkpoint checkpoint.yaml
# List available expectations
validatex list-expectations
๐ค Automate with CI/CD
ValidateX is designed to be lightweight and CI-friendly. You can easily integrate it into your GitHub Actions, GitLab CI, or Jenkins pipelines to gate deployments on data quality.
Example: GitHub Actions
name: Data Quality Validation
on: [push, pull_request]
jobs:
validate-data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install ValidateX
run: pip install validatex
- name: Run Data Validation
run: |
validatex validate \
--data data/production_data.csv \
--suite tests/data_quality/suite.yaml \
--report dq_report.html
- name: Archive production artifacts
uses: actions/upload-artifact@v4
if: always()
with:
name: validatex-report
path: dq_report.html
๐ฏ Data Quality Score
ValidateX computes a weighted quality score (0โ100) based on the severity of each expectation:
| Severity | Weight | Example Expectations |
|---|---|---|
| ๐ด Critical | ร3 | Null checks, uniqueness, column existence, row count |
| ๐ก Warning | ร2 | Range checks, set membership, regex, type checks |
| ๐ต Info | ร1 | Mean/stdev bounds, string lengths, distinct values |
Formula: Score = 100 ร (weighted_passed / weighted_total)
A critical failure impacts the score 3ร more than an info-level check. This gives decision-makers a single number to assess data health.
result = vx.validate(df, suite)
score = result.compute_quality_score()
print(f"Data Quality Score: {score}/100")
Custom Severity
Override the default severity on any expectation via meta:
expectations:
- expectation_type: expect_column_mean_to_be_between
column: revenue
kwargs:
min_value: 1000
max_value: 50000
meta:
severity: critical # Override default "info" โ "critical"
๐ Column Health Summary
The HTML report includes a Column Health Summary that aggregates all expectations per column:
| Column | Checks | Passed | Failed | Health | Null % | Unique % |
|---|---|---|---|---|---|---|
| user_id | 3 | 3 | 0 | 100% โโโ | 0.0% | 100.0% โโโ |
| 4 | 4 | 0 | 100% โโโ | 0.0% | 100.0% โโโ | |
| status | 1 | 1 | 0 | 100% โโโ | โ | โ |
Each metric includes a mini CSS bar chart for instant visual scanning.
for col in result.column_health():
print(f"{col.column}: {col.health_score}% health, "
f"{col.passed}/{col.checks} passed")
๐ Available Expectations
Column-Level (16)
| Expectation | Severity | Description |
|---|---|---|
expect_column_to_exist |
๐ด Critical | Column exists in DataFrame |
expect_column_to_not_be_null |
๐ด Critical | No null values |
expect_column_values_to_be_unique |
๐ด Critical | All values unique |
expect_column_values_to_be_between |
๐ก Warning | Values within range |
expect_column_values_to_be_in_set |
๐ก Warning | Values in allowed set |
expect_column_values_to_not_be_in_set |
๐ก Warning | Values not in forbidden set |
expect_column_values_to_match_regex |
๐ก Warning | Values match regex pattern |
expect_column_values_to_be_of_type |
๐ก Warning | Column dtype matches |
expect_column_values_to_be_dateutil_parseable |
๐ก Warning | Values parseable as dates |
expect_column_value_lengths_to_be_between |
๐ต Info | String lengths within range |
expect_column_max_to_be_between |
๐ต Info | Column max within bounds |
expect_column_min_to_be_between |
๐ต Info | Column min within bounds |
expect_column_mean_to_be_between |
๐ต Info | Column mean within bounds |
expect_column_stdev_to_be_between |
๐ต Info | Column std dev within bounds |
expect_column_distinct_values_to_be_in_set |
๐ต Info | All distinct values in set |
expect_column_proportion_of_unique_values_to_be_between |
๐ต Info | Uniqueness ratio in range |
Table-Level (5)
| Expectation | Severity | Description |
|---|---|---|
expect_table_row_count_to_equal |
๐ด Critical | Exact row count |
expect_table_row_count_to_be_between |
๐ด Critical | Row count in range |
expect_table_columns_to_match_ordered_list |
๐ด Critical | Column order matches |
expect_table_columns_to_match_set |
๐ด Critical | Column names match (unordered) |
expect_table_column_count_to_equal |
๐ด Critical | Exact column count |
Aggregate / Cross-Column (4)
| Expectation | Severity | Description |
|---|---|---|
expect_column_pair_values_a_to_be_greater_than_b |
๐ก Warning | Column A > Column B |
expect_column_pair_values_to_be_equal |
๐ก Warning | Two columns equal |
expect_multicolumn_sum_to_equal |
๐ก Warning | Row-wise sum equals target |
expect_compound_columns_to_be_unique |
๐ด Critical | Compound key uniqueness |
๐ Data Profiling
import pandas as pd
from validatex import DataProfiler
df = pd.read_csv("data.csv")
profiler = DataProfiler()
# Profile
profile = profiler.profile(df)
print(profile.summary())
# Auto-suggest expectations
suite = profiler.suggest_expectations(df, suite_name="auto_suite")
suite.save("auto_suite.yaml")
๐ง YAML Suite Configuration
suite_name: my_data_quality
meta:
description: "Quality checks for production data"
expectations:
- expectation_type: expect_column_to_not_be_null
column: id
meta:
severity: critical
- expectation_type: expect_column_values_to_be_between
column: age
kwargs:
min_value: 0
max_value: 150
- expectation_type: expect_column_values_to_be_in_set
column: status
kwargs:
value_set: ["active", "inactive"]
๐๏ธ Architecture
validatex/
โโโ core/
โ โโโ expectation.py # Base class + registry
โ โโโ result.py # ValidationResult, QualityScore, Severity, ColumnHealth
โ โโโ suite.py # ExpectationSuite (fluent API)
โ โโโ validator.py # Validation orchestrator
โโโ expectations/
โ โโโ column_expectations.py # 16 column-level checks
โ โโโ table_expectations.py # 5 table-level checks
โ โโโ aggregate_expectations.py # 4 cross-column checks
โโโ datasources/
โ โโโ csv_source.py # CSV files
โ โโโ parquet_source.py # Parquet files
โ โโโ database_source.py # SQL databases (SQLAlchemy)
โ โโโ dataframe_source.py # Direct DataFrames
โโโ profiler/
โ โโโ profiler.py # Auto-profiling & suggestion engine
โโโ reporting/
โ โโโ html_report.py # Production HTML reports
โ โโโ json_report.py # JSON reports
โโโ config/
โ โโโ loader.py # YAML/JSON config loading
โโโ cli/
โโโ main.py # CLI (validate, run, profile, init, list-expectations)
๐งช Testing
# Run all tests (66 tests)
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=validatex --cov-report=html
# Unit tests only
pytest tests/unit/ -v
# Integration tests
pytest tests/integration/ -v
๐ค Creating Custom Expectations
from dataclasses import dataclass, field
from validatex.core.expectation import Expectation, register_expectation
from validatex.core.result import ExpectationResult
@register_expectation
@dataclass
class ExpectColumnValuesToBePositive(Expectation):
"""Expect all values in a numeric column to be positive."""
expectation_type: str = field(
init=False, default="expect_column_values_to_be_positive"
)
def _validate_pandas(self, df) -> ExpectationResult:
series = df[self.column].dropna()
total = len(series)
negative_mask = series <= 0
unexpected_count = int(negative_mask.sum())
pct = (unexpected_count / total * 100) if total > 0 else 0.0
return self._build_result(
success=(unexpected_count == 0),
element_count=total,
unexpected_count=unexpected_count,
unexpected_percent=pct,
unexpected_values=series[negative_mask].tolist()[:20],
)
๐งน Clean Output
ValidateX converts all internal types to native Python before rendering. You'll never see np.int64(20) in reports or JSON โ only clean 20.
result = vx.validate(df, suite)
data = result.to_dict()
# Observed values are always clean:
# {'min': 20, 'max': 69} โ NOT {'min': np.int64(20), ...}
# "Unique: 100/100 (100.0%)" โ NOT "100 unique out of 100"
# "Distinct values: 3" โ NOT "{'unique_values': 3}"
๐ Roadmap
- 25+ built-in expectations (column, table, aggregate)
- Pandas + PySpark dual-engine support
- Severity modeling (Critical / Warning / Info)
- Weighted data quality score (0โ100)
- Column health summary with mini charts
- Modern HTML reports with dark theme
- Download buttons (JSON, CSV, clipboard)
- Drift detection foundation
- Data profiler with auto-suggestion
- CLI with validate, profile, run, init commands
- YAML/JSON declarative configuration
- Native Python type sanitization
- Slack / Teams notifications on failure
- GitHub Action template for CI/CD
- Polars engine support
- Baseline history tracking & trend charts
- Anomaly detection expectations
- Great Expectations suite import/migration
- Web dashboard for multi-dataset monitoring
- dbt integration plugin
Versioning
ValidateX follows Semantic Versioning.
- MAJOR version for incompatible API changes
- MINOR version for backwards-compatible new functionality
- PATCH version for backwards-compatible bug fixes
๐ License
MIT License
Built with โค๏ธ by the ValidateX Team
If this project helps you, consider giving it a โญ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file validatex-1.0.0.tar.gz.
File metadata
- Download URL: validatex-1.0.0.tar.gz
- Upload date:
- Size: 44.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e82e5433814440f894b76158b8b2489a363f4da905db68614c0e821363bbda21
|
|
| MD5 |
1ca75b80bbc2344a16c944bfd1bb8ad1
|
|
| BLAKE2b-256 |
f367cfbcab0b1999d89e9e81063d283da20f8bb848f6d5782c0d1bf07009f063
|
File details
Details for the file validatex-1.0.0-py3-none-any.whl.
File metadata
- Download URL: validatex-1.0.0-py3-none-any.whl
- Upload date:
- Size: 46.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50c03b2d35892cb063ca18baa80442e8b375cee4a848bb550dd860c85fa981e8
|
|
| MD5 |
caeb331871dffa9b94eb21286a8b7218
|
|
| BLAKE2b-256 |
3c9f34a91002cb57617e5f8ca91ff64a8c1d3c72b1d9c5383dcb9523b4713ffc
|