A SQL-powered data quality validation library for pandas and spark DataFrames.
Project description
vowl
vowl (vee-owl 🦉) is a validation engine for Open Data Contract Standard (ODCS) data contracts. Define your validation rules once in a declarative YAML contract and get rich, actionable reports on your data's quality.
🏆 Official ODCS Vendor: vowl is actively maintained and listed on the official ODCS vendors list as a natively compatible tool.
Table of Contents
- Features
- Getting Started
- Concepts
- Usage Patterns
- Local DataFrame (Pandas/Polars)
- PySpark
- Ibis Connections (20+ Backends)
- Compatibility Mode (DuckDB ATTACH)
- Explicit Adapter with Filter Conditions
- Multi-Source Validation
- Custom Adapters and Executors
- Using Servers Defined in Data Contract
- Loading Contracts from Git (GitHub/GitLab)
- Loading Contracts from S3
- Roadmap
- Contributing
- License
Features
- Extensible Check Engine: Ships with a SQL check engine out of the box, with the architecture designed to support custom check types beyond SQL.
- Auto-Generated Rules: Checks are automatically derived from contract metadata (
logicalType,logicalTypeOptions,required,unique,primaryKey) and library metrics (nullValues,missingValues,invalidValues,duplicateValues,rowCount). - Any DataFrame, Any Backend: Load any Narwhals-compatible DataFrame type (pandas, Polars, PySpark, etc.) or connect to 20+ backends via Ibis. SQL dialect translation is handled by SQLGlot.
- Server-Side Execution: SQL checks run server-side through Ibis without materialising tables on the client.
- Multi-Source Validation: Validate across tables in different source systems with cross-database joins.
- Declarative ODCS Contracts: Define validation rules in YAML following the Open Data Contract Standard.
- Flexible Filtering: Filter conditions with wildcard pattern matching, ideal for incremental validation of new data.
- Rich Reporting: Detailed summaries, row-level failure analysis, saveable reports, and a chainable
ValidationResultAPI. - No Silent Gaps: Unimplemented or unrecognised checks surface as
ERROR, not quietly skipped, so nothing slips through the cracks.
Getting Started
Installation
pip install vowl
Optional extras are available: vowl[spark], vowl[all].
For local development, testing, and release workflow, see CONTRIBUTING.md.
Validate in 3 lines
import pandas as pd # or any Narwhals-compatible DataFrame
from vowl import validate_data
df = pd.read_csv("tests/hdb_resale/HDBResaleWithErrors.csv")
result = validate_data("tests/hdb_resale/hdb_resale_simple.yaml", df=df)
result.display_full_report()
Output (click to expand)
=== Data Quality Validation Results ===
Contract Version: v3.1.0
Contract ID: c11443ee-542f-4442-b28d-2d224342be37
Schemas: hdb_resale_prices
OVERALL DATA QUALITY
Overall:
Checks Pass Rate: 17 / 20 (85.0%)
hdb_resale_prices:
Overall:
Checks Pass Rate: 17 / 20 (85.0%)
ERRORED Checks: 0
Single Table:
Checks Pass Rate: 17 / 20 (85.0%)
ERRORED Checks: 0
Unique Passed Rows: 201,863 / 201,879 (99.9%)
Multi Table:
Checks Pass Rate: 0 / 0 (N/A)
ERRORED Checks: 0
Non-unique Failed Rows: 0
CHECK RESULTS
+-----------------------------------------+---------------------------------------+-------------------+--------+---------------+---------------+--------+----------------+
| check_id | Target | tables_in_query | status | operator | expected | actual | execution time |
+-----------------------------------------+---------------------------------------+-------------------+--------+---------------+---------------+--------+----------------+
| Month | hdb_resale_prices.month | hdb_resale_prices | FAILED | mustBe | 0 | 2 | 17.84 ms |
| Year | hdb_resale_prices.lease_commence_date | hdb_resale_prices | FAILED | mustBe | 0 | 2 | 26.09 ms |
| floor_area_must_be_less_than_200 | hdb_resale_prices.floor_area_sqm | hdb_resale_prices | FAILED | mustBe | 0 | 12 | 13.58 ms |
+-----------------------------------------+---------------------------------------+-------------------+--------+---------------+---------------+--------+----------------+
| AddressBlockHouseNumber | hdb_resale_prices.block | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 17.26 ms |
| block_column_exists_check | hdb_resale_prices.block | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 4.56 ms |
| flat_model_column_exists_check | hdb_resale_prices.flat_model | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 31.60 ms |
| flat_type_column_exists_check | hdb_resale_prices.flat_type | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 4.45 ms |
| flat_type_invalidValues | hdb_resale_prices.flat_type | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 17.79 ms |
| floor_area_sqm_column_exists_check | hdb_resale_prices.floor_area_sqm | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 4.90 ms |
| hdb_resale_prices_rowCount | hdb_resale_prices | hdb_resale_prices | PASSED | mustBeBetween | [0, 30000000] | 201879 | 6.32 ms |
| lease_commence_date_column_exists_check | hdb_resale_prices.lease_commence_date | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 4.24 ms |
| month_column_exists_check | hdb_resale_prices.month | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 4.43 ms |
| month_logical_type_check | hdb_resale_prices.month | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 8.16 ms |
| remaining_lease_column_exists_check | hdb_resale_prices.remaining_lease | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 3.81 ms |
| resale_price_column_exists_check | hdb_resale_prices.resale_price | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 4.33 ms |
| resale_price_must_not_exceed_2m | hdb_resale_prices.resale_price | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 18.07 ms |
| storey_range_column_exists_check | hdb_resale_prices.storey_range | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 3.94 ms |
| street_name_column_exists_check | hdb_resale_prices.street_name | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 4.83 ms |
| town_column_exists_check | hdb_resale_prices.town | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 4.53 ms |
| town_nullValues | hdb_resale_prices.town | hdb_resale_prices | PASSED | mustBe | 0 | 0 | 10.13 ms |
+-----------------------------------------+---------------------------------------+-------------------+--------+---------------+---------------+--------+----------------+
Total Execution: 210.88 ms
=== Failed Checks and Rows (up to 5 row(s) per failed check) ===
hdb_resale_prices
Single checks
[Month]
Operator: mustBe
Expected: 0
Actual: 2
Target: hdb_resale_prices.month
Details: Based on ISO 8601, assumed to be in UTC +8 | YYYY-MM
Rule: SELECT COUNT(*) FROM "hdb_resale_prices" WHERE NOT REGEXP_MATCHES(TRY_CAST(month AS TEXT), '^[0-9]{4}-(0[1-9]|1[0-2])$')
Rows shown: 2 of 2
+----------+--------+-----------+-------+--------------+--------------+----------------+---------------+---------------------+--------------------+--------------+
| month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | remaining_lease | resale_price |
+----------+--------+-----------+-------+--------------+--------------+----------------+---------------+---------------------+--------------------+--------------+
| 2017-jan | BEDOK | 5 ROOM | 21 | CHAI CHEE RD | 07 TO 09 | 130.0 | Adjoined flat | 1972 | 54 years 06 months | 530000.0 |
| 2017-jan | BISHAN | 3 ROOM | 105 | BISHAN ST 12 | 04 TO 06 | 4.0 | Simplified | 1985 | 67 years 11 months | 395000.0 |
+----------+--------+-----------+-------+--------------+--------------+----------------+---------------+---------------------+--------------------+--------------+
[floor_area_must_be_less_than_200]
Operator: mustBe
Expected: 0
Actual: 12
Target: hdb_resale_prices.floor_area_sqm
Details: Validates that floor area must be less than 200
Rule: SELECT COUNT(*) FROM "hdb_resale_prices" WHERE TRY_CAST(floor_area_sqm AS BIGINT) >= 200
Rows shown: 5 of 12
+---------+-----------------+-----------+-------+---------------------+--------------+----------------+--------------------+---------------------+--------------------+--------------+
| month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | remaining_lease | resale_price |
+---------+-----------------+-----------+-------+---------------------+--------------+----------------+--------------------+---------------------+--------------------+--------------+
| 2017-06 | KALLANG/WHAMPOA | 3 ROOM | 38 | JLN BAHAGIA | 01 TO 03 | 215.0 | Terrace | 1972 | 54 years 01 month | 830000.0 |
| 2017-09 | CHOA CHU KANG | EXECUTIVE | 641 | CHOA CHU KANG ST 64 | 16 TO 18 | 215.0 | Premium Maisonette | 1998 | 79 years 04 months | 888000.0 |
| 2017-12 | KALLANG/WHAMPOA | 3 ROOM | 65 | JLN MA'MOR | 01 TO 03 | 249.0 | Terrace | 1972 | 53 years 07 months | 1053888.0 |
| 2018-01 | CHOA CHU KANG | EXECUTIVE | 639 | CHOA CHU KANG ST 64 | 10 TO 12 | 215.0 | Premium Maisonette | 1998 | 79 years | 900000.0 |
| 2018-09 | KALLANG/WHAMPOA | 3 ROOM | 41 | JLN BAHAGIA | 01 TO 03 | 237.0 | Terrace | 1972 | 52 years 10 months | 1185000.0 |
+---------+-----------------+-----------+-------+---------------------+--------------+----------------+--------------------+---------------------+--------------------+--------------+
[Year]
Operator: mustBe
Expected: 0
Actual: 2
Target: hdb_resale_prices.lease_commence_date
Details: Based on ISO 8601, assumed to be in UTC +8 | YYYY
Rule: SELECT COUNT(*) FROM "hdb_resale_prices" WHERE NOT REGEXP_MATCHES(TRY_CAST(lease_commence_date AS TEXT), '^[0-9]{4}$')
Rows shown: 2 of 2
+---------+------------+-----------+-------+------------------+--------------+----------------+----------------+---------------------+--------------------+--------------+
| month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | remaining_lease | resale_price |
+---------+------------+-----------+-------+------------------+--------------+----------------+----------------+---------------------+--------------------+--------------+
| 2017-01 | ANG MO KIO | 3 ROOM | 219 | ANG MO KIO AVE 1 | 07 TO 09 | 67.0 | New Generation | 1977.0 | 59 years 06 months | 297000.0 |
| 2017-01 | ANG MO KIO | 3 ROOM | 211 | ANG MO KIO AVE 3 | 01 TO 03 | 67.0 | New Generation | abc | 59 years 03 months | 325000.0 |
+---------+------------+-----------+-------+------------------+--------------+----------------+----------------+---------------------+--------------------+--------------+
See Usage Patterns for PySpark, Ibis connections, multi-source validation, and more.
Concepts
Data Contracts
Instead of writing validation logic in Python, you declare it in a YAML file following the Open Data Contract Standard (ODCS). This separates your rules from your code, making them easier to manage, version, and share.
Example hdb_resale_simple.yaml (trimmed for readability):
kind: DataContract
apiVersion: v3.1.0
version: 1.0.0
id: c11443ee-542f-4442-b28d-2d224342be37
status: draft
name: HDB Resale Flat Prices
schema:
- name: hdb_resale_prices # This becomes the table name in your SQL queries
properties:
# --- SQL Check: regex-based format validation ---
- name: month
logicalType: string
quality:
- type: sql
name: Month
description: Based on ISO 8601, assumed to be in UTC +8 | YYYY-MM
mustBe: 0
query: |-
SELECT COUNT(*)
FROM "hdb_resale_prices"
WHERE CAST(month AS TEXT) !~ '^[0-9]{4}-(0[1-9]|1[0-2])$';
dimension: conformity
# --- Library Metric: null-value check ---
- name: town
quality:
- type: library
metric: nullValues
mustBe: 0
dimension: completeness
# --- Library Metric: valid-value list ---
- name: flat_type
quality:
- type: library
metric: invalidValues
mustBe: 0
dimension: conformity
arguments:
validValues:
- 1 ROOM
- 2 ROOM
- 3 ROOM
- 4 ROOM
- 5 ROOM
- EXECUTIVE
- MULTI-GENERATION
# --- SQL Check: business rule ---
- name: floor_area_sqm
quality:
- name: floor_area_must_be_less_than_200
description: Validates that floor area must be less than 200
type: sql
dimension: consistency
query: SELECT COUNT(*) FROM "hdb_resale_prices" WHERE floor_area_sqm >= 200
mustBe: 0
# --- SQL Check: resale price cap ---
- name: resale_price
quality:
- name: resale_price_must_not_exceed_2m
description: Resale price must not be more than 2 million SGD
type: sql
dimension: conformity
query: >-
SELECT COUNT(*) FROM "hdb_resale_prices" WHERE resale_price > 2000000
mustBe: 0
# --- Table-Level Library Metric ---
quality:
- type: library
metric: rowCount
mustBeBetween:
- 0
- 30000000
Automatic check_references
When a contract is loaded, vowl automatically builds CheckReference objects for every executable check in the contract via Contract.get_check_references_by_schema().
This includes both user-authored checks in quality blocks and synthetic checks derived from column metadata. The generated references are grouped by schema, and the auto-generated ones run before explicit quality checks.
| Reference type | Trigger in contract | JSONPath stored in the reference |
|---|---|---|
| Table check | Entry under schema-level quality |
$.schema[N].quality[M] |
| Column check | Entry under property-level quality |
$.schema[N].properties[M].quality[K] |
| Library column metric | type: library under property-level quality |
$.schema[N].properties[M].quality[K] |
| Library table metric | type: library under schema-level quality |
$.schema[N].quality[M] |
| Declared column exists check | Property has a name |
$.schema[N].properties[M] |
| Logical type check | logicalType present on a property |
$.schema[N].properties[M].logicalType |
| Logical type options check | Supported key under logicalTypeOptions |
$.schema[N].properties[M].logicalTypeOptions.<optionKey> |
| Required check | required: true |
$.schema[N].properties[M].required |
| Unique check | unique: true |
$.schema[N].properties[M].unique |
| Primary key check | primaryKey: true |
$.schema[N].properties[M].primaryKey |
The auto-generated check types currently cover:
| Generated from | What vowl validates |
|---|---|
name |
Column declared in the contract exists in the source table |
logicalType |
Values can be cast to the declared SQL type for integer, number, boolean, date, timestamp, and time |
logicalTypeOptions.minLength |
String length is at least the configured minimum |
logicalTypeOptions.maxLength |
String length does not exceed the configured maximum |
logicalTypeOptions.pattern |
String values match the configured regex pattern |
logicalTypeOptions.minimum |
Value is greater than or equal to the configured minimum |
logicalTypeOptions.maximum |
Value is less than or equal to the configured maximum |
logicalTypeOptions.exclusiveMinimum |
Value is strictly greater than the configured minimum |
logicalTypeOptions.exclusiveMaximum |
Value is strictly less than the configured maximum |
logicalTypeOptions.multipleOf |
Value is a multiple of the configured number |
logicalTypeOptions.format |
Value satisfies the declared format (see Format Checks below) |
required: true |
Column contains no NULL values |
unique: true |
Non-null values are unique |
primaryKey: true |
Values are both unique and non-null |
Library Metrics (type: library)
Instead of writing SQL by hand, you can declare common data quality metrics using type: library in your quality blocks. vowl auto-generates the appropriate SQL at runtime.
Column-level metrics (under a property's quality):
metric |
What it checks | Arguments |
|---|---|---|
nullValues |
Count of NULL values in the column |
- |
missingValues |
Count of values matching a configurable missing-values list | arguments.missingValues: list of sentinel values (use null for SQL NULL) |
invalidValues |
Count of values that fail valid-value or pattern criteria | arguments.validValues: allowed values list and/or arguments.pattern: regex |
duplicateValues |
Count of duplicate non-NULL values in the column | - |
Table-level metrics (under a schema's quality):
metric |
What it checks | Arguments |
|---|---|---|
rowCount |
Total number of rows in the table | - |
duplicateValues |
Count of duplicate rows across specified columns | arguments.properties: list of column names to check |
All library metrics support unit: "percent" to return the result as a percentage of total rows instead of an absolute count. They also accept any of the standard check operators (mustBe, mustBeGreaterThan, etc.).
Example:
properties:
- name: town
quality:
- type: library
metric: nullValues
mustBe: 0
dimension: completeness
- name: flat_type
quality:
- type: library
metric: invalidValues
mustBe: 0
dimension: conformity
arguments:
validValues:
- 3 ROOM
- 4 ROOM
- 5 ROOM
- EXECUTIVE
quality:
- type: library
metric: rowCount
mustBeGreaterThan: 0
dimension: completeness
- type: library
metric: duplicateValues
mustBe: 0
dimension: uniqueness
arguments:
properties:
- month
- block
- street_name
In practice, a property like this:
- name: block
logicalType: string
logicalTypeOptions:
maxLength: 10
required: true
produces three generated check references pointing at:
| Check path | Check type |
|---|---|
$.schema[0].properties[...] |
DeclaredColumnExistsCheckReference |
$.schema[0].properties[...].logicalTypeOptions.maxLength |
LogicalTypeOptionsCheckReference |
$.schema[0].properties[...].required |
RequiredCheckReference |
Because string does not currently generate a SQL cast-based type check, the logicalType entry above contributes metadata for option checks rather than a standalone type-validation query. If you use integer, number, boolean, date, timestamp, or time, vowl also generates a logicalType SQL check automatically. You only need to define extra quality entries when you want custom business rules beyond the contract metadata.
Format Checks
The logicalTypeOptions.format key validates that column values conform to a declared format. The check generated depends on the column's logicalType:
Integer formats — validates that values fall within the range of a fixed-width integer type:
format |
Min | Max |
|---|---|---|
i8 |
-128 | 127 |
i16 |
-32,768 | 32,767 |
i32 |
-2,147,483,648 | 2,147,483,647 |
i64 |
-9,223,372,036,854,775,808 | 9,223,372,036,854,775,807 |
u8 |
0 | 255 |
u16 |
0 | 65,535 |
u32 |
0 | 4,294,967,295 |
u64 |
0 | 18,446,744,073,709,551,615 |
i128 and u128 are recognised but skipped because their ranges exceed what SQL engines can represent.
String formats — validates values against a built-in regex pattern:
format |
What it checks |
|---|---|
uuid |
UUID v1-v5 hex format (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) |
email |
Basic local@domain.tld structure |
ipv4 |
Dotted-decimal IPv4 address (0.0.0.0 - 255.255.255.255) |
ipv6 |
Full-form colon-separated IPv6 address |
hostname |
RFC-952 hostname with TLD |
uri |
URI with a valid scheme prefix (e.g. https:, s3:) |
password, byte, and binary are recognised but skipped because they cannot be validated against data.
Number formats — f32 and f64 are recognised but produce no check (metadata-only).
Date, timestamp and time formats — accepts a JDK DateTimeFormatter pattern (e.g. yyyy-MM-dd). vowl converts the pattern to a regex and validates that string-cast values match. Supported tokens include yyyy, yy, MM, dd, HH, mm, ss, SSS, and timezone offsets (X/XXX/Z). If a pattern contains tokens vowl cannot translate, the check is skipped with a warning.
- name: age
logicalType: integer
logicalTypeOptions:
format: u8 # 0 – 255
- name: request_id
logicalType: string
logicalTypeOptions:
format: uuid
- name: created_at
logicalType: timestamp
logicalTypeOptions:
format: "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"
Validation Results
The validate_data function returns a powerful ValidationResult object that provides multiple ways to interact with your validation results.
Core Methods
| Method/Property | What It Does | Returns |
|---|---|---|
print_summary() |
Prints high-level statistics (pass/fail counts, success rate, performance) | self (chainable) |
show_failed_rows(max_rows=5) |
Displays sample of failed rows in console. Use max_rows=-1 for all rows. |
self (chainable) |
display_full_report(max_rows=5) |
Prints summary + shows failed rows (convenience method) | self (chainable) |
save(output_dir=".", prefix="vowl_results") |
Saves enhanced CSV and summary JSON to disk | self (chainable) |
get_output_dfs(checks=None) |
Returns per-check failed rows as {check_id: DataFrame} |
Dict[str, DataFrame] |
get_consolidated_output_dfs(checks=None) |
Deduplicates failed rows across checks, grouped by table | Dict[str, DataFrame] |
.passed (property) |
Boolean indicating if all checks passed | True/False |
Architecture
vowl has a modular architecture built around Ibis as the universal query layer.
┌─────────────────────────────────────────────────────────────────────────────┐
│ validate_data() │
│ (Main Entry Point) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ DataSourceMapper │
│ (Auto-detects input type → creates adapter) │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌──────────────────────────┼──────────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ IbisAdapter │ │ MultiSourceAdapter│ │ Custom Adapter │
│ │ │ │ │ │
│ • pandas/Polars │ │ • Cross-database │ │ • Extend │
│ • PySpark │ │ validation │ │ BaseAdapter │
│ • PostgreSQL │ │ • Data federation│ │ │
│ • Snowflake │ │ │ │ │
│ • BigQuery │ │ │ │ │
│ • 20+ backends │ │ │ │ │
└──────────────────┘ └──────────────────┘ └──────────────────┘
│ │ │
└──────────────────────────┼──────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Executors │
│ │
│ ┌─────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ IbisSQLExecutor│ │MultiSourceSQLExecutor│ │ Custom Executor │ │
│ │ │ │ │ │ │ │
│ │ Runs SQL checks │ │ Mode 1: delegate to │ │ Extend BaseExecutor │ │
│ │ via Ibis │ │ backend (same conn) │ │ or SQLExecutor │ │
│ │ (server-side) │ │ Mode 2: materialise │ │ │ │
│ │ │ │ to DuckDB via Arrow │ │ │ │
│ └─────────────────┘ └─────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ ValidationResult │
│ │
│ • Per-check failed rows with check_id & tables_in_query columns │
│ • Detailed check results and metrics │
│ • Export to CSV/JSON │
└─────────────────────────────────────────────────────────────────────────────┘
Key Components
| Component | Description |
|---|---|
| DataSourceMapper | Auto-detects a single input source (DataFrame, Spark object, Ibis backend, or connection string) and creates the appropriate adapter |
| IbisAdapter | Universal adapter supporting 20+ backends via Ibis (pandas, Polars, PySpark, PostgreSQL, Snowflake, BigQuery, etc.) |
| MultiSourceAdapter | Routes checks across multiple data sources, separating single-table checks (delegated to per-schema adapters) from multi-table checks (sent to MultiSourceSQLExecutor) |
| IbisSQLExecutor | Executes SQL-based quality checks through the Ibis query layer (server-side) |
| MultiSourceSQLExecutor | Executes cross-source SQL with two modes: direct delegation when all tables share the same compatible backend, or DuckDB materialisation when backends differ. Tables are exported as Arrow and loaded into a local DuckDB for cross-database joins |
| Contract | Parses ODCS YAML contracts into executable validation rules |
| ValidationResult | Rich result object with enhanced DataFrames, metrics, and export capabilities |
Usage Patterns
Interactive demo: Try the usage patterns notebook for a hands-on walkthrough of the examples below.
Local DataFrame (Pandas/Polars)
import pandas as pd
from vowl import validate_data
df = pd.read_csv("data.csv")
result = validate_data("contract.yaml", df=df)
result.display_full_report()
PySpark
from pyspark.sql import SparkSession
from vowl import validate_data
# Create SparkSession (user-managed)
spark = SparkSession.builder.appName("vowl").getOrCreate()
try:
spark_df = spark.read.table("my_table")
result = validate_data("contract.yaml", df=spark_df)
result.display_full_report()
finally:
# User is responsible for stopping the SparkSession
spark.stop()
Note: The library does not manage the SparkSession lifecycle. You must create and stop it yourself. This is by design - SparkSession is a heavy, application-owned resource with specific configuration requirements.
Ibis Connections (20+ Backends)
# Ibis supports: Amazon Athena, BigQuery, ClickHouse, Dask, Databricks, DataFusion,
# Druid, DuckDB, Exasol, Flink, Impala, MSSQL, MySQL, Oracle, pandas, Polars,
# PostgreSQL, PySpark, RisingWave, SingleStoreDB, Snowflake, SQLite, Trino, ...
# Find out more at https://github.com/ibis-project/ibis
import ibis
from vowl import validate_data
from vowl.adapters import IbisAdapter
con = ibis.postgres.connect(...) # Redshift can be supported via Postgres connections too
result = validate_data("contract.yaml", adapter=IbisAdapter(con))
result.display_full_report()
For MySQL, select the database when you create the connection, for example via
ibis.mysql.connect(..., database="my_db") or a connection URI that already
includes the database name. vowl does not issue USE database during
validation; it runs read-only SELECT queries against the active database on
the existing connection. If you need to avoid relying on the connection's
default database, use qualified table names such as my_db.my_table in your
contract queries.
Compatibility Mode (DuckDB ATTACH)
import ibis
from vowl import validate_data
from vowl.adapters import IbisAdapter
# ATTACH lets DuckDB query your remote database directly.
# Data is streamed on demand, not materialised locally.
# All SQL is evaluated by DuckDB, so dialect differences are eliminated.
con = ibis.duckdb.connect()
con.raw_sql("ATTACH 'postgresql://user:pass@host:5432/mydb' AS pg (TYPE postgres, READ_ONLY)")
con.raw_sql("USE pg") # Allows querying tables without the pg. alias
result = validate_data("contract.yaml", adapter=IbisAdapter(con))
result.display_full_report()
When to use this: Your remote backend doesn't support a SQL feature that a check needs, or you want a single local engine for reproducible results regardless of the source database. DuckDB ATTACH supports PostgreSQL, MySQL, and SQLite.
Explicit Adapter with Filter Conditions
from vowl import validate_data
from vowl.adapters import IbisAdapter
from datetime import datetime, timedelta
import ibis
date_limit = (datetime.today() - timedelta(days=7)).strftime("%Y-%m-%d")
con = ibis.postgres.connect(...)
# Using dict for filter conditions with wildcard patterns
# Wildcards use glob-style matching: * (any chars), ? (single char), [seq] (char in seq)
adapter = IbisAdapter(
con,
filter_conditions={
# Exact match
"TableA": {
"field": "date_dt",
"operator": ">=",
"value": date_limit
},
# Wildcard: matches employees, emp_history, emp_details, etc.
"emp*": {
"field": "date_dt",
"operator": ">=",
"value": date_limit
},
# Wildcard: matches orders_archive, customers_archive, etc.
"*_archive": {
"field": "is_deleted",
"operator": "=",
"value": False
},
# Apply to ALL tables
"*": {
"field": "tenant_id",
"operator": "=",
"value": 123
},
}
)
# Note: If multiple patterns match a table, conditions are combined with AND
# Multiple filter conditions on same table (combined with AND)
adapter = IbisAdapter(
con,
filter_conditions={
"TableA": [
{"field": "date_dt", "operator": ">=", "value": date_limit},
{"field": "status", "operator": "=", "value": "active"},
]
}
)
result = validate_data("contract.yaml", adapter=adapter)
result.display_full_report()
Multi-Source Validation
There are two ways to validate across tables in different databases.
Option A: DuckDB ATTACH (recommended: streams data, no materialisation)
import ibis
from vowl import validate_data
from vowl.adapters import IbisAdapter
con = ibis.duckdb.connect()
# Attach multiple remote databases
con.raw_sql("ATTACH 'postgresql://user:pass@host:5432/salesdb' AS pg_sales (TYPE postgres, READ_ONLY)")
con.raw_sql("ATTACH 'sqlite:///path/to/users.db' AS sqlite_users (TYPE sqlite, READ_ONLY)")
# Switch back to local DuckDB so views live in memory
con.raw_sql("USE memory")
# Create views as prefix-free shortcuts to the attached tables
con.raw_sql("CREATE VIEW transactions AS SELECT * FROM pg_sales.transactions")
con.raw_sql("CREATE VIEW users AS SELECT * FROM sqlite_users.users")
# Now vowl (and your contract queries) can reference tables without alias prefixes
result = validate_data("contract.yaml", adapter=IbisAdapter(con))
result.display_full_report()
Note: DuckDB evaluates views dynamically at query time, so this does not materialise or copy data. It streams live from your attached databases; you just get cleaner, prefix-free table names in your contracts. DuckDB ATTACH supports PostgreSQL, MySQL, and SQLite.
Option B: Multi-Source Adapters (materialises data locally)
from vowl import validate_data
from vowl.adapters import IbisAdapter
import ibis
con_a = ibis.postgres.connect(...)
con_b = ibis.sqlite.connect(...)
adapters = {
"table_a": IbisAdapter(con_a),
"table_b": IbisAdapter(con_b)
}
result = validate_data("contract.yaml", adapters=adapters)
result.display_full_report()
Why this exists: A fallback for backends that DuckDB ATTACH does not support (e.g. Snowflake, BigQuery, Databricks, Oracle, MSSQL). The
MultiSourceAdaptermaterialises entire tables on the client via Arrow into a local DuckDB instance, so prefer ATTACH whenever possible. DuckDB ATTACH only supports PostgreSQL, MySQL, and SQLite. It cannot be used as a general-purpose multi-source strategy because of namespace, credential, and filter limitations. It also preserves a known dark pattern: SQL checks can reference tables not declared in the contract'sschemablock, and those queries succeed withMultiSourceAdapter(everything is materialised locally) but fail with DuckDB ATTACH (only explicitly attached tables are visible).
Custom Adapters and Executors
BaseAdapter, BaseExecutor, and SQLExecutor are intended as boilerplate extension points for teams building custom integrations. The typical pattern is to wrap an existing adapter, register custom executors, and then add backend-specific behavior incrementally.
from typing import Optional
import ibis
from vowl.adapters import BaseAdapter, IbisAdapter
from vowl.executors import BaseExecutor, SQLExecutor
class CustomAdapter(BaseAdapter):
def __init__(self, con, **kwargs):
super().__init__(executors={
"sql": CustomSQLExecutor,
"xxx": CustomEngineExecutor,
})
self._wrapped = IbisAdapter(con, **kwargs)
def get_connection(self):
return self._wrapped.get_connection()
@property
def filter_conditions(self):
return self._wrapped.filter_conditions
def test_connection(self, table_name: str) -> Optional[str]:
return self._wrapped.test_connection(table_name)
class CustomEngineExecutor(BaseExecutor):
...
class CustomSQLExecutor(SQLExecutor):
...
con = ibis.duckdb.connect()
adapter = CustomAdapter(con)
executors = adapter.get_executors()
assert "sql" in executors
This section documents the extension boilerplate rather than a guaranteed drop-in validate_data(..., adapter=...) path for arbitrary non-Ibis adapters. For end-to-end validation in the built-in runner today, the supported runtime adapter type is IbisAdapter.
Using Servers Defined in Data Contract
from vowl import validate_data
from vowl.contracts import Contract
from vowl.adapters import IbisAdapter
import ibis
# Load the contract and get server configuration
contract = Contract.load("contract.yaml")
server = contract.get_server("my-postgres-server") # Match by server name
# Or: contract.get_server("uat") # falls back to matching by environment
# Or: contract.get_server() # returns the first server
# Create connection based on server config
con = ibis.postgres.connect(
host=server["server"],
port=server.get("port", 5432),
database=server.get("database", ""),
)
# Create adapter and validate
adapter = IbisAdapter(con)
result = validate_data("contract.yaml", adapter=adapter)
result.display_full_report()
Loading Contracts from Git (GitHub/GitLab)
from vowl import validate_data
# GitHub - blob URL (auto-converted to raw)
result = validate_data(
"https://github.com/org/repo/blob/main/contracts/my_contract.yaml",
df=df
)
result.display_full_report()
# GitHub - raw URL
result = validate_data(
"https://raw.githubusercontent.com/org/repo/main/contracts/my_contract.yaml",
df=df
)
result.display_full_report()
# GitLab - blob URL (auto-converted to raw)
result = validate_data(
"https://gitlab.com/org/repo/-/blob/main/contracts/my_contract.yaml",
df=df
)
result.display_full_report()
# Note: `requests` is included in base install.
Loading Contracts from S3
from vowl import validate_data
# S3 URI format
result = validate_data("s3://my-bucket/contracts/my_contract.yaml", df=df)
result.display_full_report()
# Note: `boto3` is not included in the base install.
# Install it with: pip install vowl[all] or pip install boto3
# Uses default AWS credentials (environment variables, ~/.aws/credentials, IAM role, etc.)
Roadmap
Completed
| Capability | Description |
|---|---|
| ✅ Ibis Connectors | Interoperability with 20+ data sources via Ibis (PostgreSQL, Snowflake, BigQuery, Databricks, etc.) |
| ✅ Remote Contract Loading | Load contracts from S3 (s3://) and Git (GitHub/GitLab URLs) |
| ✅ JSONPath Navigation | Navigate contract elements using JSONPath expressions (contract.resolve("$.schema[0].name")) |
| ✅ Static Checks | Auto-generated checks from contract elements: logicalType, logicalTypeOptions, required, unique, primaryKey |
| ✅ Library Metrics | Declare common data quality metrics (nullValues, missingValues, invalidValues, duplicateValues, rowCount) with type: library. SQL auto-generated at runtime |
| ✅ ODCS Schema Validation | Contracts validated against ODCS JSON Schema before execution |
| ✅ Filter Conditions | Incremental quality testing with wildcard pattern matching - optimised for append-only data sources |
| ✅ Multi-Schema Checks | Cross-table referential checks within a single contract |
| ✅ Multi-Connection Checks | Cross-table referential checks between different servers/databases via MultiSourceAdapter |
| ✅ Optional Extras | Add optional Spark support with .[spark] or install .[all] |
| ✅ Custom Adapters & Executors | Extensible architecture - create custom adapters and executors by extending BaseAdapter, BaseExecutor, or SQLExecutor |
Planned
| Capability | Description | Status |
|---|---|---|
| � Alternative Check Engines | Support for dqx, dbt, Soda, Great Expectations (subject to licensing review) | Planned |
| 📅 Parallel Check Execution | Run checks in parallel for faster validation across large contracts | Planned |
| 📅 vowl-ui | Web-based validation interface for vowl | Planned |
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to get started.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vowl-0.0.2.tar.gz.
File metadata
- Download URL: vowl-0.0.2.tar.gz
- Upload date:
- Size: 9.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a981187b50f4276b5f81ac2ee07446dd8d988ae02ac976ce7297050378734219
|
|
| MD5 |
adaf0d12697e71c1e45083b8eca067e7
|
|
| BLAKE2b-256 |
06d2590d3d6262c666cf1240eef11f54508be4d0ecfa6590c7768a10f84a888f
|
Provenance
The following attestation bundles were made for vowl-0.0.2.tar.gz:
Publisher:
ci.yml on govtech-data-practice/vowl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vowl-0.0.2.tar.gz -
Subject digest:
a981187b50f4276b5f81ac2ee07446dd8d988ae02ac976ce7297050378734219 - Sigstore transparency entry: 1396712774
- Sigstore integration time:
-
Permalink:
govtech-data-practice/vowl@545fdadb91743785a7d15613539f4ae04fbaa014 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/govtech-data-practice
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@545fdadb91743785a7d15613539f4ae04fbaa014 -
Trigger Event:
push
-
Statement type:
File details
Details for the file vowl-0.0.2-py3-none-any.whl.
File metadata
- Download URL: vowl-0.0.2-py3-none-any.whl
- Upload date:
- Size: 156.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cd22610389561f4b0085bcdc2b5a1353f88fe06f3d1e1947675392316cd33ad
|
|
| MD5 |
29aafc9db30a066eee72f9b100f45282
|
|
| BLAKE2b-256 |
29bf6b7d0de768b17f8d941ebdfc2e4289ae42b4b0392b8c35110900ea7c77f4
|
Provenance
The following attestation bundles were made for vowl-0.0.2-py3-none-any.whl:
Publisher:
ci.yml on govtech-data-practice/vowl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vowl-0.0.2-py3-none-any.whl -
Subject digest:
7cd22610389561f4b0085bcdc2b5a1353f88fe06f3d1e1947675392316cd33ad - Sigstore transparency entry: 1396712781
- Sigstore integration time:
-
Permalink:
govtech-data-practice/vowl@545fdadb91743785a7d15613539f4ae04fbaa014 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/govtech-data-practice
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@545fdadb91743785a7d15613539f4ae04fbaa014 -
Trigger Event:
push
-
Statement type: