Metadata-driven Data Quality Assessment Framework for Databricks / Delta Lake
Project description
Data Quality Assessment Framework — Databricks / Delta Lake
Open-source Python library reimplementing the SQL Server DQ Assessment Framework natively on Databricks with Delta tables, PySpark DataFrames, and Python-based validation functions.
What this Library Provides
A metadata-driven, PySpark port of a SQL Server Data Quality Assessment Framework for Databricks/Delta Lake. It validates data in curated Delta tables using configurable rules — no code changes needed to add new fields or update logic.
| Feature | SQL Server Implementation | Databricks Implementation |
|---|---|---|
| Storage | SQL Server tables | Delta Lake tables (Unity Catalog) |
| Pattern evaluation | T-SQL TVFs (generated dynamically) | Python closures (generated dynamically) |
| Row-level evaluation | CROSS APPLY TVF |
PySpark UDF + DeltaTable.merge() |
| Custom rules | SQL expressions in configCustomQuery |
Plain regex patterns or named validators |
| Email validation | Complex LIKE / CHARINDEX SQL |
Native Python regex (enhanced) |
| Output | SELECT result sets |
Spark DataFrames |
| Same field configuration | Script_02 SQL INSERT |
ConfigManager Python API or raw SQL |
| Same pattern library | 118 patterns | 118 patterns (identical) |
| Same precedence logic | L0A RANK() CTE |
Python ranking equivalent |
| Same validation levels | L01→L02→L03→L04→L99 | L01→L02→L03→L04→L99 (identical) |
| Same DQ columns | DQEligible BIT, DQViolations, DQFields |
DQEligible BOOLEAN, DQViolations STRING, DQFields STRING |
| Same dual logging | auditDQChecks + statDQChecks |
Same Delta tables |
| Same ExecutionID | NEWID() GUID |
uuid.uuid4() |
Same 1=1 scope filtering |
WHERE 1=1 AND (...) |
Python conditional filter |
| Same sticky DQ logic | CASE WHEN DQEligible=0 THEN 0 ELSE Result END |
Spark when() equivalent |
Architecture at a Glance
masterField (config) ──► configFieldAllowedPattern ──► mapDQChecks
│ │
ConfigManager API resolve_pattern_rules.py (L0A precedence)
│ │
└──────────────────────────┘
│
generate_rule_functions.py
│
FunctionRegistry (UDFs)
│
data_assessment_rules.py (DQRunner)
│
Curated Table: DQEligible | DQViolations | DQFields
│
┌───────────────┴───────────────┐
auditDQChecks statDQChecks
└───────────────┬───────────────┘
Reporting Views
(v_auditDQChecks, v_statDQChecks)
Data flow: Config tables → precedence resolution → Python closure generation → UDF-based row assessment → DQ column updates + dual audit logging → reporting.
Quick Start
from dq_framework import DQFramework
# 1. Initialise
dq = DQFramework(spark, catalog="main", schema="dq")
dq.setup()
# 2. Configure a field (ConfigManager API — no SQL required)
cfg = dq.config
cfg.register_field(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS", data_category_type_id=4)
cfg.set_field_values(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS", min_data_length=6, max_data_length=255)
cfg.block_category(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS", "DataEmptiness")
cfg.allow_pattern(2, "Source.SRC_ContactPoint.EMAIL_ADDRESS", "Has At Sign")
cfg.add_custom_query_regex(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS",
r"^[^@\s]+@[^@\s]+\.[^@\s]+$", must_match=True)
cfg.add_mapping(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS",
target_schema_name="Curated", target_table_name="Entity_Email_Denorm",
target_field_name="EMAIL_ADDRESS", target_catalog_name="main")
# 3. Validate config FK integrity before generating functions
cfg.verify_config()
# 4. Generate field-level checker functions from config
dq.generate_rule_functions()
# 4b. Optional: validate SQL-type custom queries before assessment
dq.validate_custom_queries_sql()
# 5. Run the DQ assessment
exec_id = dq.run_assessment(schema_name="Curated")
# 6. Query results
dq.violations(exec_id).display()
dq.quality_scores(exec_id).display()
dq.fields_below_threshold(threshold=80).display()
dq.summary_by_table().display()
See sample_usage/ for hands-on demo notebooks, sample CSV data, and the Excel config template.
Built-in Usage Guide
dq.guide() # Full workflow, FullFieldName patterns, all ConfigManager methods
dq.sample_usage(spark) # Print path to demo notebooks, sample data, and Excel template
Prerequisite: DQ Columns on Curated Tables
Each curated Delta table being assessed must have three extra columns.
Use the built-in helper — it reads mapDQChecks and adds them to every mapped table at once:
dq.prepare_curated_tables() # all tables referenced in mapDQChecks
dq.prepare_curated_table("Schema", "Table") # or a single table
Both methods are idempotent (safe to re-run) and require ALTER privilege on each target table. The assessment runner will raise a clear error if columns are missing, pointing to these methods.
The three columns added are:
DQEligible BOOLEAN -- 1=all checks passed, 0=at least one failed, NULL=not assessed
DQViolations STRING -- [field: ViolationType], ... (accumulated violations)
DQFields STRING -- [field1], [field2] (all fields assessed on this row)
| Column | Value | Meaning |
|---|---|---|
DQEligible |
None / NULL |
Row not yet assessed |
DQEligible |
True |
All configured checks passed |
DQEligible |
False |
At least one check failed |
DQViolations |
e.g. [email_address: Data Type] |
Violated rules (sticky — accumulates across fields) |
DQFields |
e.g. [email_address], [phone] |
All fields assessed on this row |
Package Structure
dq_framework/
├── __init__.py # Exports: DQFramework, ConfigManager
├── framework.py # DQFramework facade — top-level orchestration
├── ddl_framework_tables.py # ≡ Script_00_DDL_Framework_Tables — Delta table DDL (9 tables)
├── seed_master_data.py # ≡ Script_01_Master_Reference_Data — 27 categories + 118 patterns
├── config.py # ConfigManager — Python API for all 5 config tables
├── engine/
│ ├── pattern_checks.py # All 118 pattern check implementations
│ ├── resolve_pattern_rules.py # Pattern precedence (L0A CTE equivalent)
│ ├── generate_rule_functions.py # ≡ p_DQ_GenerateRuleFunctions — FunctionRegistry + closures
│ └── data_assessment_rules.py # ≡ p_DQ_DataAssessmentRules — DQRunner
├── reporting/
│ └── views.py # v_auditDQChecks + v_statDQChecks equivalents
└── agents/
├── __init__.py # Exports SYSTEM_PROMPT, DQ_TOOLS, suggest_config
├── context.py # LLM system prompt + short context
├── tools.py # OpenAI-compatible tool definitions + execute_tool_call()
└── suggest.py # Schema-based config auto-suggester (no LLM required)
dq_framework/sample_usage/ # Bundled inside the PyPI package — extracted to /Workspace/Users/{you}/...
├── 00-infrastructure.py # One-time: create catalog/schema/volume
├── 01-install.py # Per-session: %pip install + dq.setup() + copy CSVs to volume + load as Delta
├── 02-config-pyspark.py # Seed all config tables via Python ConfigManager methods
├── 02-config-sql.py # Same config, seeded via SQL MERGE statements
├── 03-run.py # generate_rule_functions → run_assessment → reporting
├── mock_curated_contacts.csv # Sample data — contacts
├── mock_curated_locations.csv # Sample data — locations
├── mock_curated_vendors.csv # Sample data — vendors
└── DBX DQ Framework - Data Model v2.xlsx # Excel config authoring template (see Excel Template section)
# Run dq.sample_usage(spark) to extract all files to your user workspace
notebooks/ # Developer / contributor notebooks
├── 00_quickstart.py # Step-by-step walkthrough: configure → assess → results
├── 01_agent_assistant.py # AI agent integration: auto-suggest, tool-calling, system prompt
├── 02_integration_test.py # Integration test suite (run on cluster before publishing)
└── 03_seed_config.py # Paste Excel-generated INSERT SQL here
tests/
├── conftest.py # Path setup + test split explanation
└── unit/
├── test_pattern_checks.py # All 118 pattern functions (no cluster needed)
├── test_resolve_pattern_rules.py # Pattern precedence logic
├── test_generate_rule_functions.py # Closure building + FunctionRegistry
├── test_seed_master_data.py # Seed data integrity (counts, IDs, FK)
└── test_suggest.py # Schema-based config suggester
Getting sample files: Run
dq.sample_usage(spark)— it extracts the bundled sample notebooks, CSV data, and Excel template to/Workspace/Users/{you}/databricks-dq-framework/sample_usage/automatically. No Repos clone needed. Works on serverless and all compute types.
9 Framework Tables
All tables live in the dq schema (configurable). Three categories:
| Category | Tables |
|---|---|
Framework-Seeded (auto-populated by setup()) |
masterDataCategory, masterPattern |
User-Managed (populated via ConfigManager or SQL) |
masterField, configFieldValues, configFieldAllowedPattern, configCustomQuery, mapDQChecks |
Results (auto-populated by run_assessment()) |
auditDQChecks, statDQChecks |
| Table | Purpose |
|---|---|
masterDataCategory |
27 field type classifications |
masterPattern |
118 built-in validation patterns. Custom patterns use _ID >= 1000 (IDs 1–999 are reserved for built-ins) |
masterField |
Source fields registered for assessment |
configFieldValues |
Data length (L01) and value range (L04) boundaries |
configFieldAllowedPattern |
Pattern allow/block rules per field |
configCustomQuery |
Custom validation expressions — plain regex, Spark SQL, or named validators (L02) |
mapDQChecks |
Maps source field → curated table column |
auditDQChecks |
Row-level violation audit log (failures only, append-only) |
statDQChecks |
Aggregated pass/fail statistics per execution (append-only) |
generate_rule_functions()output: Prints a•line for each registered checker with its SQL expression count. Any DQ function referenced inmapDQChecksthat has no rules configured (inconfigFieldValues,configFieldAllowedPattern, orconfigCustomQuery) is flagged with ⚠ and skipped by the assessment runner — avoiding misleading 100% quality scores for unchecked fields.
Validation Levels (in order, early-exit on first failure)
Each generated field checker implements the same 5-level hierarchy as the SQL TVF:
| Level | Check Type | Priority |
|---|---|---|
| L01 | Data Length range (min/max character count) | Checked first |
| L02 | Custom expression rules (regex, Spark SQL, or named validator) | Checked second |
| L03 | 118-pattern checks (in PatternPriority order) | Checked third |
| L04 | Data Value range (min/max value comparison) | Checked fourth |
| L99 | Default PASS (if all prior checks pass) | Final |
Early exit: The checker returns immediately on first failure. Re-run after fixing the first violation to surface the next.
118 Patterns across 10 Categories
| Category | Count | Examples |
|---|---|---|
DataType1 |
3 | Is Fully Numeric, Is Fully Decimal, Is Fully Text |
DataType2 |
1 | Is AlphaNumeric |
DataType3 |
4 | Is Date, Is Time, Is Timestamp, Is Boolean |
SpecialCharacter |
32 | Has Comma, Has At Sign, Has Hyphen, Has Pipe Symbol, … |
SpaceFound |
1 | Has Space |
DataEmptiness |
34 | Is Empty or NULL, Is Virtually Empty with Spaces, … |
InvalidKeyword |
40 | Has Keyword-n/a, Has Keyword-null, Has Keyword-missing, … |
FullyDuplicatedCharacter |
1 | Has Fully Duplicated Character (e.g. "aaaa") |
UnicodeCharacters |
1 | Has Unicode Characters (outside printable ASCII 0x20–0x7E) |
CasingCheck |
2 | Has Lowercase Character, Has Uppercase Character |
Total: 118 built-in patterns. Pattern IDs 1–999 are reserved for framework use.
To add custom patterns (e.g. domain-specific invalid keywords), use IDs >= 1000.
Configuring Pattern Rules (configFieldAllowedPattern)
This table is the heart of the framework — it controls which of the 118 built-in patterns apply to each field, and whether each pattern is required, blocked, or suppressed.
The three selector columns
Each row in configFieldAllowedPattern points into masterPattern via one of three columns.
Populate one column per row and leave the other two NULL.
| Column | Selects | Patterns activated |
|---|---|---|
PatternCategory |
An entire category family | Many — e.g. all 32 SpecialCharacter patterns at once |
PatternSubCategory |
A named group within a category | Several — e.g. all 8 Separator chars: . , - _ / \ ; : |
PatternName |
One exact pattern by name | Exactly 1 — e.g. only Has At Sign |
Precedence when two rules conflict on the same pattern: PatternName > PatternSubCategory > PatternCategory.
The most specific rule always wins. Within the same specificity level, Allowed wins over Not Allowed.
cfg.block_category(1, "Source.T.FIELD", "SpecialCharacter") # block all 32 special chars
cfg.allow_pattern(2, "Source.T.FIELD", "Has Hyphen") # except hyphen (PatternName wins)
cfg.allow_pattern(3, "Source.T.FIELD", "Has Full Stop") # except .
cfg.allow_pattern(4, "Source.T.FIELD", "Has At Sign") # except @
When to use each level
PatternCategory — use when you want a blanket policy for a whole family:
SpecialCharacter Not Allowed → blocks all 32 special chars in one row
InvalidKeyword Not Allowed → blocks all 40 junk keywords (null, n/a, test, missing …)
DataType3 Not Allowed → blocks values that look like dates, times, or booleans
UnicodeCharacters Not Allowed → blocks any character outside printable ASCII (0x20–0x7E)
FullyDuplicatedChar Not Allowed → blocks "aaaaaaa", "--------" etc.
PatternSubCategory — use when you want a subset without naming every pattern:
Separator Allowed → allows . , - _ / \ ; : all at once (instead of 8 separate rows)
Bracket Not Allowed → blocks ( ) [ ] { }
Emptiness Not Allowed → blocks NULL, all-spaces, and all 32 "virtually empty with X" patterns
Symbol Not Allowed → blocks ! @ # $ % ^ & * ~ + < > ? =
Available SubCategories in masterPattern:
| Category | SubCategory | Patterns included |
|---|---|---|
SpecialCharacter |
Symbol |
! @ # $ % ^ & * ~ + < > ? = (15) |
SpecialCharacter |
Bracket |
( ) [ ] { } (6) |
SpecialCharacter |
Separator |
. , - _ / \ ; : | (9) |
SpecialCharacter |
Quote |
`' " `` (3) |
DataEmptiness |
Null |
Is Empty or NULL only |
DataEmptiness |
Spaces |
Is Virtually Empty with Spaces only |
DataEmptiness |
Character |
All 32 "Is Virtually Empty with X" patterns |
DataEmptiness |
Emptiness (config alias) |
All three Null + Spaces + Character groups |
InvalidKeyword |
Generic |
All 40 junk keywords |
PatternName — use for precision, one specific pattern:
Has At Sign Allowed → @ is OK (email local-part)
Has Hyphen Allowed → - is OK (compound names, postal codes)
Has Underscore Allowed → _ is OK (email, login names)
Has Full Stop Allowed → . is OK (email domain separator)
Has Space Allowed → space is OK (city, vendor name, address)
Has Space Not Allowed → no spaces (email, country code, postal code)
Is Empty or NULL Allowed → field is optional — NULL is accepted
Is Empty or NULL Not Allowed → field is mandatory — NULL fails
Has Lowercase Character Not Allowed → enforces ALL-CAPS (country code, vendor code)
Is Fully Numeric Not Allowed → value must not be a pure number (name fields)
Is Fully Text Allowed (*) → value MUST contain no digits (see DataType rules below)
Allowed vs Not Allowed — two different behaviours
The IsPatternAllowed flag works differently depending on the pattern category:
One-directional patterns (most categories)
These patterns only ask "is the unwanted thing present?".
IsPatternAllowed |
Behaviour |
|---|---|
0 / Not Allowed |
Check fires → fail if the pattern is detected in the value |
1 / Allowed |
Check is skipped — the pattern is acknowledged as acceptable, not asserted |
Applies to: SpecialCharacter, SpaceFound, CasingCheck, DataEmptiness, InvalidKeyword,
FullyDuplicatedCharacter, UnicodeCharacters.
Allowedhere is a suppressor, not a validator. It tells the framework "this pattern will appear in this field — don't flag it."
Bidirectional patterns (DataType1, DataType2, DataType3)
These check type conformance in both directions:
IsPatternAllowed |
Behaviour |
|---|---|
0 / Not Allowed |
Fail if value IS this type — e.g. no pure numbers in a name field |
1 / Allowed |
Fail if value IS NOT this type — enforces the type as mandatory |
Examples:
Is Fully Numeric Not Allowed → vendor_name must not be a pure number → flags "99999", "123"
Is Fully Numeric Allowed → raw_phone_number must ONLY be digits → flags "+91 9698..." (has +/space)
Is Fully Text Allowed → state/city must contain no digits → flags "3H", "1st Avenue"
Is Date Not Allowed → email_address cannot look like a date → flags "2024-01-15"
Is AlphaNumeric Not Allowed → state field rejects codes that contain both letters AND digits
CasingCheck — one-directional only
Allowed has no effect on casing checks. Only Not Allowed enforces anything.
| Rule | Effect |
|---|---|
Has Lowercase Character Not Allowed |
Enforces ALL-CAPS — any a-z found = violation |
Has Uppercase Character Not Allowed |
Enforces all-lowercase — any A-Z found = violation |
Has Lowercase Character Allowed |
No effect (check silently suppressed) |
The check uses binary/case-sensitive matching (COLLATE Latin1_General_BIN equivalent).
Accented characters (é, ü, ñ) only trigger the Unicode check, not the casing check —
they are outside the a-z / A-Z ASCII ranges.
The override rule in practice
The design pattern across every field: block broadly at category level, then punch exceptions at pattern level.
email_address — 14 rules:
Category SpecialCharacter Not Allowed → blocks all 32 special chars
Category DataType3 Not Allowed → blocks date/time/timestamp/boolean values
Category InvalidKeyword Not Allowed → blocks "null", "n/a", "test" etc.
Category FullyDupChar Not Allowed → blocks "aaaaaaa"
Category UnicodeCharacters Not Allowed → blocks non-ASCII (é, ü, 田中, шрифт)
SubCat Emptiness Not Allowed → blocks NULL, spaces, all virtually-empty patterns
PatternName Has At Sign Allowed → overrides SpecialCharacter block → @ is OK ✓
PatternName Has Hyphen Allowed → overrides → - is OK ✓
PatternName Has Underscore Allowed → overrides → _ is OK ✓
PatternName Has Full Stop Allowed → overrides → . is OK ✓
PatternName Has Space Not Allowed → spaces are still blocked (no override) ✓
PatternName Is Empty or NULL Allowed → email is optional — NULL accepted ✓
Typical starting configs by field type
| Field type | Category blocks | Pattern overrides (Allowed) |
|---|---|---|
| Email address | SpecialCharacter, DataType3, InvalidKeyword, FullyDupChar, UnicodeCharacters, Emptiness |
Has At Sign, Has Full Stop, Has Hyphen, Has Underscore |
| Phone (formatted) | DataType1, DataType2, DataType3, InvalidKeyword, FullyDupChar, UnicodeCharacters |
Has Hyphen, Has Plus Sign, Has Space, Has Open Parenthesis, Has Close Parenthesis |
| Phone (digits-only) | SpecialCharacter, DataType3, InvalidKeyword, FullyDupChar, UnicodeCharacters, Emptiness |
Is Fully Numeric (Allowed = digits only, mandatory) |
| Postal / ZIP code | SpecialCharacter, DataType3, InvalidKeyword, FullyDupChar, UnicodeCharacters, Emptiness |
Has Hyphen, Has Space |
| Person name | DataType1, DataType2, DataType3, SpecialCharacter, InvalidKeyword, FullyDupChar, Emptiness |
Has Hyphen, Has Full Stop, Has Single Quote, Has Space |
| Vendor / company name | DataType3, SpecialCharacter, InvalidKeyword, FullyDupChar, UnicodeCharacters, Emptiness |
Has Space, Has Hyphen, Has Full Stop |
| Country code (2-char) | DataType1, DataType2, DataType3, SpecialCharacter, InvalidKeyword, FullyDupChar, UnicodeCharacters, Emptiness |
Is Fully Text (Allowed), add Has Lowercase Character Not Allowed for ALL-CAPS |
| Street address | DataType3, InvalidKeyword, FullyDupChar, UnicodeCharacters, Emptiness |
SpecialChar Separator (Allowed), Has Space, block Has Exclamation Mark, Has Question Mark |
| Trade / lookup code | InvalidKeyword, FullyDupChar, UnicodeCharacters, Emptiness, 100% Numeric |
SpecialChar-L1, SpecialChar-L2, Has Semicolon, Has Space |
Custom Validation Rules (L02)
The configCustomQuery table stores a CustomQueryExpression for each field.
Three options are available.
Option 1: Plain regex (recommended for non-technical users)
# Must match: value must satisfy the regex to PASS
cfg.add_custom_query_regex(1, "Source.T.EMAIL_ADDRESS",
r"^[^@\s]+@[^@\s]+\.[^@\s]+$", must_match=True,
description="Basic email format")
# Must NOT match: if the regex matches, the row FAILS
cfg.add_custom_query_regex(2, "Source.T.EMAIL_ADDRESS",
r"noemaildress", must_match=False,
description="Reject placeholder values")
The engine auto-detects regex patterns by the presence of metacharacters
(^$+*?[({|\) and applies re.search() automatically.
Option 2: Spark SQL expression
# @InputValue is replaced with the actual column reference at assessment time
cfg.add_custom_query_sql(1, "Source.T.AMOUNT",
"CAST(@InputValue AS DOUBLE) > 0",
is_condition_allowed=True,
description="Amount must be positive")
SQL expressions are applied at the DataFrame level via F.expr() — not inside a
Python UDF — so they support any Spark SQL built-in function.
Option 3: Named validator (for complex logic)
Register a Python function once, then reference it by name in any field config:
# Register (once, at cluster startup or in a setup cell)
dq.register_validator("au_mobile", lambda v: bool(__import__('re').match(r'^04[0-9]{8}$', v or '')))
# Reference by name in config — query_id is optional, auto-assigned if omitted
cfg.add_custom_query("Source.T.MOBILE_NUMBER", "au_mobile",
is_condition_allowed=True, custom_query_type="PYTHON")
Built-in email validators are pre-registered:
| Validator Name | Description |
|---|---|
email_basic_format |
Regex: basic x@y.z format |
email_at_validation |
@ position, count, dot after @ |
email_domain_format |
1–3 dots, no trailing dot, 2+ char labels |
email_local_part_dots |
Max 2 dots in local part |
email_no_placeholder |
Must not contain 'noemaildress' |
email_no_trailing_special |
Must not end with ., -, or _ |
Scope Filtering (1=1 Pattern Preserved)
# All schemas, all tables, all fields
dq.run_assessment()
# Specific schema
dq.run_assessment(schema_name="Curated")
# Specific table
dq.run_assessment(schema_name="Curated", table_name="Individual_Denorm")
# Single field
dq.run_assessment(schema_name="Curated",
table_name="Individual_Denorm",
field_name="FIRST_NAME")
# Reset previously-failed rows before re-assessment
dq.run_assessment(schema_name="Curated", reset_eligible_flag=True)
Reporting
Four helper methods return Spark DataFrames ready for .display() or further transformation:
exec_id = dq.run_assessment(schema_name="Curated")
dq.violations(exec_id).display() # Row-level violations (field, violation type, value)
dq.quality_scores(exec_id).display() # Pass/fail counts + % per field
dq.fields_below_threshold(threshold=80).display() # Fields with quality < 80%
dq.summary_by_table(exec_id).display() # Aggregated quality score per table
Two persistent reporting views are also created in the dq schema by setup():
| View | Purpose |
|---|---|
v_auditDQChecks |
Row-level violations joined with field metadata |
v_statDQChecks |
Aggregated pass/fail statistics with PercentageQualified |
AI Agent Support
The dq_framework.agents module provides three ways to use AI to configure the framework.
No agent framework (Semantic Kernel, LangChain, AutoGen) is required.
Approach A — Auto-suggest from schema (no LLM needed)
from dq_framework.agents import suggest_config
schema = [
("EMAIL_ADDRESS", "string"),
("FIRST_NAME", "string"),
("POSTAL_CODE", "string"),
("CUSTOMER_ID", "int"),
]
# Returns ready-to-run ConfigManager code
code = suggest_config(schema, source_schema="Source", source_table="SRC_Customer",
curated_schema="Curated", curated_table="Customer_Denorm",
catalog="main")
print(code)
Or read schema directly from a live Delta table:
schema = spark.table("main.Source.SRC_Customer").dtypes
code = suggest_config(schema, source_schema="Source", source_table="SRC_Customer", ...)
Approach B — Chat assistant with tool calling
The 14 framework operations are exposed as OpenAI-compatible JSON Schema tool definitions. They work unchanged with OpenAI, Azure OpenAI, Anthropic Claude, Semantic Kernel, AutoGen, and LangChain.
from dq_framework.agents import SYSTEM_PROMPT, DQ_TOOLS
from dq_framework.agents.tools import execute_tool_call
from openai import AzureOpenAI
client = AzureOpenAI(azure_endpoint="...", api_key="...", api_version="2024-05-01-preview")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Configure DQ rules for Source.SRC_Party.EMAIL_ADDRESS"},
],
tools=DQ_TOOLS,
)
for call in response.choices[0].message.tool_calls or []:
result = execute_tool_call(call.function.name, call.function.arguments, dq, cfg)
print(f"[Tool] {call.function.name} → {result}")
See notebooks/01_agent_assistant.py for Azure OpenAI, Anthropic Claude, and Semantic Kernel examples.
Note:
notebooks/contains developer/contributor notebooks. The consumer-facing demo is in sample_usage/ — rundq.sample_usage(spark)for the path.
Approach C — Paste system prompt into ChatGPT / Copilot
from dq_framework.agents import SYSTEM_PROMPT
print(SYSTEM_PROMPT) # Copy and paste into any chat interface
Testing
Unit tests — no cluster needed
Run locally with Python and pytest. No Databricks, no Spark, no Delta Lake required.
pip install dq-framework pytest
pytest tests/unit/ -v
| Test file | What it covers |
|---|---|
test_pattern_checks.py |
All major pattern check functions (DataEmptiness, SpecialChar, InvalidKeyword, …) |
test_resolve_pattern_rules.py |
Pattern precedence — specific overrides broad, inactive excluded |
test_generate_rule_functions.py |
L01/L02/L04 closure building, FunctionRegistry, stale function removal |
test_seed_master_data.py |
Seed data integrity: 27 categories, 118 patterns, no duplicate IDs |
test_suggest.py |
Schema-based config suggester: column classification, code generation |
Integration tests — requires a Databricks cluster
Open notebooks/02_integration_test.py on a Databricks cluster. It creates throwaway test schemas, runs the full pipeline end-to-end, and drops everything on completion. For a hands-on consumer demo instead, use the notebooks in sample_usage/.
| Test | What it verifies |
|---|---|
| Test 1 | Framework setup — 9 tables created, 27 categories, 118 patterns seeded |
| Test 2 | setup() is idempotent — no duplicate data on second call |
| Test 3 | Synthetic curated table created with known good/bad email values |
| Test 4 | ConfigManager — all 5 user-managed tables populated |
| Test 5 | generate_rule_functions() — checker built and directly invocable |
| Test 6 | run_assessment() — DQ columns written correctly, specific rows verified |
| Test 7 | Reporting views — violations count, quality scores, threshold filter, summary |
| Test 8 | reset_eligible_flag — failed rows cleared, re-assessment gives same result |
| Test 9 | Custom keyword extensibility via add_invalid_keyword() |
| Test 10 | cfg.verify_config() passes with no FK issues |
Platform Requirements
| Requirement | Detail |
|---|---|
| Databricks Runtime | 12.0+ (includes PySpark 3.3+) |
| Delta Lake | 2.0+ |
| Python | 3.9+ |
| Unity Catalog | Recommended (pass catalog="" for legacy Hive metastore) |
| python-dateutil | Included in Databricks Runtime; required for full date/time validation |
Adding a New Source Field
- Register the field:
cfg.register_field(id, "Schema.Table.Column", data_category_type_id=N) - Set length bounds:
cfg.set_field_values(id, ffn, min_data_length=M, max_data_length=N) - Add pattern rules:
cfg.block_category(...),cfg.allow_pattern(...),cfg.block_pattern(...) - Add custom expressions:
cfg.add_custom_query_regex(id, ffn, r"your_regex", must_match=True) - (Optional) Add custom invalid keywords:
dq.add_invalid_keyword(id, "your_keyword") - Map to curated column:
cfg.add_mapping(id, ffn, target_schema_name=..., target_table_name=..., ...) - Validate FK integrity:
cfg.verify_config() - Re-run:
dq.generate_rule_functions() - (Optional) Pre-flight SQL validation:
dq.validate_custom_queries_sql()
Excel Template — Config Authoring Workbook
The workbook (DBX DQ Framework - Data Model v2.xlsx, located in sample_usage/) is the
recommended way to author and maintain DQ configurations for any non-trivial field set.
Why use the template?
Configuring DQ rules manually in Python or SQL for 10+ fields — each with dozens of pattern allow/block rules, length bounds, and custom expressions — is repetitive, error-prone, and hard to audit. The Excel template solves this by acting as both a configuration record and a code generator:
- One row per rule; columns map directly to framework concepts (field name, pattern category, allow/block, expression, description, active flag …)
- Built-in formulas generate ready-to-run Python method calls and SQL MERGE rows from the same data — no manual escaping or formatting required
- The filled-in workbook becomes a human-readable audit trail of what rules are configured, why, and when — something that is hard to get from raw code alone
- Non-developers (data stewards, business analysts) can review, edit, and sign off on rules in a familiar spreadsheet before anyone runs a single notebook cell
Sheets and what to fill in
| Sheet | Maps to table | What to fill in |
|---|---|---|
masterField |
masterField |
Field ID, FullFieldName (pattern A or B), DataCategoryTypeID |
configFieldValues |
configFieldValues |
Per-field min/max data length (L01) and min/max value range (L04) |
configFieldAllowedPattern |
configFieldAllowedPattern |
Pattern allow/block rules — one row per rule (L03) |
configCustomQuery |
configCustomQuery |
Custom regex, Spark SQL, or named validator expressions (L02) |
mapDQChecks |
mapDQChecks |
Field → curated table column mappings |
Deciding pattern rules for a field (configFieldAllowedPattern)
This is the most complex sheet to fill in and the one where the template adds the most value. See Configuring Pattern Rules above for the full guide — three-column selector rules, Allowed vs Not Allowed behaviour for each category, CasingCheck specifics, the override mechanism, and ready-made starting configs by field type.
Quick summary for the Excel sheet:
- One row per rule, one of
PatternCategory/PatternSubCategory/PatternNamepopulated per row - Start with broad category blocks, then add PatternName exceptions (Allowed) to punch holes
Allowedon DataType patterns is a mandatory type assertion (value MUST be that type)Allowedon all other categories is a suppressor (skip the check for this pattern)- A typical email field needs ~10–14 rows; a postal code ~6–8; a name field ~8–12
A typical email field ends up with ~10–14 rows in this sheet. A phone number or postal code is similar. The template makes it easy to copy a field's block as a starting point for a similar field and adjust.
Two output columns — M (PySpark) and N (SQL)
Each data sheet has two formula output columns that generate ready-to-paste code:
| Column | Generates | Paste destination |
|---|---|---|
| M | dq.config.<method>(...) Python call |
Directly into a Python notebook cell |
| N | SELECT ... UNION ALL row |
Inside the body of a SQL MERGE template (see below) |
Column M is the recommended path. Copy the M cells for your rows, paste into
02-config-pyspark.py (or any Python cell), and run. The formula handles all
Databricks-compatible escaping internally — backslash doubling and single-quote escaping
are applied in the correct order so add_custom_query() stores the expression correctly.
Column N (SQL MERGE path) produces rows for a bulk MERGE statement. The generated rows
must be embedded inside a r"""...""" Python raw string — never f"""...""":
# Correct: raw string for the SQL body, f-string only for catalog/schema names
tgt = f"`{MY_CATALOG}`.`{DQ_SCHEMA}`.`configCustomQuery`"
query = (
r"MERGE INTO " + tgt + r" AS t "
r"USING (SELECT * FROM (VALUES "
r" (1, 'Schema.Table.Field', 'REGEX', '\\d+', ...) " # ← paste N-column rows here
r") AS s(...) WHERE _ID IS NOT NULL) ON t._ID = s._ID "
r"WHEN MATCHED AND (...) THEN UPDATE SET ... "
r"WHEN NOT MATCHED THEN INSERT ..."
)
spark.sql(query)
Why r"""..."""? Python f-strings process backslash sequences before Spark SQL ever
sees the string — \' becomes ' silently, breaking @InputValue expressions with
PARSE_SYNTAX_ERROR: Syntax error at or near '@'. A raw string passes the SQL verbatim.
If you paste N-column rows directly into a SQL notebook cell (not a Python string), replace every
''inside expression values with\'— Databricks SQL legacy mode uses backslash escaping, not ANSI quote-doubling.
Escaping rules (handled automatically by the formula)
| In your expression | Formula produces | Stored in table | Spark SQL sees |
|---|---|---|---|
\d+ (regex backslash) |
\\d+ |
\\d+ |
\d+ ✓ |
O'Brien (quote) |
O\'Brien (M) / O''Brien (N) |
O'Brien |
O'Brien ✓ |
You do not need to pre-escape anything before typing into the Excel cells. Fill in the
raw expression (e.g. ^\d{4}$); the formula handles the rest.
Known Limitations
| # | Area | Description |
|---|---|---|
| 1 | Bug — add_custom_query() |
CustomQueryDescription column is NOT NULL in the DDL but the method passes null when no description argument is provided. Always supply a description when calling add_custom_query(), add_custom_query_regex(), or add_custom_query_sql() to avoid an INSERT constraint violation. |
| 2 | Silent pattern skip | If a PatternName in configFieldAllowedPattern does not match any row in masterPattern, the rule is silently ignored (equivalent to SQL's ELSE NULL). A typo in a pattern name will go undetected unless verify_config() is run. |
| 3 | No transaction guarantees | The assessment writes DQ columns, audit rows, and stat rows in three separate Spark actions. A cluster failure between steps can leave partial results. Re-run the assessment with reset_eligible_flag=True to recover. |
| 4 | Date validation outside Databricks | If python-dateutil is not installed, date/time pattern checks fall back to three hard-coded regex formats. This is rarely an issue on Databricks Runtime but may affect local unit test runs that exercise the date validators. |
| 5 | No parameterised SQL | Config is written via f-string SQL. Backslash and single-quote escaping uses Databricks legacy Spark SQL mode (\ as escape character). Field names and expressions provided by trusted developers are safe; do not pass untrusted user input directly into ConfigManager methods. |
Data Quality Non-Functional Assessment Framework — Databricks Edition Generalised for open-source publication — no client-specific data included
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file databricks_dq_framework-1.0.37.tar.gz.
File metadata
- Download URL: databricks_dq_framework-1.0.37.tar.gz
- Upload date:
- Size: 141.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d696ce4fbe9a35cb6cfae19e3ab5b9844f2299b7c32556544d5482b5f52d109c
|
|
| MD5 |
b35b4280ac5dec35e5bd3ab8d0a478a6
|
|
| BLAKE2b-256 |
c1416740311e97b302fe30852145916364d31444028957e11c10617dc4533035
|
File details
Details for the file databricks_dq_framework-1.0.37-py3-none-any.whl.
File metadata
- Download URL: databricks_dq_framework-1.0.37-py3-none-any.whl
- Upload date:
- Size: 126.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
738a6a78035d1f312ffea4ea731631a3379c47e24b87b185400537f4eb05e2fd
|
|
| MD5 |
87eef965a3147b5d2a74b1899b6d6099
|
|
| BLAKE2b-256 |
0db33dc0bc24cc1e84e14f982734db8fbab084ed0bee1d5c2aa70a3b39f1b309
|