Skip to main content

A lightweight data quality testing tool for CSV files and databases

Project description

DQ Tester

A lightweight, simple data quality testing tool for CSV files and databases.

Overview

DQ Tester allows you to define data quality checks in SQL and organize them into reusable test catalogs. Execute tests against CSV files or databases, and monitor results through an interactive Streamlit dashboard.

Key Features:

  • 📝 Simple YAML configuration for tests and catalogs
  • 🗄️ Support for CSV files and databases (via ODBC)
  • 📊 Built-in Streamlit dashboard for monitoring test results
  • 🔍 Customizable SQL-based data quality checks
  • 💾 Test results stored in DuckDB for analysis

Requirements

  • Python 3.9+
  • ODBC driver for your database (currently tested with PostgreSQL)

Installation

pip install dq_tester

Setup with Claude

To use DQ Tester with Claude:

  1. Upload prompt_custom_instructions.txt as a project instruction file
  2. Upload the following example files to your project:
    • examples/catalog.yaml
    • examples/connections.yaml
    • examples/db_test_plan.yaml
    • examples/file_test_plan.yaml

Claude will use these files to help you create and manage data quality tests.

Python API

DQ Tester can be used directly in Python for custom workflows and integrations:

import dq_tester
import sys

results = dq_tester.run_tests('examples/catalog.yaml', 'examples/file_test_plan.yaml')
failed = [r for r in results if r['status'] == 'FAIL']

if failed:
    print(f"{len(failed)} tests failed")
    sys.exit(1)
print("All tests passed")

Quick Start

1. Configure Database Connections

Create a connections.yaml file:

connections:
  - name: sample_db
    driver: "PostgreSQL"
    server: "myserver"
    database: "demo_source"
    username: "user"
    password: "password"

  - name: results_db  # Required: results stored here
    driver: "DuckDB"
    database: "./results/test_results.duckdb"

Note: The results_db connection is required for storing test results.

2. Create a Catalog of DQ Checks

Create a catalog.yaml file defining reusable data quality checks:

dq_checks:
  - name: null_values
    type: sql
    sql: |
      select count(1)
      from {table_name}
      where {column_name} is null
  
  - name: duplicate_key
    type: sql
    sql: |
      select count(1)
      from (
        select {key_cols}
        from {table_name}
        group by {key_cols}
        having count(1) > 1
      ) t1
  
  - name: invalid_email_duckdb
    type: sql
    sql: |
      select count(1)
      from {table_name}
      WHERE NOT REGEXP_FULL_MATCH(
          {column_name},
          '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{{2,}}$'
      )

Parameters in SQL: Use {parameter_name} in your SQL to define parameters. These must be provided in your test plan.

3. Create a Test Plan

For CSV Files

Quick Start with csv-to-yaml:

DQ Tester can generate the YAML structure for your CSV file automatically:

dq_tester -a csv-to-yaml --csv-path examples/datasets/customers.csv

This outputs the column definitions which you can copy into your test plan:

has_header: true
delimiter: ','
columns:
  - name: Customer Id
    type: VARCHAR
  - name: First Name
    type: VARCHAR
  - name: Email
    type: VARCHAR
  # ... etc

Complete Test Plan:

Create a test plan (e.g., file_test_plan.yaml) by adding the generated output plus source, file_name, and dq_tests:

source: csv
file_name: examples/datasets/customers.csv
has_header: true
delimiter: ','
columns:
  - name: Customer Id
    type: VARCHAR
  - name: First Name
    type: VARCHAR
  - name: Email
    type: VARCHAR

dq_tests:
  - dq_test_name: null_values
    column_name: '"Customer Id"'
    threshold:
      type: count
      operator: "=="
      value: 0
  
  - dq_test_name: invalid_email_duckdb
    column_name: '"Email"'
    threshold:
      type: count
      operator: "=="
      value: 0
  
  - dq_test_name: total_records
    threshold:
      type: count
      operator: ">"
      value: 1000

CSV Requirements:

  • file_name: Path to the CSV file
  • has_header: Whether the CSV has a header row
  • delimiter: Column delimiter (usually ,)
  • columns: Column definitions (types are optional)

For Databases

Create a test plan (e.g., db_test_plan.yaml):

source: database
connection_name: sample_db
table_name: sales.salesorderdetail

dq_tests:
  - dq_test_name: null_values
    column_name: salesorderid
    threshold:
      type: count
      operator: "=="
      value: 0
  
  - dq_test_name: duplicate_key
    key_cols: salesorderdetailid
    threshold:
      type: count
      operator: "=="
      value: 0

Database Requirements:

  • connection_name: Name from connections.yaml
  • table_name: Fully qualified table name (e.g., schema.table)

CLI Commands

Run Tests

Execute a test plan against your data:

dq_tester -a run \
  -c examples/catalog.yaml \
  -t examples/file_test_plan.yaml \
  --connections-path connections.yaml

Required Options:

  • -a run or --action run: Action to execute
  • -c or --catalog-path: Path to catalog YAML file
  • -t or --test-plan-path: Path to test plan YAML file

Optional:

  • --connections-path: Path to connections YAML file (defaults to searching standard locations)

Validate your catalog and test plan files without running tests:

dq_tester -a validate \
  -c examples/catalog.yaml \
  -t examples/file_test_plan.yaml

Required Options:

  • -a validate or --action validate: Action to execute
  • -c or --catalog-path: Path to catalog YAML file
  • -t or --test-plan-path: Path to test plan YAML file

This checks for:

  • Valid YAML syntax
  • Required fields present
  • Proper test configuration
  • Parameter matching between catalog and test plan

Generate CSV YAML Structure

Automatically generate the YAML structure for a CSV file:

dq_tester -a csv-to-yaml --csv-path examples/datasets/customers.csv

Required Options:

  • -a csv-to-yaml or --action csv-to-yaml: Action to execute
  • --csv-path: Path to the CSV file to analyze

This analyzes your CSV file and outputs the column definitions with inferred data types. Copy this output into your test plan to get started quickly.

Output example:

has_header: true
delimiter: ','
columns:
  - name: Index
    type: BIGINT
  - name: Customer Id
    type: VARCHAR
  - name: Email
    type: VARCHAR
  # ... additional columns

Launch Dashboard

Start the interactive Streamlit dashboard:

# Use default connections.yaml locations
dq_dashboard

# Specify connections file
dq_dashboard connections.yaml

# Use custom port
dq_dashboard --port 8080 connections.yaml

Options:

  • connections_file (positional, optional): Path to connections YAML file
  • --port: Port number for the dashboard (default: 8501)

Default connections.yaml locations:

If no connections file is specified, the dashboard searches in order:

  1. ./connections.yaml (current directory)
  2. ~/.dq_tester/connections.yaml (user home directory)
  3. /etc/dq_tester/connections.yaml (system-wide)

Command Reference

# Show help
dq_tester -h

# Available actions
dq_tester -a {validate,run,csv-to-yaml}

# Full options
dq_tester [-h] 
          [-c CATALOG_PATH] 
          [-t TEST_PLAN_PATH] 
          [--csv-path CSV_PATH]
          [-a {validate,run,csv-to-yaml}]
          [--connections-path CONNECTIONS_PATH]

Built-in DQ Checks

For CSV Files

  • invalid_records - Count of records that fail to parse
  • total_records - Total number of records
  • expected_delimiter - Validates the delimiter used
  • valid_header - Validates header structure

For Databases

  • total_records - Total number of records in the table

Note: Thresholds for built-in checks can be configured in your test plan.

Thresholds

Tests use thresholds to determine PASS/FAIL status:

Threshold Types

  • count: Compare absolute count values
  • pct: Compare percentage values (0-100)

Threshold Operators

  • ==: Equal to
  • !=: Not equal to
  • <: Less than
  • <=: Less than or equal to
  • >: Greater than
  • >=: Greater than or equal to

Example Thresholds

# Count-based threshold
threshold:
  type: count
  operator: "=="
  value: 0

# Percentage-based threshold
threshold:
  type: pct
  operator: "<="
  value: 5  # 5% or less

Test Results

Each test produces one of three statuses:

  • PASS: The test result meets the threshold criteria (threshold comparison returns TRUE)
  • FAIL: The test result does not meet the threshold criteria
  • ERROR: The test execution failed (SQL error, connection issue, etc.)

Results are stored in the results_db DuckDB database and can be viewed in the dashboard.

Example Project Structure

my-dq-project/
├── connections.yaml
├── catalog.yaml
├── test_plans/
│   ├── customers_test_plan.yaml
│   └── orders_test_plan.yaml
└── results/
    └── test_results.duckdb

Configuration Reference

Catalog Configuration

dq_checks:
  - name: check_name           # Unique name for the check
    type: sql                  # Currently only 'sql' is supported
    sql: |                     # SQL query template
      select count(1)
      from {table_name}
      where {parameter_name} condition

Test Plan Configuration

# For CSV
source: csv | database
file_name: path/to/file.csv    # CSV only
has_header: true | false       # CSV only
delimiter: ','                 # CSV only
columns:                       # CSV only
  - name: column_name
    type: data_type            # Optional

# For Database
connection_name: name          # Database only
table_name: schema.table       # Database only

dq_tests:
  - dq_test_name: check_name   # References catalog
    parameter_name: value      # Match parameters in SQL
    threshold:
      type: count | pct
      operator: "==|!=|<|<=|>|>="
      value: number

Connections Configuration

connections:
  - name: connection_name
    driver: "PostgreSQL"       # ODBC driver name
    server: "hostname"
    database: "database_name"
    username: "user"
    password: "password"
  
  - name: results_db           # Required for storing results
    driver: "DuckDB"
    database: "./path/to/results.duckdb"

ODBC Driver Setup

PostgreSQL

Linux:

sudo apt-get install unixodbc unixodbc-dev odbc-postgresql

macOS:

brew install unixodbc psqlodbc

Windows: Download and install the PostgreSQL ODBC driver from the official website.

Other Databases

DQ Tester should work with any database that has an ODBC driver. Install the appropriate ODBC driver for your database and configure it in connections.yaml.

Dashboard Features

The Streamlit dashboard provides:

  • 📊 Key Metrics: Total tests, pass rate, recent failures, columns tested
  • 🎯 Status Distribution: Visual breakdown of PASS/FAIL/ERROR
  • 📈 Trends Over Time: Historical test results
  • 🔗 Connection Health: Test results by connection
  • 📋 Object Analysis: Test results by tested object
  • 🏷️ Column-Level Health: Identify problematic columns
  • 🔍 Cascading Filters: Filter by connection → object → column
  • 📥 CSV Export: Download filtered results

Examples

Complete examples are available in the examples/ directory:

  • catalog.yaml - Sample DQ check definitions
  • file_test_plan.yaml - CSV testing example
  • db_test_plan.yaml - Database testing example
  • connections.yaml - Connection configuration template

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions, issues, or feature requests, please contact: chad@kodda.io

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_tester-0.4.0.tar.gz (22.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dq_tester-0.4.0-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file dq_tester-0.4.0.tar.gz.

File metadata

  • Download URL: dq_tester-0.4.0.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.16

File hashes

Hashes for dq_tester-0.4.0.tar.gz
Algorithm Hash digest
SHA256 3cf3cbbdd11d368ace6e0dc87db1920e33cf3f990a532a3fefa2ed54e0015d5b
MD5 edd561c5fd8e38257a6cf4facb18bc4d
BLAKE2b-256 8775682d0bdb7f2ca275501d213eefe3b800f16f0e09179c3a589ac70e9a4d3e

See more details on using hashes here.

File details

Details for the file dq_tester-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: dq_tester-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.16

File hashes

Hashes for dq_tester-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cd8d57b56bdc7b39ee8d72f25d01946312a138779575bfeaeed56b65021a3df2
MD5 ad8e00269aa534e1254dc00fa8faab21
BLAKE2b-256 78651e2440e2ad8dfd4968ceb8f008b6592587be927e2bd08c7e7e04b3f79a2d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page