DataBase Quality Tool

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Software Development :: Quality Assurance

Project description

DBQT (DataBase Quality Tool)

DBQT is a lightweight, Python-first data quality testing framework that helps data teams maintain high-quality data through automated checks and intelligent suggestions.

Current Tools

Column Comparison Tool (dbqt colcompare / dbqt compare)

Compare schemas between databases or files:

Table-level comparison
Column-level comparison with data type compatibility checks
Support for CSV and Parquet files
Handles nested Parquet schemas (arrays, structs, maps)
Database-backed comparison — fetch metadata directly from source/target databases
Retrieves precision information (datetime, numeric precision, numeric scale)
Supports table, schema.table, and db.schema.table patterns in CSV files
Intelligent data type compatibility checking
Customizable type mappings via YAML configuration
Generates interactive HTML report (powered by Tabulator) with:
- Table differences
- Column differences
- Data type mismatches
- Sortable, filterable, searchable tables
- Built-in XLSX export button

Usage:

# Basic file-based comparison
dbqt colcompare source_schema.csv target_schema.csv

# Compare Parquet files directly
dbqt colcompare source.parquet target.parquet

# Database-backed comparison (fetches metadata from live databases)
dbqt colcompare --source-config source_config.yaml --target-config target_config.yaml

# Use a custom type mappings config for database-backed comparison
dbqt colcompare --source-config source_config.yaml --target-config target_config.yaml --type-config my_types.yaml

# Generate a default column type mappings configuration file (works with both colcompare and dbstats)
dbqt colcompare --generate-col-mappings

# Generate config with custom output path
dbqt colcompare --generate-col-mappings --output my_types.yaml

# Use custom type mappings for comparison (file-based)
dbqt colcompare source.csv target.csv --config my_types.yaml

# Use custom type mappings for comparison (database-backed)
dbqt colcompare --source-config source_config.yaml --target-config target_config.yaml --type-config my_types.yaml

# Exclude specific columns from comparison (CLI flag)
dbqt colcompare source.csv target.csv --excluded-cols CREATED_AT UPDATED_AT

# Also available via alias
dbqt compare source.csv target.csv

Customizing Type Mappings:

The tool uses intelligent type compatibility checking (e.g., INT and BIGINT are considered compatible). You can customize these mappings:

Generate a default configuration file:
```
dbqt compare --generate-col-mappings
```

Edit the generated colcompare_config.yaml file to add or modify type groups and excluded columns:

description: Column comparison type mappings configuration. Each key represents a type group, and the list contains equivalent types. excluded_cols is a list of column names to ignore during comparison.
type_mappings:
  INTEGER:
  - INT
  - INTEGER
  - BIGINT
  - SMALLINT
  - TINYINT
  - NUMBER
  VARCHAR:
  - VARCHAR
  - TEXT
  - CHAR
  - STRING
  - NVARCHAR
  - VARCHAR2
  # Add your custom types here
# Column names to exclude from comparison (case-insensitive)
excluded_cols:
  - CREATED_AT
  - UPDATED_AT

Use your custom configuration:

dbqt compare source.csv target.csv --config colcompare_config.yaml

Note: excluded_cols can also be placed directly in your connection YAML configs (source and/or target). When running database-backed comparisons, exclusions from all configs are merged automatically.

To generate CSV schema files from your database, run this query:

SELECT
    upper(table_schema) as SCH, --optional
    upper(table_name) as TABLE_NAME,
    upper(column_name) as COL_NAME,
    upper(data_type) as DATA_TYPE --optional
FROM information_schema.columns
where UPPER(table_schema) = UPPER('YOUR_SCHEMA')
order by table_name, ordinal_position;

Export the results to CSV format to use with the compare tool.

Parquet Combine Tool (dbqt combine)

Combine multiple Parquet files into a single file:

Validates schema compatibility
Preserves nested data structures
Handles large datasets efficiently

Usage:

dbqt combine [output.parquet]  # Combines all .parquet files in current directory

Database Statistics Tool (dbqt dbstats / dbqt rowcount / dbqt stats)

Collect and analyze database statistics with multiple modes:

rowcount (default) — Fetches table row counts in parallel
colcompare — Compares column schemas between source and target databases
both — Runs row counts then column comparison in one command
Supports both single table analysis and source/target table comparisons
Supports separate source and target database configurations for cross-database comparisons
Supports table, schema.table, and db.schema.table patterns in CSV files
Automatically calculates differences and percentage changes for comparisons
Generates interactive HTML reports with sortable, filterable tables
Built-in XLSX export from the browser
Configurable through YAML

Usage:

# Row counts — single database mode
dbqt dbstats --config config.yaml

# Row counts — dual database mode (source and target in different databases)
dbqt dbstats --source-config source_config.yaml --target-config target_config.yaml

# Column comparison via dbstats
dbqt dbstats colcompare --source-config source_config.yaml --target-config target_config.yaml

# Both row counts and column comparison in one run
dbqt dbstats both --source-config source_config.yaml --target-config target_config.yaml

# Both mode with custom type mappings
dbqt dbstats both --source-config source_config.yaml --target-config target_config.yaml --type-config colcompare_config.yaml

# Generate a default column type mappings configuration file
dbqt dbstats --generate-col-mappings

# Also available via aliases
dbqt rowcount --config config.yaml
dbqt stats --config config.yaml

Example config.yaml (single database mode):

# Database connection configuration
connection:
  type: mysql  # mysql, snowflake, duckdb, csv, parquet, s3parquet, postgresql, sqlserver, athena
  host: localhost
  user: myuser
  password: mypassword
  database: mydb
  # Optional AWS configs for s3parquet
  # aws_profile: default
  # aws_region: us-west-2
  # bucket: my-bucket

  # Snowflake-specific configs
  # type: snowflake
  # account: your_account.region
  # warehouse: YOUR_WAREHOUSE
  # database: YOUR_DB
  # schema: YOUR_SCHEMA
  # role: YOUR_ROLE
  # authenticator: externalbrowser  # Optional: use SSO authentication
  # user: your_username
  # password: your_password  # Not needed if using externalbrowser auth

# Path to CSV file containing table names to analyze
tables_file: tables.csv

# Optional: number of parallel workers (default: 4)
max_workers: 10

# Optional: exclude tables matching SQL-like patterns (case-insensitive)
# % matches any sequence of characters
excluded_tables:
  - "%_FINAL"
  - "TMP_%"
  - "%_BAK_%"

Example dual database configuration (for cross-database comparisons):

source_config.yaml:

connection:
  type: mysql
  host: source-db.example.com
  user: source_user
  password: source_pass
  database: source_db

tables_file: tables.csv  # CSV with source_table and target_table columns
max_workers: 10

target_config.yaml:

connection:
  type: snowflake
  account: myorg.snowflakecomputing.com
  user: target_user
  password: target_pass
  warehouse: COMPUTE_WH
  database: TARGET_DB
  schema: PUBLIC
  role: ANALYST

The tables.csv file should contain either:

A table_name column for single table analysis (adds row_count and notes columns)
source_table and target_table columns for comparison analysis (adds row counts, notes, difference, and percentage difference columns)

Auto-discovery: If tables_file is omitted from the YAML config, dbqt will automatically discover all tables and views in the configured database schema and use those for row counts and/or column comparisons. In dual-database mode, only tables present in both databases will have row counts collected — tables that exist in only one side are reported with a note like "Only in source, row count skipped" since comparing a missing table is not meaningful.

Excluding tables: Add excluded_tables to your YAML config to filter out tables by pattern. Patterns use SQL-like % wildcards and are case-insensitive:

excluded_tables:
  - "%_FINAL"    # any table ending in _FINAL
  - "TMP_%"      # any table starting with TMP_
  - "%_BAK_%"    # any table containing _BAK_

This works with both CSV-provided table lists and auto-discovered tables.

Table names in the CSV support flexible path formats:

my_table — uses database/schema from YAML config
my_schema.my_table — overrides schema, uses database from config
my_db.my_schema.my_table — fully qualified, ignores config defaults

Note: When using dual database mode (--source-config and --target-config), the tool will:

Connect to the source database to count rows in source_table column
Connect to the target database to count rows in target_table column
This enables comparing tables across different database systems (e.g., MySQL to Snowflake migration validation)

Null Column Check Tool (dbqt nullcheck)

Check for columns where all records are null across multiple tables in Snowflake.

Identifies completely empty columns.
Reports on columns with low-distinct values (<=5).
Efficiently checks multiple tables in parallel.
Generates a markdown report summarizing the findings.

Usage:

dbqt nullcheck --config snowflake_config.yaml

This tool currently only supports Snowflake.

Dynamic Query Tool (dbqt dynamic-query)

Run a dynamic SQL query against Athena for a list of values from a CSV file.

Substitutes values from a CSV into a query template.
Executes queries sequentially and writes results to an output file.
Useful for running the same query against multiple tables or with different parameters.

Usage:

dbqt dynamic-query --config athena_config.yaml --csv values.csv --query "SELECT COUNT(1) FROM {var_from_csv}"

This tool currently only supports AWS Athena.

Parquetizer Tool (dbqt parquetizer)

A utility to recursively find files that are Parquet but lack the .parquet extension and rename them.

Scans a directory for files without extensions.
Validates if a file is a Parquet file by checking its magic bytes.
Renames valid Parquet files to include the .parquet extension.

Usage:

dbqt parquetizer [directory] # Scans from the specified directory (or current if not provided)

Key Finder Tool (dbqt keyfinder)

Automatically discover composite keys (primary keys) in database tables by analyzing column combinations.

Finds minimal composite keys (smallest column combinations that uniquely identify rows)
Checks for NULL values (columns with NULLs cannot be part of a key)
Prioritizes ID-like columns (columns named id, *_id, id_*) for faster discovery
Supports filtering columns with --include-only or --exclude options
Configurable maximum key size and column limits
Parallel-safe with progress tracking
Works with any supported database connector (Snowflake, MySQL, PostgreSQL, etc.)

Usage:

# Basic usage - find keys in a table
dbqt keyfinder --config config.yaml --table users

# Limit search to keys of size 3 or less
dbqt keyfinder --config config.yaml --table orders --max-size 3

# Exclude certain columns from the search
dbqt keyfinder --config config.yaml --table products --exclude created_at updated_at

# Only search specific columns
dbqt keyfinder --config config.yaml --table transactions --include-only user_id transaction_date store_id

# Force execution even with high combination counts
dbqt keyfinder --config config.yaml --table large_table --force

# Verbose output with progress updates
dbqt keyfinder --config config.yaml --table data --verbose

Example config.yaml:

connection:
  type: snowflake
  user: myuser
  password: mypass
  account: myaccount
  database: mydb
  schema: myschema
  warehouse: mywh
  role: myrole

Performance Tips:

The tool checks combinations starting from size 1 and stops when it finds minimal keys
ID-like columns are checked first for faster discovery
Use --max-size to limit the search space for tables with many columns
Use --include-only to focus on specific columns you suspect form a key
Use --max-columns to limit the number of columns analyzed (default: 20)
The tool will warn if the combination count exceeds 50,000 (use --force to proceed)

Future Plans

Core DBQT Features (Coming Soon)

AI-Powered column classification using Qwen2 0.5B
Automatic check suggestions
20+ built-in data quality checks
Python-first API
No backend required
Customizable check framework

Planned Checks

Completeness checks (null values)
Uniqueness validation
Format validation (regex, dates, emails)
Range/boundary checks
Value validation
Statistical analysis
Dependency checks

Integration Plans

Data pipeline integration
Scheduled runs
Parallel check execution
Multiple database backend support

📄 License

This project is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Software Development :: Quality Assurance

Release history Release notifications | RSS feed

This version

0.3.0

Mar 23, 2026

0.2.2

Mar 3, 2026

0.2.1

Feb 12, 2026

0.1.15

Oct 21, 2025

0.1.14

Oct 1, 2025

0.1.13

Sep 3, 2025

0.1.12

Aug 16, 2025

0.1.11

Aug 13, 2025

0.1.10

Aug 7, 2025

0.1.9

Jul 8, 2025

0.1.8

Jun 26, 2025

0.1.7

Apr 15, 2025

0.1.6

Mar 19, 2025

0.1.5

Feb 19, 2025

0.1.4

Feb 7, 2025

0.1.3

Feb 7, 2025

0.1.2

Feb 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbqt-0.3.0.tar.gz (41.7 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dbqt-0.3.0-py3-none-any.whl (44.0 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file dbqt-0.3.0.tar.gz.

File metadata

Download URL: dbqt-0.3.0.tar.gz
Upload date: Mar 23, 2026
Size: 41.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for dbqt-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`e9d7ae3f5e4dc1a6d5d6b3d1c8a9456ace9d0b6b7a24e77e29be7bb25a31636c`
MD5	`5eabd04407f70f6c1f10e94d32f91794`
BLAKE2b-256	`8f930d31285bdb4b0461702a40e56cf6993aeca33529a6eaf7952d7964cf83aa`

See more details on using hashes here.

File details

Details for the file dbqt-0.3.0-py3-none-any.whl.

File metadata

Download URL: dbqt-0.3.0-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 44.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for dbqt-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c86a4aac82071833b5875ca86aed5c19bbed75bc1df25470b20b9cd6a41c8d5`
MD5	`d76c5f2473743bf2a551bc2843312edd`
BLAKE2b-256	`5d4c3d029fc76f1190cabfe93d4d5e65c1af3df643343d072a36bf1d0f973a6a`

See more details on using hashes here.

dbqt 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DBQT (DataBase Quality Tool)

Current Tools

Column Comparison Tool (dbqt colcompare / dbqt compare)

Parquet Combine Tool (dbqt combine)

Database Statistics Tool (dbqt dbstats / dbqt rowcount / dbqt stats)

Null Column Check Tool (dbqt nullcheck)

Dynamic Query Tool (dbqt dynamic-query)

Parquetizer Tool (dbqt parquetizer)

Key Finder Tool (dbqt keyfinder)

Future Plans

Core DBQT Features (Coming Soon)

Planned Checks

Integration Plans

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes