DataBase Quality Tool
Project description
DBQT (DataBase Quality Tool) 🎯
DBQT is a lightweight, Python-first data quality testing framework that helps data teams maintain high-quality data through automated checks and intelligent suggestions.
🛠️ Current Tools
Column Comparison Tool (dbqt compare)
Compare schemas between databases or files:
- Table-level comparison
- Column-level comparison with data type compatibility checks
- Support for CSV and Parquet files
- Handles nested Parquet schemas (arrays, structs, maps)
- Intelligent data type compatibility checking
- Generates detailed Excel report with:
- Table differences
- Column differences
- Data type mismatches
- Formatted worksheets for easy analysis
Usage:
dbqt compare source_schema.csv target_schema.csv
# Or compare Parquet files directly:
dbqt compare source.parquet target.parquet
To generate CSV schema files from your database, run this query:
SELECT
upper(table_schema) as SCH, --optional
upper(table_name) as TABLE_NAME,
upper(column_name) as COL_NAME,
upper(data_type) as DATA_TYPE --optional
FROM information_schema.columns
where UPPER(table_schema) = UPPER('YOUR_SCHEMA')
order by table_name, ordinal_position;
Export the results to CSV format to use with the compare tool.
Parquet Combine Tool (dbqt combine)
Combine multiple Parquet files into a single file:
- Validates schema compatibility
- Preserves nested data structures
- Handles large datasets efficiently
Usage:
dbqt combine [output.parquet] # Combines all .parquet files in current directory
Database Statistics Tool (dbqt dbstats)
Collect and analyze database statistics:
- Fetches table row counts in parallel for faster execution
- Supports both single table analysis and source/target table comparisons
- NEW: Supports separate source and target database configurations for cross-database comparisons
- Automatically calculates differences and percentage changes for comparisons
- Updates statistics in a CSV file with comprehensive error reporting
- Configurable through YAML
Usage:
# Single database mode (source and target in same database)
dbqt dbstats --config config.yaml
# Dual database mode (source and target in different databases)
dbqt dbstats --source-config source_config.yaml --target-config target_config.yaml
Example config.yaml (single database mode):
# Database connection configuration
connection:
type: mysql # mysql, snowflake, duckdb, csv, parquet, s3parquet, postgresql, sqlserver, athena
host: localhost
user: myuser
password: mypassword
database: mydb
# Optional AWS configs for s3parquet
# aws_profile: default
# aws_region: us-west-2
# bucket: my-bucket
# Snowflake-specific configs
# type: snowflake
# account: your_account.region
# warehouse: YOUR_WAREHOUSE
# database: YOUR_DB
# schema: YOUR_SCHEMA
# role: YOUR_ROLE
# authenticator: externalbrowser # Optional: use SSO authentication
# user: your_username
# password: your_password # Not needed if using externalbrowser auth
# Path to CSV file containing table names to analyze
tables_file: tables.csv
# Optional: number of parallel workers (default: 4)
max_workers: 10
Example dual database configuration (for cross-database comparisons):
source_config.yaml:
connection:
type: mysql
host: source-db.example.com
user: source_user
password: source_pass
database: source_db
tables_file: tables.csv # CSV with source_table and target_table columns
max_workers: 10
target_config.yaml:
connection:
type: snowflake
account: myorg.snowflakecomputing.com
user: target_user
password: target_pass
warehouse: COMPUTE_WH
database: TARGET_DB
schema: PUBLIC
role: ANALYST
The tables.csv file should contain either:
- A
table_namecolumn for single table analysis (addsrow_countandnotescolumns) source_tableandtarget_tablecolumns for comparison analysis (adds row counts, notes, difference, and percentage difference columns)
Note: When using dual database mode (--source-config and --target-config), the tool will:
- Connect to the source database to count rows in
source_tablecolumn - Connect to the target database to count rows in
target_tablecolumn - This enables comparing tables across different database systems (e.g., MySQL to Snowflake migration validation)
Null Column Check Tool (dbqt nullcheck)
Check for columns where all records are null across multiple tables in Snowflake.
- Identifies completely empty columns.
- Reports on columns with low-distinct values (<=5).
- Efficiently checks multiple tables in parallel.
- Generates a markdown report summarizing the findings.
Usage:
dbqt nullcheck --config snowflake_config.yaml
This tool currently only supports Snowflake.
Dynamic Query Tool (dbqt dynamic-query)
Run a dynamic SQL query against Athena for a list of values from a CSV file.
- Substitutes values from a CSV into a query template.
- Executes queries sequentially and writes results to an output file.
- Useful for running the same query against multiple tables or with different parameters.
Usage:
dbqt dynamic-query --config athena_config.yaml --csv values.csv --query "SELECT COUNT(1) FROM {var_from_csv}"
This tool currently only supports AWS Athena.
Parquetizer Tool (dbqt parquetizer)
A utility to recursively find files that are Parquet but lack the .parquet extension and rename them.
- Scans a directory for files without extensions.
- Validates if a file is a Parquet file by checking its magic bytes.
- Renames valid Parquet files to include the
.parquetextension.
Usage:
dbqt parquetizer [directory] # Scans from the specified directory (or current if not provided)
Key Finder Tool (dbqt keyfinder)
Automatically discover composite keys (primary keys) in database tables by analyzing column combinations.
- Finds minimal composite keys (smallest column combinations that uniquely identify rows)
- Checks for NULL values (columns with NULLs cannot be part of a key)
- Prioritizes ID-like columns (columns named
id,*_id,id_*) for faster discovery - Supports filtering columns with
--include-onlyor--excludeoptions - Configurable maximum key size and column limits
- Parallel-safe with progress tracking
- Works with any supported database connector (Snowflake, MySQL, PostgreSQL, etc.)
Usage:
# Basic usage - find keys in a table
dbqt keyfinder --config config.yaml --table users
# Limit search to keys of size 3 or less
dbqt keyfinder --config config.yaml --table orders --max-size 3
# Exclude certain columns from the search
dbqt keyfinder --config config.yaml --table products --exclude created_at updated_at
# Only search specific columns
dbqt keyfinder --config config.yaml --table transactions --include-only user_id transaction_date store_id
# Force execution even with high combination counts
dbqt keyfinder --config config.yaml --table large_table --force
# Verbose output with progress updates
dbqt keyfinder --config config.yaml --table data --verbose
Example config.yaml:
connection:
type: snowflake
user: myuser
password: mypass
account: myaccount
database: mydb
schema: myschema
warehouse: mywh
role: myrole
Performance Tips:
- The tool checks combinations starting from size 1 and stops when it finds minimal keys
- ID-like columns are checked first for faster discovery
- Use
--max-sizeto limit the search space for tables with many columns - Use
--include-onlyto focus on specific columns you suspect form a key - Use
--max-columnsto limit the number of columns analyzed (default: 20) - The tool will warn if the combination count exceeds 50,000 (use
--forceto proceed)
🚀 Future Plans
Core DBQT Features (Coming Soon)
- AI-Powered column classification using Qwen2 0.5B
- Automatic check suggestions
- 20+ built-in data quality checks
- Python-first API
- No backend required
- Customizable check framework
Planned Checks
- Completeness checks (null values)
- Uniqueness validation
- Format validation (regex, dates, emails)
- Range/boundary checks
- Value validation
- Statistical analysis
- Dependency checks
Integration Plans
- Data pipeline integration
- Scheduled runs
- Parallel check execution
- Multiple database backend support
📄 License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dbqt-0.1.14.tar.gz.
File metadata
- Download URL: dbqt-0.1.14.tar.gz
- Upload date:
- Size: 30.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e77c084c8e3dfa80b4c0e16fa41844708850d79aa27f3a4ad4a74dee49efbae
|
|
| MD5 |
4db6fb190bb8b2034d9e695635c1577d
|
|
| BLAKE2b-256 |
4babedb33f4d86f4dba148635920865e5e3d9bbbcd9de1522feffe9d3afd3073
|
File details
Details for the file dbqt-0.1.14-py3-none-any.whl.
File metadata
- Download URL: dbqt-0.1.14-py3-none-any.whl
- Upload date:
- Size: 32.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a0304412601acf7c45de2e9e4b5d8be0b4a861c6985f23e188fa7452ad2bd89
|
|
| MD5 |
65e0b408a2c7201dbbcb4604ef216333
|
|
| BLAKE2b-256 |
731a22d4654ffec9c1905059121158e45e275669e3a3a48c52275986d195ac4e
|