Skip to main content

A lightweight data quality checkup CLI — no YAML, no rule syntax to remember.

Project description

dq-doctor

One command to profile tables, generate data quality checks, and catch dirty data.

PyPI Python 3.9+ Tests License: MIT

pip install dq-doctor
dqdoctor demo --dirty
dqdoctor check --db dirty.duckdb --all-tables --out report.html

dq-doctor report screenshot

Highlights

  • Zero config — profile, generate rules, validate, output HTML in one command
  • Dirty data demodqdoctor demo --dirty generates a database with intentional issues
  • PII detection — flags emails, phone numbers, ID cards in your columns
  • Referential integrity — finds orphan rows across tables
  • 5 export formats — dbt, Great Expectations, Soda CL, Deequ, Markdown
  • PostgreSQL / MySQL — DuckDB first-class, PG/MySQL via connection string
  • Custom SQL rules — write your own validation queries
  • LLM suggestions — optional AI-powered rule generation (experimental)

English | 中文说明

Quick Start

pip install dq-doctor

# Try it now — generates a demo database with data quality issues
dqdoctor demo --dirty

# Run a full check
dqdoctor check --db dirty.duckdb --all-tables --out report.html

# Open the report
open report.html       # macOS
start report.html      # Windows

That's it. Three commands to see dq-doctor in action.

All Commands

# Health check your installation
dqdoctor doctor

# List tables
dqdoctor tables --db demo.duckdb

# Profile a table (column stats + PII detection)
dqdoctor profile --db demo.duckdb --table orders

# Full check: profile + rules + validate + HTML report
dqdoctor check --db demo.duckdb --table orders --out report.html

# Check all tables
dqdoctor check --db demo.duckdb --all-tables --out report.html

# Generate editable rules file
dqdoctor rules-init --db demo.duckdb --table orders --out rules.yml

# Discover foreign keys across tables
dqdoctor fk --db demo.duckdb

# Check referential integrity (orphan detection)
dqdoctor refint --db demo.duckdb

# Compare profiles for drift
dqdoctor profile --db demo.duckdb --table orders --out v1.json
dqdoctor profile --db demo.duckdb --table orders --out v2.json
dqdoctor drift --old v1.json --new v2.json

# Export to dbt / GX / Soda / Deequ / Markdown
dqdoctor export --db demo.duckdb --table orders --format dbt --out schema.yml

# Generate config file
dqdoctor init

What It Does

DuckDB (first-class) / PostgreSQL / MySQL
  → Profile table structure & column distributions
  → Auto-generate quality rules (not_null, unique, accepted_values, range, freshness)
  → Execute validations + custom SQL rules
  → PII detection (email, phone, ID card, IP, etc.)
  → Cross-table FK discovery & referential integrity checks
  → Column correlation detection
  → Profile drift comparison
  → Output HTML report
  → Export to dbt / GX / Soda CL / Deequ / Markdown

Every rule comes with a human-readable reason — so you know why the rule was suggested, not just what it checks.

Example Output

dirty_orders: Rules 14  Passed 7  Failed 7  Suggested 0
  PASS not_null on order_id: All 20 rows have non-null 'order_id'.
  FAIL not_null on user_id: 3 rows have null 'user_id'.
  FAIL range on total_amount: 4 values outside [-200.00, 99999.99].
  FAIL accepted_values on status: 2 values not in accepted set.
  FAIL freshness on created_at: Latest value is 168.0h old (max 24h).
  PII detected:
    email in column 'email'
    phone_cn in column 'phone'

Supported Rules

Rule How It's Triggered Example
not_null Column has zero nulls, or is an identifier field order_id has no nulls
unique Identifier field with ≥98% distinct rate user_id is nearly unique
accepted_values Category field with ≤20 distinct values status has 4 values
range Numeric column total_amount in [45.00, 680.00]
freshness Timestamp field created_at within 24h

Configuration

dqdoctor init                          # Create .dqdoctor.yml
dqdoctor check --config .dqdoctor.yml  # Use config
# .dqdoctor.yml
db: demo.duckdb
tables:
  orders:
    disable_rules: [range:user_id]
    severity:
      order_id:not_null: high
    sql_rules:
      - name: amount_positive
        query: "SELECT COUNT(*) FROM orders WHERE total_amount <= 0"
        expect: 0

See docs/config.md for full reference.

Export Formats

Format Command
dbt schema.yml dqdoctor export --format dbt --out schema.yml
Great Expectations dqdoctor export --format gx --out suite.json
Soda CL dqdoctor export --format soda --out checks.yml
Deequ dqdoctor export --format deequ --out checks.json
Markdown dqdoctor export --format markdown --out dict.md

PostgreSQL / MySQL

pip install dq-doctor[sql]
dqdoctor check --db "postgresql://user:pass@host:5432/dbname" --table orders
dqdoctor check --db "mysql://user:pass@host:3306/dbname" --table orders

See examples/postgres-demo/ and examples/mysql-demo/ for docker-compose setups.

LLM-Enhanced Rules (Experimental)

pip install dq-doctor[llm]
dqdoctor check --db demo.duckdb --table orders \
  --llm-key "sk-xxx" \
  --llm-base-url "https://api.deepseek.com/v1" \
  --llm-model "deepseek-chat"

LLM rules appear as SUGGEST status (not validated) — separate from pass/fail counts.

CI Mode

dqdoctor check --db demo.duckdb --table orders --ci --max-failures 0

Exits with code 1 when failures exceed threshold.

Experimental Features

Feature Status
LLM rule suggestions Experimental — quality depends on model
Column correlation Experimental — Pearson only
Data lineage Experimental — FK + correlation heuristics
Soda CL / Deequ export Starter — basic metric mapping
Web dashboard Starter — read-only

See docs/limitations.md for details.

Documentation

Contributing

See CONTRIBUTING.md.

中文说明

dq-doctor 是一个轻量级数据质量体检 CLI 工具。一行命令,自动 profile 表结构、生成质量规则、执行校验并输出 HTML 报告。

dq-doctor 报告截图

30 秒体验

pip install dq-doctor
dqdoctor demo --dirty
dqdoctor check --db dirty.duckdb --all-tables --out report.html

核心特性

  • 零配置 — 一行命令完成 profile + 规则生成 + 校验 + 报告
  • 脏数据 Demodqdoctor demo --dirty 自动生成带问题的数据库
  • PII 检测 — 自动识别邮箱、手机号、身份证等敏感字段
  • 参照完整性 — 发现跨表孤立记录
  • 5 种导出格式 — dbt / GX / Soda CL / Deequ / Markdown
  • PostgreSQL / MySQL — 通过连接字符串直接连
  • 自定义 SQL 规则 — 写自己的校验查询
  • LLM 建议 — 可选的 AI 规则生成(实验性)

配置文件

dqdoctor init  # 生成 .dqdoctor.yml

CI 模式

dqdoctor check --db demo.duckdb --table orders --ci --max-failures 0

技术栈

Python 3.9+ / Typer / DuckDB / Pydantic / Jinja2 / Rich

Development

git clone https://github.com/pugyy/dq-doctor.git
cd dq-doctor
pip install -e ".[dev]"

pytest tests/ -v    # 101 tests
ruff check .        # lint

Roadmap

  • DuckDB + PostgreSQL / MySQL support
  • PII detection, FK discovery, referential integrity
  • Configuration file + custom SQL rules
  • Profile drift comparison + data lineage
  • Export: dbt / GX / Soda CL / Deequ / Markdown
  • Dirty demo + web dashboard + Airflow operator
  • dqdoctor doctor health check
  • dqdoctor rules-init editable rules
  • PyPI published (v0.5.0)
  • Demo GIF

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_doctor-0.6.0.tar.gz (186.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dq_doctor-0.6.0-py3-none-any.whl (43.6 kB view details)

Uploaded Python 3

File details

Details for the file dq_doctor-0.6.0.tar.gz.

File metadata

  • Download URL: dq_doctor-0.6.0.tar.gz
  • Upload date:
  • Size: 186.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for dq_doctor-0.6.0.tar.gz
Algorithm Hash digest
SHA256 c4e5154ab8273aa5afcac6ba9b2197be634b7a347b17b492cbe41da1cd6f2f02
MD5 b88bf4c5e30b70e626b3e660b1f2626c
BLAKE2b-256 f2cbd7fbf709606ea433c457a9f14ab7f0824580661a13a320675f625b15de0a

See more details on using hashes here.

File details

Details for the file dq_doctor-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: dq_doctor-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 43.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for dq_doctor-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2be7b2b03a7e2c1dab09b5742a79e6c93dc0e621a07046f4b20b83868bdb179f
MD5 789c5869a0804eb09355af7e20feb818
BLAKE2b-256 488222866f8f345d63a33f3019fd69b84f8cd1e22c36f5f92a8a6d19e59ef90b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page