Skip to main content

A lightweight data quality checkup CLI — no YAML, no rule syntax to remember.

Project description

dq-doctor

Generate data quality reports from your database in minutes — no YAML, no rule syntax to remember.

A lightweight CLI that profiles your database tables, auto-generates quality check rules, runs validations, and outputs an HTML report. One command, zero config.

dq-doctor report screenshot

English | 中文说明

Quick Start

# Install
pip install dq-doctor

# Generate a demo database to try it out
dqdoctor demo

# List tables
dqdoctor tables --db examples/ecommerce/demo.duckdb

# Profile a table
dqdoctor profile --db examples/ecommerce/demo.duckdb --table orders

# Full check: profile + rules + validate + HTML report
dqdoctor check --db examples/ecommerce/demo.duckdb --table orders --out report.html

# Check all tables at once
dqdoctor check --db examples/ecommerce/demo.duckdb --all-tables --out report.html

# Export rules to dbt / Great Expectations / Markdown
dqdoctor export --db examples/ecommerce/demo.duckdb --table orders --format dbt --out schema.yml
dqdoctor export --db examples/ecommerce/demo.duckdb --table orders --format gx --out suite.json
dqdoctor export --db examples/ecommerce/demo.duckdb --table orders --format markdown --out dict.md

That's it. Open report.html in your browser.

What It Does

DuckDB (first-class) / PostgreSQL / MySQL
  → Profile table structure & column distributions
  → Auto-generate quality rules (not_null, unique, accepted_values, range, freshness)
  → Execute validations + custom SQL rules
  → PII detection (email, phone, ID card, IP, etc.)
  → Cross-table FK discovery & referential integrity checks
  → Column correlation detection
  → Profile drift comparison
  → Output HTML report
  → Export to dbt / GX / Soda CL / Deequ / Markdown

Every rule comes with a human-readable reason — so you know why the rule was suggested, not just what it checks.

Example Output

orders: Rules 14  Passed 14  Failed 0
  PASS not_null on order_id: All 20 rows have non-null 'order_id'.
  PASS unique on order_id: All 20 values in 'order_id' are unique.
  PASS range on total_amount: All 20 values within [45.00, 680.00].
  PASS accepted_values on status: All 20 non-null values in accepted set.
  PASS freshness on created_at: Latest value is 3.0h old (max 24h).

Supported Rules

Rule How It's Triggered Example
not_null Column has zero nulls, or is an identifier field order_id has no nulls → require not_null
unique Identifier field with ≥98% distinct rate user_id is nearly unique → require unique
accepted_values Category field with ≤20 distinct values status has 4 values → constrain to that set
range Numeric column total_amount in [45.00, 680.00]
freshness Timestamp field created_at should be within 24h

Export Formats

# Starter dbt schema.yml with column tests
dqdoctor export --format dbt --out schema.yml

# Great Expectations Expectation Suite JSON
dqdoctor export --format gx --out suite.json

# Markdown data dictionary
dqdoctor export --format markdown --out dict.md

# Soda CL checks
dqdoctor export --format soda --out checks.yml

# Deequ-style JSON
dqdoctor export --format deequ --out checks.json

Note: dbt export generates a starter schema.yml structure. You may need to adjust test types (e.g. range) to match your dbt version and packages.

LLM-Enhanced Rules (Experimental)

Pass an LLM API key to get additional business rules beyond the heuristic ones:

LLM suggested rules

dqdoctor check --db demo.duckdb --table orders ^
  --llm-key "sk-xxx" ^
  --llm-base-url "https://api.deepseek.com/v1" ^
  --llm-model "deepseek-chat"

Without --llm-key, dqdoctor runs purely with deterministic heuristic rules. Requires pip install dq-doctor[llm].

CI Mode

Use in CI/CD pipelines — exits with code 1 when failures exceed threshold:

dqdoctor check --db demo.duckdb --table orders --ci --max-failures 0

Configuration File

Generate a .dqdoctor.yml config to persist settings:

dqdoctor init
# .dqdoctor.yml
db: demo.duckdb

tables:
  orders:
    freshness:
      created_at:
        max_age_hours: 48
    disable_rules:
      - range:user_id
    severity:
      order_id:not_null: high
    sql_rules:
      - name: amount_positive
        query: "SELECT COUNT(*) FROM orders WHERE total_amount <= 0"
        expect: 0
dqdoctor check --config .dqdoctor.yml --table orders

Advanced Commands

# Discover foreign keys across tables
dqdoctor fk --db demo.duckdb

# Check referential integrity (orphan detection)
dqdoctor refint --db demo.duckdb

# Detect column correlations
dqdoctor correlate --db demo.duckdb --table orders

# Compare profile drift between runs
dqdoctor drift --old profile_v1.json --new profile_v2.json

# Discover data lineage (FK + correlations)
dqdoctor lineage --db demo.duckdb

# Generate dirty demo data (with intentional issues)
dqdoctor demo --dirty
dqdoctor check --db dirty.duckdb --all-tables

# Web dashboard
pip install dq-doctor[dashboard]
dqdoctor serve --db demo.duckdb

Why Not Great Expectations / Soda / dbt?

dq-doctor is not a replacement — it's a quick checkup layer that runs before you invest in heavy tooling:

  • Great Expectations / Soda: Powerful but require YAML configs, expectation suites, and setup. dqdoctor gives you a first-pass report with zero config.
  • dbt tests: Great for ongoing CI, but you need to write tests first. dqdoctor suggests tests for you and can export a starter schema.yml.
  • Think of it as: dqdoctor check → discover issues → export to dbt/GX → refine.

中文说明

dqdoctor 是一个轻量级数据质量体检 CLI 工具。你不需要手写 YAML,不需要记 Great Expectations 或 dbt 的规则语法,只需要一行命令,就能对数据库表做 profiling、自动生成质量检查规则、执行校验并输出 HTML 报告。

dq-doctor 报告截图

快速开始

# 安装
pip install dq-doctor

# 生成示例数据库
dqdoctor demo

# 列出所有表
dqdoctor tables --db examples/ecommerce/demo.duckdb

# 对单表做 profiling
dqdoctor profile --db examples/ecommerce/demo.duckdb --table orders

# 完整检查:profiling + 规则生成 + 校验 + HTML 报告
dqdoctor check --db examples/ecommerce/demo.duckdb --table orders --out report.html

# 一次性检查所有表
dqdoctor check --db examples/ecommerce/demo.duckdb --all-tables --out report.html

# 导出规则为 dbt / Great Expectations / Markdown
dqdoctor export --db examples/ecommerce/demo.duckdb --table orders --format dbt --out schema.yml
dqdoctor export --db examples/ecommerce/demo.duckdb --table orders --format gx --out suite.json
dqdoctor export --db examples/ecommerce/demo.duckdb --table orders --format markdown --out dict.md

打开 report.html 即可查看报告。

它做了什么

DuckDB(一等支持)/ PostgreSQL / MySQL
  → 分析表结构和字段分布
  → 自动生成质量规则(not_null, unique, accepted_values, range, freshness)
  → 执行校验 + 自定义 SQL 规则
  → PII 敏感数据检测(邮箱、手机、身份证、IP 等)
  → 跨表外键发现 + 参照完整性校验
  → 列间相关性检测
  → Profile 漂移对比
  → 输出 HTML 报告
  → 导出 dbt / GX / Soda CL / Deequ / Markdown

每条规则都有可读的生成原因 — 你不仅知道检查了什么,还知道为什么建议这个规则。

示例输出

orders: Rules 14  Passed 14  Failed 0  Suggested 0
  PASS not_null on order_id: All 20 rows have non-null 'order_id'.
  PASS unique on order_id: All 20 values in 'order_id' are unique.
  PASS range on total_amount: All 20 values within [45.00, 680.00].
  PASS accepted_values on status: All 20 non-null values in accepted set.
  PASS freshness on created_at: Latest value is 3.0h old (max 24h).

支持的规则

规则 触发条件 示例
not_null 字段零空值,或被推断为标识符 order_id 没有空值 → 要求 not_null
unique 标识符字段且唯一率 ≥98% user_id 近乎唯一 → 要求 unique
accepted_values 分类字段且不同值 ≤20 个 status 有 4 个值 → 约束为这 4 个
range 数值字段 total_amount 在 [45.00, 680.00]
freshness 时间戳字段 created_at 应在 24 小时以内

导出格式

# dbt schema.yml(含 column tests)
dqdoctor export --format dbt --out schema.yml

# Great Expectations Expectation Suite JSON
dqdoctor export --format gx --out suite.json

# Markdown 数据字典
dqdoctor export --format markdown --out dict.md

# Soda CL checks
dqdoctor export --format soda --out checks.yml

# Deequ-style JSON
dqdoctor export --format deequ --out checks.json

注意:dbt 导出生成的是 starter 格式的 schema.yml。range 规则使用 dbt_utils.expression_is_true,需要安装 dbt-utils 包。

LLM 增强规则(实验性)

传入 LLM API key 可以在启发式规则之外获得额外的业务规则建议:

LLM 建议规则截图

dqdoctor check --db demo.duckdb --table orders ^
  --llm-key "sk-xxx" ^
  --llm-base-url "https://api.deepseek.com/v1" ^
  --llm-model "deepseek-chat"

不传 --llm-key 时 dqdoctor 只运行确定性启发式规则。LLM 功能需要 pip install dq-doctor[llm]

LLM 生成的规则显示为 SUGGEST 状态(未实际校验),与通过/失败的规则分开统计。

CI 模式

在 CI/CD 流水线中使用 — 失败数超过阈值时 exit 1:

dqdoctor check --db demo.duckdb --table orders --ci --max-failures 0

配置文件

一键生成 .dqdoctor.yml 配置文件:

dqdoctor init
# .dqdoctor.yml
db: demo.duckdb

tables:
  orders:
    freshness:
      created_at:
        max_age_hours: 48
    disable_rules:
      - range:user_id
    severity:
      order_id:not_null: high
    sql_rules:
      - name: amount_positive
        query: "SELECT COUNT(*) FROM orders WHERE total_amount <= 0"
        expect: 0
dqdoctor check --config .dqdoctor.yml --table orders

高级命令

# 发现跨表外键
dqdoctor fk --db demo.duckdb

# 参照完整性校验(孤立记录检测)
dqdoctor refint --db demo.duckdb

# 列间相关性检测
dqdoctor correlate --db demo.duckdb --table orders

# Profile 漂移对比
dqdoctor drift --old profile_v1.json --new profile_v2.json

# 数据血缘发现(FK + 相关性)
dqdoctor lineage --db demo.duckdb

# 生成脏数据 demo(故意有问题)
dqdoctor demo --dirty
dqdoctor check --db dirty.duckdb --all-tables

# Web 看板
pip install dq-doctor[dashboard]
dqdoctor serve --db demo.duckdb

为什么不用 Great Expectations / Soda / dbt?

dq-doctor 不是替代品 — 它是一个快速体检层,在你投入重型工具之前先跑一轮:

  • Great Expectations / Soda:功能强大但需要 YAML 配置、Expectation Suite 和初始化。dqdoctor 零配置给你第一轮报告。
  • dbt tests:适合持续 CI,但你得先写测试。dqdoctor 帮你建议测试,还能导出 starter schema.yml。
  • 定位dqdoctor check → 发现问题 → 导出到 dbt/GX → 精细化。

适用人群

数据开发工程师、数仓工程师、数据平台实习生。

技术栈

  • Python 3.9+
  • Typer — CLI 框架
  • DuckDB — 嵌入式分析数据库
  • Pydantic — 数据模型
  • Jinja2 — HTML 报告模板
  • Rich — 终端输出

Tech Stack

  • Python 3.9+
  • Typer — CLI framework
  • DuckDB — embedded analytical database
  • Pydantic — data models
  • Jinja2 — HTML report templates
  • Rich — terminal output

Development

git clone https://github.com/pugyy/dq-doctor.git
cd dq-doctor
pip install -e ".[dev]"

# Run tests (98 tests)
pytest tests/ -v

# Lint
ruff check dqdoctor/ tests/

# Try the demo
dqdoctor demo
dqdoctor check --db examples/ecommerce/demo.duckdb --table orders

Roadmap

  • PostgreSQL / MySQL connector framework + CI integration testing
  • Configuration file (.dqdoctor.yml) + custom SQL rules
  • PII detection, FK discovery, referential integrity
  • Column correlation, profile drift, data lineage
  • Export: dbt / GX / Soda CL / Deequ / Markdown
  • Dirty demo data + Web dashboard + Airflow operator
  • PyPI published (v0.4.0)
  • Demo GIF

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_doctor-0.4.1.tar.gz (178.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dq_doctor-0.4.1-py3-none-any.whl (43.4 kB view details)

Uploaded Python 3

File details

Details for the file dq_doctor-0.4.1.tar.gz.

File metadata

  • Download URL: dq_doctor-0.4.1.tar.gz
  • Upload date:
  • Size: 178.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for dq_doctor-0.4.1.tar.gz
Algorithm Hash digest
SHA256 eacb02caab7252ac78637237874f69f6b22302ceb80400e9ac1fe1ee36019415
MD5 a295cd23c4b551f5278500219a7fa0da
BLAKE2b-256 56f284d8dc9e3d3a29faba2520ee73ec56f07f4dd80e6da579094280b03b41b1

See more details on using hashes here.

File details

Details for the file dq_doctor-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: dq_doctor-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 43.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for dq_doctor-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d781c559f94ad55424535074e03a93128a591ab07a295c3995012752bf9ea359
MD5 d7cec569f708633c40e448841088b57a
BLAKE2b-256 bea656ed16c09bdc3f919d9b2cb893a3b894f95e346bad1a825b9587f5508303

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page