Skip to main content

Data quality CLI demo with deterministic checks, anomaly detection, and replayable artifacts

Project description

dq_agent

CI dq_agent demo

中文文档请见:README.zh-CN.md

What it is

  • A local/offline data quality CLI for CSV/Parquet inputs with YAML/JSON rules.
  • A deterministic runner that emits report.json (machine) and report.md (human).
  • A practical gate for CI quality checks using typed exit codes and artifacts.

What it isn't

  • Not a distributed compute engine.
  • Not an automatic data repair/correction system.
  • Not an LLM agent.

Installation

Install from PyPI:

pip install dq-agent

Install with pipx (requires package to be published to PyPI first):

pipx install dq-agent

Install from source (repo usage):

pip install -e ".[test]"

Demo in one command

dq demo

Use a custom output root:

dq demo --output-dir artifacts/demo_runs

Use Make targets for setup and demo:

make bootstrap
make demo

Quickstart (Demo)

Requirements: Python 3.11+

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install -e ".[test]"

Run the demo:

python -m dq_agent demo

Use idempotency keys to keep outputs stable across reruns:

python -m dq_agent demo --idempotency-key demo-001 --idempotency-mode reuse

It prints something like:

{"report_json_path": "artifacts/<run_id>/report.json", "report_md_path": "artifacts/<run_id>/report.md", "run_record_path": "artifacts/<run_id>/run_record.json", "trace_path": "artifacts/<run_id>/trace.jsonl", "checkpoint_path": "artifacts/<run_id>/checkpoint.json"}

CI Integration (GitHub Actions)

Use dq run as a quality gate in CI:

name: dq-gate
on: [push, pull_request]
jobs:
  dq:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install
        run: |
          python -m pip install -U pip
          pip install dq-agent
          # For repository usage instead:
          # pip install -e ".[test]"
      - name: Run data quality gate
        run: |
          dq run --data path/to/data.parquet --config path/to/rules.yml --fail-on ERROR

Exit code semantics:

  • 0: run completed and --fail-on was not triggered.
  • 1: I/O/config errors.
  • 2: guardrail/schema failures, idempotency conflict/regression, or --fail-on triggered.

Outputs

A demo/run creates a new run directory:

  • artifacts/<run_id>/report.json
  • artifacts/<run_id>/report.md
  • artifacts/<run_id>/run_record.json (replayable run record)
  • artifacts/<run_id>/trace.jsonl (run trace events, NDJSON)
  • artifacts/<run_id>/checkpoint.json (resume checkpoint)

report.json and run_record.json include schema_version: 1 at the top level. report.json includes an observability section with timing and rule/anomaly counts.

Sample outputs (committed for quick preview):

  • examples/report.md
  • examples/report.json

Note: artifacts/ is ignored by git. Use examples/ if you want committed sample reports.

Trace file

Each run writes a minimal trace log to trace.jsonl (newline-delimited JSON). The trace includes run_start, stage_start, stage_end, and run_end events with elapsed milliseconds since start.

Inspect it with standard shell tools:

tail -n +1 artifacts/<run_id>/trace.jsonl | head

Run on your own data

python -m dq_agent run --data path/to/table.parquet --config path/to/rules.yml

Supported:

  • Data: CSV / Parquet
  • Config: YAML / JSON

Guardrails (optional safety limits):

python -m dq_agent run \
  --data path/to/table.parquet \
  --config path/to/rules.yml \
  --max-input-mb 50 \
  --max-rows 500000 \
  --max-cols 200 \
  --max-rules 2000 \
  --max-anomalies 200 \
  --max-wall-time-s 30

Fail the run when issues reach a severity threshold:

python -m dq_agent run --data path/to/table.parquet --config path/to/rules.yml --fail-on ERROR

Exit code behavior:

  • 0: run completed without triggering --fail-on
  • 1: I/O or config parsing errors (missing/unreadable files, invalid config)
  • 2: guardrail violation or schema validation failures, plus --fail-on severity

Idempotency controls:

python -m dq_agent run \
  --data path/to/table.parquet \
  --config path/to/rules.yml \
  --idempotency-key run-001 \
  --idempotency-mode reuse

Modes:

  • reuse (default): if report.json + run_record.json already exist for the key, return their paths and skip the pipeline.
  • overwrite: re-run and overwrite artifacts in the deterministic run directory.
  • fail: return a structured error (idempotency_conflict) with exit code 2 when artifacts exist.

Failure contract (typed errors)

Failures are first-class artifacts. When a run fails, both report.json and run_record.json include an error object, and the CLI prints a JSON payload with the error and any written output paths.

Error schema fields:

  • type: guardrail_violation | io_error | config_error | schema_validation_error | internal_error | idempotency_conflict | regression
  • code: short, stable machine code (e.g., max_rows, data_not_found, invalid_config)
  • message: stable human-readable message
  • is_retryable: boolean retry hint
  • suggested_next_step: short actionable hint
  • details: optional, small JSON object

Failure output example:

{
  "error": {
    "type": "guardrail_violation",
    "code": "max_rows",
    "message": "Row count 2500 exceeds limit 10.",
    "is_retryable": false,
    "suggested_next_step": "Adjust guardrail limits or reduce input size before retrying.",
    "details": {"limit": 10, "observed": 2500}
  },
  "report_json_path": "artifacts/<run_id>/report.json",
  "run_record_path": "artifacts/<run_id>/run_record.json"
}

See all CLI options:

python -m dq_agent --help
python -m dq_agent run --help

Shadow (baseline vs candidate)

Compare two configs against the same data before promoting changes:

python -m dq_agent shadow \
  --data path/to/table.parquet \
  --baseline-config path/to/baseline.yml \
  --candidate-config path/to/candidate.yml \
  --fail-on-regression

Outputs are grouped under one shadow run directory:

artifacts/<shadow_run_id>/
  baseline/<baseline_run_id>/...
  candidate/<candidate_run_id>/...
  shadow_diff.json

The --fail-on-regression flag exits with code 2 when the candidate is worse than the baseline, while still writing shadow_diff.json and both run artifacts. The CLI prints a single JSON payload with the baseline and candidate report paths plus any typed error details.

Replay a run

After a run completes, use the run_record.json to replay deterministically:

python -m dq_agent replay --run-record artifacts/<run_id>/run_record.json --strict

Resume a run with checkpoints

Each run writes a checkpoint.json alongside the other artifacts. If artifacts go missing, resume can repair them:

python -m dq_agent resume --run-dir artifacts/<run_id>

Example flow:

python -m dq_agent demo --output-dir artifacts
rm artifacts/<run_id>/report.md
python -m dq_agent resume --run-dir artifacts/<run_id>
python -m dq_agent validate --kind report --path artifacts/<run_id>/report.json
python -m dq_agent validate --kind run_record --path artifacts/<run_id>/run_record.json
python -m dq_agent validate --kind checkpoint --path artifacts/<run_id>/checkpoint.json

Schema + validation

Print JSON Schema for each output:

python -m dq_agent schema --kind report
python -m dq_agent schema --kind run_record
python -m dq_agent schema --kind checkpoint

Validate a generated output:

python -m dq_agent validate --kind report --path artifacts/<run_id>/report.json
python -m dq_agent validate --kind run_record --path artifacts/<run_id>/run_record.json
python -m dq_agent validate --kind checkpoint --path artifacts/<run_id>/checkpoint.json

Config format (YAML)

Minimal example:

version: 1
dataset:
  name: demo_orders
  primary_key: [order_id]
  time_column: created_at

columns:
  order_id:
    type: string
    required: true
    checks:
      - unique: true

  user_id:
    type: string
    required: true
    checks:
      - not_null: { max_null_rate: 0.01 }
      - string_noise: { contains: ["*", "''"], max_rate: 0.0 }
    anomalies:
      - missing_rate: { max_rate: 0.02 }

  amount:
    type: float
    required: true
    checks:
      - range: { min: 0, max: 10000 }
    anomalies:
      - outlier_mad: { z: 6.0 }

  status:
    type: string
    required: true
    checks:
      - allowed_values: { values: ["PAID","REFUND","CANCEL","PENDING"] }

Demo config lives at: dq_agent/resources/demo_rules.yml.

What it checks

Deterministic rules:

  • not_null
  • unique
  • range
  • allowed_values
  • string_noise (substring / regex pattern based)

Statistical anomalies:

  • outlier_mad (robust outliers via MAD z-score)
  • missing_rate (null-rate anomaly)

Benchmarks (real labeled datasets)

We evaluate dq_agent as an error detector on datasets that have real labels via paired dirty.csv / clean.csv.

Metrics:

  • cell-level precision / recall / F1 (detect wrong cells)
  • row-level precision / recall / F1 (detect rows containing any wrong cell)

Reproduce (Raha):

bash scripts/run_raha_and_save.sh

Notes:

  • Benchmark harness: scripts/eval_dirty_clean.py (auto-generates a config from profile data).
  • The string_noise check is enabled by default for open-domain string columns in the harness (use --no-string-noise to ablate).

Reproduce (PED):

bash scripts/run_ped_and_save.sh

To refresh the README benchmark block from benchmarks/ artifacts:

python scripts/update_readme_benchmarks.py

Raha (7 datasets; dirty vs clean profiles)

profile datasets macro_cell_f1 micro_cell_f1 macro_row_f1 micro_row_f1 cell_tp/fp/fn row_tp/fp/fn
clean 7 0.817778 0.841898 0.915870 0.925301 (16662, 3720, 2538) (12288, 724, 1260)
dirty 7 0.405580 0.303776 0.593699 0.466883 (4183, 3488, 15686) (4395, 714, 9323)

Full breakdown: benchmarks/raha_compare.md.

Raha string-noise ablation (patterns: *, '')

metric base union Δ
macro_cell_f1 0.760961 0.817760 0.056798
micro_cell_f1 0.806936 0.841877 0.034940
macro_row_f1 0.859008 0.915870 0.056862
micro_row_f1 0.876437 0.925301 0.048864

Largest per-dataset gain (from the committed compare file):

dataset base cell_f1 union cell_f1 Δ base row_f1 union row_f1 Δ
raha/tax 0.319744 0.718032 0.398288 0.324004 0.722462 0.398457

Full breakdown: benchmarks/raha_noise_union/compare.md.

PED (additional dirty/clean datasets)

profile datasets macro_cell_f1 micro_cell_f1 macro_row_f1 micro_row_f1 cell_tp/fp/fn row_tp/fp/fn
clean 14 0.846318 0.056526 0.857537 0.104772 (40776, 917613, 443580) (45356, 455137, 319955)
dirty 14 0.200416 0.010962 0.212178 0.035870 (7726, 917533, 476630) (14983, 455111, 350328)

Full breakdown: benchmarks/ped_compare.md.

Dev / Tests

pip install -e ".[test]"
python -m pytest -q

If you see:

  • ModuleNotFoundError: typer / No module named pytest

You almost certainly forgot to activate venv:

source .venv/bin/activate

Project layout (high-level)

  • dq_agent/cli.py – Typer CLI (run, demo)
  • dq_agent/loader.py – CSV/Parquet loader
  • dq_agent/config.py – config loading (YAML/JSON)
  • dq_agent/contract.py – contract validation
  • dq_agent/rules/ – deterministic checks (registry-based)
  • dq_agent/anomalies/ – anomaly detectors (registry-based)
  • dq_agent/report/ – JSON + Markdown report writers
  • tests/ – unit + integration tests

Spec / Roadmap

Full design doc: A0_SPEC.md

Project docs

  • CHANGELOG.md
  • CONTRIBUTING.md
  • CODE_OF_CONDUCT.md
  • SECURITY.md
  • docs/publishing.md

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_agent-0.1.0.tar.gz (47.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dq_agent-0.1.0-py3-none-any.whl (45.8 kB view details)

Uploaded Python 3

File details

Details for the file dq_agent-0.1.0.tar.gz.

File metadata

  • Download URL: dq_agent-0.1.0.tar.gz
  • Upload date:
  • Size: 47.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dq_agent-0.1.0.tar.gz
Algorithm Hash digest
SHA256 63ce9d9a700af85f03c14372f1f023a2e0bdd88d5a5323e35a3e7c40ff13644e
MD5 b3dccd02286471d3f66aa5131133d168
BLAKE2b-256 3f8ce25a1eba04071bc7b45bb4dc1a16bd3aafffaa43c11ab92fff367554bdde

See more details on using hashes here.

Provenance

The following attestation bundles were made for dq_agent-0.1.0.tar.gz:

Publisher: publish-pypi.yml on Tylor-Tian/dq_agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dq_agent-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dq_agent-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dq_agent-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 da25f967c2eae49b635dbf485db9b9963f8e7a21571979bfc17f72f213d94bc0
MD5 0bd452a4d6be862b31e4632ab9be5208
BLAKE2b-256 d4f106dd9b41c586957e4c13e901182fd1c7deb3d7ace0f2a0709284dd3c713b

See more details on using hashes here.

Provenance

The following attestation bundles were made for dq_agent-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on Tylor-Tian/dq_agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page