Data quality CLI demo with deterministic checks, anomaly detection, and replayable artifacts
Project description
dq_agent
中文文档请见:README.zh-CN.md
What it is
- A local/offline data quality CLI for CSV/Parquet inputs with YAML/JSON rules.
- A deterministic runner that emits
report.json(machine) andreport.md(human). - A practical gate for CI quality checks using typed exit codes and artifacts.
What it isn't
- Not a distributed compute engine.
- Not an automatic data repair/correction system.
- Not an LLM agent.
Installation
Install from PyPI:
pip install dq-agent
Install with pipx (requires package to be published to PyPI first):
pipx install dq-agent
Install from source (repo usage):
pip install -e ".[test]"
Demo in one command
dq demo
Use a custom output root:
dq demo --output-dir artifacts/demo_runs
Use Make targets for setup and demo:
make bootstrap
make demo
Quickstart (Demo)
Requirements: Python 3.11+
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install -e ".[test]"
Run the demo:
python -m dq_agent demo
Use idempotency keys to keep outputs stable across reruns:
python -m dq_agent demo --idempotency-key demo-001 --idempotency-mode reuse
It prints something like:
{"report_json_path": "artifacts/<run_id>/report.json", "report_md_path": "artifacts/<run_id>/report.md", "run_record_path": "artifacts/<run_id>/run_record.json", "trace_path": "artifacts/<run_id>/trace.jsonl", "checkpoint_path": "artifacts/<run_id>/checkpoint.json"}
CI Integration (GitHub Actions)
Use dq run as a quality gate in CI:
name: dq-gate
on: [push, pull_request]
jobs:
dq:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install
run: |
python -m pip install -U pip
pip install dq-agent
# For repository usage instead:
# pip install -e ".[test]"
- name: Run data quality gate
run: |
dq run --data path/to/data.parquet --config path/to/rules.yml --fail-on ERROR
Exit code semantics:
0: run completed and--fail-onwas not triggered.1: I/O/config errors.2: guardrail/schema failures, idempotency conflict/regression, or--fail-ontriggered.
Outputs
A demo/run creates a new run directory:
artifacts/<run_id>/report.jsonartifacts/<run_id>/report.mdartifacts/<run_id>/run_record.json(replayable run record)artifacts/<run_id>/trace.jsonl(run trace events, NDJSON)artifacts/<run_id>/checkpoint.json(resume checkpoint)
report.json and run_record.json include schema_version: 1 at the top level. report.json includes an
observability section with timing and rule/anomaly counts.
Sample outputs (committed for quick preview):
examples/report.mdexamples/report.json
Note:
artifacts/is ignored by git. Useexamples/if you want committed sample reports.
Trace file
Each run writes a minimal trace log to trace.jsonl (newline-delimited JSON). The trace includes run_start,
stage_start, stage_end, and run_end events with elapsed milliseconds since start.
Inspect it with standard shell tools:
tail -n +1 artifacts/<run_id>/trace.jsonl | head
Run on your own data
python -m dq_agent run --data path/to/table.parquet --config path/to/rules.yml
Supported:
- Data: CSV / Parquet
- Config: YAML / JSON
Guardrails (optional safety limits):
python -m dq_agent run \
--data path/to/table.parquet \
--config path/to/rules.yml \
--max-input-mb 50 \
--max-rows 500000 \
--max-cols 200 \
--max-rules 2000 \
--max-anomalies 200 \
--max-wall-time-s 30
Fail the run when issues reach a severity threshold:
python -m dq_agent run --data path/to/table.parquet --config path/to/rules.yml --fail-on ERROR
Exit code behavior:
0: run completed without triggering--fail-on1: I/O or config parsing errors (missing/unreadable files, invalid config)2: guardrail violation or schema validation failures, plus--fail-onseverity
Idempotency controls:
python -m dq_agent run \
--data path/to/table.parquet \
--config path/to/rules.yml \
--idempotency-key run-001 \
--idempotency-mode reuse
Modes:
reuse(default): ifreport.json+run_record.jsonalready exist for the key, return their paths and skip the pipeline.overwrite: re-run and overwrite artifacts in the deterministic run directory.fail: return a structured error (idempotency_conflict) with exit code 2 when artifacts exist.
Failure contract (typed errors)
Failures are first-class artifacts. When a run fails, both report.json and run_record.json include an error
object, and the CLI prints a JSON payload with the error and any written output paths.
Error schema fields:
type:guardrail_violation|io_error|config_error|schema_validation_error|internal_error|idempotency_conflict|regressioncode: short, stable machine code (e.g.,max_rows,data_not_found,invalid_config)message: stable human-readable messageis_retryable: boolean retry hintsuggested_next_step: short actionable hintdetails: optional, small JSON object
Failure output example:
{
"error": {
"type": "guardrail_violation",
"code": "max_rows",
"message": "Row count 2500 exceeds limit 10.",
"is_retryable": false,
"suggested_next_step": "Adjust guardrail limits or reduce input size before retrying.",
"details": {"limit": 10, "observed": 2500}
},
"report_json_path": "artifacts/<run_id>/report.json",
"run_record_path": "artifacts/<run_id>/run_record.json"
}
See all CLI options:
python -m dq_agent --help
python -m dq_agent run --help
Shadow (baseline vs candidate)
Compare two configs against the same data before promoting changes:
python -m dq_agent shadow \
--data path/to/table.parquet \
--baseline-config path/to/baseline.yml \
--candidate-config path/to/candidate.yml \
--fail-on-regression
Outputs are grouped under one shadow run directory:
artifacts/<shadow_run_id>/
baseline/<baseline_run_id>/...
candidate/<candidate_run_id>/...
shadow_diff.json
The --fail-on-regression flag exits with code 2 when the candidate is worse than the baseline, while still
writing shadow_diff.json and both run artifacts. The CLI prints a single JSON payload with the baseline and
candidate report paths plus any typed error details.
Replay a run
After a run completes, use the run_record.json to replay deterministically:
python -m dq_agent replay --run-record artifacts/<run_id>/run_record.json --strict
Resume a run with checkpoints
Each run writes a checkpoint.json alongside the other artifacts. If artifacts go missing, resume can repair them:
python -m dq_agent resume --run-dir artifacts/<run_id>
Example flow:
python -m dq_agent demo --output-dir artifacts
rm artifacts/<run_id>/report.md
python -m dq_agent resume --run-dir artifacts/<run_id>
python -m dq_agent validate --kind report --path artifacts/<run_id>/report.json
python -m dq_agent validate --kind run_record --path artifacts/<run_id>/run_record.json
python -m dq_agent validate --kind checkpoint --path artifacts/<run_id>/checkpoint.json
Schema + validation
Print JSON Schema for each output:
python -m dq_agent schema --kind report
python -m dq_agent schema --kind run_record
python -m dq_agent schema --kind checkpoint
Validate a generated output:
python -m dq_agent validate --kind report --path artifacts/<run_id>/report.json
python -m dq_agent validate --kind run_record --path artifacts/<run_id>/run_record.json
python -m dq_agent validate --kind checkpoint --path artifacts/<run_id>/checkpoint.json
Config format (YAML)
Minimal example:
version: 1
dataset:
name: demo_orders
primary_key: [order_id]
time_column: created_at
columns:
order_id:
type: string
required: true
checks:
- unique: true
user_id:
type: string
required: true
checks:
- not_null: { max_null_rate: 0.01 }
- string_noise: { contains: ["*", "''"], max_rate: 0.0 }
anomalies:
- missing_rate: { max_rate: 0.02 }
amount:
type: float
required: true
checks:
- range: { min: 0, max: 10000 }
anomalies:
- outlier_mad: { z: 6.0 }
status:
type: string
required: true
checks:
- allowed_values: { values: ["PAID","REFUND","CANCEL","PENDING"] }
Demo config lives at: dq_agent/resources/demo_rules.yml.
What it checks
Deterministic rules:
not_nulluniquerangeallowed_valuesstring_noise(substring / regex pattern based)
Statistical anomalies:
outlier_mad(robust outliers via MAD z-score)missing_rate(null-rate anomaly)
Benchmarks (real labeled datasets)
We evaluate dq_agent as an error detector on datasets that have real labels
via paired dirty.csv / clean.csv.
Metrics:
- cell-level precision / recall / F1 (detect wrong cells)
- row-level precision / recall / F1 (detect rows containing any wrong cell)
Reproduce (Raha):
bash scripts/run_raha_and_save.sh
Notes:
- Benchmark harness:
scripts/eval_dirty_clean.py(auto-generates a config from profile data). - The
string_noisecheck is enabled by default for open-domain string columns in the harness (use--no-string-noiseto ablate).
Reproduce (PED):
bash scripts/run_ped_and_save.sh
To refresh the README benchmark block from benchmarks/ artifacts:
python scripts/update_readme_benchmarks.py
Raha (7 datasets; dirty vs clean profiles)
| profile | datasets | macro_cell_f1 | micro_cell_f1 | macro_row_f1 | micro_row_f1 | cell_tp/fp/fn | row_tp/fp/fn |
|---|---|---|---|---|---|---|---|
| clean | 7 | 0.817778 | 0.841898 | 0.915870 | 0.925301 | (16662, 3720, 2538) | (12288, 724, 1260) |
| dirty | 7 | 0.405580 | 0.303776 | 0.593699 | 0.466883 | (4183, 3488, 15686) | (4395, 714, 9323) |
Full breakdown: benchmarks/raha_compare.md.
Raha string-noise ablation (patterns: *, '')
| metric | base | union | Δ |
|---|---|---|---|
| macro_cell_f1 | 0.760961 | 0.817760 | 0.056798 |
| micro_cell_f1 | 0.806936 | 0.841877 | 0.034940 |
| macro_row_f1 | 0.859008 | 0.915870 | 0.056862 |
| micro_row_f1 | 0.876437 | 0.925301 | 0.048864 |
Largest per-dataset gain (from the committed compare file):
| dataset | base cell_f1 | union cell_f1 | Δ | base row_f1 | union row_f1 | Δ |
|---|---|---|---|---|---|---|
| raha/tax | 0.319744 | 0.718032 | 0.398288 | 0.324004 | 0.722462 | 0.398457 |
Full breakdown: benchmarks/raha_noise_union/compare.md.
PED (additional dirty/clean datasets)
| profile | datasets | macro_cell_f1 | micro_cell_f1 | macro_row_f1 | micro_row_f1 | cell_tp/fp/fn | row_tp/fp/fn |
|---|---|---|---|---|---|---|---|
| clean | 14 | 0.846318 | 0.056526 | 0.857537 | 0.104772 | (40776, 917613, 443580) | (45356, 455137, 319955) |
| dirty | 14 | 0.200416 | 0.010962 | 0.212178 | 0.035870 | (7726, 917533, 476630) | (14983, 455111, 350328) |
Full breakdown: benchmarks/ped_compare.md.
Dev / Tests
pip install -e ".[test]"
python -m pytest -q
If you see:
ModuleNotFoundError: typer/No module named pytest
You almost certainly forgot to activate venv:
source .venv/bin/activate
Project layout (high-level)
dq_agent/cli.py– Typer CLI (run,demo)dq_agent/loader.py– CSV/Parquet loaderdq_agent/config.py– config loading (YAML/JSON)dq_agent/contract.py– contract validationdq_agent/rules/– deterministic checks (registry-based)dq_agent/anomalies/– anomaly detectors (registry-based)dq_agent/report/– JSON + Markdown report writerstests/– unit + integration tests
Spec / Roadmap
Full design doc: A0_SPEC.md
Project docs
CHANGELOG.mdCONTRIBUTING.mdCODE_OF_CONDUCT.mdSECURITY.mddocs/publishing.md
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dq_agent-0.1.0.tar.gz.
File metadata
- Download URL: dq_agent-0.1.0.tar.gz
- Upload date:
- Size: 47.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63ce9d9a700af85f03c14372f1f023a2e0bdd88d5a5323e35a3e7c40ff13644e
|
|
| MD5 |
b3dccd02286471d3f66aa5131133d168
|
|
| BLAKE2b-256 |
3f8ce25a1eba04071bc7b45bb4dc1a16bd3aafffaa43c11ab92fff367554bdde
|
Provenance
The following attestation bundles were made for dq_agent-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on Tylor-Tian/dq_agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dq_agent-0.1.0.tar.gz -
Subject digest:
63ce9d9a700af85f03c14372f1f023a2e0bdd88d5a5323e35a3e7c40ff13644e - Sigstore transparency entry: 976229517
- Sigstore integration time:
-
Permalink:
Tylor-Tian/dq_agent@a168e694155c7c819322819f681a950b48d6cc11 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Tylor-Tian
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@a168e694155c7c819322819f681a950b48d6cc11 -
Trigger Event:
release
-
Statement type:
File details
Details for the file dq_agent-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dq_agent-0.1.0-py3-none-any.whl
- Upload date:
- Size: 45.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da25f967c2eae49b635dbf485db9b9963f8e7a21571979bfc17f72f213d94bc0
|
|
| MD5 |
0bd452a4d6be862b31e4632ab9be5208
|
|
| BLAKE2b-256 |
d4f106dd9b41c586957e4c13e901182fd1c7deb3d7ace0f2a0709284dd3c713b
|
Provenance
The following attestation bundles were made for dq_agent-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on Tylor-Tian/dq_agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dq_agent-0.1.0-py3-none-any.whl -
Subject digest:
da25f967c2eae49b635dbf485db9b9963f8e7a21571979bfc17f72f213d94bc0 - Sigstore transparency entry: 976229521
- Sigstore integration time:
-
Permalink:
Tylor-Tian/dq_agent@a168e694155c7c819322819f681a950b48d6cc11 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Tylor-Tian
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@a168e694155c7c819322819f681a950b48d6cc11 -
Trigger Event:
release
-
Statement type: