Data quality CLI demo with deterministic checks, anomaly detection, and replayable artifacts

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Tylor-Tian

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
Topic
- Software Development :: Quality Assurance

Project description

dq_agent

dq_agent demo

中文文档请见：README.zh-CN.md

What it is

A local/offline data quality CLI for CSV/Parquet inputs with YAML/JSON rules.
A deterministic runner that emits report.json (machine) and report.md (human).
A practical gate for CI quality checks using typed exit codes and artifacts.

What it isn't

Not a distributed compute engine.
Not an automatic data repair/correction system.
Not an LLM agent.

Installation

Install from PyPI:

pip install dq-agent

Install with pipx (requires package to be published to PyPI first):

pipx install dq-agent

Install from source (repo usage):

pip install -e ".[test]"

Demo in one command

dq demo

Use a custom output root:

dq demo --output-dir artifacts/demo_runs

Use Make targets for setup and demo:

make bootstrap
make demo

Quickstart (Demo)

Requirements: Python 3.11+

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install -e ".[test]"

Run the demo:

python -m dq_agent demo

Use idempotency keys to keep outputs stable across reruns:

python -m dq_agent demo --idempotency-key demo-001 --idempotency-mode reuse

It prints something like:

{"report_json_path": "artifacts/<run_id>/report.json", "report_md_path": "artifacts/<run_id>/report.md", "run_record_path": "artifacts/<run_id>/run_record.json", "trace_path": "artifacts/<run_id>/trace.jsonl", "checkpoint_path": "artifacts/<run_id>/checkpoint.json"}

CI Integration (GitHub Actions)

Use dq run as a quality gate in CI:

name: dq-gate
on: [push, pull_request]
jobs:
  dq:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install
        run: |
          python -m pip install -U pip
          pip install dq-agent
          # For repository usage instead:
          # pip install -e ".[test]"
      - name: Run data quality gate
        run: |
          dq run --data path/to/data.parquet --config path/to/rules.yml --fail-on ERROR

Exit code semantics:

0: run completed and --fail-on was not triggered.
1: I/O/config errors.
2: guardrail/schema failures, idempotency conflict/regression, or --fail-on triggered.

Outputs

A demo/run creates a new run directory:

artifacts/<run_id>/report.json
artifacts/<run_id>/report.md
artifacts/<run_id>/run_record.json (replayable run record)
artifacts/<run_id>/trace.jsonl (run trace events, NDJSON)
artifacts/<run_id>/checkpoint.json (resume checkpoint)

report.json and run_record.json include schema_version: 1 at the top level. report.json includes an observability section with timing and rule/anomaly counts.

Sample outputs (committed for quick preview):

examples/report.md
examples/report.json

Note: artifacts/ is ignored by git. Use examples/ if you want committed sample reports.

Trace file

Each run writes a minimal trace log to trace.jsonl (newline-delimited JSON). The trace includes run_start, stage_start, stage_end, and run_end events with elapsed milliseconds since start.

Inspect it with standard shell tools:

tail -n +1 artifacts/<run_id>/trace.jsonl | head

Run on your own data

python -m dq_agent run --data path/to/table.parquet --config path/to/rules.yml

Supported:

Data: CSV / Parquet
Config: YAML / JSON

Guardrails (optional safety limits):

python -m dq_agent run \
  --data path/to/table.parquet \
  --config path/to/rules.yml \
  --max-input-mb 50 \
  --max-rows 500000 \
  --max-cols 200 \
  --max-rules 2000 \
  --max-anomalies 200 \
  --max-wall-time-s 30

Fail the run when issues reach a severity threshold:

python -m dq_agent run --data path/to/table.parquet --config path/to/rules.yml --fail-on ERROR

Exit code behavior:

0: run completed without triggering --fail-on
1: I/O or config parsing errors (missing/unreadable files, invalid config)
2: guardrail violation or schema validation failures, plus --fail-on severity

Idempotency controls:

python -m dq_agent run \
  --data path/to/table.parquet \
  --config path/to/rules.yml \
  --idempotency-key run-001 \
  --idempotency-mode reuse

Modes:

reuse (default): if report.json + run_record.json already exist for the key, return their paths and skip the pipeline.
overwrite: re-run and overwrite artifacts in the deterministic run directory.
fail: return a structured error (idempotency_conflict) with exit code 2 when artifacts exist.

Failure contract (typed errors)

Failures are first-class artifacts. When a run fails, both report.json and run_record.json include an error object, and the CLI prints a JSON payload with the error and any written output paths.

Error schema fields:

type: guardrail_violation | io_error | config_error | schema_validation_error | internal_error | idempotency_conflict | regression
code: short, stable machine code (e.g., max_rows, data_not_found, invalid_config)
message: stable human-readable message
is_retryable: boolean retry hint
suggested_next_step: short actionable hint
details: optional, small JSON object

Failure output example:

{
  "error": {
    "type": "guardrail_violation",
    "code": "max_rows",
    "message": "Row count 2500 exceeds limit 10.",
    "is_retryable": false,
    "suggested_next_step": "Adjust guardrail limits or reduce input size before retrying.",
    "details": {"limit": 10, "observed": 2500}
  },
  "report_json_path": "artifacts/<run_id>/report.json",
  "run_record_path": "artifacts/<run_id>/run_record.json"
}

See all CLI options:

python -m dq_agent --help
python -m dq_agent run --help

Shadow (baseline vs candidate)

Compare two configs against the same data before promoting changes:

python -m dq_agent shadow \
  --data path/to/table.parquet \
  --baseline-config path/to/baseline.yml \
  --candidate-config path/to/candidate.yml \
  --fail-on-regression

Outputs are grouped under one shadow run directory:

artifacts/<shadow_run_id>/
  baseline/<baseline_run_id>/...
  candidate/<candidate_run_id>/...
  shadow_diff.json

The --fail-on-regression flag exits with code 2 when the candidate is worse than the baseline, while still writing shadow_diff.json and both run artifacts. The CLI prints a single JSON payload with the baseline and candidate report paths plus any typed error details.

Replay a run

After a run completes, use the run_record.json to replay deterministically:

python -m dq_agent replay --run-record artifacts/<run_id>/run_record.json --strict

Resume a run with checkpoints

Each run writes a checkpoint.json alongside the other artifacts. If artifacts go missing, resume can repair them:

python -m dq_agent resume --run-dir artifacts/<run_id>

Example flow:

python -m dq_agent demo --output-dir artifacts
rm artifacts/<run_id>/report.md
python -m dq_agent resume --run-dir artifacts/<run_id>
python -m dq_agent validate --kind report --path artifacts/<run_id>/report.json
python -m dq_agent validate --kind run_record --path artifacts/<run_id>/run_record.json
python -m dq_agent validate --kind checkpoint --path artifacts/<run_id>/checkpoint.json

Schema + validation

Print JSON Schema for each output:

python -m dq_agent schema --kind report
python -m dq_agent schema --kind run_record
python -m dq_agent schema --kind checkpoint

Validate a generated output:

python -m dq_agent validate --kind report --path artifacts/<run_id>/report.json
python -m dq_agent validate --kind run_record --path artifacts/<run_id>/run_record.json
python -m dq_agent validate --kind checkpoint --path artifacts/<run_id>/checkpoint.json

Config format (YAML)

Minimal example:

version: 1
dataset:
  name: demo_orders
  primary_key: [order_id]
  time_column: created_at

columns:
  order_id:
    type: string
    required: true
    checks:
      - unique: true

  user_id:
    type: string
    required: true
    checks:
      - not_null: { max_null_rate: 0.01 }
      - string_noise: { contains: ["*", "''"], max_rate: 0.0 }
    anomalies:
      - missing_rate: { max_rate: 0.02 }

  amount:
    type: float
    required: true
    checks:
      - range: { min: 0, max: 10000 }
    anomalies:
      - outlier_mad: { z: 6.0 }

  status:
    type: string
    required: true
    checks:
      - allowed_values: { values: ["PAID","REFUND","CANCEL","PENDING"] }

Demo config lives at: dq_agent/resources/demo_rules.yml.

What it checks

Deterministic rules:

not_null
unique
range
allowed_values
string_noise (substring / regex pattern based)

Statistical anomalies:

outlier_mad (robust outliers via MAD z-score)
missing_rate (null-rate anomaly)

Benchmarks (real labeled datasets)

We evaluate dq_agent as an error detector on datasets that have real labels via paired dirty.csv / clean.csv.

Metrics:

cell-level precision / recall / F1 (detect wrong cells)
row-level precision / recall / F1 (detect rows containing any wrong cell)

Reproduce (Raha):

bash scripts/run_raha_and_save.sh

Notes:

Benchmark harness: scripts/eval_dirty_clean.py (auto-generates a config from profile data).
The string_noise check is enabled by default for open-domain string columns in the harness (use --no-string-noise to ablate).

Reproduce (PED):

bash scripts/run_ped_and_save.sh

To refresh the README benchmark block from benchmarks/ artifacts:

python scripts/update_readme_benchmarks.py

Raha (7 datasets; dirty vs clean profiles)

profile	datasets	macro_cell_f1	micro_cell_f1	macro_row_f1	micro_row_f1	cell_tp/fp/fn	row_tp/fp/fn
clean	7	0.817778	0.841898	0.915870	0.925301	(16662, 3720, 2538)	(12288, 724, 1260)
dirty	7	0.405580	0.303776	0.593699	0.466883	(4183, 3488, 15686)	(4395, 714, 9323)

Full breakdown: benchmarks/raha_compare.md.

Raha string-noise ablation (patterns: `*`, `''`)

metric	base	union	Δ
macro_cell_f1	0.760961	0.817760	0.056798
micro_cell_f1	0.806936	0.841877	0.034940
macro_row_f1	0.859008	0.915870	0.056862
micro_row_f1	0.876437	0.925301	0.048864

Largest per-dataset gain (from the committed compare file):

dataset	base cell_f1	union cell_f1	Δ	base row_f1	union row_f1	Δ
raha/tax	0.319744	0.718032	0.398288	0.324004	0.722462	0.398457

Full breakdown: benchmarks/raha_noise_union/compare.md.

PED (additional dirty/clean datasets)

profile	datasets	macro_cell_f1	micro_cell_f1	macro_row_f1	micro_row_f1	cell_tp/fp/fn	row_tp/fp/fn
clean	14	0.846318	0.056526	0.857537	0.104772	(40776, 917613, 443580)	(45356, 455137, 319955)
dirty	14	0.200416	0.010962	0.212178	0.035870	(7726, 917533, 476630)	(14983, 455111, 350328)

Full breakdown: benchmarks/ped_compare.md.

Dev / Tests

pip install -e ".[test]"
python -m pytest -q

If you see:

ModuleNotFoundError: typer / No module named pytest

You almost certainly forgot to activate venv:

source .venv/bin/activate

Project layout (high-level)

dq_agent/cli.py – Typer CLI (run, demo)
dq_agent/loader.py – CSV/Parquet loader
dq_agent/config.py – config loading (YAML/JSON)
dq_agent/contract.py – contract validation
dq_agent/rules/ – deterministic checks (registry-based)
dq_agent/anomalies/ – anomaly detectors (registry-based)
dq_agent/report/ – JSON + Markdown report writers
tests/ – unit + integration tests

Spec / Roadmap

Full design doc: A0_SPEC.md

Project docs

CHANGELOG.md
CONTRIBUTING.md
CODE_OF_CONDUCT.md
SECURITY.md
docs/publishing.md

License

Apache-2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Tylor-Tian

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
Topic
- Software Development :: Quality Assurance

Release history Release notifications | RSS feed

This version

0.1.0

Feb 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_agent-0.1.0.tar.gz (47.2 kB view details)

Uploaded Feb 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dq_agent-0.1.0-py3-none-any.whl (45.8 kB view details)

Uploaded Feb 22, 2026 Python 3

File details

Details for the file dq_agent-0.1.0.tar.gz.

File metadata

Download URL: dq_agent-0.1.0.tar.gz
Upload date: Feb 22, 2026
Size: 47.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dq_agent-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`63ce9d9a700af85f03c14372f1f023a2e0bdd88d5a5323e35a3e7c40ff13644e`
MD5	`b3dccd02286471d3f66aa5131133d168`
BLAKE2b-256	`3f8ce25a1eba04071bc7b45bb4dc1a16bd3aafffaa43c11ab92fff367554bdde`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dq_agent-0.1.0.tar.gz:

Publisher: publish-pypi.yml on Tylor-Tian/dq_agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dq_agent-0.1.0.tar.gz
- Subject digest: 63ce9d9a700af85f03c14372f1f023a2e0bdd88d5a5323e35a3e7c40ff13644e
- Sigstore transparency entry: 976229517
- Sigstore integration time: Feb 22, 2026
Source repository:
- Permalink: Tylor-Tian/dq_agent@a168e694155c7c819322819f681a950b48d6cc11
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Tylor-Tian
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@a168e694155c7c819322819f681a950b48d6cc11
- Trigger Event: release

File details

Details for the file dq_agent-0.1.0-py3-none-any.whl.

File metadata

Download URL: dq_agent-0.1.0-py3-none-any.whl
Upload date: Feb 22, 2026
Size: 45.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dq_agent-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da25f967c2eae49b635dbf485db9b9963f8e7a21571979bfc17f72f213d94bc0`
MD5	`0bd452a4d6be862b31e4632ab9be5208`
BLAKE2b-256	`d4f106dd9b41c586957e4c13e901182fd1c7deb3d7ace0f2a0709284dd3c713b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dq_agent-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on Tylor-Tian/dq_agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dq_agent-0.1.0-py3-none-any.whl
- Subject digest: da25f967c2eae49b635dbf485db9b9963f8e7a21571979bfc17f72f213d94bc0
- Sigstore transparency entry: 976229521
- Sigstore integration time: Feb 22, 2026
Source repository:
- Permalink: Tylor-Tian/dq_agent@a168e694155c7c819322819f681a950b48d6cc11
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Tylor-Tian
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@a168e694155c7c819322819f681a950b48d6cc11
- Trigger Event: release

dq-agent 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

dq_agent

What it is

What it isn't

Installation

Demo in one command

Quickstart (Demo)

CI Integration (GitHub Actions)

Outputs

Trace file

Run on your own data

Failure contract (typed errors)

Shadow (baseline vs candidate)

Replay a run

Resume a run with checkpoints

Schema + validation

Config format (YAML)

What it checks

Benchmarks (real labeled datasets)

Raha (7 datasets; dirty vs clean profiles)

Raha string-noise ablation (patterns: *, '')

PED (additional dirty/clean datasets)

Dev / Tests

Project layout (high-level)

Spec / Roadmap

Project docs

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Raha string-noise ablation (patterns: `*`, `''`)