Open, audit-grade agentic data quality framework with portable industry packs
Project description
Aegis DQ
Open-source agentic data quality: validate, diagnose, and explain data failures — with an LLM that tells you exactly why.
$ aegis run rules.yaml
Aegis DQ — loading rules from rules.yaml
Loaded 3 rules
LLM: Anthropic (claude-haiku-4-5-20251001)
╭─────────────────────────────────────────────────╮
│ Aegis Validation Report │
├──────────────────┬──────────────────────────────┤
│ Metric │ Value │
├──────────────────┼──────────────────────────────┤
│ Rules checked │ 3 │
│ Passed │ 2 │
│ Failed │ 1 │
│ Pass rate │ 66.67% │
│ LLM cost │ $0.000183 │
╰──────────────────┴──────────────────────────────╯
Failures:
orders_no_nulls (critical) — orders
Rows failed: 47 / 10,000
Explanation: 47 rows have NULL order_id, violating the completeness rule.
Likely cause: ETL pipeline failed to populate order_id for orders placed via
the mobile API between 2024-01-14 02:00–04:00 UTC.
Action: Re-run the mobile-api ingestion job for that window and
backfill the missing order_ids from the events table.
Why Aegis?
| Aegis DQ | Great Expectations / Soda | Monte Carlo / Anomalo | |
|---|---|---|---|
| Open source | ✅ Apache 2.0 | ✅ | ❌ Commercial |
| Agentic LLM diagnosis + RCA | ✅ | ❌ | ✅ Proprietary |
| Audit trail (per-decision log) | ✅ | Partial | ✅ Proprietary |
| Pluggable LLM (Anthropic, OpenAI, Ollama) | ✅ | ❌ | ❌ |
| dbt integration | ✅ | ✅ | Partial |
| Portable open rule standard | ✅ | Partial | ❌ |
Install
pip install aegis-dq
| Extra | What it adds |
|---|---|
aegis-dq[bigquery] |
BigQuery adapter |
aegis-dq[databricks] |
Databricks adapter |
aegis-dq[athena] |
AWS Athena adapter |
aegis-dq[openai] |
OpenAI LLM provider |
aegis-dq[ollama] |
Ollama (local) LLM provider |
aegis-dq[airflow] |
Airflow AegisOperator |
aegis-dq[mcp] |
MCP server for Claude Desktop |
5-minute quickstart
pip install aegis-dq
Seed a demo DuckDB database:
import duckdb
con = duckdb.connect("demo.db")
con.execute("""
CREATE TABLE orders AS
SELECT i AS order_id, 'placed' AS status, i * 9.99 AS revenue
FROM range(1, 10001) t(i)
""")
# introduce some bad data
con.execute("UPDATE orders SET order_id = NULL WHERE order_id % 200 = 0")
con.execute("UPDATE orders SET revenue = -5.00 WHERE order_id % 500 = 0")
con.close()
Generate a starter rules file and run:
# create rules.yaml
aegis init
# edit rules.yaml — set warehouse: duckdb and table: orders
# then run validation
export ANTHROPIC_API_KEY=sk-ant-...
aegis run rules.yaml --db demo.db
Run without an API key (validation only, no LLM diagnosis):
aegis run rules.yaml --db demo.db --no-llm
Pipeline
Every aegis run passes your data through a 7-node LangGraph pipeline:
rules.yaml
│
▼
plan → execute → reconcile → classify → diagnose → rca → report
│ │ │ │ │
28 rule heuristic LLM asks lineage JSON +
types + LLM "why?" context Slack
- plan — parse and validate rules.yaml, build an execution graph
- execute — run all 28 rule types against your warehouse
- reconcile — compare results against expected thresholds
- classify — heuristic triage (severity, category, affected rows)
- diagnose — LLM writes a plain-English explanation per failure
- rca — root-cause analysis using lineage context and run history
- report — structured JSON + optional Slack notification
Rule types (28 total)
| Category | Types |
|---|---|
| Completeness | not_null not_empty_string null_percentage_below |
| Uniqueness | unique composite_unique duplicate_percentage_below |
| Validity | sql_expression between min_value_check max_value_check regex_match accepted_values not_accepted_values no_future_dates column_exists |
| Referential | foreign_key conditional_not_null |
| Statistical | mean_between stddev_below column_sum_between |
| Timeliness | freshness date_order |
| Volume | row_count row_count_between custom_sql |
| Cross-table | row_count_match column_sum_match set_inclusion set_equality |
Example rule:
rules:
- apiVersion: aegis.dev/v1
kind: DataQualityRule
metadata:
id: orders_revenue_non_negative
severity: critical
owner: revenue-team
tags: [revenue, validity]
scope:
warehouse: duckdb
table: orders
logic:
type: sql_expression
expression: "revenue >= 0"
Warehouse adapters
| Adapter | Install | Status |
|---|---|---|
| DuckDB | built-in | ✅ |
| BigQuery | aegis-dq[bigquery] |
✅ |
| Databricks | aegis-dq[databricks] |
✅ |
| AWS Athena | aegis-dq[athena] |
✅ |
| Snowflake | aegis-dq[snowflake] |
✅ coming v1.0 |
| Postgres / Redshift | aegis-dq[postgres] |
🚧 v1.0 |
LLM providers
| Provider | Install | Default model |
|---|---|---|
| Anthropic (Claude) | built-in | claude-haiku-4-5 |
| OpenAI | aegis-dq[openai] |
gpt-4o-mini |
| Ollama (local) | aegis-dq[ollama] |
llama3.2 |
Switch providers at the CLI:
aegis run rules.yaml --llm openai --llm-model gpt-4o
aegis run rules.yaml --llm ollama --llm-model llama3.2
Integrations
| Integration | What it does |
|---|---|
aegis-dq[airflow] |
AegisOperator — drop-in Airflow task |
aegis-dq[mcp] |
MCP server for Claude Desktop / tool use |
aegis dbt generate |
Convert dbt manifest.json to Aegis rules |
| GitHub Action (#27) | CI/CD gate on PRs (coming v1.0) |
CLI reference
| Command | Description |
|---|---|
aegis init |
Generate a starter rules.yaml |
aegis validate <config> |
Check YAML syntax + schema (no warehouse needed) |
aegis run <config> |
Run validation, diagnose failures, produce a report |
aegis rules list |
Browse built-in rule templates |
aegis audit trajectory <run-id> |
Inspect the LLM decision trail for a past run |
aegis audit search <query> |
Full-text search across audit logs (FTS5) |
aegis dbt generate <manifest> |
Convert a dbt manifest to Aegis rules |
aegis mcp serve |
Start the MCP server for Claude Desktop |
aegis run flags:
| Flag | Default | Description |
|---|---|---|
--db |
:memory: |
DuckDB file path |
--llm |
anthropic |
LLM provider: anthropic | openai | ollama |
--llm-model |
(provider default) | Override model name |
--no-llm |
false |
Skip LLM diagnosis entirely |
--output-json |
(none) | Write full JSON report to file |
--notify |
(none) | Slack webhook URL |
--notify-on |
failures |
When to notify: all | failures | critical |
Roadmap
| Phase | Version | Items | Status |
|---|---|---|---|
| Foundation | v0.1 | Core agent, DuckDB, CLI, audit trail | ✅ Done |
| Differentiate | v0.5 | BigQuery, Databricks, Athena, Airflow, Ollama, RCA, ShareGPT export, FTS5 search, dbt, MCP | ✅ Done |
| Mature | v1.0 | Postgres, REST API, GitHub Action, parallel subagents, ML anomaly detection, banking/healthcare packs | 🚧 In progress |
Full issue tracker: github.com/aegis-dq/aegis-dq/issues
Contributing
Contributions are welcome. See CONTRIBUTING.md to get started.
Good first issues: label:good first issue
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aegis_dq-0.5.0.tar.gz.
File metadata
- Download URL: aegis_dq-0.5.0.tar.gz
- Upload date:
- Size: 350.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
639534778c0da237a7606c72e0d4f379c10494330d1003e0461a4e1089628e15
|
|
| MD5 |
af9fb05fc0488eb413b71f6b1c9b6755
|
|
| BLAKE2b-256 |
156f5a26ebbd9332fa09ed3c37aa19773b813e2cca9398e6ec47fb137a6d91ae
|
File details
Details for the file aegis_dq-0.5.0-py3-none-any.whl.
File metadata
- Download URL: aegis_dq-0.5.0-py3-none-any.whl
- Upload date:
- Size: 76.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
457c470e282f4dc0c84466590172b799b2d10d3964a4472016839c107341827e
|
|
| MD5 |
46fb629294b28c7db77ba158f823217e
|
|
| BLAKE2b-256 |
0416f4aa81ec5a89df37c0f51bf4e047362de44120b18c6ab730388bce520ea8
|