Skip to main content

Open, audit-grade agentic data quality framework with portable industry packs

Project description

Aegis DQ

PyPI Downloads Python License Tests Open in Colab

Aegis DQ Demo

Open-source agentic data quality: validate, diagnose, and explain data failures — with an LLM that tells you exactly why.


$ aegis run rules.yaml

Aegis DQ — loading rules from rules.yaml
Loaded 3 rules
LLM: Anthropic (claude-haiku-4-5-20251001)

╭─────────────────────────────────────────────────╮
│         Aegis Validation Report                  │
├──────────────────┬──────────────────────────────┤
│ Metric           │ Value                        │
├──────────────────┼──────────────────────────────┤
│ Rules checked    │ 3                            │
│ Passed           │ 2                            │
│ Failed           │ 1                            │
│ Pass rate        │ 66.67%                       │
│ LLM cost         │ $0.000183                    │
╰──────────────────┴──────────────────────────────╯

Failures:

  orders_no_nulls (critical) — orders
  Rows failed: 47 / 10,000
  Explanation:  47 rows have NULL order_id, violating the completeness rule.
  Likely cause: ETL pipeline failed to populate order_id for orders placed via
                the mobile API between 2024-01-14 02:00–04:00 UTC.
  Action:       Re-run the mobile-api ingestion job for that window and
                backfill the missing order_ids from the events table.

Why Aegis?

Aegis DQ Great Expectations / Soda Monte Carlo / Anomalo
Open source ✅ Apache 2.0 ❌ Commercial
Agentic LLM diagnosis + RCA ✅ Proprietary
Audit trail (per-decision log) Partial ✅ Proprietary
Pluggable LLM (Anthropic, OpenAI, Ollama)
dbt integration Partial
Portable open rule standard Partial

Install

pip install aegis-dq
Extra What it adds
aegis-dq[bigquery] BigQuery adapter
aegis-dq[databricks] Databricks adapter
aegis-dq[athena] AWS Athena adapter
aegis-dq[openai] OpenAI LLM provider
aegis-dq[ollama] Ollama (local) LLM provider
aegis-dq[airflow] Airflow AegisOperator
aegis-dq[mcp] MCP server for Claude Desktop

5-minute quickstart

pip install aegis-dq

Seed a demo DuckDB database:

import duckdb

con = duckdb.connect("demo.db")
con.execute("""
    CREATE TABLE orders AS
    SELECT i AS order_id, 'placed' AS status, i * 9.99 AS revenue
    FROM range(1, 10001) t(i)
""")
# introduce some bad data
con.execute("UPDATE orders SET order_id = NULL WHERE order_id % 200 = 0")
con.execute("UPDATE orders SET revenue = -5.00 WHERE order_id % 500 = 0")
con.close()

Generate a starter rules file and run:

# create rules.yaml
aegis init

# edit rules.yaml — set warehouse: duckdb and table: orders
# then run validation
export ANTHROPIC_API_KEY=sk-ant-...
aegis run rules.yaml --db demo.db

Run without an API key (validation only, no LLM diagnosis):

aegis run rules.yaml --db demo.db --no-llm

Pipeline

Every aegis run passes your data through a 7-node LangGraph pipeline:

rules.yaml
    │
    ▼
  plan → execute → reconcile → classify → diagnose → rca → report
           │                       │           │        │       │
        28 rule               heuristic    LLM asks  lineage  JSON +
        types                  + LLM       "why?"    context  Slack
  • plan — parse and validate rules.yaml, build an execution graph
  • execute — run all 28 rule types against your warehouse
  • reconcile — compare results against expected thresholds
  • classify — heuristic triage (severity, category, affected rows)
  • diagnose — LLM writes a plain-English explanation per failure
  • rca — root-cause analysis using lineage context and run history
  • report — structured JSON + optional Slack notification

Rule types (28 total)

Category Types
Completeness not_null not_empty_string null_percentage_below
Uniqueness unique composite_unique duplicate_percentage_below
Validity sql_expression between min_value_check max_value_check regex_match accepted_values not_accepted_values no_future_dates column_exists
Referential foreign_key conditional_not_null
Statistical mean_between stddev_below column_sum_between
Timeliness freshness date_order
Volume row_count row_count_between custom_sql
Cross-table row_count_match column_sum_match set_inclusion set_equality

Example rule:

rules:
  - apiVersion: aegis.dev/v1
    kind: DataQualityRule
    metadata:
      id: orders_revenue_non_negative
      severity: critical
      owner: revenue-team
      tags: [revenue, validity]
    scope:
      warehouse: duckdb
      table: orders
    logic:
      type: sql_expression
      expression: "revenue >= 0"

Warehouse adapters

Adapter Install Status
DuckDB built-in
BigQuery aegis-dq[bigquery]
Databricks aegis-dq[databricks]
AWS Athena aegis-dq[athena]
Snowflake aegis-dq[snowflake] ✅ coming v1.0
Postgres / Redshift aegis-dq[postgres] 🚧 v1.0

LLM providers

Provider Install Default model
Anthropic (Claude) built-in claude-haiku-4-5
OpenAI aegis-dq[openai] gpt-4o-mini
Ollama (local) aegis-dq[ollama] llama3.2

Switch providers at the CLI:

aegis run rules.yaml --llm openai --llm-model gpt-4o
aegis run rules.yaml --llm ollama --llm-model llama3.2

Integrations

Integration What it does
aegis-dq[airflow] AegisOperator — drop-in Airflow task
aegis-dq[mcp] MCP server for Claude Desktop / tool use
aegis dbt generate Convert dbt manifest.json to Aegis rules
GitHub Action (#27) CI/CD gate on PRs (coming v1.0)

CLI reference

Command Description
aegis init Generate a starter rules.yaml
aegis validate <config> Check YAML syntax + schema (no warehouse needed)
aegis run <config> Run validation, diagnose failures, produce a report
aegis rules list Browse built-in rule templates
aegis audit trajectory <run-id> Inspect the LLM decision trail for a past run
aegis audit search <query> Full-text search across audit logs (FTS5)
aegis dbt generate <manifest> Convert a dbt manifest to Aegis rules
aegis mcp serve Start the MCP server for Claude Desktop

aegis run flags:

Flag Default Description
--db :memory: DuckDB file path
--llm anthropic LLM provider: anthropic | openai | ollama
--llm-model (provider default) Override model name
--no-llm false Skip LLM diagnosis entirely
--output-json (none) Write full JSON report to file
--notify (none) Slack webhook URL
--notify-on failures When to notify: all | failures | critical

Roadmap

Phase Version Items Status
Foundation v0.1 Core agent, DuckDB, CLI, audit trail ✅ Done
Differentiate v0.5 BigQuery, Databricks, Athena, Airflow, Ollama, RCA, ShareGPT export, FTS5 search, dbt, MCP ✅ Done
Mature v1.0 Postgres, REST API, GitHub Action, parallel subagents, ML anomaly detection, banking/healthcare packs 🚧 In progress

Full issue tracker: github.com/aegis-dq/aegis-dq/issues


Contributing

Contributions are welcome. See CONTRIBUTING.md to get started.

Good first issues: label:good first issue

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aegis_dq-0.5.0.tar.gz (350.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aegis_dq-0.5.0-py3-none-any.whl (76.3 kB view details)

Uploaded Python 3

File details

Details for the file aegis_dq-0.5.0.tar.gz.

File metadata

  • Download URL: aegis_dq-0.5.0.tar.gz
  • Upload date:
  • Size: 350.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for aegis_dq-0.5.0.tar.gz
Algorithm Hash digest
SHA256 639534778c0da237a7606c72e0d4f379c10494330d1003e0461a4e1089628e15
MD5 af9fb05fc0488eb413b71f6b1c9b6755
BLAKE2b-256 156f5a26ebbd9332fa09ed3c37aa19773b813e2cca9398e6ec47fb137a6d91ae

See more details on using hashes here.

File details

Details for the file aegis_dq-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: aegis_dq-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 76.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for aegis_dq-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 457c470e282f4dc0c84466590172b799b2d10d3964a4472016839c107341827e
MD5 46fb629294b28c7db77ba158f823217e
BLAKE2b-256 0416f4aa81ec5a89df37c0f51bf4e047362de44120b18c6ab730388bce520ea8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page