Skip to main content

Agentic EDA for relational databases via natural language: safe SQL, multi-engine, multi-LLM, with governance, observability, and persistence built in.

Project description

๐Ÿ” OmniQuery Explorer

Agentic Exploratory Data Analysis for relational databases โ€” in plain English.

Ask your database questions in natural language. Get safe SQL, real rows, and a structured analytical report โ€” automatically.

Python 3.12 License: MIT CI Docker Architecture


๐Ÿ“– Table of Contents


๐Ÿ’ก Why OmniQuery?

Most NL-to-SQL tools stop at generating a query. That's the easy half. The hard half is everything around it:

  • โŒ The model invents column names that don't exist.
  • โŒ It writes a DELETE masquerading as a SELECT inside a CTE.
  • โŒ It nukes the warehouse with a 200M-row full scan.
  • โŒ It leaks email, ssn, or credit_card into a prompt or report.
  • โŒ It returns a CSV with no narrative, no chart, no follow-up.

OmniQuery Explorer is an end-to-end EDA platform that ships every guardrail you would otherwise have to build yourself: AST-based SQL hardening, cost gates, PII masking, query budgets, schema caching, automated SQL repair, structured reports, multi-LLM, multi-engine, observability, and a clean hexagonal architecture you can extend.

It's built for data analysts, platform engineers, and product teams who need fast exploration without sacrificing control, traceability, or architecture quality.


๐Ÿงญ What it does

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Plain-English      โ”‚ โ†’  โ”‚ Multi-agent      โ”‚ โ†’  โ”‚ Verified, safe   โ”‚ โ†’  โ”‚ Structured EDA   โ”‚
โ”‚ question           โ”‚    โ”‚ pipeline         โ”‚    โ”‚ SELECT execution โ”‚    โ”‚ report + chart   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Under the hood, a LangGraph state machine runs eight specialised agents:

Agent Responsibility
introspect Extract schema metadata (tables, columns, PKs, FKs).
profile Compute statistical profiles (rows, nulls, cardinality).
build_graph Build FK graph, run PageRank, rank tables.
propose_questions Suggest exploratory questions aligned with the domain.
generate_sql Two-phase generation: select tables โ†’ emit SQL.
execute_sql Run the SELECT under guard, timeout, and EXPLAIN gate.
fix_sql Self-heal on DB errors via bounded LLM repair loop.
generate_report Produce the markdown EDA narrative.

โœจ Feature highlights

๐Ÿ›ก๏ธ Bullet-proof SQL pipeline

  • AST-based read-only guard (sqlglot): rejects DML, DDL, CTE-wrapped DML, multiple statements, and a curated blocklist (pg_sleep, dblink, xp_cmdshell, utl_file, โ€ฆ).
  • Dialect-aware LIMIT / FETCH FIRST rewriting via AST manipulation โ€” never string concatenation.
  • Two-phase SQL generation: the model picks tables from the full list before it ever sees a column, killing the "hallucinated join key" failure mode.
  • Self-healing retry loop: on ProgrammingError/OperationalError, the verified DDL + the DB's own message are fed back to the LLM; bounded by configurable retries.
  • Per-statement timeout: SET LOCAL statement_timeout (Postgres), SET SESSION MAX_EXECUTION_TIME (MySQL), asyncio.wait_for fallback (Oracle).

๐ŸŒ Multi-engine, multi-LLM, multilingual

  • Engines: ๐Ÿ˜ PostgreSQL ยท ๐Ÿฌ MySQL/MariaDB ยท ๐ŸŸง Oracle ยท ๐Ÿชถ SQLite ยท ๐Ÿฆ† DuckDB.
  • LLMs: ๐Ÿฆ™ Ollama (fully local) ยท ๐Ÿค– OpenAI ยท ๐Ÿง  Anthropic ยท โ˜๏ธ AWS Bedrock ยท โ˜๏ธ Google Vertex AI. Switch with one env var; provider-specific retries via tenacity.
  • Languages: ๐Ÿ‡ฌ๐Ÿ‡ง English ยท ๐Ÿ‡ช๐Ÿ‡ธ Spanish, with LLM_LANGUAGE=auto to detect per question. The model answers questions, returns reports, and produces DB summaries in the same language the analyst asked in.
  • Pooled AsyncEngine: process-wide LRU cache, pool_pre_ping, recycle every 30 min. No engine churn per query.

๐Ÿ’ฐ Governance built-in

  • Cost guard: EXPLAIN (FORMAT JSON) on Postgres and MySQL rejects queries above configurable plan cost or estimated rows โ€” before they touch the data.
  • Budget tracker: caps queries and LLM tokens per session in memory.
  • PII policy: regex denylist redacts sensitive columns from the LLM prompt and masks values in returned rows. Default denylist covers email, ssn, password, credit_card, iban, phone, address, dob, api_key, secret, โ€ฆ

๐Ÿš€ Two first-class interfaces

  • CLI (omniquery) with ask, explore, suggest, profile, schema. Rich tables, charts, progress spinners.
  • HTTP API (omniquery-web โ†’ uvicorn):
    • POST /ask โ€” synchronous JSON.
    • POST /ask/stream โ€” Server-Sent Events streaming agent-by-agent (started, sql, rows, report, done).
    • POST /explore, POST /schema, GET /health.
    • API-key auth (X-API-Key) and token-bucket rate limiter per identity (key or IP).

๐Ÿ’พ Persistence + caching

  • Sessions, queries, reports stored in SQLite by default (Postgres in prod) and managed with Alembic migrations applied at boot.
  • Disk-backed cache for introspected schemas and embeddings โ€” keyed by SHA-256 fingerprint, TTL-driven.

๐Ÿ”ญ Observability

  • OpenTelemetry spans on every LangGraph node (agent.introspect, agent.generate_sql, โ€ฆ) and every LLM call (llm.call tagged with provider, model, call name). Toggle with OBS_OTEL_ENABLED, export via OTLP/HTTP.
  • Structured JSON logging with session_id / agent correlation throughout the pipeline.

๐Ÿ—๏ธ Production delivery

  • GitHub Actions CI: ruff + mypy + pytest + bandit + pip-audit.
  • Multi-arch Docker image (linux/amd64 + linux/arm64) published to GHCR on tag push, with SBOM and provenance attestations.
  • docker-compose.yml for the single-machine path; Kubernetes manifests under deploy/k8s/.
  • Typed configuration via Pydantic Settings โ€” every knob is type-checked and documented in src/omniquery/config.py.

๐Ÿ›๏ธ Architecture

OmniQuery follows Hexagonal Architecture (Ports & Adapters) with explicit DDD boundaries:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         Driving adapters                           โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚  CLI (Typer)   โ”‚                       โ”‚  Web (FastAPI+SSE) โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
            โ”‚                                         โ”‚
            โ–ผ                                         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                          Application                               โ”‚
โ”‚                                                                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚  RunEdaUseCase          โ”‚    โ”‚  EdaSessionGraph (LangGraph)โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚                               โ”‚
                     โ–ผ                               โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  Domain ports (interfaces only)                    โ”‚
โ”‚                                                                    โ”‚
โ”‚   DatabasePort ยท LlmPort ยท EmbeddingPort ยท ProfilingPort           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚                               โ”‚
                     โ–ผ                               โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                       Driven adapters                              โ”‚
โ”‚                                                                    โ”‚
โ”‚  db/        โ†’  postgres ยท mysql ยท oracle ยท sqlite ยท duckdb         โ”‚
โ”‚  llm/       โ†’  ollama ยท openai ยท anthropic                         โ”‚
โ”‚  graph/     โ†’  schema_graph_service ยท schema_linker                โ”‚
โ”‚  cache/     โ†’  disk_cache ยท cached_database ยท cached_embedding     โ”‚
โ”‚  governance/โ†’  sql_guard ยท cost_guard ยท pii_policy                 โ”‚
โ”‚  observability/โ†’  OpenTelemetry tracer                             โ”‚
โ”‚  persistence/โ†’  SQLAlchemy ORM ยท Alembic                           โ”‚
โ”‚  logging/   โ†’  structured JSON                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Why this matters

  • The domain layer has zero infrastructure imports. Swap PostgreSQL for DuckDB or Ollama for Anthropic by changing a single env var; the agents don't notice.
  • Every cross-cutting concern is a port. PII, cost, observability, persistence โ€” each lives in its own bounded module and is wired by the container at startup.
  • One container, many interfaces. The CLI and the FastAPI app share the same singleton Container (LLM client, profiler, schema linker, caches), so warmth and quotas carry across.

๐Ÿš€ Quickstart

Option A โ€” Run with Docker Compose (fastest)

git clone https://github.com/rsaglobaltech/omniquery-explorer.git
cd omniquery-explorer

docker compose up -d
docker compose exec ollama ollama pull llama3.2:latest

curl http://localhost:8000/health
# {"status":"ok","environment":"development","llm_provider":"ollama","llm_model":"llama3.2:latest"}

Option B โ€” Local install with uv

# 1. Install dependencies
uv sync

# 2. Configure the target database + LLM
export DATABASE_URL="postgresql+asyncpg://user:pwd@localhost:5432/db"
export LLM_PROVIDER=ollama
export LLM_MODEL=llama3.2:latest

# 3. Start the model server (local-first path)
ollama pull llama3.2:latest
ollama serve &

# 4. Ask a question
uv run omniquery ask "What are the top 10 customers by total orders?"

Option C โ€” Pull the published image

docker run --rm -p 8000:8000 \
  -e LLM_PROVIDER=openai \
  -e LLM_OPENAI_API_KEY=sk-... \
  -e DATABASE_URL='postgresql+asyncpg://user:pwd@host/db' \
  -e WEB_API_KEYS=secret-key \
  ghcr.io/rsaglobaltech/omniquery-explorer:0.1.0

๐Ÿงช Usage

๐Ÿ–ฅ๏ธ CLI

# Single natural-language EDA query
omniquery ask "Which customers spent the most last quarter?"

# Full exploration: schema โ†’ profile โ†’ propose questions โ†’ answer best one
omniquery explore

# Generate suggested EDA questions only
omniquery suggest

# Show statistical profile of the most important tables
omniquery profile --top 10

# Print the full schema (tables, columns, PKs, FKs)
omniquery schema

Override defaults with --url, --max-rows, or by exporting DATABASE_URL.

๐ŸŒ HTTP API

Synchronous JSON:

curl -X POST http://localhost:8000/ask \
  -H 'Content-Type: application/json' \
  -H 'X-API-Key: secret-key' \
  -d '{
    "question": "Top 5 products by revenue this year",
    "connection_url": "postgresql+asyncpg://user:pwd@host/db",
    "max_rows": 100
  }'

Server-Sent Events (one event per pipeline stage):

curl -N -X POST http://localhost:8000/ask/stream \
  -H 'Content-Type: application/json' \
  -H 'X-API-Key: secret-key' \
  -d '{"question":"...","connection_url":"..."}'

# event: started
# data: {"question":"..."}
#
# event: sql
# data: {"sql":"SELECT ..."}
#
# event: rows
# data: {"count":5,"preview":[...]}
#
# event: report
# data: {"markdown":"# Top 5 Products..."}
#
# event: done
# data: {"row_count":5}

OpenAPI docs are served at http://localhost:8000/docs.

๐Ÿ Python

The application layer is import-clean โ€” you can drive it directly:

import asyncio
from omniquery.domain.entities.eda_query import EdaQuery
from omniquery.infrastructure.container import get_container

async def main():
    container = get_container()
    use_case = container.eda_use_case("postgresql+asyncpg://user:pwd@host/db")
    result = await use_case.run_eda(
        EdaQuery(
            question="How many active subscriptions per plan?",
            connection_url="postgresql+asyncpg://user:pwd@host/db",
            max_rows=200,
        )
    )
    print(result.generated_sql)
    print(result.report)

asyncio.run(main())

โš™๏ธ Configuration

Every knob is a typed Pydantic Setting. Source of truth: src/omniquery/config.py.

๐Ÿ”Œ LLM provider

Variable Default Description
LLM_PROVIDER ollama ollama ยท openai ยท anthropic ยท bedrock ยท vertex.
LLM_MODEL llama3.2:latest Provider-specific model identifier.
LLM_EMBEDDING_MODEL nomic-embed-text Used by the semantic schema linker.
LLM_TIMEOUT 300.0 HTTP timeout in seconds.
LLM_MAX_RETRIES 3 Tenacity retry attempts on 5xx / 429.
LLM_LANGUAGE auto en ยท es ยท auto (detect per question).
LLM_OLLAMA_BASE_URL http://localhost:11434 Ollama HTTP endpoint.
LLM_OPENAI_API_KEY unset Required when provider=openai.
LLM_ANTHROPIC_API_KEY unset Required when provider=anthropic.
LLM_BEDROCK_REGION us-east-1 AWS region for Bedrock (creds from boto3 chain).
LLM_VERTEX_PROJECT unset GCP project id for Vertex AI (ADC for creds).
LLM_VERTEX_REGION us-east5 Vertex region.

๐Ÿ—„๏ธ Target database

Variable Default Description
DATABASE_URL unset Default connection URL for the CLI.
DB_STATEMENT_TIMEOUT_MS 30000 Per-statement timeout sent down to the engine.
DB_MAX_ROWS_DEFAULT 500 Default cap on rows returned (overridable per call).

๐ŸŒ Web adapter

Variable Default Description
WEB_API_KEYS unset Comma-separated allowlist; required in prod.
WEB_RATE_LIMIT_PER_MINUTE 60 Token-bucket rate per identity. 0 disables.
WEB_CORS_ORIGINS * Comma-separated CORS origins.
WEB_HOST / WEB_PORT 0.0.0.0 / 8000 Bind address used by omniquery-web.

๐Ÿ’ฐ Cost guard

Variable Default Description
COST_EXPLAIN_ENABLED false Enable the EXPLAIN plan gate.
COST_MAX_PLAN_COST 1_000_000 Engine planner units cap.
COST_MAX_PLAN_ROWS 50_000_000 Rejects table scans above this estimate.
COST_MAX_QUERIES_PER_SESSION 100 In-memory per-session query cap.
COST_MAX_TOKENS_PER_SESSION 1_000_000 In-memory per-session LLM token cap.

๐Ÿ” PII

Variable Default Description
PII_ENABLED true Master switch.
PII_DENYLIST_PATTERNS (curated regex; see config.py) Case-insensitive regex for column names.
PII_MASK_VALUE *** Replacement token in returned rows.

๐Ÿ’พ Persistence & cache

Variable Default Description
PERSIST_DATABASE_URL sqlite+aiosqlite:///.tmp/omniquery.db Move to Postgres in prod.
CACHE_ENABLED true Master cache toggle.
CACHE_DIR .tmp/cache Local cache root.
CACHE_SCHEMA_TTL_SECONDS 3600 Schema cache TTL.
CACHE_EMBEDDING_TTL_SECONDS 86400 Embedding cache TTL.

๐Ÿ”ญ Observability

Variable Default Description
OBS_OTEL_ENABLED false Enable OpenTelemetry export.
OBS_OTEL_ENDPOINT unset OTLP/HTTP collector URL (e.g. http://otel:4318/v1/traces).
OBS_LOG_LEVEL INFO Root log level.
OBS_LOG_PAYLOAD_LIMIT 2000 Truncation cap for logged prompts and SQL.

๐Ÿ›ก๏ธ Security & Governance

OmniQuery treats generating safe SQL as defence-in-depth, not the only barrier. Production deployments should still grant the app a read-only DB role. On top of that, the pipeline enforces:

  1. Single SELECT only: every SQL string is parsed with sqlglot. Anything that isn't a pure SELECT/UNION/WITH โ€ฆ SELECT is rejected.
  2. No DML inside CTEs: walks the AST to forbid INSERT/UPDATE/DELETE/MERGE anywhere in the tree.
  3. No DDL or admin commands: Create, Drop, Alter, TruncateTable, Command nodes raise SqlGuardError.
  4. No dangerous functions: pg_sleep, pg_read_file, dblink, lo_import, xp_cmdshell, utl_file, dbms_lock, โ€ฆ rejected by name.
  5. Per-statement timeout: applied to the session before the query runs.
  6. EXPLAIN cost gate (optional): Postgres EXPLAIN (FORMAT JSON) and MySQL EXPLAIN FORMAT=JSON veto queries whose planner estimate exceeds thresholds.
  7. Per-session quotas: query count and LLM token count tracked in memory; surpassing the cap fails fast without touching DB or LLM.
  8. PII redaction: sensitive columns are stripped from the schema the LLM sees, and replaced with *** in the rows returned to the caller.
  9. API-key auth + rate limiter on the HTTP adapter.
  10. CI security scans: bandit (medium+) on every push; pip-audit for CVEs.

๐Ÿ”ญ Observability

Every node in the LangGraph pipeline and every LLM call is wrapped in an OpenTelemetry span:

session
โ””โ”€โ”€ agent.introspect
โ””โ”€โ”€ agent.profile
โ””โ”€โ”€ agent.build_graph
โ””โ”€โ”€ agent.propose_questions
โ””โ”€โ”€ agent.generate_sql
    โ””โ”€โ”€ llm.call  (provider=ollama, call_name=table_selection)
    โ””โ”€โ”€ llm.call  (provider=ollama, call_name=generate_sql)
โ””โ”€โ”€ agent.execute_sql
โ””โ”€โ”€ agent.generate_report
    โ””โ”€โ”€ llm.call  (provider=ollama, call_name=generate_report)

Spans carry session_id, agent, and (for LLM calls) provider, model, call_name. Set OBS_OTEL_ENABLED=true and point OBS_OTEL_ENDPOINT at any OTLP/HTTP collector (Tempo, Jaeger, Honeycomb, Grafana Agent).

In parallel, structured JSON logging writes a record per agent transition with input/output snapshots and durations.


๐Ÿ“ Evaluation Harness

OmniQuery ships a pytest-driven text-to-SQL eval harness under tests/eval/. Each dataset is a YAML file pairing a fixture DB with NL questions and (optionally) ground-truth rows.

Metrics tracked per dataset:

  • execution_accuracy โ€” fraction of cases whose rows match the ground truth (or simply returned data when no ground truth is given).
  • fix_rate โ€” fraction of cases that needed the LLM repair loop.
  • latency_p50, latency_p95 โ€” wall-clock latency percentiles per case.

Run:

# Sanity tests (no LLM, run in default CI)
uv run pytest tests/eval/test_harness_meta.py -q

# Full harness against the configured provider (requires Ollama/OpenAI/Anthropic)
uv run pytest tests/eval -m eval -q

# Produce a baseline JSON report
uv run python -m tests.eval.runner tests/eval/datasets/ecommerce.yaml > baseline.json

See tests/eval/README.md for the dataset schema and how to add new cases.


๐Ÿšข Deployment

Two topologies are supported out of the box:

  1. Single machine โ€” docker compose up. Fits most teams running โ‰ค 10 analysts.
  2. Kubernetes โ€” manifests under deploy/k8s/ (namespace, configmap, secret template, deployment, service, optional ingress).

The full deployment guide โ€” env reference, production checklist, troubleshooting table, smoke-test snippets โ€” lives in docs/DEPLOYMENT.md.

Production checklist

  • โœ… Pin the image to a published tag (ghcr.io/<org>/omniquery-explorer:vX.Y.Z), never :latest.
  • โœ… Use Postgres for persistence (PERSIST_DATABASE_URL=postgresql+asyncpg://...).
  • โœ… Set ENVIRONMENT=production so WEB_API_KEYS is enforced.
  • โœ… Mount the app under a read-only DB role.
  • โœ… Enable COST_EXPLAIN_ENABLED=true against large warehouses.
  • โœ… Wire OBS_OTEL_ENABLED=true + OBS_OTEL_ENDPOINT to your collector.
  • โœ… Put a real gateway (Cloudflare, nginx, Envoy) in front for multi-pod rate limiting and TLS termination.

๐Ÿ“‚ Project Structure

src/omniquery/
โ”œโ”€โ”€ adapters/
โ”‚   โ”œโ”€โ”€ cli/                  # Typer + Rich CLI
โ”‚   โ””โ”€โ”€ web/                  # FastAPI + SSE + rate limiter + API-key auth
โ”œโ”€โ”€ application/
โ”‚   โ”œโ”€โ”€ agents/               # LangGraph state machine (EdaSessionGraph)
โ”‚   โ””โ”€โ”€ use_cases/            # RunEdaUseCase
โ”œโ”€โ”€ domain/
โ”‚   โ”œโ”€โ”€ entities/             # Table, Column, DatabaseSchema, ScoredTable, โ€ฆ
โ”‚   โ””โ”€โ”€ ports/
โ”‚       โ”œโ”€โ”€ inbound/          # EdaUseCase
โ”‚       โ””โ”€โ”€ outbound/         # DatabasePort, LlmPort, EmbeddingPort, ProfilingPort
โ”œโ”€โ”€ infrastructure/
โ”‚   โ”œโ”€โ”€ db/                   # Postgres / MySQL / Oracle / SQLite / DuckDB
โ”‚   โ”‚                         # + engine_pool + sql_guard + statement_timeout
โ”‚   โ”‚                         # + sql_profiling_adapter
โ”‚   โ”œโ”€โ”€ llm/                  # Ollama / OpenAI / Anthropic + shared prompts
โ”‚   โ”œโ”€โ”€ graph/                # schema_graph_service + schema_linker
โ”‚   โ”œโ”€โ”€ cache/                # disk_cache + cached_database + cached_embedding
โ”‚   โ”œโ”€โ”€ governance/           # cost_guard + pii_policy
โ”‚   โ”œโ”€โ”€ observability/        # OpenTelemetry tracer
โ”‚   โ”œโ”€โ”€ persistence/          # SQLAlchemy ORM + Alembic migrations
โ”‚   โ””โ”€โ”€ logging/              # Structured JSON logging
โ””โ”€โ”€ config.py                 # Typed Pydantic Settings
deploy/k8s/                   # Production manifests
docs/DEPLOYMENT.md            # Deployment guide
tests/{unit,integration,e2e,eval}

๐Ÿ› ๏ธ Tech Stack

  • Language: Python 3.12 ยท uv for dep management.
  • Agents: LangGraph ยท LangChain Core.
  • LLMs: Ollama ยท OpenAI ยท Anthropic ยท tenacity for retry/backoff.
  • DB drivers: SQLAlchemy 2 async ยท asyncpg ยท aiomysql ยท oracledb ยท aiosqlite ยท duckdb_engine.
  • SQL hardening: sqlglot AST parser.
  • HTTP: FastAPI ยท uvicorn ยท httpx async.
  • CLI: Typer ยท Rich.
  • Graph / ranking: NetworkX ยท Matplotlib.
  • Config & validation: Pydantic 2 ยท pydantic-settings.
  • Persistence & migrations: SQLAlchemy ORM ยท Alembic.
  • Observability: OpenTelemetry SDK + OTLP/HTTP exporter.
  • CI/CD: GitHub Actions ยท ruff ยท mypy ยท pytest ยท bandit ยท pip-audit ยท Docker Buildx (multi-arch, SBOM, provenance).

๐Ÿ—บ๏ธ Roadmap

The full prioritised plan with status and commit references is in IMPROVEMENTS.md. Highlights of what's next:

  • ๐Ÿง  Semantic question cache (pgvector) for near-instant replies on similar prompts.
  • ๐Ÿ–ผ๏ธ Visualisation agent producing Vega-Lite specs.
  • ๐Ÿชœ Multi-hop join reasoning via Steiner trees over the FK graph.
  • ๐Ÿ—ฃ๏ธ Conversational memory between turns (LangGraph MemorySaver).
  • ๐Ÿข Workspaces + RBAC for multi-tenant deployments.
  • ๐Ÿ“Š BigQuery / Snowflake / MSSQL adapters.
  • ๐ŸŒ Web UI (Next.js).

๐Ÿค Contributing

Contributions are welcome. The repo enforces a small but firm bar:

  • Tests first. Every new public function ships with at least one unit test.
  • Type-safe by default. Curated modules pass mypy; new modules should join the list.
  • Pass CI locally before pushing:
    uv run ruff check src tests
    uv run mypy
    uv run pytest tests/unit tests/integration tests/eval/test_harness_meta.py -q
    uv run bandit -r src -c pyproject.toml -ll
    
  • Commit style: conventional commits (feat:, fix:, refactor:, docs:, test:, ci:).
  • Open a PR against develop. main only receives merges from develop at release time.

๐Ÿ“„ License

MIT โ€” see LICENSE.


Built with โค๏ธ for analysts who want answers, not boilerplate.

Quickstart ยท Architecture ยท Deployment ยท Roadmap

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omniquery_explorer-0.2.0.tar.gz (29.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omniquery_explorer-0.2.0-py3-none-any.whl (128.2 kB view details)

Uploaded Python 3

File details

Details for the file omniquery_explorer-0.2.0.tar.gz.

File metadata

  • Download URL: omniquery_explorer-0.2.0.tar.gz
  • Upload date:
  • Size: 29.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for omniquery_explorer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ffceeae68ecdc6c195bb18e7638a366877a1153185a3488c47fba0fb46d3b8f1
MD5 5d6e31865383580ef6fa8213bf8d6a7e
BLAKE2b-256 c3bdfc5d94ae3914c5e8abbd26611c7569605f1cf05eb6265adc1f673d4b1cee

See more details on using hashes here.

Provenance

The following attestation bundles were made for omniquery_explorer-0.2.0.tar.gz:

Publisher: release.yml on rsaglobaltech/omniquery-explorer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file omniquery_explorer-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for omniquery_explorer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8eb29d7e8fb12134e39b09afc8f251ad63c3af6f5b8725e3be192444347d0c35
MD5 34d81c364e8b2749c2a339e626dd6410
BLAKE2b-256 39bec040b607415c12a9a279ea25ba9ec935ccd6eb03cd27e047a50736c01924

See more details on using hashes here.

Provenance

The following attestation bundles were made for omniquery_explorer-0.2.0-py3-none-any.whl:

Publisher: release.yml on rsaglobaltech/omniquery-explorer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page