Skip to main content

Deterministic, validation-first Text-to-SQL system for business databases

Project description

AetherDialect — validation-first Text-to-SQL for PostgreSQL and Databricks

aetherdialect turns analytical questions into read-only SELECT pipelines: a structured intent representation, multi-stage validation (including dialect EXPLAIN), template reuse from accepted answers, and bounded learning from rejections. The language model fills bounded slots in that intent; it does not author unconstrained SQL.

Why this exists

Teams need answers from relational data without shipping opaque generated SQL. AetherDialect targets analysts and integrators who want a repeatable path from question to result: the same question can return cached SQL with no model round-trip, schema drift surfaces as an explicit migration stop instead of silent breakage, and every generated statement is checked against the catalog and engine before it runs.

Install

pip install aetherdialect
pip install "aetherdialect[postgresql]"
pip install "aetherdialect[databricks]"
pip install "aetherdialect[postgresql,databricks]"

Requires Python 3.10 or newer. Configure the LLM and database via a TOML config_file (recommended) and/or process environment; the full key list lives in the API reference.

Quick start

from aetherdialect import SchemaContext, Text2SQL

t2s = Text2SQL(
    SchemaContext(),
    artifacts_dir="./my_run",
    config_file="./aetherdialect.toml",
)
t2s.run_interactive()

run_interactive prompts once per invocation; call it again for another question. For programmatic UIs, use Text2SQL.session() or Text2SQL.asession() and drive SessionStep objects — see the Integrator guide.

dry_run_warmup exercises a newline-delimited seed question file through validation and execution without persisting templates; see the User guide — Seed warmup.

What makes this different

  • Constant-learning cache: exact q_norm reuse returns SQL with zero LLM calls; near-paraphrases (token Levenshtein at most 2) reuse the same template with one bounded LLM call that only extracts parameters. (How it works)

  • Schema overrides are a JSON file you read, edit, and version. Every override (descriptions, roles, sensitivity, added or suppressed foreign keys, primary key endorsements) is replayed on every cache invalidation. (API reference)

  • Migration is never silent. When the catalog changes structurally, the engine writes a schema_migration_map.json skeleton and stops. You decide the action; it resumes. (User guide)

  • Generated SQL passes through four validation layers (intent JSON, dialect AST, schema/catalog alignment, dialect EXPLAIN). The LLM never emits raw SQL; it fills bounded slots in a structured intent IR. (Security)

  • Reader / writer split is built in. Many readers can ask questions; the engine drains write_queue.jsonl at the start of every writer-mode turn under the artifacts lock so learning persists without readers touching the partitioned template store files. (Integrator guide)

Documentation

Doc When to read it
User guide Install, first run, asking questions, notes, overrides, migration, seed warmup and dry-run warmup, pitfalls.
Integrator guide Embedding patterns, sessions, multi-turn relay, threading, reader/writer split and queue, audit and diagnostics, cache reset.
API reference Types, config_file TOML schema, methods, schema overrides JSON, diagnostic codes, exceptions.
How it works Architecture diagrams, schema build, engine storage, question pipeline, migration, overrides, validation, learning model, configuration, observability, warmup/QSim, offline-mock design pointer.
Offline testing and mock LLM (design) Planned mock LLM provider and fixture workflow for hermetic tests; links to dev_workspace/mock.txt (not on PyPI).
Security Threat model, LLM context inventory, on-disk inventory, sensitivity model, deny lists, raw-SQL impossibility, EXPLAIN gate, network.
Support matrix Per-engine table, IR-unsupported constructs and reformulations.

License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aetherdialect-0.1.7.tar.gz (958.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aetherdialect-0.1.7-py3-none-any.whl (659.6 kB view details)

Uploaded Python 3

File details

Details for the file aetherdialect-0.1.7.tar.gz.

File metadata

  • Download URL: aetherdialect-0.1.7.tar.gz
  • Upload date:
  • Size: 958.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aetherdialect-0.1.7.tar.gz
Algorithm Hash digest
SHA256 a58fafd3fe3972bf655ef19f56e4f2c1209ca6117778e8350938f4649154c2de
MD5 a248328482b6e5dff6f02175b219eebb
BLAKE2b-256 477cbc09f953bd39cb7acccce11ed9f5f5ed415237fc1c959ddc924eb4f3f3e1

See more details on using hashes here.

File details

Details for the file aetherdialect-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: aetherdialect-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 659.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aetherdialect-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 328edc535d502a23084637db5a75ef403c6b1e7f8bb79e68e3065858c426f915
MD5 87457f0eaa330ba7f6079275636dc0aa
BLAKE2b-256 854a3186065bee1c9ebc247f7afe86555248801e010eb23998af43d418a012c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page