Skip to main content

Deterministic, validation-first Text-to-SQL system for business databases

Project description

Deterministic, validation-first Text-to-SQL for business databases

This library turns analytical questions into read-only SELECT pipelines on PostgreSQL or Databricks: structured intent, heavy validation (including dialect AST and EXPLAIN), optional template reuse from accepted answers, and negative memory from rejections. When you construct Text2SQL, it checks database connectivity, LLM reachability, and whether on-disk artifacts still match the live schema.

Practical tips: Questions resolve more reliably when you state intent explicitly—entities, grain, filters, time scope, and ordering—instead of leaving those details implied. The same goes for optional domain notes (SchemaContext.notes_file, see API_REFERENCE.md): richer notes and clearer questions generally improve routing speed and SQL quality.

Internals: for how the schema graph is built, how columns become visible to the LLM, how joins are picked, and how stored artifacts migrate when the schema drifts, see OVERVIEW.md.

Installation

pip install aetherdialect
pip install "aetherdialect[postgresql]"
pip install "aetherdialect[databricks]"
pip install "aetherdialect[postgresql,databricks]"

Requires Python ≥ 3.10 and either an OpenAI API key or Azure OpenAI credentials. Construction verifies LLM connectivity for each distinct model or deployment the run uses; see API_REFERENCE.md for required variables, optional deployment-name overrides on Azure, and Databricks SQL warehouse vs PySpark.

Extra Brings in Use when
(base) SQLAlchemy (shared introspection / execution interface) Always installed
postgresql PostgreSQL driver (psycopg2-binary), pglast PostgreSQL via PG* / POSTGRES_* env (see API_REFERENCE.md)
databricks Databricks SQL connector (preferred), PySpark (fallback), databricks-sqlalchemy, sqlglot Databricks via DATABRICKS_* / related aliases (see API_REFERENCE.md)

SQL parsing for validation: PostgreSQL uses pglast for structural AST checks (join pairs, CTE bodies, ast_validate). Databricks / Spark SQL uses sqlglot with the Spark dialect.


Quickstart

from text2sql import SchemaContext, Text2SQL

t2s = Text2SQL(
    SchemaContext(include="tables"),
    artifacts_dir="./my_run",
    env_file=".env",
)

t2s.run_interactive()

Set database and LLM variables in the process environment or in env_file. The full matrix is in API_REFERENCE.md. Pass artifacts_dir= so artifacts are written under <root>/text2sql; when omitted, a platform user-data directory is used.

Interactive two ways: run_interactive() is a stdin loop. For your own UI or protocol, use Text2SQL.pipeline_session() with PipelineSession.ask and PipelineSession.step, which return SessionStep until done is true. Details are in API_REFERENCE.md.


When to use it

Good fit: star/snowflake-style models, clear foreign keys, repeated BI-style questions, PostgreSQL or Databricks.

Poor fit: schemas without relationships, heavy procedural logic, or expectations of arbitrary SQL outside the supported analytical subset (see What it supports and What it is not below).

Typical scale: The toolkit targets business-facing warehouses on the order of tens to hundreds of tables. If your catalog is much larger, narrow SchemaContext (allow_objects, include, deny lists) so the reflected graph matches how analysts work.


What this is

A validation-first layer for stable business schemas and repeated analytical questions, not open-ended “any SQL” generation.

Core idea: the system prefers reuse and rules over open-ended generation. Your question is matched to stored, validated patterns when possible; otherwise the model proposes structured analytical intent, which is repaired and checked against the schema, then turned into read-only SELECT SQL that must pass static and EXPLAIN gates before it runs.

  • Natural language is turned into a structured intent (tables, select expressions, filters, grouping, ordering, optional CTEs) that is shared across dialects; dialect-specific SQL is produced later.
  • Templates store previously accepted query patterns; negative memory records rejections so bad shapes are less likely to repeat.
  • LLM calls run at temperature 0; for the same inputs and schema state, behavior is repeatable.
  • Bounded LLM use: strong paths reuse templates or deterministic structure before asking the model for SQL.

What you can use it for

  • Repeated BI-style questions on a schema you control: faster turns when templates already fit, consistent SQL shape across users.
  • Governed analytical subset of SQL—not arbitrary strings—so reviewers have a predictable surface area.
  • Learning from usage: accepting good answers strengthens reuse; rejecting bad ones feeds negative memory. When the session asks what was wrong, your short free-text reason is normalized and categorized so similar intents are steered away more reliably than with a reject flag alone.
  • Bootstrapping coverage: seed warmup (gold intents from a seed file → expansion → validation/execute → optional NL for new templates) and QSim (reproducible synthetic questions from schema and profiles) for datasets, regression lists, or onboarding.
  • Operational feedback: construction checks database connectivity, LLM reachability for every configured model, and reconciles on-disk artifacts with the current schema (see Artifacts and migration below).

LLM: three fixed models

The library uses three named models internally. On Azure OpenAI you expose three deployments whose default names match those internal names, or you map each name to your deployment with optional env vars (API_REFERENCE.md lists the exact strings and variables).

When Text2SQL is constructed, a short completion is sent once per distinct configured model name (OpenAI) or deployment name (Azure), so bad keys, endpoints, or deployment maps fail immediately with ConfigError instead of halfway through a session.

When the process environment exposes credentials for both providers, OpenAI is tried first and Azure is used as the fallback when the OpenAI preflight fails; if neither preflight succeeds the construction raises ConfigError.


Database credentials and permissions

Treat credentials as you would for any read-only analyst account.

  • The engine needs to reflect the tables or views in your SchemaContext, run SELECT (and EXPLAIN) on generated queries, and execute the paths you enable (interactive display, warmup, etc.).
  • Least privilege is recommended: a role limited to SELECT (and whatever your database requires for EXPLAIN) on the objects you include. The library enforces an analytical SELECT-only policy in generated SQL, but that is not a substitute for database- and network-level security.
  • Scope matters: allow/deny lists and include settings restrict what is visible; they also feed fingerprinting so template stores stay aligned when you change scope (API_REFERENCE.md).

Artifacts and migration

Everything learned or cached for a connection “shape” lives under artifacts_dir (see Quickstart): resolved to <root>/text2sql, or a platform user-data directory if you omit artifacts_dir. That folder holds the schema snapshot, template store, QSim skeletons, seed-warmup cache, and a small manifest of fingerprints—not your raw database.

Each time Text2SQL(...) runs, the live schema graph is compared to the stored manifest. The outcome is one of four migration tiers (construction step 6 and the table below; for a conceptual walkthrough see OVERVIEW.md § Migration tiers):

Tier What it means for you
No change Stored fingerprints already match the live graph; existing templates and caches apply as-is.
Soft refresh The same structural scope changed only in ways that do not invalidate stored SQL shapes (for example, refreshed notes or profiling metadata). Fingerprints on disk are updated; templates are kept.
Remap The engine detected a rename-only style drift (tables/columns) within the same scope. Stored templates may be rewritten to the new names; a Migration applied summary is printed to stdout when the tier is not no_change.
Destructive The drift is too large to remap safely. Applying this tier clears persisted templates and related caches on disk (including the compressed schema snapshot, QSim skeletons, and seed-warmup zip) under that artifact root, then stamps a fresh manifest. Your database is not modified—only local learning files. The live graph in memory for the current process is already built; the next process run rewrites artifacts as needed.

When destructive migration is required, construction does not wipe disk immediately. The in-memory template store starts empty until you confirm on the first interactive or programmatic turn:

  • run_interactive(): you enter a question line as usual; ask then returns a SessionStep that asks for migration confirmation before that question runs. Answer on the next prompt; n returns a terminal SessionStep with kind=error and guidance to use a new artifacts_dir (or remove the old artifact tree) before continuing.
  • PipelineSession: the first ask may return a SessionStep whose prompt asks for confirmation (your question string is held until you confirm). Supply y / n via step. Declining migration returns the same terminal SessionStep error shape instead of raising.

After construction, the migration outcome is printed when non-trivial; use Text2SQL.show_config() for a redacted configuration snapshot. For backups, copy the whole <root>/text2sql directory; to reset learning, remove it or switch artifacts_dir.


How to improve results (without touching code)

  • Schema quality: declared foreign keys in the database, sensible types, and a stable star/snowflake-style layout are ideal. If production metadata is thin, the graph still gains inferred FK-style links from naming conventions where those rules apply, plus semantic join neighbors from profiled column value overlap—so imperfect warehouses get extra join signal, not only whatever the catalog declared.
  • Domain language: optional notes file (SchemaContext.notes_file) and concrete questions (entities, time range, grain) improve routing and SQL quality; see the opening paragraph of this README.
  • Scope: use allow/deny and include deliberately so the graph matches how analysts think about the warehouse; changing scope changes fingerprints and can trigger migration.
  • Operational learning: accept good SQL; when you reject, answer the “what was wrong?” prompt when it appears—the reason improves negative learning (see above). Use seed warmup or QSim to broaden template coverage in a controlled way.

Philosophy

  • Determinism over creativity — prefer a correct, boring plan to a novel one.
  • Correct joins over clever SQL — join paths come from declared and inferred foreign-key structure, precomputed paths, and profiled semantic links where applicable, not free-form guessing.
  • Validate before execute — schema checks, intent consistency, join shape, and dialect-safe read-only SELECT rules.
  • Minimal LLM surface — parse intent, resolve ambiguous joins when needed; everything else is rules and stores.
  • Safe defaults for non-analysts — narrow allowed SQL; you still choose database credentials (read-only vs write-capable is outside this library).

What it supports (at a glance)

Backends

  • PostgreSQL
  • Databricks

Schema

  • Load from live introspection (primary). Set SchemaContext.include to "tables" (base tables and materialized views), "views" (ordinary views only), or "both" (union of the two passes; ambiguous relation names are rejected when building the graph).
  • Optional CREATE TABLE file as extra or fallback (especially when you cannot reach all metadata from the driver).
  • Cached schema snapshot per connection fingerprint so restarts avoid re-reflecting unchanged databases.
  • Table roles (e.g. fact vs dimension), column roles (measure, categorical, temporal, identifier, etc.), filter / aggregation / HAVING allowances per column, value domains from profiling — all assigned when the graph is built (reflection, DDL, profiling, and optional notes).
  • Profiling captures the mode frequency for each column. When one value occupies ≥99% of the non-null distribution (a sentinel like 0, -1, or 'Unknown') the column is hidden from the LLM by the same gate as columns that are ≥99% null — sentinel-dominated columns carry no useful filter or grouping signal.
  • Optional human notes (plain text), via SchemaContext.notes_file (see API_REFERENCE.md): merged when the graph is built or when notes change; if the cache already contains notes and you omit notes_file on a later run, cached roles and hints are kept.
  • Optional deny_columns and allow_objects in SchemaContext; they participate in scope hashing so template stores reconcile when scope changes. Each deny_columns entry is either a qualified "table.column" (denies that exact column) or a bare "column" name (denies that column name on every table where it appears — qualify if you want one-table scope). Denied columns are hidden from the LLM context and rejected anywhere they would appear in the IR (bare select, filter, GROUP BY, HAVING, ORDER BY, aggregate).
  • Optional allow_columns in SchemaContext complements deny_columns: when non-empty, only the listed columns survive reflection. Same grammar as deny_columns (qualified or bare). Pragmatic auto-include: primary key columns and any column appearing in a foreign key edge (source or destination) are always retained so the join graph survives a narrow allow list. Participates in scope hashing.
  • Per-column sensitivity tag on ColumnMetadata accepts "pii" or "restricted". Both hide the column from the LLM context. "pii" additionally rejects bare select-list projection and GROUP BY references; aggregates and equality filters remain available. "restricted" hides from the LLM only — IR references that survive other validators are permitted.

Intent / SQL shape (analytical subset)

  • Queries: SELECT only (enforced with pattern checks and dialect parsing). CTEs reuse the same intent model as the outer query.
  • Joins: foreign keys and precomputed paths, plus profiled value-overlap links between columns when samples agree; when several valid paths could link the same tables, one coherent path is chosen for the whole query. CTEs act as virtual tables in join discovery; self-joins use CTEs instead of repeating the same base table.
  • Select list: bare columns, aggregates (COUNT, SUM, AVG, MIN, MAX, etc.), arithmetic and string expressions where the schema allows them, DISTINCT, and scalar functions subject to column metadata.
  • Filters / boolean logic: comparisons, AND / OR, IN, LIKE; ILIKE / NOT ILIKE on PostgreSQL only (intent stays dialect-agnostic; SQL rendering differs). Null / boolean value normalization in the deterministic intent and SQL preparation chain.
  • BETWEEN in intent is decomposed into a pair of comparable predicates.
  • Grouping / ordering: GROUP BY, HAVING (aggregate-aware), ORDER BY, LIMIT; rules tie grain (row-level vs grouped) to aggregates and grouped columns.
  • Dates: structured date_window (anchor unit + offset) and date-difference filters between columns where supported. Calendar grains supplied by the LLM are accepted in colloquial forms — monthly, quarterly, yearly, annual, daily, weekly, hourly, plurals, and short abbreviations like yr, mo, qtr, wk — and are folded internally to a single canonical token before SQL is rendered, so dialect emit functions see only day / week / month / quarter / year / hour / minute / second.
  • Windows: ROW_NUMBER, RANK, DENSE_RANK, and windowed SUM/AVG, LAG, LEAD, FIRST_VALUE, and LAST_VALUE on select columns (main query and CTEs).
  • CASE / WHEN: only in the select list in the intent model (not in WHERE / HAVING).
  • Arrays / lists: membership-style filters; SQL uses dialect-appropriate forms; optional UNNEST / EXPLODE-style expansion in CTE select lists for typed array columns.
  • Metadata: UNIQUE (and related) when reflection or DDL exposes it, for ranking “human readable” identifiers.

Operational modes

  • Interactive — ask questions, accept/reject, results export; via run_interactive() or a programmatic PipelineSession (see Quickstart above and API_REFERENCE.md).
  • Seed warmup — seed questions → gold intents → deterministic expansion (many operators, deduplicated) → validate/execute → NL question generation for new templates.
  • QSim — reproducible synthetic questions from schema and profiles (seeded randomness).

What it is not

  • Not a full SQL or stored-procedure generator: no UNION/INTERSECT/EXCEPT, no correlated subqueries, no EXISTS, no LATERAL, no DML/DDL, no arbitrary RIGHT/FULL OUTER join policy in the constrained path.
  • Not a substitute for database security (see Database credentials and permissions above).
  • Not schema-agnostic: quality depends on relationship signal (declared FKs, inferred links, semantic profile edges), sensible types, and optional notes for domain language.

How a question becomes SQL

  1. Template match — if a trusted pattern fits, reuse parameterized SQL (often no SQL LLM call).
  2. Intent parse — structured intent from the question + schema summary, then a long deterministic repair chain.
  3. Join resolution — choose among valid FK paths for the intent’s tables; disambiguation may use the LLM when multiple paths tie.
  4. SQL generation & validation — deterministic skeleton, injected joins, LLM fill under constraints where applicable, then semantic validation, SELECT-only / forbidden-pattern checks, dialect AST validation (pglast or sqlglot), and EXPLAIN.
  5. Execute (where the mode allows) and learn — accept → promote template trust; reject → record a categorized negative pattern (optionally after a short rejection reason when the UI or interactive loop asks for one).

Validation (layers)

  • Safety / shapeSELECT-only enforcement, configurable forbidden SQL substrings, then dialect ast_validate (pglast or sqlglot). EXPLAIN is used as an extra executability check. Intent handling is structured (contracts and parsers); a few narrow string passes still exist for SQL post-processing where AST coverage is not yet unified. Natural-language questions are not executed as raw SQL; they pass through structured intent and these gates, with values bound as parameters or validated literals rather than unconstrained interpolation into executable SQL (together with least-privilege roles, see What it is not).
  • Schema vs intent — tables/columns/CTEs, selectability, access and sensitivity policy, window / CASE / array shapes, filter and HAVING ops per column, aggregate roles in select/HAVING/ORDER BY, scalar function typing, filter value types vs column types, null/date-window/date-diff rules.
  • Semantic consistency — grouped queries require proper aggregation; contradictions and impossible HAVING; grain alignment.
  • Joins — paths use the FK graph, virtual CTE bridges, and stored semantic-profile edges where applicable; guarded path avoids ad-hoc join guessing.

Learning and reuse

  • Accepted templates — intent fingerprint, parameterized SQL, optional example question, trust that rises with validation and falls with rejection.
  • Rejected templates (“negative memory”) — failures are stored with categories (and optional user rejection reasons when collected) so similar bad intents are discouraged on later turns.
  • Loader reconciliation — when you open an existing template file, rows that no longer match the current graph (missing tables, columns, or join segments) are pruned, negative memory for removed rejects is cleared, and stale failure-log rows from older hashes are filtered before the store is saved for the current scope. Large fingerprint jumps are handled by the migration path above, not only this incremental prune.
  • Persistence lives next to the manifest under your artifacts_dir tree (API_REFERENCE.md); back it up or reset it as described under Artifacts and migration.

Seed warmup (brief)

  1. Parse each seed line into a gold intent.
  2. Expand with a fixed set of deterministic operators (filters, aggregates, joins, etc.), deduplicated by body similarity across depths.
  3. Classify gold rows against the template store; resolve joins per table set; validate and execute the full pool (subject to caps).
  4. Stratified sampling on execute successes, then an LLM step (up to three paraphrases) from the SQL with a realism filter.

Warmup loads at most 500 seed lines per run (fixed internal cap; larger files are truncated). run_warmup_preflight runs the same execute path without question LLM or template persistence and writes warmup_preflight_report_v{m}.json.


QSim (brief)

Generates reproducible question lists from the schema and profiled values. Same seed → same output. Use for regression-style testing or dataset building.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aetherdialect-0.1.6.tar.gz (823.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aetherdialect-0.1.6-py3-none-any.whl (547.5 kB view details)

Uploaded Python 3

File details

Details for the file aetherdialect-0.1.6.tar.gz.

File metadata

  • Download URL: aetherdialect-0.1.6.tar.gz
  • Upload date:
  • Size: 823.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aetherdialect-0.1.6.tar.gz
Algorithm Hash digest
SHA256 a26e01390bbb895aeb5881bd156607d5011ad09100e30377a7a747d115958672
MD5 efda0e18b850ebbdad5d9af6ed07b226
BLAKE2b-256 5b2bb0670b4e6e3569543a53492d4e49512748d569af2db587492e0988280bde

See more details on using hashes here.

File details

Details for the file aetherdialect-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: aetherdialect-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 547.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aetherdialect-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 96db5fa756ab75df5302249ab8126f7f53c147fd8919d9c0984c1309df0d60ea
MD5 febcc989787d278f6910b8a3539cd846
BLAKE2b-256 5f26a5f5785d4cc7b5669a7444ba2ede319436ece6a9b8386baf9f31a7bc3c51

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page