Skip to main content

Deterministic, validation-first Text-to-SQL system for business databases

Project description

Deterministic, validation-first Text-to-SQL for business databases

Installation

pip install text2sql
pip install "text2sql[postgresql]"
pip install "text2sql[databricks]"
pip install "text2sql[postgresql,databricks]"

Requires Python ≥ 3.10 and an OpenAI API key.

Extra Brings in Use when
postgresql SQLAlchemy, PostgreSQL driver, pglast engine="postgresql"
databricks PySpark, Databricks SQL connector engine="databricks"

SQL parsing for validation: PostgreSQL uses pglast for structural AST checks (join pairs, CTE bodies, ast_validate). Databricks / Spark SQL uses sqlglot with the Spark dialect. The base package already depends on sqlglot; pglast is installed with the postgresql extra.


Quickstart

from text2sql import Text2SQL

t2s = Text2SQL(
    engine="postgresql",
    host="localhost",
    database="mydb",
    password="secret",
    openai_api_key="sk-...",
)

t2s.run_interactive()

Constructor options, modes, optional files, and the full method list are in USAGE.md.


What this is

A validation-first text-to-SQL layer for PostgreSQL and Databricks. It targets stable business schemas and repeated analytical questions, not open-ended “any SQL” generation.

  • Natural language is turned into a structured intent (tables, select expressions, filters, grouping, ordering, optional CTEs) that is shared across dialects; dialect-specific SQL is produced later.
  • Templates store previously accepted query patterns; negative memory records rejections so bad shapes are less likely to repeat.
  • LLM calls run at temperature 0; for the same inputs and schema state, behavior is repeatable.
  • Bounded LLM use: strong paths reuse templates or deterministic structure before asking the model for SQL.

Philosophy

  • Determinism over creativity — prefer a correct, boring plan to a novel one.
  • Correct joins over clever SQL — join paths come from foreign keys and precomputed paths, not free-form guessing.
  • Validate before execute — schema checks, intent consistency, join shape, and dialect-safe read-only SELECT rules.
  • Minimal LLM surface — parse intent, resolve ambiguous joins when needed, generate or repair SQL; everything else is rules and stores.
  • Safe defaults for non-analysts — narrow allowed SQL; you still choose database credentials (read-only vs write-capable is outside this library).

What it supports (at a glance)

Backends

  • PostgreSQL via SQLAlchemy.
  • Databricks via Unity Catalog introspection (with optional DDL file fallback when the catalog is empty).

Schema

  • Load from live introspection (primary).
  • Optional CREATE TABLE file as extra or fallback (especially when you cannot reach all metadata from the driver).
  • Cached schema snapshot per connection fingerprint so restarts avoid re-reflecting unchanged databases.
  • Table roles (e.g. fact vs dimension), column roles (measure, categorical, temporal, identifier, etc.), filter / aggregation / HAVING allowances per column, value domains from profiling.
  • Optional human notes (plain text) fed once when the schema graph is built: richer descriptions, roles, and optional sensitivity labels — without renaming tables or inventing columns.
  • Optional deny lists for tables or columns so they stay out of prompts and can be stripped from intents.

Intent / SQL shape (analytical subset)

  • Queries: SELECT only (enforced with pattern checks and dialect parsing). CTEs reuse the same intent model as the outer query.
  • Joins: only along FK-backed paths; join type for injected joins is chosen deterministically from table roles (e.g. dimension side as LEFT where applicable).
  • Select list: bare columns, aggregates (COUNT, SUM, AVG, MIN, MAX, etc.), arithmetic and string expressions where the schema allows them, DISTINCT, and scalar functions subject to column metadata.
  • Filters / boolean logic: comparisons, AND / OR, IN, LIKE; ILIKE / NOT ILIKE on PostgreSQL only (intent stays dialect-agnostic; SQL rendering differs). Null / boolean value normalization in the repair chain.
  • BETWEEN in intent is decomposed into a pair of comparable predicates.
  • Grouping / ordering: GROUP BY, HAVING (aggregate-aware), ORDER BY, LIMIT; rules tie grain (row-level vs grouped) to aggregates and grouped columns.
  • Dates: structured date_window (anchor unit + offset) and date-difference filters between columns where supported.
  • Windows: ROW_NUMBER, RANK, DENSE_RANK, and windowed SUM / AVG on select columns (main query and CTEs).
  • CASE / WHEN: only in the select list in the intent model (not in WHERE / HAVING).
  • Arrays / lists: membership-style filters; SQL uses dialect-appropriate forms; optional UNNEST / EXPLODE-style expansion in CTE select lists for typed array columns.
  • Metadata: UNIQUE (and related) when reflection or DDL exposes it, for ranking “human readable” identifiers.

Operational modes

  • Interactive — ask questions, accept/reject, results export.
  • Coverage simulator — seed questions → gold intents → deterministic expansion (many operators, deduplicated) → validate/execute → NL question generation for new templates.
  • QSim — reproducible synthetic questions from schema and profiles (seeded randomness).

What it is not

  • Not a full SQL or stored-procedure generator: no UNION/INTERSECT/EXCEPT, no correlated subqueries, no EXISTS, no LATERAL, no DML/DDL, no arbitrary RIGHT/FULL OUTER join policy in the constrained path.
  • Not a substitute for database security: use credentials with least privilege (SELECT / EXPLAIN as you prefer).
  • Not schema-agnostic: quality depends on FKs, sensible types, and optional notes for domain language.

How a question becomes SQL

  1. Template match — if a trusted pattern fits, reuse parameterized SQL (often no SQL LLM call).
  2. Intent parse — structured intent from the question + schema summary, then a long deterministic repair chain (see below).
  3. Join resolution — choose among valid FK paths for the intent’s tables; disambiguation may use the LLM when multiple paths tie.
  4. SQL generation & validation — deterministic skeleton, injected joins, LLM fill/repair under constraints, then semantic validation, SELECT-only / forbidden-pattern checks, dialect AST validation (pglast or sqlglot), and optionally EXPLAIN when an engine is available.
  5. Execute (where the mode allows) and learn — accept → promote template trust; reject → record negative pattern.

Validation (layers)

  • Safety / shapeSELECT-only enforcement, configurable forbidden SQL substrings, then dialect ast_validate (pglast on PostgreSQL, sqlglot (Spark) on Databricks). When a live engine is passed in, EXPLAIN can be used as an extra executability check.
  • Schema vs intent — tables/columns/CTEs, selectability, access and sensitivity policy, window / CASE / array shapes, filter and HAVING ops per column, aggregate roles in select/HAVING/ORDER BY, scalar function typing, filter value types vs column types, null/date-window/date-diff rules.
  • Semantic consistency — grouped queries require proper aggregation; contradictions and impossible HAVING; grain alignment (one of many cross-checks, not the only story).
  • Joins — paths must match the FK graph; guarded path avoids ad-hoc join guessing.

Deterministic repairs (after intent parse)

Applied in order (high level): COUNT(*) normalization; CTE naming and output aliases; qualify CTE outputs; sanitize table names; grain rules for grouped CTE usage and grain consistency; strip redundant GROUP BY; normalize filters/HAVING; null-equality fixes; strip join-condition leakage into filters; per-CTE sort order; simplify expressions; IN value normalization; date-diff classification and raw-value fixes; BETWEEN → paired predicates; auto-repair filters/HAVING; strip impossible HAVING; FK filter type repair; filter value case / enum alignment; boolean and null filter values; expand FK selects to descriptive columns; deduplicate contradictory filters; redundant PK re-qualification; window, CASE, and array intent repairs; sensitivity and access policy enforcement on the intent.


Learning and reuse

  • Accepted templates — intent fingerprint, parameterized SQL, optional example question, trust that rises with validation and falls with rejection.
  • Rejected templates — categorized failures so similar bad intents are discouraged.
  • Persistence is under a per-connection artifact directory (see USAGE); you can back it up or reset it by removing that directory.

Coverage simulator (brief)

  1. Parse each seed line into a gold intent.
  2. Expand with a fixed set of deterministic operators (filters, aggregates, joins, time windows, numeric transforms, distinct/limit/OR-groups, expressions, window variants, etc.), deduplicated across depths.
  3. Resolve joins once per table set where possible; validate and execute as a gate.
  4. One LLM step to produce a natural language question from the SQL, with a realism filter.

The simulator loads at most 500 seed lines per run (fixed internal cap; larger files are truncated). estimate_simulator_costs uses the same loader, so its seed count reflects that cap. It returns rough LLM-call and execution estimates from a seed file and schema stats.


QSim (brief)

Generates reproducible question lists from the schema and profiled values. Same seed → same output. Use for regression-style testing or dataset building.


When to use it

Good fit: star/snowflake-style models, clear FKs, repeated BI-style questions, PostgreSQL or Databricks.

Poor fit: schemas without relationships, heavy procedural logic, or expectations of arbitrary SQL features outside the supported analytical subset.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aetherdialect-0.1.0.tar.gz (426.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aetherdialect-0.1.0-py3-none-any.whl (287.8 kB view details)

Uploaded Python 3

File details

Details for the file aetherdialect-0.1.0.tar.gz.

File metadata

  • Download URL: aetherdialect-0.1.0.tar.gz
  • Upload date:
  • Size: 426.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aetherdialect-0.1.0.tar.gz
Algorithm Hash digest
SHA256 645ec51b8d1b1ee4f8388215939a709d6852cb4541bbc136b72c6afd04622f24
MD5 804a9a5cb775313c73d616e93dee1c7c
BLAKE2b-256 24b2ffeb82fffe40783f00ef77686eab8d6b1c280476f24770921be317f706e4

See more details on using hashes here.

File details

Details for the file aetherdialect-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: aetherdialect-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 287.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aetherdialect-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3963f3b38f0785d9e34da285d8b31ede1179b5dbe263ee23deebd1dbfebae9ce
MD5 7aba3afb40098469b4fd36077da1ae55
BLAKE2b-256 f233255c760999159169dd33a6a31d0d52be6acd11249c81471c849cd43f1192

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page