Skip to main content

Deterministic, validation-first Text-to-SQL system for business databases

Project description

Deterministic, validation-first Text-to-SQL for business databases

Questions resolve more reliably when you state analytical intent explicitly—entities, grain, filters, time scope, and ordering—instead of leaving those details implied.

Installation

pip install text2sql
pip install "text2sql[postgresql]"
pip install "text2sql[databricks]"
pip install "text2sql[postgresql,databricks]"

Requires Python ≥ 3.10 and either an OpenAI API key or Azure OpenAI credentials.

Extra Brings in Use when
(base) SQLAlchemy (shared introspection / execution interface) Always installed
postgresql PostgreSQL driver (psycopg2-binary), pglast engine="postgresql"
databricks Databricks SQL connector (preferred), PySpark (fallback), databricks-sqlalchemy, sqlglot engine="databricks"

SQL parsing for validation: PostgreSQL uses pglast for structural AST checks (join pairs, CTE bodies, ast_validate). Databricks / Spark SQL uses sqlglot with the Spark dialect.


Quickstart

from text2sql import Text2SQL

t2s = Text2SQL(
    engine="postgresql",
    host="localhost",
    database="mydb",
    password="secret",
    openai_api_key="sk-...",
)

t2s.run_interactive()

Constructor options, credentials (set_openai_api_key, set_azure_openai_api_key, set_env), modes, and the full API are in USAGE.md. Pass artifacts_dir= to put the per-connection cache under a root you choose; otherwise it lives under the platform user-data directory (see USAGE.md).

Interactive two ways: run_interactive() is a stdin loop. For your own UI or protocol, use Text2SQL.pipeline_session() and drive PipelineSession step by step: one natural-language question is a turn that may return several SessionEvent objects (prompt / answer / …) until done is true. Types and methods are documented in USAGE.md.


What this is

A validation-first text-to-SQL layer for PostgreSQL and Databricks. It targets stable business schemas and repeated analytical questions, not open-ended “any SQL” generation.

  • Natural language is turned into a structured intent (tables, select expressions, filters, grouping, ordering, optional CTEs) that is shared across dialects; dialect-specific SQL is produced later.
  • Templates store previously accepted query patterns; negative memory records rejections so bad shapes are less likely to repeat.
  • LLM calls run at temperature 0; for the same inputs and schema state, behavior is repeatable.
  • Bounded LLM use: strong paths reuse templates or deterministic structure before asking the model for SQL.

Philosophy

  • Determinism over creativity — prefer a correct, boring plan to a novel one.
  • Correct joins over clever SQL — join paths come from foreign keys and precomputed paths, not free-form guessing.
  • Validate before execute — schema checks, intent consistency, join shape, and dialect-safe read-only SELECT rules.
  • Minimal LLM surface — parse intent, resolve ambiguous joins when needed, repair SQL; everything else is rules and stores.
  • Safe defaults for non-analysts — narrow allowed SQL; you still choose database credentials (read-only vs write-capable is outside this library).

What it supports (at a glance)

Backends

  • PostgreSQL
  • Databricks

Schema

  • Load from live introspection (primary).
  • Optional CREATE TABLE file as extra or fallback (especially when you cannot reach all metadata from the driver).
  • Cached schema snapshot per connection fingerprint so restarts avoid re-reflecting unchanged databases.
  • Table roles (e.g. fact vs dimension), column roles (measure, categorical, temporal, identifier, etc.), filter / aggregation / HAVING allowances per column, value domains from profiling — all assigned when the graph is built (reflection, DDL, profiling, and optional notes).
  • Optional human notes (plain text), via Text2SQL(..., notes_file=...) (see USAGE.md): merged when the graph is built or when notes change; if the cache already contains notes and you omit notes_file on a later run, cached roles and hints are kept.
  • Optional deny lists for tables or columns so they stay out of prompts and can be stripped from intents.

Intent / SQL shape (analytical subset)

  • Queries: SELECT only (enforced with pattern checks and dialect parsing). CTEs reuse the same intent model as the outer query.
  • Joins: related tables are wired using the schema’s relationships; when more than one valid path could link the same tables, one coherent path is chosen for the whole query, and join style follows table roles (e.g. LEFT toward dimensions where that fits). Self-joins use CTEs instead of repeating the same base table in one FROM chain.
  • Select list: bare columns, aggregates (COUNT, SUM, AVG, MIN, MAX, etc.), arithmetic and string expressions where the schema allows them, DISTINCT, and scalar functions subject to column metadata.
  • Filters / boolean logic: comparisons, AND / OR, IN, LIKE; ILIKE / NOT ILIKE on PostgreSQL only (intent stays dialect-agnostic; SQL rendering differs). Null / boolean value normalization in the repair chain.
  • BETWEEN in intent is decomposed into a pair of comparable predicates.
  • Grouping / ordering: GROUP BY, HAVING (aggregate-aware), ORDER BY, LIMIT; rules tie grain (row-level vs grouped) to aggregates and grouped columns.
  • Dates: structured date_window (anchor unit + offset) and date-difference filters between columns where supported.
  • Windows: ROW_NUMBER, RANK, DENSE_RANK, and windowed SUM/AVG, LAG, LEAD, FIRST_VALUE, and LAST_VALUE on select columns (main query and CTEs).
  • CASE / WHEN: only in the select list in the intent model (not in WHERE / HAVING).
  • Arrays / lists: membership-style filters; SQL uses dialect-appropriate forms; optional UNNEST / EXPLODE-style expansion in CTE select lists for typed array columns.
  • Metadata: UNIQUE (and related) when reflection or DDL exposes it, for ranking “human readable” identifiers.

Operational modes

  • Interactive — ask questions, accept/reject, results export; via run_interactive() or a programmatic PipelineSession (see Quickstart above and USAGE.md).
  • Coverage simulator — seed questions → gold intents → deterministic expansion (many operators, deduplicated) → validate/execute → NL question generation for new templates.
  • QSim — reproducible synthetic questions from schema and profiles (seeded randomness).

What it is not

  • Not a full SQL or stored-procedure generator: no UNION/INTERSECT/EXCEPT, no correlated subqueries, no EXISTS, no LATERAL, no DML/DDL, no arbitrary RIGHT/FULL OUTER join policy in the constrained path.
  • Not a substitute for database security: use credentials with least privilege (SELECT and EXPLAIN).
  • Not schema-agnostic: quality depends on FKs, sensible types, and optional notes for domain language.

How a question becomes SQL

  1. Template match — if a trusted pattern fits, reuse parameterized SQL (often no SQL LLM call).
  2. Intent parse — structured intent from the question + schema summary, then a long deterministic repair chain.
  3. Join resolution — choose among valid FK paths for the intent’s tables; disambiguation may use the LLM when multiple paths tie.
  4. SQL generation & validation — deterministic skeleton, injected joins, LLM fill/repair under constraints, then semantic validation, SELECT-only / forbidden-pattern checks, dialect AST validation (pglast or sqlglot), and EXPLAIN.
  5. Execute (where the mode allows) and learn — accept → promote template trust; reject → record negative pattern.

Validation (layers)

  • Safety / shapeSELECT-only enforcement, configurable forbidden SQL substrings, then dialect ast_validate (pglast or sqlglot). EXPLAIN is used as an extra executability check.
  • Schema vs intent — tables/columns/CTEs, selectability, access and sensitivity policy, window / CASE / array shapes, filter and HAVING ops per column, aggregate roles in select/HAVING/ORDER BY, scalar function typing, filter value types vs column types, null/date-window/date-diff rules.
  • Semantic consistency — grouped queries require proper aggregation; contradictions and impossible HAVING; grain alignment.
  • Joins — paths must match the FK graph; guarded path avoids ad-hoc join guessing.

Learning and reuse

  • Accepted templates — intent fingerprint, parameterized SQL, optional example question, trust that rises with validation and falls with rejection.
  • Rejected templates — categorized failures so similar bad intents are discouraged.
  • Persistence is under a per-connection artifact directory (see USAGE.md); you can back it up or reset it by removing that directory.

Coverage simulator (brief)

  1. Parse each seed line into a gold intent.
  2. Expand with a fixed set of deterministic operators (filters, aggregates, joins, time windows, numeric transforms, distinct/limit/OR-groups, expressions, window variants, etc.), deduplicated across depths.
  3. Resolve joins once per table set where possible; validate and execute as a gate.
  4. One LLM step to produce a natural language question from the SQL, with a realism filter.

The simulator loads at most 500 seed lines per run (fixed internal cap; larger files are truncated). estimate_simulator_costs uses the same loader, prints rough per-phase upper-bound lines to stdout, and returns None.


QSim (brief)

Generates reproducible question lists from the schema and profiled values. Same seed → same output. Use for regression-style testing or dataset building.


When to use it

Good fit: star/snowflake-style models, clear FKs, repeated BI-style questions, PostgreSQL or Databricks.

Poor fit: schemas without relationships, heavy procedural logic, or expectations of arbitrary SQL features outside the supported analytical subset.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aetherdialect-0.1.2.tar.gz (488.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aetherdialect-0.1.2-py3-none-any.whl (329.6 kB view details)

Uploaded Python 3

File details

Details for the file aetherdialect-0.1.2.tar.gz.

File metadata

  • Download URL: aetherdialect-0.1.2.tar.gz
  • Upload date:
  • Size: 488.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aetherdialect-0.1.2.tar.gz
Algorithm Hash digest
SHA256 35760f2b21bb0f416174c94e29a0dda0811fbc5a12c77ca33bb795adfe652e0f
MD5 3863445dc007505c0f1a1d2e81cd2fa3
BLAKE2b-256 bb8b750b5e269c4e3ab02bd988f6d578ee5413b182a2f5a3d3a7d482da72e1ce

See more details on using hashes here.

File details

Details for the file aetherdialect-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: aetherdialect-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 329.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aetherdialect-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 82da8230c2763afbaa096e3d381e83f722fe6230f82ab0e05ad3bca55b05ccd1
MD5 df0702f27ca4474276baf6fee9b788b0
BLAKE2b-256 8a12cb3454878822897854f081c3efb43d3f309d0155ba78c681d450cc080a5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page