Deterministic, validation-first Text-to-SQL system for business databases

These details have not been verified by PyPI

Project links

Homepage

Project description

Deterministic, validation-first Text-to-SQL for business databases

Installation

pip install text2sql
pip install "text2sql[postgresql]"
pip install "text2sql[databricks]"
pip install "text2sql[postgresql,databricks]"

Requires Python ≥ 3.10 and either an OpenAI API key or Azure OpenAI credentials.

Extra	Brings in	Use when
(base)	SQLAlchemy (shared introspection / execution interface)	Always installed
`postgresql`	PostgreSQL driver (`psycopg2-binary`), `pglast`	`engine="postgresql"`
`databricks`	PySpark, Databricks SQL connector, `databricks-sqlalchemy`, `sqlglot`	`engine="databricks"`

SQL parsing for validation: PostgreSQL uses pglast for structural AST checks (join pairs, CTE bodies, ast_validate). Databricks / Spark SQL uses sqlglot with the Spark dialect.

Quickstart

from text2sql import Text2SQL

t2s = Text2SQL(
    engine="postgresql",
    host="localhost",
    database="mydb",
    password="secret",
    openai_api_key="sk-...",
)

t2s.run_interactive()

Constructor options, modes, optional files, and the full method list are in USAGE.md.

What this is

A validation-first text-to-SQL layer for PostgreSQL and Databricks. It targets stable business schemas and repeated analytical questions, not open-ended “any SQL” generation.

Natural language is turned into a structured intent (tables, select expressions, filters, grouping, ordering, optional CTEs) that is shared across dialects; dialect-specific SQL is produced later.
Templates store previously accepted query patterns; negative memory records rejections so bad shapes are less likely to repeat.
LLM calls run at temperature 0; for the same inputs and schema state, behavior is repeatable.
Bounded LLM use: strong paths reuse templates or deterministic structure before asking the model for SQL.

Philosophy

Determinism over creativity — prefer a correct, boring plan to a novel one.
Correct joins over clever SQL — join paths come from foreign keys and precomputed paths, not free-form guessing.
Validate before execute — schema checks, intent consistency, join shape, and dialect-safe read-only SELECT rules.
Minimal LLM surface — parse intent, resolve ambiguous joins when needed, repair SQL; everything else is rules and stores.
Safe defaults for non-analysts — narrow allowed SQL; you still choose database credentials (read-only vs write-capable is outside this library).

What it supports (at a glance)

Backends

PostgreSQL
Databricks

Schema

Load from live introspection (primary).
Optional CREATE TABLE file as extra or fallback (especially when you cannot reach all metadata from the driver).
Cached schema snapshot per connection fingerprint so restarts avoid re-reflecting unchanged databases.
Table roles (e.g. fact vs dimension), column roles (measure, categorical, temporal, identifier, etc.), filter / aggregation / HAVING allowances per column, value domains from profiling — all assigned when the graph is built (reflection, DDL, profiling, and optional notes).
Optional human notes (plain text), via Text2SQL(..., notes_file=...) (see USAGE.md): read once when the schema graph is first created (not on cache load). The LLM uses them to refine table and column roles, natural-language descriptions, and optional sensitivity labels — without renaming tables or inventing columns.
Optional deny lists for tables or columns so they stay out of prompts and can be stripped from intents.

Intent / SQL shape (analytical subset)

Queries: SELECT only (enforced with pattern checks and dialect parsing). CTEs reuse the same intent model as the outer query.
Joins: related tables are wired using the schema’s relationships; when more than one valid path could link the same tables, one coherent path is chosen for the whole query, and join style follows table roles (e.g. LEFT toward dimensions where that fits). Self-joins use CTEs instead of repeating the same base table in one FROM chain.
Select list: bare columns, aggregates (COUNT, SUM, AVG, MIN, MAX, etc.), arithmetic and string expressions where the schema allows them, DISTINCT, and scalar functions subject to column metadata.
Filters / boolean logic: comparisons, AND / OR, IN, LIKE; ILIKE / NOT ILIKE on PostgreSQL only (intent stays dialect-agnostic; SQL rendering differs). Null / boolean value normalization in the repair chain.
BETWEEN in intent is decomposed into a pair of comparable predicates.
Grouping / ordering: GROUP BY, HAVING (aggregate-aware), ORDER BY, LIMIT; rules tie grain (row-level vs grouped) to aggregates and grouped columns.
Dates: structured date_window (anchor unit + offset) and date-difference filters between columns where supported.
Windows: ROW_NUMBER, RANK, DENSE_RANK, and windowed SUM/AVG, LAG, LEAD, FIRST_VALUE, and LAST_VALUE on select columns (main query and CTEs).
CASE / WHEN: only in the select list in the intent model (not in WHERE / HAVING).
Arrays / lists: membership-style filters; SQL uses dialect-appropriate forms; optional UNNEST / EXPLODE-style expansion in CTE select lists for typed array columns.
Metadata: UNIQUE (and related) when reflection or DDL exposes it, for ranking “human readable” identifiers.

Operational modes

Interactive — ask questions, accept/reject, results export.
Coverage simulator — seed questions → gold intents → deterministic expansion (many operators, deduplicated) → validate/execute → NL question generation for new templates.
QSim — reproducible synthetic questions from schema and profiles (seeded randomness).

What it is not

Not a full SQL or stored-procedure generator: no UNION/INTERSECT/EXCEPT, no correlated subqueries, no EXISTS, no LATERAL, no DML/DDL, no arbitrary RIGHT/FULL OUTER join policy in the constrained path.
Not a substitute for database security: use credentials with least privilege (SELECT and EXPLAIN).
Not schema-agnostic: quality depends on FKs, sensible types, and optional notes for domain language.

How a question becomes SQL

Template match — if a trusted pattern fits, reuse parameterized SQL (often no SQL LLM call).
Intent parse — structured intent from the question + schema summary, then a long deterministic repair chain.
Join resolution — choose among valid FK paths for the intent’s tables; disambiguation may use the LLM when multiple paths tie.
SQL generation & validation — deterministic skeleton, injected joins, LLM fill/repair under constraints, then semantic validation, SELECT-only / forbidden-pattern checks, dialect AST validation (pglast or sqlglot), and EXPLAIN.
Execute (where the mode allows) and learn — accept → promote template trust; reject → record negative pattern.

Validation (layers)

Safety / shape — SELECT-only enforcement, configurable forbidden SQL substrings, then dialect ast_validate (pglast or sqlglot). EXPLAIN is used as an extra executability check.
Schema vs intent — tables/columns/CTEs, selectability, access and sensitivity policy, window / CASE / array shapes, filter and HAVING ops per column, aggregate roles in select/HAVING/ORDER BY, scalar function typing, filter value types vs column types, null/date-window/date-diff rules.
Semantic consistency — grouped queries require proper aggregation; contradictions and impossible HAVING; grain alignment.
Joins — paths must match the FK graph; guarded path avoids ad-hoc join guessing.

Learning and reuse

Accepted templates — intent fingerprint, parameterized SQL, optional example question, trust that rises with validation and falls with rejection.
Rejected templates — categorized failures so similar bad intents are discouraged.
Persistence is under a per-connection artifact directory (see USAGE.md); you can back it up or reset it by removing that directory.

Coverage simulator (brief)

Parse each seed line into a gold intent.
Expand with a fixed set of deterministic operators (filters, aggregates, joins, time windows, numeric transforms, distinct/limit/OR-groups, expressions, window variants, etc.), deduplicated across depths.
Resolve joins once per table set where possible; validate and execute as a gate.
One LLM step to produce a natural language question from the SQL, with a realism filter.

The simulator loads at most 500 seed lines per run (fixed internal cap; larger files are truncated). estimate_simulator_costs uses the same loader, so its seed count reflects that cap. It returns rough LLM-call and execution estimates from a seed file and schema stats.

QSim (brief)

Generates reproducible question lists from the schema and profiled values. Same seed → same output. Use for regression-style testing or dataset building.

When to use it

Good fit: star/snowflake-style models, clear FKs, repeated BI-style questions, PostgreSQL or Databricks.

Poor fit: schemas without relationships, heavy procedural logic, or expectations of arbitrary SQL features outside the supported analytical subset.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.7

May 20, 2026

0.1.6

May 6, 2026

0.1.5

May 5, 2026

0.1.4

May 5, 2026

0.1.3

Apr 30, 2026

0.1.2

Apr 10, 2026

This version

0.1.1

Apr 6, 2026

0.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aetherdialect-0.1.1.tar.gz (452.2 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aetherdialect-0.1.1-py3-none-any.whl (301.8 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file aetherdialect-0.1.1.tar.gz.

File metadata

Download URL: aetherdialect-0.1.1.tar.gz
Upload date: Apr 6, 2026
Size: 452.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aetherdialect-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`6d0d056623c3e24cc381a6c6a91348eca1ce7626b0639fd170e201762a23ad01`
MD5	`45d65d1c1d1b483ec0bfffdc498a5828`
BLAKE2b-256	`d40172d33b534379a86960af4eac9e388ec33b958024be2c274f07a16bccabb3`

See more details on using hashes here.

File details

Details for the file aetherdialect-0.1.1-py3-none-any.whl.

File metadata

Download URL: aetherdialect-0.1.1-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 301.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aetherdialect-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ae9d12107c98b4f9321ef8b2c671a7cab7e1f4275ec46148c5ebe5700d06a55`
MD5	`3c029c1815d225b9eeeaa4618f2c9e47`
BLAKE2b-256	`e35e48bc07e27705e9678287e2c72ef2a80503d69d4e0aacb97f81abe6dc30f7`

See more details on using hashes here.

aetherdialect 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Deterministic, validation-first Text-to-SQL for business databases

Installation

Quickstart

What this is

Philosophy

What it supports (at a glance)

What it is not

How a question becomes SQL

Validation (layers)

Learning and reuse

Coverage simulator (brief)

QSim (brief)

When to use it

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes