Deterministic, validation-first Text-to-SQL system for business databases
Project description
Deterministic, validation-first Text-to-SQL for business databases
Installation
pip install text2sql
pip install "text2sql[postgresql]"
pip install "text2sql[databricks]"
pip install "text2sql[postgresql,databricks]"
Requires Python ≥ 3.10 and either an OpenAI API key or Azure OpenAI credentials.
| Extra | Brings in | Use when |
|---|---|---|
| (base) | SQLAlchemy (shared introspection / execution interface) | Always installed |
postgresql |
PostgreSQL driver (psycopg2-binary), pglast |
engine="postgresql" |
databricks |
PySpark, Databricks SQL connector, databricks-sqlalchemy, sqlglot |
engine="databricks" |
SQL parsing for validation: PostgreSQL uses pglast for structural AST checks (join pairs, CTE bodies, ast_validate). Databricks / Spark SQL uses sqlglot with the Spark dialect.
Quickstart
from text2sql import Text2SQL
t2s = Text2SQL(
engine="postgresql",
host="localhost",
database="mydb",
password="secret",
openai_api_key="sk-...",
)
t2s.run_interactive()
Constructor options, modes, optional files, and the full method list are in USAGE.md.
What this is
A validation-first text-to-SQL layer for PostgreSQL and Databricks. It targets stable business schemas and repeated analytical questions, not open-ended “any SQL” generation.
- Natural language is turned into a structured intent (tables, select expressions, filters, grouping, ordering, optional CTEs) that is shared across dialects; dialect-specific SQL is produced later.
- Templates store previously accepted query patterns; negative memory records rejections so bad shapes are less likely to repeat.
- LLM calls run at temperature 0; for the same inputs and schema state, behavior is repeatable.
- Bounded LLM use: strong paths reuse templates or deterministic structure before asking the model for SQL.
Philosophy
- Determinism over creativity — prefer a correct, boring plan to a novel one.
- Correct joins over clever SQL — join paths come from foreign keys and precomputed paths, not free-form guessing.
- Validate before execute — schema checks, intent consistency, join shape, and dialect-safe read-only
SELECTrules. - Minimal LLM surface — parse intent, resolve ambiguous joins when needed, repair SQL; everything else is rules and stores.
- Safe defaults for non-analysts — narrow allowed SQL; you still choose database credentials (read-only vs write-capable is outside this library).
What it supports (at a glance)
Backends
- PostgreSQL
- Databricks
Schema
- Load from live introspection (primary).
- Optional
CREATE TABLEfile as extra or fallback (especially when you cannot reach all metadata from the driver). - Cached schema snapshot per connection fingerprint so restarts avoid re-reflecting unchanged databases.
- Table roles (e.g. fact vs dimension), column roles (measure, categorical, temporal, identifier, etc.), filter / aggregation / HAVING allowances per column, value domains from profiling — all assigned when the graph is built (reflection, DDL, profiling, and optional notes).
- Optional human notes (plain text), via
Text2SQL(..., notes_file=...)(see USAGE.md): read once when the schema graph is first created (not on cache load). The LLM uses them to refine table and column roles, natural-language descriptions, and optional sensitivity labels — without renaming tables or inventing columns. - Optional deny lists for tables or columns so they stay out of prompts and can be stripped from intents.
Intent / SQL shape (analytical subset)
- Queries:
SELECTonly (enforced with pattern checks and dialect parsing). CTEs reuse the same intent model as the outer query. - Joins: related tables are wired using the schema’s relationships; when more than one valid path could link the same tables, one coherent path is chosen for the whole query, and join style follows table roles (e.g.
LEFTtoward dimensions where that fits). Self-joins use CTEs instead of repeating the same base table in oneFROMchain. - Select list: bare columns, aggregates (
COUNT,SUM,AVG,MIN,MAX, etc.), arithmetic and string expressions where the schema allows them,DISTINCT, and scalar functions subject to column metadata. - Filters / boolean logic: comparisons,
AND/OR,IN,LIKE;ILIKE/NOT ILIKEon PostgreSQL only (intent stays dialect-agnostic; SQL rendering differs). Null / boolean value normalization in the repair chain. BETWEENin intent is decomposed into a pair of comparable predicates.- Grouping / ordering:
GROUP BY,HAVING(aggregate-aware),ORDER BY,LIMIT; rules tie grain (row-level vs grouped) to aggregates and grouped columns. - Dates: structured
date_window(anchor unit + offset) and date-difference filters between columns where supported. - Windows:
ROW_NUMBER,RANK,DENSE_RANK, and windowedSUM/AVG,LAG,LEAD,FIRST_VALUE, andLAST_VALUEon select columns (main query and CTEs). CASE/WHEN: only in the select list in the intent model (not inWHERE/HAVING).- Arrays / lists: membership-style filters; SQL uses dialect-appropriate forms; optional UNNEST / EXPLODE-style expansion in CTE select lists for typed array columns.
- Metadata: UNIQUE (and related) when reflection or DDL exposes it, for ranking “human readable” identifiers.
Operational modes
- Interactive — ask questions, accept/reject, results export.
- Coverage simulator — seed questions → gold intents → deterministic expansion (many operators, deduplicated) → validate/execute → NL question generation for new templates.
- QSim — reproducible synthetic questions from schema and profiles (seeded randomness).
What it is not
- Not a full SQL or stored-procedure generator: no
UNION/INTERSECT/EXCEPT, no correlated subqueries, noEXISTS, noLATERAL, no DML/DDL, no arbitrary RIGHT/FULL OUTER join policy in the constrained path. - Not a substitute for database security: use credentials with least privilege (
SELECTandEXPLAIN). - Not schema-agnostic: quality depends on FKs, sensible types, and optional notes for domain language.
How a question becomes SQL
- Template match — if a trusted pattern fits, reuse parameterized SQL (often no SQL LLM call).
- Intent parse — structured intent from the question + schema summary, then a long deterministic repair chain.
- Join resolution — choose among valid FK paths for the intent’s tables; disambiguation may use the LLM when multiple paths tie.
- SQL generation & validation — deterministic skeleton, injected joins, LLM fill/repair under constraints, then semantic validation,
SELECT-only / forbidden-pattern checks, dialect AST validation (pglastorsqlglot), andEXPLAIN. - Execute (where the mode allows) and learn — accept → promote template trust; reject → record negative pattern.
Validation (layers)
- Safety / shape —
SELECT-only enforcement, configurable forbidden SQL substrings, then dialectast_validate(pglastorsqlglot).EXPLAINis used as an extra executability check. - Schema vs intent — tables/columns/CTEs, selectability, access and sensitivity policy, window / CASE / array shapes, filter and HAVING ops per column, aggregate roles in select/HAVING/ORDER BY, scalar function typing, filter value types vs column types, null/date-window/date-diff rules.
- Semantic consistency — grouped queries require proper aggregation; contradictions and impossible HAVING; grain alignment.
- Joins — paths must match the FK graph; guarded path avoids ad-hoc join guessing.
Learning and reuse
- Accepted templates — intent fingerprint, parameterized SQL, optional example question, trust that rises with validation and falls with rejection.
- Rejected templates — categorized failures so similar bad intents are discouraged.
- Persistence is under a per-connection artifact directory (see USAGE.md); you can back it up or reset it by removing that directory.
Coverage simulator (brief)
- Parse each seed line into a gold intent.
- Expand with a fixed set of deterministic operators (filters, aggregates, joins, time windows, numeric transforms, distinct/limit/OR-groups, expressions, window variants, etc.), deduplicated across depths.
- Resolve joins once per table set where possible; validate and execute as a gate.
- One LLM step to produce a natural language question from the SQL, with a realism filter.
The simulator loads at most 500 seed lines per run (fixed internal cap; larger files are truncated). estimate_simulator_costs uses the same loader, so its seed count reflects that cap. It returns rough LLM-call and execution estimates from a seed file and schema stats.
QSim (brief)
Generates reproducible question lists from the schema and profiled values. Same seed → same output. Use for regression-style testing or dataset building.
When to use it
Good fit: star/snowflake-style models, clear FKs, repeated BI-style questions, PostgreSQL or Databricks.
Poor fit: schemas without relationships, heavy procedural logic, or expectations of arbitrary SQL features outside the supported analytical subset.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aetherdialect-0.1.1.tar.gz.
File metadata
- Download URL: aetherdialect-0.1.1.tar.gz
- Upload date:
- Size: 452.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d0d056623c3e24cc381a6c6a91348eca1ce7626b0639fd170e201762a23ad01
|
|
| MD5 |
45d65d1c1d1b483ec0bfffdc498a5828
|
|
| BLAKE2b-256 |
d40172d33b534379a86960af4eac9e388ec33b958024be2c274f07a16bccabb3
|
File details
Details for the file aetherdialect-0.1.1-py3-none-any.whl.
File metadata
- Download URL: aetherdialect-0.1.1-py3-none-any.whl
- Upload date:
- Size: 301.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ae9d12107c98b4f9321ef8b2c671a7cab7e1f4275ec46148c5ebe5700d06a55
|
|
| MD5 |
3c029c1815d225b9eeeaa4618f2c9e47
|
|
| BLAKE2b-256 |
e35e48bc07e27705e9678287e2c72ef2a80503d69d4e0aacb97f81abe6dc30f7
|