Fine-tuned text-to-SQL: turn plain-English questions into correct SQL, measured with real execution accuracy on Spider.

These details have not been verified by PyPI

Project links

Project description

SQLForge 🛠️

A small open LLM, fine-tuned to turn plain-English questions into correct SQL — measured on the Spider benchmark with real execution accuracy, not string matching.

📊 Headline result

Fine-tuning a 1.5B model with a single LoRA adapter lifted execution accuracy by +8.1 points on the full Spider dev set — and cut the number of crashing queries by 35%.

	Execution accuracy	Correct (of 1034)	Crashing queries
Base Qwen2.5-Coder-1.5B (zero-shot)	57.45%	594	228
+ QLoRA fine-tune (SQLForge)	65.57%	678	148
	+8.12 pts	+84	−80 (−35%)

Measured on the full 1034-example Spider validation split, by running every generated query against the real SQLite database and comparing result sets. Trained on a single RTX 3070 (8GB). See the runs on W&B →

What this is

A from-scratch fine-tuning project, not an API wrapper. It takes a small open code model, teaches it text-to-SQL with QLoRA, and proves the improvement with a proper execution-accuracy harness on Spider.

The goal is to demonstrate the full ML loop — data → train → evaluate → serve → ship — with experiment tracking, an honest before/after, error analysis, and an interactive demo that runs the generated SQL against live databases.

Why execution accuracy (and why it matters)

Most text-to-SQL demos compare the predicted query string to the gold string. That's fragile — two very different queries can be equally correct, and two near-identical strings can differ by one wrong column.

SQLForge instead runs each generated query against the actual SQLite database and compares the result sets (order-insensitive unless the gold query uses ORDER BY). If the gold query itself fails, the example is skipped rather than counted. This is the same metric used by the Spider execution-accuracy leaderboard, and it's the honest one.

The agentic twist: self-correction

When a generated query crashes (e.g. no such column), SQLForge doesn't just mark it wrong — it feeds the SQL and the database error back to the model, hands it the exact valid table/column names, and lets it retry. You can watch this happen live in the demo's self-correction trace.

Honest note: self-correction recovers some crashes but only nudges overall accuracy by ~1 point — a 1.5B model often repeats its own mistake. It's a genuinely useful, transparent feature, and an honest illustration of a small model's reasoning ceiling. See Limitations.

Architecture

flowchart LR
    A[Spider<br/>question + schema] --> B[Qwen2.5-Coder-1.5B<br/>4-bit + QLoRA adapter]
    B --> C[generated SQL]
    C --> D{run on<br/>real SQLite DB}
    D -- error --> E[self-correction:<br/>feed error + valid names back]
    E --> B
    D -- rows --> F[compare result sets<br/>= execution accuracy]

Live demo

A FastAPI server + a custom dark UI: pick a database, ask a question, and watch the model write the SQL, run it, and show the results — with the self-correction trace when it retries.

🤗 Hosted: huggingface.co/spaces/Abdullahkousa2/sqlforge

💻 Local:

python -m uvicorn app.server:app --port 8000
# then open http://localhost:8000

Install (library)

pip install sqlforge

Use the inference + evaluation toolkit directly:

from sqlforge.inference import load_model, generate_sql

model, tok = load_model("Qwen/Qwen2.5-Coder-1.5B-Instruct",
                        adapter_path="Abdullahkousa2/sqlforge-qwen2.5-coder-1.5b")

schema = '''CREATE TABLE singer ("Singer_ID" int, "Name" text, "Age" int);'''
print(generate_sql(model, tok, schema, "How many singers are there?"))
# -> SELECT count(*) FROM singer

Or from the command line:

sqlforge --db mydata.sqlite --question "How many users signed up after 2020?"

Reproduce the pipeline

The whole thing runs on a single 8GB GPU. Steps are also available as a guided notebook: run_pipeline.ipynb.

# 0. install (torch is assumed pre-installed; see requirements.txt)
pip install -r requirements.txt

# 1. data
python scripts/download_data.py        # Spider question/SQL pairs (HF)
python scripts/download_databases.py   # Spider SQLite databases
python scripts/prepare_data.py         # -> data/processed/{train,validation}.jsonl

# 2. baseline (the "before" number)
python scripts/evaluate.py --no-4bit --wandb

# 3. fine-tune (QLoRA, ~3 epochs on an RTX 3070)
python scripts/train.py

# 4. fine-tuned eval (the "after" number)
python scripts/evaluate.py --adapter outputs/qwen2.5-coder-1.5b-sql --no-4bit --wandb

# 5. self-correction eval
python scripts/evaluate.py --adapter outputs/qwen2.5-coder-1.5b-sql --no-4bit --self-correct --wandb

Training configuration


Base model	`Qwen/Qwen2.5-Coder-1.5B-Instruct`
Method	QLoRA — 4-bit NF4 base + LoRA adapter
LoRA	r=16, α=32, dropout=0.05, all attention + MLP projections
Schedule	3 epochs, lr 2e-4 cosine, warmup 0.03
Batching	per-device 1 × grad-accum 16 (effective 16)
Memory	gradient checkpointing, paged AdamW 8-bit, bf16
Seq length	1280 (the ~3% longer examples are dropped, never truncated)

Full config: configs/train_config.yaml.

Repository layout

sqlforge/            # the library: prompt format, inference, execution-accuracy grader
  prompts.py         #   single source of truth for the prompt (no train/eval skew)
  inference.py       #   model loading, generation, self-correction loop
  exec_eval.py       #   run SQL on SQLite, compare result sets
scripts/             # download → prepare → train → evaluate
app/                 # FastAPI demo server + custom dark UI (static/)
configs/             # training config
run_pipeline.ipynb   # the whole pipeline as a guided notebook

Limitations

Stated plainly, because honest evaluation is the point:

The model is small (1.5B). Its dominant failure is over-joining — building an unnecessary JOIN and referencing a column on the wrong table (no such column). Fine-tuning reduced this by a third but couldn't eliminate it.
~90% needs a different class of system. Top of the Spider leaderboard means a frontier model (GPT-4 / Claude) inside an agentic pipeline (DIN-SQL / DAIL-SQL: schema linking, decomposition, self-consistency). A locally-trained 1.5B realistically tops out in the 60s–70s.
A 200-example eval is noise. An early 200-sample read showed 49.5%→51.5%; the full 1034-example set revealed the true 57.4%→65.6%. Always evaluate the full split.

Roadmap

Scale to a 3B/7B base for the mid-70s
Stop-string decoding to speed up worst-case generation
BIRD dataset augmentation
Schema-linking pre-step before generation

Tech stack

PyTorch · 🤗 Transformers · PEFT · TRL · bitsandbytes · Datasets · Weights & Biases · FastAPI

License

MIT © Abdullah Kousa

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlforge-0.1.0.tar.gz (16.5 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sqlforge-0.1.0-py3-none-any.whl (12.9 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file sqlforge-0.1.0.tar.gz.

File metadata

Download URL: sqlforge-0.1.0.tar.gz
Upload date: Jun 14, 2026
Size: 16.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sqlforge-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4eb77f73ce9601d6ca9c915ce3c0b1545135178c7762154a05f9f251ab990416`
MD5	`bb5594f8863b1278a058b91c3a3b7dd6`
BLAKE2b-256	`dbd49e204af0cf7b0f02dbc5ff751f269bd2f3bc7e049415970220fa916e5a7e`

See more details on using hashes here.

File details

Details for the file sqlforge-0.1.0-py3-none-any.whl.

File metadata

Download URL: sqlforge-0.1.0-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 12.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sqlforge-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3f7b769607e5eb3140295cb1612224014e06fd886918a5a6a62c66933e478fe0`
MD5	`65ab36763a9ed12b83cfeeea6a5a2e9b`
BLAKE2b-256	`3668bdb054fc84c2977b87be5f6e18af1f053e0aaea115ce323ff72afc5766b5`

See more details on using hashes here.

sqlforge 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SQLForge 🛠️

📊 Headline result

What this is

Why execution accuracy (and why it matters)

The agentic twist: self-correction

Architecture

Live demo

Install (library)

Reproduce the pipeline

Training configuration

Repository layout

Limitations

Roadmap

Tech stack

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes