Fine-tuned text-to-SQL: turn plain-English questions into correct SQL, measured with real execution accuracy on Spider.
Project description
SQLForge 🛠️
A small open LLM, fine-tuned to turn plain-English questions into correct SQL — measured on the Spider benchmark with real execution accuracy, not string matching.
📊 Headline result
Fine-tuning a 1.5B model with a single LoRA adapter lifted execution accuracy by +8.1 points on the full Spider dev set — and cut the number of crashing queries by 35%.
| Execution accuracy | Correct (of 1034) | Crashing queries | |
|---|---|---|---|
| Base Qwen2.5-Coder-1.5B (zero-shot) | 57.45% | 594 | 228 |
| + QLoRA fine-tune (SQLForge) | 65.57% | 678 | 148 |
| +8.12 pts | +84 | −80 (−35%) |
Measured on the full 1034-example Spider validation split, by running every generated query against the real SQLite database and comparing result sets. Trained on a single RTX 3070 (8GB). See the runs on W&B →
What this is
A from-scratch fine-tuning project, not an API wrapper. It takes a small open code model, teaches it text-to-SQL with QLoRA, and proves the improvement with a proper execution-accuracy harness on Spider.
The goal is to demonstrate the full ML loop — data → train → evaluate → serve → ship — with experiment tracking, an honest before/after, error analysis, and an interactive demo that runs the generated SQL against live databases.
Why execution accuracy (and why it matters)
Most text-to-SQL demos compare the predicted query string to the gold string. That's fragile — two very different queries can be equally correct, and two near-identical strings can differ by one wrong column.
SQLForge instead runs each generated query against the actual SQLite database and compares the result sets (order-insensitive unless the gold query uses ORDER BY). If the gold query itself fails, the example is skipped rather than counted. This is the same metric used by the Spider execution-accuracy leaderboard, and it's the honest one.
The agentic twist: self-correction
When a generated query crashes (e.g. no such column), SQLForge doesn't just mark it wrong — it feeds the SQL and the database error back to the model, hands it the exact valid table/column names, and lets it retry. You can watch this happen live in the demo's self-correction trace.
Honest note: self-correction recovers some crashes but only nudges overall accuracy by ~1 point — a 1.5B model often repeats its own mistake. It's a genuinely useful, transparent feature, and an honest illustration of a small model's reasoning ceiling. See Limitations.
Architecture
flowchart LR
A[Spider<br/>question + schema] --> B[Qwen2.5-Coder-1.5B<br/>4-bit + QLoRA adapter]
B --> C[generated SQL]
C --> D{run on<br/>real SQLite DB}
D -- error --> E[self-correction:<br/>feed error + valid names back]
E --> B
D -- rows --> F[compare result sets<br/>= execution accuracy]
Live demo
A FastAPI server + a custom dark UI: pick a database, ask a question, and watch the model write the SQL, run it, and show the results — with the self-correction trace when it retries.
- 🤗 Hosted: huggingface.co/spaces/Abdullahkousa2/sqlforge
- 💻 Local:
python -m uvicorn app.server:app --port 8000 # then open http://localhost:8000
Install (library)
pip install sqlforge
Use the inference + evaluation toolkit directly:
from sqlforge.inference import load_model, generate_sql
model, tok = load_model("Qwen/Qwen2.5-Coder-1.5B-Instruct",
adapter_path="Abdullahkousa2/sqlforge-qwen2.5-coder-1.5b")
schema = '''CREATE TABLE singer ("Singer_ID" int, "Name" text, "Age" int);'''
print(generate_sql(model, tok, schema, "How many singers are there?"))
# -> SELECT count(*) FROM singer
Or from the command line:
sqlforge --db mydata.sqlite --question "How many users signed up after 2020?"
Reproduce the pipeline
The whole thing runs on a single 8GB GPU. Steps are also available as a guided notebook: run_pipeline.ipynb.
# 0. install (torch is assumed pre-installed; see requirements.txt)
pip install -r requirements.txt
# 1. data
python scripts/download_data.py # Spider question/SQL pairs (HF)
python scripts/download_databases.py # Spider SQLite databases
python scripts/prepare_data.py # -> data/processed/{train,validation}.jsonl
# 2. baseline (the "before" number)
python scripts/evaluate.py --no-4bit --wandb
# 3. fine-tune (QLoRA, ~3 epochs on an RTX 3070)
python scripts/train.py
# 4. fine-tuned eval (the "after" number)
python scripts/evaluate.py --adapter outputs/qwen2.5-coder-1.5b-sql --no-4bit --wandb
# 5. self-correction eval
python scripts/evaluate.py --adapter outputs/qwen2.5-coder-1.5b-sql --no-4bit --self-correct --wandb
Training configuration
| Base model | Qwen/Qwen2.5-Coder-1.5B-Instruct |
| Method | QLoRA — 4-bit NF4 base + LoRA adapter |
| LoRA | r=16, α=32, dropout=0.05, all attention + MLP projections |
| Schedule | 3 epochs, lr 2e-4 cosine, warmup 0.03 |
| Batching | per-device 1 × grad-accum 16 (effective 16) |
| Memory | gradient checkpointing, paged AdamW 8-bit, bf16 |
| Seq length | 1280 (the ~3% longer examples are dropped, never truncated) |
Full config: configs/train_config.yaml.
Repository layout
sqlforge/ # the library: prompt format, inference, execution-accuracy grader
prompts.py # single source of truth for the prompt (no train/eval skew)
inference.py # model loading, generation, self-correction loop
exec_eval.py # run SQL on SQLite, compare result sets
scripts/ # download → prepare → train → evaluate
app/ # FastAPI demo server + custom dark UI (static/)
configs/ # training config
run_pipeline.ipynb # the whole pipeline as a guided notebook
Limitations
Stated plainly, because honest evaluation is the point:
- The model is small (1.5B). Its dominant failure is over-joining — building an unnecessary JOIN and referencing a column on the wrong table (
no such column). Fine-tuning reduced this by a third but couldn't eliminate it. - ~90% needs a different class of system. Top of the Spider leaderboard means a frontier model (GPT-4 / Claude) inside an agentic pipeline (DIN-SQL / DAIL-SQL: schema linking, decomposition, self-consistency). A locally-trained 1.5B realistically tops out in the 60s–70s.
- A 200-example eval is noise. An early 200-sample read showed 49.5%→51.5%; the full 1034-example set revealed the true 57.4%→65.6%. Always evaluate the full split.
Roadmap
- Scale to a 3B/7B base for the mid-70s
- Stop-string decoding to speed up worst-case generation
- BIRD dataset augmentation
- Schema-linking pre-step before generation
Tech stack
PyTorch · 🤗 Transformers · PEFT · TRL · bitsandbytes · Datasets · Weights & Biases · FastAPI
License
MIT © Abdullah Kousa
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sqlforge-0.1.0.tar.gz.
File metadata
- Download URL: sqlforge-0.1.0.tar.gz
- Upload date:
- Size: 16.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4eb77f73ce9601d6ca9c915ce3c0b1545135178c7762154a05f9f251ab990416
|
|
| MD5 |
bb5594f8863b1278a058b91c3a3b7dd6
|
|
| BLAKE2b-256 |
dbd49e204af0cf7b0f02dbc5ff751f269bd2f3bc7e049415970220fa916e5a7e
|
File details
Details for the file sqlforge-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sqlforge-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f7b769607e5eb3140295cb1612224014e06fd886918a5a6a62c66933e478fe0
|
|
| MD5 |
65ab36763a9ed12b83cfeeea6a5a2e9b
|
|
| BLAKE2b-256 |
3668bdb054fc84c2977b87be5f6e18af1f053e0aaea115ce323ff72afc5766b5
|