Skip to main content

Fine-tuned text-to-SQL: turn plain-English questions into correct SQL, measured with real execution accuracy on Spider.

Project description

SQLForge 🛠️

A small open LLM, fine-tuned to turn plain-English questions into correct SQL — measured on the Spider benchmark with real execution accuracy, not string matching.

Python PyTorch QLoRA Base W&B PyPI HF Space License


📊 Headline result

Fine-tuning a 1.5B model with a single LoRA adapter lifted execution accuracy by +8.1 points on the full Spider dev set — and cut the number of crashing queries by 35%.

Execution accuracy Correct (of 1034) Crashing queries
Base Qwen2.5-Coder-1.5B (zero-shot) 57.45% 594 228
+ QLoRA fine-tune (SQLForge) 65.57% 678 148
+8.12 pts +84 −80 (−35%)

Measured on the full 1034-example Spider validation split, by running every generated query against the real SQLite database and comparing result sets. Trained on a single RTX 3070 (8GB). See the runs on W&B →


What this is

A from-scratch fine-tuning project, not an API wrapper. It takes a small open code model, teaches it text-to-SQL with QLoRA, and proves the improvement with a proper execution-accuracy harness on Spider.

The goal is to demonstrate the full ML loopdata → train → evaluate → serve → ship — with experiment tracking, an honest before/after, error analysis, and an interactive demo that runs the generated SQL against live databases.

Why execution accuracy (and why it matters)

Most text-to-SQL demos compare the predicted query string to the gold string. That's fragile — two very different queries can be equally correct, and two near-identical strings can differ by one wrong column.

SQLForge instead runs each generated query against the actual SQLite database and compares the result sets (order-insensitive unless the gold query uses ORDER BY). If the gold query itself fails, the example is skipped rather than counted. This is the same metric used by the Spider execution-accuracy leaderboard, and it's the honest one.

The agentic twist: self-correction

When a generated query crashes (e.g. no such column), SQLForge doesn't just mark it wrong — it feeds the SQL and the database error back to the model, hands it the exact valid table/column names, and lets it retry. You can watch this happen live in the demo's self-correction trace.

Honest note: self-correction recovers some crashes but only nudges overall accuracy by ~1 point — a 1.5B model often repeats its own mistake. It's a genuinely useful, transparent feature, and an honest illustration of a small model's reasoning ceiling. See Limitations.

Architecture

flowchart LR
    A[Spider<br/>question + schema] --> B[Qwen2.5-Coder-1.5B<br/>4-bit + QLoRA adapter]
    B --> C[generated SQL]
    C --> D{run on<br/>real SQLite DB}
    D -- error --> E[self-correction:<br/>feed error + valid names back]
    E --> B
    D -- rows --> F[compare result sets<br/>= execution accuracy]

Live demo

A FastAPI server + a custom dark UI: pick a database, ask a question, and watch the model write the SQL, run it, and show the results — with the self-correction trace when it retries.

Install (library)

pip install sqlforge

Use the inference + evaluation toolkit directly:

from sqlforge.inference import load_model, generate_sql

model, tok = load_model("Qwen/Qwen2.5-Coder-1.5B-Instruct",
                        adapter_path="Abdullahkousa2/sqlforge-qwen2.5-coder-1.5b")

schema = '''CREATE TABLE singer ("Singer_ID" int, "Name" text, "Age" int);'''
print(generate_sql(model, tok, schema, "How many singers are there?"))
# -> SELECT count(*) FROM singer

Or from the command line:

sqlforge --db mydata.sqlite --question "How many users signed up after 2020?"

Reproduce the pipeline

The whole thing runs on a single 8GB GPU. Steps are also available as a guided notebook: run_pipeline.ipynb.

# 0. install (torch is assumed pre-installed; see requirements.txt)
pip install -r requirements.txt

# 1. data
python scripts/download_data.py        # Spider question/SQL pairs (HF)
python scripts/download_databases.py   # Spider SQLite databases
python scripts/prepare_data.py         # -> data/processed/{train,validation}.jsonl

# 2. baseline (the "before" number)
python scripts/evaluate.py --no-4bit --wandb

# 3. fine-tune (QLoRA, ~3 epochs on an RTX 3070)
python scripts/train.py

# 4. fine-tuned eval (the "after" number)
python scripts/evaluate.py --adapter outputs/qwen2.5-coder-1.5b-sql --no-4bit --wandb

# 5. self-correction eval
python scripts/evaluate.py --adapter outputs/qwen2.5-coder-1.5b-sql --no-4bit --self-correct --wandb

Training configuration

Base model Qwen/Qwen2.5-Coder-1.5B-Instruct
Method QLoRA — 4-bit NF4 base + LoRA adapter
LoRA r=16, α=32, dropout=0.05, all attention + MLP projections
Schedule 3 epochs, lr 2e-4 cosine, warmup 0.03
Batching per-device 1 × grad-accum 16 (effective 16)
Memory gradient checkpointing, paged AdamW 8-bit, bf16
Seq length 1280 (the ~3% longer examples are dropped, never truncated)

Full config: configs/train_config.yaml.

Repository layout

sqlforge/            # the library: prompt format, inference, execution-accuracy grader
  prompts.py         #   single source of truth for the prompt (no train/eval skew)
  inference.py       #   model loading, generation, self-correction loop
  exec_eval.py       #   run SQL on SQLite, compare result sets
scripts/             # download → prepare → train → evaluate
app/                 # FastAPI demo server + custom dark UI (static/)
configs/             # training config
run_pipeline.ipynb   # the whole pipeline as a guided notebook

Limitations

Stated plainly, because honest evaluation is the point:

  • The model is small (1.5B). Its dominant failure is over-joining — building an unnecessary JOIN and referencing a column on the wrong table (no such column). Fine-tuning reduced this by a third but couldn't eliminate it.
  • ~90% needs a different class of system. Top of the Spider leaderboard means a frontier model (GPT-4 / Claude) inside an agentic pipeline (DIN-SQL / DAIL-SQL: schema linking, decomposition, self-consistency). A locally-trained 1.5B realistically tops out in the 60s–70s.
  • A 200-example eval is noise. An early 200-sample read showed 49.5%→51.5%; the full 1034-example set revealed the true 57.4%→65.6%. Always evaluate the full split.

Roadmap

  • Scale to a 3B/7B base for the mid-70s
  • Stop-string decoding to speed up worst-case generation
  • BIRD dataset augmentation
  • Schema-linking pre-step before generation

Tech stack

PyTorch · 🤗 Transformers · PEFT · TRL · bitsandbytes · Datasets · Weights & Biases · FastAPI

License

MIT © Abdullah Kousa

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlforge-0.1.0.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqlforge-0.1.0-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file sqlforge-0.1.0.tar.gz.

File metadata

  • Download URL: sqlforge-0.1.0.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sqlforge-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4eb77f73ce9601d6ca9c915ce3c0b1545135178c7762154a05f9f251ab990416
MD5 bb5594f8863b1278a058b91c3a3b7dd6
BLAKE2b-256 dbd49e204af0cf7b0f02dbc5ff751f269bd2f3bc7e049415970220fa916e5a7e

See more details on using hashes here.

File details

Details for the file sqlforge-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sqlforge-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sqlforge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f7b769607e5eb3140295cb1612224014e06fd886918a5a6a62c66933e478fe0
MD5 65ab36763a9ed12b83cfeeea6a5a2e9b
BLAKE2b-256 3668bdb054fc84c2977b87be5f6e18af1f053e0aaea115ce323ff72afc5766b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page