SQLAS — SQL Agent Scoring Framework. Production-grade evaluation for Text-to-SQL and SQL AI agents. 20 metrics across 8 categories.
Project description
SQLAS — SQL Agent Scoring Framework
Production-grade evaluation framework for Text-to-SQL and SQL AI agents. 20 metrics. 8 categories. Any LLM.
SQLAS scores your SQL agent the way production demands — execution accuracy, semantic correctness, context quality, cost efficiency, safety, and more. Built on industry benchmarks (Spider, BIRD) and real-world observability patterns (Arize, MLflow).
Author: Pradip Tivhale
Install
# From PyPI
pip install sqlas
# From source
git clone https://github.com/thepradip/SQLAS.git
cd SQLAS
pip install .
# With MLflow integration
pip install sqlas[mlflow]
# With dev tools
pip install sqlas[dev]
Quick Start
from sqlas import evaluate
# Your LLM judge function (any LLM: OpenAI, Anthropic, local, etc.)
def my_llm_judge(prompt: str) -> str:
return client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
).choices[0].message.content
# Evaluate a single query
scores = evaluate(
question="How many active users are there?",
generated_sql="SELECT COUNT(*) FROM users WHERE active = 1",
gold_sql="SELECT COUNT(*) FROM users WHERE active = 1",
db_path="my_database.db",
llm_judge=my_llm_judge,
response="There are 1,523 active users.",
result_data={"columns": ["COUNT(*)"], "rows": [[1523]], "row_count": 1, "execution_time_ms": 2.1},
)
print(scores.overall_score) # 0.95
print(scores.summary())
Evaluate Without Gold SQL
Gold SQL is optional. Without it, SQLAS uses semantic equivalence (LLM judge) and execution success:
scores = evaluate(
question="Show top 10 products by revenue",
generated_sql="SELECT name, SUM(price * qty) AS rev FROM orders GROUP BY name ORDER BY rev DESC LIMIT 10",
llm_judge=my_llm_judge,
response="The top products are...",
result_data={"columns": ["name", "rev"], "rows": [...], "row_count": 10, "execution_time_ms": 15},
)
Run a Test Suite
from sqlas import run_suite, TestCase
test_cases = [
TestCase(
question="How many users signed up this month?",
gold_sql="SELECT COUNT(*) FROM users WHERE created_at >= '2026-03-01'",
category="easy",
),
TestCase(
question="Average order value by country",
gold_sql="SELECT country, AVG(total) FROM orders GROUP BY country",
category="medium",
),
]
def my_agent(question: str) -> dict:
# Your SQL agent pipeline
sql = generate_sql(question)
result = execute(sql)
response = narrate(result)
return {"sql": sql, "response": response, "data": result}
results = run_suite(
test_cases=test_cases,
agent_fn=my_agent,
llm_judge=my_llm_judge,
db_path="my_database.db",
pass_threshold=0.6, # configurable
)
print(results["summary"]["overall_score"]) # 0.88
Metric Framework
v1: Production Composite Score (15 metrics, 6 categories)
The default WEIGHTS profile uses 15 metrics:
SQLAS = 40% Execution Accuracy
+ 15% Semantic Correctness
+ 15% Cost Efficiency
+ 10% Execution Quality
+ 10% Task Success
+ 10% Safety
v2: Full RAGAS-Mapped Score (20 metrics, 8 categories)
Use WEIGHTS_V2 for the full 20-metric evaluation with context quality:
from sqlas import evaluate, WEIGHTS_V2
scores = evaluate(..., weights=WEIGHTS_V2)
SQLAS v2 = 35% Execution Accuracy
+ 13% Semantic Correctness
+ 10% Context Quality (NEW — RAGAS-mapped)
+ 12% Cost Efficiency
+ 8% Execution Quality
+ 8% Task Success
+ 4% Result Similarity (NEW)
+ 10% Safety
Detailed Breakdown (v2 — 20 metrics)
| Category | Metric | v1 Weight | v2 Weight | Method |
|---|---|---|---|---|
| Execution Accuracy | execution_accuracy | 40% | 35% | Automated: output + structure + efficiency |
| Semantic Correctness | semantic_equivalence | 15% | 13% | LLM-as-judge |
| Context Quality | context_precision | — | 3% | Automated: schema element precision vs gold |
| context_recall | — | 3% | Automated: schema element recall vs gold | |
| entity_recall | — | 2% | Automated: strict entity-level recall | |
| noise_robustness | — | 2% | Automated: irrelevant schema resistance | |
| Cost Efficiency | efficiency_score | 5% | 4% | Automated: VES |
| data_scan_efficiency | 5% | 4% | Automated: scan detection | |
| sql_quality | 3% | 2% | LLM: join/agg/filter | |
| schema_compliance | 2% | 2% | Automated: sqlglot | |
| Execution Quality | execution_success | 5% | 4% | Automated |
| complexity_match | 3% | 2% | LLM-as-judge | |
| empty_result_penalty | 2% | 2% | Automated | |
| Task Success | faithfulness | 4% | 3% | LLM-as-judge |
| answer_relevance | 3% | 2% | LLM-as-judge | |
| answer_completeness | 2% | 2% | LLM-as-judge | |
| fluency | 1% | 1% | LLM-as-judge | |
| Result Similarity | result_set_similarity | — | 4% | Automated: Jaccard on result sets |
| Safety | read_only_compliance | 5% | 5% | Automated: DDL/DML |
| safety_score | 5% | 5% | Automated: PII/injection |
Custom Weights
my_weights = {
"execution_accuracy": 0.50, # increase correctness weight
"semantic_equivalence": 0.10,
"safety_score": 0.15, # stricter safety
# ... other metrics (must sum to 1.0)
}
scores = evaluate(..., weights=my_weights)
Use Individual Metrics
from sqlas import execution_accuracy, schema_compliance, safety_score
from sqlas import context_precision, context_recall, entity_recall
# Just check execution accuracy
score, details = execution_accuracy(
generated_sql="SELECT COUNT(*) FROM users",
gold_sql="SELECT COUNT(*) FROM users",
db_path="my.db",
)
# Just check schema compliance
score, details = schema_compliance(
sql="SELECT name FROM users",
valid_tables={"users", "orders"},
valid_columns={"users": {"id", "name", "email"}, "orders": {"id", "user_id", "total"}},
)
# Just check safety
score, details = safety_score(
sql="SELECT * FROM users",
pii_columns=["email", "phone", "ssn"],
)
# Context quality (requires gold SQL)
precision, details = context_precision(
generated_sql="SELECT name, age FROM users WHERE active = 1",
gold_sql="SELECT name FROM users WHERE active = 1",
)
# precision < 1.0 — 'age' is extra
recall, details = context_recall(
generated_sql="SELECT name FROM users",
gold_sql="SELECT name FROM users WHERE active = 1",
)
# recall < 1.0 — 'active' is missing
Metric Mapping (vs. RAG Evaluation Standards)
| Standard Metric | SQLAS Equivalent | Description |
|---|---|---|
| Faithfulness | faithfulness |
Claims grounded in SQL result data |
| Answer Relevance | answer_relevance |
Response answers the question |
| Answer Correctness | execution_accuracy |
SQL returns correct results |
| Answer Similarity | result_set_similarity |
Result set Jaccard similarity |
| Context Precision | context_precision |
Only relevant schema elements used |
| Context Recall | context_recall |
All required schema elements used |
| Context Entity Recall | entity_recall |
Strict entity match (tables, columns, literals, functions) |
| Noise Sensitivity | noise_robustness |
Resistance to irrelevant schema context |
| — | semantic_equivalence |
SQL answers the intent (LLM judge) |
| — | safety_score |
PII + injection + DDL protection |
| — | schema_compliance |
Valid tables/columns via AST |
Production Features
- Read-only DB: All query execution uses read-only connections
- Timeout guard: SQL execution timeout (default 30s) prevents hangs
- LLM resilience: All LLM judge calls wrapped with error handling
- Input validation: Empty SQL, missing db_path, weight sum checks
- Structured logging: Uses Python
loggingmodule (not print) - Type-checked: Ships
py.typedmarker for mypy/pyright
LLM Judge
SQLAS is LLM-agnostic. Provide any function (prompt: str) -> str:
# OpenAI
def judge(prompt):
return openai_client.chat.completions.create(
model="gpt-4o", messages=[{"role": "user", "content": prompt}]
).choices[0].message.content
# Anthropic
def judge(prompt):
return anthropic_client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=500,
messages=[{"role": "user", "content": prompt}]
).content[0].text
# Local (Ollama)
def judge(prompt):
import requests
return requests.post("http://localhost:11434/api/generate",
json={"model": "llama3", "prompt": prompt}
).json()["response"]
Example: SQL AI Agent (LangGraph + SQLAS)
See thepradip/SQL-AI-Agent — a full-stack NL-to-SQL application powered by LangGraph that uses SQLAS for:
- Pre-execution safety gate —
read_only_compliance,safety_score,schema_complianceblock unsafe queries - Post-response quality scoring — full
evaluate()scores every query on 20 metrics - Evaluation suite — 25 test cases across 4 difficulty tiers scored by SQLAS
License
MIT License - Pradip Tivhale
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sqlas-1.1.1.tar.gz.
File metadata
- Download URL: sqlas-1.1.1.tar.gz
- Upload date:
- Size: 26.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9bac8228acfd6cc5d1a76ff65a69212c21cc6ac7e4e8ae316406c9f2ff0b1e0
|
|
| MD5 |
6fe63e9e56ced6d7ace26e13c6c5e787
|
|
| BLAKE2b-256 |
9e6e7880f90d3e05d8aac46c7997e0a9ef2a23d6205c88b6a2b948ff3801777c
|
File details
Details for the file sqlas-1.1.1-py3-none-any.whl.
File metadata
- Download URL: sqlas-1.1.1-py3-none-any.whl
- Upload date:
- Size: 25.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
daee8ddd558a4f487193fb763e1456b25bef542ee72c7a873b5f2c3f69636e57
|
|
| MD5 |
d6fe277fc50632a969fb853cf6be4be8
|
|
| BLAKE2b-256 |
8d9b7cd75aae5e5880ebec2e9e4270cc7cfd33396cfd76db04bf39afe6be2b37
|