SQLAS — SQL Agent Scoring Framework. Production-grade evaluation for Text-to-SQL and SQL AI agents. 20 metrics across 8 categories.

These details have not been verified by PyPI

Project links

Project description

SQLAS Logo

SQLAS — SQL Agent Scoring Framework

Production-grade evaluation framework for Text-to-SQL and SQL AI agents. 20 metrics. 8 categories. Any LLM.

Python License

SQLAS scores your SQL agent the way production demands — execution accuracy, semantic correctness, context quality, cost efficiency, safety, and more. Built on industry benchmarks (Spider, BIRD) and real-world observability patterns (Arize, MLflow).

Author: Pradip Tivhale

Install

# From PyPI
pip install sqlas

# From source
git clone https://github.com/thepradip/SQLAS.git
cd SQLAS
pip install .

# With MLflow integration
pip install sqlas[mlflow]

# With dev tools
pip install sqlas[dev]

Quick Start

from sqlas import evaluate

# Your LLM judge function (any LLM: OpenAI, Anthropic, local, etc.)
def my_llm_judge(prompt: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

# Evaluate a single query
scores = evaluate(
    question="How many active users are there?",
    generated_sql="SELECT COUNT(*) FROM users WHERE active = 1",
    gold_sql="SELECT COUNT(*) FROM users WHERE active = 1",
    db_path="my_database.db",
    llm_judge=my_llm_judge,
    response="There are 1,523 active users.",
    result_data={"columns": ["COUNT(*)"], "rows": [[1523]], "row_count": 1, "execution_time_ms": 2.1},
)

print(scores.overall_score)  # 0.95
print(scores.summary())

Evaluate Without Gold SQL

Gold SQL is optional. Without it, SQLAS uses semantic equivalence (LLM judge) and execution success:

scores = evaluate(
    question="Show top 10 products by revenue",
    generated_sql="SELECT name, SUM(price * qty) AS rev FROM orders GROUP BY name ORDER BY rev DESC LIMIT 10",
    llm_judge=my_llm_judge,
    response="The top products are...",
    result_data={"columns": ["name", "rev"], "rows": [...], "row_count": 10, "execution_time_ms": 15},
)

Run a Test Suite

from sqlas import run_suite, TestCase

test_cases = [
    TestCase(
        question="How many users signed up this month?",
        gold_sql="SELECT COUNT(*) FROM users WHERE created_at >= '2026-03-01'",
        category="easy",
    ),
    TestCase(
        question="Average order value by country",
        gold_sql="SELECT country, AVG(total) FROM orders GROUP BY country",
        category="medium",
    ),
]

def my_agent(question: str) -> dict:
    # Your SQL agent pipeline
    sql = generate_sql(question)
    result = execute(sql)
    response = narrate(result)
    return {"sql": sql, "response": response, "data": result}

results = run_suite(
    test_cases=test_cases,
    agent_fn=my_agent,
    llm_judge=my_llm_judge,
    db_path="my_database.db",
    pass_threshold=0.6,  # configurable
)

print(results["summary"]["overall_score"])  # 0.88

Metric Framework

v1: Production Composite Score (15 metrics, 6 categories)

The default WEIGHTS profile uses 15 metrics:

SQLAS = 40% Execution Accuracy
      + 15% Semantic Correctness
      + 15% Cost Efficiency
      + 10% Execution Quality
      + 10% Task Success
      + 10% Safety

v2: Full RAGAS-Mapped Score (20 metrics, 8 categories)

Use WEIGHTS_V2 for the full 20-metric evaluation with context quality:

from sqlas import evaluate, WEIGHTS_V2

scores = evaluate(..., weights=WEIGHTS_V2)

SQLAS v2 = 35% Execution Accuracy
         + 13% Semantic Correctness
         + 10% Context Quality (NEW — RAGAS-mapped)
         + 12% Cost Efficiency
         +  8% Execution Quality
         +  8% Task Success
         +  4% Result Similarity (NEW)
         + 10% Safety

Detailed Breakdown (v2 — 20 metrics)

Category	Metric	v1 Weight	v2 Weight	Method
Execution Accuracy	execution_accuracy	40%	35%	Automated: output + structure + efficiency
Semantic Correctness	semantic_equivalence	15%	13%	LLM-as-judge
Context Quality	context_precision	—	3%	Automated: schema element precision vs gold
	context_recall	—	3%	Automated: schema element recall vs gold
	entity_recall	—	2%	Automated: strict entity-level recall
	noise_robustness	—	2%	Automated: irrelevant schema resistance
Cost Efficiency	efficiency_score	5%	4%	Automated: VES
	data_scan_efficiency	5%	4%	Automated: scan detection
	sql_quality	3%	2%	LLM: join/agg/filter
	schema_compliance	2%	2%	Automated: sqlglot
Execution Quality	execution_success	5%	4%	Automated
	complexity_match	3%	2%	LLM-as-judge
	empty_result_penalty	2%	2%	Automated
Task Success	faithfulness	4%	3%	LLM-as-judge
	answer_relevance	3%	2%	LLM-as-judge
	answer_completeness	2%	2%	LLM-as-judge
	fluency	1%	1%	LLM-as-judge
Result Similarity	result_set_similarity	—	4%	Automated: Jaccard on result sets
Safety	read_only_compliance	5%	5%	Automated: DDL/DML
	safety_score	5%	5%	Automated: PII/injection

Custom Weights

my_weights = {
    "execution_accuracy": 0.50,  # increase correctness weight
    "semantic_equivalence": 0.10,
    "safety_score": 0.15,        # stricter safety
    # ... other metrics (must sum to 1.0)
}

scores = evaluate(..., weights=my_weights)

Use Individual Metrics

from sqlas import execution_accuracy, schema_compliance, safety_score
from sqlas import context_precision, context_recall, entity_recall

# Just check execution accuracy
score, details = execution_accuracy(
    generated_sql="SELECT COUNT(*) FROM users",
    gold_sql="SELECT COUNT(*) FROM users",
    db_path="my.db",
)

# Just check schema compliance
score, details = schema_compliance(
    sql="SELECT name FROM users",
    valid_tables={"users", "orders"},
    valid_columns={"users": {"id", "name", "email"}, "orders": {"id", "user_id", "total"}},
)

# Just check safety
score, details = safety_score(
    sql="SELECT * FROM users",
    pii_columns=["email", "phone", "ssn"],
)

# Context quality (requires gold SQL)
precision, details = context_precision(
    generated_sql="SELECT name, age FROM users WHERE active = 1",
    gold_sql="SELECT name FROM users WHERE active = 1",
)
# precision < 1.0 — 'age' is extra

recall, details = context_recall(
    generated_sql="SELECT name FROM users",
    gold_sql="SELECT name FROM users WHERE active = 1",
)
# recall < 1.0 — 'active' is missing

Metric Mapping (vs. RAG Evaluation Standards)

Standard Metric	SQLAS Equivalent	Description
Faithfulness	`faithfulness`	Claims grounded in SQL result data
Answer Relevance	`answer_relevance`	Response answers the question
Answer Correctness	`execution_accuracy`	SQL returns correct results
Answer Similarity	`result_set_similarity`	Result set Jaccard similarity
Context Precision	`context_precision`	Only relevant schema elements used
Context Recall	`context_recall`	All required schema elements used
Context Entity Recall	`entity_recall`	Strict entity match (tables, columns, literals, functions)
Noise Sensitivity	`noise_robustness`	Resistance to irrelevant schema context
—	`semantic_equivalence`	SQL answers the intent (LLM judge)
—	`safety_score`	PII + injection + DDL protection
—	`schema_compliance`	Valid tables/columns via AST

Production Features

Read-only DB: All query execution uses read-only connections
Timeout guard: SQL execution timeout (default 30s) prevents hangs
LLM resilience: All LLM judge calls wrapped with error handling
Input validation: Empty SQL, missing db_path, weight sum checks
Structured logging: Uses Python logging module (not print)
Type-checked: Ships py.typed marker for mypy/pyright

LLM Judge

SQLAS is LLM-agnostic. Provide any function (prompt: str) -> str:

# OpenAI
def judge(prompt):
    return openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

# Anthropic
def judge(prompt):
    return anthropic_client.messages.create(
        model="claude-sonnet-4-20250514", max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

# Local (Ollama)
def judge(prompt):
    import requests
    return requests.post("http://localhost:11434/api/generate",
        json={"model": "llama3", "prompt": prompt}
    ).json()["response"]

Example: SQL AI Agent (LangGraph + SQLAS)

See thepradip/SQL-AI-Agent — a full-stack NL-to-SQL application powered by LangGraph that uses SQLAS for:

Pre-execution safety gate — read_only_compliance, safety_score, schema_compliance block unsafe queries
Post-response quality scoring — full evaluate() scores every query on 20 metrics
Evaluation suite — 25 test cases across 4 difficulty tiers scored by SQLAS

License

MIT License - Pradip Tivhale

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.8.0

May 9, 2026

2.6.0

May 5, 2026

2.5.0

May 4, 2026

2.0.0

Apr 30, 2026

1.3.0

Apr 29, 2026

This version

1.1.1

Apr 15, 2026

1.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlas-1.1.1.tar.gz (26.2 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sqlas-1.1.1-py3-none-any.whl (25.0 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file sqlas-1.1.1.tar.gz.

File metadata

Download URL: sqlas-1.1.1.tar.gz
Upload date: Apr 15, 2026
Size: 26.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sqlas-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f9bac8228acfd6cc5d1a76ff65a69212c21cc6ac7e4e8ae316406c9f2ff0b1e0`
MD5	`6fe63e9e56ced6d7ace26e13c6c5e787`
BLAKE2b-256	`9e6e7880f90d3e05d8aac46c7997e0a9ef2a23d6205c88b6a2b948ff3801777c`

See more details on using hashes here.

File details

Details for the file sqlas-1.1.1-py3-none-any.whl.

File metadata

Download URL: sqlas-1.1.1-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 25.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sqlas-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`daee8ddd558a4f487193fb763e1456b25bef542ee72c7a873b5f2c3f69636e57`
MD5	`d6fe277fc50632a969fb853cf6be4be8`
BLAKE2b-256	`8d9b7cd75aae5e5880ebec2e9e4270cc7cfd33396cfd76db04bf39afe6be2b37`

See more details on using hashes here.

sqlas 1.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SQLAS — SQL Agent Scoring Framework

Install

Quick Start

Evaluate Without Gold SQL

Run a Test Suite

Metric Framework

v1: Production Composite Score (15 metrics, 6 categories)

v2: Full RAGAS-Mapped Score (20 metrics, 8 categories)

Detailed Breakdown (v2 — 20 metrics)

Custom Weights

Use Individual Metrics

Metric Mapping (vs. RAG Evaluation Standards)

Production Features

LLM Judge

Example: SQL AI Agent (LangGraph + SQLAS)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes