No project description provided
Project description
T2S-Metrics
A small evaluation toolkit for text-to-SPARQL systems. It runs a configurable set of metrics over JSONL datasets and can execute queries against local RDF files or SPARQL endpoints.
Features
- Metrics for query exact match, token overlap, answer-set quality, BLEU/ROUGE, CodeBLEU, and more.
- Execution backends for local RDF (RDFLib) and remote SPARQL endpoints.
- Pluggable LLM-based judging via an Ollama backend.
- Python API for quick experiments.
Installation
The package is available on PyPI and can be installed directly with pip:
pip install t2s-metrics
For development (editable install), you can use:
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install -e .
Usage
Expected JSONL format
Input evaluation files must be JSON Lines (.jsonl) with one object per line.
Each object must include:
id(string): unique query/case identifiergolden(string): reference SPARQL querygenerated(string): system-generated SPARQL queryorder_matters(boolean): whether answer order must be preserved
This is exactly what JsonlEval expects in t2smetrics/core/eval.py.
Example (from datasets/ck25/eval/AIFB.jsonl):
{"id": "ck25:1-en", "golden": "PREFIX pv: <http://ld.company.org/prod-vocab/>\nSELECT DISTINCT ?result\nWHERE\n{\n <http://ld.company.org/prod-instances/empl-Karen.Brant%40company.org> pv:memberOf ?result .\n ?result a pv:Department .\n}\n", "generated": "SELECT ?department WHERE { ?person :name \"Ms. Brant\"; :worksIn ?department. }", "order_matters": false}
Execution backends
The library supports two execution backend families:
- Local RDF file execution with
RDFLibBackend - Remote SPARQL endpoint execution with
SparqlEndpointBackend
SparqlEndpointBackend is generic SPARQL 1.1 and works with endpoints such as
QLever and Corese (and also GraphDB, Fuseki, Virtuoso, Blazegraph, etc.).
from t2smetrics.execution.rdflib_backend import RDFLibBackend
from t2smetrics.execution.sparql_endpoint_backend import SparqlEndpointBackend
# Option 1: local KG file
local_backend = RDFLibBackend("./datasets/example/kg/example.ttl")
# Option 2: remote endpoint (e.g., QLever/Corese)
endpoint_backend = SparqlEndpointBackend("http://localhost:8886/")
LLM backend (local Ollama + extensible)
For LLM-based metrics (for example LLMJudge), the library currently provides
OllamaBackend for local inference.
from t2smetrics.llm.ollama_backend import OllamaBackend
llm_backend = OllamaBackend(model="gemma3:4b")
The LLM layer is extensible via LLMBackend (t2smetrics/llm/base.py).
To plug another provider, implement judge(prompt: str, timeout: int = 30) -> dict
and return a dictionary with a numeric score (recommended in [0, 1]).
from t2smetrics.llm.base import LLMBackend
class MyLLMBackend(LLMBackend):
def judge(self, prompt: str, timeout: int = 30) -> dict:
# Call your provider/client here
return {"score": 0.85, "raw": "optional provider response"}
Then pass your backend to Experiment(..., llm_backend=...).
Python (minimal example)
from t2smetrics.core.experiment import Experiment
from t2smetrics.core.eval import JsonlEval
from t2smetrics.metrics.text_metrics import Bleu
from t2smetrics.metrics.token import TokenF1
jsonl_eval = JsonlEval("./datasets/example/eval/example.jsonl")
metrics = [Bleu(), TokenF1()]
experiment = Experiment(jsonl_eval, metrics)
_, summary = experiment.run()
print("\n=== SUMMARY ===")
for k, v in summary.items():
print(f"{k}: {v:.4f}")
Python (full example with execution backends)
from t2smetrics.core.experiment import Experiment
from t2smetrics.core.eval import JsonlEval
from t2smetrics.execution.rdflib_backend import RDFLibBackend
from t2smetrics.llm.ollama_backend import OllamaBackend
from t2smetrics.metrics.answer_set.f1 import AnswerSetF1
from t2smetrics.metrics.answer_set.precision import AnswerSetPrecision
from t2smetrics.metrics.answer_set.precision_qald import PrecisionQALD
from t2smetrics.metrics.answer_set.recall import AnswerSetRecall
from t2smetrics.metrics.answer_set.recall_qald import RecallQALD
from t2smetrics.metrics.exact import QueryExactMatch
from t2smetrics.metrics.codebleu.codebleu import CodeBLEU
from t2smetrics.metrics.answer_set.f1_qald import F1QALD
from t2smetrics.metrics.answer_set.f1_spinach import F1Spinach
from t2smetrics.metrics.answer_set.mrr import MRR
from t2smetrics.metrics.answer_set.hit_at_k import HitAtK
from t2smetrics.metrics.answer_set.ndcg import NDCG
from t2smetrics.metrics.answer_set.p_at_k import PrecisionAtK
from t2smetrics.metrics.distance import (
LevenshteinDistance,
JaccardSimilarity,
CosineSimilarity,
EuclideanDistance,
)
from t2smetrics.metrics.llm_judge import LLMJudge
from t2smetrics.metrics.text_metrics import Bleu, RougeN, Meteor, SPBleu
from t2smetrics.metrics.uri.uri_hallucination import URIHallucination
from t2smetrics.metrics.query_execution import QueryExecution
from t2smetrics.metrics.token import SPF1, TokenRecall, TokenPrecision, TokenF1
jsonl_eval = JsonlEval("./datasets/example/eval/example.jsonl")
execution_backend = RDFLibBackend("./datasets/example/kg/example.ttl")
llm_backend = OllamaBackend()
metrics = [
AnswerSetPrecision(),
AnswerSetRecall(),
AnswerSetF1(),
Bleu(),
SPBleu(),
CodeBLEU(),
CosineSimilarity(),
EuclideanDistance(),
F1QALD(),
PrecisionQALD(),
RecallQALD(),
F1Spinach(),
HitAtK(k=5),
JaccardSimilarity(),
LLMJudge(),
LevenshteinDistance(),
MRR(),
Meteor(),
NDCG(),
PrecisionAtK(k=1),
QueryExecution(),
QueryExactMatch(),
RougeN(1),
RougeN(2),
RougeN(3),
RougeN(4),
TokenF1(),
SPF1(),
TokenPrecision(),
TokenRecall(),
URIHallucination(),
]
experiment = Experiment(
jsonl_eval=jsonl_eval,
metrics=metrics,
execution_backend=execution_backend,
llm_backend=llm_backend,
verbose=True,
)
results, summary = experiment.run()
print("=== PER QUERY RESULTS ===")
for r in results:
print(r)
print("\n=== SUMMARY ===")
for k, v in summary.items():
print(f"{k}: {v:.4f}")
Full workflow example (dataset + endpoint + export)
For a complete run over multiple systems and export of aggregated metrics to JSON,
see t2smetrics/run_text2sparql.py.
Typical workflow:
- Choose a dataset folder (for example
datasets/ck25). - Put input files under
datasets/<dataset>/eval/*.jsonl. - Start your SPARQL endpoint (for example QLever/Corese).
- Set endpoint URL in the script (example:
http://localhost:8886/). - Run:
python -m t2smetrics.run_text2sparql
The script writes timestamped summary files under:
datasets/<dataset>/results/<dataset>-YYYYMMDD-HHMMSS.json
These result files are then directly consumable by the dashboard.
Dashboard
The dashboard reads JSON result files (generated in datasets/*/results/*.json)
and serves an interactive UI (Radar, Bar, Correlation Heatmap, Parallel Coordinates,
Scatter Matrix).
Launch with auto-discovery:
python -m t2smetrics.cli dashboard
Launch with explicit files:
python -m t2smetrics.cli dashboard \
datasets/ck25/results/ck25-20260306-133227.json \
datasets/db25/results/db25-20260306-132100.json
Then open:
http://127.0.0.1:8050
Development
Build
python setup.py sdist bdist_wheel
Tests
There are no automated tests yet. If you add tests, run them with:
python -m pytest
License
t2s-metrics
t2s-metrics is provided under the terms of the GNU Affero General Public License 3.0 (AGPL-3.0).
Redistribution of third-party software and data
This repository provides several third-party contributions redistributed with their original licenses.
CK25 Dataset
t2s-metrics reuses the CK25 Corporate Knowledge Reference Dataset for Benchmarking Text-2-SPARQL QA Approaches that we modified to account for file format requirements (jsonl format).
The modified version is redistributed in directory dataset/ck25 under the terms of the Creative Commons Attribution 4.0 International license (CC-BY-4.0).
QCan library
t2s-metrics reuses the QCan software for canonicalising SPARQL queries.
QCan is written in Java. In this repository, we distribute the compiled jar of QCan v1.1, third_party_lib/qcan-1.1-jar-with-dependencies.jar, under the terms of the Apache 2.0 license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file t2s_metrics-1.0.0.tar.gz.
File metadata
- Download URL: t2s_metrics-1.0.0.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ba88fb51074c492bc04f9fd5230e225fccfe554fcb5ca15eb65b356533223db
|
|
| MD5 |
2fcc5d6fe128ffdfd56b76b1997bdccd
|
|
| BLAKE2b-256 |
3a315da07f24e7e2d7234f3606e45a982122a08441a5015da4ebb6ea28f3fa2e
|
File details
Details for the file t2s_metrics-1.0.0-py3-none-any.whl.
File metadata
- Download URL: t2s_metrics-1.0.0-py3-none-any.whl
- Upload date:
- Size: 43.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7283f3a51206155f284c86ac191ca516776ba231f29130c97def4854e6143dfe
|
|
| MD5 |
31ff3dba241c9deae6bce5c9ad25631f
|
|
| BLAKE2b-256 |
9d0d8d97694099bf277ea4976ee152e80a56089b019c5c607238e8160e68dc3c
|