Skip to main content

No project description provided

Project description

T2S-Metrics


t2s-metrics logo

A small evaluation toolkit for text-to-SPARQL systems. It runs a configurable set of metrics over JSONL datasets and can execute queries against local RDF files or SPARQL endpoints.

LICENSES PyPI Downloads

Features

  • Metrics for query exact match, token overlap, answer-set quality, BLEU/ROUGE, CodeBLEU, and more.
  • Execution backends for local RDF (RDFLib) and remote SPARQL endpoints.
  • Pluggable LLM-based judging via an Ollama backend.
  • Python API for quick experiments.

Installation

The package is available on PyPI and can be installed directly with pip:

pip install t2s-metrics

For development (editable install), you can use:

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install -e .

Usage

Expected JSONL format

Input evaluation files must be JSON Lines (.jsonl) with one object per line. Each object must include:

  • id (string): unique query/case identifier
  • golden (string): reference SPARQL query
  • generated (string): system-generated SPARQL query
  • order_matters (boolean): whether answer order must be preserved

This is exactly what JsonlEval expects in t2smetrics/core/eval.py.

Example (from datasets/ck25/eval/AIFB.jsonl):

{"id": "ck25:1-en", "golden": "PREFIX pv: <http://ld.company.org/prod-vocab/>\nSELECT DISTINCT ?result\nWHERE\n{\n  <http://ld.company.org/prod-instances/empl-Karen.Brant%40company.org> pv:memberOf ?result .\n  ?result a pv:Department .\n}\n", "generated": "SELECT ?department WHERE { ?person :name \"Ms. Brant\"; :worksIn ?department. }", "order_matters": false}

Execution backends

The library supports two execution backend families:

  1. Local RDF file execution with RDFLibBackend
  2. Remote SPARQL endpoint execution with SparqlEndpointBackend

SparqlEndpointBackend is generic SPARQL 1.1 and works with endpoints such as QLever and Corese (and also GraphDB, Fuseki, Virtuoso, Blazegraph, etc.).

from t2smetrics.execution.rdflib_backend import RDFLibBackend
from t2smetrics.execution.sparql_endpoint_backend import SparqlEndpointBackend

# Option 1: local KG file
local_backend = RDFLibBackend("./datasets/example/kg/example.ttl")

# Option 2: remote endpoint (e.g., QLever/Corese)
endpoint_backend = SparqlEndpointBackend("http://localhost:8886/")

LLM backend (local Ollama + extensible)

For LLM-based metrics (for example LLMJudge), the library currently provides OllamaBackend for local inference.

from t2smetrics.llm.ollama_backend import OllamaBackend

llm_backend = OllamaBackend(model="gemma3:4b")

The LLM layer is extensible via LLMBackend (t2smetrics/llm/base.py). To plug another provider, implement judge(prompt: str, timeout: int = 30) -> dict and return a dictionary with a numeric score (recommended in [0, 1]).

from t2smetrics.llm.base import LLMBackend


class MyLLMBackend(LLMBackend):
    def judge(self, prompt: str, timeout: int = 30) -> dict:
        # Call your provider/client here
        return {"score": 0.85, "raw": "optional provider response"}

Then pass your backend to Experiment(..., llm_backend=...).

Python (minimal example)

from t2smetrics.core.experiment import Experiment
from t2smetrics.core.eval import JsonlEval
from t2smetrics.metrics.text_metrics import Bleu
from t2smetrics.metrics.token import TokenF1


jsonl_eval = JsonlEval("./datasets/example/eval/example.jsonl")
metrics = [Bleu(), TokenF1()]
experiment = Experiment(jsonl_eval, metrics)
_, summary = experiment.run()

print("\n=== SUMMARY ===")
for k, v in summary.items():
    print(f"{k}: {v:.4f}")

Python (full example with execution backends)

from t2smetrics.core.experiment import Experiment
from t2smetrics.core.eval import JsonlEval
from t2smetrics.execution.rdflib_backend import RDFLibBackend

from t2smetrics.llm.ollama_backend import OllamaBackend
from t2smetrics.metrics.answer_set.f1 import AnswerSetF1
from t2smetrics.metrics.answer_set.precision import AnswerSetPrecision
from t2smetrics.metrics.answer_set.precision_qald import PrecisionQALD
from t2smetrics.metrics.answer_set.recall import AnswerSetRecall
from t2smetrics.metrics.answer_set.recall_qald import RecallQALD
from t2smetrics.metrics.exact import QueryExactMatch
from t2smetrics.metrics.codebleu.codebleu import CodeBLEU
from t2smetrics.metrics.answer_set.f1_qald import F1QALD
from t2smetrics.metrics.answer_set.f1_spinach import F1Spinach
from t2smetrics.metrics.answer_set.mrr import MRR
from t2smetrics.metrics.answer_set.hit_at_k import HitAtK
from t2smetrics.metrics.answer_set.ndcg import NDCG
from t2smetrics.metrics.answer_set.p_at_k import PrecisionAtK
from t2smetrics.metrics.distance import (
    LevenshteinDistance,
    JaccardSimilarity,
    CosineSimilarity,
    EuclideanDistance,
)
from t2smetrics.metrics.llm_judge import LLMJudge
from t2smetrics.metrics.text_metrics import Bleu, RougeN, Meteor, SPBleu
from t2smetrics.metrics.uri.uri_hallucination import URIHallucination
from t2smetrics.metrics.query_execution import QueryExecution
from t2smetrics.metrics.token import SPF1, TokenRecall, TokenPrecision, TokenF1


jsonl_eval = JsonlEval("./datasets/example/eval/example.jsonl")

execution_backend = RDFLibBackend("./datasets/example/kg/example.ttl")

llm_backend = OllamaBackend()

metrics = [
    AnswerSetPrecision(),
    AnswerSetRecall(),
    AnswerSetF1(),
    Bleu(),
    SPBleu(),
    CodeBLEU(),
    CosineSimilarity(),
    EuclideanDistance(),
    F1QALD(),
    PrecisionQALD(),
    RecallQALD(),
    F1Spinach(),
    HitAtK(k=5),
    JaccardSimilarity(),
    LLMJudge(),
    LevenshteinDistance(),
    MRR(),
    Meteor(),
    NDCG(),
    PrecisionAtK(k=1),
    QueryExecution(),
    QueryExactMatch(),
    RougeN(1),
    RougeN(2),
    RougeN(3),
    RougeN(4),
    TokenF1(),
    SPF1(),
    TokenPrecision(),
    TokenRecall(),
    URIHallucination(),
]

experiment = Experiment(
    jsonl_eval=jsonl_eval,
    metrics=metrics,
    execution_backend=execution_backend,
    llm_backend=llm_backend,
    verbose=True,
)

results, summary = experiment.run()

print("=== PER QUERY RESULTS ===")
for r in results:
    print(r)

print("\n=== SUMMARY ===")
for k, v in summary.items():
    print(f"{k}: {v:.4f}")

Full workflow example (dataset + endpoint + export)

For a complete run over multiple systems and export of aggregated metrics to JSON, see t2smetrics/run_text2sparql.py.

Typical workflow:

  1. Choose a dataset folder (for example datasets/ck25).
  2. Put input files under datasets/<dataset>/eval/*.jsonl.
  3. Start your SPARQL endpoint (for example QLever/Corese).
  4. Set endpoint URL in the script (example: http://localhost:8886/).
  5. Run:
python -m t2smetrics.run_text2sparql

The script writes timestamped summary files under:

datasets/<dataset>/results/<dataset>-YYYYMMDD-HHMMSS.json

These result files are then directly consumable by the dashboard.

Dashboard

The dashboard reads JSON result files (generated in datasets/*/results/*.json) and serves an interactive UI (Radar, Bar, Correlation Heatmap, Parallel Coordinates, Scatter Matrix).

Launch with auto-discovery:

python -m t2smetrics.cli dashboard

Launch with explicit files:

python -m t2smetrics.cli dashboard \
    datasets/ck25/results/ck25-20260306-133227.json \
    datasets/db25/results/db25-20260306-132100.json

Then open:

http://127.0.0.1:8050

Development

Build

python setup.py sdist bdist_wheel

Tests

There are no automated tests yet. If you add tests, run them with:

python -m pytest

License

t2s-metrics

t2s-metrics is provided under the terms of the GNU Affero General Public License 3.0 (AGPL-3.0).

Redistribution of third-party software and data

This repository provides several third-party contributions redistributed with their original licenses.

CK25 Dataset

t2s-metrics reuses the CK25 Corporate Knowledge Reference Dataset for Benchmarking Text-2-SPARQL QA Approaches that we modified to account for file format requirements (jsonl format).

The modified version is redistributed in directory dataset/ck25 under the terms of the Creative Commons Attribution 4.0 International license (CC-BY-4.0).

QCan library

t2s-metrics reuses the QCan software for canonicalising SPARQL queries.

QCan is written in Java. In this repository, we distribute the compiled jar of QCan v1.1, third_party_lib/qcan-1.1-jar-with-dependencies.jar, under the terms of the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

t2s_metrics-1.0.1.tar.gz (25.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

t2s_metrics-1.0.1-py3-none-any.whl (43.5 kB view details)

Uploaded Python 3

File details

Details for the file t2s_metrics-1.0.1.tar.gz.

File metadata

  • Download URL: t2s_metrics-1.0.1.tar.gz
  • Upload date:
  • Size: 25.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for t2s_metrics-1.0.1.tar.gz
Algorithm Hash digest
SHA256 015fbb31f11d17e67672b9c595f6ada15a7591231048d515d35b1b50ddb874c8
MD5 c93e9ef2cc53b384ad0f05d94f921380
BLAKE2b-256 e040717c43a11dfe13b8e0f9c837700d8f85ad8881b89240df4c8ef384e221e9

See more details on using hashes here.

File details

Details for the file t2s_metrics-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: t2s_metrics-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 43.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for t2s_metrics-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cd15c2f9e15c59fc1bb3b8cbd0856c4d60ba50648029f0bd486713a42c3bd324
MD5 c5569523ac99f06584a8917878b0b3ad
BLAKE2b-256 53e423c74d70fdbc587c8d2acfcd0bc6e0bfe5c651fd102b65f28ed587774c46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page