Skip to main content

T2S-Metrics provides a modular abstraction layer that decouples metric specification from implementation, enabling consistent, transparent, and reproducible evaluation of SPARQL-based QA systems

Project description

T2S-Metrics


t2s-metrics logo

A small evaluation toolkit for text-to-SPARQL systems. It runs a configurable set of metrics over JSONL datasets and can execute queries against local RDF files or SPARQL endpoints.

docs LICENSES PyPI Downloads DOI

Features

  • Full evaluation pipeline for Text-to-SPARQL systems from JSONL inputs to exportable JSON results.
  • Rich metric coverage for answer-set quality, text similarity, structural similarity, ranking, distance, and execution validity.
  • Two execution backend families:
    • Local RDF graphs with RDFLib.
    • Remote SPARQL endpoints (QLever, Corese, Fuseki, GraphDB, Virtuoso, Blazegraph, etc.).
  • Simple Python API for research workflows and reproducible experiment scripts.
  • CLI to run evaluations and launch dashboards.
  • Static dashboard export support for sharing reports without running a server.

Prerequisites

  • Python 3.12 or later.
  • uv (recommended for local development) or pip.
  • A SPARQL endpoint only if you use execution metrics with a remote KG (for example QLever/Corese).
  • Ollama only if you enable LLM-based metrics.
  • QCan jar only if you use qcan-related metrics. The repository includes it under third_party_lib.
  • NLTK data only if you use BLEU and METEOR realated metrics.

Installation

PyPI

For users who only want to run the library and CLI:

pip install t2s-metrics

After installation, the CLI entry point is available as:

t2s --help

[!TIP] If you would like to follow the examples below, you may wish to check the GitHub repository to obtain the evaluation datasets and knowledge graphs in the datasets folder. These files are not included in the PyPI package.

For development (editable install):

  1. Clone the repository:
git clone https://github.com/Wimmics/t2s-metrics.git
  1. Navigate to the project directory:
cd t2s-metrics
  1. Install dependencies

Using uv:

uv sync

# With dev dependencies (pytest, ruff, twine)
uv sync --all-extras

Using pip:

pip install -e .

# With dev dependencies (pytest, ruff, twine)
pip install -e ".[dev]"

Adding NLTK data (check Prerequisites)

python -c "import nltk; nltk.download('punkt_tab'); nltk.download('wordnet')"

Usage

This section focuses on practical usage for both PyPI users and repository users.

1. Prepare your evaluation data

Input files must be JSON Lines (.jsonl) with one object per line.

Required keys:

  • id: unique query/case identifier.
  • golden: reference SPARQL query.
  • generated: system-generated SPARQL query.
  • order_matters: whether result ordering must be preserved.

Example (from datasets/ck25/eval/AIFB.jsonl):

{"id": "ck25:1-en", "golden": "PREFIX pv: <http://ld.company.org/prod-vocab/>\nSELECT DISTINCT ?result\nWHERE\n{\n  <http://ld.company.org/prod-instances/empl-Karen.Brant%40company.org> pv:memberOf ?result .\n  ?result a pv:Department .\n}\n", "generated": "SELECT ?department WHERE { ?person :name \"Ms. Brant\"; :worksIn ?department. }", "order_matters": false}

2. Choose your execution backend

You must provide one execution backend when running execution-aware metrics:

  • Local graph file with --execution_backend_graph_path or -eg.
  • SPARQL endpoint URL with --execution_backend_endpoint_url or -ee.

Python examples:

from t2smetrics.execution.rdflib_backend import RDFLibBackend
from t2smetrics.execution.sparql_endpoint_backend import SparqlEndpointBackend

# Local file backend
local_backend = RDFLibBackend("./datasets/example/kg/example.ttl")

# Remote endpoint backend
endpoint_backend = SparqlEndpointBackend("http://localhost:8886/")

3. Run from Python (based on run_example.py)

The minimal complete workflow from t2smetrics/run_example.py:

from t2smetrics import run_experiments
from t2smetrics.metrics import (
    AnswerSetPrecision,
    AnswerSetRecall,
    AnswerSetF1,
    Bleu,
    CodeBLEU,
    QueryExecution,
    QueryExactMatch,
)

run_experiments.run(
    dataset="example",
    jsonl_evals=["./datasets/example/eval/example.jsonl"],
    metrics_list=[
        AnswerSetPrecision(),
        AnswerSetRecall(),
        AnswerSetF1(),
        Bleu(),
        CodeBLEU(),
        QueryExecution(),
        QueryExactMatch(),
    ],
    execution_backend_graph_path="./datasets/example/kg/example.ttl",
    verbose=True,
)

4. Run from Python on ck25 (based on run_text2sparql.py)

t2smetrics/run_text2sparql.py demonstrates a multi-system run on ck25 dataset with an endpoint backend and parallel execution.

uv run ./t2smetrics/run_text2sparql.py

This generates timestamped JSON results under:

datasets/ck25/results/ck25-YYYYMMDD-HHMMSS.json

5. Run from CLI (recommended for daily use)

Show command help:

t2s --help
t2s run --help

Here is an example command with specific metrics, a SPARQL endpoint, verbosity and parallel processing:

t2s run -d ck25 -j ./datasets/ck25/eval/ -m 'hit@1' 'answerset_f1' 'answerset_precision' 'answerset_recall' 'bleu' 'codebleu' 'cosine_sim' 'euclidean' 'f1_qald' 'f1_spinach' 'jaccard' 'levenshtein' 'meteor' 'mrr' 'ndcg' 'p@1' 'precision_qald' 'query_exact_match' 'recall_qald' 'rouge_4' 'sp-bleu' 'sp-f1' 'token_f1' 'token_precision' 'token_recall' 'uri_hallucination' 'query_execution' -ee http://localhost:8886/ -v -p

[!NOTE] By default, the results are automatically exported to the directory ./datasets/{dataset}/results/. You can change this behaviour by using the -ep flag.

Here is an example with all metrics, a local TTL backend, an explicit export path, and detailed output per request instead:

t2s run -d ck25 -j ./datasets/ck25/eval/ \
  -m __all__ \
  -eg ./datasets/ck25/kg/dataset.ttl \
  -ep ./custom_results_folder \
  -eq -v -p

[!IMPORTANT] If you use the LLM as the judge metric, either directly or via the __all__ keyword, you will need to have Ollama running with either the requested model or the gemma3:4b model, as this is the default. After installing Ollama, you can use the following commands:

ollama serve          # run the server
ollama pull gemma3:4b # default model used by t2s-metrics

Common useful flags:

  • -s/--systems_name for explicit system names.
  • -p/--parallel for multiprocessing.
  • -eq/--export_per_query to include per-query values in output JSON.
  • -ep/--export_path to control output location.
  • -eg/--execution_backend_graph_path to run on local RDF files instead of endpoint mode.

6. Launch the dashboard

Auto-discover results under datasets/*/results/*.json:

t2s dashboard

[!IMPORTANT] The results are discovered automatically relative to the folder in which the command is executed. If you are not in the root directory of the cloned GitHub project, use the -f flag to avoid a FileNotFoundError.

Load explicit result files:

t2s dashboard -f \
  datasets/ck25/results/ck25-20260306-133227.json \
  datasets/db25/results/db25-20260306-132100.json

Generate a static dashboard snapshot:

t2s dashboard --static --output static_dashboard_snapshot

Then open:

http://127.0.0.1:8050

Development

Build

uv build

Tests

Run the test suite with:

uv run pytest

Release updates

For full details by version, see CHANGELOG.md.

License

t2s-metrics

t2s-metrics is provided under the terms of the GNU Affero General Public License 3.0 (AGPL-3.0).

Redistribution of third-party software and data

This repository provides several third-party contributions redistributed with their original licenses.

CK25 Dataset

t2s-metrics reuses the CK25 Corporate Knowledge Reference Dataset for Benchmarking Text-2-SPARQL QA Approaches that we modified to account for file format requirements (jsonl format).

The modified version is redistributed in directory datasets/ck25 under the terms of the Creative Commons Attribution 4.0 International license (CC-BY-4.0).

QCan library

t2s-metrics reuses the QCan software for canonicalising SPARQL queries.

QCan is written in Java. In this repository, we distribute the compiled jar of QCan v1.1, third_party_lib/qcan-1.1-jar-with-dependencies.jar, under the terms of the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

t2s_metrics-1.1.0.tar.gz (55.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

t2s_metrics-1.1.0-py3-none-any.whl (52.5 kB view details)

Uploaded Python 3

File details

Details for the file t2s_metrics-1.1.0.tar.gz.

File metadata

  • Download URL: t2s_metrics-1.1.0.tar.gz
  • Upload date:
  • Size: 55.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for t2s_metrics-1.1.0.tar.gz
Algorithm Hash digest
SHA256 2569db10a36109bca927e04268d813fd1073e3c23835bcff613fd374bff6fa70
MD5 0dc30d34fbbb09ecaa576e3ace87cdf2
BLAKE2b-256 08da521c45fc8dd448543107f8799c28019c21f2af0fcf6b9392fdc130c1fb43

See more details on using hashes here.

File details

Details for the file t2s_metrics-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: t2s_metrics-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 52.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for t2s_metrics-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 61e8a7cc234f184639e2299148545e177c859fe6da6b948ef22cd7de59d23fb1
MD5 d6195a91e82ea9e8ce2c36edfe446fde
BLAKE2b-256 8b56a9005e24f9dfa5fc3a60934f670949e47c7104f05ab2a7be692e0d592201

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page