A pipeline-based framework to evaluate factual consistency metrics.

These details have not been verified by PyPI

Project description

TruthBench

truthbench is a modular pipeline designed to generate controlled factual perturbations of ground-truth answers. These perturbations enable fine-grained meta-evaluation of factuality metrics used to assess large language model (LLM) outputs.

While many tools exist to judge whether LLM-generated answers are "factual," their own sensitivity, reliability, and robustness remain underexplored. truthbench provides a way to systematically test these tools using corrupted versions of correct answers, ranging from semantically faithful paraphrases to subtly or severely inaccurate alternatives.

Key Features

🧠 LLM-based Paraphrasing and Corruption: Produces answer variants (A0–A4) that span a factuality spectrum.
🏗️ Step-by-Step Pipeline Architecture: Modular components for paraphrasing, information extraction, perturbation, and grouping.
🎯 Controlled Evaluation Levels: Supports reproducible degradation of factual content while preserving fluency and answer structure.
🔍 Built for Evaluating Evaluators: Enables validation of popular factuality metrics like RAGAS, FactScore, and LLM-as-judge models.

Use Cases

Meta-evaluating factuality metrics in open-ended QA settings.
Building datasets with graded factual errors.
Benchmarking the sensitivity of evaluation tools to fine-grained truth degradation.

How It Works

The pipeline takes a question and ground-truth answer, and produces 5 graded answers:

Answer	Description
`A0`	Faithful paraphrase of the ground truth
`A1`	Mild factual perturbation
`A2`	Moderate factual error
`A3`	High factual degradation
`A4`	Severely incorrect or misleading response

Internally, the pipeline follows these stages:

Paraphrase Ground Truth (A0)
Extract Key Factual Components
Filter Overlap with Question
Rank Factual Importance
Group by Perturbation Level
Generate Perturbed Answers (A1–A4)

Each step is implemented as a modular Step class, enabling customization and extension.

Example

Example: Who did the United States win its independence from?

A0 (Reference)
Independence Day, commonly known as the Fourth of July or July Fourth, is a federal holiday in the United States celebrating the adoption of the Declaration of Independence on July 4, 1776. On this day, the Continental Congress announced that the thirteen American colonies considered themselves a new nation, called the United States of America, and were no longer under British rule. Interestingly, the Congress had voted to declare independence * two days* earlier, on July 2.

A1 (Low perturbation)
... celebrating the adoption of the Declaration of Independence ~~on July 4, 1776~~ on August 5, 1776 ...

A2 (Medium perturbation)
... celebrating the Declaration of Independence on August 5, 1781. ~~On this day~~ On that moment, ...

A3 (High perturbation)
... is an unofficial event ... celebrating a proposal of the Declaration of Independence **on August 5, 1781 ** ...

A4 (Extreme perturbation)
... celebrating a proposal of the drafting of Independence on August 5, 1781 ... called the United States of the Colonies, and were no longer under Spanish rule.

Using the perturbation pipeline

CLI Usage

You can run the TruthBench pipeline directly from the command line.

Installation

Install the package with optional OpenAI dependencies:

pip install truthbench[openai]

Download required spaCy model

TruthBench relies on the spaCy English model. Download it once with:

python -m spacy download en_core_web_sm

Set your OpenAI API key

Export your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your_openai_api_key_here"

Run the pipeline

truthbench --input-file path/to/input.json --output-dir path/to/output_dir

This will create report.json and dataset.json inside output_dir.

Output File Formats

After running the pipeline, two main output files are generated in the output directory:

1. `dataset.json`

This file contains the input questions along with multiple generated answer variants.

Structure:

  {
    "questions": [
      {
        "id": 0,                                                   // Unique identifier for the question
        "question": "why is the sky blue?",                        // The original question text.
        "ground_truth": "The sky appears to be blue because...",   // The correct answer text.
        "answers": {                                               // A dictionary of answer variants with increased perturbation levels
          "A0": "The sky looks blue because...",
          "A1": "...",
          "A2": "...",
          "A3": "...",
          "A4": "..."
        }
      },
      // ...
    ]
  }

2. `report.json`

This file contains all the processing details.

{
  "report": {                                           // Summary metrics about the evaluation (counts of samples, errors, etc.)
    "input_samples": 100,
    "find_factual_data_error": 0,
    "json_parse_ranking_error": 3,
    "index_ranking_error": 52,
    "ranking_factual_data_error": 2,
    "output_samples": 100
  },
  "questions": [                                       // The complete processing trace for every dataset sample
    {
      "question": "what do the 3 dots mean in math?",
      "ground_truth": "In logical argument...",
      "raw_factual_data": [
        "logical reasoning",
        "...",
      ],
      "with_brackets": {
        "A0": "In [logical reasoning] and [mathematics] ..."
        // ...
      },
      // ...
    },
    // ...
  ]
}

Creating a Custom Reader, Step, and Using an Open-Source LLM in the Pipeline

You can customize the pipeline to your needs. You may combine your custom implementations with available code or override any blocks.

The Pipeline runs on three abstractions:

Reader: fetches data;
Step: provides the processing logic;
Pipeline: holds a sequence of steps and execute them.

You can declare a pipeline by chaining a sequence of Steps and run it like this...

from truthbench import Pipeline
from truthbench.steps.counter import CounterStep
from truthbench.steps.paraphrase import ParaphraseStep

llm = ...
reader = ...

p = (
    Pipeline()
    .with_step(ParaphraseStep(llm))
    .with_step(CounterStep(expected_levels=5))
)

samples, tracker = p.run(reader)

The samples contain the list with the processing traces for each sample, while tracker has general stats about the processing.

Adding a custom step requires you to implement a Step abstract class.

from typing import Dict, Any
from truthbench import Step


class WordCountStep(Step):
    def __init__(self):
        super().__init__(required_fields={"paraphrased_question"}, counters=frozenset({"word_counted"}))

    def step(self, sample: Dict[str, Any], tracker: Dict[str, int]) -> None:
        question = sample["paraphrased_question"]
        sample["word_count"] = len(question.split())
        tracker["word_counted"] += 1

Each step may have a dependency on previous processing. In the above example, it requires that a previous step has computed paraphrased_question. If that's not the case, you likely have a dependency issue or a bug worth investigating. A step can also declare a set of counters it needs to keep track of stats. In the above example, it declares it may increment word_counted.

The following steps are available:

Step Name	Description	Updated Counters	Required Fields
`ParaphraseStep`	Generates a faithful paraphrase of the ground-truth answer using the LLM.	(none)	`ground_truth`
`FactualDataStep`	Identifies factual spans in a sentence using spaCy and brackets them.	`find_factual_data_error`	`answers`
`BlacklistItemsFromQuestionStep`)	Removes factual items from `raw_factual_data` if they appear in the question (minus stopwords).	(none)	`question`, `raw_factual_data`
`RankFactualDataStep`	Uses an LLM to assign an importance ranking to factual terms based on a bracketed sentence.	`ranked_factual_data`, `index_ranking_error`, `ranking_factual_data_error`, `json_parse_ranking_error`	`with_brackets`, `raw_factual_data`
`FilterFactualDataStep`	Keeps top-ranked factual items and removes those blacklisted (present in the question).	(none)	`ranked_factual_data`, `blacklisted`
`CreateNoiseExamplesStep`	Generates noisy paraphrases with varying levels of factual degradation using factual spans.	(none)	`factual_data`, `with_brackets`, `answers`
`CounterStep`	Verifies if the expected number of answer levels are present and increments a counter.	`output_samples`	`answers`

A pipeline also needs a datasource to fetch data. You can declare your own data fetching mechanism by subclassing a Reader.

from typing import List, Dict, Any
from truthbench import Reader


class StaticReader(Reader):
    def samples(self) -> List[Dict[str, Any]]:
        return [
            {
                "question": "why is the sky blue?",
                "ground_truth": "The sky appears blue because of Rayleigh scattering..."
            }
        ]

Generally, Readers expect to output at least two fields: question and ground_truth.

Right now, we made available a JsonReader that expects a json file with the following structure:

[
    {
        "question": "who is playing the halftime show at super bowl 2016?",
        "ground_truth": "The Super Bowl 50 Halftime Show took place on..."
    },
    // ...
]

Lastly, some steps may need access to a running large language model (LLM). We provide support to OpenAI's ChatGPT with [GPT](truthbench/src/truthbench/llms/openai.py) (it requires installing pip install truthbench[openai]), but you can implement your own LLM access by subclassing:

from typing import List, Dict
from truthbench import LLM


class OpenSourceLLM(LLM):
    def __init__(self, model):
        self.model = model  # e.g., from HuggingFace or llama-cpp

    def query(self, messages: List[Dict[str, str]]) -> str:
        prompt = ...  # Convert messages if needed
        response = self.model.generate(prompt)  # Use the appropriate method
        return response

Pipeline validation

To ensure the quality of the factual perturbations, we conducted a human evaluation comparing outputs from the truthbench pipeline with those created by experts.

Two evaluators were shown factual Q&A pairs with five answer variants (A0–A4) and asked to blindly choose which version (AI- or expert-generated) better fit the intended level of factuality — or indicate a tie.

Key results:

🟰 82.5% of evaluations resulted in ties, indicating that AI and human answers were often perceptually indistinguishable.
✅ The AI pipeline was statistically non-inferior to human performance.
❗ Only 2.5% of examples showed conflicting preferences between evaluators.

Known limitations

Our perturbation pipeline systematically applies linguistic and semantic modifications using dependency parsers and predefined operators. However, the effectiveness of these perturbations can vary depending on the properties of the target text:

🧩 Variation in sensitivity: Verbose or highly detailed answers (e.g., those generated by large language models) may require more targeted or intensive perturbations to induce meaningful semantic changes. In contrast, shorter, more concise answers tend to be more sensitive to even minor modifications. Consequently, the uniformity of perturbation strength across different questions and answers is not guaranteed.
🛡️ Core content preservation: Some perturbations might alter surface-level phrasing without affecting the core factual content. For example, for the question “Who breaks a tie in the US Senate?,” truthbench will fail to modify “the Vice President” in “The Vice President serves as the ex officio President of the Senate but is only permitted to vote to resolve a tie.” Although we currently lack quantitative evidence on how widespread these cases are, this limitation is especially relevant for verbose answers where the main fact constitutes only a small fraction of the text. Our evaluators were not specifically instructed on handling these borderline cases, indicating a need for further analysis and possibly alternative perturbation strategies.
⚠️ Semantic inconsistencies: Certain perturbations may introduce contradictions or inconsistencies. For instance, for the question “Who wrote the text for Jeanie with the Light Brown Hair?,” truthbench can produce “Jeanie with the Light Brown Hair is a folk song created by Henry Bishop [...]. Foster composed the song thinking of [...]. Such examples fail our semantic guidelines and should be marked as rejected — either accepting a valid human alternative or rejecting both answers.
🌐 Language dependency: Although the approach is designed to be language-agnostic in principle, it relies heavily on the availability and quality of dependency parsers and language models for the target language. Languages with complex morphology or syntax, or those that are low-resource, may experience reduced perturbation accuracy and coverage.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Jun 3, 2025

0.2.0

Jun 3, 2025

0.1.1

Jun 3, 2025

0.1.0

Jun 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

truthbench-0.3.0.tar.gz (27.2 kB view details)

Uploaded Jun 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

truthbench-0.3.0-py3-none-any.whl (28.7 kB view details)

Uploaded Jun 3, 2025 Python 3

File details

Details for the file truthbench-0.3.0.tar.gz.

File metadata

Download URL: truthbench-0.3.0.tar.gz
Upload date: Jun 3, 2025
Size: 27.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.11.6 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for truthbench-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`3a9f637f5c4b89e57f5794ea230172de177ae12d00b40a616cb2883e64bc4fbb`
MD5	`fab63d63bb83a1cf18f47d5a02419254`
BLAKE2b-256	`7bdd0c60e42ce36381b938692209eb2e15148ad5760fde641dfbcc35c02f407b`

See more details on using hashes here.

File details

Details for the file truthbench-0.3.0-py3-none-any.whl.

File metadata

Download URL: truthbench-0.3.0-py3-none-any.whl
Upload date: Jun 3, 2025
Size: 28.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.11.6 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for truthbench-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c9e66b41931952860ca808dbc019e594a42ad444548847b6292761be28f4328d`
MD5	`b4f19678ec198c6ddbb3abecd0f6491b`
BLAKE2b-256	`4c1709cdfa026e8e0f60c0003063af74a6dd358e630594b3c9156bf15a409505`

See more details on using hashes here.

truthbench 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

TruthBench

Key Features

Use Cases

How It Works

Example

Using the perturbation pipeline

CLI Usage

Installation

Download required spaCy model

Set your OpenAI API key

Run the pipeline

Output File Formats

1. dataset.json

2. report.json

Creating a Custom Reader, Step, and Using an Open-Source LLM in the Pipeline

Pipeline validation

Known limitations

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `dataset.json`

2. `report.json`