A pipeline-based framework to evaluate factual consistency metrics.
Project description
TruthBench
truthbench is a modular pipeline designed to generate controlled factual perturbations of ground-truth answers. These
perturbations enable fine-grained meta-evaluation of factuality metrics used to assess large language model (LLM)
outputs.
While many tools exist to judge whether LLM-generated answers are "factual," their own sensitivity, reliability, and
robustness remain underexplored. truthbench provides a way to systematically test these tools using corrupted versions
of correct answers, ranging from semantically faithful paraphrases to subtly or severely inaccurate alternatives.
Key Features
- 🧠 LLM-based Paraphrasing and Corruption: Produces answer variants (A0–A4) that span a factuality spectrum.
- 🏗️ Step-by-Step Pipeline Architecture: Modular components for paraphrasing, information extraction, perturbation, and grouping.
- 🎯 Controlled Evaluation Levels: Supports reproducible degradation of factual content while preserving fluency and answer structure.
- 🔍 Built for Evaluating Evaluators: Enables validation of popular factuality metrics like RAGAS, FactScore, and LLM-as-judge models.
Use Cases
- Meta-evaluating factuality metrics in open-ended QA settings.
- Building datasets with graded factual errors.
- Benchmarking the sensitivity of evaluation tools to fine-grained truth degradation.
How It Works
The pipeline takes a question and ground-truth answer, and produces 5 graded answers:
| Answer | Description |
|---|---|
A0 |
Faithful paraphrase of the ground truth |
A1 |
Mild factual perturbation |
A2 |
Moderate factual error |
A3 |
High factual degradation |
A4 |
Severely incorrect or misleading response |
Internally, the pipeline follows these stages:
- Paraphrase Ground Truth (A0)
- Extract Key Factual Components
- Filter Overlap with Question
- Rank Factual Importance
- Group by Perturbation Level
- Generate Perturbed Answers (A1–A4)
Each step is implemented as a modular Step class, enabling customization and extension.
Example
Example: Who did the United States win its independence from?
A0 (Reference)
Independence Day, commonly known as the Fourth of July or July Fourth, is a federal holiday in the United States
celebrating the adoption of the Declaration of Independence on July 4, 1776. On this day, the Continental
Congress announced that the thirteen American colonies considered themselves a new nation, called the United States of
America, and were no longer under British rule. Interestingly, the Congress had voted to declare independence *
two days* earlier, on July 2.
A1 (Low perturbation)
... celebrating the adoption of the Declaration of Independence on July 4, 1776 on August 5, 1776 ...
A2 (Medium perturbation)
... celebrating the Declaration of Independence on August 5, 1781. On this day On that moment, ...
A3 (High perturbation)
... is an unofficial event ... celebrating a proposal of the Declaration of Independence **on August 5, 1781
** ...
A4 (Extreme perturbation)
... celebrating a proposal of the drafting of Independence on August 5, 1781 ... called the United States of
the Colonies, and were no longer under Spanish rule.
Using the perturbation pipeline
CLI Usage
You can run the TruthBench pipeline directly from the command line.
Installation
Install the package with optional OpenAI dependencies:
pip install truthbench[openai]
Download required spaCy model
TruthBench relies on the spaCy English model. Download it once with:
python -m spacy download en_core_web_sm
Set your OpenAI API key
Export your OpenAI API key as an environment variable:
export OPENAI_API_KEY="your_openai_api_key_here"
Run the pipeline
truthbench --input-file path/to/input.json --output-dir path/to/output_dir
This will create report.json and dataset.json inside output_dir.
Output File Formats
After running the pipeline, two main output files are generated in the output directory:
1. dataset.json
This file contains the input questions along with multiple generated answer variants.
- Structure:
{
"questions": [
{
"id": 0, // Unique identifier for the question
"question": "why is the sky blue?", // The original question text.
"ground_truth": "The sky appears to be blue because...", // The correct answer text.
"answers": { // A dictionary of answer variants with increased perturbation levels
"A0": "The sky looks blue because...",
"A1": "...",
"A2": "...",
"A3": "...",
"A4": "..."
}
},
// ...
]
}
2. report.json
This file contains all the processing details.
{
"report": { // Summary metrics about the evaluation (counts of samples, errors, etc.)
"input_samples": 100,
"find_factual_data_error": 0,
"json_parse_ranking_error": 3,
"index_ranking_error": 52,
"ranking_factual_data_error": 2,
"output_samples": 100
},
"questions": [ // The complete processing trace for every dataset sample
{
"question": "what do the 3 dots mean in math?",
"ground_truth": "In logical argument...",
"raw_factual_data": [
"logical reasoning",
"...",
],
"with_brackets": {
"A0": "In [logical reasoning] and [mathematics] ..."
// ...
},
// ...
},
// ...
]
}
Creating a Custom Reader, Step, and Using an Open-Source LLM in the Pipeline
You can customize the pipeline to your needs. You may combine your custom implementations with available code or override any blocks.
The Pipeline runs on three abstractions:
Reader: fetches data;Step: provides the processing logic;Pipeline: holds a sequence of steps and execute them.
You can declare a pipeline by chaining a sequence of Steps and run it like this...
from truthbench import Pipeline
from truthbench.steps.counter import CounterStep
from truthbench.steps.paraphrase import ParaphraseStep
llm = ...
reader = ...
p = (
Pipeline()
.with_step(ParaphraseStep(llm))
.with_step(CounterStep(expected_levels=5))
)
samples, tracker = p.run(reader)
The samples contain the list with the processing traces for each sample, while tracker has general stats about the
processing.
Adding a custom step requires you to implement a Step abstract class.
from typing import Dict, Any
from truthbench import Step
class WordCountStep(Step):
def __init__(self):
super().__init__(required_fields={"paraphrased_question"}, counters=frozenset({"word_counted"}))
def step(self, sample: Dict[str, Any], tracker: Dict[str, int]) -> None:
question = sample["paraphrased_question"]
sample["word_count"] = len(question.split())
tracker["word_counted"] += 1
Each step may have a dependency on previous processing. In the above example, it requires that a previous step has
computed paraphrased_question. If that's not the case, you likely have a dependency issue or a bug worth
investigating. A step can also declare a set of counters it needs to keep track of stats. In the above example, it
declares it may increment word_counted.
The following steps are available:
| Step Name | Description | Updated Counters | Required Fields |
|---|---|---|---|
ParaphraseStep |
Generates a faithful paraphrase of the ground-truth answer using the LLM. | (none) | ground_truth |
FactualDataStep |
Identifies factual spans in a sentence using spaCy and brackets them. | find_factual_data_error |
answers |
BlacklistItemsFromQuestionStep) |
Removes factual items from raw_factual_data if they appear in the question (minus stopwords). |
(none) | question, raw_factual_data |
RankFactualDataStep |
Uses an LLM to assign an importance ranking to factual terms based on a bracketed sentence. | ranked_factual_data, index_ranking_error, ranking_factual_data_error, json_parse_ranking_error |
with_brackets, raw_factual_data |
FilterFactualDataStep |
Keeps top-ranked factual items and removes those blacklisted (present in the question). | (none) | ranked_factual_data, blacklisted |
CreateNoiseExamplesStep |
Generates noisy paraphrases with varying levels of factual degradation using factual spans. | (none) | factual_data, with_brackets, answers |
CounterStep |
Verifies if the expected number of answer levels are present and increments a counter. | output_samples |
answers |
A pipeline also needs a datasource to fetch data. You can declare your own data fetching mechanism by subclassing
a Reader.
from typing import List, Dict, Any
from truthbench import Reader
class StaticReader(Reader):
def samples(self) -> List[Dict[str, Any]]:
return [
{
"question": "why is the sky blue?",
"ground_truth": "The sky appears blue because of Rayleigh scattering..."
}
]
Generally, Readers expect to output at least two fields: question and ground_truth.
Right now, we made available a JsonReader that expects a json
file with the following structure:
[
{
"question": "who is playing the halftime show at super bowl 2016?",
"ground_truth": "The Super Bowl 50 Halftime Show took place on..."
},
// ...
]
Lastly, some steps may need access to a running large language model (LLM). We provide support to OpenAI's ChatGPT with
[GPT](truthbench/src/truthbench/llms/openai.py) (it requires installing pip install truthbench[openai]), but you can
implement your own LLM access by subclassing:
from typing import List, Dict
from truthbench import LLM
class OpenSourceLLM(LLM):
def __init__(self, model):
self.model = model # e.g., from HuggingFace or llama-cpp
def query(self, messages: List[Dict[str, str]]) -> str:
prompt = ... # Convert messages if needed
response = self.model.generate(prompt) # Use the appropriate method
return response
Pipeline validation
To ensure the quality of the factual perturbations, we conducted a human evaluation comparing outputs from the truthbench pipeline with those created by experts.
Two evaluators were shown factual Q&A pairs with five answer variants (A0–A4) and asked to blindly choose which version (AI- or expert-generated) better fit the intended level of factuality — or indicate a tie.
Key results:
- 🟰 82.5% of evaluations resulted in ties, indicating that AI and human answers were often perceptually indistinguishable.
- ✅ The AI pipeline was statistically non-inferior to human performance.
- ❗ Only 2.5% of examples showed conflicting preferences between evaluators.
Known limitations
Our perturbation pipeline systematically applies linguistic and semantic modifications using dependency parsers and predefined operators. However, the effectiveness of these perturbations can vary depending on the properties of the target text:
-
🧩 Variation in sensitivity: Verbose or highly detailed answers (e.g., those generated by large language models) may require more targeted or intensive perturbations to induce meaningful semantic changes. In contrast, shorter, more concise answers tend to be more sensitive to even minor modifications. Consequently, the uniformity of perturbation strength across different questions and answers is not guaranteed.
-
🛡️ Core content preservation: Some perturbations might alter surface-level phrasing without affecting the core factual content. For example, for the question “Who breaks a tie in the US Senate?,”
truthbenchwill fail to modify “the Vice President” in “The Vice President serves as the ex officio President of the Senate but is only permitted to vote to resolve a tie.” Although we currently lack quantitative evidence on how widespread these cases are, this limitation is especially relevant for verbose answers where the main fact constitutes only a small fraction of the text. Our evaluators were not specifically instructed on handling these borderline cases, indicating a need for further analysis and possibly alternative perturbation strategies. -
⚠️ Semantic inconsistencies: Certain perturbations may introduce contradictions or inconsistencies. For instance, for the question “Who wrote the text for Jeanie with the Light Brown Hair?,”
truthbenchcan produce “Jeanie with the Light Brown Hair is a folk song created by Henry Bishop [...]. Foster composed the song thinking of [...]. Such examples fail our semantic guidelines and should be marked as rejected — either accepting a valid human alternative or rejecting both answers. -
🌐 Language dependency: Although the approach is designed to be language-agnostic in principle, it relies heavily on the availability and quality of dependency parsers and language models for the target language. Languages with complex morphology or syntax, or those that are low-resource, may experience reduced perturbation accuracy and coverage.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file truthbench-0.3.0.tar.gz.
File metadata
- Download URL: truthbench-0.3.0.tar.gz
- Upload date:
- Size: 27.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.6 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a9f637f5c4b89e57f5794ea230172de177ae12d00b40a616cb2883e64bc4fbb
|
|
| MD5 |
fab63d63bb83a1cf18f47d5a02419254
|
|
| BLAKE2b-256 |
7bdd0c60e42ce36381b938692209eb2e15148ad5760fde641dfbcc35c02f407b
|
File details
Details for the file truthbench-0.3.0-py3-none-any.whl.
File metadata
- Download URL: truthbench-0.3.0-py3-none-any.whl
- Upload date:
- Size: 28.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.6 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9e66b41931952860ca808dbc019e594a42ad444548847b6292761be28f4328d
|
|
| MD5 |
b4f19678ec198c6ddbb3abecd0f6491b
|
|
| BLAKE2b-256 |
4c1709cdfa026e8e0f60c0003063af74a6dd358e630594b3c9156bf15a409505
|