Factuality Assessment for Foundation Models
Project description
FactReasoner
A probabilistic factuality assessment framework for Large Language Models (LLMs). FactReasoner provides fine-grained factuality evaluation by decomposing LLM responses into atomic claims and verifying them against external knowledge sources using probabilistic reasoning.
Overview
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text, yet they frequently produce factually incorrect information—a phenomenon known as "hallucination." Assessing the factuality of long-form text generations is particularly challenging because these responses may contain numerous informative statements, and validating each piece of information against reliable sources is time-consuming, costly, and error-prone.
FactReasoner is a novel factuality assessment framework that leverages probabilistic reasoning to evaluate the factual correctness of LLM-generated responses. Unlike traditional binary classification approaches, FactReasoner provides calibrated probability estimates for each claim, enabling nuanced uncertainty quantification.
The Problem
Given a long-form text y generated by an LLM in response to a query x, we assume that y consists of n atomic units (or atoms) that can be either true or false, denoted as A_y = {a_1, a_2, ..., a_n}. An atomic unit is defined as a short sentence conveying one piece of information.
Given an external knowledge source C (e.g., Wikipedia, the Web, or a document collection), an atomic unit a_i is considered supported if there exists at least one piece of information (a context) in C that undeniably supports a_i. Otherwise, the atomic unit is not supported.
Factuality Metrics
FactReasoner computes several factuality metrics:
Factual Precision - The proportion of supported atoms in the response:
Precision(y) = S(y) / |A_y|
where S(y) is the number of supported atomic units.
Factual Recall at K - Measures recall up to K supported atoms:
R_K(y) = min(S(y) / K, 1)
F1@K Score - Combines precision and recall:
F1@K(y) = 2 * Precision(y) * R_K(y) / (Precision(y) + R_K(y))
Entropy-based Measure - A novel metric leveraging posterior probabilities:
E(y) = (1/n) * Σ -P(a_i) * log(P(a_i))
When all atoms have posterior probability P(a_i) = 0.5 (undecided), E(y) ≈ 0.15. When all atoms are true with certainty (P(a_i) = 1), E(y) = 0. Lower entropy indicates higher factual confidence.
How FactReasoner Works
FactReasoner addresses hallucination detection through a principled five-stage pipeline:
-
Atomic Decomposition (Atomizer): The LLM response is decomposed into minimal, verifiable claims (atoms) using few-shot prompting. Each atom represents a single piece of information that can be independently verified.
-
Decontextualization (Reviser): Atoms are revised to be standalone by resolving pronouns (e.g., "he", "she", "it"), demonstrative references (e.g., "this", "that"), unknown entities, and incomplete names. This ensures each atom can be verified without additional context.
-
Context Retrieval (Retriever): For each atom, relevant evidence is gathered from external knowledge sources. FactReasoner supports multiple retrieval backends:
- Wikipedia: Using LangChain's WikipediaRetriever
- Google Search: Via Serper API with optional full-page content extraction
- ChromaDB: Custom vector stores with semantic search
-
NLI-based Verification (NLI Extractor): Natural Language Inference (NLI) is used to determine the relationship between each atom and its retrieved contexts:
- Entailment: The context supports/implies the atom
- Contradiction: The context contradicts the atom
- Neutral: The context neither supports nor contradicts the atom
-
Probabilistic Reasoning (Evaluator): A Markov Network (undirected graphical model) is constructed where:
- Nodes represent atoms and contexts as binary random variables
- Edges encode NLI relationships with associated probabilities
- Factors define the joint probability distribution based on entailment/contradiction strengths
The Merlin inference engine performs belief propagation to compute posterior marginal probabilities
P(a_i)for each atom, determining whether it is supported by the retrieved evidence.
Pipeline Versions
FactReasoner supports three configurations based on how atom-context relationships are modeled:
| Version | Relationships | Description |
|---|---|---|
| FR1 | Atom ↔ Own Contexts | Each atom is connected only to its k retrieved contexts. Simplest model with localized reasoning. |
| FR2 | Atom ↔ All Contexts | Duplicate contexts are removed, and each atom is connected to all m unique contexts. Enables cross-atom evidence sharing. |
| FR3 | FR2 + Context ↔ Context | Adds context-to-context relationships, allowing the model to reason about consistency between different evidence sources. |
Why Probabilistic Reasoning?
Traditional factuality methods (like FactScore) make independent binary decisions for each atom. FactReasoner's probabilistic approach offers several advantages:
- Uncertainty Quantification: Instead of binary verdicts, you get calibrated probabilities reflecting confidence levels
- Evidence Aggregation: Multiple pieces of evidence (entailing and contradicting) are combined coherently
- Global Consistency: The Markov Network jointly reasons over all atoms and contexts, ensuring globally consistent assessments
- Handling Conflicting Evidence: When contexts disagree, the model produces intermediate probabilities rather than arbitrary decisions
Key Features
- Probabilistic Factuality Scoring: Returns calibrated probability estimates rather than binary verdicts
- Multiple Knowledge Sources: Support for Wikipedia, Google Search API, and ChromaDB vector stores
- Baseline Implementations: Includes FactScore and VeriScore methods for comparison
- Modular Architecture: Each component (atomizer, retriever, NLI, summarizer) can be configured independently
- Async Support: Batch processing with asynchronous LLM calls for efficiency
- Caching: SQLite-based caching for search API results
Installation
From PyPi
uv pip install fact_reasoner
From Source
git clone https://github.com/IBM/FactReasoner
cd FactReasoner
uv sync
. .venv/bin/activate # To activate the virtual environment
Internal IBM Usage
For internal access to IBM RITS backends, install mellea-ibm as follows:
pip install "git+ssh://git@github.ibm.com/generative-computing/mellea-ibm.git"
Dependencies
FactReasoner requires:
- Python >= 3.11
Merlin- C++ probabilistic inference engine (must be compiled locally)Mellea- LLM interaction library
Environment Variables
Set up the following env variables:
# Google Search retrieval via Serper API:
export SERPER_API_KEY=your_serper_api_key
# Internal IBM inference service
export RITS_API_KEY=your_RITS_api_key
Quick Start
Basic Usage
from mellea.backends import ModelOption
from mellea_ibm.rits import RITSBackend, RITS
from fact_reasoner import FactReasoner
from fact_reasoner.core.atomizer import Atomizer
from fact_reasoner.core.reviser import Reviser
from fact_reasoner.core.retriever import ContextRetriever, Retriever
from fact_reasoner.core.summarizer import ContextSummarizer
from fact_reasoner.core.nli import NLIExtractor
from fact_reasoner.core.query_builder import QueryBuilder
# Initialize the LLM backend
backend = RITSBackend(
RITS.LLAMA_3_3_70B_INSTRUCT,
model_options={ModelOption.MAX_NEW_TOKENS: 4096}
)
# Create pipeline components
query_builder = QueryBuilder(backend)
atom_extractor = Atomizer(backend)
atom_reviser = Reviser(backend)
retriever = ContextRetriever(
service_type="google", # or "wikipedia", "chromadb"
top_k=5,
fetch_text=True,
query_builder=query_builder,
num_workers=4
)
context_summarizer = ContextSummarizer(backend)
context_retriever = ContextRetriever(
retriever=retriever,
num_workers=4,
)
nli_extractor = NLIExtractor(backend)
# Create the FactReasoner pipeline
pipeline = FactReasoner(
atom_extractor=atom_extractor,
atom_reviser=atom_reviser,
context_retriever=context_retriever,
context_summarizer=context_summarizer,
nli_extractor=nli_extractor,
merlin_path="/path/to/merlin"
)
# Build and score
pipeline.build(
query="Tell me about Albert Einstein",
response="Albert Einstein was born in 1879 in Ulm, Germany...",
topic="Albert Einstein",
revise_atoms=True,
summarize_contexts=False
)
results, marginals = pipeline.score()
print(f"Factuality Score: {results['factuality_score']:.2%}")
Loading from Pre-processed Data
import json
# Load pre-processed atoms and contexts
with open("data/example.json", "r") as f:
data = json.load(f)
pipeline.from_dict_with_contexts(data)
pipeline.build(
has_atoms=True,
has_contexts=True,
revise_atoms=False,
rel_atom_context=True,
rel_context_context=False
)
results, marginals = pipeline.score()
Architecture
Core Components
| Component | Class | Description |
|---|---|---|
| Atomizer | Atomizer |
Decomposes text into atomic claims using few-shot prompting |
| Reviser | Reviser |
Decontextualizes atoms by resolving pronouns and vague references |
| Retriever | ContextRetriever |
Retrieves relevant evidence from Wikipedia, Google, or vector stores |
| Summarizer | ContextSummarizer |
Summarizes retrieved contexts with respect to specific atoms |
| NLI Extractor | NLIExtractor |
Predicts entailment/contradiction/neutral relationships |
| Fact Graph | FactGraph |
Graph representation of atoms, contexts, and their relationships |
| Search API | SearchAPI |
Google Search via Serper API with SQLite caching |
Pipeline Flow
LLM Response
│
▼
┌─────────────┐
│ Atomizer │ ──► Atomic Claims (atoms)
└─────────────┘
│
▼
┌─────────────┐
│ Reviser │ ──► Decontextualized atoms
└─────────────┘
│
▼
┌─────────────┐
│ Retriever │ ──► External contexts per atom
└─────────────┘
│
▼
┌─────────────┐
│ Summarizer │ ──► Relevant summaries (optional)
└─────────────┘
│
▼
┌─────────────┐
│ NLI │ ──► Entailment/Contradiction relationships
└─────────────┘
│
▼
┌─────────────┐
│ FactGraph + │
│ Markov │ ──► Probabilistic inference (Merlin)
│ Network │
└─────────────┘
│
▼
Factuality Score + Per-atom Marginals
Pipeline Versions
FactReasoner supports three configurations:
| Version | Description |
|---|---|
| FR1 | Each atom connected only to its own retrieved contexts |
| FR2 | All atoms connected to all unique contexts (removes duplicates) |
| FR3 | FR2 + context-to-context relationships |
Context Retrieval Options
Google Search (via Serper API)
retriever = ContextRetriever(
service_type="google",
top_k=5,
cache_dir="/path/to/cache.db", # SQLite cache for API results
fetch_text=True, # Fetch full page content from links
query_builder=query_builder # Optional query reformulation
)
Wikipedia
retriever = ContextRetriever(
service_type="wikipedia",
top_k=3
)
ChromaDB Vector Store
retriever = ContextRetriever(
service_type="chromadb",
collection_name="my_documents",
persist_dir="/path/to/chroma_db",
top_k=5
)
Baseline Methods
FactReasoner includes implementations of existing factuality assessment methods for comparison:
FactScore
from fact_reasoner.baselines.factscore import FactScore
scorer = FactScore(
backend=backend,
atom_extractor=atom_extractor,
atom_reviser=atom_reviser,
context_retriever=context_retriever
)
scorer.build(query=query, response=response, topic=topic)
results = scorer.score()
VeriScore
from fact_reasoner.baselines.veriscore import VeriScore
scorer = VeriScore(
backend=backend,
atom_extractor=atom_extractor,
atom_reviser=atom_reviser,
context_retriever=context_retriever
)
scorer.build(query=query, response=response, topic=topic)
results = scorer.score()
Output Format
The score() method returns a results dictionary:
{
"factuality_score": 0.75, # Overall precision score (0-1)
"factuality_score_per_atom": [...], # Per-atom scores and support labels
"num_atoms": 12, # Total atomic units
"num_contexts": 48, # Total retrieved contexts
"num_true_atoms": 9, # Atoms with P(a) > 0.5
"num_false_atoms": 3, # Atoms with P(a) < 0.5
"num_uniform_atoms": 0, # Atoms with P(a) = 0.5
"predictions": { # Per-atom predictions
"a0": "S", # S = Supported
"a1": "NS", # NS = Not Supported
...
},
"marginals": [ # Posterior probabilities
{"variable": "a0", "probabilities": [0.1, 0.9]},
...
],
"entropy": 2.34, # Total entropy
"avg_entropy": 0.195, # Average entropy per atom
"elapsed_time": 45.2 # Processing time in seconds
}
Input/Output File Formats
Input Format (JSON)
{
"input": "Tell me a bio of Albert Einstein",
"output": "Albert Einstein was a German-born physicist...",
"topic": "Albert Einstein",
"atoms": [
{
"id": "a0",
"text": "Albert Einstein was German-born.",
"original": "Albert Einstein was German-born.",
"label": "S",
"contexts": ["c_a0_0", "c_a0_1"]
}
],
"contexts": [
{
"id": "c_a0_0",
"title": "Albert Einstein - Wikipedia",
"text": "Albert Einstein was born in Ulm...",
"snippet": "German-born theoretical physicist",
"link": "https://en.wikipedia.org/wiki/Albert_Einstein"
}
]
}
Project Structure
FactReasoner/
├── src/fact_reasoner/
│ ├── __init__.py # Package exports
│ ├── assessor.py # Main FactReasoner class
│ ├── corrector.py # FactCorrector (WIP)
│ ├── fact_graph.py # Graph representation
│ ├── search_api.py # Google Search API wrapper
│ ├── utils.py # Utility functions
│ ├── core/
│ │ ├── atomizer.py # Atomic decomposition
│ │ ├── reviser.py # Atom decontextualization
│ │ ├── retriever.py # Context retrieval
│ │ ├── summarizer.py # Context summarization
│ │ ├── nli.py # NLI extraction
│ │ ├── query_builder.py # Search query generation
│ │ └── utils.py # Core utilities
│ ├── baselines/
│ │ ├── factscore.py # FactScore implementation
│ │ ├── factverify.py # FactVerify implementation
│ │ └── veriscore.py # VeriScore implementation
│ └── eval/
│ └── eval_dataset.py # Dataset evaluation utilities
├── docs/
| ├── examples
│ │ ├── assessors/ # Assessor examples
│ │ ├── correctors/ # Corrector examples
│ │ └── core/ # Core component examples
│ └── papers/ # Papers
├── tests/ # Unit tests
├── pyproject.toml # Package configuration
└── README.md
Examples
See the docs/ directory for complete examples:
| Example | Description |
|---|---|
docs/examples/assessors/ex_factreasoner.py |
Full FactReasoner pipeline |
docs/examples/assessors/ex_factscore.py |
FactScore baseline |
docs/examples/assessors/ex_veriscore.py |
VeriScore baseline |
docs/examples/core/ex_atomizer.py |
Standalone atomization |
docs/examples/core/ex_nli.py |
NLI extraction |
docs/examples/core/ex_retriever.py |
Context retrieval |
docs/examples/core/ex_summarizer.py |
Context summarization |
Citation
If you use FactReasoner in your research, please cite:
@misc{marinescu2025factreasonerprobabilisticapproachlongform,
title={FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models},
author={Radu Marinescu and Debarun Bhattacharjya and Junkyu Lee and Tigran Tchrakian and Javier Carnerero Cano and Yufang Hou and Elizabeth Daly and Alessandra Pascale},
year={2025},
eprint={2502.18573},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18573},
}
License
Apache License 2.0 - see LICENSE for details.
Authors
- Radu Marinescu (radu.marinescu@ie.ibm.com)
- Javier Carnerero Cano (javier.cano@ibm.com)
- Massimiliano Pronesti (massimiliano.pronesti@ibm.com)
Contributing
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fact_reasoner-0.5.8.tar.gz.
File metadata
- Download URL: fact_reasoner-0.5.8.tar.gz
- Upload date:
- Size: 368.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fadf0007e0bca184b4c08f1bb08c3c7380e8a5bf73f37410729d0c105170abb8
|
|
| MD5 |
3d39f663f96f9506fab3bdd6cfcee1f2
|
|
| BLAKE2b-256 |
735ddbe9174e12c09b6e4e5f3c68bac37ae030ec2c4e890724c02a4b3fc67eba
|
File details
Details for the file fact_reasoner-0.5.8-py3-none-any.whl.
File metadata
- Download URL: fact_reasoner-0.5.8-py3-none-any.whl
- Upload date:
- Size: 94.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4846e049971d68b0097d0fb162cb4abffb7c8f59147192f5109b0b33d2650b12
|
|
| MD5 |
714898ffbad7cf69e898153c63c181a8
|
|
| BLAKE2b-256 |
a1b5f18ae108f9775d0ed13c207ed85c86402647f29a299d0cba99abff02d3bc
|