Factuality Assessment for Foundation Models

These details have not been verified by PyPI

Project links

Source

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Static Badge

FactReasoner

A probabilistic factuality assessment framework for Large Language Models (LLMs). FactReasoner provides fine-grained factuality evaluation by decomposing LLM responses into atomic claims and verifying them against external knowledge sources using probabilistic reasoning.

Overview

Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text, yet they frequently produce factually incorrect information—a phenomenon known as "hallucination." Assessing the factuality of long-form text generations is particularly challenging because these responses may contain numerous informative statements, and validating each piece of information against reliable sources is time-consuming, costly, and error-prone.

FactReasoner is a novel factuality assessment framework that leverages probabilistic reasoning to evaluate the factual correctness of LLM-generated responses. Unlike traditional binary classification approaches, FactReasoner provides calibrated probability estimates for each claim, enabling nuanced uncertainty quantification.

The Problem

Given a long-form text y generated by an LLM in response to a query x, we assume that y consists of n atomic units (or atoms) that can be either true or false, denoted as A_y = {a_1, a_2, ..., a_n}. An atomic unit is defined as a short sentence conveying one piece of information.

Given an external knowledge source C (e.g., Wikipedia, the Web, or a document collection), an atomic unit a_i is considered supported if there exists at least one piece of information (a context) in C that undeniably supports a_i. Otherwise, the atomic unit is not supported.

Factuality Metrics

FactReasoner computes several factuality metrics:

Factual Precision - The proportion of supported atoms in the response:

Precision(y) = S(y) / |A_y|

where S(y) is the number of supported atomic units.

Factual Recall at K - Measures recall up to K supported atoms:

R_K(y) = min(S(y) / K, 1)

F1@K Score - Combines precision and recall:

F1@K(y) = 2 * Precision(y) * R_K(y) / (Precision(y) + R_K(y))

Entropy-based Measure - A novel metric leveraging posterior probabilities:

E(y) = (1/n) * Σ -P(a_i) * log(P(a_i))

When all atoms have posterior probability P(a_i) = 0.5 (undecided), E(y) ≈ 0.15. When all atoms are true with certainty (P(a_i) = 1), E(y) = 0. Lower entropy indicates higher factual confidence.

How FactReasoner Works

FactReasoner addresses hallucination detection through a principled five-stage pipeline:

Atomic Decomposition (Atomizer): The LLM response is decomposed into minimal, verifiable claims (atoms) using few-shot prompting. Each atom represents a single piece of information that can be independently verified.
Decontextualization (Reviser): Atoms are revised to be standalone by resolving pronouns (e.g., "he", "she", "it"), demonstrative references (e.g., "this", "that"), unknown entities, and incomplete names. This ensures each atom can be verified without additional context.
Context Retrieval (Retriever): For each atom, relevant evidence is gathered from external knowledge sources. FactReasoner supports multiple retrieval backends:
- Wikipedia: Using LangChain's WikipediaRetriever
- Google Search: Via Serper API with optional full-page content extraction
- ChromaDB: Custom vector stores with semantic search
NLI-based Verification (NLI Extractor): Natural Language Inference (NLI) is used to determine the relationship between each atom and its retrieved contexts:
- Entailment: The context supports/implies the atom
- Contradiction: The context contradicts the atom
- Neutral: The context neither supports nor contradicts the atom
Probabilistic Reasoning (Evaluator): A Markov Network (undirected graphical model) is constructed where:
- Nodes represent atoms and contexts as binary random variables
- Edges encode NLI relationships with associated probabilities
- Factors define the joint probability distribution based on entailment/contradiction strengths
The Merlin inference engine performs belief propagation to compute posterior marginal probabilities P(a_i) for each atom, determining whether it is supported by the retrieved evidence.

Pipeline Versions

FactReasoner supports three configurations based on how atom-context relationships are modeled:

Version	Relationships	Description
FR1	Atom ↔ Own Contexts	Each atom is connected only to its `k` retrieved contexts. Simplest model with localized reasoning.
FR2	Atom ↔ All Contexts	Duplicate contexts are removed, and each atom is connected to all `m` unique contexts. Enables cross-atom evidence sharing.
FR3	FR2 + Context ↔ Context	Adds context-to-context relationships, allowing the model to reason about consistency between different evidence sources.

Why Probabilistic Reasoning?

Traditional factuality methods (like FactScore) make independent binary decisions for each atom. FactReasoner's probabilistic approach offers several advantages:

Uncertainty Quantification: Instead of binary verdicts, you get calibrated probabilities reflecting confidence levels
Evidence Aggregation: Multiple pieces of evidence (entailing and contradicting) are combined coherently
Global Consistency: The Markov Network jointly reasons over all atoms and contexts, ensuring globally consistent assessments
Handling Conflicting Evidence: When contexts disagree, the model produces intermediate probabilities rather than arbitrary decisions

Key Features

Probabilistic Factuality Scoring: Returns calibrated probability estimates rather than binary verdicts
Multiple Knowledge Sources: Support for Wikipedia, Google Search API, and ChromaDB vector stores
Baseline Implementations: Includes FactScore and VeriScore methods for comparison
Modular Architecture: Each component (atomizer, retriever, NLI, summarizer) can be configured independently
Async Support: Batch processing with asynchronous LLM calls for efficiency
Caching: SQLite-based caching for search API results

Installation

From PyPi

uv pip install fact_reasoner

From Source

git clone https://github.com/IBM/FactReasoner
cd FactReasoner
uv sync
. .venv/bin/activate # To activate the virtual environment

Internal IBM Usage

For internal access to IBM RITS backends, install mellea-ibm as follows:

pip install "git+ssh://git@github.ibm.com/generative-computing/mellea-ibm.git"

Dependencies

FactReasoner requires:

Python >= 3.11
Merlin - C++ probabilistic inference engine (must be compiled locally)
Mellea - LLM interaction library

Environment Variables

Set up the following env variables:

# Google Search retrieval via Serper API:
export SERPER_API_KEY=your_serper_api_key

# Internal IBM inference service
export RITS_API_KEY=your_RITS_api_key

Quick Start

Basic Usage

from mellea.backends import ModelOption
from mellea_ibm.rits import RITSBackend, RITS

from fact_reasoner import FactReasoner
from fact_reasoner.core.atomizer import Atomizer
from fact_reasoner.core.reviser import Reviser
from fact_reasoner.core.retriever import ContextRetriever, Retriever
from fact_reasoner.core.summarizer import ContextSummarizer
from fact_reasoner.core.nli import NLIExtractor
from fact_reasoner.core.query_builder import QueryBuilder

# Initialize the LLM backend
backend = RITSBackend(
    RITS.LLAMA_3_3_70B_INSTRUCT,
    model_options={ModelOption.MAX_NEW_TOKENS: 4096}
)

# Create pipeline components
query_builder = QueryBuilder(backend)
atom_extractor = Atomizer(backend)
atom_reviser = Reviser(backend)
retriever = ContextRetriever(
    service_type="google",  # or "wikipedia", "chromadb"
    top_k=5,
    fetch_text=True,
    query_builder=query_builder,
    num_workers=4
)
context_summarizer = ContextSummarizer(backend)
context_retriever = ContextRetriever(
    retriever=retriever,
    num_workers=4,
)
nli_extractor = NLIExtractor(backend)

# Create the FactReasoner pipeline
pipeline = FactReasoner(
    atom_extractor=atom_extractor,
    atom_reviser=atom_reviser,
    context_retriever=context_retriever,
    context_summarizer=context_summarizer,
    nli_extractor=nli_extractor,
    merlin_path="/path/to/merlin"
)

# Build and score
pipeline.build(
    query="Tell me about Albert Einstein",
    response="Albert Einstein was born in 1879 in Ulm, Germany...",
    topic="Albert Einstein",
    revise_atoms=True,
    summarize_contexts=False
)

results, marginals = pipeline.score()
print(f"Factuality Score: {results['factuality_score']:.2%}")

Loading from Pre-processed Data

import json

# Load pre-processed atoms and contexts
with open("data/example.json", "r") as f:
    data = json.load(f)

pipeline.from_dict_with_contexts(data)
pipeline.build(
    has_atoms=True,
    has_contexts=True,
    revise_atoms=False,
    rel_atom_context=True,
    rel_context_context=False
)

results, marginals = pipeline.score()

Architecture

Core Components

Component	Class	Description
Atomizer	`Atomizer`	Decomposes text into atomic claims using few-shot prompting
Reviser	`Reviser`	Decontextualizes atoms by resolving pronouns and vague references
Retriever	`ContextRetriever`	Retrieves relevant evidence from Wikipedia, Google, or vector stores
Summarizer	`ContextSummarizer`	Summarizes retrieved contexts with respect to specific atoms
NLI Extractor	`NLIExtractor`	Predicts entailment/contradiction/neutral relationships
Fact Graph	`FactGraph`	Graph representation of atoms, contexts, and their relationships
Search API	`SearchAPI`	Google Search via Serper API with SQLite caching

Pipeline Flow

LLM Response
    │
    ▼
┌─────────────┐
│  Atomizer   │ ──► Atomic Claims (atoms)
└─────────────┘
    │
    ▼
┌─────────────┐
│   Reviser   │ ──► Decontextualized atoms
└─────────────┘
    │
    ▼
┌─────────────┐
│  Retriever  │ ──► External contexts per atom
└─────────────┘
    │
    ▼
┌─────────────┐
│ Summarizer  │ ──► Relevant summaries (optional)
└─────────────┘
    │
    ▼
┌─────────────┐
│     NLI     │ ──► Entailment/Contradiction relationships
└─────────────┘
    │
    ▼
┌─────────────┐
│ FactGraph + │
│   Markov    │ ──► Probabilistic inference (Merlin)
│   Network   │
└─────────────┘
    │
    ▼
Factuality Score + Per-atom Marginals

Pipeline Versions

FactReasoner supports three configurations:

Version	Description
FR1	Each atom connected only to its own retrieved contexts
FR2	All atoms connected to all unique contexts (removes duplicates)
FR3	FR2 + context-to-context relationships

Context Retrieval Options

Google Search (via Serper API)

retriever = ContextRetriever(
    service_type="google",
    top_k=5,
    cache_dir="/path/to/cache.db",  # SQLite cache for API results
    fetch_text=True,                 # Fetch full page content from links
    query_builder=query_builder      # Optional query reformulation
)

Wikipedia

retriever = ContextRetriever(
    service_type="wikipedia",
    top_k=3
)

ChromaDB Vector Store

retriever = ContextRetriever(
    service_type="chromadb",
    collection_name="my_documents",
    persist_dir="/path/to/chroma_db",
    top_k=5
)

Baseline Methods

FactReasoner includes implementations of existing factuality assessment methods for comparison:

FactScore

from fact_reasoner.baselines.factscore import FactScore

scorer = FactScore(
    backend=backend,
    atom_extractor=atom_extractor,
    atom_reviser=atom_reviser,
    context_retriever=context_retriever
)
scorer.build(query=query, response=response, topic=topic)
results = scorer.score()

VeriScore

from fact_reasoner.baselines.veriscore import VeriScore

scorer = VeriScore(
    backend=backend,
    atom_extractor=atom_extractor,
    atom_reviser=atom_reviser,
    context_retriever=context_retriever
)
scorer.build(query=query, response=response, topic=topic)
results = scorer.score()

Output Format

The score() method returns a results dictionary:

{
    "factuality_score": 0.75,           # Overall precision score (0-1)
    "factuality_score_per_atom": [...], # Per-atom scores and support labels
    "num_atoms": 12,                    # Total atomic units
    "num_contexts": 48,                 # Total retrieved contexts
    "num_true_atoms": 9,                # Atoms with P(a) > 0.5
    "num_false_atoms": 3,               # Atoms with P(a) < 0.5
    "num_uniform_atoms": 0,             # Atoms with P(a) = 0.5
    "predictions": {                    # Per-atom predictions
        "a0": "S",                      # S = Supported
        "a1": "NS",                     # NS = Not Supported
        ...
    },
    "marginals": [                      # Posterior probabilities
        {"variable": "a0", "probabilities": [0.1, 0.9]},
        ...
    ],
    "entropy": 2.34,                    # Total entropy
    "avg_entropy": 0.195,               # Average entropy per atom
    "elapsed_time": 45.2                # Processing time in seconds
}

Input/Output File Formats

Input Format (JSON)

{
    "input": "Tell me a bio of Albert Einstein",
    "output": "Albert Einstein was a German-born physicist...",
    "topic": "Albert Einstein",
    "atoms": [
        {
            "id": "a0",
            "text": "Albert Einstein was German-born.",
            "original": "Albert Einstein was German-born.",
            "label": "S",
            "contexts": ["c_a0_0", "c_a0_1"]
        }
    ],
    "contexts": [
        {
            "id": "c_a0_0",
            "title": "Albert Einstein - Wikipedia",
            "text": "Albert Einstein was born in Ulm...",
            "snippet": "German-born theoretical physicist",
            "link": "https://en.wikipedia.org/wiki/Albert_Einstein"
        }
    ]
}

Project Structure

FactReasoner/
├── src/fact_reasoner/
│   ├── __init__.py           # Package exports
│   ├── assessor.py           # Main FactReasoner class
│   ├── corrector.py          # FactCorrector (WIP)
│   ├── fact_graph.py         # Graph representation
│   ├── search_api.py         # Google Search API wrapper
│   ├── utils.py              # Utility functions
│   ├── core/
│   │   ├── atomizer.py       # Atomic decomposition
│   │   ├── reviser.py        # Atom decontextualization
│   │   ├── retriever.py      # Context retrieval
│   │   ├── summarizer.py     # Context summarization
│   │   ├── nli.py            # NLI extraction
│   │   ├── query_builder.py  # Search query generation
│   │   └── utils.py          # Core utilities
│   ├── baselines/
│   │   ├── factscore.py      # FactScore implementation
│   │   ├── factverify.py     # FactVerify implementation
│   │   └── veriscore.py      # VeriScore implementation
│   └── eval/
│       └── eval_dataset.py   # Dataset evaluation utilities
├── docs/
|   ├── examples
│   │   ├── assessors/        # Assessor examples
│   │   ├── correctors/       # Corrector examples
│   │   └── core/             # Core component examples
│   └── papers/               # Papers
├── tests/                    # Unit tests
├── pyproject.toml            # Package configuration
└── README.md

Examples

See the docs/ directory for complete examples:

Example	Description
`docs/examples/assessors/ex_factreasoner.py`	Full FactReasoner pipeline
`docs/examples/assessors/ex_factscore.py`	FactScore baseline
`docs/examples/assessors/ex_veriscore.py`	VeriScore baseline
`docs/examples/core/ex_atomizer.py`	Standalone atomization
`docs/examples/core/ex_nli.py`	NLI extraction
`docs/examples/core/ex_retriever.py`	Context retrieval
`docs/examples/core/ex_summarizer.py`	Context summarization

Citation

If you use FactReasoner in your research, please cite:

@misc{marinescu2025factreasonerprobabilisticapproachlongform,
    title={FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models},
    author={Radu Marinescu and Debarun Bhattacharjya and Junkyu Lee and Tigran Tchrakian and Javier Carnerero Cano and Yufang Hou and Elizabeth Daly and Alessandra Pascale},
    year={2025},
    eprint={2502.18573},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2502.18573},
}

License

Apache License 2.0 - see LICENSE for details.

Authors

Radu Marinescu (radu.marinescu@ie.ibm.com)
Javier Carnerero Cano (javier.cano@ibm.com)
Massimiliano Pronesti (massimiliano.pronesti@ibm.com)

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

Project details

These details have not been verified by PyPI

Project links

Source

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.5.9

Apr 29, 2026

This version

0.5.8

Apr 29, 2026

0.5.7

Mar 25, 2026

0.5.6

Mar 20, 2026

0.5.5

Mar 4, 2026

0.5.4

Mar 2, 2026

0.5.3

Feb 26, 2026

0.5.2

Feb 16, 2026

0.5.0

Feb 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fact_reasoner-0.5.8.tar.gz (368.8 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fact_reasoner-0.5.8-py3-none-any.whl (94.0 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file fact_reasoner-0.5.8.tar.gz.

File metadata

Download URL: fact_reasoner-0.5.8.tar.gz
Upload date: Apr 29, 2026
Size: 368.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for fact_reasoner-0.5.8.tar.gz
Algorithm	Hash digest
SHA256	`fadf0007e0bca184b4c08f1bb08c3c7380e8a5bf73f37410729d0c105170abb8`
MD5	`3d39f663f96f9506fab3bdd6cfcee1f2`
BLAKE2b-256	`735ddbe9174e12c09b6e4e5f3c68bac37ae030ec2c4e890724c02a4b3fc67eba`

See more details on using hashes here.

File details

Details for the file fact_reasoner-0.5.8-py3-none-any.whl.

File metadata

Download URL: fact_reasoner-0.5.8-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 94.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for fact_reasoner-0.5.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4846e049971d68b0097d0fb162cb4abffb7c8f59147192f5109b0b33d2650b12`
MD5	`714898ffbad7cf69e898153c63c181a8`
BLAKE2b-256	`a1b5f18ae108f9775d0ed13c207ed85c86402647f29a299d0cba99abff02d3bc`

See more details on using hashes here.

fact_reasoner 0.5.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

FactReasoner

Overview

The Problem

Factuality Metrics

How FactReasoner Works

Pipeline Versions

Why Probabilistic Reasoning?

Key Features

Installation

From PyPi

From Source

Internal IBM Usage

Dependencies

Environment Variables

Quick Start

Basic Usage

Loading from Pre-processed Data

Architecture

Core Components

Pipeline Flow

Pipeline Versions

Context Retrieval Options

Google Search (via Serper API)

Wikipedia

ChromaDB Vector Store

Baseline Methods

FactScore

VeriScore

Output Format

Input/Output File Formats

Input Format (JSON)

Project Structure

Examples

Citation

License

Authors

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes