Recursive self-checking for LLM hallucination reduction via Verdict Stability Score (VSS)

These details have not been verified by PyPI

Project links

Project description

Varity v0.1.11

Recursive Self-Checking for LLM Hallucination Reduction

Varity Interactive Simulator Interface Layout

Overview

Try the Interactive BYOK Simulator / Landing Page locally via docs/index.html or live at Varity UI

📖 Read the Architectural Whitepaper: Dive into the mathematical models behind the Verdict Stability Score (VSS) and Recursive Interrogation at docs/CONCEPTS.md.

Varity is a lightweight, zero-dependency Python library designed to natively mitigate Large Language Model (LLM) hallucinations. It operates by systematically decomposing generated responses into atomic claims, recursively verifying each claim across iterative context depths, and computing a Verdict Stability Score (VSS).

Unlike traditional single-pass evaluation frameworks, Varity asserts that hallucinatory or uncertain generations are mathematically unstable. By challenging the LLM to verify its own sub-claims recursively, unstable claims will "flip" their verdicts under analytical pressure. Varity measures these algorithmic flips to calculate rigorous confidence bounds.

Key Capabilities

Recursive Verification (Depth N): Stresses the model to re-evaluate claims repeatedly to track verdict stability.
Verdict Stability Score (VSS): A mathematical metric bounding the resilience of an LLM generation against self-contradiction.
Provider Agnostic (BYOK): Supports Anthropic, OpenAI, and Google Gemini via raw HTTP integrations, ensuring zero telemetry and guaranteeing Bring-Your-Own-Key data sovereignty.
Graceful Degradation: Safely handles upstream provider rate limits (HTTP 429) and degradation faults without interrupting the execution pipeline.

Why Varity?

Problem	Varity's Approach
Single-pass fact-checking misses nuanced errors	Recursive depth-N verification exposes instability
External knowledge bases go stale	Uses the LLM's own parametric knowledge as the oracle
Heavy SDK dependencies increase attack surface	Zero vendor SDKs — raw `httpx` only
API keys leak through telemetry	Strict BYOK — keys are never logged, cached, or transmitted beyond the provider endpoint

Installation

pip install varity

Requires Python 3.9+. Core dependencies: pydantic>=2.0, httpx>=0.25, tiktoken>=0.5.

📊 Benchmark Performance & Supported Providers

Varity natively supports all major APIs via raw zero-dependency HTTP (no SDKs required). Supported providers include OpenAI (gpt-4o-mini), Google Gemini (gemini-2.0-flash), and Anthropic (claude-3-5-sonnet). Also perfectly supports OpenAI-compatible routers like OpenRouter.

Recent Accuracy Test (v0.1.10)

Tested against a rigorous dataset of common AI hallucinations, historical misconceptions, and scientific myths using openai/gpt-4o-mini (via OpenRouter).

Detection Accuracy: 100% (8/8 mixed facts and hallucinations correctly flagged)
Average VSS Score: 100% (Mathematical stability)
False Positive Rate: 0%
Avg Confidence on Hallucinations: ~19.5%

Example Detection Run:

  Statement: "India got its independence in 1998."
  Verdict   : ❌ HALLUCINATION  (expected: hallucination)
  Confidence: 20.0%  |  VSS: 100.0%  |  Time: 11.1s  [OK]
  Correction: India reportedly got its independence in 1947....

  Statement: "Water boils at 100 degrees Celsius at sea level."
  Verdict   : ✅ FACTUAL  (expected: factual)
  Confidence: 93.0%  |  VSS: 100.0%  |  Time: 13.6s  [OK]

Quick Start

1. Set your API key

# Option A: Environment variable
export VARITY_PROVIDER="gemini"
export VARITY_API_KEY="your-api-key"

# Option B: Create a .env file in your project root
echo 'VARITY_PROVIDER=gemini' > .env
echo 'VARITY_API_KEY=your-api-key' >> .env

2. Verify a response programmatically

import asyncio
from varity import Varity, VarityConfig
from varity.providers import get_provider

async def main():
    provider = get_provider("gemini", api_key="your-api-key")
    config = VarityConfig(depth=1, confidence_threshold=0.6)
    varity = Varity(provider=provider, config=config)

    result = await varity.acheck(
        "The Eiffel Tower is 10,000 feet tall and was completed in 1887."
    )

    print(f"Confidence : {result.overall_confidence:.2f}")
    print(f"VSS        : {result.vss_score:.2f}")
    print(f"Claims     : {len(result.claims)}")
    print(f"Flagged    : {len(result.flagged_claims)}")

    for claim in result.flagged_claims:
        print(f"  [FLAGGED] {claim.text}")
        print(f"            verdict={claim.verdict}, vss={claim.vss_score:.2f}")

    if result.corrected_response:
        print(f"\nCorrected  : {result.corrected_response}")

    await provider.close()

asyncio.run(main())

3. Use the CLI

 __      __        _ _         
 \ \    / /       (_) |        
  \ \  / /_ _ _ __ _| |_ _   _ 
   \ \/ / _` | '__| | __| | | |
    \  / (_| | |  | | |_| |_| |
     \/ \__,_|_|  |_|\__|\__, |  v0.1
                          __/ |
                         |___/

# Single-text evaluation
varity check "Einstein won the Nobel Prize for Relativity." --provider gemini

# Batch processing from JSONL
varity batch input.jsonl output.jsonl --provider openai

# Interactive demo
varity demo

CI/CD Integration

Varity is designed to be easily integrated into CI/CD pipelines to enforce hallucination checks on generated outputs before deployment.

Example: GitHub Actions

Create a .github/workflows/varity-check.yml file:

name: Varity Hallucination Check
on: [push, pull_request]

jobs:
  varity_check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.9"
      - name: Install dependencies
        run: pip install varity
      - name: Run dynamic cycle checks
        env:
          VARITY_PROVIDER: ${{ secrets.VARITY_PROVIDER }}
          VARITY_API_KEY: ${{ secrets.VARITY_API_KEY }}
        run: |
          # Example: Run 5 evaluation cycles on your test script
          python test101.py --cycles 5

How It Works

Core Architecture

Varity governs a strict 5-stage deterministic evaluation flow:

  ┌──────────────────────┐
  │  Raw Response Payload │
  └──────────┬───────────┘
             │
             ▼
  ┌──────────────────────┐
  │  1. Claim Decomposer │
  └─────┬──────────┬─────┘
        │          │
        ▼          ▼
  ┌───────────┐ ┌─────────────────────┐
  │ 2. Recur- │ │ 3. Independent      │
  │    sive   │ │    Cross-Check      │
  │  Self-    │ └─────────┬───────────┘
  │  Verifier │           │
  └─────┬─────┘           │
        │                 │
        ▼                 ▼
  ┌──────────────────────────┐
  │ 4. Confidence Aggregator │
  └──────────┬───────────────┘
             │
             ▼
  ┌──────────────────────────┐
  │ 5. Correction Generator  │
  └──────────┬───────────────┘
             │
             ▼
  ┌──────────────────────────┐
  │ Validated Output Struct  │
  └──────────────────────────┘

Claim Decomposition: Segments cohesive text strings into isolated, atomic Claim schema nodes.
Recursive Self-Verification: Executes isolated iterative passes across isolated claims (Depth 0...N), dynamically tracking historical verdict variance.
Cross-Checking: Instantiates an identical external process verifying the claim devoid of the initial contextual bias.
Confidence Aggregator: Maps the volume of boolean "flips" and base metric alignments to construct the total vss_score.
Correction Generation: Automatically rebuilds text omitting nodes scored beneath the rigorous confidence threshold.

Verdict Stability Score (VSS): For each claim, Varity counts how many times the verdict flipped between supported and contradicted across recursive depths. A claim verified as supported at every depth receives VSS = 1.0. A claim that flips on every pass approaches VSS = 0.0. Claims below the configured confidence_threshold are flagged and eligible for automatic correction.

Configuration Reference

VarityConfig accepts the following parameters:

Parameter	Type	Default	Description
`depth`	`int`	`1`	Number of recursive self-verification passes (0 = single pass)
`confidence_threshold`	`float`	`0.5`	Claims scoring below this are flagged
`vss_threshold`	`float`	`0.5`	Claims with VSS below this are flagged (independently of confidence)
`strategy`	`str`	`"standard"`	Verification strategy (`"quick"`, `"standard"`, `"thorough"`)
`max_claims`	`int`	`20`	Maximum number of claims to extract per response
`enable_correction`	`bool`	`True`	Whether to generate corrected text for flagged claims

Return Schema

CheckResult contains:

Field	Type	Description
`original_response`	`str`	The input text that was evaluated
`claims`	`list[Claim]`	All extracted atomic claims with individual scores
`flagged_claims`	`list[Claim]`	Subset of claims below the confidence threshold
`corrected_response`	`str \| None`	Auto-corrected text (if corrections were generated)
`overall_confidence`	`float`	Weighted average confidence across all claims
`vss_score`	`float`	Average VSS across all claims
`verification_chain`	`list[VerificationStep]`	Full audit trail of every verification pass
`duration_ms`	`int`	Wall-clock execution time in milliseconds
`token_usage`	`dict`	Estimated token consumption breakdown

Commercial Use Cases

Because Varity mathematically filters out unstable generations, it serves as the perfect underlying engine for building high-value, hallucination-free applications:

1. "Zero-Hallucination" Legal or Medical Writers

General LLMs are dangerous in high-stakes fields because they can invent case studies or medical facts with complete semantic confidence. By piping raw LLM output through Varity (depth=3) and only rendering the corrected_response in your UI, you guarantee factuality for professionals who cannot afford hallucinations.

2. Academic & SEO Fact-Checking Automation

Content teams and researchers spend countless hours manually fact-checking AI outputs. Varity can be wrapped into a Chrome Extension or text-editor plugin where users highlight generated text and instantly receive a boolean breakdown of Verified vs. Hallucinated claims, drastically reducing manual audit times.

Literature & Academic Context

The mathematical and theoretical foundation of Varity addresses a critical gap identified across recent LLM alignment and self-reflection literature:

1. The Hallucination Gap

Modern LLMs are prone to generating highly plausible but factually incorrect statements (hallucinations) because they prioritize statistical token likelihood over factual grounding. Traditional mitigation strategies like Retrieval-Augmented Generation (RAG) suffer when external data is stale or unavailable.

Reference: "A Survey of Hallucination in Large Foundation Models" (Ji et al., 2023)

2. Self-Reflection and Iterative Refinement

Recent studies demonstrate that LLMs possess latent capabilities to critique and refine their own outputs when forced into iterative feedback loops. However, prior work mostly relied on single-pass heuristic prompting rather than algorithmic scoring. Varity operationalizes this via Recursive Verification (Depth N).

Reference: "Self-Refine: Iterative Refinement with Self-Feedback" (Madaan et al., 2023)

3. Stability as a Proxy for Truth

The core algorithmic thesis of Varity—the Verdict Stability Score (VSS)—is heavily inspired by research showing that hallucinatory claims are mathematically unstable under temperature variance and cross-examination, whereas true facts remain structurally consistent.

Reference: "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models" (Manakul et al., 2023)
Reference: "Chain-of-Verification Reduces Hallucination in Large Language Models" (Dhuliawala et al., 2023)

By combining atomic extraction (Claim Decomposition) with iterative internal probing (VSS), Varity transforms these academic concepts into a deployable, zero-dependency engineering framework.

Stress Testing

The included test101.py script runs Varity against a known-hallucination payload over a configurable number of cycles:

# Run 100 consecutive evaluation cycles
python test101.py --cycles 100

# Or configure via environment
export VARITY_CYCLES=50
python test101.py

Development

# Clone and install in development mode
git clone https://github.com/charchitd/Varity-v0.1.git
cd varity
pip install -e ".[dev]"

# Run the test suite (76 unit tests + 10 integration tests)
pytest tests/ -v

# Lint and type-check
ruff check .
mypy --strict varity/

License

Distributed under the MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.12

Apr 20, 2026

This version

0.1.11

Apr 20, 2026

0.1.10

Apr 17, 2026

0.1.9

Apr 16, 2026

0.1.8

Apr 12, 2026

0.1.7

Apr 12, 2026

0.1.6

Apr 12, 2026

0.1.5

Apr 12, 2026

0.1.4

Apr 12, 2026

0.1.3

Apr 12, 2026

0.1.2

Apr 12, 2026

0.1.1

Apr 12, 2026

0.1.0

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

varity-0.1.11.tar.gz (42.9 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

varity-0.1.11-py3-none-any.whl (37.8 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file varity-0.1.11.tar.gz.

File metadata

Download URL: varity-0.1.11.tar.gz
Upload date: Apr 20, 2026
Size: 42.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for varity-0.1.11.tar.gz
Algorithm	Hash digest
SHA256	`69c6a1e8d0d9861eea57e5e28eef41f4385e5626c2fba7d5ab2328940779a1b2`
MD5	`24f13485bf3d6eb2e3cd256ad3ed8c9a`
BLAKE2b-256	`0b2204f321044c3718f4358ded20cbb4ee9b6a60e75d678ef3e87070106949ea`

See more details on using hashes here.

File details

Details for the file varity-0.1.11-py3-none-any.whl.

File metadata

Download URL: varity-0.1.11-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 37.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for varity-0.1.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c471a829bb9ea8e4fd2b0abef8f3432e8b90737addb7b2ceb20e86ee740e8c48`
MD5	`5072d833c7d2d02ede319cdc0640dc83`
BLAKE2b-256	`a83c074fb28197b9fc49a66b1a4f0b27f6d29757d7428ee9d57a9a073238a39c`

See more details on using hashes here.

varity 0.1.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Varity v0.1.11

Overview

Key Capabilities

Why Varity?

Installation

📊 Benchmark Performance & Supported Providers

Recent Accuracy Test (v0.1.10)

Example Detection Run:

Quick Start

1. Set your API key

2. Verify a response programmatically

3. Use the CLI

CI/CD Integration

Example: GitHub Actions

How It Works

Core Architecture

Configuration Reference

Return Schema

Commercial Use Cases

1. "Zero-Hallucination" Legal or Medical Writers

2. Academic & SEO Fact-Checking Automation

Literature & Academic Context

1. The Hallucination Gap

2. Self-Reflection and Iterative Refinement

3. Stability as a Proxy for Truth

Stress Testing

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes