Skip to main content

Recursive self-checking for LLM hallucination reduction via Verdict Stability Score (VSS)

Project description

Varity v0.1.3

Recursive Self-Checking for LLM Hallucination Reduction

PyPI - Version Python Versions License: MIT CI Landing Page Type Checked Code Style: Ruff

Varity Interactive Simulator Interface Layout

Overview

๐Ÿš€ Try the Interactive BYOK Simulator / Landing Page locally via docs/index.html or live at Varity UI

๐Ÿ“– Read the Architectural Whitepaper: Dive into the mathematical models behind the Verdict Stability Score (VSS) and Recursive Interrogation at docs/CONCEPTS.md.

Varity is a lightweight, zero-dependency Python library designed to natively mitigate Large Language Model (LLM) hallucinations. It operates by systematically decomposing generated responses into atomic claims, recursively verifying each claim across iterative context depths, and computing a Verdict Stability Score (VSS).

Unlike traditional single-pass evaluation frameworks, Varity asserts that hallucinatory or uncertain generations are mathematically unstable. By challenging the LLM to verify its own sub-claims recursively, unstable claims will "flip" their verdicts under analytical pressure. Varity measures these algorithmic flips to calculate rigorous confidence bounds.

Key Capabilities

  • Recursive Verification (Depth N): Stresses the model to re-evaluate claims repeatedly to track verdict stability.
  • Verdict Stability Score (VSS): A mathematical metric bounding the resilience of an LLM generation against self-contradiction.
  • Provider Agnostic (BYOK): Supports Anthropic, OpenAI, and Google Gemini via raw HTTP integrations, ensuring zero telemetry and guaranteeing Bring-Your-Own-Key data sovereignty.
  • Graceful Degradation: Safely handles upstream provider rate limits (HTTP 429) and degradation faults without interrupting the execution pipeline.

Why Varity?

Problem Varity's Approach
Single-pass fact-checking misses nuanced errors Recursive depth-N verification exposes instability
External knowledge bases go stale Uses the LLM's own parametric knowledge as the oracle
Heavy SDK dependencies increase attack surface Zero vendor SDKs โ€” raw httpx only
API keys leak through telemetry Strict BYOK โ€” keys are never logged, cached, or transmitted beyond the provider endpoint

Installation

pip install varity

Requires Python 3.9+. Core dependencies: pydantic>=2.0, httpx>=0.25, tiktoken>=0.5.

Supported Providers

Provider Default Model Free Tier
Google Gemini gemini-2.0-flash Yes
Anthropic Claude claude-sonnet-4-20250514 No (credits required)
OpenAI gpt-4o-mini No (credits required)

All providers are accessed via direct HTTP โ€” no google-generativeai, anthropic, or openai SDK packages are required.

Quick Start

1. Set your API key

# Option A: Environment variable
export VARITY_PROVIDER="gemini"
export VARITY_API_KEY="your-api-key"

# Option B: Create a .env file in your project root
echo 'VARITY_PROVIDER=gemini' > .env
echo 'VARITY_API_KEY=your-api-key' >> .env

2. Verify a response programmatically

import asyncio
from varity import Varity, VarityConfig
from varity.providers import get_provider

async def main():
    provider = get_provider("gemini", api_key="your-api-key")
    config = VarityConfig(depth=1, confidence_threshold=0.6)
    varity = Varity(provider=provider, config=config)

    result = await varity.acheck(
        "The Eiffel Tower is 10,000 feet tall and was completed in 1887."
    )

    print(f"Confidence : {result.overall_confidence:.2f}")
    print(f"VSS        : {result.vss_score:.2f}")
    print(f"Claims     : {len(result.claims)}")
    print(f"Flagged    : {len(result.flagged_claims)}")

    for claim in result.flagged_claims:
        print(f"  [FLAGGED] {claim.text}")
        print(f"            verdict={claim.verdict}, vss={claim.vss_score:.2f}")

    if result.corrected_response:
        print(f"\nCorrected  : {result.corrected_response}")

    await provider.close()

asyncio.run(main())

3. Use the CLI

 __      __        _ _         
 \ \    / /       (_) |        
  \ \  / /_ _ _ __ _| |_ _   _ 
   \ \/ / _` | '__| | __| | | |
    \  / (_| | |  | | |_| |_| |
     \/ \__,_|_|  |_|\__|\__, |  v0.1
                          __/ |
                         |___/ 
# Single-text evaluation
varity check "Einstein won the Nobel Prize for Relativity." --provider gemini

# Batch processing from JSONL
varity batch input.jsonl output.jsonl --provider openai

# Interactive demo
varity demo

CI/CD Integration

Varity is designed to be easily integrated into CI/CD pipelines to enforce hallucination checks on generated outputs before deployment.

Example: GitHub Actions

Create a .github/workflows/varity-check.yml file:

name: Varity Hallucination Check
on: [push, pull_request]

jobs:
  varity_check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.9"
      - name: Install dependencies
        run: pip install varity
      - name: Run dynamic cycle checks
        env:
          VARITY_PROVIDER: ${{ secrets.VARITY_PROVIDER }}
          VARITY_API_KEY: ${{ secrets.VARITY_API_KEY }}
        run: |
          # Example: Run 5 evaluation cycles on your test script
          python test101.py --cycles 5

How It Works

Core Architecture

Varity governs a strict 5-stage deterministic evaluation flow:

  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  Raw Response Payload โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
             โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  1. Claim Decomposer โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚          โ”‚
        โ–ผ          โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ 2. Recur- โ”‚ โ”‚ 3. Independent      โ”‚
  โ”‚    sive   โ”‚ โ”‚    Cross-Check      โ”‚
  โ”‚  Self-    โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  โ”‚  Verifier โ”‚           โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜           โ”‚
        โ”‚                 โ”‚
        โ–ผ                 โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ 4. Confidence Aggregator โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
             โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ 5. Correction Generator  โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
             โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Validated Output Struct  โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  1. Claim Decomposition: Segments cohesive text strings into isolated, atomic Claim schema nodes.
  2. Recursive Self-Verification: Executes isolated iterative passes across isolated claims (Depth 0...N), dynamically tracking historical verdict variance.
  3. Cross-Checking: Instantiates an identical external process verifying the claim devoid of the initial contextual bias.
  4. Confidence Aggregator: Maps the volume of boolean "flips" and base metric alignments to construct the total vss_score.
  5. Correction Generation: Automatically rebuilds text omitting nodes scored beneath the rigorous confidence threshold.

Verdict Stability Score (VSS): For each claim, Varity counts how many times the verdict flipped between supported and contradicted across recursive depths. A claim verified as supported at every depth receives VSS = 1.0. A claim that flips on every pass approaches VSS = 0.0. Claims below the configured confidence_threshold are flagged and eligible for automatic correction.

Configuration Reference

VarityConfig accepts the following parameters:

Parameter Type Default Description
depth int 1 Number of recursive self-verification passes (0 = single pass)
confidence_threshold float 0.5 Claims scoring below this are flagged
vss_threshold float 0.5 Claims with VSS below this are flagged (independently of confidence)
strategy str "standard" Verification strategy ("quick", "standard", "thorough")
max_claims int 20 Maximum number of claims to extract per response
enable_correction bool True Whether to generate corrected text for flagged claims

Return Schema

CheckResult contains:

Field Type Description
original_response str The input text that was evaluated
claims list[Claim] All extracted atomic claims with individual scores
flagged_claims list[Claim] Subset of claims below the confidence threshold
corrected_response str | None Auto-corrected text (if corrections were generated)
overall_confidence float Weighted average confidence across all claims
vss_score float Average VSS across all claims
verification_chain list[VerificationStep] Full audit trail of every verification pass
duration_ms int Wall-clock execution time in milliseconds
token_usage dict Estimated token consumption breakdown

Commercial Use Cases

Because Varity mathematically filters out unstable generations, it serves as the perfect underlying engine for building high-value, hallucination-free applications:

1. "Zero-Hallucination" Legal or Medical Writers

General LLMs are dangerous in high-stakes fields because they can invent case studies or medical facts with complete semantic confidence. By piping raw LLM output through Varity (depth=3) and only rendering the corrected_response in your UI, you guarantee factuality for professionals who cannot afford hallucinations.

2. Academic & SEO Fact-Checking Automation

Content teams and researchers spend countless hours manually fact-checking AI outputs. Varity can be wrapped into a Chrome Extension or text-editor plugin where users highlight generated text and instantly receive a boolean breakdown of Verified vs. Hallucinated claims, drastically reducing manual audit times.

Stress Testing

The included test101.py script runs Varity against a known-hallucination payload over a configurable number of cycles:

# Run 100 consecutive evaluation cycles
python test101.py --cycles 100

# Or configure via environment
export VARITY_CYCLES=50
python test101.py

Development

# Clone and install in development mode
git clone https://github.com/charchitd/Varity-v0.1.git
cd varity
pip install -e ".[dev]"

# Run the test suite (76 unit tests + 10 integration tests)
pytest tests/ -v

# Lint and type-check
ruff check .
mypy --strict varity/

License

Distributed under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

varity-0.1.3.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

varity-0.1.3-py3-none-any.whl (34.5 kB view details)

Uploaded Python 3

File details

Details for the file varity-0.1.3.tar.gz.

File metadata

  • Download URL: varity-0.1.3.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for varity-0.1.3.tar.gz
Algorithm Hash digest
SHA256 a9c5934764667143385244df9ac500342ee8cb0c7e3827ca553d9c0b8d73e6fa
MD5 fe9e6f29cb7d2f4eebfe558e627c4bb3
BLAKE2b-256 a4ae2a90c5d7f8ee0741f09a4a43577d8601021303c05678c916dd9828addcb9

See more details on using hashes here.

File details

Details for the file varity-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: varity-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 34.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for varity-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e1ac24abfdfc0eb13b9310ed6eb0d901870b13ffcedf8f5ecbcbe02ffe6630b7
MD5 a944c70d495ac7e00c33ce55d0ee1687
BLAKE2b-256 56d08511ae8db0ffa20c74bf8c2161e28e1357e6f417042e2116ff5ef35addd0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page