Skip to main content

Geometric LLM hallucination detection. No second LLM. Deterministic. Auditable.

Project description

factlens

Geometric LLM hallucination detection. No second LLM. Deterministic. Auditable.

Python License: MIT CI Docs Version Release to PyPI

Documentation | Research Papers | Examples | Contributing


factlens detects LLM hallucinations using embedding geometry instead of a second LLM. It computes deterministic, auditable scores from the spatial relationships between questions, responses, and source context in an embedding space. The result is a verification signal you can explain in an audit, reproduce on demand, and run in regulated environments.

Why factlens?

Problem How factlens solves it
Second-LLM judges are non-deterministic and expensive Single embedding model (all-mpnet-base-v2), deterministic output, sub-second latency
Probabilistic scores cannot be audited Geometric ratios and angular measurements with clear mathematical definitions
Regulatory compliance requires explainability Every score traces to Euclidean distances and cosine similarities in $R^n$
One method does not fit all use cases SGI for RAG/context verification, DGI for context-free chat, evaluate() auto-selects

SGI: Semantic Grounding Index | DGI: Directional Grounding Index

Installation

pip install factlens

With LLM provider support:

pip install "factlens[openai]"       # OpenAI
pip install "factlens[anthropic]"    # Anthropic
pip install "factlens[google]"       # Google Generative AI
pip install "factlens[providers]"    # All providers

With framework integrations:

pip install "factlens[langchain]"    # LangChain
pip install "factlens[crewai]"       # CrewAI
pip install "factlens[semantic-kernel]"  # Semantic Kernel
pip install "factlens[autogen]"      # AutoGen
pip install "factlens[all]"          # Everything

Requirements: Python 3.10+, numpy, sentence-transformers.

Quick start

SGI -- with context (RAG verification)

SGI (Semantic Grounding Index) measures whether a response engaged with the provided context or stayed anchored to the question. It requires three inputs.

from factlens import compute_sgi

result = compute_sgi(
    question="What is the capital of France?",
    context="France is in Western Europe. Its capital is Paris.",
    response="The capital of France is Paris.",
)

print(result.value)       # 1.23 — ratio of distances
print(result.normalized)  # 0.61 — mapped to [0, 1]
print(result.flagged)     # False — above review threshold
print(result.explanation) # "SGI=1.230 — strong context engagement (pass)"

Interpretation: SGI > 1.0 means the response is closer to the context than to the question in embedding space. The response engaged with the source material.

DGI -- without context

DGI (Directional Grounding Index) detects hallucinations without requiring source context. It checks whether the question-to-response displacement vector aligns with the characteristic direction of verified grounded responses.

from factlens import compute_dgi

result = compute_dgi(
    question="What causes seasons on Earth?",
    response="Seasons are caused by Earth's 23.5-degree axial tilt.",
)

print(result.value)       # 0.42 — cosine similarity to reference direction
print(result.normalized)  # 0.71 — mapped to [0, 1]
print(result.flagged)     # False — above pass threshold (0.30)

Domain calibration improves DGI accuracy from AUROC ~0.76 (generic) to 0.90-0.99:

from factlens import compute_dgi

result = compute_dgi(
    question="What is the statute of limitations for breach of contract in California?",
    response="Four years under California Code of Civil Procedure Section 337.",
    reference_csv="legal_calibration_pairs.csv",
)

evaluate() -- auto-select

The evaluate() function picks the right method automatically: SGI when context is provided, DGI when it is not.

from factlens import evaluate

# With context -> SGI
score = evaluate(
    question="What is X?",
    response="X is Y.",
    context="According to the manual, X is Y.",
)
assert score.method == "sgi"

# Without context -> DGI
score = evaluate(
    question="What is X?",
    response="X is Y.",
)
assert score.method == "dgi"

Batch evaluation

from factlens import evaluate_batch

items = [
    {"question": "Q1?", "response": "A1.", "context": "Source."},
    {"question": "Q2?", "response": "A2."},
    {"question": "Q3?", "response": "A3.", "context": "Reference."},
]

results = evaluate_batch(items)
flagged = [r for r in results if r.flagged]
print(f"{len(flagged)}/{len(results)} flagged for review")

CLI

# Single response check
factlens check \
  --question "What is the capital of France?" \
  --response "The capital of France is Paris." \
  --context "France is in Western Europe. Its capital is Paris."

# Batch CSV evaluation
factlens evaluate input.csv --output results.csv

# Domain calibration
factlens calibrate --pairs domain_pairs.csv --output calibration.json

# Run the confabulation benchmark
factlens benchmark

LLM provider guard

from factlens.providers.openai import OpenAIProvider

provider = OpenAIProvider(model="gpt-4o")
response = provider.complete(
    prompt="Summarize this document.",
    context="The document text here...",
)

if response.factlens_score and response.factlens_score.flagged:
    print("Hallucination risk detected — review recommended.")
else:
    print(response.text)

Architecture

factlens/
├── __init__.py              # Public API: compute_sgi, compute_dgi, evaluate, calibrate
├── sgi.py                   # Semantic Grounding Index (context-required)
├── dgi.py                   # Directional Grounding Index (context-free)
├── evaluate.py              # High-level evaluate() and evaluate_batch()
├── calibrate.py             # Domain-specific DGI calibration
├── score.py                 # Result types: SGIResult, DGIResult, FactlensScore
├── _version.py              # CalVer version (2026.4.28)
├── _internal/               # Private implementation
│   ├── geometry.py          # Euclidean distance, displacement, unit normalize
│   ├── embeddings.py        # Sentence transformer encoding
│   ├── thresholds.py        # Decision boundaries and normalization
│   └── csv_loader.py        # Calibration data loading
├── cli/
│   └── main.py              # CLI: check, evaluate, calibrate, benchmark
├── providers/               # LLM provider wrappers
│   ├── _base.py             # BaseLLMProvider protocol + LLMResponse
│   ├── openai.py            # OpenAI provider
│   ├── anthropic.py         # Anthropic provider
│   └── google.py            # Google Generative AI provider
└── integrations/            # Framework integrations
    ├── langchain/           # LangChain evaluator + callback
    ├── crewai/              # CrewAI tool
    ├── semantic_kernel/     # Semantic Kernel filter
    └── autogen/             # AutoGen checker

The architecture follows a layered design:

┌─────────────────────────────────────────────┐
│            Public API (evaluate)             │
├──────────────────┬──────────────────────────┤
│   SGI (sgi.py)   │      DGI (dgi.py)        │
├──────────────────┴──────────────────────────┤
│        _internal (geometry, embeddings)      │
├─────────────────────────────────────────────┤
│  sentence-transformers (all-mpnet-base-v2)   │
└─────────────────────────────────────────────┘
         ▲                          ▲
         │                          │
   ┌─────┴─────┐            ┌──────┴──────┐
   │ Providers  │            │Integrations │
   │ (OpenAI,   │            │ (LangChain, │
   │  Anthropic,│            │  CrewAI,    │
   │  Google)   │            │  SK, AutoGen│
   └────────────┘            └─────────────┘

Scoring methods

SGI (Semantic Grounding Index)

SGI = dist(phi(response), phi(question)) / dist(phi(response), phi(context))
Score Interpretation
SGI > 1.20 Strong context engagement (pass)
0.95 < SGI < 1.20 Partial engagement (review recommended)
SGI < 0.95 Weak engagement (flagged)

DGI (Directional Grounding Index)

delta = phi(response) - phi(question)
DGI = dot(delta / ||delta||, mu_hat)
Score Interpretation
DGI > 0.30 Aligns with grounded patterns (pass)
0.00 < DGI < 0.30 Weak alignment (flagged)
DGI < 0.00 Opposes grounded direction (high risk)

Providers and integrations

Component Install extra Description
OpenAI openai Wraps openai SDK with automatic scoring
Anthropic anthropic Wraps anthropic SDK with automatic scoring
Google google Wraps google-generativeai with automatic scoring
LangChain langchain Evaluator + callback handler
CrewAI crewai Tool for agent pipelines
Semantic Kernel semantic-kernel Function calling filter
AutoGen autogen Agent chat checker

Domain calibration

Generic DGI uses a bundled reference direction that achieves AUROC ~0.76. For production use, calibrate with 20-100 verified question-response pairs from your domain:

from factlens import calibrate

result = calibrate(csv_path="my_domain_pairs.csv")
print(f"Concentration: {result.concentration:.2f}")
result.save("calibration.json")

Domain-specific calibration typically reaches AUROC 0.90-0.99. The confabulation benchmark (arXiv:2603.13259) reports DGI AUROC 0.958 with domain calibration.

Research

factlens implements the methods described in three peer-reviewed papers:

  1. Semantic Grounding Index (SGI) Marin, J. (2025). Semantic Grounding Index for LLM Hallucination Detection. arXiv:2512.13771

  2. Directional Grounding Index (DGI) Marin, J. (2026). A Geometric Taxonomy of Hallucinations in Large Language Models. arXiv:2602.13224

  3. Confabulation Benchmark Marin, J. (2026). Rotational Dynamics of Factual Constraint Processing in Large Language Models. arXiv:2603.13259

Contributing

See CONTRIBUTING.md for development setup, code standards, and PR process.

License

MIT -- Javier Marin (javier@jmarin.info)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

factlens-2026.4.28.tar.gz (75.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

factlens-2026.4.28-py3-none-any.whl (79.8 kB view details)

Uploaded Python 3

File details

Details for the file factlens-2026.4.28.tar.gz.

File metadata

  • Download URL: factlens-2026.4.28.tar.gz
  • Upload date:
  • Size: 75.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for factlens-2026.4.28.tar.gz
Algorithm Hash digest
SHA256 9e602721e6e58b1a8aa503d7eb6c9c0876c8b23f65ca8e3cb111e62a25a5595b
MD5 88e428553867e4eb18a8eb391988e491
BLAKE2b-256 907408b7e9a3a68eb7fcd78801b6ad0137019e5370bf1baea115a5c3f489885e

See more details on using hashes here.

Provenance

The following attestation bundles were made for factlens-2026.4.28.tar.gz:

Publisher: release.yml on factlens/factlens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file factlens-2026.4.28-py3-none-any.whl.

File metadata

  • Download URL: factlens-2026.4.28-py3-none-any.whl
  • Upload date:
  • Size: 79.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for factlens-2026.4.28-py3-none-any.whl
Algorithm Hash digest
SHA256 551cd28a9ad731d1b47348bb5e0022a42939949444b5c7e51811ec9987cef7dd
MD5 5ba562734b58c8ba4cb70e797ec29a9d
BLAKE2b-256 409df6bb5afce9ee797c52e2688a892511b684012d109add2774cb7818808cd7

See more details on using hashes here.

Provenance

The following attestation bundles were made for factlens-2026.4.28-py3-none-any.whl:

Publisher: release.yml on factlens/factlens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page