Geometric LLM hallucination detection. No second LLM. Deterministic. Auditable.
Project description
Geometric LLM hallucination detection. No second LLM. Deterministic. Auditable.
factlens detects LLM hallucinations using embedding geometry instead of a second LLM. It computes deterministic, auditable scores from the spatial relationships between questions, responses, and source context in an embedding space. The result is a verification signal you can explain in an audit, reproduce on demand, and run in regulated environments.
Why factlens?
| Problem | How factlens solves it |
|---|---|
| Second-LLM judges are non-deterministic and expensive | Single embedding model (all-mpnet-base-v2), deterministic output, sub-second latency |
| Probabilistic scores cannot be audited | Geometric ratios and angular measurements with clear mathematical definitions |
| Regulatory compliance requires explainability | Every score traces to Euclidean distances and cosine similarities in $R^n$ |
| One method does not fit all use cases | SGI for RAG/context verification, DGI for context-free chat, evaluate() auto-selects |
SGI: Semantic Grounding Index | DGI: Directional Grounding Index
Installation
pip install factlens
With LLM provider support:
pip install "factlens[openai]" # OpenAI
pip install "factlens[anthropic]" # Anthropic
pip install "factlens[google]" # Google Generative AI
pip install "factlens[providers]" # All providers
With framework integrations:
pip install "factlens[langchain]" # LangChain
pip install "factlens[crewai]" # CrewAI
pip install "factlens[semantic-kernel]" # Semantic Kernel
pip install "factlens[autogen]" # AutoGen
pip install "factlens[all]" # Everything
Requirements: Python 3.10+, numpy, sentence-transformers.
Quick start
SGI -- with context (RAG verification)
SGI (Semantic Grounding Index) measures whether a response engaged with the provided context or stayed anchored to the question. It requires three inputs.
from factlens import compute_sgi
result = compute_sgi(
question="What is the capital of France?",
context="France is in Western Europe. Its capital is Paris.",
response="The capital of France is Paris.",
)
print(result.value) # 1.23 — ratio of distances
print(result.normalized) # 0.61 — mapped to [0, 1]
print(result.flagged) # False — above review threshold
print(result.explanation) # "SGI=1.230 — strong context engagement (pass)"
Interpretation: SGI > 1.0 means the response is closer to the context than to the question in embedding space. The response engaged with the source material.
DGI -- without context
DGI (Directional Grounding Index) detects hallucinations without requiring source context. It checks whether the question-to-response displacement vector aligns with the characteristic direction of verified grounded responses.
from factlens import compute_dgi
result = compute_dgi(
question="What causes seasons on Earth?",
response="Seasons are caused by Earth's 23.5-degree axial tilt.",
)
print(result.value) # 0.42 — cosine similarity to reference direction
print(result.normalized) # 0.71 — mapped to [0, 1]
print(result.flagged) # False — above pass threshold (0.30)
Domain calibration improves DGI accuracy from AUROC ~0.76 (generic) to 0.90-0.99:
from factlens import compute_dgi
result = compute_dgi(
question="What is the statute of limitations for breach of contract in California?",
response="Four years under California Code of Civil Procedure Section 337.",
reference_csv="legal_calibration_pairs.csv",
)
evaluate() -- auto-select
The evaluate() function picks the right method automatically: SGI when context is provided, DGI when it is not.
from factlens import evaluate
# With context -> SGI
score = evaluate(
question="What is X?",
response="X is Y.",
context="According to the manual, X is Y.",
)
assert score.method == "sgi"
# Without context -> DGI
score = evaluate(
question="What is X?",
response="X is Y.",
)
assert score.method == "dgi"
Batch evaluation
from factlens import evaluate_batch
items = [
{"question": "Q1?", "response": "A1.", "context": "Source."},
{"question": "Q2?", "response": "A2."},
{"question": "Q3?", "response": "A3.", "context": "Reference."},
]
results = evaluate_batch(items)
flagged = [r for r in results if r.flagged]
print(f"{len(flagged)}/{len(results)} flagged for review")
CLI
# Single response check
factlens check \
--question "What is the capital of France?" \
--response "The capital of France is Paris." \
--context "France is in Western Europe. Its capital is Paris."
# Batch CSV evaluation
factlens evaluate input.csv --output results.csv
# Domain calibration
factlens calibrate --pairs domain_pairs.csv --output calibration.json
# Run the confabulation benchmark
factlens benchmark
LLM provider guard
from factlens.providers.openai import OpenAIProvider
provider = OpenAIProvider(model="gpt-4o")
response = provider.complete(
prompt="Summarize this document.",
context="The document text here...",
)
if response.factlens_score and response.factlens_score.flagged:
print("Hallucination risk detected — review recommended.")
else:
print(response.text)
Architecture
factlens/
├── __init__.py # Public API: compute_sgi, compute_dgi, evaluate, calibrate
├── sgi.py # Semantic Grounding Index (context-required)
├── dgi.py # Directional Grounding Index (context-free)
├── evaluate.py # High-level evaluate() and evaluate_batch()
├── calibrate.py # Domain-specific DGI calibration
├── score.py # Result types: SGIResult, DGIResult, FactlensScore
├── _version.py # CalVer version (2026.4.28)
├── _internal/ # Private implementation
│ ├── geometry.py # Euclidean distance, displacement, unit normalize
│ ├── embeddings.py # Sentence transformer encoding
│ ├── thresholds.py # Decision boundaries and normalization
│ └── csv_loader.py # Calibration data loading
├── cli/
│ └── main.py # CLI: check, evaluate, calibrate, benchmark
├── providers/ # LLM provider wrappers
│ ├── _base.py # BaseLLMProvider protocol + LLMResponse
│ ├── openai.py # OpenAI provider
│ ├── anthropic.py # Anthropic provider
│ └── google.py # Google Generative AI provider
└── integrations/ # Framework integrations
├── langchain/ # LangChain evaluator + callback
├── crewai/ # CrewAI tool
├── semantic_kernel/ # Semantic Kernel filter
└── autogen/ # AutoGen checker
The architecture follows a layered design:
┌─────────────────────────────────────────────┐
│ Public API (evaluate) │
├──────────────────┬──────────────────────────┤
│ SGI (sgi.py) │ DGI (dgi.py) │
├──────────────────┴──────────────────────────┤
│ _internal (geometry, embeddings) │
├─────────────────────────────────────────────┤
│ sentence-transformers (all-mpnet-base-v2) │
└─────────────────────────────────────────────┘
▲ ▲
│ │
┌─────┴─────┐ ┌──────┴──────┐
│ Providers │ │Integrations │
│ (OpenAI, │ │ (LangChain, │
│ Anthropic,│ │ CrewAI, │
│ Google) │ │ SK, AutoGen│
└────────────┘ └─────────────┘
Scoring methods
SGI (Semantic Grounding Index)
SGI = dist(phi(response), phi(question)) / dist(phi(response), phi(context))
| Score | Interpretation |
|---|---|
| SGI > 1.20 | Strong context engagement (pass) |
| 0.95 < SGI < 1.20 | Partial engagement (review recommended) |
| SGI < 0.95 | Weak engagement (flagged) |
DGI (Directional Grounding Index)
delta = phi(response) - phi(question)
DGI = dot(delta / ||delta||, mu_hat)
| Score | Interpretation |
|---|---|
| DGI > 0.30 | Aligns with grounded patterns (pass) |
| 0.00 < DGI < 0.30 | Weak alignment (flagged) |
| DGI < 0.00 | Opposes grounded direction (high risk) |
Providers and integrations
| Component | Install extra | Description |
|---|---|---|
| OpenAI | openai |
Wraps openai SDK with automatic scoring |
| Anthropic | anthropic |
Wraps anthropic SDK with automatic scoring |
google |
Wraps google-generativeai with automatic scoring |
|
| LangChain | langchain |
Evaluator + callback handler |
| CrewAI | crewai |
Tool for agent pipelines |
| Semantic Kernel | semantic-kernel |
Function calling filter |
| AutoGen | autogen |
Agent chat checker |
Domain calibration
Generic DGI uses a bundled reference direction that achieves AUROC ~0.76. For production use, calibrate with 20-100 verified question-response pairs from your domain:
from factlens import calibrate
result = calibrate(csv_path="my_domain_pairs.csv")
print(f"Concentration: {result.concentration:.2f}")
result.save("calibration.json")
Domain-specific calibration typically reaches AUROC 0.90-0.99. The confabulation benchmark (arXiv:2603.13259) reports DGI AUROC 0.958 with domain calibration.
Research
factlens implements the methods described in three peer-reviewed papers:
-
Semantic Grounding Index (SGI) Marin, J. (2025). Semantic Grounding Index for LLM Hallucination Detection. arXiv:2512.13771
-
Directional Grounding Index (DGI) Marin, J. (2026). A Geometric Taxonomy of Hallucinations in Large Language Models. arXiv:2602.13224
-
Confabulation Benchmark Marin, J. (2026). Rotational Dynamics of Factual Constraint Processing in Large Language Models. arXiv:2603.13259
Contributing
See CONTRIBUTING.md for development setup, code standards, and PR process.
License
MIT -- Javier Marin (javier@jmarin.info)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file factlens-2026.4.28.1.tar.gz.
File metadata
- Download URL: factlens-2026.4.28.1.tar.gz
- Upload date:
- Size: 75.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c945d9367f1998847cccab08a3a97938844eb7841dc03c57b5004061c3d2b17
|
|
| MD5 |
dab3d4bab9eaa84b594049e0fb307089
|
|
| BLAKE2b-256 |
acce522833ddca036a6ab20180b47bf6c7a5aba9fd807a199f97804aae06bb4e
|
Provenance
The following attestation bundles were made for factlens-2026.4.28.1.tar.gz:
Publisher:
release.yml on factlens/factlens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
factlens-2026.4.28.1.tar.gz -
Subject digest:
7c945d9367f1998847cccab08a3a97938844eb7841dc03c57b5004061c3d2b17 - Sigstore transparency entry: 1397505355
- Sigstore integration time:
-
Permalink:
factlens/factlens@82e8f17366f3f23d7db8be7538c18393022882db -
Branch / Tag:
refs/tags/v2026.4.28.1 - Owner: https://github.com/factlens
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@82e8f17366f3f23d7db8be7538c18393022882db -
Trigger Event:
push
-
Statement type:
File details
Details for the file factlens-2026.4.28.1-py3-none-any.whl.
File metadata
- Download URL: factlens-2026.4.28.1-py3-none-any.whl
- Upload date:
- Size: 79.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfa0ccc5ed22942d4c77c6450804dcd6cd92913cf6556e398c422e551c628c30
|
|
| MD5 |
5caf3129e5c0db902787016c06d17c34
|
|
| BLAKE2b-256 |
4f39167bd733a70027ad4bb8cc5d62582ba230591ec4bfa33d857ec902e5bd48
|
Provenance
The following attestation bundles were made for factlens-2026.4.28.1-py3-none-any.whl:
Publisher:
release.yml on factlens/factlens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
factlens-2026.4.28.1-py3-none-any.whl -
Subject digest:
dfa0ccc5ed22942d4c77c6450804dcd6cd92913cf6556e398c422e551c628c30 - Sigstore transparency entry: 1397505382
- Sigstore integration time:
-
Permalink:
factlens/factlens@82e8f17366f3f23d7db8be7538c18393022882db -
Branch / Tag:
refs/tags/v2026.4.28.1 - Owner: https://github.com/factlens
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@82e8f17366f3f23d7db8be7538c18393022882db -
Trigger Event:
push
-
Statement type: