Hallucination detection for LLMs via semantic consistency checking
Project description
groundy ๐ฑ
Keep your LLM grounded - no ground truth required.
A grounded model agrees with itself: ask the same question a few different ways and the
answer holds. A model that's improvising scatters. groundy wraps that check into one
decorator that returns an answer you can trust - or a refusal when the model is just
making things up. No labels, no fine-tuning, no retrieval.
from groundy import groundy
@groundy
def ask(q: str) -> str:
return my_llm(q) # your LLM call - any provider, returns a str
ask("Who proved Fermat's Last Theorem?") # โ "Andrew Wiles."
ask("Who was the 14th person on the Moon?") # โ "I'm not confident enough to answer that reliably."
Same signature, same str return. Nothing downstream changes - the answer just became
trustworthy.
Get started
1. Install (not on PyPI yet):
uv add git+https://github.com/lopoc/groundy.git
That's the full library, ready to use โ the @groundy decorator and the local embeddings
backend work out of the box, no extras needed. Two optional extras add heavier integrations
only if you want them:
| Extra | Adds | Use it for |
|---|---|---|
fastembed |
ONNX embedding backend (no torch) | ~15ร lighter import (CLI cold start ~10s โ ~1โ2s). Select with backend="fastembed". |
langfuse |
Langfuse tracing adapter | Trace every check (tracer=LangfuseTracer()). See Observability. |
Add them in the brackets (comma-separated for several) โ note the quotes and the name @
prefix when you include an extra:
uv add "groundy[fastembed,langfuse] @ git+https://github.com/lopoc/groundy.git"
Skip the extras and nothing breaks: fastembed and the Langfuse SDK are imported lazily โ
only when you actually select that backend or construct the tracer โ so a plain install
never needs them.
2. Give groundy an API key, a provider, and a model name. It makes one call of its own
- reformulation, over any OpenAI-compatible API - all under its own
GROUNDY_*namespace:
export GROUNDY_API_KEY=sk-...
export GROUNDY_BASE_URL=https://api.openai.com/v1 # your provider โ name it, no default (OpenRouter, Groq, a local serverโฆ)
export GROUNDY_MODEL=gpt-4o-mini # the reformulation model (required, no default)
3. Decorate your LLM call and use it as usual:
from openai import OpenAI
from groundy import groundy
client = OpenAI()
@groundy
def ask(q: str) -> str:
return client.chat.completions.create(
model="gpt-4o", messages=[{"role": "user", "content": q}]
).choices[0].message.content
print(ask("Who proved Fermat's Last Theorem?"))
print(ask("Who was the 14th person on the Moon?"))
That's it. A ready-to-run version (decorator + cache + raw checker) ships in the repo:
uv run python examples/basic.py.
๐ก
export GROUNDY_DEBUG=1prints every reformulation, answer, and score.
Vibe-check it from the terminal
No code needed - groundy asks your question a few ways and shows you the matrix: each
distinct answer with a bar for how much it agrees with the rest (groundy's own signal),
consensus on top, outliers at the bottom. Identical answers collapse to one รN row:
export GROUNDY_API_KEY=sk-...
export GROUNDY_BASE_URL=https://api.openai.com/v1 # your provider โ required, no default
export GROUNDY_MODEL=gpt-4o-mini
groundy "Who was the 14th person to walk on the Moon?"
๐ฑ groundy
? Who was the 14th person to walk on the Moon?
โ uncertain consistency 0.50 ยท 17.8s
I'm not confident enough to answer that reliably.
scatter
โโโโโโโโ 0.61 Eugene Cernan (the last person to walk on the Moon, Apollo 17)โฆ
โโโโโโโโ 0.52 Eugene Cernan was the last (12th) person to walk on the Moonโฆ
โโโโโโโโ 0.41 Harrison Schmitt ร2
On a reliable question the bars stand tall together and collapse to one row
(โโโโโโโโ 1.00 Paris ร5); on a shaky one they fan down as the answers pull apart.
Want the raw structure? --matrix prints the full NรN pairwise heatmap - mutually-agreeing
answers light up as bright blocks, so you see the clusters with no threshold and nothing
aggregated:
scatter
a b c d e
a โโโโโโโโโโ Eugene Cernan was the last (12th)โฆ
b โโโโโโโโโโ Gene Cernan
c โโโโโโโโโโ Eugene Cernan was the last (12th)โฆ
d โโโโโโโโโโ Eugene Cernan was the last (12th)โฆ
e โโโโโโโโโโ Eugene Cernan (the last personโฆ)
It reads GROUNDY_API_KEY + GROUNDY_MODEL like everything else. Pipe a question in
(echo "โฆ" | groundy), add -q for answer-only output, --matrix for the heatmap,
-n/-t to tune, or --debug for the raw reformulation log.
How it works
An uncertain model disagrees with itself when you rephrase the question; a confident one
doesn't. With @groundy(n=5), each call:
- Rephrases the query 4 ways - groundy's one own call.
- Answers all 5 tersely. A
verify_promptis prepended so the comparison is about substance, not phrasing. These are the verify answers. - Scores agreement - embeds the verify answers locally (sentence-transformers) and
averages their pairwise cosine similarity into a
consistency_scorein[0, 1]. - Decides:
reliable = consistency_score >= threshold. - Answers your way - only if reliable. It calls your function once more on the raw
query for the served answer (your verbosity/prompt) and returns it. Unreliable โ it
skips this call and returns your
on_unreliablestring.
You serve the answer the way you want it, but verification is terse so verbosity can't hide disagreement. Cost: 7 LLM calls when reliable (1 reformulation + 5 verify + 1 served), 6 when unreliable, all synchronous - which is exactly why you cache it.
Cache it - pay once per cluster of questions
groundy is expensive, so hand it a cache and it runs only on a miss. A cache is anything
with get(key) -> str | None and set(key, value). The real win is a semantic cache: a
hit fires on any question close enough in meaning, so groundy runs once per cluster of
similar questions and serves the whole neighbourhood for free.
from groundy import groundy
# Bring any semantic cache exposing get(key) -> str | None and set(key, value). A hit fires
# on questions close in *meaning*, so groundy runs once per cluster (GPTCache, Momento,
# Upstash, Redis + RedisVL - a 3-line adapter if the method names differ).
cache = SemanticCache(threshold=0.9)
@groundy(cache=cache)
def ask(q: str) -> str:
return client.chat.completions.create(...).choices[0].message.content # the RAW model
ask("Who discovered penicillin?") # MISS โ full check โ verdict cached
ask("Who was penicillin discovered by?") # HIT โ same meaning, zero LLM calls
On a hit groundy never runs. On a miss it checks, then cache.sets the verdict - refusals
included, so "the model can't answer this" is remembered too.
โ ๏ธ The one rule: groundy goes above your semantic cache, never below it. If a semantic cache sits inside the wrapped call, the reformulations - semantically equivalent by design - all hit the same entry, score a perfect 1.0, and every check falsely passes. The semantic cache belongs on top (via
cache=), caching the verdict.
When you want the numbers
The decorator hides the scores on purpose. Reach past it for the rich result:
from groundy import GroundyChecker
checker = GroundyChecker(n=5, threshold=0.75)
r = checker.check("What does Italian Civil Code art. 2043 establish?", answer_fn=my_llm)
r.consistency_score # 0.0โ1.0
r.is_reliable # bool
r.best_answer # the served answer if reliable, else None
r.consensus_answer, r.agreement_scores, r.similarity_scores, r.latency_ms
best_answer is the served answer (your raw call) when reliable, and None when not
- on a genuine split the right move is to refuse, not guess. The decorator turns that
Noneinto youron_unreliablestring. (consensus_answer, the verify answer that agrees most with the rest, is diagnostic only.)
Run on any vendor
There are two independent LLM tasks, configured separately:
- Answering - your decorated function. OpenAI, LiteLLM, Ollama, anything returning a
str. There's noanswer_model=knob: the answer call is your function. - Reformulating - groundy's own OpenAI-compatible call. Set
GROUNDY_MODEL+GROUNDY_BASE_URL(both required, no default provider), or passmodel/base_url/api_key.
So you can reformulate on a cheap, fast model and answer on a stronger one - even across providers:
@groundy(
model="llama-3.3-70b-versatile", # reformulate on Groqโฆ
base_url="https://api.groq.com/openai/v1",
api_key="gsk_...",
)
def ask(q: str) -> str:
return openai_client.chat.completions.create( # โฆanswer on OpenAI
model="gpt-4o", messages=[{"role": "user", "content": q}]
).choices[0].message.content
Any OpenAI-compatible endpoint works - that covers OpenAI, OpenRouter, Groq, Together, Fireworks, and local servers (vLLM, llama.cpp, Ollama).
Knobs
| Param | Default | What it does |
|---|---|---|
n |
5 |
Answers compared: original + n-1 reformulations. Must be โฅ 2. Higher = sturdier + pricier. |
threshold |
0.75 |
Score below this โ refusal. Calibrate it (see limits). |
backend |
"embeddings" |
embeddings (local, sentence-transformers) or llm_judge (stub). |
model |
None |
Reformulation model - required (no default). None โ GROUNDY_MODEL, else ValueError. |
temperature |
0.0 |
Reformulator temperature (0.0 = reproducible). Set None to omit it for models that reject the param. |
base_url |
None |
Reformulation provider โ required (no default). None โ GROUNDY_BASE_URL, else ValueError. |
api_key |
None |
None โ GROUNDY_API_KEY (may be unset for keyless local servers). |
verify_prompt |
(terse instruction) | Prepended to the verify answers (not the served one). None verifies with your raw answers. |
cache |
None |
Any object with get/set. Runs groundy only on a miss. |
tracer |
None |
Any object with the Tracer protocol. Emits a nested trace per check. Langfuse adapter in groundy[langfuse]. |
on_unreliable |
(a refusal) | Returned/cached when the model disagrees with itself. |
Honest limits - read this
groundy measures self-consistency, not correctness. Know the failure modes:
- Consistent confabulation passes. A confidently, consistently wrong model scores high. This catches uncertainty that surfaces as divergence - a large subset of hallucination, not all of it. Terse verify answers help: verbose hedging hides disagreement (verbose answers to "the 14th person on the Moon" all hedge alike and score ~0.9; terse ones confabulate different names โ ~0.30, flagged). That's why verification is terse by default while your served answer stays verbose.
- Calibrate the threshold. With the default
all-MiniLM-L6-v2backend, scores cluster high (~0.75โ0.95) for any related text.0.75is a starting point - tune it on your prompts. - It costs ~N+2 LLM calls per check (n=5 โ 7, sequential). Hence
cache=: vet a question once, serve it free forever after.
Observability
Optional and agnostic. Pass a tracer (a tiny Tracer protocol, just like cache=) and
every check() emits a nested trace: reformulate โ verify รn โ score โ served. Default
tracer=None โ no tracing, zero overhead.
A Langfuse adapter ships in the box โ add the langfuse extra:
uv add "groundy[langfuse] @ git+https://github.com/lopoc/groundy.git"
from groundy.observability.langfuse import LangfuseTracer
@groundy(tracer=LangfuseTracer()) # reads LANGFUSE_* from the env
def ask(q: str) -> str:
...
The core imports no vendor SDK - only you import the adapter. groundy owns one LLM call
(reformulation), so that node carries the model, temperature, token usage, and a prompt hash;
the answer_fn nodes show text + timing only. Prefer to log it yourself? The full
GroundyResult is still right there. For dev, GROUNDY_DEBUG=1 prints reformulations,
answers, and scores.
Develop
git clone https://github.com/lopoc/groundy.git
cd groundy
uv sync # creates .venv, installs runtime + dev tools
uv run python examples/basic.py # smoke test (needs GROUNDY_API_KEY + GROUNDY_MODEL)
uv run ruff check groundy # lint
uv run ruff format groundy # format
uv run pytest # tests (once a tests/ dir exists)
Roadmap
- CLI:
groundy "your query" -
async def acheck()- parallelize the N calls -
llm_judgebackend (structured 0โ1 scoring - sharper than embeddings) - Tests + benchmark (measured reliable-vs-hallucinated separation)
Origin
A practical take on the Laplace agent from the Socrates/Laplace judicial-AI framework.
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file groundy-0.3.0.tar.gz.
File metadata
- Download URL: groundy-0.3.0.tar.gz
- Upload date:
- Size: 28.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a04b69892bd045da6d056a3eab2f4a697f1ffdb5f66e00e216e69244c9459a3a
|
|
| MD5 |
fb3d05cd3b387a92cb1b9217de833d41
|
|
| BLAKE2b-256 |
939798c5ac06364b1205a2c6c611b60dcc91009fbe399117766377c10fb030dd
|
Provenance
The following attestation bundles were made for groundy-0.3.0.tar.gz:
Publisher:
release.yml on lopoc/groundy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
groundy-0.3.0.tar.gz -
Subject digest:
a04b69892bd045da6d056a3eab2f4a697f1ffdb5f66e00e216e69244c9459a3a - Sigstore transparency entry: 2049392765
- Sigstore integration time:
-
Permalink:
lopoc/groundy@9ecc05b813a96ea12412c1786a9457df11fa435d -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/lopoc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9ecc05b813a96ea12412c1786a9457df11fa435d -
Trigger Event:
push
-
Statement type:
File details
Details for the file groundy-0.3.0-py3-none-any.whl.
File metadata
- Download URL: groundy-0.3.0-py3-none-any.whl
- Upload date:
- Size: 28.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5d540a2169d9292f0002e8ec979129ecb0b59931943c28d4be77e973152b3cf
|
|
| MD5 |
b48a4f6e4c3e27c668626bde7c26a637
|
|
| BLAKE2b-256 |
17aed0cd1d5ac83534fecdf1988feb86cdbfc6fab68b732b4ea6565fe152ed95
|
Provenance
The following attestation bundles were made for groundy-0.3.0-py3-none-any.whl:
Publisher:
release.yml on lopoc/groundy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
groundy-0.3.0-py3-none-any.whl -
Subject digest:
f5d540a2169d9292f0002e8ec979129ecb0b59931943c28d4be77e973152b3cf - Sigstore transparency entry: 2049393110
- Sigstore integration time:
-
Permalink:
lopoc/groundy@9ecc05b813a96ea12412c1786a9457df11fa435d -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/lopoc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9ecc05b813a96ea12412c1786a9457df11fa435d -
Trigger Event:
push
-
Statement type: