Skip to main content

Hallucination detection for LLMs via semantic consistency checking

Project description

groundy ๐ŸŒฑ

Keep your LLM grounded - no ground truth required.

A grounded model agrees with itself: ask the same question a few different ways and the answer holds. A model that's improvising scatters. groundy wraps that check into one decorator that returns an answer you can trust - or a refusal when the model is just making things up. No labels, no fine-tuning, no retrieval.

from groundy import groundy

@groundy
def ask(q: str) -> str:
    return my_llm(q)   # your LLM call - any provider, returns a str

ask("Who proved Fermat's Last Theorem?")     # โ†’ "Andrew Wiles."
ask("Who was the 14th person on the Moon?")  # โ†’ "I'm not confident enough to answer that reliably."

Same signature, same str return. Nothing downstream changes - the answer just became trustworthy.

Get started

1. Install (not on PyPI yet):

uv add git+https://github.com/lopoc/groundy.git

That's the full library, ready to use โ€” the @groundy decorator and the local embeddings backend work out of the box, no extras needed. Two optional extras add heavier integrations only if you want them:

Extra Adds Use it for
fastembed ONNX embedding backend (no torch) ~15ร— lighter import (CLI cold start ~10s โ†’ ~1โ€“2s). Select with backend="fastembed".
langfuse Langfuse tracing adapter Trace every check (tracer=LangfuseTracer()). See Observability.

Add them in the brackets (comma-separated for several) โ€” note the quotes and the name @ prefix when you include an extra:

uv add "groundy[fastembed,langfuse] @ git+https://github.com/lopoc/groundy.git"

Skip the extras and nothing breaks: fastembed and the Langfuse SDK are imported lazily โ€” only when you actually select that backend or construct the tracer โ€” so a plain install never needs them.

2. Give groundy an API key, a provider, and a model name. It makes one call of its own

  • reformulation, over any OpenAI-compatible API - all under its own GROUNDY_* namespace:
export GROUNDY_API_KEY=sk-...
export GROUNDY_BASE_URL=https://api.openai.com/v1   # your provider โ€” name it, no default (OpenRouter, Groq, a local serverโ€ฆ)
export GROUNDY_MODEL=gpt-4o-mini                     # the reformulation model (required, no default)

3. Decorate your LLM call and use it as usual:

from openai import OpenAI
from groundy import groundy

client = OpenAI()

@groundy
def ask(q: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": q}]
    ).choices[0].message.content

print(ask("Who proved Fermat's Last Theorem?"))
print(ask("Who was the 14th person on the Moon?"))

That's it. A ready-to-run version (decorator + cache + raw checker) ships in the repo: uv run python examples/basic.py.

๐Ÿ’ก export GROUNDY_DEBUG=1 prints every reformulation, answer, and score.

Vibe-check it from the terminal

No code needed - groundy asks your question a few ways and shows you the matrix: each distinct answer with a bar for how much it agrees with the rest (groundy's own signal), consensus on top, outliers at the bottom. Identical answers collapse to one ร—N row:

export GROUNDY_API_KEY=sk-...
export GROUNDY_BASE_URL=https://api.openai.com/v1   # your provider โ€” required, no default
export GROUNDY_MODEL=gpt-4o-mini

groundy "Who was the 14th person to walk on the Moon?"
๐ŸŒฑ groundy

  ? Who was the 14th person to walk on the Moon?

  โš  uncertain   consistency 0.50   ยท 17.8s

  I'm not confident enough to answer that reliably.

  scatter
    โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘ 0.61  Eugene Cernan (the last person to walk on the Moon, Apollo 17)โ€ฆ
    โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘ 0.52  Eugene Cernan was the last (12th) person to walk on the Moonโ€ฆ
    โ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘ 0.41  Harrison Schmitt ร—2

On a reliable question the bars stand tall together and collapse to one row (โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 1.00 Paris ร—5); on a shaky one they fan down as the answers pull apart.

Want the raw structure? --matrix prints the full Nร—N pairwise heatmap - mutually-agreeing answers light up as bright blocks, so you see the clusters with no threshold and nothing aggregated:

  scatter
       a b c d e
    a  โ–ˆโ–ˆโ–‘โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  Eugene Cernan was the last (12th)โ€ฆ
    b  โ–‘โ–‘โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  Gene Cernan
    c  โ–ˆโ–ˆโ–‘โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  Eugene Cernan was the last (12th)โ€ฆ
    d  โ–ˆโ–ˆโ–‘โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  Eugene Cernan was the last (12th)โ€ฆ
    e  โ–ˆโ–ˆโ–‘โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  Eugene Cernan (the last personโ€ฆ)

It reads GROUNDY_API_KEY + GROUNDY_MODEL like everything else. Pipe a question in (echo "โ€ฆ" | groundy), add -q for answer-only output, --matrix for the heatmap, -n/-t to tune, or --debug for the raw reformulation log.

How it works

An uncertain model disagrees with itself when you rephrase the question; a confident one doesn't. With @groundy(n=5), each call:

  1. Rephrases the query 4 ways - groundy's one own call.
  2. Answers all 5 tersely. A verify_prompt is prepended so the comparison is about substance, not phrasing. These are the verify answers.
  3. Scores agreement - embeds the verify answers locally (sentence-transformers) and averages their pairwise cosine similarity into a consistency_score in [0, 1].
  4. Decides: reliable = consistency_score >= threshold.
  5. Answers your way - only if reliable. It calls your function once more on the raw query for the served answer (your verbosity/prompt) and returns it. Unreliable โ†’ it skips this call and returns your on_unreliable string.

You serve the answer the way you want it, but verification is terse so verbosity can't hide disagreement. Cost: 7 LLM calls when reliable (1 reformulation + 5 verify + 1 served), 6 when unreliable, all synchronous - which is exactly why you cache it.

Cache it - pay once per cluster of questions

groundy is expensive, so hand it a cache and it runs only on a miss. A cache is anything with get(key) -> str | None and set(key, value). The real win is a semantic cache: a hit fires on any question close enough in meaning, so groundy runs once per cluster of similar questions and serves the whole neighbourhood for free.

from groundy import groundy

# Bring any semantic cache exposing get(key) -> str | None and set(key, value). A hit fires
# on questions close in *meaning*, so groundy runs once per cluster (GPTCache, Momento,
# Upstash, Redis + RedisVL - a 3-line adapter if the method names differ).
cache = SemanticCache(threshold=0.9)

@groundy(cache=cache)
def ask(q: str) -> str:
    return client.chat.completions.create(...).choices[0].message.content   # the RAW model

ask("Who discovered penicillin?")          # MISS โ†’ full check โ†’ verdict cached
ask("Who was penicillin discovered by?")   # HIT  โ†’ same meaning, zero LLM calls

On a hit groundy never runs. On a miss it checks, then cache.sets the verdict - refusals included, so "the model can't answer this" is remembered too.

โš ๏ธ The one rule: groundy goes above your semantic cache, never below it. If a semantic cache sits inside the wrapped call, the reformulations - semantically equivalent by design - all hit the same entry, score a perfect 1.0, and every check falsely passes. The semantic cache belongs on top (via cache=), caching the verdict.

When you want the numbers

The decorator hides the scores on purpose. Reach past it for the rich result:

from groundy import GroundyChecker

checker = GroundyChecker(n=5, threshold=0.75)
r = checker.check("What does Italian Civil Code art. 2043 establish?", answer_fn=my_llm)

r.consistency_score   # 0.0โ€“1.0
r.is_reliable         # bool
r.best_answer         # the served answer if reliable, else None
r.consensus_answer, r.agreement_scores, r.similarity_scores, r.latency_ms

best_answer is the served answer (your raw call) when reliable, and None when not

  • on a genuine split the right move is to refuse, not guess. The decorator turns that None into your on_unreliable string. (consensus_answer, the verify answer that agrees most with the rest, is diagnostic only.)

Run on any vendor

There are two independent LLM tasks, configured separately:

  • Answering - your decorated function. OpenAI, LiteLLM, Ollama, anything returning a str. There's no answer_model= knob: the answer call is your function.
  • Reformulating - groundy's own OpenAI-compatible call. Set GROUNDY_MODEL + GROUNDY_BASE_URL (both required, no default provider), or pass model / base_url / api_key.

So you can reformulate on a cheap, fast model and answer on a stronger one - even across providers:

@groundy(
    model="llama-3.3-70b-versatile",            # reformulate on Groqโ€ฆ
    base_url="https://api.groq.com/openai/v1",
    api_key="gsk_...",
)
def ask(q: str) -> str:
    return openai_client.chat.completions.create(   # โ€ฆanswer on OpenAI
        model="gpt-4o", messages=[{"role": "user", "content": q}]
    ).choices[0].message.content

Any OpenAI-compatible endpoint works - that covers OpenAI, OpenRouter, Groq, Together, Fireworks, and local servers (vLLM, llama.cpp, Ollama).

Knobs

Param Default What it does
n 5 Answers compared: original + n-1 reformulations. Must be โ‰ฅ 2. Higher = sturdier + pricier.
threshold 0.75 Score below this โ†’ refusal. Calibrate it (see limits).
backend "embeddings" embeddings (local, sentence-transformers) or llm_judge (stub).
model None Reformulation model - required (no default). None โ†’ GROUNDY_MODEL, else ValueError.
temperature 0.0 Reformulator temperature (0.0 = reproducible). Set None to omit it for models that reject the param.
base_url None Reformulation provider โ€” required (no default). None โ†’ GROUNDY_BASE_URL, else ValueError.
api_key None None โ†’ GROUNDY_API_KEY (may be unset for keyless local servers).
verify_prompt (terse instruction) Prepended to the verify answers (not the served one). None verifies with your raw answers.
cache None Any object with get/set. Runs groundy only on a miss.
tracer None Any object with the Tracer protocol. Emits a nested trace per check. Langfuse adapter in groundy[langfuse].
on_unreliable (a refusal) Returned/cached when the model disagrees with itself.

Honest limits - read this

groundy measures self-consistency, not correctness. Know the failure modes:

  • Consistent confabulation passes. A confidently, consistently wrong model scores high. This catches uncertainty that surfaces as divergence - a large subset of hallucination, not all of it. Terse verify answers help: verbose hedging hides disagreement (verbose answers to "the 14th person on the Moon" all hedge alike and score ~0.9; terse ones confabulate different names โ†’ ~0.30, flagged). That's why verification is terse by default while your served answer stays verbose.
  • Calibrate the threshold. With the default all-MiniLM-L6-v2 backend, scores cluster high (~0.75โ€“0.95) for any related text. 0.75 is a starting point - tune it on your prompts.
  • It costs ~N+2 LLM calls per check (n=5 โ‰ˆ 7, sequential). Hence cache=: vet a question once, serve it free forever after.

Observability

Optional and agnostic. Pass a tracer (a tiny Tracer protocol, just like cache=) and every check() emits a nested trace: reformulate โ†’ verify ร—n โ†’ score โ†’ served. Default tracer=None โ†’ no tracing, zero overhead.

A Langfuse adapter ships in the box โ€” add the langfuse extra:

uv add "groundy[langfuse] @ git+https://github.com/lopoc/groundy.git"
from groundy.observability.langfuse import LangfuseTracer

@groundy(tracer=LangfuseTracer())   # reads LANGFUSE_* from the env
def ask(q: str) -> str:
    ...

The core imports no vendor SDK - only you import the adapter. groundy owns one LLM call (reformulation), so that node carries the model, temperature, token usage, and a prompt hash; the answer_fn nodes show text + timing only. Prefer to log it yourself? The full GroundyResult is still right there. For dev, GROUNDY_DEBUG=1 prints reformulations, answers, and scores.

Develop

git clone https://github.com/lopoc/groundy.git
cd groundy
uv sync                              # creates .venv, installs runtime + dev tools

uv run python examples/basic.py      # smoke test (needs GROUNDY_API_KEY + GROUNDY_MODEL)
uv run ruff check groundy            # lint
uv run ruff format groundy           # format
uv run pytest                        # tests (once a tests/ dir exists)

Roadmap

  • CLI: groundy "your query"
  • async def acheck() - parallelize the N calls
  • llm_judge backend (structured 0โ€“1 scoring - sharper than embeddings)
  • Tests + benchmark (measured reliable-vs-hallucinated separation)

Origin

A practical take on the Laplace agent from the Socrates/Laplace judicial-AI framework.

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

groundy-0.3.0.tar.gz (28.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

groundy-0.3.0-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file groundy-0.3.0.tar.gz.

File metadata

  • Download URL: groundy-0.3.0.tar.gz
  • Upload date:
  • Size: 28.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for groundy-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a04b69892bd045da6d056a3eab2f4a697f1ffdb5f66e00e216e69244c9459a3a
MD5 fb3d05cd3b387a92cb1b9217de833d41
BLAKE2b-256 939798c5ac06364b1205a2c6c611b60dcc91009fbe399117766377c10fb030dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for groundy-0.3.0.tar.gz:

Publisher: release.yml on lopoc/groundy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file groundy-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: groundy-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 28.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for groundy-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f5d540a2169d9292f0002e8ec979129ecb0b59931943c28d4be77e973152b3cf
MD5 b48a4f6e4c3e27c668626bde7c26a637
BLAKE2b-256 17aed0cd1d5ac83534fecdf1988feb86cdbfc6fab68b732b4ea6565fe152ed95

See more details on using hashes here.

Provenance

The following attestation bundles were made for groundy-0.3.0-py3-none-any.whl:

Publisher: release.yml on lopoc/groundy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page