Hallucination detection for LLMs via semantic consistency checking

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lopoc

These details have not been verified by PyPI

Project description

groundy 🌱

Keep your LLM grounded - no ground truth required.

A grounded model agrees with itself: ask the same question a few different ways and the answer holds. A model that's improvising scatters. groundy wraps that check into one decorator that returns an answer you can trust - or a refusal when the model is just making things up. No labels, no fine-tuning, no retrieval.

from groundy import groundy

@groundy
def ask(q: str) -> str:
    return my_llm(q)   # your LLM call - any provider, returns a str

ask("Who proved Fermat's Last Theorem?")     # → "Andrew Wiles."
ask("Who was the 14th person on the Moon?")  # → "I'm not confident enough to answer that reliably."

Same signature, same str return. Nothing downstream changes - the answer just became trustworthy.

Get started

1. Install (not on PyPI yet):

uv add git+https://github.com/lopoc/groundy.git

That's the full library, ready to use — the @groundy decorator and the local embeddings backend work out of the box, no extras needed. Two optional extras add heavier integrations only if you want them:

Extra	Adds	Use it for
`fastembed`	ONNX embedding backend (no torch)	~15× lighter import (CLI cold start ~10s → ~1–2s). Select with `backend="fastembed"`.
`langfuse`	Langfuse tracing adapter	Trace every check (`tracer=LangfuseTracer()`). See Observability.

Add them in the brackets (comma-separated for several) — note the quotes and the name @ prefix when you include an extra:

uv add "groundy[fastembed,langfuse] @ git+https://github.com/lopoc/groundy.git"

Skip the extras and nothing breaks: fastembed and the Langfuse SDK are imported lazily — only when you actually select that backend or construct the tracer — so a plain install never needs them.

2. Give groundy an API key, a provider, and a model name. It makes one call of its own

reformulation, over any OpenAI-compatible API - all under its own GROUNDY_* namespace:

export GROUNDY_API_KEY=sk-...
export GROUNDY_BASE_URL=https://api.openai.com/v1   # your provider — name it, no default (OpenRouter, Groq, a local server…)
export GROUNDY_MODEL=gpt-4o-mini                     # the reformulation model (required, no default)

3. Decorate your LLM call and use it as usual:

from openai import OpenAI
from groundy import groundy

client = OpenAI()

@groundy
def ask(q: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": q}]
    ).choices[0].message.content

print(ask("Who proved Fermat's Last Theorem?"))
print(ask("Who was the 14th person on the Moon?"))

That's it. A ready-to-run version (decorator + cache + raw checker) ships in the repo: uv run python examples/basic.py.

💡 export GROUNDY_DEBUG=1 prints every reformulation, answer, and score.

Vibe-check it from the terminal

No code needed - groundy asks your question a few ways and shows you the matrix: each distinct answer with a bar for how much it agrees with the rest (groundy's own signal), consensus on top, outliers at the bottom. Identical answers collapse to one ×N row:

export GROUNDY_API_KEY=sk-...
export GROUNDY_BASE_URL=https://api.openai.com/v1   # your provider — required, no default
export GROUNDY_MODEL=gpt-4o-mini

groundy "Who was the 14th person to walk on the Moon?"

🌱 groundy

  ? Who was the 14th person to walk on the Moon?

  ⚠ uncertain   consistency 0.50   · 17.8s

  I'm not confident enough to answer that reliably.

  scatter
    █████░░░ 0.61  Eugene Cernan (the last person to walk on the Moon, Apollo 17)…
    ████░░░░ 0.52  Eugene Cernan was the last (12th) person to walk on the Moon…
    ███░░░░░ 0.41  Harrison Schmitt ×2

On a reliable question the bars stand tall together and collapse to one row (████████ 1.00 Paris ×5); on a shaky one they fan down as the answers pull apart.

Want the raw structure? --matrix prints the full N×N pairwise heatmap - mutually-agreeing answers light up as bright blocks, so you see the clusters with no threshold and nothing aggregated:

  scatter
       a b c d e
    a  ██░░██████  Eugene Cernan was the last (12th)…
    b  ░░██░░░░░░  Gene Cernan
    c  ██░░██████  Eugene Cernan was the last (12th)…
    d  ██░░██████  Eugene Cernan was the last (12th)…
    e  ██░░██████  Eugene Cernan (the last person…)

It reads GROUNDY_API_KEY + GROUNDY_MODEL like everything else. Pipe a question in (echo "…" | groundy), add -q for answer-only output, --matrix for the heatmap, -n/-t to tune, or --debug for the raw reformulation log.

How it works

An uncertain model disagrees with itself when you rephrase the question; a confident one doesn't. With @groundy(n=5), each call:

Rephrases the query 4 ways - groundy's one own call.
Answers all 5 tersely. A verify_prompt is prepended so the comparison is about substance, not phrasing. These are the verify answers.
Scores agreement - embeds the verify answers locally (sentence-transformers) and averages their pairwise cosine similarity into a consistency_score in [0, 1].
Decides: reliable = consistency_score >= threshold.
Answers your way - only if reliable. It calls your function once more on the raw query for the served answer (your verbosity/prompt) and returns it. Unreliable → it skips this call and returns your on_unreliable string.

You serve the answer the way you want it, but verification is terse so verbosity can't hide disagreement. Cost: 7 LLM calls when reliable (1 reformulation + 5 verify + 1 served), 6 when unreliable, all synchronous - which is exactly why you cache it.

Cache it - pay once per cluster of questions

groundy is expensive, so hand it a cache and it runs only on a miss. A cache is anything with get(key) -> str | None and set(key, value). The real win is a semantic cache: a hit fires on any question close enough in meaning, so groundy runs once per cluster of similar questions and serves the whole neighbourhood for free.

from groundy import groundy

# Bring any semantic cache exposing get(key) -> str | None and set(key, value). A hit fires
# on questions close in *meaning*, so groundy runs once per cluster (GPTCache, Momento,
# Upstash, Redis + RedisVL - a 3-line adapter if the method names differ).
cache = SemanticCache(threshold=0.9)

@groundy(cache=cache)
def ask(q: str) -> str:
    return client.chat.completions.create(...).choices[0].message.content   # the RAW model

ask("Who discovered penicillin?")          # MISS → full check → verdict cached
ask("Who was penicillin discovered by?")   # HIT  → same meaning, zero LLM calls

On a hit groundy never runs. On a miss it checks, then cache.sets the verdict - refusals included, so "the model can't answer this" is remembered too.

⚠️ The one rule: groundy goes above your semantic cache, never below it. If a semantic cache sits inside the wrapped call, the reformulations - semantically equivalent by design - all hit the same entry, score a perfect 1.0, and every check falsely passes. The semantic cache belongs on top (via cache=), caching the verdict.

When you want the numbers

The decorator hides the scores on purpose. Reach past it for the rich result:

from groundy import GroundyChecker

checker = GroundyChecker(n=5, threshold=0.75)
r = checker.check("What does Italian Civil Code art. 2043 establish?", answer_fn=my_llm)

r.consistency_score   # 0.0–1.0
r.is_reliable         # bool
r.best_answer         # the served answer if reliable, else None
r.consensus_answer, r.agreement_scores, r.similarity_scores, r.latency_ms

best_answer is the served answer (your raw call) when reliable, and None when not

on a genuine split the right move is to refuse, not guess. The decorator turns that None into your on_unreliable string. (consensus_answer, the verify answer that agrees most with the rest, is diagnostic only.)

Run on any vendor

There are two independent LLM tasks, configured separately:

Answering - your decorated function. OpenAI, LiteLLM, Ollama, anything returning a str. There's no answer_model= knob: the answer call is your function.
Reformulating - groundy's own OpenAI-compatible call. Set GROUNDY_MODEL + GROUNDY_BASE_URL (both required, no default provider), or pass model / base_url / api_key.

So you can reformulate on a cheap, fast model and answer on a stronger one - even across providers:

@groundy(
    model="llama-3.3-70b-versatile",            # reformulate on Groq…
    base_url="https://api.groq.com/openai/v1",
    api_key="gsk_...",
)
def ask(q: str) -> str:
    return openai_client.chat.completions.create(   # …answer on OpenAI
        model="gpt-4o", messages=[{"role": "user", "content": q}]
    ).choices[0].message.content

Any OpenAI-compatible endpoint works - that covers OpenAI, OpenRouter, Groq, Together, Fireworks, and local servers (vLLM, llama.cpp, Ollama).

Knobs

Param	Default	What it does
`n`	`5`	Answers compared: original + n-1 reformulations. Must be ≥ 2. Higher = sturdier + pricier.
`threshold`	`0.75`	Score below this → refusal. Calibrate it (see limits).
`backend`	`"embeddings"`	`embeddings` (local, sentence-transformers) or `llm_judge` (stub).
`model`	`None`	Reformulation model - required (no default). `None` → `GROUNDY_MODEL`, else `ValueError`.
`temperature`	`0.0`	Reformulator temperature (`0.0` = reproducible). Set `None` to omit it for models that reject the param.
`base_url`	`None`	Reformulation provider — required (no default). `None` → `GROUNDY_BASE_URL`, else `ValueError`.
`api_key`	`None`	`None` → `GROUNDY_API_KEY` (may be unset for keyless local servers).
`verify_prompt`	(terse instruction)	Prepended to the verify answers (not the served one). `None` verifies with your raw answers.
`cache`	`None`	Any object with `get`/`set`. Runs groundy only on a miss.
`tracer`	`None`	Any object with the `Tracer` protocol. Emits a nested trace per check. Langfuse adapter in `groundy[langfuse]`.
`on_unreliable`	(a refusal)	Returned/cached when the model disagrees with itself.

Honest limits - read this

groundy measures self-consistency, not correctness. Know the failure modes:

Consistent confabulation passes. A confidently, consistently wrong model scores high. This catches uncertainty that surfaces as divergence - a large subset of hallucination, not all of it. Terse verify answers help: verbose hedging hides disagreement (verbose answers to "the 14th person on the Moon" all hedge alike and score ~0.9; terse ones confabulate different names → ~0.30, flagged). That's why verification is terse by default while your served answer stays verbose.
Calibrate the threshold. With the default all-MiniLM-L6-v2 backend, scores cluster high (~0.75–0.95) for any related text. 0.75 is a starting point - tune it on your prompts.
It costs ~N+2 LLM calls per check (n=5 ≈ 7, sequential). Hence cache=: vet a question once, serve it free forever after.

Observability

Optional and agnostic. Pass a tracer (a tiny Tracer protocol, just like cache=) and every check() emits a nested trace: reformulate → verify ×n → score → served. Default tracer=None → no tracing, zero overhead.

A Langfuse adapter ships in the box — add the langfuse extra:

uv add "groundy[langfuse] @ git+https://github.com/lopoc/groundy.git"

from groundy.observability.langfuse import LangfuseTracer

@groundy(tracer=LangfuseTracer())   # reads LANGFUSE_* from the env
def ask(q: str) -> str:
    ...

The core imports no vendor SDK - only you import the adapter. groundy owns one LLM call (reformulation), so that node carries the model, temperature, token usage, and a prompt hash; the answer_fn nodes show text + timing only. Prefer to log it yourself? The full GroundyResult is still right there. For dev, GROUNDY_DEBUG=1 prints reformulations, answers, and scores.

Develop

git clone https://github.com/lopoc/groundy.git
cd groundy
uv sync                              # creates .venv, installs runtime + dev tools

uv run python examples/basic.py      # smoke test (needs GROUNDY_API_KEY + GROUNDY_MODEL)
uv run ruff check groundy            # lint
uv run ruff format groundy           # format
uv run pytest                        # tests (once a tests/ dir exists)

Roadmap

CLI: groundy "your query"
async def acheck() - parallelize the N calls
llm_judge backend (structured 0–1 scoring - sharper than embeddings)
Tests + benchmark (measured reliable-vs-hallucinated separation)

Origin

A practical take on the Laplace agent from the Socrates/Laplace judicial-AI framework.

MIT License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lopoc

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Jul 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

groundy-0.3.0.tar.gz (28.0 kB view details)

Uploaded Jul 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

groundy-0.3.0-py3-none-any.whl (28.5 kB view details)

Uploaded Jul 2, 2026 Python 3

File details

Details for the file groundy-0.3.0.tar.gz.

File metadata

Download URL: groundy-0.3.0.tar.gz
Upload date: Jul 2, 2026
Size: 28.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for groundy-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`a04b69892bd045da6d056a3eab2f4a697f1ffdb5f66e00e216e69244c9459a3a`
MD5	`fb3d05cd3b387a92cb1b9217de833d41`
BLAKE2b-256	`939798c5ac06364b1205a2c6c611b60dcc91009fbe399117766377c10fb030dd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for groundy-0.3.0.tar.gz:

Publisher: release.yml on lopoc/groundy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: groundy-0.3.0.tar.gz
- Subject digest: a04b69892bd045da6d056a3eab2f4a697f1ffdb5f66e00e216e69244c9459a3a
- Sigstore transparency entry: 2049392765
- Sigstore integration time: Jul 2, 2026
Source repository:
- Permalink: lopoc/groundy@9ecc05b813a96ea12412c1786a9457df11fa435d
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/lopoc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9ecc05b813a96ea12412c1786a9457df11fa435d
- Trigger Event: push

File details

Details for the file groundy-0.3.0-py3-none-any.whl.

File metadata

Download URL: groundy-0.3.0-py3-none-any.whl
Upload date: Jul 2, 2026
Size: 28.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for groundy-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5d540a2169d9292f0002e8ec979129ecb0b59931943c28d4be77e973152b3cf`
MD5	`b48a4f6e4c3e27c668626bde7c26a637`
BLAKE2b-256	`17aed0cd1d5ac83534fecdf1988feb86cdbfc6fab68b732b4ea6565fe152ed95`

See more details on using hashes here.

Provenance

The following attestation bundles were made for groundy-0.3.0-py3-none-any.whl:

Publisher: release.yml on lopoc/groundy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: groundy-0.3.0-py3-none-any.whl
- Subject digest: f5d540a2169d9292f0002e8ec979129ecb0b59931943c28d4be77e973152b3cf
- Sigstore transparency entry: 2049393110
- Sigstore integration time: Jul 2, 2026
Source repository:
- Permalink: lopoc/groundy@9ecc05b813a96ea12412c1786a9457df11fa435d
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/lopoc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9ecc05b813a96ea12412c1786a9457df11fa435d
- Trigger Event: push

groundy 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

groundy 🌱

Get started

Vibe-check it from the terminal

How it works

Cache it - pay once per cluster of questions

When you want the numbers

Run on any vendor

Knobs

Honest limits - read this

Observability

Develop

Roadmap

Origin

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance