Skip to main content

Zero-token, precision-guaranteed verifier for LLM/agent multi-hop relational reasoning

Project description

grounded-reasoning — Grounded, Guaranteed Reasoning for LLMs & Agents

CI License: MIT Python 3.11+ Open In Colab

TL;DR. LLMs hallucinate on multi-hop relational reasoning. This is a relation-algebra verifier an agent calls to check a claim before asserting it: zero model tokens, precision-guaranteed (accepts a claim iff a grounded proof path exists), language-agnostic, and provider-agnostic. Plugs in as a library, a function-calling tool, or an MCP server. Validated on real LLMs (DeepSeek et al.) and the public CLUTRR benchmark. See docs/integration.md.

📄 Full paper: PAPER.md · Integration guide: docs/integration.md · Try it in 30 seconds: quickstart notebook

Đọc bằng tiếng Việt: README.vi.md


Why this exists

LLMs are solid on one-hop facts but collapse on composition — chaining several correct facts into a multi-step conclusion. On CLUTRR (kinship reasoning), DeepSeek's accuracy falls off with depth, while a grounded operator-composition solver holds ~100% flat — at zero tokens:

acc
100% ●─────●─────●─────●─────●─────●─────●   ● Grounded solver (algebra, 0 tokens)
 90% |
 80% ○
 70% |  ╲
 60% |   ╲
 50% |    ╲
 40% |     ○           ○                     ○ DeepSeek (LLM)
 30% |      ╲         ╱ ╲
 20% |       ○─────○     ╲
 10% |                    ○─────○
  0% +──┴─────┴─────┴─────┴─────┴─────┴─────┴─
      hop 2    3     4     5     6     7     8   (composition steps)

     hop:      2     3     4     5     6     7     8
     DeepSeek: 83%   42%   25%   25%   42%   17%   8%
     Solver:   100%  100%  100%  100%  100%  100%  100%

(CLUTRR/v1 gen_train234_test2to10, clean-chain, n=12/hop; full test set n=635: solver covers 99.5%, accuracy 99.2%. src/experiments/clutrr_eval.py.)


What it is / is NOT (honestly)

Is: a guaranteed reasoning-verification layer built on relation operator algebra.

  • Precision = 1.0, guaranteed (Theorem G) — accepts a claim only if a grounded proof path exists.
  • Zero extra tokens — local matrix multiplication, no LLM call. Compare to "have the LLM self-verify," which costs +110% tokens for 34% precision.
  • Two-sided guarantee (Theorem I) — precision and recall both have tight bounds.
  • No external KB required (SGDC) — uses the LLM's own internal consistency.

Is not: an "unprecedented breakthrough." The Katz index, the Neumann series, graph reachability, and neuro-symbolic grounding are all classical math and technique. The contribution here is unification, a measured guarantee, and benchmark numbers — not a new primitive. The guard needs a relation graph (supplied, or extracted from LLM facts); flexibility is bounded (see PAPER §5).

How this differs from the usual fixes

Approach Extra tokens Guarantee Needs an external KB
LLM self-verification (2nd call) +110% none (measured 34% precision) no
Self-consistency / majority vote multiplies with sample count none, statistical only no
RAG / external KG grounding varies only as good as retrieval yes
This guard +0 precision = 1.0 (Theorem G) no
This guard, self-grounded (SGDC) +0 precision = 1.0 given sound atomic facts (Theorem I) no
This guard, conformal +0 coverage ≥ 1−α, distribution-free (Theorem K) no

Three theorems, one operator (F = G = H)

The reasoning core rests on a single unification (numerically verified, zero error):

View Theorem Content
Fuzzy diffusion inference F conf(a→b) = Σ αᵏ(Pᵏ)[a,b], calibrated + grounded
Relation operator algebra G composition = operator product, transitive closure = Σ powers
Spectral analysis (Katz) H engine.infer = resolvent (I−αP)⁻¹−I (matches 0.0 error)

⟹ fuzzy inference is spectral analysis of the relation operator. src/reasoning/.

Four further theorems extend this core: I (two-sided precision/recall guarantee for a self-grounded, no-external-KB variant), J (closure-learning completeness, validated on CLUTRR), K (conformal reasoning — distribution-free coverage under a noisy relation graph, including one extracted by an LLM from raw text), and L (Horn forward-chaining, generalizing transitive closure to conjunctive rules). All seven are stated, proved, and numerically verified in PAPER.md.


Evidence on real LLMs (DeepSeek)

Experiment Result
Hallucination guard (kinship) precision 33% → 100%, catches 94/94, 0 false rejects
Guard token cost +0 tokens (vs. LLM self-verify: +110% tokens, 34% precision)
SGDC (self-grounded, no external KB) precision 78% → 100% from internal consistency alone
CLUTRR (public benchmark) solver ~100% at every hop vs. DeepSeek 83%→8%
Hard passage (9-step chain) DeepSeek fabricates 2/10 (wrong direction); grounded system 10/10, with proofs — examples/hallucination_demo.py

Guaranteed reasoning over a graph an LLM extracted from raw text

The guard/solver needs a clean graph. But if you let an LLM extract relations from natural-language text, the graph is noisy (missing/spurious edges). Conformal Reasoning (Theorem K) fixes exactly that: use operator confidence as a score, calibrate a threshold ⟹ distribution-free coverage ≥ 1−α, even on a noisy graph.

End-to-end demo: DeepSeek extracts an "is a" graph from text → conformal runs on that extracted graph (ground truth is used only for scoring):

Text LLM extraction (P / R) Coverage (target ≥90%) Efficiency (FPR)
Easy 100% / 99.7% 91.3% 0.0
Hard (nested clauses + near-miss distractors) 99.5% / 68.5% 93.0% 0.77

The LLM's extraction drops 31% of the edges (a genuinely noisy graph) → the coverage guarantee still holds (93% ≥ 90%), only efficiency degrades. Validity always holds; efficiency scales with graph quality.

⟹ A path to guaranteed reasoning over natural-language relations — where the hard guard can't reach. src/experiments/conformal_llm_eval.py.


Quickstart

git clone https://github.com/ALEXaquarius/grounded-reasoning
cd grounded-reasoning && pip install -e ".[dev]"     # not yet on PyPI — install from source
pytest tests/                       # every theorem + offline-locked logic, no network needed

# Use it right now (no LLM/network needed):
python -c "from grounded_reasoning import GroundedReasoner as G; r=G(); r.add_facts([('a','p','b'),('b','p','c')]); print(r.verify('a','c',via='p'))"

# Real-LLM experiments (need a key — read from an env var, NEVER hardcoded):
export DEEPSEEK_API_KEY=sk-...        # bring your own; .env is gitignored
python -m src.experiments.guard_llm_eval        # hallucination guard
python -m src.experiments.self_grounded_eval    # SGDC
python -m src.experiments.clutrr_eval           # public CLUTRR benchmark
python -m src.experiments.conformal_llm_eval    # end-to-end conformal (LLM-extracted graph)

Integrating with an Agent / LLM (src/agent/)

A relation-reasoning verifier for agents: check a multi-hop claim before asserting it — zero model tokens, precision guaranteed (accepts iff a grounded proof path exists).

from grounded_reasoning import GroundedReasoner
gr = GroundedReasoner()
gr.add_facts([("alice","parent","bob"),("bob","parent","carol")])
gr.verify("alice","carol", via="parent")   # Verdict(grounded=True, proof=['alice','bob','carol'])
gr.verify("alice","zed",   via="parent")   # Verdict(grounded=False, proof=None)  ← hallucination blocked

Three integration paths (details: docs/integration.md):

  • Library: GroundedReasoner.verify / filter_claims / contradictions.
  • Function-calling: TOOL_SPEC (Anthropic) / openai_tool_spec() (OpenAI) + run_tool — a stateless verify_relation tool.
  • MCP server: python -m src.agent.mcp_server — plugs into Claude or any MCP-compatible agent.

Multi-provider (not just DeepSeek): LLMClient(provider=...) for DeepSeek / OpenAI / Groq / OpenRouter / Together / Mistral / Ollama (local) — all OpenAI-compatible, switch providers without changing code. Multilingual: entities/relations are opaque Unicode strings ⟹ works with any language (cha, , والد…) with zero configuration.

A real function-calling demo (agent verifies itself, blocks hallucination): python -m src.experiments.agent_demo. When the graph is noisy (relations extracted by an LLM from text), use ConformalReasoner for a coverage ≥1−α guarantee instead of hard precision.


Source map

Path Content
grounded_reasoning/ Public package — GroundedReasoner, verify_relation, TOOL_SPEC, ConformalReasoner, LLMClient
src/agent/{verifier,tool,mcp_server}.py Public API implementation — HallucinationGuard, function-calling tool, MCP server
src/reasoning/abstract_inference.py FuzzyInferenceEngine, TypedInferenceEngine, HallucinationGuard (Theorem F)
src/reasoning/operator_algebra.py Relation operator algebra (Theorem G)
src/reasoning/relation_spectrum.py Spectrum, nilpotency, Katz resolvent (Theorem H)
src/reasoning/conformal_reasoning.py Conformal — coverage guarantee under noise (Theorem K)
src/reasoning/composition_algebra.py Composition-table learning, validated on CLUTRR (Theorem J)
src/reasoning/horn.py Horn forward-chaining, least-model semantics (Theorem L)
src/reasoning/llm_client.py Provider-agnostic LLM client (key read from an env var)
src/theory/theorems.py Seven theorems (F–L) with numerical verification
src/experiments/{guard_llm,self_grounded,nl_ontology,guard_cost,clutrr,conformal_llm,inference}_eval.py Real-LLM and benchmark experiments backing every claim above
examples/hallucination_demo.py End-to-end function-calling demo
examples/quickstart.ipynb Runnable tour of the library (offline, Colab-ready)

Origin story

This project began as an attempt to invent an embedding-free retrieval algorithm that could compete with dense/RAG retrieval. That research question reached a rigorous, fully honest negative conclusion (ties BM25, loses significantly to dense embeddings — with a proof of why). The same mathematical toolkit — operator algebra, spectral analysis — turned out to have real, measurable value on a different problem: guaranteeing multi-hop relational reasoning. This repository ships only that validated, tested reasoning system; the full retrieval research trail (including every failed attempt, honestly recorded) lives in a separate research repository and is not part of this package. See PAPER.md §1 for the full framing.


Contributing & Community


Principle: proof before code, formal definitions, falsifiability, and honest reporting of negative results — see CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grounded_reasoning-0.1.0.tar.gz (66.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grounded_reasoning-0.1.0-py3-none-any.whl (60.6 kB view details)

Uploaded Python 3

File details

Details for the file grounded_reasoning-0.1.0.tar.gz.

File metadata

  • Download URL: grounded_reasoning-0.1.0.tar.gz
  • Upload date:
  • Size: 66.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for grounded_reasoning-0.1.0.tar.gz
Algorithm Hash digest
SHA256 93931354d80a460a81210d9b5d66740382ebe1a3fa9ccd12703f69c9896b5166
MD5 b800183f8013517ae98e86f458c33e74
BLAKE2b-256 4fbcd4235624971b64b5176660021f16d175e9ed4891a32dc22a096451028012

See more details on using hashes here.

File details

Details for the file grounded_reasoning-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for grounded_reasoning-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bd32bc5d4af937f67b56d0903d6ce2cc492b074eb042afd709bc473b4992fb26
MD5 d89d10a2fc7d2c7b3bb0448a092bc1b2
BLAKE2b-256 7d226f78ff0229c9d0d064bcf415516050c08f0d3147a7675ce9bb603f7f1687

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page