Skip to main content

Reliable, checkable structured output from a small local LLM, by wrapping it in a deterministic feedback loop: a regime gate + exact graph analysis + explicit refusal, plus a bounded re-extraction loop. Zero runtime dependencies; runs with no model at all.

Project description

llm-feedback-control

Get reliable, checkable structured output from a small, local language model — by wrapping it in ordinary deterministic code.

CI


The problem this solves

Large language models — the technology behind ChatGPT and similar tools — are brilliant at reading plain English and writing fluent, confident answers. But they have a well-known flaw: they make things up, and they sound exactly as sure when they're wrong as when they're right.

That flaw bites hardest when you ask a model to pull structure out of text — the steps of a process, the states of a workflow, the fields of a form. It will get most of it right, then quietly invent a step that isn't there, or drop one that is. For anything you actually need to rely on, "usually right, never tells you when it isn't" is not good enough.

This library fixes that for a whole class of jobs: turning free text into structured data you can trust. The version you pip install today ships one fully-worked, measured example — turning a described process into a state machine — and the same engine generalises to other targets (form fields, records, entities); see Extending to other targets. It works by pairing the language model with ordinary, deterministic code that:

  • double-checks the model's answer against provable facts,
  • fills in anything the model missed, by asking again with the gaps pointed out,
  • and, crucially, says "I'm not sure" instead of guessing when it can't verify the result.

The payoff: a small model you can run for free on your own laptop becomes reliable enough to use, because the checking — not the model's size — is doing the heavy lifting. (In our tests a 3.8B model wrapped this way matches a model about seven times larger; see results.)

Use it to:

  • extract a trustworthy state machine (states + transitions) from a process described in plain prose;
  • audit a process for dead ends, unreachable steps, and loops — with every finding backed by a check, not a guess;
  • get reliable structured output from a small, free, local model instead of paying for a giant model or a cloud API;
  • know when to stop trusting the model — it refuses input it can't analyse exactly, and flags incomplete results, rather than inventing answers.

Who it's for: anyone who needs dependable structured output — workflow and state-machine extraction, process auditing, config parsing — from a language model, without paying for a giant model or a cloud API, and without silently trusting a guess.

If you just want to try it, jump to Quickstart. To see exactly what it produces, read on.

What it actually does

The package ships one fully-worked instantiation — workflow / process extraction — which is the running example throughout. You hand it a process written in plain English:

"A claim enters Intake. From Intake it goes to Triage. Triage goes to FastTrack or to Investigation. FastTrack goes to Payout. Investigation goes to Payout or to Denied. Payout goes to Closed. Denied goes to Closed."

and it:

  1. turns that into a state machine — the steps (states) and the arrows between them (transitions);
  2. computes provable facts about it — which steps are dead ends, whether there are loops, which steps can't be reached from the start;
  3. writes a report where every statement is backed by one of those checked facts — so it can't quietly make things up;
  4. knows its own limits. If the text isn't actually a finite step-by-step process (e.g. "prices drift up as confidence grows"), it refuses instead of inventing a fake state machine. And if the model's first pass missed part of the process, it loops to fill the gaps — or refuses if it can't.

The point: you get higher-quality, auditable structured output from a small model, trading a few extra passes (latency) for accuracy — no extra parameters, no special mathematics, no cloud. It runs on a laptop, and the deterministic parts run with no model at all.

Quickstart (works with no model)

pip install llm-feedback-control      # zero dependencies — pulls nothing else
from llm_feedback_control import run_audit

r = run_audit("A claim enters Intake. From Intake it goes to Triage. "
              "Triage goes to FastTrack or to Investigation.")
print(r["result"])         # OK
print(r["report_facts"])   # terminals, loops, unreachable steps — all checked

That already works on a bare install: with no model reachable it uses a deterministic regex extractor plus exact graph analysis. Plug in a model and the extraction quality goes up — nothing else changes.

From the command line:

lfc "A ticket opens in New. New goes to Assigned. Assigned goes to Resolved."
lfc --check        # tells you exactly what backend is available and what to do
lfc --demo         # runs the three worked demos

Add a model (optional, recommended)

The library is not tied to any provider. Three ways to give it a model:

# 1. Local, free, private — install Ollama (https://ollama.com), then:
ollama pull phi3:mini

# 2. OpenAI (stdlib HTTP, no SDK):
export CEILING_BACKEND=openai OPENAI_API_KEY=sk-...
# 3. Bring your own: pass any callable f(prompt, fmt=None) -> str
def my_llm(prompt, fmt=None):
    ...                       # call Anthropic, a local server, anything
run_audit(text, generate=my_llm)

Run lfc --check any time to see what's wired up.

How it works — "feedback control", explained

The design is borrowed from electronics. A raw LLM is like a very high-gain amplifier: hugely powerful, but left to run "open-loop" it overshoots — fluent, yet it drifts and hallucinates. Engineers tame such an amplifier by adding a feedback loop: feed the output back, compare it against a stable reference, and trade some raw power for precision and stability. This library is that feedback loop for an LLM. The "reference" is plain deterministic code — graph checks and schema rules — that the model's output is measured against.

There are two kinds of feedback, and the library uses both:

Negative feedback — the stabilising checks (run_audit)

This is the half that grounds and refuses. In plain terms:

step what it means
regime gate First decide whether the text is even the kind of thing we can analyse exactly (a finite, step-by-step process) versus something fuzzy and continuous. Refuse the fuzzy ones.
extraction + schema Ask the model for the state machine, but force the answer into a strict shape — and fall back to a deterministic regex extractor if it won't comply (or if there's no model).
exact analysis Compute provable facts about the graph: dead ends, loops, unreachable steps. (Plus an optional finite-field "spectral fingerprint" — see below.)
grounded report Write the summary using only those verified facts, naming only real states.
explicit refusal When the input is out of regime, or a result can't be made exact, say so — don't guess.

Positive feedback — the gap-filling loop (extract_iterative)

A one-shot extraction often silently drops a branch — the model says "OK" while quietly missing Investigation → Denied. Positive feedback fixes that: it re-asks the model about anything the source text mentions that's missing from the answer, and repeats until nothing is missing (a fixed point).

Positive feedback is where capability and instability both live, so it's bounded by two negative-feedback safeguards: a deterministic consistency check (does the graph cover everything the text mentions?) and a refusal clamp — if it can't converge within a few passes, it refuses to report a confident-but-incomplete result rather than running away. This refusal-as-stabilizer is what makes the regenerative loop safe.

What's measured so far

Indicative results, not benchmarks — small corpora, a 3.8B local model (phi3:mini), greedy decoding. See docs/results.md for the full tables and method.

Headline (run on EC2 against a ~28 GB ceiling model, mixtral 8x7B): on a messy, branchy, distractor-laden workflow corpus, the small model + the feedback loop essentially matches a model ~7× its size.

configuration states F1 transitions F1
small model (phi3:mini), one-shot 0.98 0.89
small model + feedback loop 1.00 0.90
big ceiling model (mixtral, ~28 GB), one-shot 1.00 0.91

→ the loop recovers 100% of the small→big gap on states and 77% on transitions — and on several individual workflows the closed-loop small model beat the big model, because the deterministic reference catches edges that raw fluency invents or drops.

Other measured pieces: extraction states precision/recall ≈ 1.00 / 0.92; the regime gate scores 1.00 precision/recall separating finite from continuous on a clean corpus (it's brittle on deliberately mixed inputs — an open problem).

Documentation

doc contents
docs/index.md overview and where to start
docs/architecture.md the op-amp model in depth; the pipeline; refusal-as-stabilizer
docs/usage.md install, the API, the CLI, configuration, bring-your-own-backend
docs/results.md the measured results, method, and honest scope
docs/api.md reference for every public function
docs/faq.md "do I need a GPU?", "what models?", "does it work offline?" …
docs/CHANGELOG.md release history

Repository layout

src/llm_feedback_control/   the package (zero-dependency, pure standard library)
  llm.py                    the LLM client + injectable backend + a doctor()
  auditor.py                the negative-feedback pipeline (run_audit)
  feedback.py               the bounded positive-feedback loop (extract_iterative)
  __main__.py               the `lfc` command-line tool
experiments/                repro scripts for the measured results (not shipped)
aws/                        optional: run a large ceiling model on EC2 (not shipped)
docs/                       the documentation suite
tests/                      deterministic tests (no model / no network)

Honest scope

  • A reliability architecture, not a model improvement. The win is "the system knows what it can compute exactly and refuses the rest" — orthogonal to model scale. It helps on the structured / verifiable slice (workflows, state machines, configs), not open-ended generation.
  • It uses no special mathematics. The deterministic reference is plain graph/text consistency. (The finite-field "spectral fingerprint" is an optional extra exact check, honestly redundant with graph analysis for most workflow audits — keep it or ignore it.)
  • Needs a deterministic reference. Where there's nothing to check against, the gate (correctly) refuses to claim exactness.
  • Results are indicative. Small corpora; treat the numbers as direction, not guarantees.

Extending to other targets

Workflow extraction is the worked example, not the limit. Underneath it is a general engine for reliable structured extraction:

extract with the LLM → check the result against a deterministic reference → re-ask to fill the gaps → refuse when it can't be verified.

To point it at a new kind of target you supply two things:

  1. a target schema — what fields/shape you want out;
  2. a deterministic reference — cheap code that, without the LLM, says what's missing or wrong. This is the part that makes the loop converge, and it's the part that decides whether the engine helps: it pays off exactly where you can write such a reference.

For the shipped workflow instantiation, the reference is "does the graph cover every state the text mentions?". Swap it and the same loop handles other targets:

  • form fields (against a field schema + format/required-field rules),
  • records / tables (against a known column set + types),
  • entities / relations (against a gazetteer or pattern set),
  • configs / specs (against a schema or grammar).

This isn't hypothetical: a form-field instantiation has been prototyped on the same loop, with the reference being a field schema plus independent regex detectors (email, date, currency, phone, custom patterns). It does what constrained-decoding libraries (which guarantee output shape but not truth) and cloud OCR do not: it verifies each value against the source text, recovers a value the model hallucinated by reading it back out of the document, and refuses when a required field is genuinely absent rather than inventing one. On a small local model it lifted per-field accuracy and refused correctly on a missing-field case.

Where no deterministic reference exists (open-ended summarising, sentiment, theme extraction), the engine refuses to claim exactness — by design, not by accident.

Origin

This project is the practical, validated spin-off of an internal research investigation. The investigation's grander mathematical claims did not hold up under measurement; this engineering architecture — LLM feedback control with refusal-as-stabilizer — is the part that did. It stands on its own.

License

MIT with an attribution clause — see LICENSE. Built with llm-feedback-control by Edward Chalk (sapientronic.ai).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_feedback_control-0.1.2.tar.gz (29.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_feedback_control-0.1.2-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file llm_feedback_control-0.1.2.tar.gz.

File metadata

  • Download URL: llm_feedback_control-0.1.2.tar.gz
  • Upload date:
  • Size: 29.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for llm_feedback_control-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3c02e87e0c9e0705ac5dc3c8f9c00e4e9c0c47ba30c76a9443b430edee951d83
MD5 fce46be67db21bd3af9e6e529584bb3f
BLAKE2b-256 7ced60ab6001683b8278d07c3eee201dd91c0256fcab576daa9fa376b3224041

See more details on using hashes here.

File details

Details for the file llm_feedback_control-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_feedback_control-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 47e5de8ae403c27ded16fbba893bfa965a5323020a8e399d112df05cbb2648e7
MD5 8538a11d48cfe81898c54ed95691217b
BLAKE2b-256 b72a172d041982a7fcb15dc21b7868f9075ae2f965d386970e3faf00946c9294

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page