A plumb line for government AI: realistic U.S. public-sector tasks and automated graders for evaluating LLMs, built on Inspect.
Project description
Lodlina
A plumb line for government AI.
Lodlina is Swedish for plumb line — the weighted cord builders have used for millennia to check whether a wall is true. This project is the same idea for AI: a fair, reproducible, auditable way to check whether an AI model does government work correctly, fairly, and honestly — and to compare models on equal footing.
Lodlina is an open-source suite of realistic U.S. public-sector tasks paired with defensible automated graders, built on Inspect (the open evaluation framework from the UK AI Safety Institute). You point it at a model, it runs the tasks, and it produces scores and a model-comparison leaderboard you can put in front of leadership, an inspector general, or an ATO board.
A plumb line doesn't argue about which wall is prettier — it tells you, without opinion, whether the wall is true. Lodlina aims for the same: measurement you can defend, not vibes.
Who this is for
- Agency AI leads and program owners choosing a model for records, eligibility, public Q&A, or plain-language work — and needing evidence, not a vendor demo.
- Vendors and integrators selling AI to government who want to show, on neutral ground, that their model is correct, fair, and grounded.
- Inspectors general, auditors, and eval practitioners who need numbers they can reproduce and defend — every score traces to a label or an exact string check.
- Researchers studying how LLMs behave on real public-sector tasks.
What makes it different
- Defensible graders. Most scores are deterministic — computed from labeled ground truth or exact string operations, auditable by someone who isn't an ML expert. Where judgment is unavoidable, the grader uses a strict rubric backed by a deterministic check.
- All data is synthetic. No real PII or CUI anywhere — safe to run on any
network. (SSNs use the never-issued
900–999range, etc.) - In-boundary by default. Lodlina is Bedrock-first: prompts stay in your AWS boundary unless you explicitly opt into a commercial API. Every result is labeled with where it sent its data.
- Runs anywhere. No telemetry, no phone-home; designed to work air-gapped.
Quickstart (about a minute)
Recommended — with uv. uv brings its own
Python, so you don't have to install or manage one (macOS ships an old 3.9 that
won't work):
# Install uv if you don't have it: https://docs.astral.sh/uv/getting-started/installation/
uv tool install "lodlina[bedrock]" # isolated install; fetches a modern Python
uv tool update-shell # puts `lodlina` on your PATH (then restart the terminal)
Point it at AWS Bedrock (Claude models enabled in us-east-1) and run:
export AWS_PROFILE=your-bedrock-profile # or AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
export AWS_DEFAULT_REGION=us-east-1
lodlina validate # no credentials needed
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5 # one task (~a few cents)
lodlina leaderboard --models claude-sonnet-4-6 claude-haiku-4-5 --limit 5 --html
lodlina leaderboard --html writes a shareable table to
leaderboard/results.html in the current directory, led by the Lodlina
Score — an open-ended, difficulty-weighted roll-up of the defensible per-task
rates (which stay visible beneath it), versioned per suite so it keeps
discriminating as models improve. See docs/scoring.md.
Prefer pip / a virtualenv? (needs Python ≥ 3.10 yourself)
python3.12 -m venv .venv && source .venv/bin/activate # MUST be 3.10+; macOS's default python3 is 3.9
pip install "lodlina[bedrock]"
If you see ERROR: ... lodlina[bedrock] (from versions: none) or Ignored the following versions that require a different python version, your python3 is too
old (< 3.10). Use uv above, or install a newer Python (brew install python@3.12).
Notes
- The bare
lodlina leaderboardruns the full suite (every built-in eval pack — 600+ samples per model) across the full default line-up (Opus, Sonnet, Haiku — plus GPT-5.x if you've configured Bedrock Mantle). That's the publishable board, and it costs real money; it prints its run size before starting. Use--models <alias> …and--limit Nto keep a quick comparison cheap; unconfigured models show as—rather than failing. - No AWS yet?
lodlina listandlodlina validatework with no credentials. To run a model, add a provider and credentials — see Models & credentials.
Start with your AI assistant (Claude Code, Codex, Cursor, …)
If you use an AI coding assistant with shell access, you don't have to read the
docs first. Open it in an empty folder and paste the contents of
docs/ai-quickstart-prompt.md. The assistant
will install Lodlina, check your AWS Bedrock access, run a first evaluation, and
generate a leaderboard — explaining each step as it goes.
The prompt is self-contained and transparent — it's plain markdown, so read it before you paste it; everything it does is right there. It only installs into a local virtual environment and only spends money when you approve a model run.
What it measures
Ten eval packs across the federal mission spectrum — benefits, records,
citizen services, policy Q&A, national security, and defense — each a real
public-sector job where being wrong is concrete. Every pack ships a synthetic
dataset (input + labeled ground truth) and references one of five vetted task
types; packs are data + config only, never code. Full definitions of "correct"
are in docs/methodology.md; the content taxonomy is
docs/use-cases.md.
Records, disclosure & privacy (B2) — don't leak; don't over-withhold
records-redaction (150 documents) and b2-records-adversarial (100 documents
dense with look-alike traps). Synthetic government documents mix must-redact
items (SSNs, home addresses — FOIA Exemption 6 information) with clearly
releasable content.
leak_rate(headline, deterministic) — must-redact items the model missed. A miss is a leak: the most serious failure.over_redaction_rate— releasable items wrongly redacted (over-redacting defeats the purpose of FOIA disclosure).
Benefits & eligibility (B1) — correct, and consistent under irrelevant changes
eligibility-fairness (baseline rules) and b1-benefits-hfap (200 cases of a
hard SNAP-like program: size-indexed income tables, layered deductions,
categorical eligibility, elderly/disabled exemptions — built to separate
frontier models).
accuracy(deterministic) — determination vs. the rule-derived answer.flip_rate(headline, metamorphic) — each case re-run with only the applicant's name changed. Any decision that flips is flagged: "is it biased?" becomes an objective "did the decision change when it must not have?" — no subjective bias grader.
National security information protection (B7) — protect, but don't over-classify
b7-classified-spillage (100 fabricated documents with synthetic markings:
withhold classified portions before release — spillage — without
over-withholding releasable content) and b7-foreign-disclosure (60 cases:
may this item go to coalition partner X? NOFORN/REL-TO logic, sharing
arrangements, the third-agency rule, sanitization — rule-derived, with the
requesting liaison officer's name as the fairness panel). No real classified
material is ever used.
Defense & military mission support (B8) — the staff work, done right
b8-deployment-readiness (64 cases of the "is this member deployable?"
determination: readiness categories, qualification currency windows, dwell
ratios, family care plans) and b8-intel-grounded-qa (80 questions over
fabricated intelligence products — answer with verbatim citations, or
abstain: half the questions are deliberately unanswerable with
generator-guaranteed absence).
false_answer_rate(headline, deterministic) — answered an unanswerable question: the most dangerous RAG failure mode, measured exactly.
Authoritative-source Q&A (B4) — answer, and cite faithfully
grounded-qa: a policy document plus a question; the model answers and
cites supporting passages verbatim.
hallucinated_citation_rate(headline, deterministic) — cited passages not found verbatim in the source (a fabricated quote).answer_correctness,citation_support— jury-graded, each backed by a deterministic check (a labeled reference answer; the verbatim gate).
Citizen services (B3) — rewrite simply without changing the meaning
plain-language (50 dense bureaucratic paragraphs).
readability_improvement(deterministic) — Flesch-Kincaid grade drop.meaning_preservation— jury-graded two-way entailment (no facts added or dropped), reported alongside readability so "simplify by deleting content" can't look like success.
Read the methodology white paper:
docs/whitepaper.md— the full design, the measurement-integrity story, how Lodlina survives benchmark saturation, and honest limitations, with verified citations.
Why you can trust the numbers
- Deterministic first. Wherever there's a ground truth, grading is computed from labels or exact string operations — no model judgment.
- Metamorphic for fairness. Change only a legally-irrelevant attribute and check whether the output flips. No subjective "bias" grader ships.
- Constrained, backed model-grading only where unavoidable. Citation support and meaning preservation use a strict rubric and a deterministic gate (e.g. a quote must pass the verbatim check before a model is asked whether it supports a claim).
- No model grades itself. Model-graded scorers use a jury — the same
panel of graders for every candidate (comparable), ideally spanning model
families so no candidate is judged only by its own (same-family judges
over-reward their family). Configure with
--graders/LODLINA_GRADER_MODELS(aliases work); Lodlina warns when a candidate shares a juror's family, and an odd-sized panel is preferred (ties score conservatively). Every report records the jury in its Provenance section. Seedocs/eval-standards.md.
Every task's definition of "correct" and exactly how its scorer works is in
docs/methodology.md — written to be read by an inspector
general, not just an ML engineer.
Models & credentials
Lodlina is Bedrock-first. You pick a model by a short alias
(claude-sonnet-4-6, claude-opus-4-8, gpt-5.5, …) and it resolves to that
model's Amazon Bedrock route by default, keeping prompts in-boundary. The
direct OpenAI / Anthropic APIs are secondary routes, used only when you ask for them
with --provider — there is no silent cross-boundary fallback.
lodlina list shows every alias and where it resolves. Pick the provider extras you
need at install time:
| Install | Enables |
|---|---|
pip install lodlina[bedrock] |
Claude on Bedrock (Converse) — the default path |
pip install lodlina[bedrock,openai] |
+ OpenAI GPT-5.x (direct API and Bedrock Mantle) |
pip install lodlina[anthropic] |
direct Anthropic API |
pip install lodlina[all] |
every provider |
Claude on Bedrock (default, in-boundary)
Standard AWS credentials with Bedrock access; the Claude line-up and the neutral grader both run here:
export AWS_PROFILE=your-bedrock-profile # or AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
export AWS_DEFAULT_REGION=us-east-1
OpenAI GPT-5.x on Bedrock Mantle (in-boundary)
GPT-5.4 / GPT-5.5 are served on Bedrock's separate Mantle endpoint (OpenAI
Responses API), authenticated with a Bedrock long-term API key and available in
us-east-2 / us-west-2 / us-gov-west-1:
export BEDROCK_MANTLE_BASE_URL=https://bedrock-mantle.us-east-2.api.aws/openai/v1
export BEDROCK_MANTLE_API_KEY=ABSK... # Bedrock console → API keys → long-term
Aliases (gpt-5.4, gpt-5.5) route here automatically. If the Mantle environment
isn't set, those rows simply render as — instead of failing.
Direct OpenAI / Anthropic (off-boundary)
Secondary routes that send prompts to the commercial APIs — selected explicitly with
--provider openai|anthropic, reading OPENAI_API_KEY / ANTHROPIC_API_KEY. The
leaderboard always labels each model's provider and data boundary, so a reviewer
can see at a glance where every run sent its data.
Claude on a Pro/Max subscription (no per-token billing)
For development and heavy iteration you can run the Claude models on your Claude
Pro/Max subscription instead of paying per token — Inspect's anthropic provider
supports OAuth bearer auth. Mint a long-lived token once with the Claude CLI, then:
claude setup-token # opens a browser; prints an OAuth token
export ANTHROPIC_AUTH_TOKEN=<that token> # subscription auth (not an API key)
export LODLINA_GRADER_MODEL=anthropic/claude-sonnet-4-6 # keep the grader on the subscription too
lodlina run grounded-qa --model claude-sonnet-4-6 --provider anthropic
This routes both the candidate and the grader through your subscription, so a full
run needs no AWS credentials and no per-token charges (subscription rate limits
apply). Note this is the direct Anthropic API (off-boundary) — great for building;
for an in-boundary government leaderboard, use the Bedrock route. The deterministic
tasks (records-redaction, eligibility-fairness) have no model grader, so they run
on the subscription with nothing else configured.
Copy .env.example to .env.local for a fill-in-the-blanks setup.
The lodlina command
lodlina list # eval packs + model aliases
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina run records-redaction --model gpt-5.4 # GPT-5.4 via Bedrock Mantle
lodlina run plain-language --model claude-sonnet-4-6 --provider anthropic # off-boundary
lodlina leaderboard --html # full model-comparison board
lodlina validate # check every eval pack is sound
lodlina new-pack my-pack --task-type records-redaction # author your own eval set
run resolves the model Bedrock-first, binds a neutral grader, and is air-gap safe.
Bring your own eval sets (packs)
An eval pack is a shareable evaluation: a manifest.yaml + a synthetic
dataset.jsonl that reuses Lodlina's vetted graders by referencing a curated
task type. Packs are data + configuration only — no third-party code is executed,
so every grader stays auditable.
lodlina new-pack ssn-heavy --task-type records-redaction # scaffold a valid starter
# ...drop your synthetic records into ssn-heavy/dataset.jsonl...
lodlina validate --pack ./ssn-heavy
lodlina run --pack ./ssn-heavy --model claude-sonnet-4-6
Third parties can distribute packs as pip-installable packages (discovered via the
lodlina_packs entry point). See src/lodlina/packs/builtin/README.md.
Synthetic data & honest limitations
- All data is synthetic. No real PII/CUI. Identifiers are unmistakably fake
(SSNs in the never-issued
900–999range,555-01xxphones,example.comemails). Generators are seeded; committed seed sets (~12–18 samples/task) let it run out of the box. - Synthetic ≠ representative. Templated documents are cleaner than real agency records; scores indicate capability on a controlled proxy, not certified production performance.
- Model-graded components inherit grader limits. They're constrained and deterministically backed where possible, but not infallible; read them as grader-relative. The deterministic headlines are the grader-independent signal.
- U.S. federal framing, English. A starting point, not a complete map of government work.
- Not legal advice or an authorization to deploy. Lodlina is an evaluation instrument, not a compliance certification.
Roadmap & backlog
The plan of record (phases, design decisions, eval-pack format) is in
docs/ROADMAP.md. Tasks still on the methodology bench:
political-neutrality (symmetric paired prompts), Section-508 alt-text, FOIA
exemption-reasoning, and abstention on unanswerable questions.
Develop & contribute
git clone https://github.com/Lodlina/Lodlina && cd Lodlina
uv venv && uv pip install -e ".[dev]"
pytest # 50+ tests, fully offline
The test suite runs the full Inspect pipeline offline — each task is driven
end-to-end by a mock model, so all grading logic is verified without credentials or
network. Conventions mirror
inspect_evals.
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lodlina-0.5.1.tar.gz.
File metadata
- Download URL: lodlina-0.5.1.tar.gz
- Upload date:
- Size: 538.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54cc390cb77521d7eb9396aa75ec481aa9ecabc80362d1056f02863730d2ad63
|
|
| MD5 |
766a5893408b07e0557d733e76ac6407
|
|
| BLAKE2b-256 |
4a0f336972fe05cd1e494d0196bb3fb2ef145bc0cf8c68aca84f80721a330caf
|
Provenance
The following attestation bundles were made for lodlina-0.5.1.tar.gz:
Publisher:
release.yml on Lodlina/Lodlina
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lodlina-0.5.1.tar.gz -
Subject digest:
54cc390cb77521d7eb9396aa75ec481aa9ecabc80362d1056f02863730d2ad63 - Sigstore transparency entry: 1782092013
- Sigstore integration time:
-
Permalink:
Lodlina/Lodlina@2529f32636867a15019184de2d9fd8a02cf2ff58 -
Branch / Tag:
refs/tags/v0.5.1 - Owner: https://github.com/Lodlina
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2529f32636867a15019184de2d9fd8a02cf2ff58 -
Trigger Event:
release
-
Statement type:
File details
Details for the file lodlina-0.5.1-py3-none-any.whl.
File metadata
- Download URL: lodlina-0.5.1-py3-none-any.whl
- Upload date:
- Size: 247.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f902ac9a09cdb0cbf8f378041035f3177c7db84b89d165f646cc12a0b150c48f
|
|
| MD5 |
d682b47d52b8a4bb583767ac083a91ca
|
|
| BLAKE2b-256 |
18e10baaefd8bb0a580d9d173b9ffc54965f43e24b8974af65c435a8f65c6905
|
Provenance
The following attestation bundles were made for lodlina-0.5.1-py3-none-any.whl:
Publisher:
release.yml on Lodlina/Lodlina
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lodlina-0.5.1-py3-none-any.whl -
Subject digest:
f902ac9a09cdb0cbf8f378041035f3177c7db84b89d165f646cc12a0b150c48f - Sigstore transparency entry: 1782092228
- Sigstore integration time:
-
Permalink:
Lodlina/Lodlina@2529f32636867a15019184de2d9fd8a02cf2ff58 -
Branch / Tag:
refs/tags/v0.5.1 - Owner: https://github.com/Lodlina
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2529f32636867a15019184de2d9fd8a02cf2ff58 -
Trigger Event:
release
-
Statement type: