A plumb line for government AI: realistic U.S. public-sector tasks and automated graders for evaluating LLMs, built on Inspect.
Project description
Lodlina
A plumb line for government AI.
Lodlina is Swedish for plumb line — the weighted cord builders have used for millennia to check whether a wall is true. This project is the same idea for AI: a fair, reproducible, auditable way to check whether an AI model does government work correctly, fairly, and honestly — and to compare models on equal footing.
Lodlina is an open-source suite of realistic U.S. public-sector tasks paired with defensible automated graders, built on Inspect (the open evaluation framework from the UK AI Safety Institute). You point it at a model, it runs the tasks, and it produces scores and a model-comparison leaderboard you can put in front of leadership, an inspector general, or an ATO board.
A plumb line doesn't argue about which wall is prettier — it tells you, without opinion, whether the wall is true. Lodlina aims for the same: measurement you can defend, not vibes.
Who this is for
- Agency AI leads and program owners choosing a model for records, eligibility, public Q&A, or plain-language work — and needing evidence, not a vendor demo.
- Vendors and integrators selling AI to government who want to show, on neutral ground, that their model is correct, fair, and grounded.
- Inspectors general, auditors, and eval practitioners who need numbers they can reproduce and defend — every score traces to a label or an exact string check.
- Researchers studying how LLMs behave on real public-sector tasks.
What makes it different
- Defensible graders. Most scores are deterministic — computed from labeled ground truth or exact string operations, auditable by someone who isn't an ML expert. Where judgment is unavoidable, the grader uses a strict rubric backed by a deterministic check.
- All data is synthetic. No real PII or CUI anywhere — safe to run on any
network. (SSNs use the never-issued
900–999range, etc.) - In-boundary by default. Lodlina is Bedrock-first: prompts stay in your AWS boundary unless you explicitly opt into a commercial API. Every result is labeled with where it sent its data.
- Runs anywhere. No telemetry, no phone-home; designed to work air-gapped.
Quickstart (about a minute)
You need Python ≥ 3.10 and AWS credentials with Amazon Bedrock access
(the Claude models enabled in us-east-1).
pip install "lodlina[bedrock]" # or: uv pip install "lodlina[bedrock]"
export AWS_PROFILE=your-bedrock-profile # AWS creds with Bedrock access
export AWS_DEFAULT_REGION=us-east-1
lodlina validate # confirm the install + datasets
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina leaderboard --html # compare the default line-up → leaderboard/results.html
That's it. lodlina run prints a score summary; lodlina leaderboard --html writes
a shareable comparison table. A first run costs roughly a few cents to a few
dollars depending on the models and sample count.
No AWS yet? You can still
pip install lodlinaand runlodlina list/lodlina validate(no credentials needed). To run a model, add a provider and credentials — see Models & credentials.
Start with your AI assistant (Claude Code, Codex, Cursor, …)
If you use an AI coding assistant with shell access, you don't have to read the
docs first. Open it in an empty folder and paste the contents of
docs/ai-quickstart-prompt.md. The assistant
will install Lodlina, check your AWS Bedrock access, run a first evaluation, and
generate a leaderboard — explaining each step as it goes.
The prompt is self-contained and transparent — it's plain markdown, so read it before you paste it; everything it does is right there. It only installs into a local virtual environment and only spends money when you approve a model run.
What it measures
Four tasks, each a real public-sector job where being wrong is concrete. Each ships
a synthetic dataset (input + labeled ground truth) and a defensible scorer. Full
definitions of "correct" are in docs/methodology.md.
1. records-redaction — don't leak personal privacy info
A synthetic government document mixes must-redact items (SSNs, personal email, home address, date of birth — FOIA Exemption 6 information) with clearly releasable content. The model returns what it would redact.
leak_rate(headline, deterministic) — fraction of must-redact items the model missed. A miss is a leak: the most serious failure.over_redaction_rate— releasable items wrongly redacted (over-redacting defeats the purpose of FOIA disclosure).
2. eligibility-fairness — correct, and consistent under irrelevant changes
A synthetic case file plus a policy-manual excerpt with clear rules.
accuracy(deterministic) — determination vs. the rule-derived answer.flip_rate(headline, metamorphic) — each case is re-run with only the applicant's name changed (a legally-irrelevant attribute). Any case whose decision flips is flagged. This turns "is it biased?" into an objective "did the decision change when it must not have?" — no subjective bias grader.
3. grounded-qa — answer, and cite faithfully
A policy document plus a question; the model answers and cites supporting passages verbatim.
hallucinated_citation_rate(headline, deterministic) — fraction of cited passages not found verbatim in the source (a fabricated quote).answer_correctness,citation_support— model-graded, each backed by a deterministic check (a labeled reference answer; the verbatim gate).
4. plain-language — rewrite simply without changing the meaning
A dense bureaucratic paragraph.
readability_improvement(deterministic) — Flesch-Kincaid grade-level drop.meaning_preservation— model-graded two-way entailment (no facts added or dropped), reported alongside readability so "simplify by deleting content" can't look like success.
Why you can trust the numbers
- Deterministic first. Wherever there's a ground truth, grading is computed from labels or exact string operations — no model judgment.
- Metamorphic for fairness. Change only a legally-irrelevant attribute and check whether the output flips. No subjective "bias" grader ships.
- Constrained, backed model-grading only where unavoidable. Citation support and meaning preservation use a strict rubric and a deterministic gate (e.g. a quote must pass the verbatim check before a model is asked whether it supports a claim).
- By default, no model grades itself. On the leaderboard a single neutral grader scores every candidate, so model-graded columns are comparable.
Every task's definition of "correct" and exactly how its scorer works is in
docs/methodology.md — written to be read by an inspector
general, not just an ML engineer.
Models & credentials
Lodlina is Bedrock-first. You pick a model by a short alias
(claude-sonnet-4-6, claude-opus-4-8, gpt-5.5, …) and it resolves to that
model's Amazon Bedrock route by default, keeping prompts in-boundary. The
direct OpenAI / Anthropic APIs are secondary routes, used only when you ask for them
with --provider — there is no silent cross-boundary fallback.
lodlina list shows every alias and where it resolves. Pick the provider extras you
need at install time:
| Install | Enables |
|---|---|
pip install lodlina[bedrock] |
Claude on Bedrock (Converse) — the default path |
pip install lodlina[bedrock,openai] |
+ OpenAI GPT-5.x (direct API and Bedrock Mantle) |
pip install lodlina[anthropic] |
direct Anthropic API |
pip install lodlina[all] |
every provider |
Claude on Bedrock (default, in-boundary)
Standard AWS credentials with Bedrock access; the Claude line-up and the neutral grader both run here:
export AWS_PROFILE=your-bedrock-profile # or AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
export AWS_DEFAULT_REGION=us-east-1
OpenAI GPT-5.x on Bedrock Mantle (in-boundary)
GPT-5.4 / GPT-5.5 are served on Bedrock's separate Mantle endpoint (OpenAI
Responses API), authenticated with a Bedrock long-term API key and available in
us-east-2 / us-west-2 / us-gov-west-1:
export BEDROCK_MANTLE_BASE_URL=https://bedrock-mantle.us-east-2.api.aws/openai/v1
export BEDROCK_MANTLE_API_KEY=ABSK... # Bedrock console → API keys → long-term
Aliases (gpt-5.4, gpt-5.5) route here automatically. If the Mantle environment
isn't set, those rows simply render as — instead of failing.
Direct OpenAI / Anthropic (off-boundary)
Secondary routes that send prompts to the commercial APIs — selected explicitly with
--provider openai|anthropic, reading OPENAI_API_KEY / ANTHROPIC_API_KEY. The
leaderboard always labels each model's provider and data boundary, so a reviewer
can see at a glance where every run sent its data.
Copy .env.example to .env.local for a fill-in-the-blanks setup.
The lodlina command
lodlina list # eval packs + model aliases
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina run records-redaction --model gpt-5.4 # GPT-5.4 via Bedrock Mantle
lodlina run plain-language --model claude-sonnet-4-6 --provider anthropic # off-boundary
lodlina leaderboard --html # full model-comparison board
lodlina validate # check every eval pack is sound
lodlina new-pack my-pack --task-type records-redaction # author your own eval set
run resolves the model Bedrock-first, binds a neutral grader, and is air-gap safe.
Bring your own eval sets (packs)
An eval pack is a shareable evaluation: a manifest.yaml + a synthetic
dataset.jsonl that reuses Lodlina's vetted graders by referencing a curated
task type. Packs are data + configuration only — no third-party code is executed,
so every grader stays auditable.
lodlina new-pack ssn-heavy --task-type records-redaction # scaffold a valid starter
# ...drop your synthetic records into ssn-heavy/dataset.jsonl...
lodlina validate --pack ./ssn-heavy
lodlina run --pack ./ssn-heavy --model claude-sonnet-4-6
Third parties can distribute packs as pip-installable packages (discovered via the
lodlina_packs entry point). See src/lodlina/packs/builtin/README.md.
Synthetic data & honest limitations
- All data is synthetic. No real PII/CUI. Identifiers are unmistakably fake
(SSNs in the never-issued
900–999range,555-01xxphones,example.comemails). Generators are seeded; committed seed sets (~12–18 samples/task) let it run out of the box. - Synthetic ≠ representative. Templated documents are cleaner than real agency records; scores indicate capability on a controlled proxy, not certified production performance.
- Model-graded components inherit grader limits. They're constrained and deterministically backed where possible, but not infallible; read them as grader-relative. The deterministic headlines are the grader-independent signal.
- U.S. federal framing, English. A starting point, not a complete map of government work.
- Not legal advice or an authorization to deploy. Lodlina is an evaluation instrument, not a compliance certification.
Roadmap & backlog
The plan of record (phases, design decisions, eval-pack format) is in
docs/ROADMAP.md. Tasks still on the methodology bench:
political-neutrality (symmetric paired prompts), Section-508 alt-text, FOIA
exemption-reasoning, and abstention on unanswerable questions.
Develop & contribute
git clone https://github.com/Lodlina/Lodlina && cd Lodlina
uv venv && uv pip install -e ".[dev]"
pytest # 50+ tests, fully offline
The test suite runs the full Inspect pipeline offline — each task is driven
end-to-end by a mock model, so all grading logic is verified without credentials or
network. Conventions mirror
inspect_evals.
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lodlina-0.2.1.tar.gz.
File metadata
- Download URL: lodlina-0.2.1.tar.gz
- Upload date:
- Size: 74.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c02e15581f213cb015a860c5786ea2a64121042333991d1556ef5abd0b2eef2
|
|
| MD5 |
90829a54f960c5782123c451c5398c6e
|
|
| BLAKE2b-256 |
d79a32fd8a88302adb16825da8bc534ca76b172535589befd5f97300cadf706b
|
Provenance
The following attestation bundles were made for lodlina-0.2.1.tar.gz:
Publisher:
release.yml on Lodlina/Lodlina
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lodlina-0.2.1.tar.gz -
Subject digest:
3c02e15581f213cb015a860c5786ea2a64121042333991d1556ef5abd0b2eef2 - Sigstore transparency entry: 1760263762
- Sigstore integration time:
-
Permalink:
Lodlina/Lodlina@2ac230ecf9c831b35d0dc3888610d3ef6c74f394 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/Lodlina
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2ac230ecf9c831b35d0dc3888610d3ef6c74f394 -
Trigger Event:
release
-
Statement type:
File details
Details for the file lodlina-0.2.1-py3-none-any.whl.
File metadata
- Download URL: lodlina-0.2.1-py3-none-any.whl
- Upload date:
- Size: 70.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b4711154bf13e7520f414e5db590c10fa3841387ab1d6cfcc91dcad97c7fde0
|
|
| MD5 |
d6bfa00da1837c254e338516898407ba
|
|
| BLAKE2b-256 |
7aefbcfb220e6e6e87edbfeab731a099c5313cab0ca1ea14587c566295b3cfad
|
Provenance
The following attestation bundles were made for lodlina-0.2.1-py3-none-any.whl:
Publisher:
release.yml on Lodlina/Lodlina
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lodlina-0.2.1-py3-none-any.whl -
Subject digest:
3b4711154bf13e7520f414e5db590c10fa3841387ab1d6cfcc91dcad97c7fde0 - Sigstore transparency entry: 1760263843
- Sigstore integration time:
-
Permalink:
Lodlina/Lodlina@2ac230ecf9c831b35d0dc3888610d3ef6c74f394 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/Lodlina
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2ac230ecf9c831b35d0dc3888610d3ef6c74f394 -
Trigger Event:
release
-
Statement type: