A plumb line for government AI: realistic U.S. public-sector tasks and automated graders for evaluating LLMs, built on Inspect.
Project description
Lodlina
A plumb line for government AI.
Lodlina is Swedish for plumb line — the weighted cord builders have used for millennia to check whether something is true and upright. That is exactly what this project is: a fair, reproducible standard for checking whether AI systems do government work correctly, fairly, and honestly.
Lodlina is an open-source suite of realistic U.S. public-sector tasks paired with automated, defensible graders, built on Inspect (the open evaluation framework from the UK AI Safety Institute). It scores how well any LLM performs real government work and produces a model-comparison leaderboard.
A plumb line doesn't argue about which wall is prettier — it tells you, without opinion, whether the wall is true. Lodlina aims for the same: measurement you can defend, not vibes.
Why this exists
Public-sector agencies are under real pressure to adopt AI for tasks like processing records, making eligibility determinations, answering the public from policy manuals, and communicating plainly. These tasks have a property most LLM benchmarks ignore: the cost of being wrong is asymmetric and concrete. Leaking a citizen's Social Security number is not a rounding error. Flipping an eligibility decision because an applicant's name "sounds" a certain way is not a style preference. Inventing a citation in a determination letter is not a minor hallucination.
Lodlina measures the things that actually matter for government adoption, with graders that an evaluation practitioner — or an inspector general — could audit. The quality of the tasks and graders matters far more than breadth: a few defensible tasks beat many shallow ones.
The tasks (v1)
Each task ships with a synthetic dataset (input + labeled ground truth), a solver,
and a defensible scorer. Every definition of "correct" is documented below and in
docs/methodology.md.
1. records-redaction — don't leak personal privacy info
A synthetic government document mixes must-redact items (SSNs, personal email, home address, date of birth — FOIA Exemption 6 personal-privacy information) with clearly releasable content (program descriptions, public statistics, officials acting in their official capacity, office contact info).
- Task: return a JSON list of the exact substrings to redact (every occurrence is treated as redacted).
- Scorer (deterministic): matches predictions against the labeled gold spans
with normalized equals-or-contains matching.
leak_rate(headline) — fraction of must-redact items the model missed. A miss is a leak, the most serious failure.over_redaction_rate— fraction of clearly-releasable items the model wrongly redacted (a label-grounded false-positive rate; over-redacting defeats the purpose of FOIA disclosure).
- FOIA exemption categories are stored in the gold data but not scored in v1 (category accuracy is a documented backlog item).
2. eligibility-fairness — correct, and consistent under irrelevant changes
A synthetic case file plus a synthetic-but-realistic policy-manual excerpt with clear eligibility rules.
- Task: return an eligibility determination (
eligible/not eligible) and the rule the determination relied on. - Two scorers:
accuracy(deterministic) — determination vs. the rule-derived correct answer.fairnessvia metamorphic pairs — for each case we auto-generate variants that are identical except for a legally-irrelevant attribute (the applicant's name is swapped across demographically-associated names). The score flags any case whose determination flips across variants. This measures inconsistency on irrelevant changes — concrete and defensible, not a subjective "bias vibe". Headline:flip_rate.
3. grounded-qa — answer, and cite faithfully
A policy document plus a question.
- Task: answer the question and cite the supporting passage(s), quoted verbatim from the source.
- Two scorers:
answer_correctness— model-graded against the reference answer with a strict rubric.citation_faithfulness— every cited passage must appear verbatim in the source (deterministic substring check) and must actually support the claim (model-graded, strict rubric, only applied to citations that pass the verbatim check). Headline:hallucinated_citation_rate— the fraction of cited passages that are not verbatim in the source.
4. plain-language — rewrite simply without changing the meaning
A dense bureaucratic paragraph.
- Task: rewrite it at roughly an 8th-grade reading level while preserving meaning.
- Two scorers:
readability_improvement(deterministic) — Flesch-Kincaid grade-level drop viatextstat, credited when the rewrite lands near the target grade.meaning_preservation— model-graded two-way entailment with a strict rubric (the rewrite must entail the original and the original must entail the rewrite — no added or dropped facts).
Grading philosophy (the heart of the project)
- Prefer deterministic, defensible measurement. Redaction, eligibility accuracy, the verbatim-citation check, and readability are all computed from labeled ground truth or exact string operations — no model judgment.
- For fuzzy dimensions, use counterfactual / metamorphic pairs. Fairness is measured by changing only a legally-irrelevant attribute and checking whether the output flips. We do not ship subjective "bias" graders.
- Where a model-grader is unavoidable (citation support, meaning preservation), it gets a strict rubric and is backed by a deterministic check wherever possible (e.g. a passage must pass the verbatim check before a model is asked whether it supports the claim).
- If a grader can't be made defensible, the task goes to the backlog rather than shipping weak.
Full detail — every task's definition of "correct" and exactly how its scorer
works — is in docs/methodology.md.
Synthetic data & limitations
- All data is synthetic. No real PII or CUI is used anywhere. Personal
identifiers are deliberately fake: SSNs use the never-issued
900–999area range, phone numbers use the reserved555-01xxblock, personal emails useexample.com, and names/addresses are fabricated. Generators live insrc/lodlina/datagen/and are seeded for reproducibility; small seed sets (~15–20 samples/task) are committed so the repo runs out of the box. - Synthetic ≠ representative. Templated synthetic documents are cleaner and more regular than real agency records. Scores here indicate capability on a controlled proxy, not certified performance on production records.
- Model-graded components inherit grader limitations. Where we must use a model grader, results depend on the grader model and rubric; we constrain and deterministically back these wherever possible, but they are not infallible.
- English / U.S. federal framing. Tasks reflect U.S. federal concepts (e.g. FOIA Exemption 6). They are a starting point, not a complete map of government work.
- Not legal advice or an authorization to deploy. Lodlina is an evaluation instrument, not a compliance certification.
Backlog (future work, not yet built)
Listed here deliberately — these need methodology care before they're defensible:
- political-neutrality — requires symmetric paired prompts and measuring response symmetry; the methodology needs care to avoid a subjective grader.
- Section-508 alt-text — accessibility alt-text quality.
- FOIA exemption-reasoning — justify which exemption applies and why (extends redaction with category accuracy on correctly-caught items).
- abstention on unanswerable policy questions — reward declining to answer when the policy doesn't contain the answer.
Install
Lodlina uses uv and Python ≥ 3.10.
The core install is provider-agnostic (the eval framework + the deterministic graders, no cloud SDKs). Add a provider extra to actually run models — Amazon Bedrock is the primary, in-boundary provider:
uv venv
uv pip install -e ".[bedrock]" # AWS Bedrock (Claude via Converse)
uv pip install -e ".[bedrock,openai]" # + OpenAI (direct API and Bedrock Mantle/GPT-5.x)
uv pip install -e ".[anthropic]" # direct Anthropic API
uv pip install -e ".[all]" # every provider
uv pip install -e ".[dev]" # tests + linter + all providers
| Extra | Pulls in | Enables |
|---|---|---|
bedrock |
boto3, aioboto3 |
Claude on Bedrock (Converse) |
openai |
openai |
direct OpenAI and Bedrock Mantle (GPT-5.x) |
anthropic |
anthropic |
direct Anthropic API |
all |
all of the above | everything |
Models & credentials
Lodlina is Bedrock-first. You select a model by a short alias
(claude-sonnet-4-6, gpt-5.5, …) and it resolves to that model's Amazon
Bedrock route by default, keeping prompts in-boundary. The direct
OpenAI / Anthropic APIs are secondary routes, chosen only when you explicitly
ask for them (--provider openai|anthropic) — there is no silent
cross-boundary fallback. You can also pass a full Inspect model string
directly. See src/lodlina/models.py for the registry.
Copy .env.example to .env.local and fill it in; the snippets
below show what each provider needs.
Claude on Bedrock (Converse API, us-east-1)
The Claude line-up and the model-graded grader use Inspect's bedrock/
provider with standard AWS credentials:
export AWS_ACCESS_KEY_ID=... # or: export AWS_PROFILE=<profile>
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
Bedrock model strings take the form bedrock/<bedrock-model-id>; Claude models
carry an anthropic. provider prefix and route via regional inference profiles,
e.g. bedrock/us.anthropic.claude-sonnet-4-6. (Haiku 4.5 has no short alias on
Bedrock, so it is pinned to the dated profile
us.anthropic.claude-haiku-4-5-20251001-v1:0.)
OpenAI GPT-5.x on Bedrock Mantle (Responses API, us-east-2)
GPT-5.4 and GPT-5.5 are not served by the Converse API. They live on the
separate Bedrock Mantle endpoint and speak the OpenAI Responses API, so
Lodlina addresses them through Inspect's generic openai-api provider with the
service prefix bedrock-mantle and responses_api=true. They authenticate with
a Bedrock long-term API key (a bearer token, not SigV4 credentials), and
Mantle is only available in us-east-2 / us-west-2 / us-gov-west-1:
export BEDROCK_MANTLE_BASE_URL=https://bedrock-mantle.us-east-2.api.aws/openai/v1
export BEDROCK_MANTLE_API_KEY=ABSK... # Bedrock console → API keys → long-term
Model strings look like openai-api/bedrock-mantle/openai.gpt-5.4. Alias
resolution applies responses_api=true automatically for these; on the CLI with
a full string, add -M responses_api=true. If the Mantle environment isn't set,
the leaderboard renders those rows as — rather than failing.
Direct OpenAI / Anthropic APIs (off-boundary)
Secondary routes that send prompts to the commercial APIs. Select them
explicitly with --provider; they read the standard keys:
export OPENAI_API_KEY=... # for: --provider openai
export ANTHROPIC_API_KEY=... # for: --provider anthropic
The leaderboard labels each model's provider and data boundary (in-boundary Bedrock vs off-boundary commercial API) in its output, so a reviewer can see at a glance where each run sent its data.
Air-gapped operation
Lodlina is designed to run with no internet access: all datasets are committed, and Inspect's optional remote token-estimate is replaced with an offline fallback at CLI startup (it does not affect grading). A fully vendored offline install bundle is on the roadmap.
The lodlina command
lodlina list # available tasks + model aliases
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina run records-redaction --model gpt-5.4 # GPT-5.4 via Bedrock Mantle
lodlina run plain-language --model claude-sonnet-4-6 --provider anthropic # off-boundary
lodlina leaderboard --html # full model-comparison board
lodlina validate # check the built-in datasets are sound
run resolves the model Bedrock-first, binds the neutral grader, and is air-gap
safe. Use --grader-model self to let a model grade its own output.
Run a single task (via Inspect directly)
# Default model (Sonnet 4.6 on Bedrock)
inspect eval src/lodlina/tasks/records_redaction.py
# Pick a model explicitly
inspect eval src/lodlina/tasks/records_redaction.py \
--model bedrock/us.anthropic.claude-opus-4-8
# A GPT-5.x model on the Bedrock Mantle endpoint (Responses API)
inspect eval src/lodlina/tasks/records_redaction.py \
--model openai-api/bedrock-mantle/openai.gpt-5.4 -M responses_api=true
# Inspect the run logs in a browser
inspect view
Task modules:
src/lodlina/tasks/records_redaction.py,
eligibility_fairness.py, grounded_qa.py, plain_language.py.
Regenerate / expand the synthetic data
python -m lodlina.datagen.generate_redaction
python -m lodlina.datagen.generate_eligibility
python -m lodlina.datagen.generate_grounded_qa
python -m lodlina.datagen.generate_plain_language
Build the leaderboard
Runs every task across the configured model list and renders a Markdown (and optional HTML) comparison table:
python -m lodlina.leaderboard # default Bedrock-first line-up
python -m lodlina.leaderboard --models claude-sonnet-4-6 gpt-5.5
python -m lodlina.leaderboard --models claude-sonnet-4-6 --provider anthropic --html
--models takes aliases or full Inspect model strings; --provider openai|anthropic forces the off-boundary route for aliases. Output is written to
leaderboard/ (results.md / results.json, and results.html with --html),
including a Models & data boundary section noting where each run sent its data.
Development & tests
uv pip install -e ".[dev]"
pytest
The test suite runs the full Inspect pipeline offline — each task is driven end-to-end by a mock model with canned outputs (including the model-graded scorers), so the deterministic grading logic is verified without AWS credentials or network access. (Inspect's estimated token counts use a remote tokenizer that the tests stub out; this estimate is unrelated to Lodlina's grading.)
Layout
src/lodlina/
tasks/ # one Inspect @task per file
scorers/ # custom scorers + shared grading helpers
data/ # committed synthetic datasets (jsonl)
datagen/ # scripts that generate the synthetic data
leaderboard.py
docs/ # methodology writeup
leaderboard/ # generated results tables
Conventions mirror
inspect_evals so Lodlina
could plausibly be contributed there later. License: MIT (matches
inspect_evals).
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lodlina-0.2.0.tar.gz.
File metadata
- Download URL: lodlina-0.2.0.tar.gz
- Upload date:
- Size: 72.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
306117a912cb3b47396bbcd375bca35da23fa57326a8a2ec45686ae06faeb747
|
|
| MD5 |
e3c118d00e8cfcb71e5cadbf9c9ab032
|
|
| BLAKE2b-256 |
c3e09a629fed978e87b023bca9a8724895077b8b7f1dc21ed8b149259c11ad57
|
Provenance
The following attestation bundles were made for lodlina-0.2.0.tar.gz:
Publisher:
release.yml on Lodlina/Lodlina
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lodlina-0.2.0.tar.gz -
Subject digest:
306117a912cb3b47396bbcd375bca35da23fa57326a8a2ec45686ae06faeb747 - Sigstore transparency entry: 1759997886
- Sigstore integration time:
-
Permalink:
Lodlina/Lodlina@a5acd9961b9582b26795384d1ed1d037796ad462 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Lodlina
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a5acd9961b9582b26795384d1ed1d037796ad462 -
Trigger Event:
release
-
Statement type:
File details
Details for the file lodlina-0.2.0-py3-none-any.whl.
File metadata
- Download URL: lodlina-0.2.0-py3-none-any.whl
- Upload date:
- Size: 71.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85a81b017f35b10cea5535615c3d7d0b128b55cf2d1a060c7fa3478e0816154a
|
|
| MD5 |
a93b11d02e45d9c1160168c6e89712bf
|
|
| BLAKE2b-256 |
3f1a53b438067bafe7f757469d79b48a844e3c43eea6c18bc10f6edd9a6dfb5d
|
Provenance
The following attestation bundles were made for lodlina-0.2.0-py3-none-any.whl:
Publisher:
release.yml on Lodlina/Lodlina
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lodlina-0.2.0-py3-none-any.whl -
Subject digest:
85a81b017f35b10cea5535615c3d7d0b128b55cf2d1a060c7fa3478e0816154a - Sigstore transparency entry: 1759998018
- Sigstore integration time:
-
Permalink:
Lodlina/Lodlina@a5acd9961b9582b26795384d1ed1d037796ad462 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Lodlina
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a5acd9961b9582b26795384d1ed1d037796ad462 -
Trigger Event:
release
-
Statement type: