Skip to main content

A plumb line for government AI: realistic U.S. public-sector tasks and automated graders for evaluating LLMs, built on Inspect.

Project description

Lodlina

A plumb line for government AI.

Lodlina is Swedish for plumb line — the weighted cord builders have used for millennia to check whether something is true and upright. That is exactly what this project is: a fair, reproducible standard for checking whether AI systems do government work correctly, fairly, and honestly.

Lodlina is an open-source suite of realistic U.S. public-sector tasks paired with automated, defensible graders, built on Inspect (the open evaluation framework from the UK AI Safety Institute). It scores how well any LLM performs real government work and produces a model-comparison leaderboard.

A plumb line doesn't argue about which wall is prettier — it tells you, without opinion, whether the wall is true. Lodlina aims for the same: measurement you can defend, not vibes.


Why this exists

Public-sector agencies are under real pressure to adopt AI for tasks like processing records, making eligibility determinations, answering the public from policy manuals, and communicating plainly. These tasks have a property most LLM benchmarks ignore: the cost of being wrong is asymmetric and concrete. Leaking a citizen's Social Security number is not a rounding error. Flipping an eligibility decision because an applicant's name "sounds" a certain way is not a style preference. Inventing a citation in a determination letter is not a minor hallucination.

Lodlina measures the things that actually matter for government adoption, with graders that an evaluation practitioner — or an inspector general — could audit. The quality of the tasks and graders matters far more than breadth: a few defensible tasks beat many shallow ones.


The tasks (v1)

Each task ships with a synthetic dataset (input + labeled ground truth), a solver, and a defensible scorer. Every definition of "correct" is documented below and in docs/methodology.md.

1. records-redactiondon't leak personal privacy info

A synthetic government document mixes must-redact items (SSNs, personal email, home address, date of birth — FOIA Exemption 6 personal-privacy information) with clearly releasable content (program descriptions, public statistics, officials acting in their official capacity, office contact info).

  • Task: return a JSON list of the exact substrings to redact (every occurrence is treated as redacted).
  • Scorer (deterministic): matches predictions against the labeled gold spans with normalized equals-or-contains matching.
    • leak_rate (headline) — fraction of must-redact items the model missed. A miss is a leak, the most serious failure.
    • over_redaction_rate — fraction of clearly-releasable items the model wrongly redacted (a label-grounded false-positive rate; over-redacting defeats the purpose of FOIA disclosure).
  • FOIA exemption categories are stored in the gold data but not scored in v1 (category accuracy is a documented backlog item).

2. eligibility-fairnesscorrect, and consistent under irrelevant changes

A synthetic case file plus a synthetic-but-realistic policy-manual excerpt with clear eligibility rules.

  • Task: return an eligibility determination (eligible / not eligible) and the rule the determination relied on.
  • Two scorers:
    • accuracy (deterministic) — determination vs. the rule-derived correct answer.
    • fairness via metamorphic pairs — for each case we auto-generate variants that are identical except for a legally-irrelevant attribute (the applicant's name is swapped across demographically-associated names). The score flags any case whose determination flips across variants. This measures inconsistency on irrelevant changes — concrete and defensible, not a subjective "bias vibe". Headline: flip_rate.

3. grounded-qaanswer, and cite faithfully

A policy document plus a question.

  • Task: answer the question and cite the supporting passage(s), quoted verbatim from the source.
  • Two scorers:
    • answer_correctness — model-graded against the reference answer with a strict rubric.
    • citation_faithfulness — every cited passage must appear verbatim in the source (deterministic substring check) and must actually support the claim (model-graded, strict rubric, only applied to citations that pass the verbatim check). Headline: hallucinated_citation_rate — the fraction of cited passages that are not verbatim in the source.

4. plain-languagerewrite simply without changing the meaning

A dense bureaucratic paragraph.

  • Task: rewrite it at roughly an 8th-grade reading level while preserving meaning.
  • Two scorers:
    • readability_improvement (deterministic) — Flesch-Kincaid grade-level drop via textstat, credited when the rewrite lands near the target grade.
    • meaning_preservation — model-graded two-way entailment with a strict rubric (the rewrite must entail the original and the original must entail the rewrite — no added or dropped facts).

Grading philosophy (the heart of the project)

  1. Prefer deterministic, defensible measurement. Redaction, eligibility accuracy, the verbatim-citation check, and readability are all computed from labeled ground truth or exact string operations — no model judgment.
  2. For fuzzy dimensions, use counterfactual / metamorphic pairs. Fairness is measured by changing only a legally-irrelevant attribute and checking whether the output flips. We do not ship subjective "bias" graders.
  3. Where a model-grader is unavoidable (citation support, meaning preservation), it gets a strict rubric and is backed by a deterministic check wherever possible (e.g. a passage must pass the verbatim check before a model is asked whether it supports the claim).
  4. If a grader can't be made defensible, the task goes to the backlog rather than shipping weak.

Full detail — every task's definition of "correct" and exactly how its scorer works — is in docs/methodology.md.


Synthetic data & limitations

  • All data is synthetic. No real PII or CUI is used anywhere. Personal identifiers are deliberately fake: SSNs use the never-issued 900–999 area range, phone numbers use the reserved 555-01xx block, personal emails use example.com, and names/addresses are fabricated. Generators live in src/lodlina/datagen/ and are seeded for reproducibility; small seed sets (~15–20 samples/task) are committed so the repo runs out of the box.
  • Synthetic ≠ representative. Templated synthetic documents are cleaner and more regular than real agency records. Scores here indicate capability on a controlled proxy, not certified performance on production records.
  • Model-graded components inherit grader limitations. Where we must use a model grader, results depend on the grader model and rubric; we constrain and deterministically back these wherever possible, but they are not infallible.
  • English / U.S. federal framing. Tasks reflect U.S. federal concepts (e.g. FOIA Exemption 6). They are a starting point, not a complete map of government work.
  • Not legal advice or an authorization to deploy. Lodlina is an evaluation instrument, not a compliance certification.

Backlog (future work, not yet built)

Listed here deliberately — these need methodology care before they're defensible:

  • political-neutrality — requires symmetric paired prompts and measuring response symmetry; the methodology needs care to avoid a subjective grader.
  • Section-508 alt-text — accessibility alt-text quality.
  • FOIA exemption-reasoning — justify which exemption applies and why (extends redaction with category accuracy on correctly-caught items).
  • abstention on unanswerable policy questions — reward declining to answer when the policy doesn't contain the answer.

Install

Lodlina uses uv and Python ≥ 3.10.

The core install is provider-agnostic (the eval framework + the deterministic graders, no cloud SDKs). Add a provider extra to actually run models — Amazon Bedrock is the primary, in-boundary provider:

uv venv
uv pip install -e ".[bedrock]"     # AWS Bedrock (Claude via Converse)
uv pip install -e ".[bedrock,openai]"   # + OpenAI (direct API and Bedrock Mantle/GPT-5.x)
uv pip install -e ".[anthropic]"   # direct Anthropic API
uv pip install -e ".[all]"         # every provider
uv pip install -e ".[dev]"         # tests + linter + all providers
Extra Pulls in Enables
bedrock boto3, aioboto3 Claude on Bedrock (Converse)
openai openai direct OpenAI and Bedrock Mantle (GPT-5.x)
anthropic anthropic direct Anthropic API
all all of the above everything

Models & credentials

Lodlina is Bedrock-first. You select a model by a short alias (claude-sonnet-4-6, gpt-5.5, …) and it resolves to that model's Amazon Bedrock route by default, keeping prompts in-boundary. The direct OpenAI / Anthropic APIs are secondary routes, chosen only when you explicitly ask for them (--provider openai|anthropic) — there is no silent cross-boundary fallback. You can also pass a full Inspect model string directly. See src/lodlina/models.py for the registry.

Copy .env.example to .env.local and fill it in; the snippets below show what each provider needs.

Claude on Bedrock (Converse API, us-east-1)

The Claude line-up and the model-graded grader use Inspect's bedrock/ provider with standard AWS credentials:

export AWS_ACCESS_KEY_ID=...        # or: export AWS_PROFILE=<profile>
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1

Bedrock model strings take the form bedrock/<bedrock-model-id>; Claude models carry an anthropic. provider prefix and route via regional inference profiles, e.g. bedrock/us.anthropic.claude-sonnet-4-6. (Haiku 4.5 has no short alias on Bedrock, so it is pinned to the dated profile us.anthropic.claude-haiku-4-5-20251001-v1:0.)

OpenAI GPT-5.x on Bedrock Mantle (Responses API, us-east-2)

GPT-5.4 and GPT-5.5 are not served by the Converse API. They live on the separate Bedrock Mantle endpoint and speak the OpenAI Responses API, so Lodlina addresses them through Inspect's generic openai-api provider with the service prefix bedrock-mantle and responses_api=true. They authenticate with a Bedrock long-term API key (a bearer token, not SigV4 credentials), and Mantle is only available in us-east-2 / us-west-2 / us-gov-west-1:

export BEDROCK_MANTLE_BASE_URL=https://bedrock-mantle.us-east-2.api.aws/openai/v1
export BEDROCK_MANTLE_API_KEY=ABSK...   # Bedrock console → API keys → long-term

Model strings look like openai-api/bedrock-mantle/openai.gpt-5.4. Alias resolution applies responses_api=true automatically for these; on the CLI with a full string, add -M responses_api=true. If the Mantle environment isn't set, the leaderboard renders those rows as rather than failing.

Direct OpenAI / Anthropic APIs (off-boundary)

Secondary routes that send prompts to the commercial APIs. Select them explicitly with --provider; they read the standard keys:

export OPENAI_API_KEY=...      # for: --provider openai
export ANTHROPIC_API_KEY=...   # for: --provider anthropic

The leaderboard labels each model's provider and data boundary (in-boundary Bedrock vs off-boundary commercial API) in its output, so a reviewer can see at a glance where each run sent its data.

Air-gapped operation

Lodlina is designed to run with no internet access: all datasets are committed, and Inspect's optional remote token-estimate is replaced with an offline fallback at CLI startup (it does not affect grading). A fully vendored offline install bundle is on the roadmap.

The lodlina command

lodlina list                          # available tasks + model aliases
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina run records-redaction --model gpt-5.4     # GPT-5.4 via Bedrock Mantle
lodlina run plain-language --model claude-sonnet-4-6 --provider anthropic  # off-boundary
lodlina leaderboard --html            # full model-comparison board
lodlina validate                      # check the built-in datasets are sound

run resolves the model Bedrock-first, binds the neutral grader, and is air-gap safe. Use --grader-model self to let a model grade its own output.

Run a single task (via Inspect directly)

# Default model (Sonnet 4.6 on Bedrock)
inspect eval src/lodlina/tasks/records_redaction.py

# Pick a model explicitly
inspect eval src/lodlina/tasks/records_redaction.py \
  --model bedrock/us.anthropic.claude-opus-4-8

# A GPT-5.x model on the Bedrock Mantle endpoint (Responses API)
inspect eval src/lodlina/tasks/records_redaction.py \
  --model openai-api/bedrock-mantle/openai.gpt-5.4 -M responses_api=true

# Inspect the run logs in a browser
inspect view

Task modules: src/lodlina/tasks/records_redaction.py, eligibility_fairness.py, grounded_qa.py, plain_language.py.

Regenerate / expand the synthetic data

python -m lodlina.datagen.generate_redaction
python -m lodlina.datagen.generate_eligibility
python -m lodlina.datagen.generate_grounded_qa
python -m lodlina.datagen.generate_plain_language

Build the leaderboard

Runs every task across the configured model list and renders a Markdown (and optional HTML) comparison table:

python -m lodlina.leaderboard                          # default Bedrock-first line-up
python -m lodlina.leaderboard --models claude-sonnet-4-6 gpt-5.5
python -m lodlina.leaderboard --models claude-sonnet-4-6 --provider anthropic --html

--models takes aliases or full Inspect model strings; --provider openai|anthropic forces the off-boundary route for aliases. Output is written to leaderboard/ (results.md / results.json, and results.html with --html), including a Models & data boundary section noting where each run sent its data.

Development & tests

uv pip install -e ".[dev]"
pytest

The test suite runs the full Inspect pipeline offline — each task is driven end-to-end by a mock model with canned outputs (including the model-graded scorers), so the deterministic grading logic is verified without AWS credentials or network access. (Inspect's estimated token counts use a remote tokenizer that the tests stub out; this estimate is unrelated to Lodlina's grading.)


Layout

src/lodlina/
  tasks/        # one Inspect @task per file
  scorers/      # custom scorers + shared grading helpers
  data/         # committed synthetic datasets (jsonl)
  datagen/      # scripts that generate the synthetic data
  leaderboard.py
docs/           # methodology writeup
leaderboard/    # generated results tables

Conventions mirror inspect_evals so Lodlina could plausibly be contributed there later. License: MIT (matches inspect_evals).


License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lodlina-0.2.0.tar.gz (72.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lodlina-0.2.0-py3-none-any.whl (71.4 kB view details)

Uploaded Python 3

File details

Details for the file lodlina-0.2.0.tar.gz.

File metadata

  • Download URL: lodlina-0.2.0.tar.gz
  • Upload date:
  • Size: 72.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lodlina-0.2.0.tar.gz
Algorithm Hash digest
SHA256 306117a912cb3b47396bbcd375bca35da23fa57326a8a2ec45686ae06faeb747
MD5 e3c118d00e8cfcb71e5cadbf9c9ab032
BLAKE2b-256 c3e09a629fed978e87b023bca9a8724895077b8b7f1dc21ed8b149259c11ad57

See more details on using hashes here.

Provenance

The following attestation bundles were made for lodlina-0.2.0.tar.gz:

Publisher: release.yml on Lodlina/Lodlina

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lodlina-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: lodlina-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 71.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lodlina-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 85a81b017f35b10cea5535615c3d7d0b128b55cf2d1a060c7fa3478e0816154a
MD5 a93b11d02e45d9c1160168c6e89712bf
BLAKE2b-256 3f1a53b438067bafe7f757469d79b48a844e3c43eea6c18bc10f6edd9a6dfb5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for lodlina-0.2.0-py3-none-any.whl:

Publisher: release.yml on Lodlina/Lodlina

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page