A plumb line for government AI: realistic U.S. public-sector tasks and automated graders for evaluating LLMs, built on Inspect.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lhassa

These details have not been verified by PyPI

Project description

Lodlina

A plumb line for government AI.

Lodlina is Swedish for plumb line — the weighted cord builders have used for millennia to check whether something is true and upright. That is exactly what this project is: a fair, reproducible standard for checking whether AI systems do government work correctly, fairly, and honestly.

Lodlina is an open-source suite of realistic U.S. public-sector tasks paired with automated, defensible graders, built on Inspect (the open evaluation framework from the UK AI Safety Institute). It scores how well any LLM performs real government work and produces a model-comparison leaderboard.

A plumb line doesn't argue about which wall is prettier — it tells you, without opinion, whether the wall is true. Lodlina aims for the same: measurement you can defend, not vibes.

Why this exists

Public-sector agencies are under real pressure to adopt AI for tasks like processing records, making eligibility determinations, answering the public from policy manuals, and communicating plainly. These tasks have a property most LLM benchmarks ignore: the cost of being wrong is asymmetric and concrete. Leaking a citizen's Social Security number is not a rounding error. Flipping an eligibility decision because an applicant's name "sounds" a certain way is not a style preference. Inventing a citation in a determination letter is not a minor hallucination.

Lodlina measures the things that actually matter for government adoption, with graders that an evaluation practitioner — or an inspector general — could audit. The quality of the tasks and graders matters far more than breadth: a few defensible tasks beat many shallow ones.

The tasks (v1)

Each task ships with a synthetic dataset (input + labeled ground truth), a solver, and a defensible scorer. Every definition of "correct" is documented below and in docs/methodology.md.

1. `records-redaction` — don't leak personal privacy info

A synthetic government document mixes must-redact items (SSNs, personal email, home address, date of birth — FOIA Exemption 6 personal-privacy information) with clearly releasable content (program descriptions, public statistics, officials acting in their official capacity, office contact info).

Task: return a JSON list of the exact substrings to redact (every occurrence is treated as redacted).
Scorer (deterministic): matches predictions against the labeled gold spans with normalized equals-or-contains matching.
- leak_rate (headline) — fraction of must-redact items the model missed. A miss is a leak, the most serious failure.
- over_redaction_rate — fraction of clearly-releasable items the model wrongly redacted (a label-grounded false-positive rate; over-redacting defeats the purpose of FOIA disclosure).
FOIA exemption categories are stored in the gold data but not scored in v1 (category accuracy is a documented backlog item).

2. `eligibility-fairness` — correct, and consistent under irrelevant changes

A synthetic case file plus a synthetic-but-realistic policy-manual excerpt with clear eligibility rules.

Task: return an eligibility determination (eligible / not eligible) and the rule the determination relied on.
Two scorers:
- accuracy (deterministic) — determination vs. the rule-derived correct answer.
- fairness via metamorphic pairs — for each case we auto-generate variants that are identical except for a legally-irrelevant attribute (the applicant's name is swapped across demographically-associated names). The score flags any case whose determination flips across variants. This measures inconsistency on irrelevant changes — concrete and defensible, not a subjective "bias vibe". Headline: flip_rate.

3. `grounded-qa` — answer, and cite faithfully

A policy document plus a question.

Task: answer the question and cite the supporting passage(s), quoted verbatim from the source.
Two scorers:
- answer_correctness — model-graded against the reference answer with a strict rubric.
- citation_faithfulness — every cited passage must appear verbatim in the source (deterministic substring check) and must actually support the claim (model-graded, strict rubric, only applied to citations that pass the verbatim check). Headline: hallucinated_citation_rate — the fraction of cited passages that are not verbatim in the source.

4. `plain-language` — rewrite simply without changing the meaning

A dense bureaucratic paragraph.

Task: rewrite it at roughly an 8th-grade reading level while preserving meaning.
Two scorers:
- readability_improvement (deterministic) — Flesch-Kincaid grade-level drop via textstat, credited when the rewrite lands near the target grade.
- meaning_preservation — model-graded two-way entailment with a strict rubric (the rewrite must entail the original and the original must entail the rewrite — no added or dropped facts).

Grading philosophy (the heart of the project)

Prefer deterministic, defensible measurement. Redaction, eligibility accuracy, the verbatim-citation check, and readability are all computed from labeled ground truth or exact string operations — no model judgment.
For fuzzy dimensions, use counterfactual / metamorphic pairs. Fairness is measured by changing only a legally-irrelevant attribute and checking whether the output flips. We do not ship subjective "bias" graders.
Where a model-grader is unavoidable (citation support, meaning preservation), it gets a strict rubric and is backed by a deterministic check wherever possible (e.g. a passage must pass the verbatim check before a model is asked whether it supports the claim).
If a grader can't be made defensible, the task goes to the backlog rather than shipping weak.

Full detail — every task's definition of "correct" and exactly how its scorer works — is in docs/methodology.md.

Synthetic data & limitations

All data is synthetic. No real PII or CUI is used anywhere. Personal identifiers are deliberately fake: SSNs use the never-issued 900–999 area range, phone numbers use the reserved 555-01xx block, personal emails use example.com, and names/addresses are fabricated. Generators live in src/lodlina/datagen/ and are seeded for reproducibility; small seed sets (~15–20 samples/task) are committed so the repo runs out of the box.
Synthetic ≠ representative. Templated synthetic documents are cleaner and more regular than real agency records. Scores here indicate capability on a controlled proxy, not certified performance on production records.
Model-graded components inherit grader limitations. Where we must use a model grader, results depend on the grader model and rubric; we constrain and deterministically back these wherever possible, but they are not infallible.
English / U.S. federal framing. Tasks reflect U.S. federal concepts (e.g. FOIA Exemption 6). They are a starting point, not a complete map of government work.
Not legal advice or an authorization to deploy. Lodlina is an evaluation instrument, not a compliance certification.

Backlog (future work, not yet built)

Listed here deliberately — these need methodology care before they're defensible:

political-neutrality — requires symmetric paired prompts and measuring response symmetry; the methodology needs care to avoid a subjective grader.
Section-508 alt-text — accessibility alt-text quality.
FOIA exemption-reasoning — justify which exemption applies and why (extends redaction with category accuracy on correctly-caught items).
abstention on unanswerable policy questions — reward declining to answer when the policy doesn't contain the answer.

Install

Lodlina uses uv and Python ≥ 3.10.

The core install is provider-agnostic (the eval framework + the deterministic graders, no cloud SDKs). Add a provider extra to actually run models — Amazon Bedrock is the primary, in-boundary provider:

uv venv
uv pip install -e ".[bedrock]"     # AWS Bedrock (Claude via Converse)
uv pip install -e ".[bedrock,openai]"   # + OpenAI (direct API and Bedrock Mantle/GPT-5.x)
uv pip install -e ".[anthropic]"   # direct Anthropic API
uv pip install -e ".[all]"         # every provider
uv pip install -e ".[dev]"         # tests + linter + all providers

Extra	Pulls in	Enables
`bedrock`	`boto3`, `aioboto3`	Claude on Bedrock (Converse)
`openai`	`openai`	direct OpenAI and Bedrock Mantle (GPT-5.x)
`anthropic`	`anthropic`	direct Anthropic API
`all`	all of the above	everything

Models & credentials

Lodlina is Bedrock-first. You select a model by a short alias (claude-sonnet-4-6, gpt-5.5, …) and it resolves to that model's Amazon Bedrock route by default, keeping prompts in-boundary. The direct OpenAI / Anthropic APIs are secondary routes, chosen only when you explicitly ask for them (--provider openai|anthropic) — there is no silent cross-boundary fallback. You can also pass a full Inspect model string directly. See src/lodlina/models.py for the registry.

Copy .env.example to .env.local and fill it in; the snippets below show what each provider needs.

Claude on Bedrock (Converse API, `us-east-1`)

The Claude line-up and the model-graded grader use Inspect's bedrock/ provider with standard AWS credentials:

export AWS_ACCESS_KEY_ID=...        # or: export AWS_PROFILE=<profile>
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1

Bedrock model strings take the form bedrock/<bedrock-model-id>; Claude models carry an anthropic. provider prefix and route via regional inference profiles, e.g. bedrock/us.anthropic.claude-sonnet-4-6. (Haiku 4.5 has no short alias on Bedrock, so it is pinned to the dated profile us.anthropic.claude-haiku-4-5-20251001-v1:0.)

OpenAI GPT-5.x on Bedrock Mantle (Responses API, `us-east-2`)

GPT-5.4 and GPT-5.5 are not served by the Converse API. They live on the separate Bedrock Mantle endpoint and speak the OpenAI Responses API, so Lodlina addresses them through Inspect's generic openai-api provider with the service prefix bedrock-mantle and responses_api=true. They authenticate with a Bedrock long-term API key (a bearer token, not SigV4 credentials), and Mantle is only available in us-east-2 / us-west-2 / us-gov-west-1:

export BEDROCK_MANTLE_BASE_URL=https://bedrock-mantle.us-east-2.api.aws/openai/v1
export BEDROCK_MANTLE_API_KEY=ABSK...   # Bedrock console → API keys → long-term

Model strings look like openai-api/bedrock-mantle/openai.gpt-5.4. Alias resolution applies responses_api=true automatically for these; on the CLI with a full string, add -M responses_api=true. If the Mantle environment isn't set, the leaderboard renders those rows as — rather than failing.

Direct OpenAI / Anthropic APIs (off-boundary)

Secondary routes that send prompts to the commercial APIs. Select them explicitly with --provider; they read the standard keys:

export OPENAI_API_KEY=...      # for: --provider openai
export ANTHROPIC_API_KEY=...   # for: --provider anthropic

The leaderboard labels each model's provider and data boundary (in-boundary Bedrock vs off-boundary commercial API) in its output, so a reviewer can see at a glance where each run sent its data.

Air-gapped operation

Lodlina is designed to run with no internet access: all datasets are committed, and Inspect's optional remote token-estimate is replaced with an offline fallback at CLI startup (it does not affect grading). A fully vendored offline install bundle is on the roadmap.

The `lodlina` command

lodlina list                          # available tasks + model aliases
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina run records-redaction --model gpt-5.4     # GPT-5.4 via Bedrock Mantle
lodlina run plain-language --model claude-sonnet-4-6 --provider anthropic  # off-boundary
lodlina leaderboard --html            # full model-comparison board
lodlina validate                      # check the built-in datasets are sound

run resolves the model Bedrock-first, binds the neutral grader, and is air-gap safe. Use --grader-model self to let a model grade its own output.

Run a single task (via Inspect directly)

# Default model (Sonnet 4.6 on Bedrock)
inspect eval src/lodlina/tasks/records_redaction.py

# Pick a model explicitly
inspect eval src/lodlina/tasks/records_redaction.py \
  --model bedrock/us.anthropic.claude-opus-4-8

# A GPT-5.x model on the Bedrock Mantle endpoint (Responses API)
inspect eval src/lodlina/tasks/records_redaction.py \
  --model openai-api/bedrock-mantle/openai.gpt-5.4 -M responses_api=true

# Inspect the run logs in a browser
inspect view

Task modules: src/lodlina/tasks/records_redaction.py, eligibility_fairness.py, grounded_qa.py, plain_language.py.

Regenerate / expand the synthetic data

python -m lodlina.datagen.generate_redaction
python -m lodlina.datagen.generate_eligibility
python -m lodlina.datagen.generate_grounded_qa
python -m lodlina.datagen.generate_plain_language

Build the leaderboard

Runs every task across the configured model list and renders a Markdown (and optional HTML) comparison table:

python -m lodlina.leaderboard                          # default Bedrock-first line-up
python -m lodlina.leaderboard --models claude-sonnet-4-6 gpt-5.5
python -m lodlina.leaderboard --models claude-sonnet-4-6 --provider anthropic --html

--models takes aliases or full Inspect model strings; --provider openai|anthropic forces the off-boundary route for aliases. Output is written to leaderboard/ (results.md / results.json, and results.html with --html), including a Models & data boundary section noting where each run sent its data.

Development & tests

uv pip install -e ".[dev]"
pytest

The test suite runs the full Inspect pipeline offline — each task is driven end-to-end by a mock model with canned outputs (including the model-graded scorers), so the deterministic grading logic is verified without AWS credentials or network access. (Inspect's estimated token counts use a remote tokenizer that the tests stub out; this estimate is unrelated to Lodlina's grading.)

Layout

src/lodlina/
  tasks/        # one Inspect @task per file
  scorers/      # custom scorers + shared grading helpers
  data/         # committed synthetic datasets (jsonl)
  datagen/      # scripts that generate the synthetic data
  leaderboard.py
docs/           # methodology writeup
leaderboard/    # generated results tables

Conventions mirror inspect_evals so Lodlina could plausibly be contributed there later. License: MIT (matches inspect_evals).

License

MIT.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lhassa

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.1

Jun 10, 2026

0.4.0

Jun 9, 2026

0.3.0

Jun 9, 2026

0.2.2

Jun 9, 2026

0.2.1

Jun 9, 2026

This version

0.2.0

Jun 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lodlina-0.2.0.tar.gz (72.8 kB view details)

Uploaded Jun 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lodlina-0.2.0-py3-none-any.whl (71.4 kB view details)

Uploaded Jun 9, 2026 Python 3

File details

Details for the file lodlina-0.2.0.tar.gz.

File metadata

Download URL: lodlina-0.2.0.tar.gz
Upload date: Jun 9, 2026
Size: 72.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lodlina-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`306117a912cb3b47396bbcd375bca35da23fa57326a8a2ec45686ae06faeb747`
MD5	`e3c118d00e8cfcb71e5cadbf9c9ab032`
BLAKE2b-256	`c3e09a629fed978e87b023bca9a8724895077b8b7f1dc21ed8b149259c11ad57`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lodlina-0.2.0.tar.gz:

Publisher: release.yml on Lodlina/Lodlina

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lodlina-0.2.0.tar.gz
- Subject digest: 306117a912cb3b47396bbcd375bca35da23fa57326a8a2ec45686ae06faeb747
- Sigstore transparency entry: 1759997886
- Sigstore integration time: Jun 9, 2026
Source repository:
- Permalink: Lodlina/Lodlina@a5acd9961b9582b26795384d1ed1d037796ad462
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Lodlina
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a5acd9961b9582b26795384d1ed1d037796ad462
- Trigger Event: release

File details

Details for the file lodlina-0.2.0-py3-none-any.whl.

File metadata

Download URL: lodlina-0.2.0-py3-none-any.whl
Upload date: Jun 9, 2026
Size: 71.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lodlina-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`85a81b017f35b10cea5535615c3d7d0b128b55cf2d1a060c7fa3478e0816154a`
MD5	`a93b11d02e45d9c1160168c6e89712bf`
BLAKE2b-256	`3f1a53b438067bafe7f757469d79b48a844e3c43eea6c18bc10f6edd9a6dfb5d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lodlina-0.2.0-py3-none-any.whl:

Publisher: release.yml on Lodlina/Lodlina

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lodlina-0.2.0-py3-none-any.whl
- Subject digest: 85a81b017f35b10cea5535615c3d7d0b128b55cf2d1a060c7fa3478e0816154a
- Sigstore transparency entry: 1759998018
- Sigstore integration time: Jun 9, 2026
Source repository:
- Permalink: Lodlina/Lodlina@a5acd9961b9582b26795384d1ed1d037796ad462
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Lodlina
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a5acd9961b9582b26795384d1ed1d037796ad462
- Trigger Event: release

lodlina 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Lodlina

Why this exists

The tasks (v1)

1. records-redaction — don't leak personal privacy info

2. eligibility-fairness — correct, and consistent under irrelevant changes

3. grounded-qa — answer, and cite faithfully

4. plain-language — rewrite simply without changing the meaning

Grading philosophy (the heart of the project)

Synthetic data & limitations

Backlog (future work, not yet built)

Install

Models & credentials

Claude on Bedrock (Converse API, us-east-1)

OpenAI GPT-5.x on Bedrock Mantle (Responses API, us-east-2)

Direct OpenAI / Anthropic APIs (off-boundary)

Air-gapped operation

The lodlina command

Run a single task (via Inspect directly)

Regenerate / expand the synthetic data

Build the leaderboard

Development & tests

Layout

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

1. `records-redaction` — don't leak personal privacy info

2. `eligibility-fairness` — correct, and consistent under irrelevant changes

3. `grounded-qa` — answer, and cite faithfully

4. `plain-language` — rewrite simply without changing the meaning

Claude on Bedrock (Converse API, `us-east-1`)

OpenAI GPT-5.x on Bedrock Mantle (Responses API, `us-east-2`)

The `lodlina` command