Skip to main content

A plumb line for government AI: realistic U.S. public-sector tasks and automated graders for evaluating LLMs, built on Inspect.

Project description

Lodlina

A plumb line for government AI.

Lodlina is Swedish for plumb line — the weighted cord builders have used for millennia to check whether a wall is true. This project is the same idea for AI: a fair, reproducible, auditable way to check whether an AI model does government work correctly, fairly, and honestly — and to compare models on equal footing.

Lodlina is an open-source suite of realistic U.S. public-sector tasks paired with defensible automated graders, built on Inspect (the open evaluation framework from the UK AI Safety Institute). You point it at a model, it runs the tasks, and it produces scores and a model-comparison leaderboard you can put in front of leadership, an inspector general, or an ATO board.

A plumb line doesn't argue about which wall is prettier — it tells you, without opinion, whether the wall is true. Lodlina aims for the same: measurement you can defend, not vibes.

Who this is for

  • Agency AI leads and program owners choosing a model for records, eligibility, public Q&A, or plain-language work — and needing evidence, not a vendor demo.
  • Vendors and integrators selling AI to government who want to show, on neutral ground, that their model is correct, fair, and grounded.
  • Inspectors general, auditors, and eval practitioners who need numbers they can reproduce and defend — every score traces to a label or an exact string check.
  • Researchers studying how LLMs behave on real public-sector tasks.

What makes it different

  • Defensible graders. Most scores are deterministic — computed from labeled ground truth or exact string operations, auditable by someone who isn't an ML expert. Where judgment is unavoidable, the grader uses a strict rubric backed by a deterministic check.
  • All data is synthetic. No real PII or CUI anywhere — safe to run on any network. (SSNs use the never-issued 900–999 range, etc.)
  • In-boundary by default. Lodlina is Bedrock-first: prompts stay in your AWS boundary unless you explicitly opt into a commercial API. Every result is labeled with where it sent its data.
  • Runs anywhere. No telemetry, no phone-home; designed to work air-gapped.

Quickstart (about a minute)

You need Python ≥ 3.10 and AWS credentials with Amazon Bedrock access (the Claude models enabled in us-east-1).

pip install "lodlina[bedrock]"          # or: uv pip install "lodlina[bedrock]"

export AWS_PROFILE=your-bedrock-profile # AWS creds with Bedrock access
export AWS_DEFAULT_REGION=us-east-1

lodlina validate                        # confirm the install + datasets
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina leaderboard --html              # compare the default line-up → leaderboard/results.html

That's it. lodlina run prints a score summary; lodlina leaderboard --html writes a shareable comparison table. A first run costs roughly a few cents to a few dollars depending on the models and sample count.

No AWS yet? You can still pip install lodlina and run lodlina list / lodlina validate (no credentials needed). To run a model, add a provider and credentials — see Models & credentials.

Start with your AI assistant (Claude Code, Codex, Cursor, …)

If you use an AI coding assistant with shell access, you don't have to read the docs first. Open it in an empty folder and paste the contents of docs/ai-quickstart-prompt.md. The assistant will install Lodlina, check your AWS Bedrock access, run a first evaluation, and generate a leaderboard — explaining each step as it goes.

The prompt is self-contained and transparent — it's plain markdown, so read it before you paste it; everything it does is right there. It only installs into a local virtual environment and only spends money when you approve a model run.


What it measures

Four tasks, each a real public-sector job where being wrong is concrete. Each ships a synthetic dataset (input + labeled ground truth) and a defensible scorer. Full definitions of "correct" are in docs/methodology.md.

1. records-redactiondon't leak personal privacy info

A synthetic government document mixes must-redact items (SSNs, personal email, home address, date of birth — FOIA Exemption 6 information) with clearly releasable content. The model returns what it would redact.

  • leak_rate (headline, deterministic) — fraction of must-redact items the model missed. A miss is a leak: the most serious failure.
  • over_redaction_rate — releasable items wrongly redacted (over-redacting defeats the purpose of FOIA disclosure).

2. eligibility-fairnesscorrect, and consistent under irrelevant changes

A synthetic case file plus a policy-manual excerpt with clear rules.

  • accuracy (deterministic) — determination vs. the rule-derived answer.
  • flip_rate (headline, metamorphic) — each case is re-run with only the applicant's name changed (a legally-irrelevant attribute). Any case whose decision flips is flagged. This turns "is it biased?" into an objective "did the decision change when it must not have?" — no subjective bias grader.

3. grounded-qaanswer, and cite faithfully

A policy document plus a question; the model answers and cites supporting passages verbatim.

  • hallucinated_citation_rate (headline, deterministic) — fraction of cited passages not found verbatim in the source (a fabricated quote).
  • answer_correctness, citation_support — model-graded, each backed by a deterministic check (a labeled reference answer; the verbatim gate).

4. plain-languagerewrite simply without changing the meaning

A dense bureaucratic paragraph.

  • readability_improvement (deterministic) — Flesch-Kincaid grade-level drop.
  • meaning_preservation — model-graded two-way entailment (no facts added or dropped), reported alongside readability so "simplify by deleting content" can't look like success.

Why you can trust the numbers

  1. Deterministic first. Wherever there's a ground truth, grading is computed from labels or exact string operations — no model judgment.
  2. Metamorphic for fairness. Change only a legally-irrelevant attribute and check whether the output flips. No subjective "bias" grader ships.
  3. Constrained, backed model-grading only where unavoidable. Citation support and meaning preservation use a strict rubric and a deterministic gate (e.g. a quote must pass the verbatim check before a model is asked whether it supports a claim).
  4. By default, no model grades itself. On the leaderboard a single neutral grader scores every candidate, so model-graded columns are comparable.

Every task's definition of "correct" and exactly how its scorer works is in docs/methodology.md — written to be read by an inspector general, not just an ML engineer.


Models & credentials

Lodlina is Bedrock-first. You pick a model by a short alias (claude-sonnet-4-6, claude-opus-4-8, gpt-5.5, …) and it resolves to that model's Amazon Bedrock route by default, keeping prompts in-boundary. The direct OpenAI / Anthropic APIs are secondary routes, used only when you ask for them with --provider — there is no silent cross-boundary fallback.

lodlina list shows every alias and where it resolves. Pick the provider extras you need at install time:

Install Enables
pip install lodlina[bedrock] Claude on Bedrock (Converse) — the default path
pip install lodlina[bedrock,openai] + OpenAI GPT-5.x (direct API and Bedrock Mantle)
pip install lodlina[anthropic] direct Anthropic API
pip install lodlina[all] every provider

Claude on Bedrock (default, in-boundary)

Standard AWS credentials with Bedrock access; the Claude line-up and the neutral grader both run here:

export AWS_PROFILE=your-bedrock-profile   # or AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
export AWS_DEFAULT_REGION=us-east-1

OpenAI GPT-5.x on Bedrock Mantle (in-boundary)

GPT-5.4 / GPT-5.5 are served on Bedrock's separate Mantle endpoint (OpenAI Responses API), authenticated with a Bedrock long-term API key and available in us-east-2 / us-west-2 / us-gov-west-1:

export BEDROCK_MANTLE_BASE_URL=https://bedrock-mantle.us-east-2.api.aws/openai/v1
export BEDROCK_MANTLE_API_KEY=ABSK...   # Bedrock console → API keys → long-term

Aliases (gpt-5.4, gpt-5.5) route here automatically. If the Mantle environment isn't set, those rows simply render as instead of failing.

Direct OpenAI / Anthropic (off-boundary)

Secondary routes that send prompts to the commercial APIs — selected explicitly with --provider openai|anthropic, reading OPENAI_API_KEY / ANTHROPIC_API_KEY. The leaderboard always labels each model's provider and data boundary, so a reviewer can see at a glance where every run sent its data.

Copy .env.example to .env.local for a fill-in-the-blanks setup.


The lodlina command

lodlina list                                   # eval packs + model aliases
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina run records-redaction --model gpt-5.4               # GPT-5.4 via Bedrock Mantle
lodlina run plain-language --model claude-sonnet-4-6 --provider anthropic   # off-boundary
lodlina leaderboard --html                     # full model-comparison board
lodlina validate                               # check every eval pack is sound
lodlina new-pack my-pack --task-type records-redaction     # author your own eval set

run resolves the model Bedrock-first, binds a neutral grader, and is air-gap safe.

Bring your own eval sets (packs)

An eval pack is a shareable evaluation: a manifest.yaml + a synthetic dataset.jsonl that reuses Lodlina's vetted graders by referencing a curated task type. Packs are data + configuration only — no third-party code is executed, so every grader stays auditable.

lodlina new-pack ssn-heavy --task-type records-redaction   # scaffold a valid starter
# ...drop your synthetic records into ssn-heavy/dataset.jsonl...
lodlina validate --pack ./ssn-heavy
lodlina run --pack ./ssn-heavy --model claude-sonnet-4-6

Third parties can distribute packs as pip-installable packages (discovered via the lodlina_packs entry point). See src/lodlina/packs/builtin/README.md.


Synthetic data & honest limitations

  • All data is synthetic. No real PII/CUI. Identifiers are unmistakably fake (SSNs in the never-issued 900–999 range, 555-01xx phones, example.com emails). Generators are seeded; committed seed sets (~12–18 samples/task) let it run out of the box.
  • Synthetic ≠ representative. Templated documents are cleaner than real agency records; scores indicate capability on a controlled proxy, not certified production performance.
  • Model-graded components inherit grader limits. They're constrained and deterministically backed where possible, but not infallible; read them as grader-relative. The deterministic headlines are the grader-independent signal.
  • U.S. federal framing, English. A starting point, not a complete map of government work.
  • Not legal advice or an authorization to deploy. Lodlina is an evaluation instrument, not a compliance certification.

Roadmap & backlog

The plan of record (phases, design decisions, eval-pack format) is in docs/ROADMAP.md. Tasks still on the methodology bench: political-neutrality (symmetric paired prompts), Section-508 alt-text, FOIA exemption-reasoning, and abstention on unanswerable questions.

Develop & contribute

git clone https://github.com/Lodlina/Lodlina && cd Lodlina
uv venv && uv pip install -e ".[dev]"
pytest          # 50+ tests, fully offline

The test suite runs the full Inspect pipeline offline — each task is driven end-to-end by a mock model, so all grading logic is verified without credentials or network. Conventions mirror inspect_evals.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lodlina-0.2.1.tar.gz (74.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lodlina-0.2.1-py3-none-any.whl (70.3 kB view details)

Uploaded Python 3

File details

Details for the file lodlina-0.2.1.tar.gz.

File metadata

  • Download URL: lodlina-0.2.1.tar.gz
  • Upload date:
  • Size: 74.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lodlina-0.2.1.tar.gz
Algorithm Hash digest
SHA256 3c02e15581f213cb015a860c5786ea2a64121042333991d1556ef5abd0b2eef2
MD5 90829a54f960c5782123c451c5398c6e
BLAKE2b-256 d79a32fd8a88302adb16825da8bc534ca76b172535589befd5f97300cadf706b

See more details on using hashes here.

Provenance

The following attestation bundles were made for lodlina-0.2.1.tar.gz:

Publisher: release.yml on Lodlina/Lodlina

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lodlina-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: lodlina-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 70.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lodlina-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3b4711154bf13e7520f414e5db590c10fa3841387ab1d6cfcc91dcad97c7fde0
MD5 d6bfa00da1837c254e338516898407ba
BLAKE2b-256 7aefbcfb220e6e6e87edbfeab731a099c5313cab0ca1ea14587c566295b3cfad

See more details on using hashes here.

Provenance

The following attestation bundles were made for lodlina-0.2.1-py3-none-any.whl:

Publisher: release.yml on Lodlina/Lodlina

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page