Skip to main content

A plumb line for government AI: realistic U.S. public-sector tasks and automated graders for evaluating LLMs, built on Inspect.

Project description

Lodlina

A plumb line for government AI.

Lodlina is Swedish for plumb line — the weighted cord builders have used for millennia to check whether a wall is true. This project is the same idea for AI: a fair, reproducible, auditable way to check whether an AI model does government work correctly, fairly, and honestly — and to compare models on equal footing.

Lodlina is an open-source suite of realistic U.S. public-sector tasks paired with defensible automated graders, built on Inspect (the open evaluation framework from the UK AI Safety Institute). You point it at a model, it runs the tasks, and it produces scores and a model-comparison leaderboard you can put in front of leadership, an inspector general, or an ATO board.

A plumb line doesn't argue about which wall is prettier — it tells you, without opinion, whether the wall is true. Lodlina aims for the same: measurement you can defend, not vibes.

Who this is for

  • Agency AI leads and program owners choosing a model for records, eligibility, public Q&A, or plain-language work — and needing evidence, not a vendor demo.
  • Vendors and integrators selling AI to government who want to show, on neutral ground, that their model is correct, fair, and grounded.
  • Inspectors general, auditors, and eval practitioners who need numbers they can reproduce and defend — every score traces to a label or an exact string check.
  • Researchers studying how LLMs behave on real public-sector tasks.

What makes it different

  • Defensible graders. Most scores are deterministic — computed from labeled ground truth or exact string operations, auditable by someone who isn't an ML expert. Where judgment is unavoidable, the grader uses a strict rubric backed by a deterministic check.
  • All data is synthetic. No real PII or CUI anywhere — safe to run on any network. (SSNs use the never-issued 900–999 range, etc.)
  • In-boundary by default. Lodlina is Bedrock-first: prompts stay in your AWS boundary unless you explicitly opt into a commercial API. Every result is labeled with where it sent its data.
  • Runs anywhere. No telemetry, no phone-home; designed to work air-gapped.

Quickstart (about a minute)

Recommended — with uv. uv brings its own Python, so you don't have to install or manage one (macOS ships an old 3.9 that won't work):

# Install uv if you don't have it: https://docs.astral.sh/uv/getting-started/installation/
uv tool install "lodlina[bedrock]"     # isolated install; fetches a modern Python
uv tool update-shell                   # puts `lodlina` on your PATH (then restart the terminal)

Point it at AWS Bedrock (Claude models enabled in us-east-1) and run:

export AWS_PROFILE=your-bedrock-profile   # or AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
export AWS_DEFAULT_REGION=us-east-1

lodlina validate                                              # no credentials needed
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5   # one task (~a few cents)
lodlina leaderboard --models claude-sonnet-4-6 claude-haiku-4-5 --limit 5 --html

lodlina leaderboard --html writes a shareable table to leaderboard/results.html in the current directory.

Prefer pip / a virtualenv? (needs Python ≥ 3.10 yourself)
python3.12 -m venv .venv && source .venv/bin/activate   # MUST be 3.10+; macOS's default python3 is 3.9
pip install "lodlina[bedrock]"

If you see ERROR: ... lodlina[bedrock] (from versions: none) or Ignored the following versions that require a different python version, your python3 is too old (< 3.10). Use uv above, or install a newer Python (brew install python@3.12).

Notes

  • The bare lodlina leaderboard runs the full suite (every built-in eval pack — 600+ samples per model) across the full default line-up (Opus, Sonnet, Haiku — plus GPT-5.x if you've configured Bedrock Mantle). That's the publishable board, and it costs real money; it prints its run size before starting. Use --models <alias> … and --limit N to keep a quick comparison cheap; unconfigured models show as rather than failing.
  • No AWS yet? lodlina list and lodlina validate work with no credentials. To run a model, add a provider and credentials — see Models & credentials.

Start with your AI assistant (Claude Code, Codex, Cursor, …)

If you use an AI coding assistant with shell access, you don't have to read the docs first. Open it in an empty folder and paste the contents of docs/ai-quickstart-prompt.md. The assistant will install Lodlina, check your AWS Bedrock access, run a first evaluation, and generate a leaderboard — explaining each step as it goes.

The prompt is self-contained and transparent — it's plain markdown, so read it before you paste it; everything it does is right there. It only installs into a local virtual environment and only spends money when you approve a model run.


What it measures

Four tasks, each a real public-sector job where being wrong is concrete. Each ships a synthetic dataset (input + labeled ground truth) and a defensible scorer. Full definitions of "correct" are in docs/methodology.md.

1. records-redactiondon't leak personal privacy info

A synthetic government document mixes must-redact items (SSNs, personal email, home address, date of birth — FOIA Exemption 6 information) with clearly releasable content. The model returns what it would redact.

  • leak_rate (headline, deterministic) — fraction of must-redact items the model missed. A miss is a leak: the most serious failure.
  • over_redaction_rate — releasable items wrongly redacted (over-redacting defeats the purpose of FOIA disclosure).

2. eligibility-fairnesscorrect, and consistent under irrelevant changes

A synthetic case file plus a policy-manual excerpt with clear rules.

  • accuracy (deterministic) — determination vs. the rule-derived answer.
  • flip_rate (headline, metamorphic) — each case is re-run with only the applicant's name changed (a legally-irrelevant attribute). Any case whose decision flips is flagged. This turns "is it biased?" into an objective "did the decision change when it must not have?" — no subjective bias grader.

3. grounded-qaanswer, and cite faithfully

A policy document plus a question; the model answers and cites supporting passages verbatim.

  • hallucinated_citation_rate (headline, deterministic) — fraction of cited passages not found verbatim in the source (a fabricated quote).
  • answer_correctness, citation_support — model-graded, each backed by a deterministic check (a labeled reference answer; the verbatim gate).

4. plain-languagerewrite simply without changing the meaning

A dense bureaucratic paragraph.

  • readability_improvement (deterministic) — Flesch-Kincaid grade-level drop.
  • meaning_preservation — model-graded two-way entailment (no facts added or dropped), reported alongside readability so "simplify by deleting content" can't look like success.

Why you can trust the numbers

  1. Deterministic first. Wherever there's a ground truth, grading is computed from labels or exact string operations — no model judgment.
  2. Metamorphic for fairness. Change only a legally-irrelevant attribute and check whether the output flips. No subjective "bias" grader ships.
  3. Constrained, backed model-grading only where unavoidable. Citation support and meaning preservation use a strict rubric and a deterministic gate (e.g. a quote must pass the verbatim check before a model is asked whether it supports a claim).
  4. No model grades itself. Model-graded scorers use a jury — the same panel of graders for every candidate (comparable), ideally spanning model families so no candidate is judged only by its own (same-family judges over-reward their family). Configure with --graders / LODLINA_GRADER_MODELS (aliases work); Lodlina warns when a candidate shares a juror's family, and an odd-sized panel is preferred (ties score conservatively). Every report records the jury in its Provenance section. See docs/eval-standards.md.

Every task's definition of "correct" and exactly how its scorer works is in docs/methodology.md — written to be read by an inspector general, not just an ML engineer.


Models & credentials

Lodlina is Bedrock-first. You pick a model by a short alias (claude-sonnet-4-6, claude-opus-4-8, gpt-5.5, …) and it resolves to that model's Amazon Bedrock route by default, keeping prompts in-boundary. The direct OpenAI / Anthropic APIs are secondary routes, used only when you ask for them with --provider — there is no silent cross-boundary fallback.

lodlina list shows every alias and where it resolves. Pick the provider extras you need at install time:

Install Enables
pip install lodlina[bedrock] Claude on Bedrock (Converse) — the default path
pip install lodlina[bedrock,openai] + OpenAI GPT-5.x (direct API and Bedrock Mantle)
pip install lodlina[anthropic] direct Anthropic API
pip install lodlina[all] every provider

Claude on Bedrock (default, in-boundary)

Standard AWS credentials with Bedrock access; the Claude line-up and the neutral grader both run here:

export AWS_PROFILE=your-bedrock-profile   # or AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
export AWS_DEFAULT_REGION=us-east-1

OpenAI GPT-5.x on Bedrock Mantle (in-boundary)

GPT-5.4 / GPT-5.5 are served on Bedrock's separate Mantle endpoint (OpenAI Responses API), authenticated with a Bedrock long-term API key and available in us-east-2 / us-west-2 / us-gov-west-1:

export BEDROCK_MANTLE_BASE_URL=https://bedrock-mantle.us-east-2.api.aws/openai/v1
export BEDROCK_MANTLE_API_KEY=ABSK...   # Bedrock console → API keys → long-term

Aliases (gpt-5.4, gpt-5.5) route here automatically. If the Mantle environment isn't set, those rows simply render as instead of failing.

Direct OpenAI / Anthropic (off-boundary)

Secondary routes that send prompts to the commercial APIs — selected explicitly with --provider openai|anthropic, reading OPENAI_API_KEY / ANTHROPIC_API_KEY. The leaderboard always labels each model's provider and data boundary, so a reviewer can see at a glance where every run sent its data.

Claude on a Pro/Max subscription (no per-token billing)

For development and heavy iteration you can run the Claude models on your Claude Pro/Max subscription instead of paying per token — Inspect's anthropic provider supports OAuth bearer auth. Mint a long-lived token once with the Claude CLI, then:

claude setup-token                         # opens a browser; prints an OAuth token
export ANTHROPIC_AUTH_TOKEN=<that token>   # subscription auth (not an API key)
export LODLINA_GRADER_MODEL=anthropic/claude-sonnet-4-6   # keep the grader on the subscription too

lodlina run grounded-qa --model claude-sonnet-4-6 --provider anthropic

This routes both the candidate and the grader through your subscription, so a full run needs no AWS credentials and no per-token charges (subscription rate limits apply). Note this is the direct Anthropic API (off-boundary) — great for building; for an in-boundary government leaderboard, use the Bedrock route. The deterministic tasks (records-redaction, eligibility-fairness) have no model grader, so they run on the subscription with nothing else configured.

Copy .env.example to .env.local for a fill-in-the-blanks setup.


The lodlina command

lodlina list                                   # eval packs + model aliases
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina run records-redaction --model gpt-5.4               # GPT-5.4 via Bedrock Mantle
lodlina run plain-language --model claude-sonnet-4-6 --provider anthropic   # off-boundary
lodlina leaderboard --html                     # full model-comparison board
lodlina validate                               # check every eval pack is sound
lodlina new-pack my-pack --task-type records-redaction     # author your own eval set

run resolves the model Bedrock-first, binds a neutral grader, and is air-gap safe.

Bring your own eval sets (packs)

An eval pack is a shareable evaluation: a manifest.yaml + a synthetic dataset.jsonl that reuses Lodlina's vetted graders by referencing a curated task type. Packs are data + configuration only — no third-party code is executed, so every grader stays auditable.

lodlina new-pack ssn-heavy --task-type records-redaction   # scaffold a valid starter
# ...drop your synthetic records into ssn-heavy/dataset.jsonl...
lodlina validate --pack ./ssn-heavy
lodlina run --pack ./ssn-heavy --model claude-sonnet-4-6

Third parties can distribute packs as pip-installable packages (discovered via the lodlina_packs entry point). See src/lodlina/packs/builtin/README.md.


Synthetic data & honest limitations

  • All data is synthetic. No real PII/CUI. Identifiers are unmistakably fake (SSNs in the never-issued 900–999 range, 555-01xx phones, example.com emails). Generators are seeded; committed seed sets (~12–18 samples/task) let it run out of the box.
  • Synthetic ≠ representative. Templated documents are cleaner than real agency records; scores indicate capability on a controlled proxy, not certified production performance.
  • Model-graded components inherit grader limits. They're constrained and deterministically backed where possible, but not infallible; read them as grader-relative. The deterministic headlines are the grader-independent signal.
  • U.S. federal framing, English. A starting point, not a complete map of government work.
  • Not legal advice or an authorization to deploy. Lodlina is an evaluation instrument, not a compliance certification.

Roadmap & backlog

The plan of record (phases, design decisions, eval-pack format) is in docs/ROADMAP.md. Tasks still on the methodology bench: political-neutrality (symmetric paired prompts), Section-508 alt-text, FOIA exemption-reasoning, and abstention on unanswerable questions.

Develop & contribute

git clone https://github.com/Lodlina/Lodlina && cd Lodlina
uv venv && uv pip install -e ".[dev]"
pytest          # 50+ tests, fully offline

The test suite runs the full Inspect pipeline offline — each task is driven end-to-end by a mock model, so all grading logic is verified without credentials or network. Conventions mirror inspect_evals.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lodlina-0.3.0.tar.gz (121.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lodlina-0.3.0-py3-none-any.whl (105.6 kB view details)

Uploaded Python 3

File details

Details for the file lodlina-0.3.0.tar.gz.

File metadata

  • Download URL: lodlina-0.3.0.tar.gz
  • Upload date:
  • Size: 121.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lodlina-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ebfe254e156a87462b07035946dfca0be2af8354a1076b1d5637e51e0129772c
MD5 45a5ff030a2c4aaeba1c68f0cae8fd04
BLAKE2b-256 6d0aabb8f130d4186983f365905ba39077bc4296286d28e7fd19bda593733f21

See more details on using hashes here.

Provenance

The following attestation bundles were made for lodlina-0.3.0.tar.gz:

Publisher: release.yml on Lodlina/Lodlina

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lodlina-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: lodlina-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 105.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lodlina-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 07f102daaf95cd67e56cb02f743f6119f63bf26b9c81ebd5c3ed5b313b4db919
MD5 c100a1619acf485916139fa2826ee076
BLAKE2b-256 c735f96c6a4f4fde497d07ced613e7250d25b266940d84c88c23b0f617a4a789

See more details on using hashes here.

Provenance

The following attestation bundles were made for lodlina-0.3.0-py3-none-any.whl:

Publisher: release.yml on Lodlina/Lodlina

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page