Skip to main content

A plumb line for government AI: realistic U.S. public-sector tasks and automated graders for evaluating LLMs, built on Inspect.

Project description

Lodlina

A plumb line for government AI.

Lodlina is Swedish for plumb line — the weighted cord builders have used for millennia to check whether a wall is true. This project is the same idea for AI: a fair, reproducible, auditable way to check whether an AI model does government work correctly, fairly, and honestly — and to compare models on equal footing.

Lodlina is an open-source suite of realistic U.S. public-sector tasks paired with defensible automated graders, built on Inspect (the open evaluation framework from the UK AI Safety Institute). You point it at a model, it runs the tasks, and it produces scores and a model-comparison leaderboard you can put in front of leadership, an inspector general, or an ATO board.

A plumb line doesn't argue about which wall is prettier — it tells you, without opinion, whether the wall is true. Lodlina aims for the same: measurement you can defend, not vibes.

Who this is for

  • Agency AI leads and program owners choosing a model for records, eligibility, public Q&A, or plain-language work — and needing evidence, not a vendor demo.
  • Vendors and integrators selling AI to government who want to show, on neutral ground, that their model is correct, fair, and grounded.
  • Inspectors general, auditors, and eval practitioners who need numbers they can reproduce and defend — every score traces to a label or an exact string check.
  • Researchers studying how LLMs behave on real public-sector tasks.

What makes it different

  • Defensible graders. Most scores are deterministic — computed from labeled ground truth or exact string operations, auditable by someone who isn't an ML expert. Where judgment is unavoidable, the grader uses a strict rubric backed by a deterministic check.
  • All data is synthetic. No real PII or CUI anywhere — safe to run on any network. (SSNs use the never-issued 900–999 range, etc.)
  • In-boundary by default. Lodlina is Bedrock-first: prompts stay in your AWS boundary unless you explicitly opt into a commercial API. Every result is labeled with where it sent its data.
  • Runs anywhere. No telemetry, no phone-home; designed to work air-gapped.

Quickstart (about a minute)

Recommended — with uv. uv brings its own Python, so you don't have to install or manage one (macOS ships an old 3.9 that won't work):

# Install uv if you don't have it: https://docs.astral.sh/uv/getting-started/installation/
uv tool install "lodlina[bedrock]"     # isolated install; fetches a modern Python
uv tool update-shell                   # puts `lodlina` on your PATH (then restart the terminal)

Point it at AWS Bedrock (Claude models enabled in us-east-1) and run:

export AWS_PROFILE=your-bedrock-profile   # or AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
export AWS_DEFAULT_REGION=us-east-1

lodlina validate                                              # no credentials needed
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5   # one task (~a few cents)
lodlina leaderboard --models claude-sonnet-4-6 claude-haiku-4-5 --limit 5 --html

lodlina leaderboard --html writes a shareable table to leaderboard/results.html in the current directory, led by the Lodlina Score — an open-ended, difficulty-weighted roll-up of the defensible per-task rates (which stay visible beneath it), versioned per suite so it keeps discriminating as models improve. See docs/scoring.md.

Prefer pip / a virtualenv? (needs Python ≥ 3.10 yourself)
python3.12 -m venv .venv && source .venv/bin/activate   # MUST be 3.10+; macOS's default python3 is 3.9
pip install "lodlina[bedrock]"

If you see ERROR: ... lodlina[bedrock] (from versions: none) or Ignored the following versions that require a different python version, your python3 is too old (< 3.10). Use uv above, or install a newer Python (brew install python@3.12).

Notes

  • The bare lodlina leaderboard runs the full suite (every built-in eval pack — 600+ samples per model) across the full default line-up (Opus, Sonnet, Haiku — plus GPT-5.x if you've configured Bedrock Mantle). That's the publishable board, and it costs real money; it prints its run size before starting. Use --models <alias> … and --limit N to keep a quick comparison cheap; unconfigured models show as rather than failing.
  • No AWS yet? lodlina list and lodlina validate work with no credentials. To run a model, add a provider and credentials — see Models & credentials.

Start with your AI assistant (Claude Code, Codex, Cursor, …)

If you use an AI coding assistant with shell access, you don't have to read the docs first. Open it in an empty folder and paste the contents of docs/ai-quickstart-prompt.md. The assistant will install Lodlina, check your AWS Bedrock access, run a first evaluation, and generate a leaderboard — explaining each step as it goes.

The prompt is self-contained and transparent — it's plain markdown, so read it before you paste it; everything it does is right there. It only installs into a local virtual environment and only spends money when you approve a model run.


What it measures

Ten eval packs across the federal mission spectrum — benefits, records, citizen services, policy Q&A, national security, and defense — each a real public-sector job where being wrong is concrete. Every pack ships a synthetic dataset (input + labeled ground truth) and references one of five vetted task types; packs are data + config only, never code. Full definitions of "correct" are in docs/methodology.md; the content taxonomy is docs/use-cases.md.

Records, disclosure & privacy (B2) — don't leak; don't over-withhold

records-redaction (150 documents) and b2-records-adversarial (100 documents dense with look-alike traps). Synthetic government documents mix must-redact items (SSNs, home addresses — FOIA Exemption 6 information) with clearly releasable content.

  • leak_rate (headline, deterministic) — must-redact items the model missed. A miss is a leak: the most serious failure.
  • over_redaction_rate — releasable items wrongly redacted (over-redacting defeats the purpose of FOIA disclosure).

Benefits & eligibility (B1) — correct, and consistent under irrelevant changes

eligibility-fairness (baseline rules) and b1-benefits-hfap (200 cases of a hard SNAP-like program: size-indexed income tables, layered deductions, categorical eligibility, elderly/disabled exemptions — built to separate frontier models).

  • accuracy (deterministic) — determination vs. the rule-derived answer.
  • flip_rate (headline, metamorphic) — each case re-run with only the applicant's name changed. Any decision that flips is flagged: "is it biased?" becomes an objective "did the decision change when it must not have?" — no subjective bias grader.

National security information protection (B7) — protect, but don't over-classify

b7-classified-spillage (100 fabricated documents with synthetic markings: withhold classified portions before release — spillage — without over-withholding releasable content) and b7-foreign-disclosure (60 cases: may this item go to coalition partner X? NOFORN/REL-TO logic, sharing arrangements, the third-agency rule, sanitization — rule-derived, with the requesting liaison officer's name as the fairness panel). No real classified material is ever used.

Defense & military mission support (B8) — the staff work, done right

b8-deployment-readiness (64 cases of the "is this member deployable?" determination: readiness categories, qualification currency windows, dwell ratios, family care plans) and b8-intel-grounded-qa (80 questions over fabricated intelligence products — answer with verbatim citations, or abstain: half the questions are deliberately unanswerable with generator-guaranteed absence).

  • false_answer_rate (headline, deterministic) — answered an unanswerable question: the most dangerous RAG failure mode, measured exactly.

Authoritative-source Q&A (B4) — answer, and cite faithfully

grounded-qa: a policy document plus a question; the model answers and cites supporting passages verbatim.

  • hallucinated_citation_rate (headline, deterministic) — cited passages not found verbatim in the source (a fabricated quote).
  • answer_correctness, citation_support — jury-graded, each backed by a deterministic check (a labeled reference answer; the verbatim gate).

Citizen services (B3) — rewrite simply without changing the meaning

plain-language (50 dense bureaucratic paragraphs).

  • readability_improvement (deterministic) — Flesch-Kincaid grade drop.
  • meaning_preservation — jury-graded two-way entailment (no facts added or dropped), reported alongside readability so "simplify by deleting content" can't look like success.

Read the methodology white paper: docs/whitepaper.md — the full design, the measurement-integrity story, how Lodlina survives benchmark saturation, and honest limitations, with verified citations.

Why you can trust the numbers

  1. Deterministic first. Wherever there's a ground truth, grading is computed from labels or exact string operations — no model judgment.
  2. Metamorphic for fairness. Change only a legally-irrelevant attribute and check whether the output flips. No subjective "bias" grader ships.
  3. Constrained, backed model-grading only where unavoidable. Citation support and meaning preservation use a strict rubric and a deterministic gate (e.g. a quote must pass the verbatim check before a model is asked whether it supports a claim).
  4. No model grades itself. Model-graded scorers use a jury — the same panel of graders for every candidate (comparable), ideally spanning model families so no candidate is judged only by its own (same-family judges over-reward their family). Configure with --graders / LODLINA_GRADER_MODELS (aliases work); Lodlina warns when a candidate shares a juror's family, and an odd-sized panel is preferred (ties score conservatively). Every report records the jury in its Provenance section. See docs/eval-standards.md.

Every task's definition of "correct" and exactly how its scorer works is in docs/methodology.md — written to be read by an inspector general, not just an ML engineer.


Models & credentials

Lodlina is Bedrock-first. You pick a model by a short alias (claude-sonnet-4-6, claude-opus-4-8, gpt-5.5, …) and it resolves to that model's Amazon Bedrock route by default, keeping prompts in-boundary. The direct OpenAI / Anthropic APIs are secondary routes, used only when you ask for them with --provider — there is no silent cross-boundary fallback.

lodlina list shows every alias and where it resolves. Pick the provider extras you need at install time:

Install Enables
pip install lodlina[bedrock] Claude on Bedrock (Converse) — the default path
pip install lodlina[bedrock,openai] + OpenAI GPT-5.x (direct API and Bedrock Mantle)
pip install lodlina[anthropic] direct Anthropic API
pip install lodlina[all] every provider

Claude on Bedrock (default, in-boundary)

Standard AWS credentials with Bedrock access; the Claude line-up and the neutral grader both run here:

export AWS_PROFILE=your-bedrock-profile   # or AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
export AWS_DEFAULT_REGION=us-east-1

OpenAI GPT-5.x on Bedrock Mantle (in-boundary)

GPT-5.4 / GPT-5.5 are served on Bedrock's separate Mantle endpoint (OpenAI Responses API), authenticated with a Bedrock long-term API key and available in us-east-2 / us-west-2 / us-gov-west-1:

export BEDROCK_MANTLE_BASE_URL=https://bedrock-mantle.us-east-2.api.aws/openai/v1
export BEDROCK_MANTLE_API_KEY=ABSK...   # Bedrock console → API keys → long-term

Aliases (gpt-5.4, gpt-5.5) route here automatically. If the Mantle environment isn't set, those rows simply render as instead of failing.

Direct OpenAI / Anthropic (off-boundary)

Secondary routes that send prompts to the commercial APIs — selected explicitly with --provider openai|anthropic, reading OPENAI_API_KEY / ANTHROPIC_API_KEY. The leaderboard always labels each model's provider and data boundary, so a reviewer can see at a glance where every run sent its data.

Claude on a Pro/Max subscription (no per-token billing)

For development and heavy iteration you can run the Claude models on your Claude Pro/Max subscription instead of paying per token — Inspect's anthropic provider supports OAuth bearer auth. Mint a long-lived token once with the Claude CLI, then:

claude setup-token                         # opens a browser; prints an OAuth token
export ANTHROPIC_AUTH_TOKEN=<that token>   # subscription auth (not an API key)
export LODLINA_GRADER_MODEL=anthropic/claude-sonnet-4-6   # keep the grader on the subscription too

lodlina run grounded-qa --model claude-sonnet-4-6 --provider anthropic

This routes both the candidate and the grader through your subscription, so a full run needs no AWS credentials and no per-token charges (subscription rate limits apply). Note this is the direct Anthropic API (off-boundary) — great for building; for an in-boundary government leaderboard, use the Bedrock route. The deterministic tasks (records-redaction, eligibility-fairness) have no model grader, so they run on the subscription with nothing else configured.

Copy .env.example to .env.local for a fill-in-the-blanks setup.


The lodlina command

lodlina list                                   # eval packs + model aliases
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina run records-redaction --model gpt-5.4               # GPT-5.4 via Bedrock Mantle
lodlina run plain-language --model claude-sonnet-4-6 --provider anthropic   # off-boundary
lodlina leaderboard --html                     # full model-comparison board
lodlina validate                               # check every eval pack is sound
lodlina new-pack my-pack --task-type records-redaction     # author your own eval set

run resolves the model Bedrock-first, binds a neutral grader, and is air-gap safe.

Bring your own eval sets (packs)

An eval pack is a shareable evaluation: a manifest.yaml + a synthetic dataset.jsonl that reuses Lodlina's vetted graders by referencing a curated task type. Packs are data + configuration only — no third-party code is executed, so every grader stays auditable.

lodlina new-pack ssn-heavy --task-type records-redaction   # scaffold a valid starter
# ...drop your synthetic records into ssn-heavy/dataset.jsonl...
lodlina validate --pack ./ssn-heavy
lodlina run --pack ./ssn-heavy --model claude-sonnet-4-6

Third parties can distribute packs as pip-installable packages (discovered via the lodlina_packs entry point). See src/lodlina/packs/builtin/README.md.


Synthetic data & honest limitations

  • All data is synthetic. No real PII/CUI. Identifiers are unmistakably fake (SSNs in the never-issued 900–999 range, 555-01xx phones, example.com emails). Generators are seeded; committed seed sets (~12–18 samples/task) let it run out of the box.
  • Synthetic ≠ representative. Templated documents are cleaner than real agency records; scores indicate capability on a controlled proxy, not certified production performance.
  • Model-graded components inherit grader limits. They're constrained and deterministically backed where possible, but not infallible; read them as grader-relative. The deterministic headlines are the grader-independent signal.
  • U.S. federal framing, English. A starting point, not a complete map of government work.
  • Not legal advice or an authorization to deploy. Lodlina is an evaluation instrument, not a compliance certification.

Roadmap & backlog

The plan of record (phases, design decisions, eval-pack format) is in docs/ROADMAP.md. Tasks still on the methodology bench: political-neutrality (symmetric paired prompts), Section-508 alt-text, FOIA exemption-reasoning, and abstention on unanswerable questions.

Develop & contribute

git clone https://github.com/Lodlina/Lodlina && cd Lodlina
uv venv && uv pip install -e ".[dev]"
pytest          # 50+ tests, fully offline

The test suite runs the full Inspect pipeline offline — each task is driven end-to-end by a mock model, so all grading logic is verified without credentials or network. Conventions mirror inspect_evals.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lodlina-0.5.1.tar.gz (538.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lodlina-0.5.1-py3-none-any.whl (247.2 kB view details)

Uploaded Python 3

File details

Details for the file lodlina-0.5.1.tar.gz.

File metadata

  • Download URL: lodlina-0.5.1.tar.gz
  • Upload date:
  • Size: 538.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lodlina-0.5.1.tar.gz
Algorithm Hash digest
SHA256 54cc390cb77521d7eb9396aa75ec481aa9ecabc80362d1056f02863730d2ad63
MD5 766a5893408b07e0557d733e76ac6407
BLAKE2b-256 4a0f336972fe05cd1e494d0196bb3fb2ef145bc0cf8c68aca84f80721a330caf

See more details on using hashes here.

Provenance

The following attestation bundles were made for lodlina-0.5.1.tar.gz:

Publisher: release.yml on Lodlina/Lodlina

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lodlina-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: lodlina-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 247.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lodlina-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f902ac9a09cdb0cbf8f378041035f3177c7db84b89d165f646cc12a0b150c48f
MD5 d682b47d52b8a4bb583767ac083a91ca
BLAKE2b-256 18e10baaefd8bb0a580d9d173b9ffc54965f43e24b8974af65c435a8f65c6905

See more details on using hashes here.

Provenance

The following attestation bundles were made for lodlina-0.5.1-py3-none-any.whl:

Publisher: release.yml on Lodlina/Lodlina

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page