Regression canary for LLM apps: YAML test suites, offline-first assertions, baseline diffing, and agent-trace policy gates for CI

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

okssusucha

These details have not been verified by PyPI

Project description

llm-canary

Regression canary for LLM apps in CI: declarative YAML test suites for prompts, baseline drift detection without golden answers, and policy gates for agent traces (tool order, cost budgets, runaway loops). Use it as a CLI in CI, or self-host it as a service with run history, a dashboard, and team-shared baselines. Offline-first — the whole test suite and the bundled examples run with zero API keys.

CI で動く LLM アプリの回帰カナリア。プロンプトのテストを YAML で宣言し、正解データなしでベースラインからのドリフトを検知、さらにエージェントのトレース（ツール呼び出し順・コスト予算・無限ループ）をポリシーで検査します。 CI 内の CLI としても、実行履歴・ダッシュボード・チーム共有ベースラインを持つ セルフホスト型サービスとしても使えます。オフラインファースト設計で、テストもサンプルも APIキーなし で動きます。

Why / なぜ必要か

Prompt changes are silent regressions. A one-line prompt tweak can break JSON output, leak text it shouldn't, or double your token bill — and nothing in a normal CI pipeline notices. llm-canary run turns those into failing builds.
You rarely have golden answers. LLM output isn't byte-stable, so snapshot tests don't work. record/check compares against a baseline with semantic similarity and cost-drift thresholds instead of exact matches.
Agents act; outputs aren't enough. In 2026 the risk moved from "what did the model say" to "what did the agent do". trace gates a JSONL action log against a policy: forbidden tools, required ordering, step/cost budgets, loop detection.

プロンプト変更は静かなデグレ。 1行の修正で JSON 出力が壊れたり、出すべきでない文言が混ざったり、トークン費用が倍になっても、普通の CI は気づきません。 llm-canary run がそれをビルド失敗に変えます。
正解データは普通ない。 LLM の出力はバイト単位では安定しないため、スナップショットテストは機能しません。record/check は完全一致ではなく、意味的類似度とコストドリフトの閾値でベースラインと比較します。
エージェントは「行動」する。 2026年のリスクは「何を言ったか」から「何をしたか」へ移りました。trace は JSONL の行動ログをポリシー（禁止ツール・実行順序・ステップ/コスト予算・ループ検知）で検査します。

Quickstart / クイックスタート

# install (Python 3.11+)
uv tool install llm-canary    # or: pip install llm-canary
# from source: git clone && uv sync && uv run llm-canary ...

# scaffold and run a starter suite — works offline, no keys
llm-canary init
llm-canary run canary.yaml

# the bundled, fully offline example suite
llm-canary run canary.example.yaml

# check an agent trace against a policy
llm-canary trace examples/agent-trace/trace.jsonl \
  --policy examples/agent-trace/policy.yaml

Exit code is 0 when everything passes, 1 on failures — drop it straight into CI. / 全て成功で終了コード 0、失敗で 1。そのまま CI に組み込めます。

Suite YAML / スイート定義

name: support-bot
providers:
  - name: openai            # echo / fixture / openai / anthropic
    model: gpt-4o-mini
judge:                      # optional: provider used by `judge` assertions
  name: anthropic
  model: claude-haiku-4-5
cases:
  - name: refund-policy
    prompt: "A customer asks: can I get a refund for {product}?"
    vars:
      product: "a keyboard bought 2 weeks ago"
    assertions:
      - type: contains
        value: "30 days"
      - type: json_schema
        value: {type: object, required: [eligible]}
      - type: judge
        value: "Politely explains the refund policy"
        threshold: 0.7
      - type: max_cost_usd
        value: 0.01

  - name: tone-by-language        # matrix: one case → cartesian product
    prompt: "Reply to a {mood} customer in {lang}: where is my order?"
    matrix:
      mood: [calm, angry]
      lang: [English, Japanese]
    assertions:
      - type: judge
        value: "Stays polite and de-escalates"

Useful extras / 便利オプション:

llm-canary validate suite.yaml — lint the suite (typos in provider or assertion names, missing judge) without calling anything / 何も呼ばずにスイートの誤りを検出
--max-workers 8 — run cases concurrently against slow real APIs / 並列実行
--json — machine-readable results / 機械可読な結果出力
Direct-API mode can load your app's real system prompt so it is on the tested path / 直接APIモードでもアプリの実システムプロンプトを経路に乗せる:

providers:
  - name: openai
    model: gpt-4o-mini
    options:
      system_prompt_file: ../app/prompts/support.txt   # the SAME file your app loads

Assertions / アサーション一覧

type	checks / 内容
`contains` / `not_contains`	substring (opt. `case_insensitive`) / 部分文字列
`regex`	pattern match / 正規表現
`equals`	exact match, whitespace-trimmed / 完全一致
`json_valid`	parseable JSON (handles ``` fences & prose) / JSON妥当性
`json_schema`	JSON Schema validation / スキーマ検証
`similarity`	semantic similarity vs reference (`threshold`) / 意味的類似度
`judge`	LLM-as-judge score vs criteria (`threshold`) / LLM評価
`max_latency_ms` / `max_cost_usd` / `max_output_tokens`	budget gates / 予算ゲート

Providers / プロバイダ

echo — returns the prompt; deterministic, free, offline. / プロンプトをそのまま返すオフライン用
fixture — regex-routed canned replies; ideal for offline demos and as an offline judge. / 正規表現で固定応答を返す。オフラインのジャッジにも使える
openai / anthropic — real APIs via OPENAI_API_KEY / ANTHROPIC_API_KEY (base_url option points openai at any OpenAI-compatible endpoint)
command / http — your actual bot, whatever it is (see below) / あなたの本物のボットを対象にする(下記)
Cost is estimated from a built-in price table — good enough for budget gates. / コストは内蔵価格表からの概算

Test YOUR bot, not the raw model / 素のモデルではなく「あなたのボット」を検査する

A canary is only meaningful if the thing you change — your system prompt, your RAG pipeline, your pre/post-processing — is on the execution path. The command and http providers put your real application under test, however it is built. / カナリアが意味を持つのは、あなたが変更するもの（システムプロンプト・RAG・前後処理）が実行経路に乗っているときだけです。command / http プロバイダは、どんな作りのアプリでも「本物」をテスト対象にします。

command — anything executable / 実行できるものなら何でも(Python, Node, Go, shell, …). The prompt replaces {prompt} in the arguments — or is piped to stdin when there is no placeholder — and stdout is the reply:

providers:
  - name: command
    options:
      cmd: "python my_bot.py --ask {prompt}"   # or just "python my_bot.py" (stdin)

http — anything with an HTTP API / HTTP APIを持つものなら何でも. {prompt} is substituted into the body/params/url; the reply is extracted from the response JSON with a dot path:

providers:
  - name: http
    options:
      url: http://localhost:8000/chat
      body: {message: "{prompt}", session: "ci"}
      response_path: reply.text          # or e.g. choices.0.message.content
      headers: {Authorization: "Bearer ${BOT_TOKEN}"}

In CI, boot your bot and point the canary at it / CIではボットを起動してカナリアを向けるだけ:

- run: docker compose up -d my-chatbot
- run: llm-canary run suite.yaml        # http provider hits the real stack

Security / セキュリティ: the command provider executes processes, so the self-hosted server rejects it unless started with llm-canary serve --allow-command (or CANARY_ALLOW_COMMAND=1). / command はプロセスを実行するため、セルフホストサーバーでは既定で拒否され、 --allow-command での明示的な許可が必要です。

Baseline drift / ベースラインドリフト

llm-canary record canary.yaml          # snapshot outputs + costs
llm-canary check canary.yaml           # rerun and gate on drift
llm-canary check canary.yaml --similarity-threshold 0.85 --cost-drift 0.1

check fails when an output's semantic similarity to the baseline drops below the threshold, or cost grows beyond the allowed ratio. The default embedder is a deterministic offline hash embedder (pluggable). / check は出力の類似度が閾値を下回るか、コストが許容比率を超えて増えたときに失敗します。既定の埋め込みは決定的なオフラインのハッシュ埋め込み（差し替え可）。

Agent trace gates / エージェントトレース検査

{"type": "tool_call", "tool": "query_sales_db", "cost_usd": 0.002}
{"type": "tool_call", "tool": "post_slack", "cost_usd": 0.001}

# policy.yaml
max_steps: 10
max_cost_usd: 0.05
forbidden_tools: [delete_records, send_email]
required_order: [query_sales_db, post_slack]
max_tool_repeats: 3        # catch runaway loops

llm-canary trace trace.jsonl --policy policy.yaml

Emit one JSON object per agent step from your framework of choice and gate it in CI. / 任意のフレームワークからステップごとに JSON を1行出力し、CI でゲートします。

Self-hosting / セルフホスティング

Run llm-canary as a service inside your own infrastructure. Suites, outputs, traces, and baselines are stored in a local SQLite file — prompts and agent logs never leave your network (except calls to providers you explicitly configure). / llm-canary を自社インフラ内のサービスとして常駐させられます。スイート・出力・トレース・ベースラインはローカルの SQLite に保存され、 プロンプトもエージェントログも社外に出ません（明示的に設定したモデルプロバイダへの呼び出しを除く）。

docker compose up -d          # serves on :8080, history persisted in a volume
# or without Docker:
pip install 'llm-canary[server]'
llm-canary serve --port 8080 --token "$CANARY_TOKEN"

With --token (or CANARY_TOKEN), every endpoint except /healthz requires Authorization: Bearer <token> — set it whenever the server leaves 127.0.0.1. / --token を設定すると /healthz 以外の全エンドポイントが Bearer トークン必須になります。127.0.0.1 の外に出すなら必ず設定してください。

Teams and CI jobs talk to it over HTTP / チームや CI からは HTTP で:

# run a suite (body = suite spec as JSON)
curl -X POST localhost:8080/api/runs -H 'content-type: application/json' \
  -d @suite.json

# gate an agent trace
curl -X POST localhost:8080/api/traces/check -H 'content-type: application/json' \
  -d '{"steps": [...], "policy": {"forbidden_tools": ["delete_records"]}}'

# record a team-shared baseline, then check drift against it
curl -X PUT  localhost:8080/api/baselines/main -d @suite.json \
  -H 'content-type: application/json'
curl -X POST localhost:8080/api/baselines/main/check -d '{"suite": ...}' \
  -H 'content-type: application/json'

GET / — dashboard with run history / 実行履歴ダッシュボード
GET /api/runs, GET /api/runs/{id} — history & full detail / 履歴と詳細
GET /healthz — liveness probe

Baselines live on the server, so the whole team (and every CI job) gates against the same baseline instead of per-machine files. / ベースラインはサーバー側に保存されるため、チーム全員とすべての CI ジョブが 同一のベースラインに対して検査できます（マシンごとのファイル管理が不要）。

GitHub Actions

- name: LLM regression gate
  run: |
    uv run llm-canary run canary.yaml --junit junit.xml --md summary.md
    uv run llm-canary trace trace.jsonl --policy policy.yaml

--junit integrates with test reporters; --md is ready to post as a PR comment. / --junit はテストレポーター連携用、--md は PR コメント投稿用。

Architecture / アーキテクチャ

suite YAML ─▶ runner ─▶ provider (echo | fixture | openai | anthropic)
                │              │
                ▼              ▼
           assertions ◀── completion {text, tokens, cost, latency}
                │
                ├─▶ reports: console / JUnit XML / Markdown
                └─▶ baseline: record / drift check (hash embedder)

trace JSONL ─▶ policy checks ─▶ violations (exit 1)

src/llm_canary/config.py — pydantic specs for suites & policies
src/llm_canary/providers/ — provider registry (offline + remote)
src/llm_canary/assertions/ — assertion registry (basic + quality)
src/llm_canary/baseline.py — snapshot & drift detection
src/llm_canary/trace.py — agent-trace policy engine
src/llm_canary/report.py — console / JUnit / Markdown reporters
src/llm_canary/server.py — self-hosted FastAPI server (REST + dashboard)
src/llm_canary/storage.py — SQLite history & team-shared baselines

Development / 開発

uv sync
uv run pytest          # entire suite is offline — no keys, no network
uv run ruff check .

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

okssusucha

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_canary-0.4.0.tar.gz (95.0 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_canary-0.4.0-py3-none-any.whl (32.7 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file llm_canary-0.4.0.tar.gz.

File metadata

Download URL: llm_canary-0.4.0.tar.gz
Upload date: Jun 10, 2026
Size: 95.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_canary-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`c049653c7e59804d4a8282b3436a6e5ce260cf51c12b163e9f18709d712565f3`
MD5	`16795875de02dd9a5bac993658aec145`
BLAKE2b-256	`670010de3048112c6076ee9ddc3d25676d859f71361af8a1f5af388df8a8d7e6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_canary-0.4.0.tar.gz:

Publisher: release.yml on okssusucha/llm-canary

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_canary-0.4.0.tar.gz
- Subject digest: c049653c7e59804d4a8282b3436a6e5ce260cf51c12b163e9f18709d712565f3
- Sigstore transparency entry: 1773001787
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: okssusucha/llm-canary@eef6572ded10a6f8c3c79753bdccab72d65bd581
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/okssusucha
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@eef6572ded10a6f8c3c79753bdccab72d65bd581
- Trigger Event: push

File details

Details for the file llm_canary-0.4.0-py3-none-any.whl.

File metadata

Download URL: llm_canary-0.4.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 32.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_canary-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dbbc1dacba1de990d7be5a1c126c9d294a242356424afe826f4d93015e03f233`
MD5	`41fa0f4abfd2575fd91a48f091cb8bd7`
BLAKE2b-256	`605b296d31420c99d715f16ba3bc211d22416552040f7f65149ac56afd27bdab`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_canary-0.4.0-py3-none-any.whl:

Publisher: release.yml on okssusucha/llm-canary

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_canary-0.4.0-py3-none-any.whl
- Subject digest: dbbc1dacba1de990d7be5a1c126c9d294a242356424afe826f4d93015e03f233
- Sigstore transparency entry: 1773002206
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: okssusucha/llm-canary@eef6572ded10a6f8c3c79753bdccab72d65bd581
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/okssusucha
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@eef6572ded10a6f8c3c79753bdccab72d65bd581
- Trigger Event: push

llm-canary 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

llm-canary

Why / なぜ必要か

Quickstart / クイックスタート

Suite YAML / スイート定義

Assertions / アサーション一覧

Providers / プロバイダ

Test YOUR bot, not the raw model / 素のモデルではなく「あなたのボット」を検査する

Baseline drift / ベースラインドリフト

Agent trace gates / エージェントトレース検査

Self-hosting / セルフホスティング

GitHub Actions

Architecture / アーキテクチャ

Development / 開発

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance