Generate eval suites from prompt logs and catch LLM regressions locally.

These details have not been verified by PyPI

Project links

Project description

redline

Website · Docs · MCP · Security · License

Automatic eval suites from the prompt logs you already have.

redline turns real prompt-response logs into regression tests. It watches or imports existing outputs, selects representative cases, replays your changed prompt, and shows the behavioral diff before a bad prompt ships.

redline product demo

Product Promise

In under five minutes, on a real prompt log, redline should catch one regression you did not want to ship.

That promise is intentionally narrow. redline is not a hosted eval platform, a generic score, or a replacement for human judgment. It is the local safety loop between "I changed the prompt" and "this is safe enough to merge."

Why It Exists

Most teams already have the raw material for evals: prompts, outputs, support tickets, traces, model responses, and production logs. What they usually do not have is time to hand-write a full regression suite before every prompt edit.

redline makes the first suite free:

Use the logs you already have.
Cluster behavior into representative cases.
Re-run the suite after a prompt change.
See exactly what broke: JSON keys, required numbers, URLs, tables, code blocks, refusals, empty answers, and other high-signal changes.
Mark intentional changes, accept reviewed outputs, and keep the suite moving with the product.

Start Here

Install from PyPI:

python -m pip install redline-ai

Run the public proof:

redline demo --public --compact

The demo catches ten synthetic regressions without API keys, private logs, a cloud account, or an LLM judge. It writes JSON, Markdown, and self-contained HTML reports under .redline/demo.

Open the local report index:

redline dashboard --open

Real Workflow

Build a suite from baseline logs:

redline suite logs/baseline.jsonl --out redline-suite.json

Evaluate a changed prompt file through your configured runner:

redline eval --prompt prompts/v2.txt

Or compare candidate outputs you already generated:

redline diff redline-suite.json logs/candidate.jsonl

When redline finds a blocking change, it exits non-zero for CI and prints the reason:

REGRESSION case_004
- candidate missing JSON keys: owner, required_action
- candidate missing URL: https://example.com/policies/refunds

Confidence: HIGH | fix blocking cases before shipping

What redline Catches

Signal	Example regression
JSON validity and keys	Candidate stops returning valid JSON or drops `owner`.
Tables, lists, and code blocks	Markdown table becomes prose; code fence disappears.
Numbers, URLs, and entities	Refund window, ticket ID, policy URL, or owner is missing.
Empty outputs and refusals	Candidate newly refuses a safe task or returns nothing.
Content drift	Same-shape response changes substantially.
Explicit requirements	Pinned cases require or forbid exact strings.

redline is deterministic and local-first by default. Optional judge commands are available for ambiguous changed cases, but redline does not call a cloud model unless you explicitly configure that command.

Trust Boundary

A green redline run means no configured high-signal structural blockers were found. It does not prove factual correctness, tone, hallucination safety, policy compliance, or subtle reasoning quality.

That boundary is visible in CLI output and reports because over-trusting eval tools is dangerous. Use requirements or an optional judge for semantic risks that structural checks cannot prove.

Product Surface

redline is built around the full prompt-regression loop:

redline watch: collect prompt-response observations from logs or Python code.
redline cluster: inspect behavior groups before suite generation.
redline suite: generate a representative eval suite from baseline logs.
redline suite add: pin hand-picked edge cases the algorithm should never miss.
redline eval: replay each suite case through your local app or model runner.
redline diff: compare candidate JSONL outputs against the suite baseline.
redline mark and redline accept: review intentional changes and promote the new baseline.
redline require: add deterministic must-include or must-not-include rules.
redline history, redline compare, and redline dashboard: track quality over time and inspect reports locally.
redline-mcp: let AI coding assistants run checks inside Claude, Codex, Cursor, Kiro, or any MCP client.

Connect Your App

Any command that reads a prompt from stdin and prints a response to stdout can be a redline runner:

redline init --runner stdio --copy-runner --github-action

Built-in adapters cover provider-neutral stdio, OpenAI, Anthropic, LiteLLM, HTTP APIs, Python chains, and JSONL log imports:

redline runners
redline runners --copy all

Runner details live in docs/runners.md. Log import adapters are for building suites from exported logs, not for redline eval replay.

AI Assistant Native

redline ships a local Model Context Protocol server:

redline-mcp

Use docs/mcp.md to wire redline into an MCP client. The MCP surface exposes safe read/eval/report tools and workflow prompts like check_prompt_change, build_suite_from_logs, and review_candidate_outputs. It does not expose baseline mutation commands.

CI And GitHub

Create config plus a GitHub Actions workflow:

redline init --runner stdio --copy-runner --github-action

Use redline as a composite GitHub Action from another repo:

- uses: gowtham0992/redline@v0.1.0
  with:
    prompt-path: prompts/v2.txt

The action writes JSON, Markdown, HTML, JUnit, history, and dashboard artifacts under .redline/, appends the report and trend summary to the GitHub step summary, and exits with the eval gate status.

Reports

Every diff and eval run can write:

JSON for machines and dashboards
Markdown for PR comments and summaries
self-contained HTML for side-by-side inspection
JUnit XML for CI test reporting
GitHub annotations for changed or blocking cases

Example:

redline diff redline-suite.json logs/candidate.jsonl \
  --out-json .redline/reports/diff.json \
  --out-md .redline/reports/diff.md \
  --out-html .redline/reports/diff.html \
  --out-junit .redline/reports/diff.xml

Optional Judges

Use judges only where structural checks are not enough. redline sends only ambiguous changed cases to the configured command as JSON on stdin:

redline diff logs/candidate.jsonl --judge "python examples/judge_changed.py"

Included templates:

Calibration guidance lives in docs/judges.md.

Config

redline init writes redline.json with a $schema reference for editor autocomplete. Important keys:

Key	Purpose
`suite`	Suite baseline path, default `redline-suite.json`.
`input_field`, `output_field`	JSONL field paths for prompts and responses.
`max_cases`	Maximum representative cases selected for a suite.
`replay`	Command used by `eval`; prompts go to stdin unless it contains `{prompt}`.
`workers`	Number of replay cases to run concurrently.
`fail_on`	Statuses that fail `diff` or `eval`; use `"none"` for report-only setup.
`reports`	JSON, Markdown, HTML, and JUnit output paths.
`judge`	Optional command for ambiguous `changed` cases.

Check setup before relying on a suite:

redline doctor --strict
redline validate redline-suite.json --strict
redline summary redline-suite.json

Dogfood Assets

The public fixture is synthetic, shaped after public instruction/chat dataset patterns, and documented in examples/public_dogfood_sources.md.

python -m redline suite examples/public_dogfood_baseline.jsonl --out /tmp/redline-public-suite.json --all-cases
python -m redline diff /tmp/redline-public-suite.json examples/public_dogfood_candidate.jsonl --compact --fail-on none

For AI-assistant session dogfood, use docs/ai-session-dogfood-prompts.jsonl and normalize raw exports with scripts/normalize_ai_session_logs.py.

From a repo checkout, record the public demo:

bash scripts/demo_terminal.sh
bash scripts/demo_gif.sh .redline/launch .redline/launch/redline-demo.gif

Development

python -m pip install -e ".[dev]"
python -m pytest -q
python -m ruff check .
python -m mypy redline tests scripts examples

Before cutting a release or asking someone else to try a branch:

bash scripts/release_check.sh

Project Docs

docs/release.md: package, tag, PyPI, and MCP Registry release flow
docs/launch.md: public alpha launch plan
docs/dogfood.md: first-user dogfood protocol
docs/runners.md: runner and log adapter setup
docs/mcp.md: MCP server setup
docs/repository.md: GitHub repository controls
CONTRIBUTING.md: contributor validation
SECURITY.md: privacy and vulnerability reporting
LICENSE: MIT open source license

Website source for GitHub Pages lives in site/ and deploys from the committed static assets on main.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redline_ai-0.1.0.tar.gz (494.7 kB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

redline_ai-0.1.0-py3-none-any.whl (92.2 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file redline_ai-0.1.0.tar.gz.

File metadata

Download URL: redline_ai-0.1.0.tar.gz
Upload date: May 24, 2026
Size: 494.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for redline_ai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`761d5f8af9fb1cdec7229f3c0b137188f98cdef6f7731ee9bc62188de10152f6`
MD5	`e7321c45199d6505d2ed26e284847a67`
BLAKE2b-256	`5d42c02ce820549b74a93d86a35d4a9b1f86e9a1f91e237c5620604953b433c1`

See more details on using hashes here.

File details

Details for the file redline_ai-0.1.0-py3-none-any.whl.

File metadata

Download URL: redline_ai-0.1.0-py3-none-any.whl
Upload date: May 24, 2026
Size: 92.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for redline_ai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d22f986fb5c177c04ad922f30c05313008093c58c9fe7a725102ca5a92aa5659`
MD5	`45e298960785cda0a7b16fc277177525`
BLAKE2b-256	`f4eb322513f000717d37d226666f03a69baaefe38d671a5e17a442a6564def9a`

See more details on using hashes here.

redline-ai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

redline

Product Promise

Why It Exists

Start Here

Real Workflow

What redline Catches

Trust Boundary

Product Surface

Connect Your App

AI Assistant Native

CI And GitHub

Reports

Optional Judges

Config

Dogfood Assets

Development

Project Docs

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes