Skip to main content

CI-native regression testing and migration for LLMs

Project description

Scaffold

CI-native regression testing and migration for LLMs.

Catch quality drops before they merge. Migrate models without breaking things.

Scaffold is not an observability tool — it's a pre-merge safety gate. Define eval datasets, set quality thresholds, and let CI block bad changes to your prompts, models, or pipelines.

Installation

pip install llmci

The PyPI package is llmci; the CLI command is scaffold.

Requires Python 3.11+.

Quick Start

1. Initialize

scaffold init

This creates a scaffold.yaml config and a starter eval dataset. You'll be asked:

  • Target modecommand (run any script) or direct (call an LLM API)
  • Task type — classification, open-ended, or agent
  • Eval name — what to call this eval

2. Define your eval dataset

Edit the generated evals/<name>.jsonl. Each line is a JSON object:

{"input": "My printer won't connect to wifi", "expected": "hardware"}
{"input": "I need a refund for order #882", "expected": "billing"}

Or add examples interactively:

scaffold dataset add --name my-eval

3. Run

scaffold run

Output:

## Scaffold Eval Report

| Eval | Metric | Score | Threshold | Status |
|------|--------|-------|-----------|--------|
| ticket-classification | accuracy | 0.950 | ≥ 0.9 | ✅ |
| ticket-classification | f1_macro | 0.940 | ≥ 0.85 | ✅ |

Exit code 0 = all thresholds pass. Exit code 1 = regression detected.

Configuration

scaffold.yaml defines your target, evals, and settings:

version: 1

target:
  command: "python3 run_prompt.py --input {input_file} --output {output_file}"

evals:
  - name: ticket-classification
    dataset: ./evals/tickets.jsonl
    judge: exact_match
    metrics:
      - name: accuracy
        threshold: 0.90
        mode: absolute
      - name: f1_macro
        threshold: 0.85
        mode: absolute

settings:
  parallelism: 5
  timeout_per_call: 30
  retries: 1

Target Modes

Command mode — wrap any script, any language:

target:
  command: "python3 my_pipeline.py --input {input_file} --output {output_file}"

Your script reads a JSON input file and writes a JSON output file with an "output" key.

Direct API mode — call an LLM provider directly:

target:
  direct:
    provider: openai
    model: gpt-4o-mini
  prompt_file: prompt.txt

Uses litellm under the hood, so any provider works (OpenAI, Anthropic, Azure, etc.). Set credentials via environment variables.

For internal proxies or custom gateways, add base_url:

target:
  direct:
    provider: openai
    model: gpt-4o
    base_url: https://llm-proxy.internal.company.com/v1
  prompt_file: prompt.txt

Judges

Type Use case Config
exact_match Classification, deterministic outputs judge: exact_match
llm Open-ended generation, summarization judge: {type: llm, model: gpt-4o, rubric: [...]}
custom Domain-specific logic (JSON validation, etc.) judge: {type: custom, module: ./judge.py, function: evaluate}
composite Agent evaluation with multiple criteria judge: {type: composite, criteria: [...]}

Metrics

Score-based:

  • accuracy — fraction of exact matches (score = 1.0)
  • pass_rate — fraction of examples scoring >= 0.5
  • mean_score — average judge score
  • median_score — median judge score (robust to outliers)
  • min_score / max_score — worst and best scores in dataset
  • error_rate — fraction of examples that errored

Classification:

  • f1_macro, f1_micro, f1_weighted — F1 score variants
  • precision_macro, precision_micro, precision_weighted — precision variants
  • recall_macro, recall_micro, recall_weighted — recall variants

Similarity:

  • cosine_similarity — token-overlap cosine similarity between expected and actual

Latency:

  • latency_mean, latency_p50, latency_p90, latency_p99 — response time percentiles (ms)

Each metric supports two threshold modes:

  • absolute — score must be >= threshold (for latency metrics, must be <= threshold)
  • max_regression — drop from baseline must be <= threshold (e.g., 0.05 = max 5% drop)

CI Integration

GitHub Actions

Add to your workflow:

- uses: alexminnaar/scaffold@main
  with:
    compare-to: origin/main
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Or use the CLI directly:

- run: pip install llmci
- run: scaffold run --compare-to=origin/main
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

When running in GitHub Actions, Scaffold automatically posts eval results as a PR comment.

For matrix CI (multiple services in parallel), set a unique slice per job so reports merge into one comment:

env:
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
  SCAFFOLD_REPORT_SLICE: ${{ matrix.service }}/${{ matrix.config }}

Baselines

Store baseline scores on your main branch:

scaffold run --update-baseline

Then compare PRs against that baseline:

scaffold run --compare-to=main

Model Migration

When switching models (e.g., GPT-4o to GPT-4.5), Scaffold can automatically tune your prompt to maintain quality parity:

scaffold migrate \
  --from gpt-4o \
  --to gpt-4.5 \
  --eval ticket-classification \
  --optimizer-model gpt-4o

The optimizer:

  1. Splits your dataset into train/validation/holdout
  2. Iteratively suggests minimal prompt modifications
  3. Stops when improvement plateaus (early stopping)
  4. Reports the final holdout score vs. the original model

Agent Evaluation

Test tool-using and conversational agents with composite judging:

evals:
  - name: agent-tool-use
    level: agent
    dataset: ./evals/scenarios.jsonl
    judge:
      type: composite
      criteria:
        - name: constraints
          type: constraint
          weight: 1.0
        - name: outcome
          type: outcome
          weight: 2.0

Your agent runs as a command that reads Scaffold input JSON and writes trace JSON. Use scaffold.trace.TraceBuilder to build output, or scaffold.integrations.openai_agents for the OpenAI Agents SDK — see examples/10-agent-openai-agents.

Supports:

  • Single-turn and multi-turn conversations
  • Constraint checking — tool call budgets, required/forbidden tools, token limits
  • Outcome judging — LLM-based evaluation of final output
  • Trajectory judging — LLM-based evaluation of execution path quality
  • Full replay or history injection modes for multi-turn

Dataset Tools

# Initialize a new dataset
scaffold dataset init --name my-eval --type classification

# Add examples interactively
scaffold dataset add --name my-eval

# Analyze coverage and quality
scaffold dataset check --name my-eval

# Import from CSV or JSON
scaffold dataset import --name my-eval --from data.csv

Migrating from Promptfoo

scaffold import-promptfoo promptfooconfig.yaml

Converts providers, test assertions, and variables into Scaffold's format.

Reference integration

The scaffold-testbed repository is a realistic customer monorepo that dogfoods llmci against full HTTP services, RAG pipelines, agents, and migration workflows. Each service maps to a docs case study and runs in GitHub Actions with mock LLM mode (no API cost on PRs).

Testbed path Case study
services/ticket-classifier FastAPI service
services/rag-qa RAG pipeline
services/summarizer Summarization QA
services/support-agent Support agent
migration Model migration

Examples

Example What it demonstrates
01-ci-regression Ticket classifier with exact_match + F1
02-model-migration Prompt optimization across models
03-llm-as-judge Open-ended generation with rubric judging
04-custom-judge JSON schema validation with a Python judge
05-agent-single-turn Tool-using agent with constraint checking
06-agent-multi-turn Multi-turn conversation testing
07-pipeline-level Full RAG pipeline end-to-end
08-fastapi-service Pre/post processing pipeline with dual-level testing
09-summarization-qa Multi-criteria LLM judge with reference-free evaluation
10-agent-openai-agents TraceBuilder + OpenAI Agents SDK adapter

CLI Reference

scaffold run              Run evals and report results
scaffold migrate          Optimize prompts for a new model
scaffold init             Generate scaffold.yaml interactively
scaffold dataset init     Create a new eval dataset
scaffold dataset add      Add examples interactively
scaffold dataset check    Analyze dataset coverage
scaffold dataset import   Import from CSV/JSON
scaffold import-promptfoo Convert a Promptfoo config

Global flags: -v (verbose), --debug (full logging), --version.

See CHANGELOG.md for release history.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmci-0.1.2.tar.gz (45.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmci-0.1.2-py3-none-any.whl (58.1 kB view details)

Uploaded Python 3

File details

Details for the file llmci-0.1.2.tar.gz.

File metadata

  • Download URL: llmci-0.1.2.tar.gz
  • Upload date:
  • Size: 45.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llmci-0.1.2.tar.gz
Algorithm Hash digest
SHA256 09a52eca14515f04a432b00e7b35a6af76028fe0eb03f3d0169e7e7df297225e
MD5 62f6c1bd06030e01852ad21b50910554
BLAKE2b-256 3fe44e39f581013479cc76211e22e5909dec99aa2be9bc890581b3b82aff1d14

See more details on using hashes here.

File details

Details for the file llmci-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: llmci-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 58.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llmci-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2e6c4bdfa4054fc6ff5775315cf580ac557d2643a21136442a80f148f9ab87ac
MD5 01d3e9a02694e91b7b8f95e9b5ebd589
BLAKE2b-256 e7feb5b9d8ca35110642122d7e278792b6e747ddfcbc468084b2f08a929406fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page