CI-native regression testing and migration for LLMs
Project description
llmci
CI-native regression testing and migration for LLMs.
Catch quality drops before they merge. Migrate models without breaking things.
llmci is not an observability tool — it's a pre-merge safety gate. Define eval datasets, set quality thresholds, and let CI block bad changes to your prompts, models, or pipelines.
Installation
pip install llmci
Requires Python 3.10+.
Quick Start
1. Initialize
llmci init
This creates a llmci.yaml config and a starter eval dataset. You'll be asked:
- Target mode —
command(run any script) ordirect(call an LLM API) - Task type — classification, open-ended, or agent
- Eval name — what to call this eval
2. Define your eval dataset
Edit the generated evals/<name>.jsonl. Each line is a JSON object:
{"input": "My printer won't connect to wifi", "expected": "hardware"}
{"input": "I need a refund for order #882", "expected": "billing"}
Or add examples interactively:
llmci dataset add --name my-eval
3. Run
llmci run
Output:
## llmci Eval Report
| Eval | Metric | Score | Threshold | Status |
|------|--------|-------|-----------|--------|
| ticket-classification | accuracy | 0.950 | ≥ 0.9 | ✅ |
| ticket-classification | f1_macro | 0.940 | ≥ 0.85 | ✅ |
Exit code 0 = all thresholds pass. Exit code 1 = regression detected.
Configuration
llmci.yaml defines your target, evals, and settings:
version: 1
target:
command: "python3 run_prompt.py --input {input_file} --output {output_file}"
evals:
- name: ticket-classification
dataset: ./evals/tickets.jsonl
judge: exact_match
metrics:
- name: accuracy
threshold: 0.90
mode: absolute
- name: f1_macro
threshold: 0.85
mode: absolute
settings:
parallelism: 5
timeout_per_call: 30
retries: 1
Use --config when your eval config has a different name or lives in a service directory:
llmci run --config llmci-prompt-level.yaml
For monorepos, discover configs and run them all:
llmci discover
llmci run --all
llmci run --all --root services/ticket-classifier
llmci run --all --include "services/**" --exclude "services/summarizer/llmci.yaml"
Target Modes
Command mode — wrap any script, any language:
target:
command: "python3 my_pipeline.py --input {input_file} --output {output_file}"
Your script reads a JSON input file and writes a JSON output file with an "output" key.
Direct API mode — call an LLM provider directly:
target:
direct:
provider: openai
model: gpt-4o-mini
prompt_file: prompt.txt
Uses litellm under the hood, so any provider works (OpenAI, Anthropic, Azure, etc.). Set credentials via environment variables.
For internal proxies or custom gateways, add base_url:
target:
direct:
provider: openai
model: gpt-4o
base_url: https://llm-proxy.internal.company.com/v1
prompt_file: prompt.txt
Judges
| Type | Use case | Config |
|---|---|---|
exact_match |
Classification, deterministic outputs | judge: exact_match |
llm |
Open-ended generation, summarization | judge: {type: llm, model: gpt-4o, rubric: [...]} |
custom |
Domain-specific logic (JSON validation, etc.) | judge: {type: custom, module: ./judge.py, function: evaluate} |
composite |
Agent evaluation with multiple criteria | judge: {type: composite, criteria: [...]} |
Metrics
Score-based:
accuracy— fraction of exact matches (score = 1.0)pass_rate— fraction of examples scoring >= 0.5mean_score— average judge scoremedian_score— median judge score (robust to outliers)min_score/max_score— worst and best scores in dataseterror_rate— fraction of examples that errored
Classification:
f1_macro,f1_micro,f1_weighted— F1 score variantsprecision_macro,precision_micro,precision_weighted— precision variantsrecall_macro,recall_micro,recall_weighted— recall variants
Similarity:
cosine_similarity— token-overlap cosine similarity between expected and actual
Latency:
latency_mean,latency_p50,latency_p90,latency_p99— response time percentiles (ms)
Each metric supports two threshold modes:
absolute— score must be >= threshold (for latency metrics, must be <= threshold)max_regression— drop from baseline must be <= threshold (e.g., 0.05 = max 5% drop)
CI Integration
GitHub Actions
Add to your workflow:
- uses: llmci-cli/llmci@main
with:
compare-to: origin/main
llmci-version: 0.1.9
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Or use the CLI directly:
- run: pip install llmci
- run: llmci run --compare-to=origin/main
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
For monorepos, pass the service config explicitly:
- uses: llmci-cli/llmci@main
with:
config: services/api/llmci.yaml
compare-to: origin/main
llmci-version: 0.1.9
Or run every discovered config:
- uses: llmci-cli/llmci@main
with:
all: "true"
include: "services/**"
exclude: "services/experimental/**"
compare-to: origin/main
llmci-version: 0.1.9
When running in GitHub Actions, llmci automatically posts eval results as a PR comment.
For matrix CI (multiple services in parallel), set a unique slice per job so reports merge into one comment:
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
LLMCI_REPORT_SLICE: ${{ matrix.service }}/${{ matrix.config }}
Baselines
Store baseline scores on your main branch:
llmci run --update-baseline
Then compare PRs against that baseline:
llmci run --compare-to=main
Model Migration
When switching models (e.g., GPT-4o to GPT-4.5), llmci can automatically tune your prompt to maintain quality parity:
llmci migrate \
--from gpt-4o \
--to gpt-4.5 \
--eval ticket-classification \
--optimizer-model gpt-4o
The optimizer:
- Splits your dataset into train/validation/holdout
- Iteratively suggests minimal prompt modifications
- Stops when improvement plateaus (early stopping)
- Reports the final holdout score vs. the original model
Agent Evaluation
Test tool-using and conversational agents with composite judging:
evals:
- name: agent-tool-use
level: agent
dataset: ./evals/scenarios.jsonl
judge:
type: composite
criteria:
- name: constraints
type: constraint
weight: 1.0
- name: outcome
type: outcome
weight: 2.0
Your agent runs as a command that reads llmci input JSON and writes trace JSON. Use llmci.trace.TraceBuilder to build output, or llmci.integrations.openai_agents for the OpenAI Agents SDK — see examples/10-agent-openai-agents.
Supports:
- Single-turn and multi-turn conversations
- Constraint checking — tool call budgets, required/forbidden tools, token limits
- Outcome judging — LLM-based evaluation of final output
- Trajectory judging — LLM-based evaluation of execution path quality
- Full replay or history injection modes for multi-turn
Dataset Tools
# Initialize a new dataset
llmci dataset init --name my-eval --type classification
# Add examples interactively
llmci dataset add --name my-eval
# Analyze coverage and quality
llmci dataset check --name my-eval
# Import from CSV or JSON
llmci dataset import --name my-eval --from data.csv
Migrating from Promptfoo
llmci import-promptfoo promptfooconfig.yaml
Converts providers, test assertions, and variables into llmci's format.
Reference integration
The llmci-testbed repository is a realistic customer monorepo that dogfoods llmci against full HTTP services, RAG pipelines, agents, and migration workflows. Each service maps to a docs case study and runs in GitHub Actions with mock LLM mode (no API cost on PRs).
| Testbed path | Case study |
|---|---|
services/ticket-classifier |
FastAPI service |
services/rag-qa |
RAG pipeline |
services/summarizer |
Summarization QA |
services/support-agent |
Support agent |
migration |
Model migration |
Examples
| Example | What it demonstrates |
|---|---|
01-ci-regression |
Ticket classifier with exact_match + F1 |
02-model-migration |
Prompt optimization across models |
03-llm-as-judge |
Open-ended generation with rubric judging |
04-custom-judge |
JSON schema validation with a Python judge |
05-agent-single-turn |
Tool-using agent with constraint checking |
06-agent-multi-turn |
Multi-turn conversation testing |
07-pipeline-level |
Full RAG pipeline end-to-end |
08-fastapi-service |
Pre/post processing pipeline with dual-level testing |
09-summarization-qa |
Multi-criteria LLM judge with reference-free evaluation |
10-agent-openai-agents |
TraceBuilder + OpenAI Agents SDK adapter |
CLI Reference
llmci run Run evals and report results
llmci migrate Optimize prompts for a new model
llmci init Generate llmci.yaml interactively
llmci dataset init Create a new eval dataset
llmci dataset add Add examples interactively
llmci dataset check Analyze dataset coverage
llmci dataset import Import from CSV/JSON
llmci import-promptfoo Convert a Promptfoo config
Global flags: -v (verbose), --debug (full logging), --version.
See CHANGELOG.md for release history.
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmci-0.1.9.tar.gz.
File metadata
- Download URL: llmci-0.1.9.tar.gz
- Upload date:
- Size: 52.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
900766856b7a3d80271ace95a009e6ed399f3903aa69fdf1b561572c92949e14
|
|
| MD5 |
5076dd7756997f72991ddbea42f149ad
|
|
| BLAKE2b-256 |
e8c56799785cd721ddfcede6166c7d2969a59fdc263f2d6c98fc09f636cd2263
|
Provenance
The following attestation bundles were made for llmci-0.1.9.tar.gz:
Publisher:
publish.yml on llmci-cli/llmci
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmci-0.1.9.tar.gz -
Subject digest:
900766856b7a3d80271ace95a009e6ed399f3903aa69fdf1b561572c92949e14 - Sigstore transparency entry: 1687971629
- Sigstore integration time:
-
Permalink:
llmci-cli/llmci@9680172d0a6b558a9f7fa8e64c3c6834395c8f40 -
Branch / Tag:
refs/tags/v0.1.9 - Owner: https://github.com/llmci-cli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9680172d0a6b558a9f7fa8e64c3c6834395c8f40 -
Trigger Event:
release
-
Statement type:
File details
Details for the file llmci-0.1.9-py3-none-any.whl.
File metadata
- Download URL: llmci-0.1.9-py3-none-any.whl
- Upload date:
- Size: 61.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6f9ecffe9ab7f1ddf7e0b5365a5ce974a9fab658ae56fdb2ba1afa9914a1799
|
|
| MD5 |
aa165610b9acfd7c44fee150be4c2d41
|
|
| BLAKE2b-256 |
e76e90bfbcf67ac4e8ca0861b642a48162cea48918c5f0bef7a56096f2341688
|
Provenance
The following attestation bundles were made for llmci-0.1.9-py3-none-any.whl:
Publisher:
publish.yml on llmci-cli/llmci
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmci-0.1.9-py3-none-any.whl -
Subject digest:
b6f9ecffe9ab7f1ddf7e0b5365a5ce974a9fab658ae56fdb2ba1afa9914a1799 - Sigstore transparency entry: 1687971689
- Sigstore integration time:
-
Permalink:
llmci-cli/llmci@9680172d0a6b558a9f7fa8e64c3c6834395c8f40 -
Branch / Tag:
refs/tags/v0.1.9 - Owner: https://github.com/llmci-cli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9680172d0a6b558a9f7fa8e64c3c6834395c8f40 -
Trigger Event:
release
-
Statement type: