CI-native regression testing and migration for LLMs

These details have not been verified by PyPI

Project description

llmci

CI-native regression testing and migration for LLMs.

Catch quality drops before they merge. Migrate models without breaking things.

llmci is not an observability tool — it's a pre-merge safety gate. Define eval datasets, set quality thresholds, and let CI block bad changes to your prompts, models, or pipelines.

Installation

pip install llmci

Requires Python 3.10+.

Quick Start

1. Initialize

llmci init

This creates a llmci.yaml config and a starter eval dataset. You'll be asked:

Target mode — command (run any script) or direct (call an LLM API)
Task type — classification, open-ended, or agent
Eval name — what to call this eval

2. Define your eval dataset

Edit the generated evals/<name>.jsonl. Each line is a JSON object:

{"input": "My printer won't connect to wifi", "expected": "hardware"}
{"input": "I need a refund for order #882", "expected": "billing"}

Or add examples interactively:

llmci dataset add --name my-eval

3. Run

llmci run

Output:

## llmci Eval Report

| Eval | Metric | Score | Threshold | Status |
|------|--------|-------|-----------|--------|
| ticket-classification | accuracy | 0.950 | ≥ 0.9 | ✅ |
| ticket-classification | f1_macro | 0.940 | ≥ 0.85 | ✅ |

Exit code 0 = all thresholds pass. Exit code 1 = regression detected.

Configuration

llmci.yaml defines your target, evals, and settings:

version: 1

target:
  command: "python3 run_prompt.py --input {input_file} --output {output_file}"

evals:
  - name: ticket-classification
    dataset: ./evals/tickets.jsonl
    judge: exact_match
    metrics:
      - name: accuracy
        threshold: 0.90
        mode: absolute
      - name: f1_macro
        threshold: 0.85
        mode: absolute

settings:
  parallelism: 5
  timeout_per_call: 30
  retries: 1

Use --config when your eval config has a different name or lives in a service directory:

llmci run --config llmci-prompt-level.yaml

For monorepos, discover configs and run them all:

llmci discover
llmci run --all
llmci run --all --root services/ticket-classifier
llmci run --all --include "services/**" --exclude "services/summarizer/llmci.yaml"

Target Modes

Command mode — wrap any script, any language:

target:
  command: "python3 my_pipeline.py --input {input_file} --output {output_file}"

Your script reads a JSON input file and writes a JSON output file with an "output" key.

Direct API mode — call an LLM provider directly:

target:
  direct:
    provider: openai
    model: gpt-4o-mini
  prompt_file: prompt.txt

Uses litellm under the hood, so any provider works (OpenAI, Anthropic, Azure, etc.). Set credentials via environment variables.

For internal proxies or custom gateways, add base_url:

target:
  direct:
    provider: openai
    model: gpt-4o
    base_url: https://llm-proxy.internal.company.com/v1
  prompt_file: prompt.txt

Judges

Type	Use case	Config
`exact_match`	Classification, deterministic outputs	`judge: exact_match`
`llm`	Open-ended generation, summarization	`judge: {type: llm, model: gpt-4o, rubric: [...]}`
`custom`	Domain-specific logic (JSON validation, etc.)	`judge: {type: custom, module: ./judge.py, function: evaluate}`
`composite`	Agent evaluation with multiple criteria	`judge: {type: composite, criteria: [...]}`

Metrics

Score-based:

accuracy — fraction of exact matches (score = 1.0)
pass_rate — fraction of examples scoring >= 0.5
mean_score — average judge score
median_score — median judge score (robust to outliers)
min_score / max_score — worst and best scores in dataset
error_rate — fraction of examples that errored

Classification:

f1_macro, f1_micro, f1_weighted — F1 score variants
precision_macro, precision_micro, precision_weighted — precision variants
recall_macro, recall_micro, recall_weighted — recall variants

Similarity:

cosine_similarity — token-overlap cosine similarity between expected and actual

Latency:

latency_mean, latency_p50, latency_p90, latency_p99 — response time percentiles (ms)

Each metric supports two threshold modes:

absolute — score must be >= threshold (for latency metrics, must be <= threshold)
max_regression — drop from baseline must be <= threshold (e.g., 0.05 = max 5% drop)

CI Integration

GitHub Actions

Add to your workflow:

- uses: llmci-cli/llmci@main
  with:
    compare-to: origin/main
    llmci-version: 0.1.9
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Or use the CLI directly:

- run: pip install llmci
- run: llmci run --compare-to=origin/main
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

For monorepos, pass the service config explicitly:

- uses: llmci-cli/llmci@main
  with:
    config: services/api/llmci.yaml
    compare-to: origin/main
    llmci-version: 0.1.9

Or run every discovered config:

- uses: llmci-cli/llmci@main
  with:
    all: "true"
    include: "services/**"
    exclude: "services/experimental/**"
    compare-to: origin/main
    llmci-version: 0.1.9

When running in GitHub Actions, llmci automatically posts eval results as a PR comment.

For matrix CI (multiple services in parallel), set a unique slice per job so reports merge into one comment:

env:
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
  LLMCI_REPORT_SLICE: ${{ matrix.service }}/${{ matrix.config }}

Baselines

Store baseline scores on your main branch:

llmci run --update-baseline

Then compare PRs against that baseline:

llmci run --compare-to=main

Model Migration

When switching models (e.g., GPT-4o to GPT-4.5), llmci can automatically tune your prompt to maintain quality parity:

llmci migrate \
  --from gpt-4o \
  --to gpt-4.5 \
  --eval ticket-classification \
  --optimizer-model gpt-4o

The optimizer:

Splits your dataset into train/validation/holdout
Iteratively suggests minimal prompt modifications
Stops when improvement plateaus (early stopping)
Reports the final holdout score vs. the original model

Agent Evaluation

Test tool-using and conversational agents with composite judging:

evals:
  - name: agent-tool-use
    level: agent
    dataset: ./evals/scenarios.jsonl
    judge:
      type: composite
      criteria:
        - name: constraints
          type: constraint
          weight: 1.0
        - name: outcome
          type: outcome
          weight: 2.0

Your agent runs as a command that reads llmci input JSON and writes trace JSON. Use llmci.trace.TraceBuilder to build output, or llmci.integrations.openai_agents for the OpenAI Agents SDK — see examples/10-agent-openai-agents.

Supports:

Single-turn and multi-turn conversations
Constraint checking — tool call budgets, required/forbidden tools, token limits
Outcome judging — LLM-based evaluation of final output
Trajectory judging — LLM-based evaluation of execution path quality
Full replay or history injection modes for multi-turn

Dataset Tools

# Initialize a new dataset
llmci dataset init --name my-eval --type classification

# Add examples interactively
llmci dataset add --name my-eval

# Analyze coverage and quality
llmci dataset check --name my-eval

# Import from CSV or JSON
llmci dataset import --name my-eval --from data.csv

Migrating from Promptfoo

llmci import-promptfoo promptfooconfig.yaml

Converts providers, test assertions, and variables into llmci's format.

Reference integration

The llmci-testbed repository is a realistic customer monorepo that dogfoods llmci against full HTTP services, RAG pipelines, agents, and migration workflows. Each service maps to a docs case study and runs in GitHub Actions with mock LLM mode (no API cost on PRs).

Testbed path	Case study
`services/ticket-classifier`	FastAPI service
`services/rag-qa`	RAG pipeline
`services/summarizer`	Summarization QA
`services/support-agent`	Support agent
`migration`	Model migration

Examples

Example	What it demonstrates
`01-ci-regression`	Ticket classifier with exact_match + F1
`02-model-migration`	Prompt optimization across models
`03-llm-as-judge`	Open-ended generation with rubric judging
`04-custom-judge`	JSON schema validation with a Python judge
`05-agent-single-turn`	Tool-using agent with constraint checking
`06-agent-multi-turn`	Multi-turn conversation testing
`07-pipeline-level`	Full RAG pipeline end-to-end
`08-fastapi-service`	Pre/post processing pipeline with dual-level testing
`09-summarization-qa`	Multi-criteria LLM judge with reference-free evaluation
`10-agent-openai-agents`	TraceBuilder + OpenAI Agents SDK adapter

CLI Reference

llmci run              Run evals and report results
llmci migrate          Optimize prompts for a new model
llmci init             Generate llmci.yaml interactively
llmci dataset init     Create a new eval dataset
llmci dataset add      Add examples interactively
llmci dataset check    Analyze dataset coverage
llmci dataset import   Import from CSV/JSON
llmci import-promptfoo Convert a Promptfoo config

Global flags: -v (verbose), --debug (full logging), --version.

See CHANGELOG.md for release history.

License

Apache 2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.1

Jun 7, 2026

0.4.0

Jun 6, 2026

0.3.0

Jun 6, 2026

0.2.0

Jun 6, 2026

This version

0.1.9

Jun 1, 2026

0.1.8

May 31, 2026

0.1.7

May 31, 2026

0.1.6

May 31, 2026

0.1.5

May 25, 2026

0.1.3

May 24, 2026

0.1.2

May 24, 2026

0.1.1

May 24, 2026

0.1.0

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmci-0.1.9.tar.gz (52.6 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmci-0.1.9-py3-none-any.whl (61.2 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file llmci-0.1.9.tar.gz.

File metadata

Download URL: llmci-0.1.9.tar.gz
Upload date: Jun 1, 2026
Size: 52.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmci-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`900766856b7a3d80271ace95a009e6ed399f3903aa69fdf1b561572c92949e14`
MD5	`5076dd7756997f72991ddbea42f149ad`
BLAKE2b-256	`e8c56799785cd721ddfcede6166c7d2969a59fdc263f2d6c98fc09f636cd2263`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmci-0.1.9.tar.gz:

Publisher: publish.yml on llmci-cli/llmci

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmci-0.1.9.tar.gz
- Subject digest: 900766856b7a3d80271ace95a009e6ed399f3903aa69fdf1b561572c92949e14
- Sigstore transparency entry: 1687971629
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: llmci-cli/llmci@9680172d0a6b558a9f7fa8e64c3c6834395c8f40
- Branch / Tag: refs/tags/v0.1.9
- Owner: https://github.com/llmci-cli
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9680172d0a6b558a9f7fa8e64c3c6834395c8f40
- Trigger Event: release

File details

Details for the file llmci-0.1.9-py3-none-any.whl.

File metadata

Download URL: llmci-0.1.9-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 61.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmci-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b6f9ecffe9ab7f1ddf7e0b5365a5ce974a9fab658ae56fdb2ba1afa9914a1799`
MD5	`aa165610b9acfd7c44fee150be4c2d41`
BLAKE2b-256	`e76e90bfbcf67ac4e8ca0861b642a48162cea48918c5f0bef7a56096f2341688`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmci-0.1.9-py3-none-any.whl:

Publisher: publish.yml on llmci-cli/llmci

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmci-0.1.9-py3-none-any.whl
- Subject digest: b6f9ecffe9ab7f1ddf7e0b5365a5ce974a9fab658ae56fdb2ba1afa9914a1799
- Sigstore transparency entry: 1687971689
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: llmci-cli/llmci@9680172d0a6b558a9f7fa8e64c3c6834395c8f40
- Branch / Tag: refs/tags/v0.1.9
- Owner: https://github.com/llmci-cli
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9680172d0a6b558a9f7fa8e64c3c6834395c8f40
- Trigger Event: release

llmci 0.1.9

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

llmci

Installation

Quick Start

1. Initialize

2. Define your eval dataset

3. Run

Configuration

Target Modes

Judges

Metrics

CI Integration

GitHub Actions

Baselines

Model Migration

Agent Evaluation

Dataset Tools

Migrating from Promptfoo

Reference integration

Examples

CLI Reference

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance