Skip to main content

Turn production LLM failures into regression tests, automatically.

Project description

phoenix2pytest

CI CodeQL OpenSSF Best Practices Tessl codecov Python License: MIT Last commit

Turn production LLM failures into regression tests. Automatically.

phoenix2pytest demo

Built for: Google Cloud Rapid Agent Hackathon, Arize track

Live demo: phoenix2pytest-etm7pvfo3a-nw.a.run.app - the example is pre-filled, just click Generate. Single trace at /, batch at /batch.

Stack: Google Cloud, Vertex AI, Gemini 2.5, Agent Builder, Arize Phoenix MCP, FastAPI, pytest

Status: Stable (v1.0.0). Originally built for the Devpost submission cycle of June 2026, now stable within the documented scope (see Limits).


The problem

You ship an LLM feature. Three weeks later, a Slack thread mentions a customer got a weird response. You dig. The prompt has been edited twice since release. The model has been quietly re-quantised by the provider. Nobody added a test that would have caught it.

Existing eval frameworks ask you to predict failures up front. You write evals against your LLM, run them, get scores. That works for known failure modes you can imagine. It does not work for the failure that just got escalated to your phone.

The idea

phoenix2pytest goes the other direction. It reads traces from your Arize Phoenix project, picks the ones flagged as failures, and synthesises pytest cases that would have caught them. Production traffic feeds your regression suite without manual translation.

Existing tools phoenix2pytest
Direction: spec to eval to run Direction: trace to failure to test
You predict what to test You react to what broke
Eval scores Concrete pytest assertions
Catches what you imagined Catches what actually happened

How it works

  1. Your LLM application emits traces to Arize Phoenix (standard OpenInference instrumentation).
  2. You annotate failed traces in the Phoenix UI (manual review, or via Phoenix evals).
  3. phoenix2pytest reads annotated traces via the Phoenix MCP server.
  4. A Gemini agent extracts evidence and assertion strategy per failure.
  5. A second pass synthesises a runnable pytest file.
  6. You drop the test into your repo and your CI catches the regression next time.

The pipeline runs end-to-end on a single trace (/) or on many annotated traces in one request (/batch), both on the web UI and Cloud Run. Batch mode groups traces by failure mode and folds shared modes into one parametrised test.

Architecture

flowchart LR
    A[Phoenix project<br/>annotated traces] -->|MCP| B[Agent Builder<br/>orchestrator]
    B --> C[Gemini Flash<br/>extractor]
    B --> D[Gemini Pro<br/>synthesiser]
    C --> D
    D --> E[Generated<br/>pytest file]
    E --> F[CI / dev<br/>runs pytest]

The orchestrator runs on Cloud Run, fetches traces through the Arize Phoenix MCP server, calls Gemini twice per trace (Flash for evidence extraction, Pro for code generation), and writes the synthesised test file.

Quickstart

The web UI is the primary entry point. A console-script CLI is planned (see roadmap).

Local web UI:

git clone https://github.com/golikovichev/phoenix2pytest
cd phoenix2pytest
pip install -e ".[dev]"

Create a .env file in the repo root:

PHOENIX_BASE_URL=https://app.phoenix.arize.com/s/your-space
PHOENIX_API_KEY=your-phoenix-api-key
GOOGLE_CLOUD_PROJECT=your-gcp-project
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_GENAI_USE_VERTEXAI=True

Application Default Credentials are picked up automatically for Vertex AI, so no API key is needed if you have run gcloud auth application-default login. If you prefer the direct Gemini API, set GEMINI_API_KEY instead of the Vertex variables.

Run the FastAPI web UI and open it in a browser:

uvicorn phoenix2pytest.web:app --reload --port 8000
# http://localhost:8000

Cloud Run deploy (see cloudbuild.yaml for the full pipeline):

gcloud builds submit --config cloudbuild.yaml

Demo

A 3-minute walkthrough video accompanies the Devpost submission. The demo shows a real Phoenix trace with a hallucination, phoenix2pytest extracting the failure, generating a pytest file, the run showing red, a prompt fix, and the run showing green.

How is this different from DeepEval, Opik, pytest-evals, Langfuse?

Short answer: those tools are about running evals you wrote. phoenix2pytest is about generating tests from failures you saw. Different direction, different mental model.

Tool What it is When to use
DeepEval pytest-style framework for writing LLM evals You know the failure modes you care about and want to define metrics
Opik LLM observability with pytest integration You want eval scores in CI
pytest-evals Minimal pytest plugin for running evals at scale You want parametrised eval runs
Langfuse LLM tracing platform with evals You want production tracing plus scoring
phoenix2pytest Generates pytest tests from observed failures You want your regression suite to keep up with production reality

You can use phoenix2pytest alongside the others. It does not compete with eval frameworks; it feeds them. The output of phoenix2pytest is a pytest file you can run via DeepEval, Opik, pytest-evals, or plain pytest. Your choice.

What it catches and does not catch

Catches:

  • Hallucinations of specific facts when those facts appear as identifiable strings in the bad output
  • Format breaks: JSON wrapped in markdown when pure JSON was demanded, missing fields, wrong types
  • Refusals where the model should have answered
  • Wrong reasoning when correct answer or clarification was reachable
  • Stale-data claims when the model invented current information

Does not catch yet:

  • Semantic-level paraphrased failures where the model fabricates the same fact in different words
  • Failures that only appear in long context or multi-turn flows
  • Subtle quality degradations without a clear bad-string pattern

The roadmap covers paraphrase tolerance via embedding-similarity assertions.

Limits

  • Loads matched traces in memory. For projects with hundreds of thousands of failed traces you will want a streaming variant.
  • Generated tests assume the same model the failure was observed against. Cross-model regression suites need explicit configuration.
  • Test quality depends on Gemini synthesis quality. For unusual failure modes, manual review of the generated test is prudent.

Honesty notes

  • The synthesised pytest is a starting point, not a final test. Engineers review it before commit.
  • The agent does not auto-classify random traces. It works on traces already labeled as failures (manual annotation, eval framework, or heuristic). This is intentional. Automatic failure detection on raw traces is not reliable for hallucinations of facts the classifier itself cannot verify.
  • Built and maintained solo. The scope is deliberately narrow (see Limits) and stable within it; the roadmap covers where it grows next.

Roadmap

  • v1.0 (June 2026): one-trace and many-trace generation; hallucination, format_break, refusal_bug, stale_data, wrong_reasoning, and off_topic_drift coverage; Cloud Run hosting; web UI; generated code is validated as parseable Python before it is returned.
  • v1.1: paraphrase-tolerant assertions via embedding similarity.
  • v1.2: multi-turn trace handling.
  • v1.3: phoenix2pytest console-script CLI and broader documentation.

Contributing

Contributions are welcome. See CONTRIBUTING.md for setup and the workflow.

License

MIT. See LICENSE.

Acknowledgements

Built on Arize Phoenix, Google Cloud Vertex AI, OpenTelemetry, and OpenInference semantic conventions. Thanks to the maintainers of all four projects.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phoenix2pytest-1.0.0.tar.gz (140.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phoenix2pytest-1.0.0-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file phoenix2pytest-1.0.0.tar.gz.

File metadata

  • Download URL: phoenix2pytest-1.0.0.tar.gz
  • Upload date:
  • Size: 140.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for phoenix2pytest-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c924b05c395c78102faca381fc954a2f5906faaad58344009333c3dc5017a0cb
MD5 a8723cfac4fc59953d09b6c738dd5392
BLAKE2b-256 18a2c72f79a97d0d8114b734c8661f3be11075cc898394265b70d1a45ed39d2f

See more details on using hashes here.

Provenance

The following attestation bundles were made for phoenix2pytest-1.0.0.tar.gz:

Publisher: publish.yml on golikovichev/phoenix2pytest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file phoenix2pytest-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: phoenix2pytest-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for phoenix2pytest-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 acb6b6e40ba6335a4aa635a37d389418c338babd8b79511b15d0565b73606604
MD5 595bcd0b4fbc36291c2944b473e148af
BLAKE2b-256 d2fb2b421cb82d0b3f692f272dc9f7f432b544e10c7031063509e6bf8f7f867f

See more details on using hashes here.

Provenance

The following attestation bundles were made for phoenix2pytest-1.0.0-py3-none-any.whl:

Publisher: publish.yml on golikovichev/phoenix2pytest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page