Skip to main content

Automated quality assurance for AI applications

Project description

pixie-qa

Eval-driven development for Python LLM applications.

pixie-qa ships two complementary tools:

  • eval-driven-dev agent skill — guides a coding agent through the full eval-driven development loop: instrument → capture → build dataset → test → investigate → iterate.
  • pixie-qa Python package — the runtime: wrap() for data-boundary instrumentation, Runnable for dataset-driven test execution, built-in and custom evaluators, and the pixie CLI.

Agent Skill

Install

npx skills add yiouli/pixie-qa

Usage

Open a conversation with your coding agent and say something like:

"set up QA for my app"

The agent follows a six-step workflow:

  1. Understand the app — entry point, execution flow, expected behaviors
  2. Instrument with wrap() — mark data boundaries in the production code path
  3. Define evaluators — map quality criteria to built-in or custom evaluators
  4. Build a dataset — diverse representative scenarios in JSON
  5. Run pixie test — real pass/fail scores for every scenario
  6. Investigate & iterate — root-cause failures and fix

Python Package

Install

pip install pixie-qa
# with an LLM provider auto-instrumentor:
pip install "pixie-qa[openai]"   # openai | anthropic | langchain | google | dspy | all

wrap() — instrument data boundaries

Call wrap() at data boundaries in your application code. At test time, wrap(purpose="input") values are injected from the dataset; wrap(purpose="output") values are captured and scored by evaluators.

from pixie import wrap

db_result = wrap(fetch_from_db(user_id), purpose="input", name="db_result")
response   = wrap(generate_response(db_result), purpose="output", name="response")
Purpose Meaning
"input" External data fed into the LLM (injected at test time)
"output" Final or intermediate output to evaluate
"state" Intermediate state captured for debugging

Runnable — run the app against each dataset entry

Implement the Runnable protocol so pixie test and pixie trace know how to run your app:

from pydantic import BaseModel
import pixie

class MyArgs(BaseModel):
    user_id: str
    message: str

class MyAppRunnable(pixie.Runnable[MyArgs]):
    @classmethod
    def create(cls) -> "MyAppRunnable":
        return cls()

    async def setup(self) -> None:
        pass  # one-time initialization before entries run

    async def run(self, args: MyArgs) -> None:
        await my_app.handle(args.user_id, args.message)

    async def teardown(self) -> None:
        pass  # one-time cleanup after all entries finish

run() is called concurrently for all dataset entries — protect shared mutable state with asyncio.Semaphore or asyncio.Lock if needed.

Dataset JSON format

{
  "runnable": "pixie_qa/scripts/run_app.py:MyAppRunnable",
  "evaluators": ["Factuality"],
  "entries": [
    {
      "entry_kwargs": { "user_id": "u1", "message": "What is my balance?" },
      "test_case": {
        "eval_input": [
          { "purpose": "input", "name": "db_result", "data": { "balance": 120.5 } }
        ],
        "expectation": "Your current balance is $120.50.",
        "description": "basic balance query"
      }
    }
  ]
}

Use pixie trace + pixie format to capture real traces and turn them into dataset entries with the correct data shapes.

Evaluators

Evaluator Task
Factuality LLM-as-judge factual accuracy
ClosedQA LLM-as-judge Q&A with reference answer
AnswerCorrectness RAGAS combined factual + semantic similarity
EmbeddingSimilarity Cosine similarity between output and expectation
ExactMatch Deterministic exact string match
create_llm_evaluator Custom prompt-based LLM-as-judge

Full evaluator list: docs/pixie/index.md

CLI reference

Command Description
pixie test [path] Run eval tests; open scorecard in browser
pixie trace --runnable R --input I --output O Run a Runnable, capture trace to JSONL
pixie format --input I --output O Convert a trace JSONL to a dataset entry JSON
pixie analyze <test_run_id> LLM analysis of a completed test run
pixie init [root] Scaffold the pixie_qa/ working directory
pixie start [root] Launch the web UI at http://localhost:7118

Web UI

View all eval artifacts (results, datasets, markdown docs) in a live-updating local web UI:

pixie start              # initializes pixie_qa/ (if needed) and opens http://localhost:7118
pixie start my_dir       # use a custom artifact root
pixie init               # scaffolds pixie_qa/ without starting the server

Changes to artifacts are pushed to the browser in real time via SSE.

Configuration

Pixie reads configuration from environment variables and a local .env file. Existing process env vars take priority over .env values.

Variable Description
PIXIE_ROOT Root directory for all generated artefacts
PIXIE_RATE_LIMIT_ENABLED true to enable evaluator throttling
PIXIE_RATE_LIMIT_RPS Max requests per second for LLM-as-judge calls
PIXIE_RATE_LIMIT_RPM Max requests per minute
PIXIE_RATE_LIMIT_TPS Max tokens per second
PIXIE_RATE_LIMIT_TPM Max tokens per minute

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pixie_qa-0.6.0.tar.gz (326.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pixie_qa-0.6.0-py3-none-any.whl (339.8 kB view details)

Uploaded Python 3

File details

Details for the file pixie_qa-0.6.0.tar.gz.

File metadata

  • Download URL: pixie_qa-0.6.0.tar.gz
  • Upload date:
  • Size: 326.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pixie_qa-0.6.0.tar.gz
Algorithm Hash digest
SHA256 6e042c68a87e4afbeb6b73298fac5b28faa58c5e24cdb17e48f0585161c9f001
MD5 9a902fcf45383c06316325692ffd2f29
BLAKE2b-256 c1657ee320c5df8f7ceef771139ce4c6a1e856ef5e7ca65bc3fb7fa13d23f701

See more details on using hashes here.

Provenance

The following attestation bundles were made for pixie_qa-0.6.0.tar.gz:

Publisher: publish.yml on yiouli/pixie-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pixie_qa-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: pixie_qa-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 339.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pixie_qa-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a03c13e8e23e416a3a641714db47a5fc6b429f3e85cd43d283f94ce4d8a2d317
MD5 01701d0d45eeaca08508ccd938d29d8a
BLAKE2b-256 1cc55abd88f616fdf4c96b0a424871e6f5d4e7c0f7a10aabbb50836859b3a5cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for pixie_qa-0.6.0-py3-none-any.whl:

Publisher: publish.yml on yiouli/pixie-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page