Skip to main content

Automated quality assurance for AI applications

Project description

pixie-qa

Eval-driven development for Python LLM applications.

pixie-qa ships two complementary tools:

  • eval-driven-dev agent skill — guides a coding agent through the full eval-driven development loop: instrument → capture → build dataset → test → investigate → iterate.
  • pixie-qa Python package — the runtime: wrap() for data-boundary instrumentation, Runnable for dataset-driven test execution, built-in and custom evaluators, and the pixie CLI.

Agent Skill

Install

npx skills add yiouli/pixie-qa

Usage

Open a conversation with your coding agent and say something like:

"set up QA for my app"

The agent follows a six-step workflow:

  1. Understand the app — entry point, execution flow, expected behaviors
  2. Instrument with wrap() — mark data boundaries in the production code path
  3. Define evaluators — map quality criteria to built-in or custom evaluators
  4. Build a dataset — diverse representative scenarios in JSON
  5. Run pixie test — real pass/fail scores for every scenario
  6. Investigate & iterate — root-cause failures and fix

Python Package

Install

pip install pixie-qa
# with an LLM provider auto-instrumentor:
pip install "pixie-qa[openai]"   # openai | anthropic | langchain | google | dspy | all

wrap() — instrument data boundaries

Call wrap() at data boundaries in your application code. At test time, wrap(purpose="input") values are injected from the dataset; wrap(purpose="output") values are captured and scored by evaluators.

from pixie import wrap

db_result = wrap(fetch_from_db(user_id), purpose="input", name="db_result")
response   = wrap(generate_response(db_result), purpose="output", name="response")
Purpose Meaning
"input" External data fed into the LLM (injected at test time)
"output" Final or intermediate output to evaluate
"state" Intermediate state captured for debugging

Runnable — run the app against each dataset entry

Implement the Runnable protocol so pixie test and pixie trace know how to run your app:

from pydantic import BaseModel
import pixie

class MyArgs(BaseModel):
    user_id: str
    message: str

class MyAppRunnable(pixie.Runnable[MyArgs]):
    @classmethod
    def create(cls) -> "MyAppRunnable":
        return cls()

    async def setup(self) -> None:
        pass  # one-time initialization before entries run

    async def run(self, args: MyArgs) -> None:
        await my_app.handle(args.user_id, args.message)

    async def teardown(self) -> None:
        pass  # one-time cleanup after all entries finish

run() is called concurrently for all dataset entries — protect shared mutable state with asyncio.Semaphore or asyncio.Lock if needed.

Dataset JSON format

{
  "runnable": "pixie_qa/scripts/run_app.py:MyAppRunnable",
  "evaluators": ["Factuality"],
  "entries": [
    {
      "input_data": { "user_id": "u1", "message": "What is my balance?" },
      "test_case": {
        "eval_input": [
          {
            "purpose": "input",
            "name": "db_result",
            "data": { "balance": 120.5 }
          }
        ],
        "expectation": "Your current balance is $120.50.",
        "description": "basic balance query"
      }
    }
  ]
}

Use pixie trace + pixie format to capture real traces and turn them into dataset entries with the correct data shapes.

Evaluators

Evaluator Task
Factuality LLM-as-judge factual accuracy
ClosedQA LLM-as-judge Q&A with reference answer
AnswerCorrectness RAGAS combined factual + semantic similarity
EmbeddingSimilarity Cosine similarity between output and expectation
ExactMatch Deterministic exact string match
create_llm_evaluator Custom prompt-based LLM-as-judge

Full evaluator list: docs/pixie/index.md

CLI reference

Command Description
pixie test [path] Run eval tests; open scorecard in browser
pixie trace --runnable R --input I --output O Run a Runnable, capture trace to JSONL
pixie format --input I --output O Convert a trace JSONL to a dataset entry JSON
pixie init [root] Scaffold the pixie_qa/ working directory
pixie start [root] Launch the web UI at http://localhost:7118

Web UI

View all eval artifacts (results, datasets, markdown docs) in a live-updating local web UI:

pixie start              # initializes pixie_qa/ (if needed) and opens http://localhost:7118
pixie start my_dir       # use a custom artifact root
pixie init               # scaffolds pixie_qa/ without starting the server

Changes to artifacts are pushed to the browser in real time via SSE.

Configuration

Pixie reads configuration from environment variables and a local .env file. Existing process env vars take priority over .env values.

Variable Description
PIXIE_ROOT Root directory for all generated artefacts
PIXIE_RATE_LIMIT_ENABLED true to enable evaluator throttling
PIXIE_RATE_LIMIT_RPS Max requests per second for LLM-as-judge calls
PIXIE_RATE_LIMIT_RPM Max requests per minute
PIXIE_RATE_LIMIT_TPS Max tokens per second
PIXIE_RATE_LIMIT_TPM Max tokens per minute

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pixie_qa-0.7.2.tar.gz (603.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pixie_qa-0.7.2-py3-none-any.whl (619.1 kB view details)

Uploaded Python 3

File details

Details for the file pixie_qa-0.7.2.tar.gz.

File metadata

  • Download URL: pixie_qa-0.7.2.tar.gz
  • Upload date:
  • Size: 603.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pixie_qa-0.7.2.tar.gz
Algorithm Hash digest
SHA256 d4968b22e3493ea9796a179cd0dfef94ba6faac7342751399b77fd7ad0c4e7c4
MD5 607800606f08b996e1d40306ec57dd8d
BLAKE2b-256 669ca1514fb2d6324caf1ce7eaf7273274ba648b7cdc44f4e5972405e8ee4623

See more details on using hashes here.

Provenance

The following attestation bundles were made for pixie_qa-0.7.2.tar.gz:

Publisher: publish.yml on yiouli/pixie-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pixie_qa-0.7.2-py3-none-any.whl.

File metadata

  • Download URL: pixie_qa-0.7.2-py3-none-any.whl
  • Upload date:
  • Size: 619.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pixie_qa-0.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1c0bedd37234e0f6f489355e28a849f7d22e393f2818dd614df21c189d8f8cd8
MD5 851a0ed1f933e822db073942b2140e4d
BLAKE2b-256 2a8a773f9cfb156ca4604b44322402dca033dc96b1aa1279ce0b575765b751f3

See more details on using hashes here.

Provenance

The following attestation bundles were made for pixie_qa-0.7.2-py3-none-any.whl:

Publisher: publish.yml on yiouli/pixie-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page