Skip to main content

Automated quality assurance for AI applications

Project description

pixie-qa

Eval-driven development for Python LLM applications.

pixie-qa ships three complementary tools:

  • eval-driven-dev agent skill — guides a coding agent through the full eval-driven development loop: instrument → capture → build dataset → test → investigate → iterate.
  • pixie-qa Python package — the runtime: wrap() for data-boundary instrumentation, Runnable for dataset-driven test execution, built-in and custom evaluators, and the pixie CLI.
  • pixie-qa TypeScript package (pixie-ts/) — a full TypeScript port of the Python package with identical functionality and camelCase naming.

Agent Skill

Install

npx skills add yiouli/pixie-qa

Usage

Open a conversation with your coding agent and say something like:

"set up QA for my app"

The agent follows a six-step workflow:

  1. Understand the app — entry point, execution flow, expected behaviors
  2. Instrument with wrap() — mark data boundaries in the production code path
  3. Define evaluators — map quality criteria to built-in or custom evaluators
  4. Build a dataset — diverse representative scenarios in JSON
  5. Run pixie test — real pass/fail scores for every scenario
  6. Investigate & iterate — root-cause failures and fix

Python Package

Install

pip install pixie-qa
# with an LLM provider auto-instrumentor:
pip install "pixie-qa[openai]"   # openai | anthropic | langchain | google | dspy | all

wrap() — instrument data boundaries

Call wrap() at data boundaries in your application code. At test time, wrap(purpose="input") values are injected from the dataset; wrap(purpose="output") values are captured and scored by evaluators.

from pixie import wrap

db_result = wrap(fetch_from_db(user_id), purpose="input", name="db_result")
response   = wrap(generate_response(db_result), purpose="output", name="response")
Purpose Meaning
"input" External data fed into the LLM (injected at test time)
"output" Final or intermediate output to evaluate
"state" Intermediate state captured for debugging

Runnable — run the app against each dataset entry

Implement the Runnable protocol so pixie test and pixie trace know how to run your app:

from pydantic import BaseModel
import pixie

class MyArgs(BaseModel):
    user_id: str
    message: str

class MyAppRunnable(pixie.Runnable[MyArgs]):
    @classmethod
    def create(cls) -> "MyAppRunnable":
        return cls()

    async def setup(self) -> None:
        pass  # one-time initialization before entries run

    async def run(self, args: MyArgs) -> None:
        await my_app.handle(args.user_id, args.message)

    async def teardown(self) -> None:
        pass  # one-time cleanup after all entries finish

run() is called concurrently for all dataset entries — protect shared mutable state with asyncio.Semaphore or asyncio.Lock if needed.

Dataset JSON format

{
  "runnable": "pixie_qa/scripts/run_app.py:MyAppRunnable",
  "evaluators": ["Factuality"],
  "entries": [
    {
      "input_data": { "user_id": "u1", "message": "What is my balance?" },
      "test_case": {
        "eval_input": [
          {
            "purpose": "input",
            "name": "db_result",
            "data": { "balance": 120.5 }
          }
        ],
        "expectation": "Your current balance is $120.50.",
        "description": "basic balance query"
      }
    }
  ]
}

Use pixie trace + pixie format to capture real traces and turn them into dataset entries with the correct data shapes.

Evaluators

Evaluator Task
Factuality LLM-as-judge factual accuracy
ClosedQA LLM-as-judge Q&A with reference answer
AnswerCorrectness RAGAS combined factual + semantic similarity
EmbeddingSimilarity Cosine similarity between output and expectation
ExactMatch Deterministic exact string match
create_llm_evaluator Custom prompt-based LLM-as-judge

Full evaluator list: docs/pixie/index.md

CLI reference

Command Description
pixie test [path] Run eval tests; open scorecard in browser
pixie trace --runnable R --input I --output O Run a Runnable, capture trace to JSONL
pixie format --input I --output O Convert a trace JSONL to a dataset entry JSON
pixie init [root] Scaffold the pixie_qa/ working directory
pixie start [root] Launch the web UI at http://localhost:7118

Web UI

View all eval artifacts (results, datasets, markdown docs) in a live-updating local web UI:

pixie start              # initializes pixie_qa/ (if needed) and opens http://localhost:7118
pixie start my_dir       # use a custom artifact root
pixie init               # scaffolds pixie_qa/ without starting the server

Changes to artifacts are pushed to the browser in real time via SSE.

Configuration

Pixie reads configuration from environment variables and a local .env file. Existing process env vars take priority over .env values.

Variable Description
PIXIE_ROOT Root directory for all generated artefacts
PIXIE_RATE_LIMIT_ENABLED true to enable evaluator throttling
PIXIE_RATE_LIMIT_RPS Max requests per second for LLM-as-judge calls
PIXIE_RATE_LIMIT_RPM Max requests per minute
PIXIE_RATE_LIMIT_TPS Max tokens per second
PIXIE_RATE_LIMIT_TPM Max tokens per minute

TypeScript Package

A full TypeScript port lives in pixie-ts/. It provides identical functionality to the Python package with TypeScript-idiomatic naming:

cd pixie-ts
npm install && npm run build
npx pixie-qa test [path]          # same CLI commands as Python pixie

See pixie-ts/README.md for full documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pixie_qa-0.8.1.tar.gz (604.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pixie_qa-0.8.1-py3-none-any.whl (620.3 kB view details)

Uploaded Python 3

File details

Details for the file pixie_qa-0.8.1.tar.gz.

File metadata

  • Download URL: pixie_qa-0.8.1.tar.gz
  • Upload date:
  • Size: 604.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pixie_qa-0.8.1.tar.gz
Algorithm Hash digest
SHA256 922e5ea1d2f8fd8302d731f41c1f34e74751855b094bd630a46a28cd66e800db
MD5 3dba9f69a0f1c0840d02b54c99df2395
BLAKE2b-256 7d9cb8801337cb30ca008abf934079f0a0433bb97fdbaf5ea11b5888ec45c0d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for pixie_qa-0.8.1.tar.gz:

Publisher: publish.yml on yiouli/pixie-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pixie_qa-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: pixie_qa-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 620.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pixie_qa-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 96a058c2c1266b5a26fb498c68f842bce1ea0b5bf70572307c5b0d6ab13597c2
MD5 6f56a574fd492ae40120a87295d2b1ed
BLAKE2b-256 080da3aefdfc3736a5f79a360378521270ab971c28961458c2b217dc43c76702

See more details on using hashes here.

Provenance

The following attestation bundles were made for pixie_qa-0.8.1-py3-none-any.whl:

Publisher: publish.yml on yiouli/pixie-qa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page