Pre-deploy CI/CD QA dashboard for evaluating LLM and AI-agent outputs

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jatinderkaur16

These details have not been verified by PyPI

Project description

QAlityDeep

Pre-deploy CI/CD QA for LLM and AI-agent outputs

Verify that your LLM responses, AI-generated code, and agent outputs are correct -- before they ship.

Why QAlityDeep?

73% of developers use AI coding tools, but 46% don't trust the output. There's a gap between "AI wrote this" and "we shipped this." QAlityDeep fills that gap.

The missing testing layer -- QAlityDeep sits between AI-generated output and your production deploy, catching regressions, hallucinations, and broken code before they reach users.
Works with ANY LLM output -- ChatGPT, Claude, Cursor, Copilot, custom agents, RAG pipelines, or any system that produces text or code.
9 programmatic metrics -- free, instant, no API key required. Validate syntax, match patterns, run code, and compare ASTs in milliseconds.
6 LLM-as-judge metrics -- semantic correctness, relevancy, hallucination detection, tool call validation, multi-agent coordination, and trajectory analysis.
CI/CD native -- JUnit XML output, configurable thresholds, non-zero exit codes on failure. Drop it into GitHub Actions, GitLab CI, or any pipeline.

Quick Start (60 seconds)

# Install
pip install qalitydeep

# Scaffold a project with sample config and eval files
qalitydeep init

# Run your first evaluation
qalitydeep run

qalitydeep init creates a qalitydeep.yaml config file with sample test suites and an evals/ directory with Python-based eval definitions. qalitydeep run loads the config, executes all test cases against the configured metrics, and prints a pass/fail results table.

Installation

Core (CLI + programmatic metrics):

pip install qalitydeep

With Streamlit dashboard:

pip install "qalitydeep[dashboard]"

With FastAPI server:

pip install "qalitydeep[api]"

Everything:

pip install "qalitydeep[all]"

Development (editable install):

git clone https://github.com/jatinderDH/qalitydeep.git
cd qalitydeep
pip install -e ".[all]"

Requires Python 3.9+. Tested on Python 3.9 through 3.13.

Configuration

QAlityDeep uses a qalitydeep.yaml file in your project root. Run qalitydeep init to generate one, or create it manually.

version: "1"

# Default settings applied to all suites unless overridden
defaults:
  metrics: [correctness, relevancy]   # Metrics to run on every test case
  threshold: 0.7                       # Minimum score to pass (0.0 - 1.0)

suites:
  # A QA suite for chatbot responses
  - name: chatbot_qa
    description: "Verify support chatbot answers"
    test_cases:
      - input: "What is your refund policy?"
        expected_output: "We offer a 30-day full refund on all purchases"

      - input: "Do you ship internationally?"
        expected_output: "Yes, we ship worldwide with delivery in 5-10 business days"

      - input: "How do I reset my password?"
        expected_output: "Go to Settings > Account > Reset Password"

  # A code quality suite with different metrics
  - name: code_quality
    description: "Validate AI-generated code"
    metrics: [code_syntax, exact_match]   # Override default metrics for this suite
    threshold: 0.8                         # Override default threshold
    test_cases:
      - input: "Write a hello world function"
        expected_output: |
          def hello():
              return "Hello, World!"

      # When actual_output is provided, the LLM is not invoked --
      # the output is evaluated directly against the metrics.
      - input: "Write a fibonacci function"
        actual_output: |
          def fib(n):
              if n <= 1:
                  return n
              return fib(n - 1) + fib(n - 2)
        expected_output: |
          def fib(n):
              if n <= 1:
                  return n
              return fib(n - 1) + fib(n - 2)

Configuration reference

Field	Type	Description
`version`	string	Config format version. Currently `"1"`.
`defaults.metrics`	list[string]	Metrics applied to all suites unless overridden.
`defaults.threshold`	float	Global pass/fail threshold (0.0 - 1.0).
`defaults.provider`	string	LLM backend: `openai`, `anthropic`, or `ollama`.
`defaults.model`	string	Model name override (e.g. `gpt-4.1-mini`).
`suites[].name`	string	Unique name for the suite.
`suites[].description`	string	Human-readable description.
`suites[].metrics`	list[string]	Override `defaults.metrics` for this suite.
`suites[].threshold`	float	Override `defaults.threshold` for this suite.
`suites[].test_cases[].input`	string	The prompt or input text.
`suites[].test_cases[].expected_output`	string	The expected correct output.
`suites[].test_cases[].actual_output`	string	Pre-computed output (skips LLM invocation).
`suites[].test_cases[].language`	string	Language hint for code metrics: `python`, `javascript`, `json`.
`suites[].test_cases[].tags`	list[string]	Tags for filtering and organization.

Python API

Define evaluation suites directly in Python using decorators. Place files matching eval_*.py in your project and QAlityDeep discovers them automatically.

from qalitydeep import eval_suite, eval_case


@eval_suite(metrics=["code_syntax", "exact_match"], threshold=0.8)
def test_basic_functions():
    """Test basic Python function generation."""
    return [
        eval_case(
            input="Write a function that adds two numbers",
            expected_output="def add(a, b):\n    return a + b",
        ),
        eval_case(
            input="Write a function that checks if a number is even",
            expected_output="def is_even(n):\n    return n % 2 == 0",
        ),
    ]


@eval_suite(metrics=["correctness", "relevancy"], threshold=0.7)
def test_qa_responses():
    """Test customer support responses."""
    return [
        eval_case(
            input="What is your refund policy?",
            expected_output="We offer a 30-day full refund on all purchases",
        ),
    ]

Run with:

qalitydeep run

QAlityDeep discovers eval_*.py files in the current directory and evals/ subdirectory, executing all @eval_suite-decorated functions.

CLI Commands Reference

Command	Description
`qalitydeep run`	Run evaluations from YAML config, Python eval files, or a legacy dataset.
`qalitydeep init`	Scaffold a new project with sample config and eval files.
`qalitydeep doctor`	Check environment, dependencies, and configuration health.
`qalitydeep watch`	Watch config and eval files for changes, re-running evaluations automatically.
`qalitydeep serve-api`	Start the FastAPI evaluation server.
`qalitydeep history`	List recent evaluation runs with scores and pass rates.
`qalitydeep metrics`	List all available evaluation metrics.

`qalitydeep run`

The primary command. Loads configuration, executes test suites, and reports results.

# Run all suites from auto-discovered config
qalitydeep run

# Run a specific config file
qalitydeep run --config path/to/qalitydeep.yaml

# Run only one suite
qalitydeep run --suite chatbot_qa

# Output as JSON
qalitydeep run --output json

# Output as JUnit XML (for CI systems)
qalitydeep run --output junit --junit-file results.xml

# Override the pass/fail threshold
qalitydeep run --threshold 0.9

# Don't fail the process on threshold violations
qalitydeep run --no-fail-on-error

Flags:

Flag	Short	Description
`--config`	`-c`	Path to config file. Auto-discovers `qalitydeep.yaml` when omitted.
`--suite`	`-s`	Run only a specific suite by name.
`--output`	`-o`	Output format: `table` (default), `json`, or `junit`.
`--junit-file`		Path to write JUnit XML file.
`--threshold`	`-t`	Override pass/fail threshold (0.0 - 1.0).
`--fail-on-error` / `--no-fail-on-error`		Exit code 1 when any case fails the threshold (default: enabled).

`qalitydeep watch`

Watches your config and eval files for changes and automatically re-runs evaluations.

qalitydeep watch
qalitydeep watch --config my_config.yaml --output json

`qalitydeep serve-api`

Starts a FastAPI server for programmatic evaluation via HTTP.

qalitydeep serve-api --host 0.0.0.0 --port 8000

Requires the api extra: pip install "qalitydeep[api]".

Available Metrics

Programmatic Metrics

Free, instant, no API key required. These run locally and return results in milliseconds.

Metric	What it checks	Details
`exact_match`	Exact string equality	Score 1.0 if `actual_output` exactly matches `expected_output` (whitespace-trimmed).
`contains`	Substring presence	Score 1.0 if `expected_output` is found as a substring in `actual_output`.
`contains_all`	Multiple substring presence	Score = fraction of required substrings found in `actual_output`.
`regex_match`	Regex pattern matching	Score 1.0 if the regex pattern matches anywhere in `actual_output`.
`json_valid`	Valid JSON check	Score 1.0 if `actual_output` is parseable as valid JSON.
`starts_with`	Prefix check	Score 1.0 if `actual_output` starts with the given prefix.
`code_syntax`	Python/JS/JSON syntax validity	Parses code using `ast` (Python), `node` (JS), or `json` module. Auto-detects language.
`code_diff`	AST-level code similarity	Compares code structure using AST analysis (Python) with text similarity fallback. Weighted: 70% AST + 30% text.
`code_execution`	Run code and check output	Executes code in a sandboxed subprocess, compares stdout to `expected_output`. Supports Python and Node.js.

LLM-as-Judge Metrics

Require an API key (OPENAI_API_KEY or ANTHROPIC_API_KEY). These use an LLM to evaluate output quality semantically.

Metric	What it checks	Details
`correctness`	Semantic equivalence	GEval-based scoring of whether `actual_output` conveys the same meaning as `expected_output`.
`relevancy`	Answer relevancy	DeepEval `AnswerRelevancyMetric` -- does the answer address the input question?
`hallucination`	Fact grounding	DeepEval `HallucinationMetric` -- detects claims not grounded in the provided context.
`tool_correctness`	Tool call accuracy	Validates that the agent called the right tools with correct parameters.
`coordination`	Multi-agent alignment	GEval-based scoring of communication clarity and consistency across agents.
`trajectory`	Step efficiency	GEval-based analysis of reasoning step appropriateness and efficiency.

CI/CD Integration

GitHub Actions

name: LLM Eval
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - run: pip install qalitydeep

      - run: qalitydeep run --output junit --junit-file results.xml

      - uses: dorny/test-reporter@v1
        if: always()
        with:
          name: QAlityDeep Results
          path: results.xml
          reporter: java-junit

How it works:

qalitydeep run loads your qalitydeep.yaml and executes all test suites.
--output junit formats results as JUnit XML, the standard CI test format.
--junit-file results.xml writes the XML to disk for the test reporter.
If any test case scores below the configured threshold, qalitydeep run exits with code 1, failing the CI pipeline.
The dorny/test-reporter step renders results as a GitHub check with inline annotations.

Other CI Systems

QAlityDeep works with any CI system that supports exit codes and JUnit XML:

# GitLab CI, CircleCI, Jenkins, etc.
pip install qalitydeep
qalitydeep run --output junit --junit-file results.xml --threshold 0.8

Exit code 0 means all cases passed. Exit code 1 means at least one case fell below the threshold. Use --no-fail-on-error to always exit 0 (useful for advisory-only runs).

Environment Variables

Configure LLM backends and integrations via environment variables. Copy .env.example to .env.local and fill in your values.

cp .env.example .env.local

Variable	Description	Default
`LLM_BACKEND`	LLM provider: `openai`, `anthropic`, or `ollama`	`openai`
`OPENAI_API_KEY`	OpenAI API key (required when `LLM_BACKEND=openai`)	--
`OPENAI_MODEL`	OpenAI model name	`gpt-4.1-mini`
`ANTHROPIC_API_KEY`	Anthropic API key (required when `LLM_BACKEND=anthropic`)	--
`ANTHROPIC_MODEL`	Anthropic model name	`claude-3-5-sonnet-20241022`
`OLLAMA_BASE_URL`	Ollama server URL	`http://localhost:11434`
`OLLAMA_MODEL`	Ollama model name	`llama3.1`
`LANGSMITH_API_KEY`	LangSmith API key (optional, for trajectory evals)	--
`LANGSMITH_PROJECT`	LangSmith project name	`qalitydeep-predeploy`
`QALITYDEEP_DATA_DIR`	Directory for storing datasets and run results	`./data`
`APP_ENV`	Application environment: `dev`, `test`, or `prod`	`dev`

Programmatic metrics (exact_match, code_syntax, etc.) do not require any API key. You only need an API key when using LLM-as-judge metrics like correctness, relevancy, or hallucination.

Project Structure

qalitydeep/
  __init__.py              # Package entry: exports eval_suite, eval_case
  cli.py                   # Typer CLI application (7 commands)
  eval_config.py           # Pydantic models for YAML config
  models.py                # Core data models (TestCase, EvalRun, EvalCaseResult)
  yaml_loader.py           # YAML config discovery and parsing
  decorators.py            # @eval_suite and eval_case() Python API
  discovery.py             # Auto-discover eval_*.py files
  scaffolding.py           # qalitydeep init project generator
  doctor.py                # qalitydeep doctor health checks
  watcher.py               # File watcher for qalitydeep watch
  evals.py                 # LLM-as-judge evaluation logic
  metrics/
    __init__.py             # Metric registry and discovery
    base.py                 # BaseMetric abstract class
    programmatic.py         # ExactMatch, Contains, Regex, JSON, StartsWith
    code_syntax.py          # Python/JS/JSON syntax validation
    code_diff.py            # AST-level code comparison
    code_execution.py       # Sandboxed code execution
  formatters/
    __init__.py             # Formatter registry
    table.py                # Rich table output
    json_fmt.py             # JSON output
    junit.py                # JUnit XML output
  llm_backends.py          # OpenAI / Anthropic / Ollama abstraction
  langgraph_flows.py       # LangGraph multi-agent workflows
  api_server.py            # FastAPI server (qalitydeep serve-api)
  storage.py               # JSON-based run persistence
  cost_tracker.py          # LLM cost estimation
  sandbox.py               # Code execution sandboxing
  templates/               # Scaffolding templates
data/
  datasets/                # Uploaded evaluation datasets (CSV/JSON)
  runs/                    # Persisted evaluation run results (JSON)
  sample/                  # Sample datasets for demos
pyproject.toml             # Package configuration (hatchling)
qalitydeep.yaml            # Your evaluation config (created by init)
streamlit_app.py           # Streamlit dashboard UI

Contributing

Contributions are welcome. Please see CONTRIBUTING.md for guidelines on setting up a development environment, running tests, and submitting pull requests.

# Development setup
git clone https://github.com/jatinderDH/qalitydeep.git
cd qalitydeep
python -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"

# Run tests
pytest

# Lint and format
ruff check .
ruff format .

License

MIT License. See LICENSE for the full text.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jatinderkaur16

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Mar 25, 2026

This version

0.1.0

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qalitydeep-0.1.0.tar.gz (43.0 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

qalitydeep-0.1.0-py3-none-any.whl (56.7 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file qalitydeep-0.1.0.tar.gz.

File metadata

Download URL: qalitydeep-0.1.0.tar.gz
Upload date: Mar 25, 2026
Size: 43.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qalitydeep-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0c09cf9a2f14736bca5b91bf1dfe2d380bf86e5a9e6161b4ff73ec71ff58c5bb`
MD5	`c39ee8a7164347ebcef17f52971287f9`
BLAKE2b-256	`a8c79da0581fd11c21421ed6dfd1eebfbc4bd31825ab17a49d35edb120954766`

See more details on using hashes here.

Provenance

The following attestation bundles were made for qalitydeep-0.1.0.tar.gz:

Publisher: workflow.yml on jatinderDH/qalitydeep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: qalitydeep-0.1.0.tar.gz
- Subject digest: 0c09cf9a2f14736bca5b91bf1dfe2d380bf86e5a9e6161b4ff73ec71ff58c5bb
- Sigstore transparency entry: 1181978031
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: jatinderDH/qalitydeep@26f0deb3ffa8a61b5485c66331c48e9a083fc932
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/jatinderDH
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@26f0deb3ffa8a61b5485c66331c48e9a083fc932
- Trigger Event: release

File details

Details for the file qalitydeep-0.1.0-py3-none-any.whl.

File metadata

Download URL: qalitydeep-0.1.0-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 56.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qalitydeep-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`83eb2bd95f6590d7b2fa0b5282082c66096b7fae302d74ca8e06d9bf09576d73`
MD5	`498a6fe0e1d0a1e2c8b94f8b86267887`
BLAKE2b-256	`6cc5e4553db6233a20337c110070d5bdca66c3525ac7a05e32c226060f3ed4b1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for qalitydeep-0.1.0-py3-none-any.whl:

Publisher: workflow.yml on jatinderDH/qalitydeep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: qalitydeep-0.1.0-py3-none-any.whl
- Subject digest: 83eb2bd95f6590d7b2fa0b5282082c66096b7fae302d74ca8e06d9bf09576d73
- Sigstore transparency entry: 1181978039
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: jatinderDH/qalitydeep@26f0deb3ffa8a61b5485c66331c48e9a083fc932
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/jatinderDH
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@26f0deb3ffa8a61b5485c66331c48e9a083fc932
- Trigger Event: release

qalitydeep 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

QAlityDeep

Why QAlityDeep?

Quick Start (60 seconds)

Installation

Configuration

Configuration reference

Python API

CLI Commands Reference

qalitydeep run

qalitydeep watch

qalitydeep serve-api

Available Metrics

Programmatic Metrics

LLM-as-Judge Metrics

CI/CD Integration

GitHub Actions

Other CI Systems

Environment Variables

Project Structure

Contributing

License

Links

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`qalitydeep run`

`qalitydeep watch`

`qalitydeep serve-api`