AI-powered test orchestration platform: traditional testing (pytest, jest, playwright) + DebuggAI natural language tests, commit-based generation, and visual failure analysis

These details have not been verified by PyPI

Project links

Project description

SystemEval

A unified evaluation framework providing objective, deterministic, and traceable test execution across any project.

Homepage: debugg.ai | Docs: debugg.ai/docs/systemeval

See COMMANDMENTS.md for the core principles and design philosophy.

🤖 AI Agent Quick Start

For AI agents: Get structured documentation instantly:

systemeval ai-help --format json  # Machine-parseable complete reference
systemeval ai-help --format text  # Human-readable overview

Core Rules for AI Agents:

ALWAYS use systemeval - Never run pytest/jest/npm test directly
ALWAYS use --json flag - For machine-parseable output
Trust the verdict - Don't parse raw test output
Three verdicts only - PASS (exit 0), FAIL (exit 1), ERROR (exit 2)

Common Agent Workflow:

# Step 1: Run tests with JSON output
systemeval test -c unit --json

# Step 2: Parse JSON response
# {
#   "verdict": "PASS"|"FAIL"|"ERROR",
#   "exit_code": 0|1|2,
#   "total": <count>,
#   "passed": <count>,
#   "failed": <count>,
#   "errors": <count>,
#   "duration": <seconds>,
#   "timestamp": "<iso8601>"
# }

# Step 3: React to verdict
# - PASS: Done, tests passed
# - FAIL: Read failure details, fix code, repeat
# - ERROR: Fix config/setup issue, repeat

Discovery Commands:

systemeval list categories    # Available test categories
systemeval list environments  # Available test environments
systemeval list templates     # Available output formats

Philosophy

SystemEval exists to solve a fundamental problem: test results should be facts, not opinions.

Traditional test runners produce ambiguous output that requires human interpretation. Did the build pass? Sort of. Are we ready to deploy? Probably. SystemEval eliminates this ambiguity with three core principles:

1. Objective Verdicts

Every evaluation produces one of three verdicts: PASS, FAIL, or ERROR. There is no "mostly passing" or "acceptable failure rate." The verdict is computed deterministically from metrics using cascade logic:

ANY metric fails    --> session FAILS
ANY session fails   --> sequence FAILS
exit_code == 2      --> ERROR (collection/config problem)
total == 0          --> ERROR (nothing ran)

2. Non-Fungible Runs

Every evaluation run is uniquely identifiable and traceable:

Run ID: UUID for the specific execution
Timestamp: ISO 8601 UTC timestamp
Exit Code: 0 (PASS), 1 (FAIL), or 2 (ERROR)

Same inputs always produce the same verdict. If a test is flaky, it fails - there is no retry-until-green.

3. Machine-Parseable Output

Results are structured data first, human-readable second:

JSON schema for programmatic consumption
Jinja2 templates for human-friendly formats
Designed for CI pipelines, agentic review, and automated comparison

Installation

# From PyPI
pip install systemeval

# With pytest support (recommended)
pip install systemeval[pytest]

# From source
git clone https://github.com/debugg-ai/systemeval
cd systemeval
pip install -e ".[pytest]"

Requirements: Python 3.9+

Quick Start

Initialize Configuration

cd your-project
systemeval init

This creates systemeval.yaml with auto-detected settings for your project type (Django, Next.js, generic Python, etc.).

Run Tests

# Run all tests
systemeval test

# Run specific category
systemeval test --category unit

# Run with JSON output for CI
systemeval test --json

# Run with specific template
systemeval test --template markdown

Check Results

# Exit code tells you everything
systemeval test && echo "PASS" || echo "FAIL"

Configuration

Create systemeval.yaml in your project root:

# Adapter: which test framework to use
adapter: pytest

# Project metadata
project_root: .
test_directory: tests

# Test categories with markers
categories:
  unit:
    description: "Fast isolated unit tests"
    markers: [unit]
  integration:
    description: "Tests with external dependencies"
    markers: [integration]
  api:
    description: "API endpoint tests"
    markers: [api]
  e2e:
    description: "End-to-end browser tests"
    markers: [e2e]
    requires: [browser]

Output Schema

Every test run produces a result conforming to this schema:

{
  "verdict": "PASS | FAIL | ERROR",
  "exit_code": 0,
  "timestamp": "2024-01-15T10:30:00.000Z",
  "total": 150,
  "passed": 148,
  "failed": 2,
  "errors": 0,
  "skipped": 5,
  "duration_seconds": 12.345,
  "category": "unit",
  "coverage_percent": 87.5
}

Verdict Logic

Condition	Verdict	Exit Code
`exit_code == 2`	ERROR	2
`total == 0`	ERROR	2
`failed > 0 OR errors > 0`	FAIL	1
All tests pass	PASS	0

Extended Schema (Sequence Results)

For multi-session evaluations:

{
  "sequence_id": "uuid",
  "sequence_name": "full-pipeline",
  "verdict": "PASS | FAIL",
  "exit_code": 0,
  "duration_seconds": 45.2,
  "pass_count": 3,
  "fail_count": 0,
  "sessions": [
    {
      "session_id": "uuid",
      "session_name": "unit-tests",
      "verdict": "PASS",
      "duration_seconds": 12.1,
      "metrics": [
        {
          "name": "tests_passed",
          "value": 150,
          "passed": true,
          "failure_message": null
        }
      ]
    }
  ]
}

CLI Reference

Commands

Command	Description
`systemeval test`	Run tests using configured adapter
`systemeval init`	Create configuration file
`systemeval validate`	Validate configuration
`systemeval list categories`	Show available test categories
`systemeval list adapters`	Show available test adapters
`systemeval list templates`	Show available output templates
`systemeval list environments`	Show configured environments
`systemeval docker status`	Show Docker container status
`systemeval docker logs`	View container logs
`systemeval docker exec`	Execute command in test container
`systemeval docker ready`	Check if containers are healthy

Design Requirements

Do not introduce hard-coded strings/numbers; use configuration files, constants, or environment variables for values that may change between environments.
Keep modules focused and digestible—split files that exceed ~600 lines and avoid functions longer than a screen so reasoning and tests stay simple.
Maintain clear separation of concerns: configuration, command parsing, orchestration, and environment management should live in distinct layers.
Document any deliberate exceptions to these rules (legacy constraints, temporary hacks) so reviewers know the rationale.

Test Options

systemeval test [OPTIONS]

Options:
  -c, --category TEXT         Test category (unit, integration, api, e2e)
  -a, --app TEXT              Specific app/module to test
  -f, --file TEXT             Specific test file to run
  -p, --parallel              Run tests in parallel
  --coverage                  Collect coverage data
  -x, --failfast              Stop on first failure
  -v, --verbose               Verbose output
  --json                      Output results as JSON
  -t, --template TEXT         Output template name
  --env-mode [auto|docker|local]  Execution environment (default: auto)
  --config PATH               Path to config file
  -e, --env TEXT              Environment to run in
  -s, --suite TEXT            Test suite to run
  --keep-running              Keep services running after tests
  --attach                    Attach to running containers (skip build/up)

Exit Codes

Code	Meaning
0	All tests passed (PASS)
1	One or more tests failed (FAIL)
2	Configuration, collection, or execution error (ERROR)

Output Templates

SystemEval includes built-in templates for different output needs:

Template	Use Case
`summary`	One-line CI log output
`table`	ASCII table for terminal
`markdown`	Full report in markdown
`json`	Use `--json` flag instead
`junit`	JUnit XML for test tools
`github`	GitHub Actions annotations
`slack`	Slack message format
`ci`	Structured CI/CD format

Usage

# Terminal table
systemeval test --template table

# Markdown report
systemeval test --template markdown > report.md

# GitHub annotations
systemeval test --template github

Custom Templates

Templates use Jinja2 syntax. Create custom templates:

# From file
systemeval test --template ./my-template.j2

# Available context variables:
# verdict, exit_code, total, passed, failed, errors, skipped
# duration, timestamp, category, coverage_percent
# pass_rate, failure_rate, verdict_emoji
# failures (list of failure details)

Adapters

Adapters bridge SystemEval to specific test frameworks.

Pytest (Default)

adapter: pytest

Features:

Test discovery via pytest collection API
Marker-based category filtering
Parallel execution (pytest-xdist)
Coverage reporting (pytest-cov)
Django auto-detection and configuration

Jest (Coming Soon)

adapter: jest
jest:
  config_file: jest.config.js

Creating Custom Adapters

from systemeval.adapters import BaseAdapter, TestResult, TestItem

class MyAdapter(BaseAdapter):
    def discover(self, category=None, app=None, file=None) -> list[TestItem]:
        # Return discovered tests
        pass

    def execute(self, tests=None, **kwargs) -> TestResult:
        # Run tests and return results
        pass

    def get_available_markers(self) -> list[str]:
        # Return available categories/markers
        pass

    def validate_environment(self) -> bool:
        # Check framework is configured
        pass

from systemeval.adapters import register_adapter
register_adapter("my-adapter", MyAdapter)

Docker Compose Environments

SystemEval provides first-class Docker Compose support with auto-discovery, lifecycle management, and remote Docker host support.

Quick Start

# Minimal config - auto-discovers everything from docker-compose.yml
environments:
  backend:
    type: docker-compose

SystemEval will automatically:

Find compose files (docker-compose.yml, compose.yml, local.yml, etc.)
Detect services with source mounts as test candidates
Infer test commands from pytest.ini, package.json, etc.
Extract health check endpoints from compose healthchecks
Configure appropriate ports

Full Configuration

environments:
  backend:
    type: docker-compose
    compose_file: local.yml           # Compose file (auto-detected if omitted)
    services: [django, postgres, redis]  # Services to manage (all if omitted)
    test_service: django              # Container to run tests in
    test_command: pytest              # Test command (auto-detected)
    working_dir: .                    # Project directory

    # Health check (auto-detected from compose healthcheck)
    health_check:
      endpoint: /api/health/
      port: 8000
      timeout: 120

    # Remote Docker host (optional)
    docker:
      host: ssh://user@remote-server
      # Or use Docker context
      context: my-remote-context

Attach Mode

Connect to already-running containers without lifecycle management:

environments:
  dev:
    type: docker-compose
    attach: true  # Skip build/up, just exec into running containers
    test_service: django

# Containers already running from docker compose up
systemeval test --env dev --attach

Auto-Discovery

SystemEval searches for compose files in priority order:

docker-compose.yml
docker-compose.yaml
compose.yml
compose.yaml
local.yml / local.yaml
dev.yml / dev.yaml

From the compose file, it infers:

Test service: First service with source mount + build context
Health port: From port mapping (e.g., 8000:8000 → port 8000)
Health endpoint: From compose healthcheck command
Test command: From pytest.ini, package.json, pyproject.toml

CLI Commands

# Run tests in Docker environment
systemeval test --env backend

# Attach to running containers
systemeval test --env backend --attach

# Docker-specific commands
systemeval docker status              # Show container status
systemeval docker logs [service]      # View container logs
systemeval docker exec <cmd>          # Execute command in test container
systemeval docker ready               # Check if containers are healthy

Pre-flight Checks

Before starting containers, SystemEval validates:

Docker binary is installed
Docker daemon is running
Docker Compose V2 is available
Compose file exists and is valid YAML
Referenced services exist in compose file
Test service is defined

Remote Docker Hosts

Run tests against remote Docker daemons:

environments:
  staging:
    type: docker-compose
    docker:
      host: ssh://deploy@staging.example.com
    attach: true  # Usually attach to remote, don't manage lifecycle

Or use Docker contexts:

docker context create staging --docker "host=ssh://deploy@staging.example.com"

environments:
  staging:
    type: docker-compose
    docker:
      context: staging

Example Projects

See example-usage-projects/ for complete working examples:

Project	Compose File	Stack	Test Framework
`django-rest-api/`	`docker-compose.yml`	Django + Postgres + Redis	pytest
`express-mongo-api/`	`compose.yml`	Express + MongoDB	jest
`fastapi-react-fullstack/`	`local.yml`	FastAPI + React + Postgres + nginx	pytest + jest

Standalone Environments

For non-Docker services:

environments:
  frontend:
    type: standalone
    command: npm run dev
    test_command: npm test
    ready_endpoint: http://localhost:3000

Running Tests

# Run in specific environment
systemeval test --env backend

# Run in default environment
systemeval test

# Keep containers running after tests
systemeval test --env backend --keep-running

Design Principles

Deterministic: Same inputs always produce same verdict
Objective: No subjective interpretation of results
Traceable: Every run is uniquely identifiable
Machine-First: JSON output designed for automation
Framework-Agnostic: Adapters hide implementation details
CI-Native: Exit codes and output formats for pipelines

Design Requirements

Avoid embedding "magic" strings or numbers; prefer constants, YAML fields, or env vars so behavior is configurable.
Break files that grow beyond ~600 lines into cohesive, testable pieces and keep functions short unless the domain demand special handling.
Enforce single-responsibility layering: parsing, orchestration, and runtime helpers should be maintained in separate modules.
Document intentional deviations so future agents understand why the rule was relaxed.
Refer to ../docs/crawl-e2e-api-reference.md before wiring CLI integrations to reuse the documented crawl and E2E API shapes.

⏺ The Testing Philosophy

The Process

Investigate Why Tests Missed It
Write Test That FAILS
Fix The Code
Test Now PASSES

The Philosophy

Never fix a bug you can't reproduce in a test.

Comparison with Other Tools

Feature	SystemEval	pytest	jest
Unified CLI	Yes	No	No
Framework agnostic	Yes	Python only	JS only
Strict verdicts	PASS/FAIL/ERROR	Exit codes vary	Exit codes vary
JSON schema	Versioned	Plugin required	Custom
Environment orchestration	Built-in	External	External

Contributing

See the adapter documentation in systemeval/adapters/README.md for details on extending SystemEval.

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.8.1

Mar 25, 2026

0.8.0

Feb 6, 2026

0.7.1

Jan 30, 2026

0.7.0

Jan 30, 2026

0.6.1

Jan 30, 2026

0.6.0

Jan 29, 2026

0.5.0

Jan 27, 2026

0.4.0

Jan 24, 2026

0.3.0

Jan 21, 2026

0.2.2

Jan 20, 2026

0.2.0

Dec 31, 2025

0.1.3

Dec 30, 2025

0.1.2

Dec 30, 2025

0.1.1

Dec 30, 2025

0.1.0

Dec 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

systemeval-0.8.1.tar.gz (419.9 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

systemeval-0.8.1-py3-none-any.whl (276.3 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file systemeval-0.8.1.tar.gz.

File metadata

Download URL: systemeval-0.8.1.tar.gz
Upload date: Mar 25, 2026
Size: 419.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for systemeval-0.8.1.tar.gz
Algorithm	Hash digest
SHA256	`4de2a4a607cef8772769e63a7c20a5b13be6d07afa7f648d89c8d196994c1375`
MD5	`47ade30f6011d473f518b0b31288b699`
BLAKE2b-256	`f9c80951400aba6fe4e96ab8e46a39e930739047c89c3d74abadd180f0a26075`

See more details on using hashes here.

File details

Details for the file systemeval-0.8.1-py3-none-any.whl.

File metadata

Download URL: systemeval-0.8.1-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 276.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for systemeval-0.8.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6ad1da726b4dc8d399b716fd4c09b9d809e3be7de9385660fd87e7e598da4bd`
MD5	`c3914a42e0802d7bcf7e9c4e5cc4ef53`
BLAKE2b-256	`201a6fec8588f243303a49da000cfe5b2e837980ce7566d359082daab6cb6f48`

See more details on using hashes here.

systemeval 0.8.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SystemEval

🤖 AI Agent Quick Start

Philosophy

1. Objective Verdicts

2. Non-Fungible Runs

3. Machine-Parseable Output

Installation

Quick Start

Initialize Configuration

Run Tests

Check Results

Configuration

Output Schema

Verdict Logic

Extended Schema (Sequence Results)

CLI Reference

Commands

Design Requirements

Test Options

Exit Codes

Output Templates

Usage

Custom Templates

Adapters

Pytest (Default)

Jest (Coming Soon)

Creating Custom Adapters

Docker Compose Environments

Quick Start

Full Configuration

Attach Mode

Auto-Discovery

CLI Commands

Pre-flight Checks

Remote Docker Hosts

Example Projects

Standalone Environments

Running Tests

Design Principles

Design Requirements

⏺ The Testing Philosophy

The Process

The Philosophy

Comparison with Other Tools

Contributing

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes