AI-powered test orchestration platform: traditional testing (pytest, jest, playwright) + DebuggAI natural language tests, commit-based generation, and visual failure analysis
Project description
SystemEval
A unified evaluation framework providing objective, deterministic, and traceable test execution across any project.
Homepage: debugg.ai | Docs: debugg.ai/docs/systemeval
See COMMANDMENTS.md for the core principles and design philosophy.
🤖 AI Agent Quick Start
For AI agents: Get structured documentation instantly:
systemeval ai-help --format json # Machine-parseable complete reference
systemeval ai-help --format text # Human-readable overview
Core Rules for AI Agents:
- ALWAYS use systemeval - Never run pytest/jest/npm test directly
- ALWAYS use --json flag - For machine-parseable output
- Trust the verdict - Don't parse raw test output
- Three verdicts only - PASS (exit 0), FAIL (exit 1), ERROR (exit 2)
Common Agent Workflow:
# Step 1: Run tests with JSON output
systemeval test -c unit --json
# Step 2: Parse JSON response
# {
# "verdict": "PASS"|"FAIL"|"ERROR",
# "exit_code": 0|1|2,
# "total": <count>,
# "passed": <count>,
# "failed": <count>,
# "errors": <count>,
# "duration": <seconds>,
# "timestamp": "<iso8601>"
# }
# Step 3: React to verdict
# - PASS: Done, tests passed
# - FAIL: Read failure details, fix code, repeat
# - ERROR: Fix config/setup issue, repeat
Discovery Commands:
systemeval list categories # Available test categories
systemeval list environments # Available test environments
systemeval list templates # Available output formats
Philosophy
SystemEval exists to solve a fundamental problem: test results should be facts, not opinions.
Traditional test runners produce ambiguous output that requires human interpretation. Did the build pass? Sort of. Are we ready to deploy? Probably. SystemEval eliminates this ambiguity with three core principles:
1. Objective Verdicts
Every evaluation produces one of three verdicts: PASS, FAIL, or ERROR. There is no "mostly passing" or "acceptable failure rate." The verdict is computed deterministically from metrics using cascade logic:
ANY metric fails --> session FAILS
ANY session fails --> sequence FAILS
exit_code == 2 --> ERROR (collection/config problem)
total == 0 --> ERROR (nothing ran)
2. Non-Fungible Runs
Every evaluation run is uniquely identifiable and traceable:
- Run ID: UUID for the specific execution
- Timestamp: ISO 8601 UTC timestamp
- Exit Code: 0 (PASS), 1 (FAIL), or 2 (ERROR)
Same inputs always produce the same verdict. If a test is flaky, it fails - there is no retry-until-green.
3. Machine-Parseable Output
Results are structured data first, human-readable second:
- JSON schema for programmatic consumption
- Jinja2 templates for human-friendly formats
- Designed for CI pipelines, agentic review, and automated comparison
Installation
# From PyPI
pip install systemeval
# With pytest support (recommended)
pip install systemeval[pytest]
# From source
git clone https://github.com/debugg-ai/systemeval
cd systemeval
pip install -e ".[pytest]"
Requirements: Python 3.9+
Quick Start
Initialize Configuration
cd your-project
systemeval init
This creates systemeval.yaml with auto-detected settings for your project type (Django, Next.js, generic Python, etc.).
Run Tests
# Run all tests
systemeval test
# Run specific category
systemeval test --category unit
# Run with JSON output for CI
systemeval test --json
# Run with specific template
systemeval test --template markdown
Check Results
# Exit code tells you everything
systemeval test && echo "PASS" || echo "FAIL"
Configuration
Create systemeval.yaml in your project root:
# Adapter: which test framework to use
adapter: pytest
# Project metadata
project_root: .
test_directory: tests
# Test categories with markers
categories:
unit:
description: "Fast isolated unit tests"
markers: [unit]
integration:
description: "Tests with external dependencies"
markers: [integration]
api:
description: "API endpoint tests"
markers: [api]
e2e:
description: "End-to-end browser tests"
markers: [e2e]
requires: [browser]
Output Schema
Every test run produces a result conforming to this schema:
{
"verdict": "PASS | FAIL | ERROR",
"exit_code": 0,
"timestamp": "2024-01-15T10:30:00.000Z",
"total": 150,
"passed": 148,
"failed": 2,
"errors": 0,
"skipped": 5,
"duration_seconds": 12.345,
"category": "unit",
"coverage_percent": 87.5
}
Verdict Logic
| Condition | Verdict | Exit Code |
|---|---|---|
exit_code == 2 |
ERROR | 2 |
total == 0 |
ERROR | 2 |
failed > 0 OR errors > 0 |
FAIL | 1 |
| All tests pass | PASS | 0 |
Extended Schema (Sequence Results)
For multi-session evaluations:
{
"sequence_id": "uuid",
"sequence_name": "full-pipeline",
"verdict": "PASS | FAIL",
"exit_code": 0,
"duration_seconds": 45.2,
"pass_count": 3,
"fail_count": 0,
"sessions": [
{
"session_id": "uuid",
"session_name": "unit-tests",
"verdict": "PASS",
"duration_seconds": 12.1,
"metrics": [
{
"name": "tests_passed",
"value": 150,
"passed": true,
"failure_message": null
}
]
}
]
}
CLI Reference
Commands
| Command | Description |
|---|---|
systemeval test |
Run tests using configured adapter |
systemeval init |
Create configuration file |
systemeval validate |
Validate configuration |
systemeval list categories |
Show available test categories |
systemeval list adapters |
Show available test adapters |
systemeval list templates |
Show available output templates |
systemeval list environments |
Show configured environments |
systemeval docker status |
Show Docker container status |
systemeval docker logs |
View container logs |
systemeval docker exec |
Execute command in test container |
systemeval docker ready |
Check if containers are healthy |
Design Requirements
- Do not introduce hard-coded strings/numbers; use configuration files, constants, or environment variables for values that may change between environments.
- Keep modules focused and digestible—split files that exceed ~600 lines and avoid functions longer than a screen so reasoning and tests stay simple.
- Maintain clear separation of concerns: configuration, command parsing, orchestration, and environment management should live in distinct layers.
- Document any deliberate exceptions to these rules (legacy constraints, temporary hacks) so reviewers know the rationale.
Test Options
systemeval test [OPTIONS]
Options:
-c, --category TEXT Test category (unit, integration, api, e2e)
-a, --app TEXT Specific app/module to test
-f, --file TEXT Specific test file to run
-p, --parallel Run tests in parallel
--coverage Collect coverage data
-x, --failfast Stop on first failure
-v, --verbose Verbose output
--json Output results as JSON
-t, --template TEXT Output template name
--env-mode [auto|docker|local] Execution environment (default: auto)
--config PATH Path to config file
-e, --env TEXT Environment to run in
-s, --suite TEXT Test suite to run
--keep-running Keep services running after tests
--attach Attach to running containers (skip build/up)
Exit Codes
| Code | Meaning |
|---|---|
| 0 | All tests passed (PASS) |
| 1 | One or more tests failed (FAIL) |
| 2 | Configuration, collection, or execution error (ERROR) |
Output Templates
SystemEval includes built-in templates for different output needs:
| Template | Use Case |
|---|---|
summary |
One-line CI log output |
table |
ASCII table for terminal |
markdown |
Full report in markdown |
json |
Use --json flag instead |
junit |
JUnit XML for test tools |
github |
GitHub Actions annotations |
slack |
Slack message format |
ci |
Structured CI/CD format |
Usage
# Terminal table
systemeval test --template table
# Markdown report
systemeval test --template markdown > report.md
# GitHub annotations
systemeval test --template github
Custom Templates
Templates use Jinja2 syntax. Create custom templates:
# From file
systemeval test --template ./my-template.j2
# Available context variables:
# verdict, exit_code, total, passed, failed, errors, skipped
# duration, timestamp, category, coverage_percent
# pass_rate, failure_rate, verdict_emoji
# failures (list of failure details)
Adapters
Adapters bridge SystemEval to specific test frameworks.
Pytest (Default)
adapter: pytest
Features:
- Test discovery via pytest collection API
- Marker-based category filtering
- Parallel execution (pytest-xdist)
- Coverage reporting (pytest-cov)
- Django auto-detection and configuration
Jest (Coming Soon)
adapter: jest
jest:
config_file: jest.config.js
Creating Custom Adapters
from systemeval.adapters import BaseAdapter, TestResult, TestItem
class MyAdapter(BaseAdapter):
def discover(self, category=None, app=None, file=None) -> list[TestItem]:
# Return discovered tests
pass
def execute(self, tests=None, **kwargs) -> TestResult:
# Run tests and return results
pass
def get_available_markers(self) -> list[str]:
# Return available categories/markers
pass
def validate_environment(self) -> bool:
# Check framework is configured
pass
Register in the adapter registry:
from systemeval.adapters import register_adapter
register_adapter("my-adapter", MyAdapter)
Docker Compose Environments
SystemEval provides first-class Docker Compose support with auto-discovery, lifecycle management, and remote Docker host support.
Quick Start
# Minimal config - auto-discovers everything from docker-compose.yml
environments:
backend:
type: docker-compose
SystemEval will automatically:
- Find compose files (
docker-compose.yml,compose.yml,local.yml, etc.) - Detect services with source mounts as test candidates
- Infer test commands from
pytest.ini,package.json, etc. - Extract health check endpoints from compose healthchecks
- Configure appropriate ports
Full Configuration
environments:
backend:
type: docker-compose
compose_file: local.yml # Compose file (auto-detected if omitted)
services: [django, postgres, redis] # Services to manage (all if omitted)
test_service: django # Container to run tests in
test_command: pytest # Test command (auto-detected)
working_dir: . # Project directory
# Health check (auto-detected from compose healthcheck)
health_check:
endpoint: /api/health/
port: 8000
timeout: 120
# Remote Docker host (optional)
docker:
host: ssh://user@remote-server
# Or use Docker context
context: my-remote-context
Attach Mode
Connect to already-running containers without lifecycle management:
environments:
dev:
type: docker-compose
attach: true # Skip build/up, just exec into running containers
test_service: django
# Containers already running from docker compose up
systemeval test --env dev --attach
Auto-Discovery
SystemEval searches for compose files in priority order:
docker-compose.ymldocker-compose.yamlcompose.ymlcompose.yamllocal.yml/local.yamldev.yml/dev.yaml
From the compose file, it infers:
- Test service: First service with source mount + build context
- Health port: From port mapping (e.g.,
8000:8000→ port 8000) - Health endpoint: From compose healthcheck command
- Test command: From
pytest.ini,package.json,pyproject.toml
CLI Commands
# Run tests in Docker environment
systemeval test --env backend
# Attach to running containers
systemeval test --env backend --attach
# Docker-specific commands
systemeval docker status # Show container status
systemeval docker logs [service] # View container logs
systemeval docker exec <cmd> # Execute command in test container
systemeval docker ready # Check if containers are healthy
Pre-flight Checks
Before starting containers, SystemEval validates:
- Docker binary is installed
- Docker daemon is running
- Docker Compose V2 is available
- Compose file exists and is valid YAML
- Referenced services exist in compose file
- Test service is defined
Remote Docker Hosts
Run tests against remote Docker daemons:
environments:
staging:
type: docker-compose
docker:
host: ssh://deploy@staging.example.com
attach: true # Usually attach to remote, don't manage lifecycle
Or use Docker contexts:
docker context create staging --docker "host=ssh://deploy@staging.example.com"
environments:
staging:
type: docker-compose
docker:
context: staging
Example Projects
See example-usage-projects/ for complete working examples:
| Project | Compose File | Stack | Test Framework |
|---|---|---|---|
django-rest-api/ |
docker-compose.yml |
Django + Postgres + Redis | pytest |
express-mongo-api/ |
compose.yml |
Express + MongoDB | jest |
fastapi-react-fullstack/ |
local.yml |
FastAPI + React + Postgres + nginx | pytest + jest |
Standalone Environments
For non-Docker services:
environments:
frontend:
type: standalone
command: npm run dev
test_command: npm test
ready_endpoint: http://localhost:3000
Running Tests
# Run in specific environment
systemeval test --env backend
# Run in default environment
systemeval test
# Keep containers running after tests
systemeval test --env backend --keep-running
Design Principles
- Deterministic: Same inputs always produce same verdict
- Objective: No subjective interpretation of results
- Traceable: Every run is uniquely identifiable
- Machine-First: JSON output designed for automation
- Framework-Agnostic: Adapters hide implementation details
- CI-Native: Exit codes and output formats for pipelines
Design Requirements
- Avoid embedding "magic" strings or numbers; prefer constants, YAML fields, or env vars so behavior is configurable.
- Break files that grow beyond ~600 lines into cohesive, testable pieces and keep functions short unless the domain demand special handling.
- Enforce single-responsibility layering: parsing, orchestration, and runtime helpers should be maintained in separate modules.
- Document intentional deviations so future agents understand why the rule was relaxed.
- Refer to
../docs/crawl-e2e-api-reference.mdbefore wiring CLI integrations to reuse the documented crawl and E2E API shapes.
⏺ The Testing Philosophy
The Process
- Investigate Why Tests Missed It
- Write Test That FAILS
- Fix The Code
- Test Now PASSES
The Philosophy
Never fix a bug you can't reproduce in a test.
Comparison with Other Tools
| Feature | SystemEval | pytest | jest |
|---|---|---|---|
| Unified CLI | Yes | No | No |
| Framework agnostic | Yes | Python only | JS only |
| Strict verdicts | PASS/FAIL/ERROR | Exit codes vary | Exit codes vary |
| JSON schema | Versioned | Plugin required | Custom |
| Environment orchestration | Built-in | External | External |
Contributing
See the adapter documentation in systemeval/adapters/README.md for details on extending SystemEval.
Links
- Homepage: debugg.ai
- Documentation: debugg.ai/docs/systemeval
- Repository: github.com/debugg-ai/systemeval
- PyPI: pypi.org/project/systemeval
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file systemeval-0.8.1.tar.gz.
File metadata
- Download URL: systemeval-0.8.1.tar.gz
- Upload date:
- Size: 419.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4de2a4a607cef8772769e63a7c20a5b13be6d07afa7f648d89c8d196994c1375
|
|
| MD5 |
47ade30f6011d473f518b0b31288b699
|
|
| BLAKE2b-256 |
f9c80951400aba6fe4e96ab8e46a39e930739047c89c3d74abadd180f0a26075
|
File details
Details for the file systemeval-0.8.1-py3-none-any.whl.
File metadata
- Download URL: systemeval-0.8.1-py3-none-any.whl
- Upload date:
- Size: 276.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6ad1da726b4dc8d399b716fd4c09b9d809e3be7de9385660fd87e7e598da4bd
|
|
| MD5 |
c3914a42e0802d7bcf7e9c4e5cc4ef53
|
|
| BLAKE2b-256 |
201a6fec8588f243303a49da000cfe5b2e837980ce7566d359082daab6cb6f48
|