Automated evaluation pipeline for AI-generated code

Project description

Code Eval

Automated evaluation pipeline for AI-generated code. Supports two evaluation modes — full-project eval and lightweight snippet — covering Python and Java (Maven).

Two Modes

	`code-eval eval`	`code-eval snippet`
Purpose	Full-project evaluation with tests, lint, security, and complexity	Quick static analysis of a single code snippet
Input	Directory / file paths / git diff	Inline code (`-c`) or single file (`--file`)
Scanners	All 9 scanners (incl. test runners & dependency auditors)	Static-analysis only (no pytest / maven-test / pip-audit)
Scoring	4 dimensions: correctness, quality, security, maintainability	3 dimensions: quality, security, maintainability (no correctness)
Output	`evaluation.json` — full report with metrics, issues, scores	Compact `SnippetResult` JSON with score (0-100) and issues
Use Case	CI/CD pipelines, batch project evaluation	Code review, quick checks, editor integration

Features

Two evaluation modes: eval (project) and snippet (single file / inline code)
Three input modes (eval): directory, file path, git-diff
Two language adapters: Python + Java (Maven)
Nine scanners:
- Python: pytest, ruff, bandit, radon, pip-audit
- Java: maven-test, java-lint, java-security, java-complexity
Multi-dimensional scoring: correctness (0.40), quality (0.25), security (0.20), maintainability (0.15)
Two-layer diff awareness: file-level + line-level tracking (in_diff tagging)
Configurable Docker sandbox: optional container isolation with resource limits
Batch evaluation: concurrent target processing with progress reporting
Structured output: evaluation.json with metrics, issues, scores, and summary

Installation

pip install code-eval

Or install from source:

pip install -e .

Mode 1: `code-eval eval`

Full-project evaluation — runs all scanners (tests, lint, security, complexity) and produces a comprehensive structured report.

Directory mode

Evaluate a project directory (language auto-detected by markers such as pyproject.toml or pom.xml):

code-eval eval --targets ./my_project

File mode

Evaluate specific files:

code-eval eval --targets ./src/auth.py ./src/api.py

For Java, file mode also works (project root resolved via pom.xml):

code-eval eval --targets ./my-java-project/src/main/java/com/example/App.java

Git diff mode

Evaluate only files changed since main:

code-eval eval --git-diff --base main

Multiple targets

code-eval eval --targets ./project_a ./project_b

Save output to file

code-eval eval --targets ./my_project --output evaluation.json

Generate markdown summary

code-eval eval --targets ./my_project --output evaluation.json --summary summary.md

Custom configuration

code-eval eval --targets ./my_project --config .env.production

Eval Output Format

The evaluation.json output contains:

{
  "meta": {
    "timestamp": "2025-01-01T00:00:00Z",
    "pipeline_version": "0.1.0",
    "total_targets": 1,
    "total_duration_seconds": 5.2
  },
  "results": [
    {
      "target": "/path/to/project",
      "language": "python",
      "duration_seconds": 5.2,
      "scores": {
        "correctness": { "value": 0.85, "weight": 0.40, "detail": "17/20 tests passed" },
        "quality": { "value": 0.96, "weight": 0.25, "detail": "2 lint issues in diff" },
        "security": { "value": 1.0, "weight": 0.20, "detail": "No security issues" },
        "maintainability": { "value": 0.9, "weight": 0.15, "detail": "Average complexity: 6.2" },
        "overall": 0.91
      },
      "metrics": {
        "tests_total": 20,
        "tests_passed": 17,
        "tests_failed": 3,
        "lint_issues": 2,
        "security_issues": 0,
        "avg_complexity": 6.2,
        "files_evaluated": 8
      },
      "issues": [ "..." ]
    }
  ],
  "summary": {
    "avg_overall_score": 0.91,
    "total_issues": 5,
    "critical_issues": 0,
    "targets_passed": 1,
    "targets_failed": 0
  }
}

Eval Scoring Dimensions

Dimension	Weight	Source	Scoring Logic
Correctness	0.40	pytest / maven-test	`tests_passed / tests_total`; no tests → 0.5; compilation failed → 0.0
Quality	0.25	ruff / java-lint	`-0.02` per in-diff lint issue; `-0.002` per out-of-diff
Security	0.20	bandit / java-security	Deductions: critical `-0.30`, high `-0.15`, medium `-0.05`, low `-0.02`
Maintainability	0.15	radon / java-complexity	CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0

Mode 2: `code-eval snippet`

Lightweight snippet evaluation — runs static-analysis scanners only (no test runners or dependency auditors) and produces a compact result with a 0-100 score.

Inline code

Evaluate a code string directly:

code-eval snippet -c "import os; os.system('rm -rf /')" --lang python

File input

Evaluate a single code file:

code-eval snippet --file ./utils.py

Language is auto-detected from the file extension. You can override it:

code-eval snippet --file ./script.txt --lang python

Save snippet result

code-eval snippet -c "print('hello')" --lang python --output result.json

Snippet Output Format

The snippet result JSON is a compact schema:

{
  "language": "python",
  "file": "snippet.py",
  "duration_seconds": 0.45,
  "score": 85.0,
  "issues_count": 3,
  "issues": [
    {
      "id": "SNIPPET-001",
      "severity": "high",
      "type": "security",
      "message": "Possible shell injection via os.system()",
      "file": "snippet.py",
      "line": 1
    }
  ],
  "severity_summary": {
    "critical": 0,
    "high": 1,
    "medium": 1,
    "low": 1,
    "info": 0
  }
}

Snippet Scoring Dimensions

Snippet mode uses 3 dimensions (no correctness, since there are no tests):

Dimension	Weight	Source	Scoring Logic
Quality	0.40	ruff / java-lint	`-0.02` per lint issue
Security	0.35	bandit / java-security	Deductions: critical `-0.30`, high `-0.15`, medium `-0.05`, low `-0.02`
Maintainability	0.25	radon / java-complexity	CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0

Snippet Scanners by Language

Language	Scanners
Python	ruff, bandit, radon
Java	java-lint, java-security, java-complexity

Note: Test runners (pytest, maven-test) and dependency auditors (pip-audit) are excluded from snippet mode since snippets have no project structure.

Exit Codes (snippet)

Code	Meaning
`0`	No critical or high severity issues
`1`	At least one critical or high severity issue found

Configuration

Create a .env file (see .env.example) to customize behavior:

# Sandbox
SANDBOX_ENABLED=false              # Global toggle (default: false)
SANDBOX_PYTHON_ENABLED=true        # Per-language override
SANDBOX_JAVA_ENABLED=              # Per-language override for Java
SANDBOX_MEMORY_LIMIT=512m          # Docker memory limit
SANDBOX_CPU_LIMIT=1                # Docker CPU limit
SANDBOX_TIMEOUT=300                # Total timeout in seconds
SANDBOX_NETWORK=none               # Docker network mode

# Concurrency
MAX_CONCURRENT=4                   # Max parallel evaluations

# Issue limits
MAX_ISSUES_PER_TARGET=50           # Max issues per target in report

# Scoring weights (auto-normalized if they don't sum to 1.0)
SCORE_WEIGHT_CORRECTNESS=0.40
SCORE_WEIGHT_QUALITY=0.25
SCORE_WEIGHT_SECURITY=0.20
SCORE_WEIGHT_MAINTAINABILITY=0.15

# Java / Maven
JAVA_MVN_PATH=                     # Optional mvn path (fallback: PATH lookup)
JAVA_MVN_SETTINGS=                 # Optional settings.xml
JAVA_MVN_TIMEOUT=300               # Maven timeout in seconds
JAVA_MVN_SKIP_TESTS=false          # Run compile instead of test
JAVA_MVN_THREADS=                  # Optional -T value (e.g. 2C)

Sandbox resolution order

For each language: per-language override → global toggle → default (false)

Example: SANDBOX_ENABLED=false + SANDBOX_PYTHON_ENABLED=true → Python runs in sandbox, others run directly.

Docker Sandbox

To build the evaluation Docker image:

docker build -f docker/Dockerfile.python -t code-eval-python .

Enable sandbox in .env:

SANDBOX_ENABLED=true

Project Structure

code_eval/
├── __init__.py
├── cli.py              # Click CLI entry point (eval + snippet sub-commands)
├── config.py           # Configuration from .env
├── adapters/           # Language adapter interface + Python/Java implementations
├── core/               # Runner, scheduler, sandbox, models
├── extractors/         # Issue extractors (Python + Java)
├── reporting/          # JSON & markdown report generation
├── resolvers/          # Target resolution & language detection
├── scanners/           # Scanner interface + Python/Java scanner implementations
├── schemas/            # Pydantic data models (Issue, Metrics, EvaluationReport, SnippetResult)
├── scoring/            # Score computation
└── snippet/            # Snippet-mode runner & scanner selection

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
python -m pytest tests/ -v

License

MIT

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Mar 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_eval-0.1.1.tar.gz (89.0 kB view details)

Uploaded Mar 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

code_eval-0.1.1-py3-none-any.whl (56.3 kB view details)

Uploaded Mar 6, 2026 Python 3

File details

Details for the file code_eval-0.1.1.tar.gz.

File metadata

Download URL: code_eval-0.1.1.tar.gz
Upload date: Mar 6, 2026
Size: 89.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.5

File hashes

Hashes for code_eval-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`72c66b867df28a7f47db3b5bde13a7af701348be9b7af198f086c586478eebad`
MD5	`cf88baa687077d11b573f49714db6436`
BLAKE2b-256	`62057851d7d2f2da20eaf7f297ce815440afd66c6a03152fff138b962415481b`

See more details on using hashes here.

File details

Details for the file code_eval-0.1.1-py3-none-any.whl.

File metadata

Download URL: code_eval-0.1.1-py3-none-any.whl
Upload date: Mar 6, 2026
Size: 56.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.5

File hashes

Hashes for code_eval-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6fd7be77c2593a71708f7cab7f34553a48fbb9ba01b917a9b9b3d4e49bf187a`
MD5	`3b9dbf535bb84dec474535172a851598`
BLAKE2b-256	`662a74d7f65aa78e38f233f868d01b62f2fb364bd8d43cba34d21de7295e8efd`

See more details on using hashes here.

code-eval 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Code Eval

Two Modes

Features

Installation

Mode 1: code-eval eval

Directory mode

File mode

Git diff mode

Multiple targets

Save output to file

Generate markdown summary

Custom configuration

Eval Output Format

Eval Scoring Dimensions

Mode 2: code-eval snippet

Inline code

File input

Save snippet result

Snippet Output Format

Snippet Scoring Dimensions

Snippet Scanners by Language

Exit Codes (snippet)

Configuration

Sandbox resolution order

Docker Sandbox

Project Structure

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Mode 1: `code-eval eval`

Mode 2: `code-eval snippet`