Skip to main content

Automated evaluation pipeline for AI-generated code

Project description

Code Eval

Automated evaluation pipeline for AI-generated code. Supports two evaluation modes — full-project eval and lightweight snippet — covering Python and Java (Maven).

Two Modes

code-eval eval code-eval snippet
Purpose Full-project evaluation with tests, lint, security, and complexity Quick static analysis of a single code snippet
Input Directory / file paths / git diff Inline code (-c) or single file (--file)
Scanners All 9 scanners (incl. test runners & dependency auditors) Static-analysis only (no pytest / maven-test / pip-audit)
Scoring 4 dimensions: correctness, quality, security, maintainability 3 dimensions: quality, security, maintainability (no correctness)
Output evaluation.json — full report with metrics, issues, scores Compact SnippetResult JSON with score (0-100) and issues
Use Case CI/CD pipelines, batch project evaluation Code review, quick checks, editor integration

Features

  • Two evaluation modes: eval (project) and snippet (single file / inline code)
  • Three input modes (eval): directory, file path, git-diff
  • Two language adapters: Python + Java (Maven)
  • Nine scanners:
    • Python: pytest, ruff, bandit, radon, pip-audit
    • Java: maven-test, java-lint, java-security, java-complexity
  • Multi-dimensional scoring: correctness (0.40), quality (0.25), security (0.20), maintainability (0.15)
  • Two-layer diff awareness: file-level + line-level tracking (in_diff tagging)
  • Configurable Docker sandbox: optional container isolation with resource limits
  • Batch evaluation: concurrent target processing with progress reporting
  • Structured output: evaluation.json with metrics, issues, scores, and summary

Installation

pip install code-eval

Or install from source:

pip install -e .

Mode 1: code-eval eval

Full-project evaluation — runs all scanners (tests, lint, security, complexity) and produces a comprehensive structured report.

Directory mode

Evaluate a project directory (language auto-detected by markers such as pyproject.toml or pom.xml):

code-eval eval --targets ./my_project

File mode

Evaluate specific files:

code-eval eval --targets ./src/auth.py ./src/api.py

For Java, file mode also works (project root resolved via pom.xml):

code-eval eval --targets ./my-java-project/src/main/java/com/example/App.java

Git diff mode

Evaluate only files changed since main:

code-eval eval --git-diff --base main

Multiple targets

code-eval eval --targets ./project_a ./project_b

Save output to file

code-eval eval --targets ./my_project --output evaluation.json

Generate markdown summary

code-eval eval --targets ./my_project --output evaluation.json --summary summary.md

Custom configuration

code-eval eval --targets ./my_project --config .env.production

Eval Output Format

The evaluation.json output contains:

{
  "meta": {
    "timestamp": "2025-01-01T00:00:00Z",
    "pipeline_version": "0.1.0",
    "total_targets": 1,
    "total_duration_seconds": 5.2
  },
  "results": [
    {
      "target": "/path/to/project",
      "language": "python",
      "duration_seconds": 5.2,
      "scores": {
        "correctness": { "value": 0.85, "weight": 0.40, "detail": "17/20 tests passed" },
        "quality": { "value": 0.96, "weight": 0.25, "detail": "2 lint issues in diff" },
        "security": { "value": 1.0, "weight": 0.20, "detail": "No security issues" },
        "maintainability": { "value": 0.9, "weight": 0.15, "detail": "Average complexity: 6.2" },
        "overall": 0.91
      },
      "metrics": {
        "tests_total": 20,
        "tests_passed": 17,
        "tests_failed": 3,
        "lint_issues": 2,
        "security_issues": 0,
        "avg_complexity": 6.2,
        "files_evaluated": 8
      },
      "issues": [ "..." ]
    }
  ],
  "summary": {
    "avg_overall_score": 0.91,
    "total_issues": 5,
    "critical_issues": 0,
    "targets_passed": 1,
    "targets_failed": 0
  }
}

Eval Scoring Dimensions

Dimension Weight Source Scoring Logic
Correctness 0.40 pytest / maven-test tests_passed / tests_total; no tests → 0.5; compilation failed → 0.0
Quality 0.25 ruff / java-lint -0.02 per in-diff lint issue; -0.002 per out-of-diff
Security 0.20 bandit / java-security Deductions: critical -0.30, high -0.15, medium -0.05, low -0.02
Maintainability 0.15 radon / java-complexity CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0

Mode 2: code-eval snippet

Lightweight snippet evaluation — runs static-analysis scanners only (no test runners or dependency auditors) and produces a compact result with a 0-100 score.

Inline code

Evaluate a code string directly:

code-eval snippet -c "import os; os.system('rm -rf /')" --lang python

File input

Evaluate a single code file:

code-eval snippet --file ./utils.py

Language is auto-detected from the file extension. You can override it:

code-eval snippet --file ./script.txt --lang python

Save snippet result

code-eval snippet -c "print('hello')" --lang python --output result.json

Snippet Output Format

The snippet result JSON is a compact schema:

{
  "language": "python",
  "file": "snippet.py",
  "duration_seconds": 0.45,
  "score": 85.0,
  "issues_count": 3,
  "issues": [
    {
      "id": "SNIPPET-001",
      "severity": "high",
      "type": "security",
      "message": "Possible shell injection via os.system()",
      "file": "snippet.py",
      "line": 1
    }
  ],
  "severity_summary": {
    "critical": 0,
    "high": 1,
    "medium": 1,
    "low": 1,
    "info": 0
  }
}

Snippet Scoring Dimensions

Snippet mode uses 3 dimensions (no correctness, since there are no tests):

Dimension Weight Source Scoring Logic
Quality 0.40 ruff / java-lint -0.02 per lint issue
Security 0.35 bandit / java-security Deductions: critical -0.30, high -0.15, medium -0.05, low -0.02
Maintainability 0.25 radon / java-complexity CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0

Snippet Scanners by Language

Language Scanners
Python ruff, bandit, radon
Java java-lint, java-security, java-complexity

Note: Test runners (pytest, maven-test) and dependency auditors (pip-audit) are excluded from snippet mode since snippets have no project structure.

Exit Codes (snippet)

Code Meaning
0 No critical or high severity issues
1 At least one critical or high severity issue found

Configuration

Create a .env file (see .env.example) to customize behavior:

# Sandbox
SANDBOX_ENABLED=false              # Global toggle (default: false)
SANDBOX_PYTHON_ENABLED=true        # Per-language override
SANDBOX_JAVA_ENABLED=              # Per-language override for Java
SANDBOX_MEMORY_LIMIT=512m          # Docker memory limit
SANDBOX_CPU_LIMIT=1                # Docker CPU limit
SANDBOX_TIMEOUT=300                # Total timeout in seconds
SANDBOX_NETWORK=none               # Docker network mode

# Concurrency
MAX_CONCURRENT=4                   # Max parallel evaluations

# Issue limits
MAX_ISSUES_PER_TARGET=50           # Max issues per target in report

# Scoring weights (auto-normalized if they don't sum to 1.0)
SCORE_WEIGHT_CORRECTNESS=0.40
SCORE_WEIGHT_QUALITY=0.25
SCORE_WEIGHT_SECURITY=0.20
SCORE_WEIGHT_MAINTAINABILITY=0.15

# Java / Maven
JAVA_MVN_PATH=                     # Optional mvn path (fallback: PATH lookup)
JAVA_MVN_SETTINGS=                 # Optional settings.xml
JAVA_MVN_TIMEOUT=300               # Maven timeout in seconds
JAVA_MVN_SKIP_TESTS=false          # Run compile instead of test
JAVA_MVN_THREADS=                  # Optional -T value (e.g. 2C)

Sandbox resolution order

For each language: per-language overrideglobal toggledefault (false)

Example: SANDBOX_ENABLED=false + SANDBOX_PYTHON_ENABLED=true → Python runs in sandbox, others run directly.

Docker Sandbox

To build the evaluation Docker image:

docker build -f docker/Dockerfile.python -t code-eval-python .

Enable sandbox in .env:

SANDBOX_ENABLED=true

Project Structure

code_eval/
├── __init__.py
├── cli.py              # Click CLI entry point (eval + snippet sub-commands)
├── config.py           # Configuration from .env
├── adapters/           # Language adapter interface + Python/Java implementations
├── core/               # Runner, scheduler, sandbox, models
├── extractors/         # Issue extractors (Python + Java)
├── reporting/          # JSON & markdown report generation
├── resolvers/          # Target resolution & language detection
├── scanners/           # Scanner interface + Python/Java scanner implementations
├── schemas/            # Pydantic data models (Issue, Metrics, EvaluationReport, SnippetResult)
├── scoring/            # Score computation
└── snippet/            # Snippet-mode runner & scanner selection

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
python -m pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_eval-0.1.1.tar.gz (89.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

code_eval-0.1.1-py3-none-any.whl (56.3 kB view details)

Uploaded Python 3

File details

Details for the file code_eval-0.1.1.tar.gz.

File metadata

  • Download URL: code_eval-0.1.1.tar.gz
  • Upload date:
  • Size: 89.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.5

File hashes

Hashes for code_eval-0.1.1.tar.gz
Algorithm Hash digest
SHA256 72c66b867df28a7f47db3b5bde13a7af701348be9b7af198f086c586478eebad
MD5 cf88baa687077d11b573f49714db6436
BLAKE2b-256 62057851d7d2f2da20eaf7f297ce815440afd66c6a03152fff138b962415481b

See more details on using hashes here.

File details

Details for the file code_eval-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: code_eval-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 56.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.5

File hashes

Hashes for code_eval-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f6fd7be77c2593a71708f7cab7f34553a48fbb9ba01b917a9b9b3d4e49bf187a
MD5 3b9dbf535bb84dec474535172a851598
BLAKE2b-256 662a74d7f65aa78e38f233f868d01b62f2fb364bd8d43cba34d21de7295e8efd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page