Automated evaluation pipeline for AI-generated code
Project description
Code Eval
Automated evaluation pipeline for AI-generated code. Supports two evaluation modes — full-project eval and lightweight snippet — covering Python and Java (Maven).
Two Modes
code-eval eval |
code-eval snippet |
|
|---|---|---|
| Purpose | Full-project evaluation with tests, lint, security, and complexity | Quick static analysis of a single code snippet |
| Input | Directory / file paths / git diff | Inline code (-c) or single file (--file) |
| Scanners | All 9 scanners (incl. test runners & dependency auditors) | Static-analysis only (no pytest / maven-test / pip-audit) |
| Scoring | 4 dimensions: correctness, quality, security, maintainability | 3 dimensions: quality, security, maintainability (no correctness) |
| Output | evaluation.json — full report with metrics, issues, scores |
Compact SnippetResult JSON with score (0-100) and issues |
| Use Case | CI/CD pipelines, batch project evaluation | Code review, quick checks, editor integration |
Features
- Two evaluation modes:
eval(project) andsnippet(single file / inline code) - Three input modes (eval): directory, file path, git-diff
- Two language adapters: Python + Java (Maven)
- Nine scanners:
- Python: pytest, ruff, bandit, radon, pip-audit
- Java: maven-test, java-lint, java-security, java-complexity
- Multi-dimensional scoring: correctness (0.40), quality (0.25), security (0.20), maintainability (0.15)
- Two-layer diff awareness: file-level + line-level tracking (
in_difftagging) - Configurable Docker sandbox: optional container isolation with resource limits
- Batch evaluation: concurrent target processing with progress reporting
- Structured output:
evaluation.jsonwith metrics, issues, scores, and summary
Installation
pip install code-eval
Or install from source:
pip install -e .
Mode 1: code-eval eval
Full-project evaluation — runs all scanners (tests, lint, security, complexity) and produces a comprehensive structured report.
Directory mode
Evaluate a project directory (language auto-detected by markers such as pyproject.toml or pom.xml):
code-eval eval --targets ./my_project
File mode
Evaluate specific files:
code-eval eval --targets ./src/auth.py ./src/api.py
For Java, file mode also works (project root resolved via pom.xml):
code-eval eval --targets ./my-java-project/src/main/java/com/example/App.java
Git diff mode
Evaluate only files changed since main:
code-eval eval --git-diff --base main
Multiple targets
code-eval eval --targets ./project_a ./project_b
Save output to file
code-eval eval --targets ./my_project --output evaluation.json
Generate markdown summary
code-eval eval --targets ./my_project --output evaluation.json --summary summary.md
Custom configuration
code-eval eval --targets ./my_project --config .env.production
Eval Output Format
The evaluation.json output contains:
{
"meta": {
"timestamp": "2025-01-01T00:00:00Z",
"pipeline_version": "0.1.0",
"total_targets": 1,
"total_duration_seconds": 5.2
},
"results": [
{
"target": "/path/to/project",
"language": "python",
"duration_seconds": 5.2,
"scores": {
"correctness": { "value": 0.85, "weight": 0.40, "detail": "17/20 tests passed" },
"quality": { "value": 0.96, "weight": 0.25, "detail": "2 lint issues in diff" },
"security": { "value": 1.0, "weight": 0.20, "detail": "No security issues" },
"maintainability": { "value": 0.9, "weight": 0.15, "detail": "Average complexity: 6.2" },
"overall": 0.91
},
"metrics": {
"tests_total": 20,
"tests_passed": 17,
"tests_failed": 3,
"lint_issues": 2,
"security_issues": 0,
"avg_complexity": 6.2,
"files_evaluated": 8
},
"issues": [ "..." ]
}
],
"summary": {
"avg_overall_score": 0.91,
"total_issues": 5,
"critical_issues": 0,
"targets_passed": 1,
"targets_failed": 0
}
}
Eval Scoring Dimensions
| Dimension | Weight | Source | Scoring Logic |
|---|---|---|---|
| Correctness | 0.40 | pytest / maven-test | tests_passed / tests_total; no tests → 0.5; compilation failed → 0.0 |
| Quality | 0.25 | ruff / java-lint | -0.02 per in-diff lint issue; -0.002 per out-of-diff |
| Security | 0.20 | bandit / java-security | Deductions: critical -0.30, high -0.15, medium -0.05, low -0.02 |
| Maintainability | 0.15 | radon / java-complexity | CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0 |
Mode 2: code-eval snippet
Lightweight snippet evaluation — runs static-analysis scanners only (no test runners or dependency auditors) and produces a compact result with a 0-100 score.
Inline code
Evaluate a code string directly:
code-eval snippet -c "import os; os.system('rm -rf /')" --lang python
File input
Evaluate a single code file:
code-eval snippet --file ./utils.py
Language is auto-detected from the file extension. You can override it:
code-eval snippet --file ./script.txt --lang python
Save snippet result
code-eval snippet -c "print('hello')" --lang python --output result.json
Snippet Output Format
The snippet result JSON is a compact schema:
{
"language": "python",
"file": "snippet.py",
"duration_seconds": 0.45,
"score": 85.0,
"issues_count": 3,
"issues": [
{
"id": "SNIPPET-001",
"severity": "high",
"type": "security",
"message": "Possible shell injection via os.system()",
"file": "snippet.py",
"line": 1
}
],
"severity_summary": {
"critical": 0,
"high": 1,
"medium": 1,
"low": 1,
"info": 0
}
}
Snippet Scoring Dimensions
Snippet mode uses 3 dimensions (no correctness, since there are no tests):
| Dimension | Weight | Source | Scoring Logic |
|---|---|---|---|
| Quality | 0.40 | ruff / java-lint | -0.02 per lint issue |
| Security | 0.35 | bandit / java-security | Deductions: critical -0.30, high -0.15, medium -0.05, low -0.02 |
| Maintainability | 0.25 | radon / java-complexity | CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0 |
Snippet Scanners by Language
| Language | Scanners |
|---|---|
| Python | ruff, bandit, radon |
| Java | java-lint, java-security, java-complexity |
Note: Test runners (pytest, maven-test) and dependency auditors (pip-audit) are excluded from snippet mode since snippets have no project structure.
Exit Codes (snippet)
| Code | Meaning |
|---|---|
0 |
No critical or high severity issues |
1 |
At least one critical or high severity issue found |
Configuration
Create a .env file (see .env.example) to customize behavior:
# Sandbox
SANDBOX_ENABLED=false # Global toggle (default: false)
SANDBOX_PYTHON_ENABLED=true # Per-language override
SANDBOX_JAVA_ENABLED= # Per-language override for Java
SANDBOX_MEMORY_LIMIT=512m # Docker memory limit
SANDBOX_CPU_LIMIT=1 # Docker CPU limit
SANDBOX_TIMEOUT=300 # Total timeout in seconds
SANDBOX_NETWORK=none # Docker network mode
# Concurrency
MAX_CONCURRENT=4 # Max parallel evaluations
# Issue limits
MAX_ISSUES_PER_TARGET=50 # Max issues per target in report
# Scoring weights (auto-normalized if they don't sum to 1.0)
SCORE_WEIGHT_CORRECTNESS=0.40
SCORE_WEIGHT_QUALITY=0.25
SCORE_WEIGHT_SECURITY=0.20
SCORE_WEIGHT_MAINTAINABILITY=0.15
# Java / Maven
JAVA_MVN_PATH= # Optional mvn path (fallback: PATH lookup)
JAVA_MVN_SETTINGS= # Optional settings.xml
JAVA_MVN_TIMEOUT=300 # Maven timeout in seconds
JAVA_MVN_SKIP_TESTS=false # Run compile instead of test
JAVA_MVN_THREADS= # Optional -T value (e.g. 2C)
Sandbox resolution order
For each language: per-language override → global toggle → default (false)
Example: SANDBOX_ENABLED=false + SANDBOX_PYTHON_ENABLED=true → Python runs in sandbox, others run directly.
Docker Sandbox
To build the evaluation Docker image:
docker build -f docker/Dockerfile.python -t code-eval-python .
Enable sandbox in .env:
SANDBOX_ENABLED=true
Project Structure
code_eval/
├── __init__.py
├── cli.py # Click CLI entry point (eval + snippet sub-commands)
├── config.py # Configuration from .env
├── adapters/ # Language adapter interface + Python/Java implementations
├── core/ # Runner, scheduler, sandbox, models
├── extractors/ # Issue extractors (Python + Java)
├── reporting/ # JSON & markdown report generation
├── resolvers/ # Target resolution & language detection
├── scanners/ # Scanner interface + Python/Java scanner implementations
├── schemas/ # Pydantic data models (Issue, Metrics, EvaluationReport, SnippetResult)
├── scoring/ # Score computation
└── snippet/ # Snippet-mode runner & scanner selection
Development
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
python -m pytest tests/ -v
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file code_eval-0.1.1.tar.gz.
File metadata
- Download URL: code_eval-0.1.1.tar.gz
- Upload date:
- Size: 89.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72c66b867df28a7f47db3b5bde13a7af701348be9b7af198f086c586478eebad
|
|
| MD5 |
cf88baa687077d11b573f49714db6436
|
|
| BLAKE2b-256 |
62057851d7d2f2da20eaf7f297ce815440afd66c6a03152fff138b962415481b
|
File details
Details for the file code_eval-0.1.1-py3-none-any.whl.
File metadata
- Download URL: code_eval-0.1.1-py3-none-any.whl
- Upload date:
- Size: 56.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6fd7be77c2593a71708f7cab7f34553a48fbb9ba01b917a9b9b3d4e49bf187a
|
|
| MD5 |
3b9dbf535bb84dec474535172a851598
|
|
| BLAKE2b-256 |
662a74d7f65aa78e38f233f868d01b62f2fb364bd8d43cba34d21de7295e8efd
|