DX-first evaluator: submission -> evaluation -> score
Project description
Optelier
DX-first evaluator: submission → evaluation → score
A minimal, local-first framework for evaluating code submissions against problems. Start with simple toy problems and grow toward safe execution later.
Installation
pip install optelier
For development:
git clone <repo-url>
cd optelier
pip install -e .
Optional: Install Docker for maximum security isolation (required for DockerProblem):
- Docker Desktop for Mac/Windows
- Docker Engine for Linux
Quick Start
Using the CLI
# Sum problem with a Python script submission
optelier run sum-json --submission examples/candidates/sum_candidate.py --print-artifacts
# Regex problem with a JSON submission
optelier run text-regex --submission examples/candidates/regex.json --print-artifacts
Using the API
from optelier import evaluate, load_submission
from optelier.problems import REGISTRY
# Create problem with fixtures directory
problem = REGISTRY["sum-json"](fixtures_dir="examples/fixtures")
# Load submission
submission = load_submission("examples/candidates/sum_candidate.py")
# Evaluate - it's that simple!
result = evaluate(problem, submission)
# Access results
print(f"Score: {result.score}, Time: {result.time_ms}ms, OK: {result.ok}")
# Or unpack as tuple (Gym-style)
score, ok, artifacts, error, time_ms = result
print(f"Score: {score}, OK: {ok}")
Core Concepts
Problem Protocol
Minimal API: Problems implement a single evaluate() method.
class MyProblem:
name = "my-problem"
def __init__(self, data_dir):
# Configure with whatever you need
self.data_dir = data_dir
def evaluate(self, submission) -> dict | float | EvalResult:
# submission can be ANY format you define (path, dict, object, etc.)
# Option 1: Just artifacts (no score - useful for analysis/reports)
return {"backtest_results": ..., "metrics": ...}
# Option 2: Just score (simple case)
return 0.95
# Option 3: Both score and artifacts
return EvalResult(score=0.95, artifacts={"details": ...})
Key principles:
- One method:
evaluate(submission) -> dict | float | EvalResult - You choose submission format: Path, dict, object - whatever makes sense
- Pass config via init: No forced abstractions
- Scoring is optional: Return just artifacts for analysis, score later
- Flexible returns: Dict (artifacts), float (score), or EvalResult (both)
Submission Format
You decide! Each problem defines what a submission is:
- Path to file:
problem.evaluate("submission.py") - Dictionary:
problem.evaluate({"code": "...", "params": {...}}) - Custom object:
problem.evaluate(MySubmission(...))
The load_submission() utility is optional:
.pyfiles →{"kind": "python_script", "code": "..."}.jsonfiles → parsed dict
EvalResult (NamedTuple)
from optelier import EvalResult
# Create results
result = EvalResult(score=0.95)
result = EvalResult(score=0.95, artifacts={"details": ...})
# Access by name
print(result.score) # 0.95
print(result.ok) # True
print(result.artifacts) # {...}
print(result.time_ms) # Timing added by evaluate()
# Or unpack as tuple (Gym-style)
score, ok, artifacts, error, time_ms = result
Fields:
score: Numeric score (float,-infif not provided)ok: Whether evaluation succeeded (bool)artifacts: Data produced by evaluation (dict)error: Error traceback if failed (str | None)time_ms: Execution time in milliseconds (int)
Note: If your evaluate() returns just a dict (artifacts only), score will be -inf. This is useful for:
- Backtesting / simulation workflows where you analyze results before scoring
- Data collection / exploratory analysis
- Applying multiple scoring functions to the same artifacts
- Generating reports without immediate scoring
See examples/artifacts_only_example.py for a complete example.
Security
Optelier provides three levels of code execution security for evaluating Python code submissions. Choose the level that matches your threat model and performance requirements.
Security Levels
| Level | Class | Isolation | Overhead | Use Case |
|---|---|---|---|---|
| None | SecureCodeProblem |
In-process | ~0ms | Trusted internal code, maximum speed |
| Medium | SubprocessProblem |
Process isolation | ~400ms | Semi-trusted teams, internal tools |
| Maximum | DockerProblem |
Container isolation | ~600ms | Untrusted external code, competitions |
All three levels include AST-based validation to prevent malicious imports and dangerous builtins.
Quick Start: Secure Code Evaluation
Level 1: In-Process (Fast, Trusted Code)
from optelier.problems.secure import SecureCodeProblem
class FastFunctionProblem(SecureCodeProblem):
def __init__(self):
super().__init__(
name="fast_math",
allowed_imports=set(), # No imports needed
)
def setup_environment(self):
return {"x": 10, "y": 20}
def validate_output(self, namespace):
if "result" not in namespace:
raise ValueError("Must assign to 'result'")
score = 1.0 if namespace["result"] == 30 else 0.0
return score, {"result": namespace["result"]}
def _execute_code(self, code, environment):
namespace = environment.copy()
exec(code, namespace) # In-process execution
return namespace
# Use it
problem = FastFunctionProblem()
result = problem.evaluate({"code": "result = x + y"})
print(f"Score: {result.score}") # 1.0
Level 2: Subprocess (Medium Security)
from optelier.problems.secure import SubprocessProblem
class FeatureProblem(SubprocessProblem):
def __init__(self, data):
super().__init__(
name="feature_eng",
timeout=30,
allowed_imports={"pandas", "numpy"}
)
self.data = data
def setup_environment(self):
return {"df": self.data}
def validate_output(self, namespace):
if "feature" not in namespace:
raise ValueError("Must assign to 'feature'")
feature = namespace["feature"]
variance = feature.var()
score = min(1.0, variance / 1.0)
return score, {
"variance": variance,
"null_rate": feature.isna().mean()
}
# Use it
import pandas as pd
problem = FeatureProblem(pd.DataFrame({"price": [100, 200, 150]}))
result = problem.evaluate({
"code": "feature = df['price'].pct_change()"
})
print(f"Score: {result.score}")
Level 3: Docker (Maximum Security)
from optelier.problems.secure import DockerProblem
class UntrustedCodeProblem(DockerProblem):
def __init__(self, test_cases):
super().__init__(
name="untrusted_challenge",
image="python:3.10-slim",
timeout=30,
memory_limit="256m"
)
self.test_cases = test_cases # [(inputs, expected), ...]
def setup_environment(self):
return {} # No environment needed
def validate_output(self, namespace):
if "solve" not in namespace:
raise ValueError("Must define 'solve' function")
func = namespace["solve"]
passed = sum(1 for inputs, expected in self.test_cases
if func(*inputs) == expected)
score = passed / len(self.test_cases)
return score, {
"passed": passed,
"total": len(self.test_cases)
}
# Use it
problem = UntrustedCodeProblem([
((2, 3), 5),
((10, 5), 15)
])
result = problem.evaluate({
"code": "def solve(a, b): return a + b"
})
print(f"Score: {result.score}") # 1.0
Security Features
AST-Based Validation (All Levels):
- Whitelist allowed imports (e.g.,
pandas,numpy) - Blacklist dangerous builtins (
eval,exec,compile,__import__,open) - Code length limits
- Syntax validation
Subprocess Isolation (Medium):
- Separate process space (crashes don't affect parent)
- Configurable timeout
- Pickle serialization (supports DataFrames, complex objects)
- Automatic cleanup
Docker Isolation (Maximum):
- Complete filesystem isolation
- Network disabled by default
- Memory and CPU limits enforced
- Read-only execution
- Automatic container cleanup
When to Use Each Level
Use SecureCodeProblem (in-process) when:
- Code is from trusted internal developers
- Maximum speed is critical
- AST validation provides sufficient safety
Use SubprocessProblem when:
- Code is from semi-trusted sources (internal teams, controlled access)
- You need crash isolation
- Working with DataFrames or complex objects
- ~400ms overhead is acceptable
Use DockerProblem when:
- Code is from untrusted external sources
- Running competitions or public APIs
- Maximum isolation is required
- ~600ms overhead is acceptable
- Network access must be blocked
Learn More
- Security Guide - Detailed security documentation
- Migration Guide - Migrate from other evaluation frameworks
- Validation Helpers - Reusable validation utilities
Included Problems
sum-json
Evaluates Python scripts that read a JSON file with numbers and output their sum.
Contract:
- Input:
numbers.jsonwith{"numbers": [...]} - Script receives:
python script.py <input> <output> - Expected output:
{"sum": <int>}
Example submission: examples/candidates/sum_candidate.py
text-regex
Evaluates regex patterns (or Python scripts that output patterns) against a labeled text dataset.
Contract:
- Input:
text.txtwith lines labeledOK:(match) orNO:(no match) - Submission provides:
{"pattern": "regex"}or Python script printing pattern - Scoring: F1 score of matches
Example submission: examples/candidates/regex.json
Integrating Existing Evaluators
Don't rewrite your code! Wrap any existing evaluator in ~15 lines:
# Wrap a simple function - NO context needed!
from your_legacy_code import evaluate_submission
class MyAdapter:
name = "my-evaluator"
def __init__(self, data_dir):
self.data_dir = data_dir # Pass config via init
def evaluate(self, submission_path: str) -> float:
# Just call your existing function!
return evaluate_submission(submission_path)
That's it. No Context objects, no two-step evaluation, no forced dict formats.
Supported Patterns
- ✅ Functions: Single callable returning a score
- ✅ Classes: Evaluator with
__init__andevaluate()methods - ✅ CLI Tools: Command-line scripts outputting JSON/scores
- ✅ Test Suites: pytest, unittest, or custom test runners
- ✅ ML Pipelines: scikit-learn, PyTorch, TensorFlow evaluations
Documentation
- Adapter Patterns - Complete guide to all adapter types
- Integration Guide - Step-by-step integration (30-60 min)
- Dockerization - Containerize evaluators for isolation & scale
Example Adapters
See examples/adapters/ for working examples:
callable_adapter.py- Wrap simple functionsclass_adapter.py- Wrap class-based evaluators with scoring policiescli_adapter.py- Wrap command-line tools
See examples/legacy/ for mock legacy code you can adapt.
Extending
Adding New Problems
- Create a new problem class in
src/optelier/problems/:
# src/optelier/problems/my_problem.py
from ..core import EvalContext
class MyProblem:
name = "my-problem"
def run(self, submission, ctx: EvalContext):
# Your execution logic
return {"result": "..."}
def score(self, artifacts, ctx: EvalContext) -> float:
# Your scoring logic
return 1.0
- Register it in
src/optelier/problems/__init__.py:
from .my_problem import MyProblem
REGISTRY = {
# ... existing problems
"my-problem": MyProblem,
}
- Add fixtures to
examples/fixtures/and candidates toexamples/candidates/
Adding Execution Environments
The framework is designed to grow:
- Add time/memory limits
- Add Docker/WASM runners
- Add parallel execution
- Add result caching
- Add leaderboard storage
Start simple with local execution, then add safety layers as needed.
Project Structure
optelier/
├── src/optelier/
│ ├── __init__.py # Main exports
│ ├── core.py # Evaluation framework
│ ├── cli.py # CLI interface
│ └── problems/
│ ├── __init__.py # Problem registry
│ ├── secure.py # Secure code evaluation base classes
│ ├── validation.py # Validation helpers
│ ├── sum_json.py # Sum problem
│ ├── text_regex.py # Regex problem
│ └── examples/ # Example problem implementations
│ ├── feature.py # Feature engineering example
│ ├── untrusted.py # Untrusted code example
│ └── function_test.py # Function testing example
├── examples/
│ ├── adapters/ # Adapter pattern examples
│ │ ├── callable_adapter.py
│ │ ├── class_adapter.py
│ │ └── cli_adapter.py
│ ├── legacy/ # Mock legacy code to adapt
│ │ ├── simple_evaluator.py
│ │ └── class_evaluator.py
│ ├── fixtures/ # Test data
│ │ ├── numbers.json
│ │ └── text.txt
│ └── candidates/ # Example submissions
│ ├── sum_candidate.py
│ └── regex.json
├── docs/
│ ├── adapters.md # Adapter pattern guide
│ ├── integration-guide.md # Step-by-step integration
│ ├── dockerization.md # Docker & deployment guide
│ ├── security.md # Security guide (3 levels)
│ ├── migration.md # Migration from other frameworks
│ └── validation_helpers.md # Validation utilities
├── Dockerfile
├── pyproject.toml
└── README.md
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file optelier-0.1.1.tar.gz.
File metadata
- Download URL: optelier-0.1.1.tar.gz
- Upload date:
- Size: 27.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.12.10 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdfe508c42793fe9605d4aa564fc1b32a7b1e0dd1dc60a63329d378a5d42d7b1
|
|
| MD5 |
9eb553441c74b48fea795f352c71f379
|
|
| BLAKE2b-256 |
0329f6c7cb9a7c7682e10fb19e9aed97f86a230c46d7bb0cf174c21a34f3c678
|
File details
Details for the file optelier-0.1.1-py3-none-any.whl.
File metadata
- Download URL: optelier-0.1.1-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.12.10 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f12f0a90d3c4764677174a0523e4932af4d0fa109f55c30f6f1717278341655
|
|
| MD5 |
433d79e43ac0c5b7b74da4133c62db0d
|
|
| BLAKE2b-256 |
f3c091b36663a58d9efb97a6607400d35c8b0639745b530fa5c6c72a2e872dab
|