DX-first evaluator: submission -> evaluation -> score

These details have not been verified by PyPI

Project links

Project description

Optelier

DX-first evaluator: submission → evaluation → score

A minimal, local-first framework for evaluating code submissions against problems. Start with simple toy problems and grow toward safe execution later.

Installation

pip install optelier

For development:

git clone <repo-url>
cd optelier
pip install -e .

Optional: Install Docker for maximum security isolation (required for DockerProblem):

Docker Desktop for Mac/Windows
Docker Engine for Linux

Quick Start

Using the CLI

# Sum problem with a Python script submission
optelier run sum-json --submission examples/candidates/sum_candidate.py --print-artifacts

# Regex problem with a JSON submission
optelier run text-regex --submission examples/candidates/regex.json --print-artifacts

Using the API

from optelier import evaluate, load_submission
from optelier.problems import REGISTRY

# Create problem with fixtures directory
problem = REGISTRY["sum-json"](fixtures_dir="examples/fixtures")

# Load submission
submission = load_submission("examples/candidates/sum_candidate.py")

# Evaluate - it's that simple!
result = evaluate(problem, submission)

# Access results
print(f"Score: {result.score}, Time: {result.time_ms}ms, OK: {result.ok}")

# Or unpack as tuple (Gym-style)
score, ok, artifacts, error, time_ms = result
print(f"Score: {score}, OK: {ok}")

Core Concepts

Problem Protocol

Minimal API: Problems implement a single evaluate() method.

class MyProblem:
    name = "my-problem"

    def __init__(self, data_dir):
        # Configure with whatever you need
        self.data_dir = data_dir

    def evaluate(self, submission) -> dict | float | EvalResult:
        # submission can be ANY format you define (path, dict, object, etc.)

        # Option 1: Just artifacts (no score - useful for analysis/reports)
        return {"backtest_results": ..., "metrics": ...}

        # Option 2: Just score (simple case)
        return 0.95

        # Option 3: Both score and artifacts
        return EvalResult(score=0.95, artifacts={"details": ...})

Key principles:

One method: evaluate(submission) -> dict | float | EvalResult
You choose submission format: Path, dict, object - whatever makes sense
Pass config via init: No forced abstractions
Scoring is optional: Return just artifacts for analysis, score later
Flexible returns: Dict (artifacts), float (score), or EvalResult (both)

Submission Format

You decide! Each problem defines what a submission is:

Path to file: problem.evaluate("submission.py")
Dictionary: problem.evaluate({"code": "...", "params": {...}})
Custom object: problem.evaluate(MySubmission(...))

The load_submission() utility is optional:

.py files → {"kind": "python_script", "code": "..."}
.json files → parsed dict

EvalResult (NamedTuple)

from optelier import EvalResult

# Create results
result = EvalResult(score=0.95)
result = EvalResult(score=0.95, artifacts={"details": ...})

# Access by name
print(result.score)      # 0.95
print(result.ok)         # True
print(result.artifacts)  # {...}
print(result.time_ms)    # Timing added by evaluate()

# Or unpack as tuple (Gym-style)
score, ok, artifacts, error, time_ms = result

Fields:

score: Numeric score (float, -inf if not provided)
ok: Whether evaluation succeeded (bool)
artifacts: Data produced by evaluation (dict)
error: Error traceback if failed (str | None)
time_ms: Execution time in milliseconds (int)

Note: If your evaluate() returns just a dict (artifacts only), score will be -inf. This is useful for:

Backtesting / simulation workflows where you analyze results before scoring
Data collection / exploratory analysis
Applying multiple scoring functions to the same artifacts
Generating reports without immediate scoring

See examples/artifacts_only_example.py for a complete example.

Security

Optelier provides three levels of code execution security for evaluating Python code submissions. Choose the level that matches your threat model and performance requirements.

Security Levels

Level	Class	Isolation	Overhead	Use Case
None	`SecureCodeProblem`	In-process	~0ms	Trusted internal code, maximum speed
Medium	`SubprocessProblem`	Process isolation	~400ms	Semi-trusted teams, internal tools
Maximum	`DockerProblem`	Container isolation	~600ms	Untrusted external code, competitions

All three levels include AST-based validation to prevent malicious imports and dangerous builtins.

Quick Start: Secure Code Evaluation

Level 1: In-Process (Fast, Trusted Code)

from optelier.problems.secure import SecureCodeProblem

class FastFunctionProblem(SecureCodeProblem):
    def __init__(self):
        super().__init__(
            name="fast_math",
            allowed_imports=set(),  # No imports needed
        )

    def setup_environment(self):
        return {"x": 10, "y": 20}

    def validate_output(self, namespace):
        if "result" not in namespace:
            raise ValueError("Must assign to 'result'")
        score = 1.0 if namespace["result"] == 30 else 0.0
        return score, {"result": namespace["result"]}

    def _execute_code(self, code, environment):
        namespace = environment.copy()
        exec(code, namespace)  # In-process execution
        return namespace

# Use it
problem = FastFunctionProblem()
result = problem.evaluate({"code": "result = x + y"})
print(f"Score: {result.score}")  # 1.0

Level 2: Subprocess (Medium Security)

from optelier.problems.secure import SubprocessProblem

class FeatureProblem(SubprocessProblem):
    def __init__(self, data):
        super().__init__(
            name="feature_eng",
            timeout=30,
            allowed_imports={"pandas", "numpy"}
        )
        self.data = data

    def setup_environment(self):
        return {"df": self.data}

    def validate_output(self, namespace):
        if "feature" not in namespace:
            raise ValueError("Must assign to 'feature'")

        feature = namespace["feature"]
        variance = feature.var()
        score = min(1.0, variance / 1.0)

        return score, {
            "variance": variance,
            "null_rate": feature.isna().mean()
        }

# Use it
import pandas as pd
problem = FeatureProblem(pd.DataFrame({"price": [100, 200, 150]}))
result = problem.evaluate({
    "code": "feature = df['price'].pct_change()"
})
print(f"Score: {result.score}")

Level 3: Docker (Maximum Security)

from optelier.problems.secure import DockerProblem

class UntrustedCodeProblem(DockerProblem):
    def __init__(self, test_cases):
        super().__init__(
            name="untrusted_challenge",
            image="python:3.10-slim",
            timeout=30,
            memory_limit="256m"
        )
        self.test_cases = test_cases  # [(inputs, expected), ...]

    def setup_environment(self):
        return {}  # No environment needed

    def validate_output(self, namespace):
        if "solve" not in namespace:
            raise ValueError("Must define 'solve' function")

        func = namespace["solve"]
        passed = sum(1 for inputs, expected in self.test_cases
                    if func(*inputs) == expected)
        score = passed / len(self.test_cases)

        return score, {
            "passed": passed,
            "total": len(self.test_cases)
        }

# Use it
problem = UntrustedCodeProblem([
    ((2, 3), 5),
    ((10, 5), 15)
])
result = problem.evaluate({
    "code": "def solve(a, b): return a + b"
})
print(f"Score: {result.score}")  # 1.0

Security Features

AST-Based Validation (All Levels):

Whitelist allowed imports (e.g., pandas, numpy)
Blacklist dangerous builtins (eval, exec, compile, __import__, open)
Code length limits
Syntax validation

Subprocess Isolation (Medium):

Separate process space (crashes don't affect parent)
Configurable timeout
Pickle serialization (supports DataFrames, complex objects)
Automatic cleanup

Docker Isolation (Maximum):

Complete filesystem isolation
Network disabled by default
Memory and CPU limits enforced
Read-only execution
Automatic container cleanup

When to Use Each Level

Use SecureCodeProblem (in-process) when:

Code is from trusted internal developers
Maximum speed is critical
AST validation provides sufficient safety

Use SubprocessProblem when:

Code is from semi-trusted sources (internal teams, controlled access)
You need crash isolation
Working with DataFrames or complex objects
~400ms overhead is acceptable

Use DockerProblem when:

Code is from untrusted external sources
Running competitions or public APIs
Maximum isolation is required
~600ms overhead is acceptable
Network access must be blocked

Learn More

Security Guide - Detailed security documentation
Migration Guide - Migrate from other evaluation frameworks
Validation Helpers - Reusable validation utilities

Included Problems

sum-json

Evaluates Python scripts that read a JSON file with numbers and output their sum.

Contract:

Input: numbers.json with {"numbers": [...]}
Script receives: python script.py <input> <output>
Expected output: {"sum": <int>}

Example submission: examples/candidates/sum_candidate.py

text-regex

Evaluates regex patterns (or Python scripts that output patterns) against a labeled text dataset.

Contract:

Input: text.txt with lines labeled OK: (match) or NO: (no match)
Submission provides: {"pattern": "regex"} or Python script printing pattern
Scoring: F1 score of matches

Example submission: examples/candidates/regex.json

Integrating Existing Evaluators

Don't rewrite your code! Wrap any existing evaluator in ~15 lines:

# Wrap a simple function - NO context needed!
from your_legacy_code import evaluate_submission

class MyAdapter:
    name = "my-evaluator"

    def __init__(self, data_dir):
        self.data_dir = data_dir  # Pass config via init

    def evaluate(self, submission_path: str) -> float:
        # Just call your existing function!
        return evaluate_submission(submission_path)

That's it. No Context objects, no two-step evaluation, no forced dict formats.

Supported Patterns

✅ Functions: Single callable returning a score
✅ Classes: Evaluator with __init__ and evaluate() methods
✅ CLI Tools: Command-line scripts outputting JSON/scores
✅ Test Suites: pytest, unittest, or custom test runners
✅ ML Pipelines: scikit-learn, PyTorch, TensorFlow evaluations

Documentation

Adapter Patterns - Complete guide to all adapter types
Integration Guide - Step-by-step integration (30-60 min)
Dockerization - Containerize evaluators for isolation & scale

Example Adapters

See examples/adapters/ for working examples:

callable_adapter.py - Wrap simple functions
class_adapter.py - Wrap class-based evaluators with scoring policies
cli_adapter.py - Wrap command-line tools

See examples/legacy/ for mock legacy code you can adapt.

Extending

Adding New Problems

Create a new problem class in src/optelier/problems/:

# src/optelier/problems/my_problem.py
from ..core import EvalContext

class MyProblem:
    name = "my-problem"

    def run(self, submission, ctx: EvalContext):
        # Your execution logic
        return {"result": "..."}

    def score(self, artifacts, ctx: EvalContext) -> float:
        # Your scoring logic
        return 1.0

from .my_problem import MyProblem

REGISTRY = {
    # ... existing problems
    "my-problem": MyProblem,
}

Add fixtures to examples/fixtures/ and candidates to examples/candidates/

Adding Execution Environments

The framework is designed to grow:

Add time/memory limits
Add Docker/WASM runners
Add parallel execution
Add result caching
Add leaderboard storage

Start simple with local execution, then add safety layers as needed.

Project Structure

optelier/
├── src/optelier/
│   ├── __init__.py          # Main exports
│   ├── core.py              # Evaluation framework
│   ├── cli.py               # CLI interface
│   └── problems/
│       ├── __init__.py      # Problem registry
│       ├── secure.py        # Secure code evaluation base classes
│       ├── validation.py    # Validation helpers
│       ├── sum_json.py      # Sum problem
│       ├── text_regex.py    # Regex problem
│       └── examples/        # Example problem implementations
│           ├── feature.py           # Feature engineering example
│           ├── untrusted.py         # Untrusted code example
│           └── function_test.py     # Function testing example
├── examples/
│   ├── adapters/            # Adapter pattern examples
│   │   ├── callable_adapter.py
│   │   ├── class_adapter.py
│   │   └── cli_adapter.py
│   ├── legacy/              # Mock legacy code to adapt
│   │   ├── simple_evaluator.py
│   │   └── class_evaluator.py
│   ├── fixtures/            # Test data
│   │   ├── numbers.json
│   │   └── text.txt
│   └── candidates/          # Example submissions
│       ├── sum_candidate.py
│       └── regex.json
├── docs/
│   ├── adapters.md          # Adapter pattern guide
│   ├── integration-guide.md # Step-by-step integration
│   ├── dockerization.md     # Docker & deployment guide
│   ├── security.md          # Security guide (3 levels)
│   ├── migration.md         # Migration from other frameworks
│   └── validation_helpers.md # Validation utilities
├── Dockerfile
├── pyproject.toml
└── README.md

License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Nov 5, 2025

0.1.0

Nov 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

optelier-0.1.1.tar.gz (27.0 kB view details)

Uploaded Nov 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

optelier-0.1.1-py3-none-any.whl (29.7 kB view details)

Uploaded Nov 5, 2025 Python 3

File details

Details for the file optelier-0.1.1.tar.gz.

File metadata

Download URL: optelier-0.1.1.tar.gz
Upload date: Nov 5, 2025
Size: 27.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.4 CPython/3.12.10 Darwin/24.6.0

File hashes

Hashes for optelier-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`cdfe508c42793fe9605d4aa564fc1b32a7b1e0dd1dc60a63329d378a5d42d7b1`
MD5	`9eb553441c74b48fea795f352c71f379`
BLAKE2b-256	`0329f6c7cb9a7c7682e10fb19e9aed97f86a230c46d7bb0cf174c21a34f3c678`

See more details on using hashes here.

File details

Details for the file optelier-0.1.1-py3-none-any.whl.

File metadata

Download URL: optelier-0.1.1-py3-none-any.whl
Upload date: Nov 5, 2025
Size: 29.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.4 CPython/3.12.10 Darwin/24.6.0

File hashes

Hashes for optelier-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5f12f0a90d3c4764677174a0523e4932af4d0fa109f55c30f6f1717278341655`
MD5	`433d79e43ac0c5b7b74da4133c62db0d`
BLAKE2b-256	`f3c091b36663a58d9efb97a6607400d35c8b0639745b530fa5c6c72a2e872dab`

See more details on using hashes here.

optelier 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Optelier

Installation

Quick Start

Using the CLI

Using the API

Core Concepts

Problem Protocol

Submission Format

EvalResult (NamedTuple)

Security

Security Levels

Quick Start: Secure Code Evaluation

Security Features

When to Use Each Level

Learn More

Included Problems

sum-json

text-regex

Integrating Existing Evaluators

Supported Patterns

Documentation

Example Adapters

Extending

Adding New Problems

Adding Execution Environments

Project Structure

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes