Skip to main content

Python SDK for the Epsilab model evaluation and improvement platform.

Project description

Epsilab Python SDK

Official Python client for the Epsilab model evaluation and improvement platform.

What is Epsilab?

Epsilab runs model and harness evaluations on workflow-level tasks, detects recurring capability gaps, and exports evals, trajectories, preference data, SFT examples, and regression tests. Training-data exports anonymize model identities by default using labels such as target_model and reference_A.

Installation

pip install epsilab

Or install from source:

git clone https://github.com/EpsilabAI/epsilab-python.git
cd epsilab-python
pip install -e .

Quick Start

from epsilab import Epsilab

client = Epsilab(api_key="sk-...")

# Compare multiple models in one evaluation (use any OpenRouter model slug)
eval_result = client.create_evaluation(
    ["provider/model-a", "provider/model-b", "provider/model-c"],
    name="Frontier comparison",
    max_tasks=25,
)
print(f"Evaluation started: {eval_result.evaluation_id}")

# Wait for completion
run = client.wait_for_completion(eval_result.runs[0].run_id)
print(f"Completed: {run.task_count} tasks, {run.gap_count} gaps found")

# View capability gaps
for gap in client.get_gaps(run.run_id):
    print(f"  {gap.capability}: alpha={gap.alpha_score:.3f}")

# Export targeted training data (model identities are anonymized by default)
client.export_run(run.run_id, format="dpo", path="output/dpo_pairs.jsonl")

Configuration

Environment Variable Constructor Param Description
EPSILAB_API_KEY api_key Your API key
EPSILAB_API_BASE api_base API base URL (default: production)
EPSILAB_HTTP_TIMEOUT timeout_seconds Request timeout in seconds (default: 120)
max_retries Auto-retry count for 429/5xx (default: 3)
backoff_base Initial retry backoff in seconds (default: 1.0)
load_dotenv Also read a local .env file (default: false)

The SDK reads process environment variables automatically. To also read a local .env file, opt in explicitly:

client = Epsilab(load_dotenv=True)

Multi-Model Evaluations

Compare multiple models side-by-side on the same task set:

# Simple: just pass model IDs (any OpenRouter-compatible slug)
eval_result = client.create_evaluation(
    ["provider/model-a", "provider/model-b", "provider/model-c"],
    name="Three-way comparison",
)

# Advanced: per-model harness overrides
eval_result = client.create_evaluation(
    [
        {"model_id": "provider/model-a", "harness": "codex"},
        {"model_id": "provider/model-b", "harness": "openhands"},
        "provider/model-c",  # uses default_harness
    ],
    default_harness="codex",
    max_tasks=50,
    domains=["coding", "math"],
)

# Check cost before running
estimate = client.estimate_evaluation_cost(
    ["provider/model-a", "provider/model-b"],
    max_tasks=25,
)
print(f"Cost: {estimate.total_credits} credits (balance: {estimate.balance})")
print(f"Sufficient: {estimate.sufficient}")
for m in estimate.per_model:
    print(f"  {m.model_id}: {m.credits} credits, {m.task_count} tasks")

Bring Your Own Model

Evaluate any OpenAI-compatible endpoint:

run = client.create_run(
    "internal-llm-v3",
    base_url="https://my-company.example.com/v1",
    api_key="sk-model-key",
)

Your model credentials are used only during the evaluation and are never stored. Training-data exports anonymize model identities by default using labels such as target_model and reference_A.

Client Methods

Models

Method Description
list_models(search, provider, limit) Browse available models with live pricing

Evaluations

Method Description
create_evaluation(models, ...) Compare multiple models in one evaluation
estimate_evaluation_cost(models, ...) Estimate credit cost before running
suggest_scope(instructions) AI-generated scope suggestions from a description

Runs

Method Description
create_run(model_name, ...) Submit a single model for evaluation
get_run(run_id) Get run status and summary
list_runs(status, limit, offset) List your evaluation runs (single page)
iter_runs(status, page_size) Auto-paginating iterator over all runs
wait_for_completion(run_id, ...) Block until run completes or fails
cancel_run(run_id) Cancel a queued or running evaluation
retry_run(run_id) Retry a failed run, reusing completed results
resume_run(run_id, ...) Resume a failed run with optional new credentials
delete_run(run_id) Delete a run

Results & Insights

Method Description
get_gaps(run_id) Get capability gaps from a completed run
get_artifacts(run_id, ...) Get generated artifacts (single page)
iter_artifacts(run_id, ...) Auto-paginating iterator over all artifacts
get_insights(run_id) Get model rankings, J1/J2/J3 metrics, and analytics
request_review(run_id, gap_ids) Request human review for specific gaps
forge(run_id) Generate new tasks targeting run gaps
export_run(run_id, format, path) Export training data or reports

Cross-Run Analytics

Method Description
get_leaderboard() Cross-run model leaderboard
get_domain_leaderboard() Per-domain model scores across runs
get_cost_analysis() Cost-efficiency rankings with live pricing
get_precomputed_insights() Per-domain best-model recommendations

Tasks

Method Description
get_task(task_id) Get details for a specific task
create_task(task) Create a single custom evaluation task
upload_custom_tasks(tasks) Batch upload custom evaluation tasks
get_task_upload_limits() Get max file size and task count per batch
classify_tasks(tasks) Auto-classify tasks by domain and capability
list_tasks(...) List available tasks (single page)
iter_tasks(...) Auto-paginating iterator over all tasks
delete_task(task_id) Delete a custom task

API Keys

Method Description
list_api_keys() List your API keys
create_api_key(label) Create a new API key
revoke_api_key(key_id) Revoke an API key

Billing

Method Description
get_credit_balance() Get current credit balance
get_credit_ledger(...) Get credit transaction history
get_usage(period) Get monthly usage summary

Export Formats

Format Use Case
dpo Direct Preference Optimization (chosen/rejected pairs)
quality_dpo DPO pairs enriched with quality scores and feedback
sft Supervised Fine-Tuning (prompt/completion pairs)
kto Kahneman-Tversky Optimization (binary desirability)
grpo Group Relative Policy Optimization (grouped completions)
sharegpt Multi-turn conversation format
jsonl Raw artifacts as NDJSON
report Human-readable evaluation report
yaml YAML configuration for reproduction
pytest Pytest test cases from capability gaps

Training data exports use anonymized model labels (e.g. target_model, reference_A) rather than real model identifiers. Chosen/reference answers are verified gold answers, not raw model outputs. Evaluation prompts are included for enterprise accounts; standard accounts receive task ID references.

Automatic Retries

The SDK automatically retries on rate-limit (429), transient server errors (500, 502, 503, 504), and transient network failures with exponential backoff and jitter. For 429 responses, the Retry-After header is respected when valid.

# Default: 3 retries with 1s base backoff
client = Epsilab(api_key="sk-...")

# Customize retry behaviour
client = Epsilab(api_key="sk-...", max_retries=5, backoff_base=2.0)

# Disable retries entirely
client = Epsilab(api_key="sk-...", max_retries=0)

Pagination

List endpoints return a single page by default. Use the iter_* methods to auto-paginate:

# Iterate over all runs without manual offset management
for run in client.iter_runs(status="completed"):
    print(run.run_id, run.gap_count)

# Same for artifacts and tasks
for artifact in client.iter_artifacts(run_id):
    print(artifact.artifact_type)

for task in client.iter_tasks(domain="coding"):
    print(task["task_id"])

Error Handling

from epsilab import Epsilab, AuthError, InsufficientCreditsError, RateLimitError, ApiError

client = Epsilab(api_key="sk-...")

try:
    eval_result = client.create_evaluation(["provider/model-a", "provider/model-b"])
except AuthError:
    print("Invalid API key")
except InsufficientCreditsError as e:
    print(f"Not enough credits: {e}")
except RateLimitError as e:
    print(f"Rate limited. Retry after {e.retry_after}s")
except ApiError as e:
    print(f"API error: {e.status_code}")

Examples

See examples/example.py for a complete workflow.

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epsilab-0.4.0.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

epsilab-0.4.0-py3-none-any.whl (34.8 kB view details)

Uploaded Python 3

File details

Details for the file epsilab-0.4.0.tar.gz.

File metadata

  • Download URL: epsilab-0.4.0.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for epsilab-0.4.0.tar.gz
Algorithm Hash digest
SHA256 3bc47a84fd371cee2d9743d9ab7bef4a4f06918d7fa52b98937a95f12cefe72f
MD5 dc4b2c3d58d781dca9fd663024b6ad94
BLAKE2b-256 01b35cb4adbff8a2d5e0a9e908f6fd7eb0d411cd48ebc783dfcfd8eea1adc234

See more details on using hashes here.

Provenance

The following attestation bundles were made for epsilab-0.4.0.tar.gz:

Publisher: workflow.yml on EpsilabAI/epsilab-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file epsilab-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: epsilab-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 34.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for epsilab-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c94f884cb2e4d2c5579c30a021c610cffe17da08fcd3e81aeb7166eda72c3e34
MD5 a4801e19145ea8c8a3de7810b83375ec
BLAKE2b-256 246aa4644f61f6e8c14ca245cdc55587c61262cff8ab74489f58d07a52a23dd6

See more details on using hashes here.

Provenance

The following attestation bundles were made for epsilab-0.4.0-py3-none-any.whl:

Publisher: workflow.yml on EpsilabAI/epsilab-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page