Skip to main content

Python SDK + CLI for Verifiable Labs — evaluate frontier LLMs on conformal-calibrated scientific RL environments.

Project description

verifiable-labs

Python SDK for the Verifiable Labs Hosted Evaluation API — evaluate frontier LLMs on conformal-calibrated scientific RL environments without writing any HTTP plumbing.

v0.1.0a1 — alpha. The Hosted Evaluation API itself is v0.1.0-alpha (open / rate-limited / no auth / single-process session store). This SDK is a thin httpx wrapper that mirrors the 8-endpoint API surface; it'll keep working when we add auth + persistence in v0.2.

Install

pip install verifiable-labs

Python >=3.11 required.

Quickstart

Synchronous

from verifiable_labs import Client

with Client() as client:                                # localhost:8000 by default
    print(client.health().version)                      # "0.1.0-alpha"

    env = client.env("stelioszach/sparse-fourier-recovery")
    result = env.evaluate(
        seed=0,
        answer='{"support_idx": [12, 47, 91], "support_amp_x1000": [800, -300, 1200]}',
        env_kwargs={"calibration_quantile": 2.0},
    )
    print(f"reward={result.reward:.3f}  parse_ok={result.parse_ok}")

Asynchronous

import asyncio
from verifiable_labs import AsyncClient

async def main():
    async with AsyncClient(base_url="https://api.verifiable-labs.com") as client:
        env = client.env("sparse-fourier-recovery")
        # Multi-turn flow: keep submitting until session.complete is True
        session = await env.start_session(seed=42)
        while not session.complete:
            answer = my_agent.solve(session.observation)         # your code
            await session.submit(answer_text=answer)
        print("turns:", len(session.history))

asyncio.run(main())

Leaderboard

lb = client.leaderboard("sparse-fourier-recovery")
for row in lb.top_models(n=3):
    print(f"{row.model:35s}  mean={row.mean_reward:.3f}  n={row.n}")

Public surface

name sync / async purpose
Client(api_key=None, base_url=...) sync top-level client
AsyncClient(api_key=None, base_url=...) async top-level client
client.health() both liveness + version
client.environments() both list all 10 envs
client.env(env_id) both returns Environment handle
client.leaderboard(env_id) both aggregated benchmark numbers
env.evaluate(seed, answer) both one-shot eval, returns SubmitResponse
env.start_session(seed) both returns multi-turn Session
session.submit(answer_text=...) both append a turn, returns score
session.history sync (property) list of past SubmitResponses
session.complete sync (property) bool — env signalled done
session.refresh() both re-fetch state from the server

Exceptions

The SDK raises typed exceptions on non-2xx HTTP status codes; callers can except on the specific failure mode.

from verifiable_labs import (
    VerifiableLabsError,        # base class
    TransportError,             # network / timeout
    InvalidRequestError,        # 400 / 422
    NotFoundError,              # 404
    RateLimitError,             # 429
    ServerError,                # 5xx
)

Configuration

Client(
    api_key=None,               # forward-compat for v0.2; no effect in v0.1
    base_url="http://localhost:8000",
    timeout=30.0,               # httpx total-timeout in seconds
    http_client=None,           # inject your own httpx.Client for custom transport
)

AsyncClient takes the same args + accepts an httpx.AsyncClient.

What's NOT in v0.1

Same caveats as the Hosted Evaluation API:

  • No authentication. api_key= is accepted for forward-compat but unused.
  • Multi-turn sessions don't yet route turns through the env's residual-feedback rollout (server records turns but doesn't dispatch). The SDK exposes the full Session API anyway so the shape is stable for v0.2.
  • Structured answer dicts return HTTP 422; pass strings.
  • No persistence — session store is in-memory on the API side.

License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verifiable_labs-0.1.0a2.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verifiable_labs-0.1.0a2-py3-none-any.whl (24.5 kB view details)

Uploaded Python 3

File details

Details for the file verifiable_labs-0.1.0a2.tar.gz.

File metadata

  • Download URL: verifiable_labs-0.1.0a2.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for verifiable_labs-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 6d191b6bbe58f45c66257ab0f36a9dbe3264a2b5d9ad2b83ceae99517fbfa3a5
MD5 69ddbfd297a9d6b2426441f7f1f3666e
BLAKE2b-256 b2e5fc9b294967b5ace8a730ac4133af7e0d14670e12e18d397426070e9a88db

See more details on using hashes here.

File details

Details for the file verifiable_labs-0.1.0a2-py3-none-any.whl.

File metadata

File hashes

Hashes for verifiable_labs-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 cc05a0a7d4f324dead408127c0021dd3a0e7b4f21d1dc088c7ec113131efe343
MD5 5067808b23c4829b07df2fc309c0e8db
BLAKE2b-256 dfbcbf2b187aed1cbe305759b3688d1ef25805e078f558aebcfa7cd07a84a780

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page