Skip to main content

Python SDK + CLI for Verifiable Labs — evaluate frontier LLMs on conformal-calibrated scientific RL environments.

Project description

verifiable-labs

Python SDK for the Verifiable Labs Hosted Evaluation API — evaluate frontier LLMs on conformal-calibrated scientific RL environments without writing any HTTP plumbing.

v0.1.0a1 — alpha. The Hosted Evaluation API itself is v0.1.0-alpha (open / rate-limited / no auth / single-process session store). This SDK is a thin httpx wrapper that mirrors the 8-endpoint API surface; it'll keep working when we add auth + persistence in v0.2.

Install

pip install verifiable-labs

Python >=3.11 required.

Quickstart

Synchronous

from verifiable_labs import Client

with Client() as client:                                # localhost:8000 by default
    print(client.health().version)                      # "0.1.0-alpha"

    env = client.env("stelioszach/sparse-fourier-recovery")
    result = env.evaluate(
        seed=0,
        answer='{"support_idx": [12, 47, 91], "support_amp_x1000": [800, -300, 1200]}',
        env_kwargs={"calibration_quantile": 2.0},
    )
    print(f"reward={result.reward:.3f}  parse_ok={result.parse_ok}")

Asynchronous

import asyncio
from verifiable_labs import AsyncClient

async def main():
    async with AsyncClient(base_url="https://api.verifiable-labs.com") as client:
        env = client.env("sparse-fourier-recovery")
        # Multi-turn flow: keep submitting until session.complete is True
        session = await env.start_session(seed=42)
        while not session.complete:
            answer = my_agent.solve(session.observation)         # your code
            await session.submit(answer_text=answer)
        print("turns:", len(session.history))

asyncio.run(main())

Leaderboard

lb = client.leaderboard("sparse-fourier-recovery")
for row in lb.top_models(n=3):
    print(f"{row.model:35s}  mean={row.mean_reward:.3f}  n={row.n}")

Public surface

name sync / async purpose
Client(api_key=None, base_url=...) sync top-level client
AsyncClient(api_key=None, base_url=...) async top-level client
client.health() both liveness + version
client.environments() both list all 10 envs
client.env(env_id) both returns Environment handle
client.leaderboard(env_id) both aggregated benchmark numbers
env.evaluate(seed, answer) both one-shot eval, returns SubmitResponse
env.start_session(seed) both returns multi-turn Session
session.submit(answer_text=...) both append a turn, returns score
session.history sync (property) list of past SubmitResponses
session.complete sync (property) bool — env signalled done
session.refresh() both re-fetch state from the server

Exceptions

The SDK raises typed exceptions on non-2xx HTTP status codes; callers can except on the specific failure mode.

from verifiable_labs import (
    VerifiableLabsError,        # base class
    TransportError,             # network / timeout
    InvalidRequestError,        # 400 / 422
    NotFoundError,              # 404
    RateLimitError,             # 429
    ServerError,                # 5xx
)

Configuration

Client(
    api_key=None,               # forward-compat for v0.2; no effect in v0.1
    base_url="http://localhost:8000",
    timeout=30.0,               # httpx total-timeout in seconds
    http_client=None,           # inject your own httpx.Client for custom transport
)

AsyncClient takes the same args + accepts an httpx.AsyncClient.

What's NOT in v0.1

Same caveats as the Hosted Evaluation API:

  • No authentication. api_key= is accepted for forward-compat but unused.
  • Multi-turn sessions don't yet route turns through the env's residual-feedback rollout (server records turns but doesn't dispatch). The SDK exposes the full Session API anyway so the shape is stable for v0.2.
  • Structured answer dicts return HTTP 422; pass strings.
  • No persistence — session store is in-memory on the API side.

License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verifiable_labs-0.1.0a4.tar.gz (26.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verifiable_labs-0.1.0a4-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file verifiable_labs-0.1.0a4.tar.gz.

File metadata

  • Download URL: verifiable_labs-0.1.0a4.tar.gz
  • Upload date:
  • Size: 26.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for verifiable_labs-0.1.0a4.tar.gz
Algorithm Hash digest
SHA256 d530992c0f27d827a76ab6d6e3d64234638ec2a1ccb663f26be94ab796e31192
MD5 2837bf1d621120e859d331cf4d0c4e78
BLAKE2b-256 c215fba3c891c63b7d218e3853919f188e37f14a8bac26a26b032af86b5ab9e0

See more details on using hashes here.

File details

Details for the file verifiable_labs-0.1.0a4-py3-none-any.whl.

File metadata

File hashes

Hashes for verifiable_labs-0.1.0a4-py3-none-any.whl
Algorithm Hash digest
SHA256 bd44a8e85ded1c527cffbf1016103d9cd4673d7173d44b7db03ec9db240ce2df
MD5 0992eccb8cb197a3de942c183e42b3ff
BLAKE2b-256 c29f90c2a8977244fbd153ac5c6d8fe10d4fb214957b09d852408e67feda29dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page