Skip to main content

Python SDK + CLI for Verifiable Labs — evaluate frontier LLMs on conformal-calibrated scientific RL environments.

Project description

verifiable-labs

Python SDK for the Verifiable Labs Hosted Evaluation API — evaluate frontier LLMs on conformal-calibrated scientific RL environments without writing any HTTP plumbing.

v0.1.0a1 — alpha. The Hosted Evaluation API itself is v0.1.0-alpha (open / rate-limited / no auth / single-process session store). This SDK is a thin httpx wrapper that mirrors the 8-endpoint API surface; it'll keep working when we add auth + persistence in v0.2.

Install

pip install verifiable-labs

Python >=3.11 required.

Quickstart

Synchronous

from verifiable_labs import Client

with Client() as client:                                # localhost:8000 by default
    print(client.health().version)                      # "0.1.0-alpha"

    env = client.env("stelioszach/sparse-fourier-recovery")
    result = env.evaluate(
        seed=0,
        answer='{"support_idx": [12, 47, 91], "support_amp_x1000": [800, -300, 1200]}',
        env_kwargs={"calibration_quantile": 2.0},
    )
    print(f"reward={result.reward:.3f}  parse_ok={result.parse_ok}")

Asynchronous

import asyncio
from verifiable_labs import AsyncClient

async def main():
    async with AsyncClient(base_url="https://api.verifiable-labs.com") as client:
        env = client.env("sparse-fourier-recovery")
        # Multi-turn flow: keep submitting until session.complete is True
        session = await env.start_session(seed=42)
        while not session.complete:
            answer = my_agent.solve(session.observation)         # your code
            await session.submit(answer_text=answer)
        print("turns:", len(session.history))

asyncio.run(main())

Leaderboard

lb = client.leaderboard("sparse-fourier-recovery")
for row in lb.top_models(n=3):
    print(f"{row.model:35s}  mean={row.mean_reward:.3f}  n={row.n}")

Public surface

name sync / async purpose
Client(api_key=None, base_url=...) sync top-level client
AsyncClient(api_key=None, base_url=...) async top-level client
client.health() both liveness + version
client.environments() both list all 10 envs
client.env(env_id) both returns Environment handle
client.leaderboard(env_id) both aggregated benchmark numbers
env.evaluate(seed, answer) both one-shot eval, returns SubmitResponse
env.start_session(seed) both returns multi-turn Session
session.submit(answer_text=...) both append a turn, returns score
session.history sync (property) list of past SubmitResponses
session.complete sync (property) bool — env signalled done
session.refresh() both re-fetch state from the server

Exceptions

The SDK raises typed exceptions on non-2xx HTTP status codes; callers can except on the specific failure mode.

from verifiable_labs import (
    VerifiableLabsError,        # base class
    TransportError,             # network / timeout
    InvalidRequestError,        # 400 / 422
    NotFoundError,              # 404
    RateLimitError,             # 429
    ServerError,                # 5xx
)

Configuration

Client(
    api_key=None,               # forward-compat for v0.2; no effect in v0.1
    base_url="http://localhost:8000",
    timeout=30.0,               # httpx total-timeout in seconds
    http_client=None,           # inject your own httpx.Client for custom transport
)

AsyncClient takes the same args + accepts an httpx.AsyncClient.

What's NOT in v0.1

Same caveats as the Hosted Evaluation API:

  • No authentication. api_key= is accepted for forward-compat but unused.
  • Multi-turn sessions don't yet route turns through the env's residual-feedback rollout (server records turns but doesn't dispatch). The SDK exposes the full Session API anyway so the shape is stable for v0.2.
  • Structured answer dicts return HTTP 422; pass strings.
  • No persistence — session store is in-memory on the API side.

License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verifiable_labs-0.1.0a3.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verifiable_labs-0.1.0a3-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file verifiable_labs-0.1.0a3.tar.gz.

File metadata

  • Download URL: verifiable_labs-0.1.0a3.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for verifiable_labs-0.1.0a3.tar.gz
Algorithm Hash digest
SHA256 625f78098f0f4d1793fda9a76c2726e532c7262330e9492354e65b59b1dad4dc
MD5 791c0b539f148fb05d5d8690bbb11093
BLAKE2b-256 e438edd927a7f8cc2080f47ac24cd94642b34ad8590b707492e0448126f81dcc

See more details on using hashes here.

File details

Details for the file verifiable_labs-0.1.0a3-py3-none-any.whl.

File metadata

File hashes

Hashes for verifiable_labs-0.1.0a3-py3-none-any.whl
Algorithm Hash digest
SHA256 fa0a5170582b6d7e8458d7768a9058a675c533e9e6c485e68f9031440b1442da
MD5 b650b0a8e68dbf26591f1e0d1da4fbea
BLAKE2b-256 4f10e56b7c872379093eb1762d3929811c106b66ebeed9030aba9a24d4c4dfd0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page