Skip to main content

A deterministic, reproducible test environment for AI browser agents

Project description

Resurf

Realistic, Reproducible Test Framework for AI Browser Agents

Systematic testing of browser agents today is not easy: real websites are flaky, rate-limited and expensive (bot unblocking), while static-HTML benchmarks lack state and dynamic behavior. Resurf gives your browser agent a realistic, stateful, instrumented framework — built on synthetic websites with failure-mode injection.

Mind2Web WebVoyager Resurf
Realistic, dynamic, interactive environment
Deterministic & reproducible
Failure-mode injection (latency, payment errors, 5xx)
Auditable success eval (DB state, not LLM judge)
No dependency on live websites

What's in v0

  • shop_v1 — A production-shaped synthetic e-commerce site (FastAPI + React + SQLite) running in a single Docker container. Catalog, cart, multi-step checkout, auth, accounts, returns. Mobile-responsive. Deliberate ambiguous-UI inventory for testing agent reasoning.
  • Modifiers (failure-mode injection) — FastAPI middleware that toggles network latency, payment outcomes (declined / 3DS / timeout), server error rates, and session expiration per task. Configured per-task in YAML; no code changes to add new failure-mode combinations.
  • Authoring stack — A YAML task schema, ~10 failure-mode templates with scripted reference trajectories, and a CLI: task new, task from-template, task validate, task try.
  • Adaptersbrowser-use (DOM/AX-tree native), stagehand (via Node subprocess), and a vision-only baseline. All optional installs.
  • Trajectory recording — Per-step DOM snapshots, screenshots, agent actions, token counts, latencies. Deterministic via SQLite snapshot reset and seeded faker.

v0 ships a single site — shop_v1 — to keep the focus on getting the abstractions (Environment, modifiers, success predicates, the /__test__/* admin protocol) right against several adapters. E-commerce was picked because it exercises forms, multi-step flows, auth, money, and obvious failure-mode hooks in a small amount of code. Adding more sites is now content work, not platform work — see Adding a new site.

5-minute quickstart

Prerequisites: Python 3.11+, Docker, Node 20+ (only if you plan to use the Stagehand adapter), and Chromium for Playwright.

# 1. Install resurf with the browser-use adapter
pip install 'resurf[browser-use]'   # imports as `resurf`

# --- Local-development install (use this until resurf is published to PyPI) ---
# git clone https://github.com/revar-ai/resurf
# cd resurf
# python3.11 -m venv .venv && source .venv/bin/activate
# pip install -e packages/shared-models
# pip install -e 'packages/core-py[dev,browser-use]'

# 2. Install Chromium for Playwright (one-time)
playwright install chromium

# 3. Start shop_v1
docker compose up -d shop_v1
# Wait ~3s for the container to be healthy
curl -s http://localhost:8080/api/health

# 4. List bundled tasks
resurf task list

# 5. Run a task with the browser-use adapter
export OPENAI_API_KEY=sk-...
python examples/run_browser_use.py tasks/shop_v1/find/find_product_by_name.yaml

# Tip: set REVAR_HEADED=1 to watch the browser drive in a visible window.
#   REVAR_HEADED=1 python examples/run_browser_use.py <task.yaml>
# Works for all four example runners (browser-use, scripted, vision, stagehand).

You should see something like:

[resurf] reset shop_v1 to seed=42
[resurf] running browser-use agent on find_product_by_name
[resurf] step 1: nav https://localhost:8080/
[resurf] step 2: click [aria-label="Search"]
[resurf] step 3: type "Acme Bluetooth Speaker"
...
[resurf] passed=True steps=7 tokens=4123 wall_clock=12.4s
[resurf] trajectory saved to ./trajectories/...

Running shop_v1 backend without Docker

Useful if you want to iterate on the backend or read live SQL while debugging:

pip install -e sites/shop_v1/backend
REVAR_TEST_MODE=1 uvicorn app.main:app --reload --port 8080 --app-dir sites/shop_v1/backend

For frontend hacking, cd sites/shop_v1/frontend && npm install && npm run dev runs Vite on port 5173 and proxies /api/* to the backend.

Running tests

The full Python test suite runs against resurf and resurf-models:

# From the repo root, with the venv from quickstart activated
pytest

This is also exactly what CI runs. Useful subsets:

pytest packages/core-py            # SDK + CLI + task validation
pytest packages/shared-models      # SQLModel schema and seed determinism
pytest -k schema                   # task YAML schema validation only

End-to-end smoke test against a running shop_v1 (Docker container or local uvicorn):

docker compose up -d shop_v1
python examples/run_scripted.py tasks/shop_v1/find/find_product_by_name.yaml
# expects: passed=True

Testing the Stagehand adapter

Stagehand runs in Node, so the Stagehand adapter shells out to adapters/stagehand/bridge.mjs over stdio. One-time setup:

# 1. Install Node 20+, then install Stagehand into the bridge folder
cd adapters/stagehand
npm install @browserbasehq/stagehand

# 2. (Optional) Smoke-test the bridge in isolation — proves Node + Stagehand are wired up
#    Requires a running shop_v1 on :8080 and OPENAI_API_KEY set.
export OPENAI_API_KEY=sk-...
npm run smoke
# Expect: a single-line JSON `{ "steps": [...], ... }` and exit 0.

Then drive a real task through the Python adapter:

cd ../..    # back to repo root
docker compose up -d shop_v1
python examples/run_stagehand.py tasks/shop_v1/find/find_acme_bluetooth_speaker.yaml

Authoring your own task

The fastest path is to clone an existing failure-mode template:

resurf task from-template checkout/payment_declined_recovery \
    -p product_name="Acme Bluetooth Speaker" \
    -p product_slug=acme-bluetooth-speaker \
    --out tasks/shop_v1/checkout/

resurf task validate tasks/shop_v1/checkout/acme_bluetooth_speaker_payment_declined_recovery.yaml
resurf task try      tasks/shop_v1/checkout/acme_bluetooth_speaker_payment_declined_recovery.yaml

Or write YAML by hand — see docs/architecture.md for the schema.

Modifiers (failure-mode injection)

Modifiers are how a task says "make the site behave like X." They're set under modifiers: in the task YAML, and resurf applies them at reset time via POST /__test__/configure. Nothing in shop_v1's product code needs to know about specific failure modes — middlewares handle them transparently.

Available modifiers

Key Type Values / shape Default Effect
latency_profile string none | fast | realistic | slow_3g fast Sleeps before each /api/* response per the profile's per-route (min, max) seconds.
payment_outcome string | list | object success | declined | 3ds_required | timeout, or [..., ...] for a sequence consumed in order, or { sequence: [...] } success Drives /api/checkout/confirm. Sequences let you script "decline once, then succeed."
server_error_rate number (0.0–1.0) e.g. 0.1 for ~10% 503s 0.0 Probability of injecting a 503 on requests matching server_error_paths.
server_error_paths list[string] e.g. ["/api/products", "/api/cart"] ["/api/products"] URL-prefixes eligible for the error injector above.
session_ttl_s int | null seconds site default Forces a shorter session expiry — useful for testing mid-flow re-auth.
frozen_time_iso string (ISO 8601) e.g. 2026-05-01T12:00:00Z unset Pins server-side "now" — use it to make coupon expiry / order timestamps deterministic.

Example: payment-declined-then-succeed checkout

id: shop_v1.checkout.acme_speaker_decline_recovery
site: shop_v1
goal: |
  Buy 1 Acme Bluetooth Speaker. Your first payment will be declined; retry
  to recover and complete the order.
modifiers:
  latency_profile: realistic
  payment_outcome:
    sequence: [declined, success]   # first /confirm fails, second succeeds
  server_error_rate: 0.0
  frozen_time_iso: "2026-05-01T12:00:00Z"
budget:
  max_steps: 25
  max_wall_clock_s: 120
success:
  type: state_predicate
  query: SELECT COUNT(*) AS count FROM "order" WHERE status = 'paid'
  predicate: result == 1

Configuring at runtime (without a task)

The same knobs are available via the admin API — handy for ad-hoc poking or building your own runner:

curl -s http://localhost:8080/__test__/configure -H 'content-type: application/json' \
  -d '{"latency_profile":"slow_3g","payment_outcome":"3ds_required"}'

Or from Python:

env = Environment(site="shop_v1")
env.configure({"latency_profile": "slow_3g", "payment_outcome": "3ds_required"})

Adding a new modifier

The contract is intentionally small. A modifier is just a key in ModifierConfig plus the middleware/handler that reads it. Concretely:

  1. Add the field in sites/shop_v1/backend/app/modifiers.py (ModifierConfig dataclass, reset(), update(), to_dict()).
  2. Implement the behavior — either as a new ASGI middleware in sites/shop_v1/backend/app/middleware/, or inline in the API handler that should react to it.
  3. Wire it in in sites/shop_v1/backend/app/main.py (only if it's a middleware).
  4. Document the YAML key in this README's table and in docs/architecture.md's modifier table.
  5. (Optional) Update the task JSON Schema at packages/core-py/resurf/schemas/task.schema.json if you want strict validation of the new key.

A good first example to copy is LatencyMiddleware (one file, ~30 lines) — it shows the read-config-and-act pattern end-to-end.

Adding a new site

A site is just an HTTP service that implements six admin endpoints — GET /api/health, POST /__test__/{reset,configure,query,freeze_time}, and GET /__test__/state — guarded by REVAR_TEST_MODE=1. The contract is HTTP-shaped, so any backend stack works.

To add one: scaffold under sites/<your_site>/, add a deterministic seeder under packages/shared-models/resurf_models/<your_site>/, register a service block in docker-compose.yml, and drop tasks under tasks/<your_site>/ with site: <your_site> in the YAML. Copy from shop_v1app/modifiers.py and api/test_endpoints.py are near-verbatim reusable. Contributions welcome.

Releasing

Cut a release by bumping versions with scripts/bump_version.py X.Y.Z, updating CHANGELOG.md, and pushing a vX.Y.Z tag. CI handles the PyPI upload (via Trusted Publishing) and the GitHub Release. Full runbook in RELEASING.md.

License

resurf is licensed under the Apache License 2.0. See LICENSE and NOTICE.

Contributing

Contributions of new tasks, templates, and bug fixes are very welcome. By submitting a pull request, you agree that your contribution is licensed under Apache 2.0 (per Section 5 of the license). For larger changes, please open a GitHub Discussion first so we can align on direction. See CONTRIBUTING.md for details.

Roadmap

See docs/roadmap.md. Headline v1 items: a correlation study against real OSS shop demos, adversarial modifiers (CAPTCHA, anti-bot, dark patterns), prompt-driven task generation, a second site vertical, and a native Node SDK.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

resurf-0.1.1.tar.gz (24.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

resurf-0.1.1-py3-none-any.whl (30.3 kB view details)

Uploaded Python 3

File details

Details for the file resurf-0.1.1.tar.gz.

File metadata

  • Download URL: resurf-0.1.1.tar.gz
  • Upload date:
  • Size: 24.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for resurf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fadc0a09919221cb122c5ad1af9f99a377d63203c5ddbc7c4b10423c9166a694
MD5 8ddc432d798f64f1e2e07ac59d26613f
BLAKE2b-256 c2635fb3ab60e396ade7ad6bea38166141dedd4e85ea41ce4ea16791dbbaef6d

See more details on using hashes here.

File details

Details for the file resurf-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: resurf-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 30.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for resurf-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 25b98d9dcdbeacecb022b9b5f7a17e5b7de205d64708e42d4c61f9fe7429244c
MD5 a2d8e50ccefaf70a5eda7dcf55b028c9
BLAKE2b-256 04a87ef1c5a3522075dfdbfcd689e5fac8eded9c8850e212a5ff0dfff925c7d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page