A deterministic, reproducible test environment for AI browser agents
Project description
Resurf
Realistic, Reproducible Test Framework for AI Browser Agents
Systematic testing of browser agents today is not easy: real websites are flaky, rate-limited and expensive (bot unblocking), while static-HTML benchmarks lack state and dynamic behavior. Resurf gives your browser agent a realistic, stateful, instrumented framework — built on synthetic websites with failure-mode injection.
| Mind2Web | WebVoyager | Resurf | |
|---|---|---|---|
| Realistic, dynamic, interactive environment | ❌ | ✅ | ✅ |
| Deterministic & reproducible | ✅ | ❌ | ✅ |
| Failure-mode injection (latency, payment errors, 5xx) | ❌ | ❌ | ✅ |
| Auditable success eval (DB state, not LLM judge) | ❌ | ❌ | ✅ |
| No dependency on live websites | ✅ | ❌ | ✅ |
What's in v0
shop_v1— A production-shaped synthetic e-commerce site (FastAPI + React + SQLite) running in a single Docker container. Catalog, cart, multi-step checkout, auth, accounts, returns. Mobile-responsive. Deliberate ambiguous-UI inventory for testing agent reasoning.- Modifiers (failure-mode injection) — FastAPI middleware that toggles network latency, payment outcomes (declined / 3DS / timeout), server error rates, and session expiration per task. Configured per-task in YAML; no code changes to add new failure-mode combinations.
- Authoring stack — A YAML task schema, ~10 failure-mode templates with scripted reference trajectories, and a CLI:
task new,task from-template,task validate,task try. - Adapters —
browser-use(DOM/AX-tree native),stagehand(via Node subprocess), and a vision-only baseline. All optional installs. - Trajectory recording — Per-step DOM snapshots, screenshots, agent actions, token counts, latencies. Deterministic via SQLite snapshot reset and seeded faker.
v0 ships a single site — shop_v1 — to keep the focus on getting the abstractions (Environment, modifiers, success predicates, the /__test__/* admin protocol) right against several adapters. E-commerce was picked because it exercises forms, multi-step flows, auth, money, and obvious failure-mode hooks in a small amount of code. Adding more sites is now content work, not platform work — see Adding a new site.
5-minute quickstart
Prerequisites: Python 3.11+, Docker, Node 20+ (only if you plan to use the Stagehand adapter), and Chromium for Playwright.
# 1. Install resurf with the browser-use adapter
pip install 'resurf[browser-use]' # imports as `resurf`
# --- Local-development install (use this until resurf is published to PyPI) ---
# git clone https://github.com/revar-ai/resurf
# cd resurf
# python3.11 -m venv .venv && source .venv/bin/activate
# pip install -e packages/shared-models
# pip install -e 'packages/core-py[dev,browser-use]'
# 2. Install Chromium for Playwright (one-time)
playwright install chromium
# 3. Start shop_v1
docker compose up -d shop_v1
# Wait ~3s for the container to be healthy
curl -s http://localhost:8080/api/health
# 4. List bundled tasks
resurf task list
# 5. Run a task with the browser-use adapter
export OPENAI_API_KEY=sk-...
python examples/run_browser_use.py tasks/shop_v1/find/find_product_by_name.yaml
# Tip: set REVAR_HEADED=1 to watch the browser drive in a visible window.
# REVAR_HEADED=1 python examples/run_browser_use.py <task.yaml>
# Works for all four example runners (browser-use, scripted, vision, stagehand).
You should see something like:
[resurf] reset shop_v1 to seed=42
[resurf] running browser-use agent on find_product_by_name
[resurf] step 1: nav https://localhost:8080/
[resurf] step 2: click [aria-label="Search"]
[resurf] step 3: type "Acme Bluetooth Speaker"
...
[resurf] passed=True steps=7 tokens=4123 wall_clock=12.4s
[resurf] trajectory saved to ./trajectories/...
Running shop_v1 backend without Docker
Useful if you want to iterate on the backend or read live SQL while debugging:
pip install -e sites/shop_v1/backend
REVAR_TEST_MODE=1 uvicorn app.main:app --reload --port 8080 --app-dir sites/shop_v1/backend
For frontend hacking, cd sites/shop_v1/frontend && npm install && npm run dev runs Vite on port 5173 and proxies /api/* to the backend.
Running tests
The full Python test suite runs against resurf and resurf-models:
# From the repo root, with the venv from quickstart activated
pytest
This is also exactly what CI runs. Useful subsets:
pytest packages/core-py # SDK + CLI + task validation
pytest packages/shared-models # SQLModel schema and seed determinism
pytest -k schema # task YAML schema validation only
End-to-end smoke test against a running shop_v1 (Docker container or local uvicorn):
docker compose up -d shop_v1
python examples/run_scripted.py tasks/shop_v1/find/find_product_by_name.yaml
# expects: passed=True
Testing the Stagehand adapter
Stagehand runs in Node, so the Stagehand adapter shells out to adapters/stagehand/bridge.mjs over stdio. One-time setup:
# 1. Install Node 20+, then install Stagehand into the bridge folder
cd adapters/stagehand
npm install @browserbasehq/stagehand
# 2. (Optional) Smoke-test the bridge in isolation — proves Node + Stagehand are wired up
# Requires a running shop_v1 on :8080 and OPENAI_API_KEY set.
export OPENAI_API_KEY=sk-...
npm run smoke
# Expect: a single-line JSON `{ "steps": [...], ... }` and exit 0.
Then drive a real task through the Python adapter:
cd ../.. # back to repo root
docker compose up -d shop_v1
python examples/run_stagehand.py tasks/shop_v1/find/find_acme_bluetooth_speaker.yaml
Authoring your own task
The fastest path is to clone an existing failure-mode template:
resurf task from-template checkout/payment_declined_recovery \
-p product_name="Acme Bluetooth Speaker" \
-p product_slug=acme-bluetooth-speaker \
--out tasks/shop_v1/checkout/
resurf task validate tasks/shop_v1/checkout/acme_bluetooth_speaker_payment_declined_recovery.yaml
resurf task try tasks/shop_v1/checkout/acme_bluetooth_speaker_payment_declined_recovery.yaml
Or write YAML by hand — see docs/architecture.md for the schema.
Modifiers (failure-mode injection)
Modifiers are how a task says "make the site behave like X." They're set under modifiers: in the task YAML, and resurf applies them at reset time via POST /__test__/configure. Nothing in shop_v1's product code needs to know about specific failure modes — middlewares handle them transparently.
Available modifiers
| Key | Type | Values / shape | Default | Effect |
|---|---|---|---|---|
latency_profile |
string | none | fast | realistic | slow_3g |
fast |
Sleeps before each /api/* response per the profile's per-route (min, max) seconds. |
payment_outcome |
string | list | object | success | declined | 3ds_required | timeout, or [..., ...] for a sequence consumed in order, or { sequence: [...] } |
success |
Drives /api/checkout/confirm. Sequences let you script "decline once, then succeed." |
server_error_rate |
number (0.0–1.0) | e.g. 0.1 for ~10% 503s |
0.0 |
Probability of injecting a 503 on requests matching server_error_paths. |
server_error_paths |
list[string] | e.g. ["/api/products", "/api/cart"] |
["/api/products"] |
URL-prefixes eligible for the error injector above. |
session_ttl_s |
int | null | seconds | site default | Forces a shorter session expiry — useful for testing mid-flow re-auth. |
frozen_time_iso |
string (ISO 8601) | e.g. 2026-05-01T12:00:00Z |
unset | Pins server-side "now" — use it to make coupon expiry / order timestamps deterministic. |
Example: payment-declined-then-succeed checkout
id: shop_v1.checkout.acme_speaker_decline_recovery
site: shop_v1
goal: |
Buy 1 Acme Bluetooth Speaker. Your first payment will be declined; retry
to recover and complete the order.
modifiers:
latency_profile: realistic
payment_outcome:
sequence: [declined, success] # first /confirm fails, second succeeds
server_error_rate: 0.0
frozen_time_iso: "2026-05-01T12:00:00Z"
budget:
max_steps: 25
max_wall_clock_s: 120
success:
type: state_predicate
query: SELECT COUNT(*) AS count FROM "order" WHERE status = 'paid'
predicate: result == 1
Configuring at runtime (without a task)
The same knobs are available via the admin API — handy for ad-hoc poking or building your own runner:
curl -s http://localhost:8080/__test__/configure -H 'content-type: application/json' \
-d '{"latency_profile":"slow_3g","payment_outcome":"3ds_required"}'
Or from Python:
env = Environment(site="shop_v1")
env.configure({"latency_profile": "slow_3g", "payment_outcome": "3ds_required"})
Adding a new modifier
The contract is intentionally small. A modifier is just a key in ModifierConfig plus the middleware/handler that reads it. Concretely:
- Add the field in
sites/shop_v1/backend/app/modifiers.py(ModifierConfigdataclass,reset(),update(),to_dict()). - Implement the behavior — either as a new ASGI middleware in
sites/shop_v1/backend/app/middleware/, or inline in the API handler that should react to it. - Wire it in in
sites/shop_v1/backend/app/main.py(only if it's a middleware). - Document the YAML key in this README's table and in
docs/architecture.md's modifier table. - (Optional) Update the task JSON Schema at
packages/core-py/resurf/schemas/task.schema.jsonif you want strict validation of the new key.
A good first example to copy is LatencyMiddleware (one file, ~30 lines) — it shows the read-config-and-act pattern end-to-end.
Adding a new site
A site is just an HTTP service that implements six admin endpoints — GET /api/health, POST /__test__/{reset,configure,query,freeze_time}, and GET /__test__/state — guarded by REVAR_TEST_MODE=1. The contract is HTTP-shaped, so any backend stack works.
To add one: scaffold under sites/<your_site>/, add a deterministic seeder under packages/shared-models/resurf_models/<your_site>/, register a service block in docker-compose.yml, and drop tasks under tasks/<your_site>/ with site: <your_site> in the YAML. Copy from shop_v1 — app/modifiers.py and api/test_endpoints.py are near-verbatim reusable. Contributions welcome.
Releasing
Cut a release by bumping versions with scripts/bump_version.py X.Y.Z, updating CHANGELOG.md, and pushing a vX.Y.Z tag. CI handles the PyPI upload (via Trusted Publishing) and the GitHub Release. Full runbook in RELEASING.md.
License
resurf is licensed under the Apache License 2.0. See LICENSE and NOTICE.
Contributing
Contributions of new tasks, templates, and bug fixes are very welcome. By submitting a pull request, you agree that your contribution is licensed under Apache 2.0 (per Section 5 of the license). For larger changes, please open a GitHub Discussion first so we can align on direction. See CONTRIBUTING.md for details.
Roadmap
See docs/roadmap.md. Headline v1 items: a correlation study against real OSS shop demos, adversarial modifiers (CAPTCHA, anti-bot, dark patterns), prompt-driven task generation, a second site vertical, and a native Node SDK.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file resurf-0.1.1.tar.gz.
File metadata
- Download URL: resurf-0.1.1.tar.gz
- Upload date:
- Size: 24.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fadc0a09919221cb122c5ad1af9f99a377d63203c5ddbc7c4b10423c9166a694
|
|
| MD5 |
8ddc432d798f64f1e2e07ac59d26613f
|
|
| BLAKE2b-256 |
c2635fb3ab60e396ade7ad6bea38166141dedd4e85ea41ce4ea16791dbbaef6d
|
File details
Details for the file resurf-0.1.1-py3-none-any.whl.
File metadata
- Download URL: resurf-0.1.1-py3-none-any.whl
- Upload date:
- Size: 30.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25b98d9dcdbeacecb022b9b5f7a17e5b7de205d64708e42d4c61f9fe7429244c
|
|
| MD5 |
a2d8e50ccefaf70a5eda7dcf55b028c9
|
|
| BLAKE2b-256 |
04a87ef1c5a3522075dfdbfcd689e5fac8eded9c8850e212a5ff0dfff925c7d1
|