AI-driven UI automation testing framework with pluggable platform adapters.
Project description
vibe-tester
AI-driven UI automation testing for desktop and web apps — Cucumber-style tests, pluggable platform adapters, ships with the AI assets your coding agent needs to author and run them.
Status: alpha. Public API may change. The windows-desktop and
web adapters are implemented; the macos adapter is a stub.
What it does
- Lets you describe a UI test in natural language (in Copilot
Chat, Claude CLI, Cursor, …) and generates a runnable Gherkin
.featurefile using real element locators from your project's element store. - Executes scenarios at any granularity (one feature, all of them, or a tag expression) and produces a Markdown report plus optional JSON output for the AI to parse.
- Walks your app interactively with you to record UI element paths into a YAML store the executor can resolve.
The framework ships AI assets (agents, skills, an AGENTS.md
template) and a deterministic CLI (vibe-tester). It does not
embed an LLM and does not run an MCP server — your AI tool of choice
provides the intelligence, the CLI is the integration surface.
Do I need an AI agent?
No. The framework is a Cucumber/behave runner with a UI-automation adapter and a YAML element vocabulary — you can author and run tests entirely by hand. The shipped agents are productivity multipliers, not runtime dependencies.
| Capability | AI needed? |
|---|---|
Run tests (vibe-tester run …) |
No |
Write .feature files by hand using elements.yaml |
No |
Element collection — basic capture (vibe-tester collect …) |
No |
| Element collection — interactive "navigate to the next page" loop | Recommended (agent) |
Project customizations (features/hooks/, features/steps/steps.py) |
No |
| Visual regression baselines + assertions | No |
@setup: / @clean: tag-driven scenario isolation |
No |
| Markdown / JSON reports | No |
Translating a natural-language request → .feature |
Yes (Test Writer) |
| Structured root-cause analysis on a failed scenario | Yes (Test Debugger) |
Auto-proposing @clean: tags + handler stubs from element role: |
Yes (Test Writer) |
| Detecting unmapped step phrases + scaffolding custom-step stubs | Yes (Test Writer) |
Bottom line: CLI + framework run standalone. The agents add
natural-language authoring and structured failure triage. If you don't
have Copilot / Claude CLI / Cursor available, skip the .github/agents/
prompts and write .feature files directly — every step phrase the
runner accepts is documented in the uia-assertions and
element-locators skill files (also shipped to your project, plain
Markdown, readable without an LLM).
Install
# default — every adapter that ships today
pip install vibe-tester
# pick one (smaller install)
pip install vibe-tester[windows-desktop]
# pick several
pip install vibe-tester[windows-desktop,web]
| Extra | Drives | Status |
|---|---|---|
windows-desktop |
WinUI3 / Win32 / WPF / WebView2 / tray / shell menu | Implemented |
web |
Browser SUTs (Playwright) | Implemented |
macos |
macOS-native SUTs | Stub |
Quickstart
# 1. Create a fresh test project (or scaffold into an existing folder)
mkdir my-app-tests
cd my-app-tests
vibe-tester init
# 2. Capture your SUT (interactive — your app should be running)
vibe-tester collect
# 3. Ask your AI agent (Copilot Chat / Claude CLI / …) to write a test:
# "Write a smoke test that opens Settings and verifies the title."
# The Test Writer agent uses elements.yaml + the framework's CLI.
# 4. Run it
vibe-tester run
Project layout — one project = one SUT:
After step 1 your project looks like:
my-app-tests/
├── AGENTS.md # AI instructions for this project
├── .github/
│ ├── agents/ # element-collector, test-writer, test-runner, test-debugger
│ └── skills/ # element-locators, uia-assertions, web-locators,
│ # web-assertions, image-testing, custom-steps,
│ # failure-diagnosis (adapter-relevant ones only)
└── features/
├── environment.py # framework glue — do not edit
└── steps/
└── _framework.py # framework glue — do not edit
After step 2 the element store is created at the project root:
my-app-tests/
├── elements.yaml # the element vocabulary your tests use
├── features/
│ ├── *.feature # Gherkin tests (the AI writes these)
│ ├── baselines/ # visual regression PNGs (optional)
│ ├── steps/
│ │ ├── _framework.py # framework glue — do not edit
│ │ └── steps.py # your custom step defs (optional)
│ └── hooks/ # optional
│ ├── environment.py # your before/after hooks
│ └── handlers.py # @setup: / @clean: tag handlers
└── ...
The project root is the SUT — there's no nested per-app folder.
Multiple SUTs (aggregation mode)
Some products span more than one surface — say an admin desktop tool whose changes must show up in a sibling website. vibe-tester lets you keep each surface as its own focused single-SUT project, then add an aggregation root on top that orchestrates integration scenarios across both. Layout:
my-product-tests/ ← aggregation root (NO elements.yaml here)
├── features/ ← integration scenarios only
│ ├── environment.py # framework glue — do not edit
│ ├── *.feature # uses `on "<sut>"` per-step prefix
│ └── steps/
│ ├── _framework.py # framework glue — do not edit
│ └── steps.py # integration custom steps (optional)
├── admin-tool/ ← child SUT #1 — full single-SUT layout
│ ├── elements.yaml
│ └── features/
│ └── ...
└── customer-site/ ← child SUT #2 — full single-SUT layout
├── elements.yaml
└── features/
└── ...
Mode is auto-detected when behave starts:
| Project root has… | Mode |
|---|---|
elements.yaml |
single SUT |
no root elements.yaml, but at least one child folder has one |
aggregation |
| neither | uninitialized |
Integration scenarios use a per-step on "<sut>" prefix to name the
target SUT — the value matches the app.name declared inside that
child's elements.yaml, not the folder name:
Feature: Admin change shows up on the customer site
Scenario: Editing a theme propagates within 5 seconds
Given on "admin-tool" the app is open
When on "admin-tool" I click "themes.edit_button"
And on "admin-tool" I type "Sunset" into "themes.name_input"
Then on "customer-site" element "homepage.theme_banner" should be visible
The framework lazy-launches each SUT on first reference and shuts both
down once the run finishes. Five integration phrasings ship out of the
box (the app is open, I click, I type … into, should exist,
should be visible); for anything beyond that, write custom steps in
features/steps/steps.py and look up the active SUT via
context.suts.get("<name>").
Running vibe-tester run from the aggregation root executes only
the integration features at that root. To run a single child SUT's own
tests in isolation, cd into that child and run there — each child is
itself a fully-functional single-SUT project.
@setup: / @clean: handlers and @requires: flag-based skips are
single-SUT only — there's no one "active adapter" to scope them to in
aggregation mode.
CLI reference
| Command | What it does |
|---|---|
vibe-tester init [--target] [--adapter] [--overwrite] [--json] |
Scaffold a project from shipped assets |
vibe-tester list adapters [--json] |
Show installed adapters |
vibe-tester list features [--json] |
List .feature files |
vibe-tester list elements [--details] [--json] |
Print the project's element vocabulary |
vibe-tester collect [--name] [--kind] |
Interactive element capture |
vibe-tester run [--feature|--tag] [--scenario] [--json] |
Execute behave + emit Markdown / JSON report |
All commands accept --json for machine-readable output (intended for
the AI agent to parse). Default output is human-friendly Rich tables
and Markdown reports under ./results/.
How the AI assets work
vibe-tester init drops four agents and the adapter-relevant skills
into .github/ plus an AGENTS.md at the project root. Any AI
coding tool that follows the AGENTS.md convention —
Copilot, Claude CLI, Cursor, etc. — will pick them up automatically.
Skills are filtered by the adapter(s) you scaffold: a web-only
project won't get uia-assertions, and a windows-desktop-only
project won't get web-locators.
Agents (one each):
| Agent | Use when |
|---|---|
| Element Collector | Adding the SUT or new pages to it |
| Test Writer | Authoring .feature files from a natural-language ask |
| Test Runner | Executing tests and producing a Markdown report |
| Test Debugger | A test failed and you want a structured RCA |
Skills:
| Skill | Adapter | Topic |
|---|---|---|
| element-locators | windows-desktop | UIA locator syntax, dot-notation, element store schema |
| uia-assertions | windows-desktop | All assertion types the Windows adapter supports |
| web-locators | web | Playwright locator strategy and element store schema |
| web-assertions | web | All assertion types the web adapter supports |
| image-testing | any | Visual regression / baseline strategy |
| custom-steps | any | Authoring project-level custom Gherkin step definitions |
| failure-diagnosis | any | RCA methodology + known-issues catalog |
Spec-first delegation: handing a task to a coding agent
Use this workflow when you want to delegate a feature to a coding
agent (Copilot, Claude CLI, Cursor, …) and have a .feature file
serve as the binding acceptance contract — written and approved
before coding starts, untouched while coding happens, and proven
green when you come back.
What you get out of vibe-tester
vibe-tester is built around two files that, together, give you a spec you can sign off on up front:
- A Gherkin
.featurefile — what the feature must do, in business language. References UI elements by name only ("the Save Theme button"), never by selector. - An
elements.yamlentry per referenced element — the locator the agent commits to creating (AutomationId=btn_save_theme,data-testid=themes-save, …).
Because the .feature file holds no locators, freezing it after
your approval does not constrain how the UI is built. Because every
locator the test will ever try to use is declared in elements.yaml
before code is written, the agent has no room to redefine "done"
later — the test will fail unless the built UI exposes exactly those
locators.
The workflow, step by step
- You describe the task in plain English to your AI agent.
- The agent drafts two files and shows them to you:
features/<feature>.feature— the scenarios in semantic names.- New entries appended to
elements.yaml— locator strings for every element the scenarios reference.
- You review and approve both files. Edit the prose, add missing scenarios, rename anything that smells like implementation detail. Approve when it reads like the acceptance criteria you'd write yourself.
- The agent codes against the approved contract. Product code,
step glue, unit tests — but it does not edit the approved
.featurefile. Treat it as locked. - The agent runs
vibe-tester run. A scenario passes only if the live UI exposes the locator declared inelements.yaml. Mismatches surface as test failures, not as silent edits. - You come back to a Markdown report under
./results/and decide whether to ship.
If you want belt-and-suspenders enforcement, commit the approved
.feature and elements.yaml in their own PR and protect them with
a CI check that fails on any subsequent change to either file
without a --rewrite-acceptance reason recorded in the commit.
How the agent picks locators before the UI exists
The instinct is to discover a locator by inspecting a built UI. That forces the test to be written after coding, which destroys its value as a prior commitment.
vibe-tester's workflow inverts this: the agent declares the
locator string in the same act that promises to render the element.
The locator file becomes a forward-looking contract — "I will ship
a button whose AutomationId is btn_save_theme" — not a recording
of what happened to be built. The implementation must satisfy the
contract, not the other way round.
This is reliable on every stack where the agent controls the source of the locator string:
| Your stack | Pre-commit a locator? | How |
|---|---|---|
| Web (React / Vue / Svelte / plain HTML) | Yes | Use a data-testid convention |
| WinUI 3 / WPF / UWP | Yes | Set AutomationProperties.AutomationId explicitly |
| Win32 / MFC | Mostly | Owned controls via control ID; wrap shell UI |
| iOS / Android native | Yes | accessibilityIdentifier / contentDescription |
| Closed-source 3rd-party widgets | Wrap first | Locate the wrapper you control |
| Auto-generated framework IDs (e.g. Angular) | Forbid | Require an explicit testid via lint |
Starting from your project type
Greenfield (the agent is also writing the app from scratch).
The easiest case. Tell your agent in AGENTS.md to (a) adopt one
naming convention for locators (e.g. every interactive element gets
data-testid shaped as <feature>-<role>-<purpose>) and (b) add a
lint rule that fails the build on any interactive element missing
the attribute. From there, every feature PR appends to
elements.yaml before any code is written.
Brownfield with an existing elements.yaml. Point the agent at
the file and tell it to follow the existing convention for new
elements. The store itself is the reasoning input.
Brownfield without an elements.yaml yet. Run
vibe-tester collect once against the current build as a one-time
baseline. The agent then has both a snapshot of what exists and a
sample of the project's locator style to imitate. After that single
pass the project behaves like the case above.
What to watch for
Three failure modes are worth naming up front:
- Locator typos. The agent writes
data-testid="save-theme"inelements.yamlbut ships JSX withsave_themeor no testid at all. The corresponding test scenario will fail on element lookup — which is the point — but you should treat that failure as the agent broke its own contract, not as a flaky test. - Convention drift. Across many features the agent invents
slightly different naming schemes. Add a one-line CI check that
greps
elements.yamlfor entries that don't match your convention regex; drift becomes a build failure rather than review burden. - Semantic names that leak implementation. "the third div in the sidebar" is a locator in disguise. Keep names role-based ("Recently used themes list") so the spec stays implementation- agnostic and the agent retains room to build the UI well.
Architecture (one paragraph)
A user project is one SUT with one element store
(elements.yaml at the project root). Its app.kind (e.g.
windows-desktop) tells the executor which adapter to use. The
CLI dispatches to that adapter for collect / launch / click /
screenshot operations; the core layer is adapter-agnostic and
never imports an adapter directly. New platforms plug in by adding a
sub-package under vibe_tester/adapters/.
Aggregation projects layer an integration coordinator on top —
multiple sibling single-SUT projects under a parent, integration
features at the parent driving them via an on "<sut>" per-step
prefix; child adapters are launched lazily and shut down together at
suite end. See
doc/design/architecture.md for the full
picture.
Contributing
This repo is the framework itself. See AGENTS.md for dev-context guidance (rules, layout, common tasks). Bug reports and PRs welcome at https://github.com/Haroldlei/vibe-tester.
License: MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vibe_tester-0.1.0rc3.tar.gz.
File metadata
- Download URL: vibe_tester-0.1.0rc3.tar.gz
- Upload date:
- Size: 177.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9bf4cb86ced472f69623c2a70a17753ceeebeaba8379777df02597feaa8e6e9
|
|
| MD5 |
dd94def2110ae58da08f01ea05da8faf
|
|
| BLAKE2b-256 |
1fe53dfeee4fd8a19fb5be1521331a52388335b295f32e8e14896122d79e8399
|
File details
Details for the file vibe_tester-0.1.0rc3-py3-none-any.whl.
File metadata
- Download URL: vibe_tester-0.1.0rc3-py3-none-any.whl
- Upload date:
- Size: 196.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a1ba5beb0f97e142ff8b1a549e8a3fb9c1617ac4e9859407d1dad5ccd0c18e8
|
|
| MD5 |
1365965f11694821ecead42ea21c76bc
|
|
| BLAKE2b-256 |
7b14e93fbdf432ad2e978bd42298b0801d31ce50475019b8c46e7dd1b31c97b9
|