Skip to main content

AI-driven UI automation testing framework with pluggable platform adapters.

Project description

vibe-tester

AI-driven UI automation testing for desktop and web apps — Cucumber-style tests, pluggable platform adapters, ships with the AI assets your coding agent needs to author and run them.

Status: alpha. Public API may change. The windows-desktop and web adapters are implemented; the macos adapter is a stub.


What it does

  1. Lets you describe a UI test in natural language (in Copilot Chat, Claude CLI, Cursor, …) and generates a runnable Gherkin .feature file using real element locators from your project's element store.
  2. Executes scenarios at any granularity (one feature, all of them, or a tag expression) and produces a Markdown report plus optional JSON output for the AI to parse.
  3. Walks your app interactively with you to record UI element paths into a YAML store the executor can resolve.

The framework ships AI assets (agents, skills, an AGENTS.md template) and a deterministic CLI (vibe-tester). It does not embed an LLM and does not run an MCP server — your AI tool of choice provides the intelligence, the CLI is the integration surface.


Do I need an AI agent?

No. The framework is a Cucumber/behave runner with a UI-automation adapter and a YAML element vocabulary — you can author and run tests entirely by hand. The shipped agents are productivity multipliers, not runtime dependencies.

Capability AI needed?
Run tests (vibe-tester run …) No
Write .feature files by hand using elements.yaml No
Element collection — basic capture (vibe-tester collect …) No
Element collection — interactive "navigate to the next page" loop Recommended (agent)
Project customizations (features/hooks/, features/steps/steps.py) No
Visual regression baselines + assertions No
@setup: / @clean: tag-driven scenario isolation No
Markdown / JSON reports No
Translating a natural-language request → .feature Yes (Test Writer)
Structured root-cause analysis on a failed scenario Yes (Test Debugger)
Auto-proposing @clean: tags + handler stubs from element role: Yes (Test Writer)
Detecting unmapped step phrases + scaffolding custom-step stubs Yes (Test Writer)

Bottom line: CLI + framework run standalone. The agents add natural-language authoring and structured failure triage. If you don't have Copilot / Claude CLI / Cursor available, skip the .github/agents/ prompts and write .feature files directly — every step phrase the runner accepts is documented in the uia-assertions and element-locators skill files (also shipped to your project, plain Markdown, readable without an LLM).


Install

# default — every adapter that ships today
pip install vibe-tester

# pick one (smaller install)
pip install vibe-tester[windows-desktop]

# pick several
pip install vibe-tester[windows-desktop,web]
Extra Drives Status
windows-desktop WinUI3 / Win32 / WPF / WebView2 / tray / shell menu Implemented
web Browser SUTs (Playwright) Implemented
macos macOS-native SUTs Stub

Quickstart

# 1. Create a fresh test project (or scaffold into an existing folder)
mkdir my-app-tests
cd my-app-tests
vibe-tester init

# 2. Capture your SUT (interactive — your app should be running)
vibe-tester collect

# 3. Ask your AI agent (Copilot Chat / Claude CLI / …) to write a test:
#    "Write a smoke test that opens Settings and verifies the title."
#    The Test Writer agent uses elements.yaml + the framework's CLI.

# 4. Run it
vibe-tester run

Project layout — one project = one SUT:

After step 1 your project looks like:

my-app-tests/
├── AGENTS.md                  # AI instructions for this project
├── .github/
│   ├── agents/                # element-collector, test-writer, test-runner, test-debugger
│   └── skills/                # element-locators, uia-assertions, web-locators,
│                              # web-assertions, image-testing, custom-steps,
│                              # failure-diagnosis (adapter-relevant ones only)
└── features/
    ├── environment.py         # framework glue — do not edit
    └── steps/
        └── _framework.py      # framework glue — do not edit

After step 2 the element store is created at the project root:

my-app-tests/
├── elements.yaml              # the element vocabulary your tests use
├── features/
│   ├── *.feature              # Gherkin tests (the AI writes these)
│   ├── baselines/             # visual regression PNGs (optional)
│   ├── steps/
│   │   ├── _framework.py      # framework glue — do not edit
│   │   └── steps.py           # your custom step defs (optional)
│   └── hooks/                 # optional
│       ├── environment.py     # your before/after hooks
│       └── handlers.py        # @setup: / @clean: tag handlers
└── ...

The project root is the SUT — there's no nested per-app folder.

Multiple SUTs (aggregation mode)

Some products span more than one surface — say an admin desktop tool whose changes must show up in a sibling website. vibe-tester lets you keep each surface as its own focused single-SUT project, then add an aggregation root on top that orchestrates integration scenarios across both. Layout:

my-product-tests/                ← aggregation root (NO elements.yaml here)
├── features/                    ← integration scenarios only
│   ├── environment.py           # framework glue — do not edit
│   ├── *.feature                # uses `on "<sut>"` per-step prefix
│   └── steps/
│       ├── _framework.py        # framework glue — do not edit
│       └── steps.py             # integration custom steps (optional)
├── admin-tool/                  ← child SUT #1 — full single-SUT layout
│   ├── elements.yaml
│   └── features/
│       └── ...
└── customer-site/               ← child SUT #2 — full single-SUT layout
    ├── elements.yaml
    └── features/
        └── ...

Mode is auto-detected when behave starts:

Project root has… Mode
elements.yaml single SUT
no root elements.yaml, but at least one child folder has one aggregation
neither uninitialized

Integration scenarios use a per-step on "<sut>" prefix to name the target SUT — the value matches the app.name declared inside that child's elements.yaml, not the folder name:

Feature: Admin change shows up on the customer site

  Scenario: Editing a theme propagates within 5 seconds
    Given on "admin-tool" the app is open
    When  on "admin-tool" I click "themes.edit_button"
    And   on "admin-tool" I type "Sunset" into "themes.name_input"
    Then  on "customer-site" element "homepage.theme_banner" should be visible

The framework lazy-launches each SUT on first reference and shuts both down once the run finishes. Five integration phrasings ship out of the box (the app is open, I click, I type … into, should exist, should be visible); for anything beyond that, write custom steps in features/steps/steps.py and look up the active SUT via context.suts.get("<name>").

Running vibe-tester run from the aggregation root executes only the integration features at that root. To run a single child SUT's own tests in isolation, cd into that child and run there — each child is itself a fully-functional single-SUT project.

@setup: / @clean: handlers and @requires: flag-based skips are single-SUT only — there's no one "active adapter" to scope them to in aggregation mode.


CLI reference

Command What it does
vibe-tester init [--target] [--adapter] [--overwrite] [--json] Scaffold a project from shipped assets
vibe-tester list adapters [--json] Show installed adapters
vibe-tester list features [--json] List .feature files
vibe-tester list elements [--details] [--json] Print the project's element vocabulary
vibe-tester collect [--name] [--kind] Interactive element capture
vibe-tester run [--feature|--tag] [--scenario] [--json] Execute behave + emit Markdown / JSON report

All commands accept --json for machine-readable output (intended for the AI agent to parse). Default output is human-friendly Rich tables and Markdown reports under ./results/.


How the AI assets work

vibe-tester init drops four agents and the adapter-relevant skills into .github/ plus an AGENTS.md at the project root. Any AI coding tool that follows the AGENTS.md convention — Copilot, Claude CLI, Cursor, etc. — will pick them up automatically. Skills are filtered by the adapter(s) you scaffold: a web-only project won't get uia-assertions, and a windows-desktop-only project won't get web-locators.

Agents (one each):

Agent Use when
Element Collector Adding the SUT or new pages to it
Test Writer Authoring .feature files from a natural-language ask
Test Runner Executing tests and producing a Markdown report
Test Debugger A test failed and you want a structured RCA

Skills:

Skill Adapter Topic
element-locators windows-desktop UIA locator syntax, dot-notation, element store schema
uia-assertions windows-desktop All assertion types the Windows adapter supports
web-locators web Playwright locator strategy and element store schema
web-assertions web All assertion types the web adapter supports
image-testing any Visual regression / baseline strategy
custom-steps any Authoring project-level custom Gherkin step definitions
failure-diagnosis any RCA methodology + known-issues catalog

Spec-first delegation: handing a task to a coding agent

Use this workflow when you want to delegate a feature to a coding agent (Copilot, Claude CLI, Cursor, …) and have a .feature file serve as the binding acceptance contract — written and approved before coding starts, untouched while coding happens, and proven green when you come back.

What you get out of vibe-tester

vibe-tester is built around two files that, together, give you a spec you can sign off on up front:

  • A Gherkin .feature file — what the feature must do, in business language. References UI elements by name only ("the Save Theme button"), never by selector.
  • An elements.yaml entry per referenced element — the locator the agent commits to creating (AutomationId=btn_save_theme, data-testid=themes-save, …).

Because the .feature file holds no locators, freezing it after your approval does not constrain how the UI is built. Because every locator the test will ever try to use is declared in elements.yaml before code is written, the agent has no room to redefine "done" later — the test will fail unless the built UI exposes exactly those locators.

The workflow, step by step

  1. You describe the task in plain English to your AI agent.
  2. The agent drafts two files and shows them to you:
    • features/<feature>.feature — the scenarios in semantic names.
    • New entries appended to elements.yaml — locator strings for every element the scenarios reference.
  3. You review and approve both files. Edit the prose, add missing scenarios, rename anything that smells like implementation detail. Approve when it reads like the acceptance criteria you'd write yourself.
  4. The agent codes against the approved contract. Product code, step glue, unit tests — but it does not edit the approved .feature file. Treat it as locked.
  5. The agent runs vibe-tester run. A scenario passes only if the live UI exposes the locator declared in elements.yaml. Mismatches surface as test failures, not as silent edits.
  6. You come back to a Markdown report under ./results/ and decide whether to ship.

If you want belt-and-suspenders enforcement, commit the approved .feature and elements.yaml in their own PR and protect them with a CI check that fails on any subsequent change to either file without a --rewrite-acceptance reason recorded in the commit.

How the agent picks locators before the UI exists

The instinct is to discover a locator by inspecting a built UI. That forces the test to be written after coding, which destroys its value as a prior commitment.

vibe-tester's workflow inverts this: the agent declares the locator string in the same act that promises to render the element. The locator file becomes a forward-looking contract — "I will ship a button whose AutomationId is btn_save_theme" — not a recording of what happened to be built. The implementation must satisfy the contract, not the other way round.

This is reliable on every stack where the agent controls the source of the locator string:

Your stack Pre-commit a locator? How
Web (React / Vue / Svelte / plain HTML) Yes Use a data-testid convention
WinUI 3 / WPF / UWP Yes Set AutomationProperties.AutomationId explicitly
Win32 / MFC Mostly Owned controls via control ID; wrap shell UI
iOS / Android native Yes accessibilityIdentifier / contentDescription
Closed-source 3rd-party widgets Wrap first Locate the wrapper you control
Auto-generated framework IDs (e.g. Angular) Forbid Require an explicit testid via lint

Starting from your project type

Greenfield (the agent is also writing the app from scratch). The easiest case. Tell your agent in AGENTS.md to (a) adopt one naming convention for locators (e.g. every interactive element gets data-testid shaped as <feature>-<role>-<purpose>) and (b) add a lint rule that fails the build on any interactive element missing the attribute. From there, every feature PR appends to elements.yaml before any code is written.

Brownfield with an existing elements.yaml. Point the agent at the file and tell it to follow the existing convention for new elements. The store itself is the reasoning input.

Brownfield without an elements.yaml yet. Run vibe-tester collect once against the current build as a one-time baseline. The agent then has both a snapshot of what exists and a sample of the project's locator style to imitate. After that single pass the project behaves like the case above.

What to watch for

Three failure modes are worth naming up front:

  1. Locator typos. The agent writes data-testid="save-theme" in elements.yaml but ships JSX with save_theme or no testid at all. The corresponding test scenario will fail on element lookup — which is the point — but you should treat that failure as the agent broke its own contract, not as a flaky test.
  2. Convention drift. Across many features the agent invents slightly different naming schemes. Add a one-line CI check that greps elements.yaml for entries that don't match your convention regex; drift becomes a build failure rather than review burden.
  3. Semantic names that leak implementation. "the third div in the sidebar" is a locator in disguise. Keep names role-based ("Recently used themes list") so the spec stays implementation- agnostic and the agent retains room to build the UI well.

Architecture (one paragraph)

A user project is one SUT with one element store (elements.yaml at the project root). Its app.kind (e.g. windows-desktop) tells the executor which adapter to use. The CLI dispatches to that adapter for collect / launch / click / screenshot operations; the core layer is adapter-agnostic and never imports an adapter directly. New platforms plug in by adding a sub-package under vibe_tester/adapters/. Aggregation projects layer an integration coordinator on top — multiple sibling single-SUT projects under a parent, integration features at the parent driving them via an on "<sut>" per-step prefix; child adapters are launched lazily and shut down together at suite end. See doc/design/architecture.md for the full picture.


Contributing

This repo is the framework itself. See AGENTS.md for dev-context guidance (rules, layout, common tasks). Bug reports and PRs welcome at https://github.com/Haroldlei/vibe-tester.

License: MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibe_tester-0.1.0rc3.tar.gz (177.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vibe_tester-0.1.0rc3-py3-none-any.whl (196.0 kB view details)

Uploaded Python 3

File details

Details for the file vibe_tester-0.1.0rc3.tar.gz.

File metadata

  • Download URL: vibe_tester-0.1.0rc3.tar.gz
  • Upload date:
  • Size: 177.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for vibe_tester-0.1.0rc3.tar.gz
Algorithm Hash digest
SHA256 e9bf4cb86ced472f69623c2a70a17753ceeebeaba8379777df02597feaa8e6e9
MD5 dd94def2110ae58da08f01ea05da8faf
BLAKE2b-256 1fe53dfeee4fd8a19fb5be1521331a52388335b295f32e8e14896122d79e8399

See more details on using hashes here.

File details

Details for the file vibe_tester-0.1.0rc3-py3-none-any.whl.

File metadata

  • Download URL: vibe_tester-0.1.0rc3-py3-none-any.whl
  • Upload date:
  • Size: 196.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for vibe_tester-0.1.0rc3-py3-none-any.whl
Algorithm Hash digest
SHA256 4a1ba5beb0f97e142ff8b1a549e8a3fb9c1617ac4e9859407d1dad5ccd0c18e8
MD5 1365965f11694821ecead42ea21c76bc
BLAKE2b-256 7b14e93fbdf432ad2e978bd42298b0801d31ce50475019b8c46e7dd1b31c97b9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page