Skip to main content

Python SDK for Midscene.js - AI-powered UI automation using natural language

Project description

PyMidscene

Midscene Logo

Python SDK for Midscene.js - AI-powered UI automation using natural language

License Python Docs

Features | Installation | Quick Start | Documentation | 中文文档


What is PyMidscene?

PyMidscene is a Python port of Midscene.js - an AI-powered UI automation framework. It allows you to control web browsers using natural language instead of CSS selectors or XPath.

No more fragile selectors! Just describe what you want to click, type, or extract:

# Instead of: page.click("#submit-btn-primary")
await agent.ai_click("the blue Submit button")

# Instead of: page.fill("input[name='email']", "test@example.com")  
await agent.ai_input("email input field", "test@example.com")

# Extract structured data with natural language
result = await agent.ai_query({
    "title": "the page title",
    "price": "the product price as a number"
})

Features

  • Natural Language Automation - Describe elements in plain English/Chinese, no selectors needed
  • Multi-Model Support - Works with Doubao, Qwen, GPT-4V, Claude, and other vision LLMs
  • Playwright Integration - Seamless integration with Playwright for web automation
  • Android Integration - Control real devices over ADB with pymidscene[android] (see pymidscene/android/README.md)
  • iOS Integration - Drive iPhones / simulators through WebDriverAgent (see pymidscene/ios/README.md)
  • XPath Caching - Smart caching system compatible with Midscene.js format
  • Visual Reports - Generate beautiful HTML reports for debugging and sharing
  • Type-Safe - Full type hints for excellent IDE support

Installation

pip install pymidscene

# Install Playwright browsers
playwright install chromium

Or with Poetry:

poetry add pymidscene
playwright install chromium

Quick Start

1. Set up your API key

# For Doubao (recommended for Chinese users)
export MIDSCENE_MODEL_NAME="doubao-seed-1-6-251015"
export MIDSCENE_MODEL_API_KEY="your-api-key"
export MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
export MIDSCENE_MODEL_FAMILY="doubao-vision"

# For Qwen
export MIDSCENE_MODEL_NAME="qwen-vl-max"
export MIDSCENE_MODEL_API_KEY="your-api-key"
export MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export MIDSCENE_MODEL_FAMILY="qwen2.5-vl"

2. Write your automation script

import asyncio
import os
from playwright.async_api import async_playwright
from pymidscene import PlaywrightAgent

async def main():
    # Configure model (or use environment variables)
    os.environ["MIDSCENE_MODEL_NAME"] = "doubao-seed-1-6-251015"
    os.environ["MIDSCENE_MODEL_API_KEY"] = "your-api-key"
    os.environ["MIDSCENE_MODEL_BASE_URL"] = "https://ark.cn-beijing.volces.com/api/v3"
    os.environ["MIDSCENE_MODEL_FAMILY"] = "doubao-vision"

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()

        # Create agent with optional caching
        agent = PlaywrightAgent(page, cache_id="my_task")

        # Navigate to page
        await page.goto("https://www.example.com")

        # Use natural language to interact
        await agent.ai_click("the search box")
        await agent.ai_input("search input", "Python automation")
        await agent.ai_click("search button")

        # Extract data
        result = await agent.ai_query({
            "results_count": "number of search results",
            "first_title": "title of the first result"
        })
        print(f"Found: {result}")

        # Assert page state
        await agent.ai_assert("search results are displayed")

        # Generate visual report
        report_path = agent.finish()
        print(f"Report saved to: {report_path}")

        await browser.close()

if __name__ == "__main__":
    asyncio.run(main())

Documentation

Core API

Method Description
ai_click(description) Click an element described in natural language
ai_input(description, text) Type text into an input field
ai_locate(description) Locate an element and return its coordinates
ai_query(schema) Extract structured data from the page
ai_assert(assertion) Assert that a condition is true
ai_action(task) Execute a complex task with AI planning loop (plan-execute-replan)
ai_wait_for(assertion, timeout) Wait until a page condition is met (polling)
ai_scroll(direction, distance) Scroll the page with AI assistance
finish() Generate HTML report and return the path

Supported Models

Model Family Provider
doubao-seed-1-6-251015 doubao-vision Bytedance/Volcano
qwen-vl-max qwen2.5-vl Alibaba
gpt-4-vision-preview openai OpenAI
claude-3-opus claude Anthropic

Cache System

PyMidscene uses XPath-based caching compatible with Midscene.js:

# midscene_run/cache/my_task.cache.yaml
midsceneVersion: 1.0.0
cacheId: my_task
caches:
  - type: locate
    prompt: the login button
    cache:
      xpaths:
        - /html/body/div[1]/button[1]

This means:

  • Cache files are interchangeable between JS and Python versions
  • XPath-based caching works across different window sizes
  • Cache invalidation happens automatically when elements move

Examples

Check out the examples/ directory:

  • basic_usage.py - Getting started
  • login_demo.py - Login automation with visual report
  • login_demo.html - Test page for login demo

Project Structure

pymidscene/
├── pymidscene/           # Main package
│   ├── core/             # Core automation logic
│   │   ├── agent/        # Agent implementation
│   │   ├── ai_model/     # AI model integration
│   │   └── dump.py       # Report generation
│   ├── web_integration/  # Browser integrations
│   │   └── playwright/   # Playwright adapter
│   └── shared/           # Shared utilities
├── examples/             # Usage examples
├── tests/                # Test suite
└── docs/                 # Documentation

Related Projects

This is the Python implementation of Midscene.js.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

# Development setup
git clone https://github.com/AIPythoner/pymidscene.git
cd pymidscene
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black pymidscene tests

License

MIT License - see LICENSE file for details.

Acknowledgments


Made with love by the PyMidscene community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymidscene-0.3.0.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pymidscene-0.3.0-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file pymidscene-0.3.0.tar.gz.

File metadata

  • Download URL: pymidscene-0.3.0.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for pymidscene-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2b10065d5d722d829589e9eb47ccda269292255db3b91d3356e390cc2851e4a2
MD5 0c3d3b004cc88d1c01049d61bf906f5e
BLAKE2b-256 119e09acea6a5408a41cf510aef1ba074cad18ae1148e2dd4fbf264a97cf702e

See more details on using hashes here.

File details

Details for the file pymidscene-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pymidscene-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for pymidscene-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 caf8d822389fa9273655898f96d30fa01d3fd1a314d55e3670c599a6f415d52d
MD5 8665f322bd460b0dc907b46060cbde21
BLAKE2b-256 c33948c9035f998432f9967ccf83015f22aca07de979633d4b949adf4ca5ba2b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page