Spongecake is the easiest way to launch OpenAI computer use agents.

These details have not been verified by PyPI

Project links

Homepage

Project description

Open source SDK to launch OpenAI computer use agents

[coming soon] Shows a demo of spongecake in action

Using spongecake to automate linkedin prospecting (see examples/linkedin_example.py)

What is spongecake?
Prerequisites
Quick Start
Demos
1. Linkedin Prospecting
2. Amazon Shopping
(Optional) Building & Running the Docker Container
Connecting to the Virtual Desktop
Documentation
1. Desktop Client Documentation
Contributing
Roadmap
Team

What is spongecake?

🍰 spongecake is the easiest way to launch OpenAI-powered “computer use” agents. It simplifies:

Spinning up a Docker container with a virtual desktop (including Xfce, VNC, etc.).
Controlling that virtual desktop programmatically using an SDK (click, scroll, keyboard actions).
Integrating with OpenAI to drive an agent that can interact with a real Linux-based GUI.

Prerequisites

You’ll need the following to get started (click to download):

Quick Start

Clone the repo (if you haven’t already):

git clone https://github.com/aditya-nadkarni/spongecake.git
cd spongecake/examples

Set up a Python virtual environment and install the spongecake package:

python3 -m venv venv
source venv/bin/activate  # On Windows, use venv\Scripts\activate

python3 -m pip install --upgrade spongecake
python3 -m pip install --upgrade dotenv
python3 -m pip install --upgrade openai  # Make sure you have the latest version of openai for the responses API

Run the example script:
```
cd examples # If needed
```
```
python3 example.py
```
Feel free to edit the example.py script to try out your own commands.

Note: This deploys a Docker container in your local Docker environment. If the spongecake default image isn't available, it will pull the image from Docker Hub.
Create your own scripts: The example script is largely for demonstration purposes. To make this work for own use cases, create your own scripts using the SDK or integrate it into your own systems.

Demos

LinkedIn Prospecting

Using spongecake to automate linkedin prospecting (see examples/linkedin_example.py)

Amazon Shopping

Using spongecake to automate amazon shopping (see examples/amazon_example.py)

Data Entry

Using spongecake to automate data entry (see examples/data_entry_example.py)

(Optional) Building & Running the Docker Container

If you want to manually build and run the included Docker image for a virtual desktop environment you can follow these steps. To make your own changes to the docker container, fork the repository and edit however you need. This is perfect for adding dependencies specific to your workflows.

Navigate to the Docker folder (e.g., cd spongecake/docker).
Build the image:
```
docker build -t <name of your image> .
```

Run the container:

docker run -d -p 5900:5900 --name <name of your container> <name of your image>

This starts a container that you name and exposes VNC on port 5900.

Shell into the container (optional):
```
docker exec -it <name of your container> bash
```
This is useful for debugging, installing extra dependencies, etc.
You can then specify the name of your container / image when using the SDK

Connecting to the Virtual Desktop

If you're working on a mac:

Right click Finder and select Connect to server...
OR
In the Finder window, navigate to Go > Connect to server... in the menu bar
Enter the VNC host and port - should be vnc://localhost:5900 in the default container
It will ask for a password, which will be set to "secret" in the default docker container
Your mac will connect to the VNC server. You can view and control the container's desktop through here

Other options:

Install a VNC Viewer, such as TigerVNC or RealVNC.
Open the VNC client and connect to:
```
localhost:5900
```
Enter the password when needed (set to "secret" in the default docker container).

Documentation

Desktop Client Documentation

Below is the Desktop class, which provides functionality for managing and interacting with a Docker container that simulates a Linux desktop environment. This class enables you to control mouse/keyboard actions, retrieve screenshots, and integrate with OpenAI for higher-level agent logic.

Class: `Desktop`

Arguments:

name (str): A unique name for the container. Defaults to "newdesktop". This should be unique for different containers
docker_image (str): The Docker image name to pull/run if not already available. Defaults to "spongebox/spongecake:latest".
vnc_port (int): The host port mapped to the container’s VNC server. Defaults to 5900.
api_port (int): The host port mapped to the container’s internal API. Defaults to 8000.
openai_api_key (str): An optional API key for OpenAI. If not provided, the class attempts to read OPENAI_API_KEY from the environment.

Raises:

SpongecakeException if any port is in use.
SpongecakeException if no OpenAI API key is supplied.

Description: Creates a Docker client, sets up container parameters, and initializes an internal OpenAI client for agent integration.

`start()`

def start(self) -> Container:
    """
    Starts the container if it's not already running.
    """

Behavior:

Starts the docker container thats initialized in the Desktop() constructor
Checks if a container with the specified name already exists.
If the container exists but is not running, it starts it.
Note: In this case, it will not pull the latest image
If the container does not exist, the method attempts to run it:
- It will attempt to pull the latest image before starting the container
Waits a short time (2 seconds) for services to initialize.
Returns the running container object.

Returns:

A Docker Container object representing the running container.

Exceptions:

RuntimeError if it fails to find or pull the specified image
docker.errors.APIError For any issue with running the container

`stop()`

def stop(self) -> None:
    """
    Stops and removes the container.
    """

Behavior:

Stops + removes the container.
Prints a status message.
If the container does not exist, prints a warning.

Returns:

None

`exec(command)`

def exec(self, command: str) -> dict:
    """
    Runs a shell command inside the container.
    """

Arguments:

command (str): The shell command to execute.

Behavior:

Runs a shell command in the docker container
Captures stdout and stderr.
Logs the command output.

Returns: A dictionary with:

{
  "result": (string output),
  "returncode": (integer exit code)
}

Desktop Actions

`click(x, y, click_type="left")`

def click(self, x: int, y: int, click_type: str = "left") -> None:
    """
    Move the mouse to (x, y) and click the specified button.
    click_type can be 'left', 'middle', or 'right'.
    """

Arguments:

x, y (int): The screen coordinates to move the mouse.
click_type (str): The mouse button to click ("left", "middle", or "right").

Returns:

None

`scroll(x, y, scroll_x=0, scroll_y=0)`

def scroll(
    self,
    x: int,
    y: int,
    scroll_x: int = 0,
    scroll_y: int = 0
) -> None:
    """
    Move to (x, y) and scroll horizontally or vertically.
    """

Arguments:

x, y (int): The screen coordinates to move the mouse.
scroll_x (int): Horizontal scroll offset.
- Negative => Scroll left (button 6)
- Positive => Scroll right (button 7)
scroll_y (int): Vertical scroll offset.
- Negative => Scroll up (button 4)
- Positive => Scroll down (button 5)

Behavior:

Moves the mouse to (x, y).
Scrolls by scroll_x and scroll_y

Returns:

None

`keypress(keys: list[str])`

def keypress(self, keys: list[str]) -> None:
    """
    Press (and possibly hold) keys in sequence.
    """

Arguments:

keys (list[str]): A list of keys to press. Example: ["CTRL", "f"] for Ctrl+F.

Behavior:

Executes a keypress
Supports shortcuts like Ctrl+Fs

Returns:

None

`type_text(text: str)`

def type_text(self, text: str) -> None:
    """
    Type a string of text (like using a keyboard) at the current cursor location.
    """

Arguments:

text (str): The string of text to type.

Behavior:

Types a string of text at the current cursor location.

Returns:

None

`get_screenshot()`

def get_screenshot(self) -> str:
    """
    Takes a screenshot of the current desktop.
    Returns the base64-encoded PNG screenshot.
    """

Behavior:

Takes a screenshot as a png
Captures the base64 result.
Returns that base64 string.

Returns:

(str): A base64-encoded PNG screenshot.

Exceptions:

RuntimeError if the screenshot command fails.

OpenAI Agent Integration

`action(input_text=None, acknowledged_safety_checks=False, ignore_safety_and_input=False, complete_handler=None, needs_input_handler=None, needs_safety_check_handler=None, error_handler=None)`

Check out the guide for using this function for more details

Purpose

The action function lets you control the desktop environment via an agent, managing commands, user inputs, and security checks in a streamlined way.

Arguments

input_text (str, optional):
New commands or responses to agent prompts.
acknowledged_safety_checks (bool, optional):
Set True after the user confirms pending security checks.
ignore_safety_and_input (bool, optional):
Automatically approves security checks and inputs. Use cautiously.
Handlers (callables, optional):
Customize how different statuses are handled:
- complete_handler(data): Final results.
- needs_input_handler(messages): Collects user input.
- needs_safety_check_handler(safety_checks, pending_call): Approves security checks.
- error_handler(error_message): Manages errors.

How it works

The action function returns one of four statuses:

Status Handling

COMPLETE:
Task finished successfully. Handle final output.
ERROR:
Review the returned error message and handle accordingly.
NEEDS_INPUT:
Provide additional user input and call action() again with this input.
NEEDS_SECURITY_CHECK:
Review security warnings and confirm with acknowledged_safety_checks=True.

Example workflow:

status, data = agent.action(input_text="Open Chrome")

if status == AgentStatus.COMPLETE:
    print("Done:", data)
elif status == AgentStatus.ERROR:
    print("Error:", data)
elif status == AgentStatus.NEEDS_INPUT:
    user_reply = input(f"Input needed: {data}")
    agent.action(input_text=user_reply)
elif status == AgentStatus.NEEDS_SECURITY_CHECK:
    confirm = input(f"Security checks: {data['safety_checks']} Proceed? (y/N): ")
    if confirm.lower() == "y":
        agent.action(acknowledged_safety_checks=True)
    else:
        print("Action cancelled.")

Auto Mode

Set ignore_safety_and_input=True for automatic handling of inputs and security checks. Use carefully as this bypasses user prompts and approvals.

Using Handlers

Provide handler functions to automate status management, simplifying your code:

agent.action(
    input_text="Open Chrome",
    complete_handler=lambda data: print("Done:", data),
    error_handler=lambda error: print("Error:", error),
    needs_input_handler=lambda msgs: input(f"Agent asks: {msgs}"),
    needs_safety_check_handler=lambda checks, call: input(f"Approve {checks}? (y/N): ").lower() == "y"
)

🚀 Guide: Using the `action` Command

The action function lets your agent execute tasks in the desktop environment. It handles:

Starting a new conversation with a command.
Continuing a conversation by supplying user input.
Acknowledging safety checks for a pending call.
Auto-handling safety checks and input if ignore_safety_and_input=True.
Custom handler delegation for each status.

Internally, action manages state and either returns a (status, data) tuple for you to process or calls the appropriate handler if provided.

📌 Quick Overview

def action(
    input_text=None,
    acknowledged_safety_checks=False,
    ignore_safety_and_input=False,
    complete_handler=None,
    needs_input_handler=None,
    needs_safety_check_handler=None,
    error_handler=None
):
    # ...

input_text (str, optional):
- A new command to start a conversation.
- A user’s response if the agent has asked for more input.
- None if you’re just confirming safety checks.
acknowledged_safety_checks (bool, optional):
- Indicates that the user has confirmed pending checks.
- Only relevant if a NEEDS_SECURITY_CHECK status was returned previously.
ignore_safety_and_input (bool, optional):
- If True, the function automatically handles safety checks and input requests, requiring no user interaction.
Handlers (callables, optional):
- complete_handler(data): Handles COMPLETE.
- needs_input_handler(messages): Handles NEEDS_INPUT.
- needs_safety_check_handler(checks, pending_call): Handles NEEDS_SECURITY_CHECK.
- error_handler(error_message): Handles ERROR.

Return Value:

A tuple (status, data), where:
- status is one of:
  - COMPLETE: The agent finished successfully.
  - ERROR: An error occurred.
  - NEEDS_INPUT: The agent needs more user input.
  - NEEDS_SECURITY_CHECK: The agent needs confirmation for a risky action.
- data:
  - For COMPLETE: The final response object.
  - For ERROR: An error message.
  - For NEEDS_INPUT: A list of messages asking for input.
  - For NEEDS_SECURITY_CHECK: A list of safety checks and the pending call.

When handlers are provided, action may not return a status in the usual way—it delegates behavior to those handlers.

🌀 Handling the Workflow (Interactive Example)

action covers multiple scenarios:

Starting a conversation with input_text (e.g., a command: “Open Chrome”).
Continuing a conversation by providing user input if the agent is waiting for it.
Acknowledging safety checks with acknowledged_safety_checks=True if the agent flagged a security concern.
Auto-handling if ignore_safety_and_input=True, which bypasses user checks.

When you call action, you get (status, data) back (unless you use handlers). Use status to decide your next move:

COMPLETE:
- The task is done. data contains the final response.
- You can display or log it.
ERROR:
- data holds an error message explaining what went wrong.
- You can retry, log, or show it to the user.
NEEDS_INPUT:
- The agent requires more information. Use data (often a list of prompts) to know what it wants.
- Get input from the user, then call action again, supplying that text in input_text.
NEEDS_SECURITY_CHECK:
- The agent found a risky action. You must confirm it’s safe.
- Call action again with acknowledged_safety_checks=True to proceed.
- No extra input_text is required unless the agent specifically requests it.

Here’s a straightforward example:

status, data = agent.action(input_text="Open Firefox")

if status == AgentStatus.COMPLETE:
    print("Done:", data)
elif status == AgentStatus.ERROR:
    print("Error:", data)
elif status == AgentStatus.NEEDS_INPUT:
    user_reply = input(f"Input needed: {data}")
    agent.action(input_text=user_reply)
elif status == AgentStatus.NEEDS_SECURITY_CHECK:
    confirm = input(f"Security checks: {data['safety_checks']} Proceed? (y/N): ")
    if confirm.lower() == "y":
        agent.action(acknowledged_safety_checks=True)
    else:
        print("Action cancelled.")

For a more robust loop-based approach:

status, data = agent.action(input_text="Open a file")

while status in [AgentStatus.NEEDS_INPUT, AgentStatus.NEEDS_SECURITY_CHECK]:
    if status == AgentStatus.NEEDS_INPUT:
        user_reply = input(f"Agent needs more info: {data}")
        status, data = agent.action(input_text=user_reply)
    elif status == AgentStatus.NEEDS_SECURITY_CHECK:
        confirm = input(f"Security checks: {data['safety_checks']} Proceed? (y/N): ")
        if confirm.lower() == "y":
            status, data = agent.action(acknowledged_safety_checks=True)
        else:
            print("Action cancelled.")
            break

if status == AgentStatus.COMPLETE:
    print("Final result:", data)
elif status == AgentStatus.ERROR:
    print("Error:", data)

🤖 Automated (Non-Interactive) Mode

Set ignore_safety_and_input=True to:

Automatically approve safety checks.
Automatically generate responses to agent questions to continue with the prompt

This is useful for:

Automated actions that must run without user interaction.
Headless or server-based scenarios.

CAUTION: It is inherently risky because you skip all manual confirmations and user input. Ensure your applications and use cases are safe before using auto mode.

Example:

status, data = agent.action(
    input_text="Open Chrome",
    ignore_safety_and_input=True
)

if status == AgentStatus.COMPLETE:
    print("Completed:", data)
elif status == AgentStatus.ERROR:
    print("Error:", data)

Examples Using Handlers

You can avoid manual if/else checks by supplying handlers for each status. action will call them automatically:

complete_handler(data): Called when the agent finishes.
needs_input_handler(messages): Called if the agent wants more input.
needs_safety_check_handler(safety_checks, pending_call): Called if the agent flags a safety check.
error_handler(error_message): Called if something goes wrong.

Why Handlers?

They allow complex logic in a more organized, modular style.
They help you integrate with other tools or services, since each status is handled by a dedicated function.
They reduce repeated conditional code in your main flow.

Example:

result = [None]  # a mutable container to store final output or None

def complete_handler(data):
    """COMPLETE -- handle final results"""
    print("\n✅ Task completed successfully!")
    result[0] = data

def needs_input_handler(messages):
    """NEEDS_INPUT -- prompt the user and return the response"""
    for msg in messages:
        if hasattr(msg, "content"):
            text_parts = [part.text for part in msg.content if hasattr(part, "text")]
            print(f"\n💬 Agent asks: {' '.join(text_parts)}")

    user_says = input("Enter your response (or 'exit'/'quit'): ").strip()
    if user_says.lower() in ("exit", "quit"):
        print("Exiting as per user request.")
        result[0] = None
        return None
    return user_says

def needs_safety_check_handler(safety_checks, pending_call):
    """NEEDS_SAFETY_CHECK -- confirm or deny safety checks"""
    for check in safety_checks:
        if hasattr(check, "message"):
            print(f"☢️  Pending Safety Check: {check.message}")

    ack = input("Type 'ack' to confirm, or 'exit'/'quit': ").strip().lower()
    if ack in ("exit", "quit"):
        print("Exiting as per user request.")
        result[0] = None
        return False
    if ack == "ack":
        print("Acknowledged. Proceeding with the computer call...")
        return True
    return False

def error_handler(error_message):
    """ERROR -- print error and store None"""
    print(f"😱 ERROR: {error_message}")
    result[0] = None

# Provide handlers to `action`:
status, data = desktop.action(
    input_text="Open Chrome",
    complete_handler=complete_handler,
    needs_input_handler=needs_input_handler,
    needs_safety_check_handler=needs_safety_check_handler,
    error_handler=error_handler
)

When handlers are specified, action manages each status internally and continues until it hits COMPLETE or ERROR (unless you stop it prematurely).

4. Key Takeaways

Scenarios: Start new tasks, resume with user input, or confirm safety checks.
Statuses: Always handle COMPLETE, ERROR, NEEDS_INPUT, NEEDS_SECURITY_CHECK.
Resuming: Pass new input (input_text) or confirm checks (acknowledged_safety_checks=True) to continue.
Auto-mode: ignore_safety_and_input=True is convenient but risky.
Handlers: Offer a cleaner, more modular way to manage status-based logic.

For any additional questions, contact founders@passage-team.com

Appendix

Contributing

Feel free to open issues for any feature requests or if you encounter any bugs! We love and appreciate contributions of all forms.

Pull Request Guidelines

Fork the repo and create a new branch from main.
Commit changes with clear and descriptive messages.
Include tests, if possible. If adding a feature or fixing a bug, please include or update the relevant tests.
Open a Pull Request with a clear title and description explaining your work.

Roadmap

Support for other computer-use agents
Support for browser-only envrionments
Integrating human-in-the-loop
(and much more...)

Team

Made with 🍰 in San Francisco

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.15

Apr 25, 2025

0.1.14

Apr 18, 2025

0.1.13

Apr 15, 2025

0.1.12

Apr 10, 2025

0.1.11

Apr 8, 2025

0.1.10

Apr 4, 2025

0.1.9

Mar 27, 2025

0.1.8

Mar 26, 2025

0.1.7

Mar 26, 2025

This version

0.1.6

Mar 21, 2025

0.1.5

Mar 20, 2025

0.1.4

Mar 19, 2025

0.1.3

Mar 18, 2025

0.1.2

Mar 14, 2025

0.1.1

Mar 14, 2025

0.1.0

Mar 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spongecake-0.1.6.tar.gz (34.1 kB view details)

Uploaded Mar 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spongecake-0.1.6-py3-none-any.whl (27.2 kB view details)

Uploaded Mar 21, 2025 Python 3

File details

Details for the file spongecake-0.1.6.tar.gz.

File metadata

Download URL: spongecake-0.1.6.tar.gz
Upload date: Mar 21, 2025
Size: 34.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for spongecake-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`8e2cc9a8b2bc0f2ad636edb9d8fec819d9510e4021da54d31018b1e5cb19cfa2`
MD5	`74e48ed7d89f9716bbfc2f373dd83d49`
BLAKE2b-256	`a65427540a5a1c81004d6175faa020c31a8f82f9207e9b7d0517caffaa70c0fe`

See more details on using hashes here.

File details

Details for the file spongecake-0.1.6-py3-none-any.whl.

File metadata

Download URL: spongecake-0.1.6-py3-none-any.whl
Upload date: Mar 21, 2025
Size: 27.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for spongecake-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5a460848020396a5bc9c88b4e15a426a48e389dfe57c3ba1977546dfbf2ad88c`
MD5	`756baa8913ef05baedbce6632a792109`
BLAKE2b-256	`e33b7b544756addffe1d23b171c64636d840614b443f3068145ffa67b2e57358`

See more details on using hashes here.

spongecake 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Open source SDK to launch OpenAI computer use agents

Table of Contents

What is spongecake?

Prerequisites

Quick Start

Demos

LinkedIn Prospecting

Amazon Shopping

Data Entry

(Optional) Building & Running the Docker Container

Connecting to the Virtual Desktop

Documentation

Desktop Client Documentation

Class: Desktop

start()

stop()

exec(command)

Desktop Actions

click(x, y, click_type="left")

scroll(x, y, scroll_x=0, scroll_y=0)

keypress(keys: list[str])

type_text(text: str)

get_screenshot()

OpenAI Agent Integration

action(input_text=None, acknowledged_safety_checks=False, ignore_safety_and_input=False, complete_handler=None, needs_input_handler=None, needs_safety_check_handler=None, error_handler=None)

Purpose

Arguments

How it works

Status Handling

Auto Mode

Using Handlers

🚀 Guide: Using the action Command

📌 Quick Overview

🌀 Handling the Workflow (Interactive Example)

🤖 Automated (Non-Interactive) Mode

Examples Using Handlers

Why Handlers?

4. Key Takeaways

Appendix

Contributing

Pull Request Guidelines

Roadmap

Team

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Class: `Desktop`

`start()`

`stop()`

`exec(command)`

`click(x, y, click_type="left")`

`scroll(x, y, scroll_x=0, scroll_y=0)`

`keypress(keys: list[str])`

`type_text(text: str)`

`get_screenshot()`

`action(input_text=None, acknowledged_safety_checks=False, ignore_safety_and_input=False, complete_handler=None, needs_input_handler=None, needs_safety_check_handler=None, error_handler=None)`

🚀 Guide: Using the `action` Command