No project description provided
Project description
Self-Operating Computer Framework
A framework to enable multimodal models to operate a computer.
Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.
Key Features
- Compatibility: Designed for various multimodal models.
- Integration: Currently integrated with GPT-4v as the default model, with extended support for Gemini Pro Vision.
- Future Plans: Support for additional models.
Current Challenges
Note: GPT-4V's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.
Ongoing Development
At HyperwriteAI, we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.
Agent-1-Vision Model API Access
We will soon be offering API access to our Agent-1-Vision model.
If you're interested in gaining access to this API, sign up here.
Demo
Run Self-Operating Computer
- Install the project
pip install self-operating-computer
- Run the project
operate
- Enter your OpenAI Key: If you don't have one, you can obtain an OpenAI key here
- Give Terminal app the required permissions: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".
Alternatively installation with .sh
- Clone the repo to a directory on your computer:
git clone https://github.com/OthersideAI/self-operating-computer.git
- Cd into directory:
cd self-operating-computer
- Run the installation script:
./run.sh
Using operate
Modes
Multimodal Models -m
An additional model is now compatible with the Self Operating Computer Framework. Try Google's gemini-pro-vision
by following the instructions below.
Add your Google AI Studio API key to your .env file. If you don't have one, you can obtain a key here after setting up your Google AI Studio account. You may also need authorize credentials for a desktop application. It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR:
GOOGLE_API_KEY='your-key-here'
Start operate
with the Gemini model
operate -m gemini-pro-vision
Set-of-Mark Prompting -m gpt-4-with-som
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the gpt-4-with-som
command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
Learn more about SoM Prompting in the detailed arXiv paper: here.
For this initial version, a simple YOLOv8 model is trained for button detection, and the best.pt
file is included under model/weights/
. Users are encouraged to swap in their best.pt
file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
Start operate
with the SoM model
operate -m gpt-4-with-som
Voice Mode --voice
The framework supports voice inputs for the objective. Try voice by following the instructions below.
Install the additional requirements-audio.txt
pip install -r requirements-audio.txt
Install device requirements For mac users:
brew install portaudio
For Linux users:
sudo apt install portaudio19-dev python3-pyaudio
Run with voice mode
operate --voice
Contributions are Welcomed!:
If you want to contribute yourself, see CONTRIBUTING.md.
Feedback
For any input on improving this project, feel free to reach out to Josh on Twitter.
Join Our Discord Community
For real-time discussions and community support, join our Discord server.
- If you're already a member, join the discussion in #self-operating-computer.
- If you're new, first join our Discord Server and then navigate to the #self-operating-computer.
Follow HyperWriteAI for More Updates
Stay updated with the latest developments:
Compatibility
- This project is compatible with Mac OS, Windows, and Linux (with X server installed).
OpenAI Rate Limiting Note
The gpt-4-vision-preview
model is required. To unlock access to this model, your account needs to spend at least $5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum $5.
Learn more here
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for self-operating-computer-1.2.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | d25c79f0d09baec9845fc7ba541de2c4141904c5c2f93e9cfdea37829508bf6e |
|
MD5 | 4949486eefe77a21e7c5aa388ce8e8a6 |
|
BLAKE2b-256 | 9c5e6648b747d72c6b52ff6e93619dac1b8943b69f3df68c697d2ee23108b316 |
Hashes for self_operating_computer-1.2.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c1e63fc855fb0d7b36e43a47a973bb1d924a4ffd2d34e91e254d66b387a7056 |
|
MD5 | 50b21869d2b91e75bb5aa1c8b5eaeaff |
|
BLAKE2b-256 | 2e3aed15b8dbe7674f94c33ddee43ce1f4299efc641d73160829b2e7544afa7e |