No project description provided

Project description

Self-Operating Computer Framework

A framework to enable multimodal models to operate a computer.

Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.

Key Features

Compatibility: Designed for various multimodal models.
Integration: Currently integrated with GPT-4v, Gemini Pro Vision, and LLaVa.
Future Plans: Support for additional models.

Ongoing Development

At HyperwriteAI, we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.

Agent-1-Vision Model API Access

We will soon be offering API access to our Agent-1-Vision model.

If you're interested in gaining access to this API, sign up here.

Demo

https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0

Run `Self-Operating Computer`

Install the project

pip install self-operating-computer

Run the project

operate

Enter your OpenAI Key: If you don't have one, you can obtain an OpenAI key here

Give Terminal app the required permissions: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".

Alternatively installation with `.sh`

Clone the repo to a directory on your computer:

git clone https://github.com/OthersideAI/self-operating-computer.git

Cd into directory:

cd self-operating-computer

Run the installation script:

./run.sh

Using `operate` Modes

Multimodal Models `-m`

An additional model is now compatible with the Self Operating Computer Framework. Try Google's gemini-pro-vision by following the instructions below.

Start operate with the Gemini model

operate -m gemini-pro-vision

Enter your Google AI Studio API key when terminal prompts you for it If you don't have one, you can obtain a key here after setting up your Google AI Studio account. You may also need authorize credentials for a desktop application. It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.

Locally Hosted LLaVA Through Ollama

If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
Note: Ollama currently only supports MacOS and Linux

First, install Ollama on your machine from https://ollama.ai/download.

Once Ollama is installed, pull the LLaVA model:

ollama pull llava

This will download the model on your machine which takes approximately 5 GB of storage.

When Ollama has finished pulling LLaVA, start the server:

ollama serve

That's it! Now start operate and select the LLaVA model:

operate -m llava

Important: Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.

Learn more about Ollama at its GitHub Repository

Voice Mode `--voice`

The framework supports voice inputs for the objective. Try voice by following the instructions below. Clone the repo to a directory on your computer:

git clone https://github.com/OthersideAI/self-operating-computer.git

Cd into directory:

cd self-operating-computer

Install the additional requirements-audio.txt

pip install -r requirements-audio.txt

Install device requirements For mac users:

brew install portaudio

For Linux users:

sudo apt install portaudio19-dev python3-pyaudio

Run with voice mode

operate --voice

Optical Character Recognition Mode `-m gpt-4-with-ocr`

The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the gpt-4-with-ocr mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to click elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.

Based on recent tests, OCR performs better than som and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:

operate or operate -m gpt-4-with-ocr will also work.

Set-of-Mark Prompting `-m gpt-4-with-som`

The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the gpt-4-with-som command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.

Learn more about SoM Prompting in the detailed arXiv paper: here.

For this initial version, a simple YOLOv8 model is trained for button detection, and the best.pt file is included under model/weights/. Users are encouraged to swap in their best.pt file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).

Start operate with the SoM model

operate -m gpt-4-with-som

Contributions are Welcomed!:

If you want to contribute yourself, see CONTRIBUTING.md.

Feedback

For any input on improving this project, feel free to reach out to Josh on Twitter.

Join Our Discord Community

For real-time discussions and community support, join our Discord server.

If you're already a member, join the discussion in #self-operating-computer.
If you're new, first join our Discord Server and then navigate to the #self-operating-computer.

Follow HyperWriteAI for More Updates

Stay updated with the latest developments:

Follow HyperWriteAI on Twitter.
Follow HyperWriteAI on LinkedIn.

Compatibility

This project is compatible with Mac OS, Windows, and Linux (with X server installed).

OpenAI Rate Limiting Note

The gpt-4-vision-preview model is required. To unlock access to this model, your account needs to spend at least $5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum $5.
Learn more here

Project details

Release history Release notifications | RSS feed

1.5.8

Feb 28, 2025

1.5.7

Jan 23, 2025

1.5.6

Jan 23, 2025

1.5.5

Dec 19, 2024

1.5.1

Dec 18, 2024

1.5.0

Dec 18, 2024

1.4.6

Jul 9, 2024

1.4.5

Mar 21, 2024

1.4.2

Mar 20, 2024

1.4.1

Mar 20, 2024

1.4.0

Mar 20, 2024

1.3.2

Feb 17, 2024

This version

1.3.1

Feb 9, 2024

1.3.0

Feb 9, 2024

1.2.9

Feb 2, 2024

1.2.8

Jan 25, 2024

1.2.7

Jan 24, 2024

1.2.6

Jan 24, 2024

1.2.5

Jan 19, 2024

1.2.4

Jan 19, 2024

1.2.3

Jan 19, 2024

1.2.2

Jan 19, 2024

1.2.1

Jan 19, 2024

1.2.0

Jan 16, 2024

1.1.2

Jan 10, 2024

1.1.1

Jan 7, 2024

1.0.9

Dec 30, 2023

1.0.8

Dec 20, 2023

1.0.7

Dec 19, 2023

1.0.6

Dec 13, 2023

1.0.5

Dec 9, 2023

1.0.3

Dec 6, 2023

1.0.2

Dec 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

self-operating-computer-1.3.1.tar.gz (5.7 MB view details)

Uploaded Feb 9, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

self_operating_computer-1.3.1-py3-none-any.whl (5.7 MB view details)

Uploaded Feb 9, 2024 Python 3

File details

Details for the file self-operating-computer-1.3.1.tar.gz.

File metadata

Download URL: self-operating-computer-1.3.1.tar.gz
Upload date: Feb 9, 2024
Size: 5.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for self-operating-computer-1.3.1.tar.gz
Algorithm	Hash digest
SHA256	`5046be139701c2f25e1c2376115e8e6f09b998f09cbae16a1eb54c9fc54d2c1a`
MD5	`2b943342a218ec0477a05718ad8ebc59`
BLAKE2b-256	`fa152f315c1e56b9999a021a7df22fc2b85bfb108bc41b8412fcb6e141e01531`

See more details on using hashes here.

File details

Details for the file self_operating_computer-1.3.1-py3-none-any.whl.

File metadata

Download URL: self_operating_computer-1.3.1-py3-none-any.whl
Upload date: Feb 9, 2024
Size: 5.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for self_operating_computer-1.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`94f877a621bf1833b89363dfda9a2e401d6b75f42abd5e81a9d76125d9b2af84`
MD5	`dddafa617140d29d4f35e260c1b4b8e5`
BLAKE2b-256	`674cd8ed353bf7aee9529c69f76e0525a12bd929f91fd4a87cacacc658367332`

See more details on using hashes here.

self-operating-computer 1.3.1

Navigation

Verified details

Maintainers

Unverified details

Project description

Self-Operating Computer Framework

Key Features

Ongoing Development

Agent-1-Vision Model API Access

Demo

Run `Self-Operating Computer`

Alternatively installation with `.sh`

Using `operate` Modes

Multimodal Models `-m`

Locally Hosted LLaVA Through Ollama

Voice Mode `--voice`

Optical Character Recognition Mode `-m gpt-4-with-ocr`

Set-of-Mark Prompting `-m gpt-4-with-som`

Contributions are Welcomed!:

Feedback

Join Our Discord Community

Follow HyperWriteAI for More Updates

Compatibility

OpenAI Rate Limiting Note

Project details

Verified details

Maintainers

Unverified details

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

self-operating-computer 1.3.1

Navigation

Verified details

Maintainers

Unverified details

Project description

Self-Operating Computer Framework

Key Features

Ongoing Development

Agent-1-Vision Model API Access

Demo

Run Self-Operating Computer

Alternatively installation with .sh

Using operate Modes

Multimodal Models -m

Locally Hosted LLaVA Through Ollama

Voice Mode --voice

Optical Character Recognition Mode -m gpt-4-with-ocr

Set-of-Mark Prompting -m gpt-4-with-som

Contributions are Welcomed!:

Feedback

Join Our Discord Community

Follow HyperWriteAI for More Updates

Compatibility

OpenAI Rate Limiting Note

Project details

Verified details

Maintainers

Unverified details

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Run `Self-Operating Computer`

Alternatively installation with `.sh`

Using `operate` Modes

Multimodal Models `-m`

Voice Mode `--voice`

Optical Character Recognition Mode `-m gpt-4-with-ocr`

Set-of-Mark Prompting `-m gpt-4-with-som`