Skip to main content

A framework to enable multimodal models to operate a computer

Project description

Hanzo Operator

Self-Operating Computer Framework

A framework to enable multimodal models to operate a computer.

Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. Hanzo Operator is a powerful framework for computer automation using multimodal AI models.

Key Features

  • Compatibility: Designed for various multimodal models.
  • Integration: Currently integrated with GPT-4o, o1, Gemini Pro Vision, Claude 3, Qwen-VL and LLaVa.
  • Future Plans: Support for additional models.

Demo

https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0

Run Self-Operating Computer

  1. Install the project
pip install hanzo-operator
  1. Run the project
hanzo
  1. Enter your OpenAI Key: If you don't have one, you can obtain an OpenAI key here. If you need you change your key at a later point, run vim .env to open the .env and replace the old key.
  1. Give Terminal app the required permissions: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".

Using hanzo Modes

OpenAI models

The default model for the project is gpt-4o which you can use by simply typing hanzo. To try running OpenAI's new o1 model, use the command below.

hanzo -m o1-with-ocr

Multimodal Models -m

Try Google's gemini-pro-vision by following the instructions below. Start hanzo with the Gemini model

hanzo -m gemini-pro-vision

Enter your Google AI Studio API key when terminal prompts you for it If you don't have one, you can obtain a key here after setting up your Google AI Studio account. You may also need authorize credentials for a desktop application. It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.

Try Claude -m claude-3

Use Claude 3 with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the Claude dashboard to get an API key and run the command below to try it.

hanzo -m claude-3

Try qwen -m qwen-vl

Use Qwen-vl with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the Qwen dashboard to get an API key and run the command below to try it.

hanzo -m qwen-vl

Try LLaVa Hosted Through Ollama -m llava

If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
Note: Ollama currently only supports MacOS and Linux. Windows now in Preview

First, install Ollama on your machine from https://ollama.ai/download.

Once Ollama is installed, pull the LLaVA model:

ollama pull llava

This will download the model on your machine which takes approximately 5 GB of storage.

When Ollama has finished pulling LLaVA, start the server:

ollama serve

That's it! Now start hanzo and select the LLaVA model:

hanzo -m llava

Important: Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.

Learn more about Ollama at its GitHub Repository

Voice Mode --voice

The framework supports voice inputs for the objective. Try voice by following the instructions below. Clone the repo to a directory on your computer:

git clone https://github.com/OthersideAI/self-operating-computer.git

Cd into directory:

cd self-operating-computer

Install the additional requirements-audio.txt

pip install -r requirements-audio.txt

Install device requirements For mac users:

brew install portaudio

For Linux users:

sudo apt install portaudio19-dev python3-pyaudio

Run with voice mode

hanzo --voice

Optical Character Recognition Mode -m gpt-4-with-ocr

The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the gpt-4-with-ocr mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to click elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.

Based on recent tests, OCR performs better than som and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:

hanzo or hanzo -m gpt-4-with-ocr will also work.

Set-of-Mark Prompting -m gpt-4-with-som

The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the gpt-4-with-som command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.

Learn more about SoM Prompting in the detailed arXiv paper: here.

For this initial version, a simple YOLOv8 model is trained for button detection, and the best.pt file is included under model/weights/. Users are encouraged to swap in their best.pt file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).

Start hanzo with the SoM model

hanzo -m gpt-4-with-som

Contributions are Welcomed!:

If you want to contribute yourself, see CONTRIBUTING.md.

Feedback

For any input on improving this project, feel free to reach out to Josh on Twitter.

Join Our Discord Community

For real-time discussions and community support, join our Discord server.

Follow HyperWriteAI for More Updates

Stay updated with the latest developments:

Compatibility

  • This project is compatible with Mac OS, Windows, and Linux (with X server installed).

OpenAI Rate Limiting Note

The gpt-4o model is required. To unlock access to this model, your account needs to spend at least $5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum $5.
Learn more here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hanzo_operator-0.0.1.tar.gz (5.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hanzo_operator-0.0.1-py3-none-any.whl (5.7 MB view details)

Uploaded Python 3

File details

Details for the file hanzo_operator-0.0.1.tar.gz.

File metadata

  • Download URL: hanzo_operator-0.0.1.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for hanzo_operator-0.0.1.tar.gz
Algorithm Hash digest
SHA256 92123b344f294adf521f8c73231dc69b153d3c83816f6203dde4faa0cca450c6
MD5 e860a6f7c69de2e121b07f3333fca374
BLAKE2b-256 fd8856d59fb372bd58221428dd6678466ccda2e963a611781b1d6eb1c3d948cf

See more details on using hashes here.

File details

Details for the file hanzo_operator-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: hanzo_operator-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for hanzo_operator-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 af9bc0d5cf9a58e563ceb0043509c9054f7e91631e36c8724e8f131f0da24afb
MD5 37a263cb085a9224d020879ed25b2a8a
BLAKE2b-256 a4f31ebbf407855afc454e61bfd5ba79d41160acbe83ae8ed4e8d1a03e863520

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page