A CLI interface for the METR task standard

These details have not been verified by PyPI

Project links

Development Status
- 1 - Planning
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Project description

METR CLI

Overview

METR_CLI is a command line interface for easily creating, developing, and submitting tasks made with the METR Task Standard.

Additionally, this project contains Python packages and utilities for developing METR tasks with Python. (Eventually, these will be separated into their own python pacakge)

Background

METR is an organization studying on the science of evaluating the capabilities and tendencies of state-of-the-art AI models.

The goal of the METR Task Standard is to define a common format for such tasks, so that everyone can evaluate their agents on tasks developed by other parties, rather than just their own. Making and validating informative tasks is a large amount of work, so de-duplicating efforts is important for the overall evals ecosystem. More info here.

1 Motivation and Note to METR

Back in spring of 2024, Gatlen Culp and other MAIA members participated in a METR hackathon. Repoitory kind of scuff. (TODO: Write this)

2 Setup

Install metr-cli pip install metr-cli
Create your METR task from our template metr task create <dir/to/clone/to> --name <task_family_name> --type <task_type=swe,cybersecurity,etc.>
Make sure docker is running (install if you haven't already)
Test that everything is working with Docker, your task, and the CLI, metr task run <dir/containing/task_family> addition (addition is a predefined task in our template)
Develop your task, following METR's task standard documentation (https://github.com/METR/task-template)

Requirements

Python 3.10+
- Pydantic (Data Validation + Better Objects)
- Clicker (CLI interface library)
- Rich (Higher-Quality Console Interfaces)
- Cookiecutter (Project Templating Software)
Node and NPM
Docker

3 Development Workflow

Make sure the project is working locally by running the entrypoint in the development repository python3.10 ./src/metr/metr_cli.py run ./examples/my_task/ addition
Build the project using pyproject.toml -- python -m build which will save a source dist and compiled dist to METR_CLI/dist
Install the wheel package pip install ./dist/metr_cli-0.0.1-py3-none-any.whl --force-reinstall
Test that cli is available from your shell metr (should return usage info)
Test other metr commands. (Ex: metr task run ./examples/my_task/my_task.py addition)
Setup your .pypirc in your home directory with your pypi API key and username
Upload the repository python -m twine upload dist/*
Follow the setup section above to confirm everything is working.

4 CLI Usage

Everything is under the metr task command for now

4.A CREATE

Create Task (Status: ✅ Working)

metr task create <dir/to/clone/to> 
--name <task_family_name> 
--type <task_type=swe,cybersecurity,etc.>

This will create a template METR task in the directory you specify

Uses the Cookiecutter Library
Future features:
- Will walk you through setting up the METR task standard with your personal information

Register Task with CLI (Status: 🤔 Hypothetical)

metr task register <cli_task_name> <dir/containing/task_family>

Will register your task with the CLI as <cli_task_name> for easier interaction (no having to use the container name) and cleanup.

Future features
- Also be able to register an agent for easier development

4.B DEVELOP

Run Task (Status: ✅ Working)

metr task run <dir/containing/task_family> <task_name>

Launches your task in a Docker container for an agent to complete

Run tests in a task environment

npm:
- All tests: npm run test -- "taskFamilyDirectory" "taskName" "testFileName"
- Single test: npm run test -- "taskFamilyDirectory" "taskName" "testFileName::testName"
CLI:
- All tests: metr task test <task_family_directory> <task_name> <test_file>
- Single test: metr task test <task_family_directory> <task_name> <test_file> --test-name <test_name>

Run Agent to Complete Task (Status: 🦴 Skeleton Code)

metr task run-agent "<container_name> <standard_agent_name|path/to/agent>:<path/in/VM> [start_agent_command]"

Run an agent inside a task environment to complete the task.

Score Task (Status: 🦴 Skeleton Code)

metr task score <container_name>

Scores the outcome of a task according to your TaskFamily's scoring function.

Future features
- Instead of having to specify a container name, the CLI will keep track of the running container, properly running, stopping, scoring, or running an agent within it given only the name of the TaskFamily.

Destroy Task (Status: 🦴 Skeleton Code)

metr task destroy <task_environment_identifier>

Deletes the task container for cleanup

Export Files from Container (Status: 🦴 Skeleton Code)

metr task export <container_name> <file1> <file2> ...

Export files from a task container to (your local codebase?)

(C) SUBMIT

Validate Task (Status: 🤔 Hypothetical)

metr task validate <task/project/dir>

Will run various tests to confirm that the project is ready for submission to METR

Checks that eval_info.json is configured
Checks task docs are configured (detail.md, summary.md)
Checks that qa docs are configured (qa/progress.md, qa/qa.json, review.md)
Checks that the task can be launched properly and that there are no issues
Checks that the repository remote is NOT public (ex: a GitHub repository anyone can check)
Test connection to METR's submission endpoint

Package Task (Status: 🤔 Hypothetical)

metr task package <input/task/project/dir> <ouput/task/package/dir>

Packages your METR task into an encrypted tar file for submission. Runs metr task validate prior to submitting

Submit Task (Status: 🤔 Hypothetical)

metr task submit <task/package/file>

Submits your METR task file to METR's submission endpoint

5 Python Package Usage

5.A `metr.core`

The metr.core package contains

Recommendations on How to Maintain

Worries

Known Issues

Mapping of npm functions to CLI functions

Note: The metr task create command doesn't directly map to an npm function. It's a custom command for creating new task definitions in your project structure.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 1 - Planning
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

0.0.2

Aug 22, 2024

0.0.1

Aug 22, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metr_cli-0.0.2.tar.gz (3.2 MB view details)

Uploaded Aug 22, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

metr_cli-0.0.2-py3-none-any.whl (6.5 MB view details)

Uploaded Aug 22, 2024 Python 3

File details

Details for the file metr_cli-0.0.2.tar.gz.

File metadata

Download URL: metr_cli-0.0.2.tar.gz
Upload date: Aug 22, 2024
Size: 3.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.1

File hashes

Hashes for metr_cli-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`d7285e8067ec531f6c79d480678028790bf00ff14039e51363d1f9603f2e1971`
MD5	`9cd2eb6a4465b0752462606bd9c634ad`
BLAKE2b-256	`f59d727f64f6561dc57b0f4b60f7cde94656a401589d97e018498929feb55a36`

See more details on using hashes here.

File details

Details for the file metr_cli-0.0.2-py3-none-any.whl.

File metadata

Download URL: metr_cli-0.0.2-py3-none-any.whl
Upload date: Aug 22, 2024
Size: 6.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.1

File hashes

Hashes for metr_cli-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`16bbc4987d35dcc97c0ff5c274b9d0cc935272c21fed3f1a53349754edc237e2`
MD5	`d192149a668de66a39768b5fde1faa0d`
BLAKE2b-256	`3f9834314bc94be0d083f406ba8aebdd9a4a2a6af0c748ff5b11524e5254a4e4`

See more details on using hashes here.

metr-cli 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

METR CLI

Overview

1 Motivation and Note to METR

2 Setup

Requirements

3 Development Workflow

4 CLI Usage

4.A CREATE

4.B DEVELOP

(C) SUBMIT

5 Python Package Usage

5.A metr.core

Recommendations on How to Maintain

Worries

Known Issues

Mapping of npm functions to CLI functions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

5.A `metr.core`