No project description provided

These details have not been verified by PyPI

Project links

Project description

AutoArena

Create leaderboards ranking LLM outputs against one another using automated judge evaluation

🏆 Rank outputs from different LLMs, RAG setups, and prompts to find the best configuration of your system
⚔️ Perform automated head-to-head evaluation using judges from OpenAI, Anthropic, Cohere, and more
🤖 Define and run your own custom judges, connecting to internal services or implementing bespoke logic
💻 Run application locally, getting full control over your environment and data

🤔 Why Head-to-Head Evaluation?

LLMs are better at judging responses head-to-head than they are in isolation (arXiv:2408.08688) — leaderboard rankings computed using Elo scores from many automated side-by-side comparisons should be more trustworthy than leaderboards using metrics computed on each model's responses independently!
The LMSYS Chatbot Arena has replaced benchmarks for many people as the trusted true leaderboard for foundation model performance (arXiv:2403.04132). Why not apply this approach to your own foundation model selection, RAG system setup, or prompt engineering efforts?
Using a "jury" of multiple smaller models from different model families like gpt-4o-mini, command-r, and claude-3-haiku generally yields better accuracy than a single frontier judge like gpt-4o — while being faster and much cheaper to run. AutoArena is built around this technique, called PoLL: Panel of LLM evaluators (arXiv:2404.18796).
Automated side-by-side comparison of model outputs is one of the most prevalent evaluation practices (arXiv:2402.10524) — AutoArena makes this process easier than ever to get up and running.

🔥 Getting Started

Install from PyPI:

pip install autoarena

Run as a module and visit localhost:8899 in your browser:

python -m autoarena

With the application running, getting started is simple:

Create a project via the UI.
Add responses from a model by selecting a CSV file with prompt and response columns.
Configure an automated judge via the UI. Note that most judges require credentials, e.g. X_API_KEY in the environment where you're running AutoArena.
Add responses from a second model to kick off an automated judging task using the judges you configured in the previous step to decide which of the models you've uploaded provided a better response to a given prompt.

That's it! After these steps you're fully set up for automated evaluation on AutoArena.

📄 Formatting Your Data

AutoArena requires two pieces of information to test a model: the input prompt and corresponding model response.

prompt: the inputs to your model. When uploading responses, any other models that have been run on the same prompts are matched and evaluated using the automated judges you have configured.
response: the output from your model. Judges decide which of two models produced a better response, given the same prompt.

📂 Data Storage

Data is stored in ./data/<project>.duckdb files in the directory where you invoked AutoArena. See data/README.md for more details on data storage in AutoArena.

🦾 Development

AutoArena uses uv to manage dependencies. To set up this repository for development, run:

uv venv && source .venv/bin/activate
uv pip install --all-extras -r pyproject.toml
uv tool run pre-commit install
uv run python3 -m autoarena --dev

To run AutoArena for development, you will need to run both the backend and frontend service:

Backend: uv run python3 -m autoarena --dev (the --dev/-d flag enables automatic service reloading when source files change)
Frontend: see ui/README.md

To build a release tarball in the ./dist directory:

./scripts/build.sh

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.0b11 pre-release

Oct 7, 2024

0.1.0b10 pre-release

Sep 18, 2024

0.1.0b9 pre-release

Sep 17, 2024

0.1.0b8 pre-release

Sep 13, 2024

This version

0.1.0b7 pre-release

Sep 11, 2024

0.1.0b6 pre-release

Sep 10, 2024

0.1.0b5 pre-release

Sep 9, 2024

0.1.0b4 pre-release

Sep 9, 2024

0.1.0b3 pre-release

Sep 5, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoarena-0.1.0b7.tar.gz (1.2 MB view details)

Uploaded Sep 11, 2024 Source

File details

Details for the file autoarena-0.1.0b7.tar.gz.

File metadata

Download URL: autoarena-0.1.0b7.tar.gz
Upload date: Sep 11, 2024
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for autoarena-0.1.0b7.tar.gz
Algorithm	Hash digest
SHA256	`c83f8a857dba15b713b01a12f158e1cdf848ee577c2425191e1eff6f66b0c0bc`
MD5	`bc4929ac1f63a3759a268a18d50babdf`
BLAKE2b-256	`a3c8fb516580595cb63694c90862212c7f8f69c4ae94cad3b0988be38d0ef373`