Peer-based LLM cross-evaluation system

These details have not been verified by PyPI

Project links

Homepage

Project description

LLMRank

"SlopRank" is an eval framework for ranking LLMs using peer-based cross-evaluation and PageRank. It enables unbiased, dynamic, and scalable benchmarking of multiple models, fostering transparency and innovation in the development of AI systems.

You can use it for one large set of heterogenous prompts to get the overall ranking, or smaller sets to get rankings for your particular usecase.

Definitive ranking:

=== PageRank Rankings ===
 model	pagerank_score
0	o1-preview	0.179404
1	gpt-4o	0.178305
2	deepseek-chat	0.167105
3	gemini-2.0-flash-thinking-exp-1219	0.164732
4	claude-3-5-sonnet-latest	0.155571
5	gemini-exp-1206	0.154884

Supported models include ChatGPT-4o, Claude-3.7-Sonnet, Deepseek-Reasoner, Gemini-2.0-Pro, O1, and others.

Features

Peer-Based Evaluation: Models evaluate each other's responses, mimicking a collaborative and competitive environment.
Customizable Scoring:
- Numeric Ratings (1–10) for granular evaluation.
- Upvote/Downvote for simple binary scoring.
Subset Evaluation: Reduce API costs by limiting the models each evaluator reviews.
Graph-Based Ranking: Endorsements are represented in a graph, and PageRank is used to compute relative rankings.
Scalable Benchmarking: Add more models or prompts with ease, maintaining flexibility and efficiency.
Graph Visualization: Visualize model endorsements with interactive and static graph visualizations.
Category-Based Analysis: Evaluate model performance across different prompt categories (reasoning, coding, etc.).
Statistical Confidence: Calculate confidence intervals and significance tests for model rankings.
Interactive Dashboard: Explore results through a web-based dashboard with interactive visualizations.

How It Works

Prompt Collection: Define a set of questions or tasks to test the models.
Model Responses: Each model generates a response to the prompts.
Cross-Evaluation:
- Each model evaluates the quality of other models' responses.
- Evaluations are collected via predefined scoring methods.
Graph Construction: Build a directed graph where nodes are models, and edges represent endorsements.
Ranking: Apply the PageRank algorithm to rank models based on their relative endorsements.

Installation

Prerequisites

Python 3.8+
SimonW's llm library
networkx for graph computations
dotenv for environment variable management

Setup

SlopRank is on PyPI, so you can install it via:

pip install sloprank

From Source: If you prefer, clone this repo and install locally:

git clone https://github.com/strangeloopcanon/llmrank.git
cd sloprank
pip install .

API Keys Setup

Set up API keys using Simon Willison's llm tool:

llm keys set anthropic 
llm keys set openai

Or create a .env file with:

OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key

Usage

After installing, you can run the entire SlopRank workflow via the sloprank command. By default, SlopRank uses the models defined in DEFAULT_CONFIG. You can override this by passing --models with a comma-separated list.

Basic Usage

sloprank --prompts prompts.xlsx --output-dir results

--prompts prompts.xlsx tells SlopRank where to find your list of prompts.
--output-dir results puts all CSV and JSON outputs in the results/ folder.

If you want to override the default models:

sloprank --prompts prompts.xlsx --output-dir results \
         --models "gpt-4o,claude-3-5-sonnet-latest,o1-preview"

Configuration

Models: Update the MODEL_NAMES list in the notebook to include the models you want to evaluate.
Prompts: Define your prompts in the raw_prompts list.
Evaluation Method: Choose between numeric ratings (EVALUATION_METHOD = 1) or upvotes/downvotes (EVALUATION_METHOD = 2).
Subset Evaluation: Toggle USE_SUBSET_EVALUATION to reduce evaluation costs.

Advanced Features

Visualization, Confidence Intervals, and Categories

Run SlopRank with all advanced features:

sloprank run --prompts prompts.xlsx --output-dir results --visualize --confidence --categories

Interactive Dashboard

Add the --dashboard flag to launch an interactive web dashboard:

sloprank run --prompts prompts.xlsx --output-dir results --dashboard

Launch the dashboard for existing results:

sloprank dashboard --output-dir results

Using Individual Tools

The examples/ directory contains standalone scripts for each advanced feature:

Graph Visualization:

python examples/generate_visualization.py

Confidence Intervals:
```
python examples/compute_confidence.py
```

Prompt Categorization:

python examples/prompt_categorization.py

Dashboard Generation:

python examples/generate_dashboard.py
python examples/dashboard.py

Using the Notebook

If you prefer using Jupyter Notebook:

Open llmrank.ipynb
Run the cells to execute the workflow
Inspect the results

Outputs

Ranked Models: A list of models ordered by their PageRank scores.
Graph Representation: A directed graph showing the flow of endorsements.
Processing Times: Benchmark of evaluation times for each model.
Interactive Visualizations: HTML-based interactive graphs with node and edge details.
Static Visualizations: PNG images of the endorsement graph.
Confidence Intervals: Statistical confidence bounds for model rankings.
Significance Tests: Statistical significance indicators between adjacent ranks.
Category Rankings: Model performance across different prompt categories.

Example Dashboard

The dashboard provides:

Overall model rankings with confidence intervals
Category-specific performance analysis
Interactive graph visualizations
Model comparison tools

Applications

Benchmarking: Evaluate and rank new or existing LLMs.
Specialization Analysis: Test domain-specific capabilities (e.g., legal, medical).
Model Optimization: Identify strengths and weaknesses for targeted fine-tuning.
Public Leaderboards: Maintain transparency and foster healthy competition among models.

Ideas for Contributions

Suggested Improvements

Improve visualization options and customization.
Add more statistical analysis methods.
Develop a public leaderboard to showcase rankings.
Enhance the web dashboard with more interactive features.
Add support for multi-language evaluation by introducing localized prompts.
Implement cost estimation and optimization features.

Contributions are welcome! If you have ideas for improving the framework, feel free to open an issue or submit a pull request.

Acknowledgments

Special thanks to:

SimonW for the llm library.
The AI community

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.17

Sep 10, 2025

0.3.16

Sep 10, 2025

0.3.15

Sep 10, 2025

0.3.14

Sep 10, 2025

0.3.13

Sep 10, 2025

0.3.11

Sep 9, 2025

0.3.10

Apr 8, 2025

0.3.9

Apr 8, 2025

0.3.8

Apr 8, 2025

0.3.7

Apr 8, 2025

0.3.6

Apr 8, 2025

0.3.5

Apr 8, 2025

0.3.4

Apr 8, 2025

0.3.3

Apr 8, 2025

0.3.2

Apr 7, 2025

0.3.0

Apr 7, 2025

0.2.6

Apr 7, 2025

0.2.5

Apr 7, 2025

0.2.4

Apr 7, 2025

This version

0.2.3

Feb 28, 2025

0.2.2

Feb 28, 2025

0.2.0

Feb 28, 2025

0.1.2

Feb 6, 2025

0.1.1

Jan 31, 2025

0.1.0

Jan 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sloprank-0.2.3.tar.gz (39.3 kB view details)

Uploaded Feb 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sloprank-0.2.3-py3-none-any.whl (36.0 kB view details)

Uploaded Feb 28, 2025 Python 3

File details

Details for the file sloprank-0.2.3.tar.gz.

File metadata

Download URL: sloprank-0.2.3.tar.gz
Upload date: Feb 28, 2025
Size: 39.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for sloprank-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`533b64d9d5a1728db6634be13346840cb6ea71be430de576b4a62a5c6ed1cd15`
MD5	`27a9e68b7c91184c939c6f816ebe8ce4`
BLAKE2b-256	`12f33da00fdd995f1d0325f0bbf0bdb448c598d1a713dfc20d038c5e5a4511cd`

See more details on using hashes here.

File details

Details for the file sloprank-0.2.3-py3-none-any.whl.

File metadata

Download URL: sloprank-0.2.3-py3-none-any.whl
Upload date: Feb 28, 2025
Size: 36.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for sloprank-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`826113c11fc7656a0c396b6c6156949ffefefef08a490ff418751447b98bed56`
MD5	`b7b28bae3cf809b0be022ab5a71ae576`
BLAKE2b-256	`ae6e436a7d78b6bf286842317b1786dbede45ca81b367051ba757d9036ea4531`

See more details on using hashes here.

sloprank 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

LLMRank

Features

How It Works

Installation

Prerequisites

Setup

API Keys Setup

Usage

Basic Usage

Configuration

Advanced Features

Visualization, Confidence Intervals, and Categories

Interactive Dashboard

Using Individual Tools

Using the Notebook

Outputs

Example Dashboard

Applications

Ideas for Contributions

Suggested Improvements

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes