Peer-based cross-evaluation system
Project description
LLMRank
"SlopRank" is an eval framework for ranking LLMs using peer-based cross-evaluation and PageRank. It enables unbiased, dynamic, and scalable benchmarking of multiple models, fostering transparency and innovation in the development of AI systems.
You can use it for one large set of heterogenous prompts to get the overall ranking, or smaller sets to get rankings for your particular usecase.
Definitive ranking:
=== PageRank Rankings ===
model pagerank_score
0 o1-preview 0.179404
1 gpt-4o 0.178305
2 deepseek-chat 0.167105
3 gemini-2.0-flash-thinking-exp-1219 0.164732
4 claude-3-5-sonnet-latest 0.155571
5 gemini-exp-1206 0.154884
Features
- Peer-Based Evaluation: Models evaluate each other's responses, mimicking a collaborative and competitive environment.
- Customizable Scoring:
- Numeric Ratings (1–10) for granular evaluation.
- Upvote/Downvote for simple binary scoring.
- Subset Evaluation: Reduce API costs by limiting the models each evaluator reviews.
- Graph-Based Ranking: Endorsements are represented in a graph, and PageRank is used to compute relative rankings.
- Scalable Benchmarking: Add more models or prompts with ease, maintaining flexibility and efficiency.
How It Works
- Prompt Collection: Define a set of questions or tasks to test the models.
- Model Responses: Each model generates a response to the prompts.
- Cross-Evaluation:
- Each model evaluates the quality of other models' responses.
- Evaluations are collected via predefined scoring methods.
- Graph Construction: Build a directed graph where nodes are models, and edges represent endorsements.
- Ranking: Apply the PageRank algorithm to rank models based on their relative endorsements.
Installation
Prerequisites
- Python 3.8+
- SimonW's
llmlibrary networkxfor graph computationsdotenvfor environment variable management
Setup
SlopRank is on PyPI, so you can install it via:
pip install sloprank
From Source: If you prefer, clone this repo and install locally:
git clone https://github.com/strangeloopcanon/llmrank.git
cd sloprank
pip install .
Usage: After installation, you can run the CLI:
sloprank --help
Or, if you want to just use the jupyter notebook, you can use:
- Clone the repository
- Install dependencies
- Set up API keys for your LLMs by creating a
.envfile:OPENAI_API_KEY=your_openai_key ANTHROPIC_API_KEY=your_anthropic_key
I set them using llm keys set [MODEL]
Usage
After installing, you can run the entire SlopRank workflow via the sloprank command. By default, SlopRank uses the models defined in DEFAULT_CONFIG. You can override this by passing --models with a comma-separated list.
For example, in the same directory as prompts.xlsx, run:
sloprank --prompts prompts.xlsx --output-dir results
--prompts prompts.xlsx tells SlopRank where to find your list of prompts. --output-dir results puts all CSV and JSON outputs in the results/ folder. If you want to override the default models:
sloprank --prompts prompts.xlsx --output-dir results \
--models "gpt-4o,claude-3-5-sonnet-latest,o1-preview"
Configuration
- Models: Update the
MODEL_NAMESlist in the notebook to include the models you want to evaluate. - Prompts: Define your prompts in the
raw_promptslist. - Evaluation Method: Choose between numeric ratings (
EVALUATION_METHOD = 1) or upvotes/downvotes (EVALUATION_METHOD = 2). - Subset Evaluation: Toggle
USE_SUBSET_EVALUATIONto reduce evaluation costs.
Running the Framework
- Open and run the notebook.
- Inspect the results:
- Ranked models based on PageRank.
- Visualization of the endorsement graph (optional).
Outputs
- Ranked Models: A list of models ordered by their PageRank scores.
- Graph Representation: A directed graph showing the flow of endorsements.
- Processing Times: Benchmark of evaluation times for each model.
Applications
- Benchmarking: Evaluate and rank new or existing LLMs.
- Specialization Analysis: Test domain-specific capabilities (e.g., legal, medical).
- Model Optimization: Identify strengths and weaknesses for targeted fine-tuning.
- Public Leaderboards: Maintain transparency and foster healthy competition among models.
Ideas for Contributions
Suggested Improvements
- Add graph visualization to show endorsement flow between models.
- Create a heatmap of model scores across different prompts.
- Develop a public leaderboard to showcase rankings.
- Build a simple GUI for easier usage of the framework.
- Add support for multi-language evaluation by introducing localized prompts.
Contributions are welcome! If you have ideas for improving the framework, feel free to open an issue or submit a pull request.
Acknowledgments
Special thanks to:
- SimonW for the
llmlibrary. - The AI community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sloprank-0.1.1.tar.gz.
File metadata
- Download URL: sloprank-0.1.1.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
658877f21e2baf9875a74b046d46c6035cf4541f66625620c9da95532938de79
|
|
| MD5 |
3bdf46b08452382e4f8e24c4e4cb411c
|
|
| BLAKE2b-256 |
cca99cad2c01cfc0b5bea7c69a6cde69dd1fa9e1b088ea30a1487b40091ec551
|
File details
Details for the file sloprank-0.1.1-py3-none-any.whl.
File metadata
- Download URL: sloprank-0.1.1-py3-none-any.whl
- Upload date:
- Size: 13.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0fd981d56ee94707478c01b74512cef8962b6e8f7b38a93c41f7e3e1ac0ce61
|
|
| MD5 |
d98ebfd769a2d8f72f580b788e4180f7
|
|
| BLAKE2b-256 |
957461518f642f11a6d56d02f766b0e121b887bbda5714e412c7ae587c5a0647
|