Skip to main content

Peer-based cross-evaluation system

Project description

LLMRank

"SlopRank" is an eval framework for ranking LLMs using peer-based cross-evaluation and PageRank. It enables unbiased, dynamic, and scalable benchmarking of multiple models, fostering transparency and innovation in the development of AI systems.

You can use it for one large set of heterogenous prompts to get the overall ranking, or smaller sets to get rankings for your particular usecase.

Definitive ranking:

=== PageRank Rankings ===
 model	pagerank_score
0	o1-preview	0.179404
1	gpt-4o	0.178305
2	deepseek-chat	0.167105
3	gemini-2.0-flash-thinking-exp-1219	0.164732
4	claude-3-5-sonnet-latest	0.155571
5	gemini-exp-1206	0.154884

Features

  • Peer-Based Evaluation: Models evaluate each other's responses, mimicking a collaborative and competitive environment.
  • Customizable Scoring:
    • Numeric Ratings (1–10) for granular evaluation.
    • Upvote/Downvote for simple binary scoring.
  • Subset Evaluation: Reduce API costs by limiting the models each evaluator reviews.
  • Graph-Based Ranking: Endorsements are represented in a graph, and PageRank is used to compute relative rankings.
  • Scalable Benchmarking: Add more models or prompts with ease, maintaining flexibility and efficiency.

How It Works

  1. Prompt Collection: Define a set of questions or tasks to test the models.
  2. Model Responses: Each model generates a response to the prompts.
  3. Cross-Evaluation:
    • Each model evaluates the quality of other models' responses.
    • Evaluations are collected via predefined scoring methods.
  4. Graph Construction: Build a directed graph where nodes are models, and edges represent endorsements.
  5. Ranking: Apply the PageRank algorithm to rank models based on their relative endorsements.

Installation

Prerequisites

  • Python 3.8+
  • SimonW's llm library
  • networkx for graph computations
  • dotenv for environment variable management

Setup

  1. Clone the repository
  2. Install dependencies
  3. Set up API keys for your LLMs by creating a .env file:
    OPENAI_API_KEY=your_openai_key
    ANTHROPIC_API_KEY=your_anthropic_key
    
    I set them using llm keys set [MODEL]

Usage

Configuration

  • Models: Update the MODEL_NAMES list in the notebook to include the models you want to evaluate.
  • Prompts: Define your prompts in the raw_prompts list.
  • Evaluation Method: Choose between numeric ratings (EVALUATION_METHOD = 1) or upvotes/downvotes (EVALUATION_METHOD = 2).
  • Subset Evaluation: Toggle USE_SUBSET_EVALUATION to reduce evaluation costs.

Running the Framework

  1. Open and run the notebook.
  2. Inspect the results:
    • Ranked models based on PageRank.
    • Visualization of the endorsement graph (optional).

Outputs

  • Ranked Models: A list of models ordered by their PageRank scores.
  • Graph Representation: A directed graph showing the flow of endorsements.
  • Processing Times: Benchmark of evaluation times for each model.

Applications

  • Benchmarking: Evaluate and rank new or existing LLMs.
  • Specialization Analysis: Test domain-specific capabilities (e.g., legal, medical).
  • Model Optimization: Identify strengths and weaknesses for targeted fine-tuning.
  • Public Leaderboards: Maintain transparency and foster healthy competition among models.

Ideas for Contributions

Suggested Improvements

  1. Add graph visualization to show endorsement flow between models.
  2. Create a heatmap of model scores across different prompts.
  3. Develop a public leaderboard to showcase rankings.
  4. Build a simple GUI for easier usage of the framework.
  5. Add support for multi-language evaluation by introducing localized prompts.

Contributions are welcome! If you have ideas for improving the framework, feel free to open an issue or submit a pull request.


Acknowledgments

Special thanks to:

  • SimonW for the llm library.
  • The AI community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sloprank-0.1.0.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sloprank-0.1.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file sloprank-0.1.0.tar.gz.

File metadata

  • Download URL: sloprank-0.1.0.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for sloprank-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5436a5ca3ca88a071af01d658b3d2b632a45f3357d2999319e8ee5ae897a83cf
MD5 766ef18f06754cba6bd159a3a7a86ec2
BLAKE2b-256 8a34be38be1907d9404785d8849f8d5e73a964c8e9d5bfbbc451587f11e05636

See more details on using hashes here.

File details

Details for the file sloprank-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sloprank-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for sloprank-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6a4eace257ef07cb17a35dfafc85fe097b6c7263fc8e826513254b94d89a6183
MD5 ea1ee0a400e5afdd59f228d35ab78e1b
BLAKE2b-256 e06d688fa9c3092a5509237ac947521d8161a594d50de648d5303f720bcbb51b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page