Peer-based LLM cross-evaluation system

These details have not been verified by PyPI

Project links

Homepage

Project description

SlopRank

SlopRank is a high-performance evaluation framework for ranking LLMs using peer-based cross-evaluation and PageRank. Built with Bodo for parallel processing, it enables unbiased, dynamic, and scalable benchmarking of multiple models, fostering transparency and innovation in the development of AI systems.

You can use it with a large set of heterogeneous prompts to get overall rankings, or with smaller targeted sets to evaluate models for your specific use case.

🚀 Performance: Powered by Bodo for parallel DataFrame operations and JIT compilation
📊 Scalable: Efficiently handles large datasets with optimized memory usage
🔗 Compatible: Direct integration with Simon Willison's llm library

Interactive Dashboard

Dashboard Preview

➡️ View Interactive Dashboard

Example Ranking (OpenRouter run):

=== PageRank Rankings ===
   model                                   pagerank_score
0  openrouter/openai/gpt-5                 0.168470
1  openrouter/qwen/qwen3-max               0.155266
2  openrouter/google/gemini-2.5-pro        0.145787
3  openrouter/anthropic/claude-opus-4.1    0.135553
4  openrouter/x-ai/grok-4                  0.135202
5  openrouter/anthropic/claude-sonnet-4    0.133854
6  openrouter/nousresearch/hermes-4-405b   0.125868

Models in this run: gpt-5, claude opus 4.1, claude sonnet 4, grok 4, qwen 3 max, gemini 2.5 pro, nousresearch/hermes-4-405b. Results were computed using peer cross‑evaluation and PageRank over 37 prompts.

It supports pretty much all models, anything that can be run with the 'llm' library.

Features

🚀 High-Performance Processing

Bodo Integration: Parallel DataFrame operations with JIT compilation for maximum performance
Memory Efficient: Optimized memory usage for large-scale evaluations
Scalable: Handles thousands of prompts and dozens of models efficiently

🤖 Advanced Evaluation

Peer-Based Evaluation: Models evaluate each other's responses, mimicking a collaborative and competitive environment
Customizable Scoring: Numeric ratings (1–10) for granular evaluation or upvote/downvote for binary scoring
Subset Evaluation: Reduce API costs by limiting the models each evaluator reviews
Graph-Based Ranking: Endorsements are represented in a graph, and PageRank is used to compute relative rankings

📊 Rich Analytics

Statistical Confidence: Calculate confidence intervals and significance tests for model rankings
Category-Based Analysis: Evaluate model performance across different prompt categories (reasoning, coding, etc.)
Graph Visualization: Interactive and static graph visualizations of model endorsements
Interactive Dashboard: Explore results through a web-based dashboard with interactive visualizations

🔗 Flexible Integration

LLM Library: Direct integration with Simon Willison's llm library for broad model support
Provider Agnostic: Works with OpenAI, Anthropic, OpenRouter, and local models
Easy Configuration: Simple CSV-based prompt input and JSON output

How It Works

Prompt Collection: Define a set of questions or tasks to test the models.
Model Responses: Each model generates a response to the prompts.
Cross-Evaluation:
- Each model evaluates the quality of other models' responses.
- Evaluations are collected via predefined scoring methods.
Graph Construction: Build a directed graph where nodes are models, and edges represent endorsements.
Ranking: Apply the PageRank algorithm to rank models based on their relative endorsements.

Installation

Prerequisites

Python 3.9+ (required for Bodo compatibility)
Bodo for high-performance parallel processing (included by default)
SimonW's llm library for model access
networkx for graph computations
dotenv for environment variable management

Optional Compatibility Mode

pandas for compatibility mode (if you specifically need regular pandas)

Setup

Standard Installation (includes Bodo for 3-5x performance):

pip install sloprank

Compatibility Installation (regular pandas only):

pip install sloprank[pandas]

From Source:

git clone https://github.com/strangeloopcanon/llmrank.git
cd sloprank
pip install .               # Standard installation (includes Bodo)
pip install .[pandas]       # Compatibility mode (regular pandas)

API Keys Setup

SlopRank uses the llm library for model access. Set up API keys using Simon Willison's llm tool:

# Install llm library (included as dependency)
pip install llm

# Set up API keys for various providers
llm keys set anthropic 
llm keys set openai
llm keys set openrouter  # For OpenRouter models

Or create a .env file with:

OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
OPENROUTER_API_KEY=your_openrouter_key

Supported Models: Any model supported by the llm library, including:

OpenAI (GPT-4, GPT-3.5, etc.)
Anthropic (Claude models)
OpenRouter (access to many models)
Local models via llm plugins

Backend Configuration

SlopRank automatically detects and uses the best available pandas backend:

Check Current Backend:

sloprank backend

Force Specific Backend:

# Force Bodo for maximum performance
export SLOPRANK_USE_BODO=true
sloprank run --prompts prompts.csv

# Force regular pandas for compatibility
export SLOPRANK_USE_BODO=false
sloprank run --prompts prompts.csv

# Alternative syntax
SLOPRANK_PANDAS_BACKEND=bodo sloprank run --prompts prompts.csv
SLOPRANK_PANDAS_BACKEND=pandas sloprank run --prompts prompts.csv

Auto-Detection Behavior:

Default: Uses Bodo automatically (included in standard installation, 3-5x performance boost)
Fallback: Uses regular pandas if Bodo unavailable (compatibility mode)
Override: Manual environment variables always take precedence

Usage

After installing, you can run the entire SlopRank workflow via the sloprank command. By default, SlopRank uses the models defined in DEFAULT_CONFIG. You can override this by passing --models with a comma-separated list.

Basic Usage

sloprank --prompts prompts.csv --output-dir results

--prompts prompts.csv tells SlopRank where to find your list of prompts.
--output-dir results puts all CSV and JSON outputs in the results/ folder.

If you want to override the default models:

sloprank --prompts prompts.csv --output-dir results --models "chatgpt-4o,o1,claude-3-7-sonnet-latest, deepseek-reasoner, gemini-2.0-pro-exp-02-05" --visualize --confidence

Configuration

Models: Update the MODEL_NAMES list to include the models you want to evaluate.
Prompts: Define your prompts in the raw_prompts list.
Evaluation Method: Choose between numeric ratings (EVALUATION_METHOD = 1) or upvotes/downvotes (EVALUATION_METHOD = 2).
Subset Evaluation: Toggle USE_SUBSET_EVALUATION to reduce evaluation costs.

Advanced Features

Visualization, Confidence Intervals, and Categories

Run SlopRank with all advanced features:

sloprank run --prompts prompts.csv --output-dir results --visualize --confidence --categories

Interactive Dashboard

Add the --dashboard flag to launch an interactive web dashboard:

sloprank run --prompts prompts.csv --output-dir results --dashboard

Launch the dashboard for existing results:

sloprank dashboard --output-dir results

Using Individual Tools

The examples/ directory contains standalone scripts for each advanced feature:

Graph Visualization:

python examples/generate_visualization.py

Confidence Intervals:
```
python examples/compute_confidence.py
```

Prompt Categorization:

python examples/prompt_categorization.py

Dashboard Generation:

python examples/generate_dashboard.py
python examples/dashboard.py

Outputs

Ranked Models: A list of models ordered by their PageRank scores.
Graph Representation: A directed graph showing the flow of endorsements.
Processing Times: Benchmark of evaluation times for each model.
Interactive Visualizations: HTML-based interactive graphs with node and edge details.
Static Visualizations: PNG images of the endorsement graph.
Confidence Intervals: Statistical confidence bounds for model rankings.
Significance Tests: Statistical significance indicators between adjacent ranks.
Category Rankings: Model performance across different prompt categories.

Dashboard Details

The dashboard provides:

Overall model rankings with confidence intervals
Category-specific performance analysis
Interactive graph visualizations
Model comparison tools

Download Options

⬇️ Download Dashboard HTML - Save and open locally in any browser

Applications

Benchmarking: Evaluate and rank new or existing LLMs.
Specialization Analysis: Test domain-specific capabilities (e.g., legal, medical).
Model Optimization: Identify strengths and weaknesses for targeted fine-tuning.
Public Leaderboards: Maintain transparency and foster healthy competition among models.

Development

Release Process

To build and release a new version of SlopRank to PyPI:

Update the version number in pyproject.toml following semantic versioning
Update the Changelog section below with all changes
Clean previous builds: rm -rf build/ dist/ *.egg-info/
Build the package: python -m build
Validate the package: twine check dist/*
Upload to PyPI: twine upload dist/*
Create a GitHub release with the changelog info

Troubleshooting Releases

If you get permission errors during upload, check your PyPI credentials
If the build fails, ensure all dependencies are correctly listed in pyproject.toml
If the package fails validation, fix the issues before attempting to upload again

Version History

Recent Updates (v0.3.15+)

🚀 Major Performance Upgrade: Bodo-First Architecture

✅ Bodo is now the default - included in standard installation
✅ 3-5x performance by default - no configuration needed
✅ Switchable backend system - environment variable control
✅ Direct Bodo integration for maximum performance
✅ Intelligent fallback to pandas when needed
✅ Simplified high-performance installation model

See the CHANGELOG.md file for a detailed version history and release notes.

Ideas for Contributions

Suggested Improvements

Improve visualization options and customization.
Add more statistical analysis methods.
Develop a public leaderboard to showcase rankings.
Enhance the web dashboard with more interactive features.
Add support for multi-language evaluation by introducing localized prompts.
Implement cost estimation and optimization features.

Contributions are welcome! If you have ideas for improving the framework, feel free to open an issue or submit a pull request.

Acknowledgments

Special thanks to:

Bodo.ai for the high-performance parallel computing platform
SimonW for the excellent llm library and ecosystem
The AI community for driving innovation in model evaluation

Flexible High-Performance Processing

SlopRank features a switchable pandas backend system that automatically optimizes for your environment:

# Standard installation (includes Bodo for high performance)
pip install sloprank

# Compatibility installation (regular pandas only)
pip install sloprank[pandas]

# SlopRank automatically uses the best backend (Bodo by default)
sloprank run --prompts prompts.csv --output-dir results --models "gpt-4o,claude-3.5-sonnet-latest"

# Direct usage with automatic backend selection
from sloprank.pandas_backend import pd  # Uses Bodo by default, pandas fallback
from sloprank.collect import collect_responses

# Efficient processing for large datasets (3-5x faster with Bodo by default)
responses_df = collect_responses(prompt_pairs, config)
print(responses_df)

This integration provides:

Parallel DataFrame Operations: Automatic parallelization of pandas operations across multiple cores
Memory Efficiency: Optimized memory usage for large datasets with intelligent caching
High Performance: JIT compilation for compute-intensive operations (graph building, PageRank)
Direct LLM Integration: Streamlined model access via Simon Willison's llm library
Production Ready: Robust error handling and fallback mechanisms

Performance Benefits

Benchmark improvements with Bodo integration:

⚡ 3-5x faster DataFrame operations on large evaluation datasets
💾 50-70% less memory usage compared to standard pandas
🔄 Automatic parallelization of PageRank computations
📈 Linear scalability with dataset size and number of models

Ideal for:

Large-scale model comparisons (10+ models, 1000+ prompts)
Academic research requiring statistical rigor
Enterprise benchmarking with performance requirements
Continuous evaluation pipelines

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.17

Sep 10, 2025

0.3.16

Sep 10, 2025

0.3.15

Sep 10, 2025

0.3.14

Sep 10, 2025

0.3.13

Sep 10, 2025

0.3.11

Sep 9, 2025

0.3.10

Apr 8, 2025

0.3.9

Apr 8, 2025

0.3.8

Apr 8, 2025

0.3.7

Apr 8, 2025

0.3.6

Apr 8, 2025

0.3.5

Apr 8, 2025

0.3.4

Apr 8, 2025

0.3.3

Apr 8, 2025

0.3.2

Apr 7, 2025

0.3.0

Apr 7, 2025

0.2.6

Apr 7, 2025

0.2.5

Apr 7, 2025

0.2.4

Apr 7, 2025

0.2.3

Feb 28, 2025

0.2.2

Feb 28, 2025

0.2.0

Feb 28, 2025

0.1.2

Feb 6, 2025

0.1.1

Jan 31, 2025

0.1.0

Jan 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sloprank-0.3.17.tar.gz (48.7 kB view details)

Uploaded Sep 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sloprank-0.3.17-py3-none-any.whl (42.6 kB view details)

Uploaded Sep 10, 2025 Python 3

File details

Details for the file sloprank-0.3.17.tar.gz.

File metadata

Download URL: sloprank-0.3.17.tar.gz
Upload date: Sep 10, 2025
Size: 48.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for sloprank-0.3.17.tar.gz
Algorithm	Hash digest
SHA256	`9c33c49b6378813145d81539739bc55e5bfe35f00d59006e591f5622f10b3011`
MD5	`7dd249a5cee3a420ec5de6011089c1bb`
BLAKE2b-256	`5cea7b447cb452d50d3a710cff59871c295af1855b1e18dabd1762d529445623`

See more details on using hashes here.

File details

Details for the file sloprank-0.3.17-py3-none-any.whl.

File metadata

Download URL: sloprank-0.3.17-py3-none-any.whl
Upload date: Sep 10, 2025
Size: 42.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for sloprank-0.3.17-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7b425e71f97da96f96c5eb1552385b76c2abad055ef8912fe183aa93e36acea8`
MD5	`a705a75164a7cd478be7ad6a194f816b`
BLAKE2b-256	`45bb34a9d296e56cd21c037f9a5680c38da11ffc93313bc85333173436826bae`

See more details on using hashes here.

sloprank 0.3.17

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

SlopRank

Interactive Dashboard

Example Ranking (OpenRouter run):

Features

🚀 High-Performance Processing

🤖 Advanced Evaluation

📊 Rich Analytics

🔗 Flexible Integration

How It Works

Installation

Prerequisites

Optional Compatibility Mode

Setup

API Keys Setup

Backend Configuration

Usage

Basic Usage

Configuration

Advanced Features

Visualization, Confidence Intervals, and Categories

Interactive Dashboard

Using Individual Tools

Outputs

Dashboard Details

Download Options

Applications

Development

Release Process

Troubleshooting Releases

Version History

Recent Updates (v0.3.15+)

Ideas for Contributions

Suggested Improvements

Acknowledgments

Flexible High-Performance Processing

Performance Benefits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes