Human evaluation tools for AI models and datasets
Project description
Crowd Evaluation for Machine Learning Training
A Python library for integrating crowd evaluation into your machine learning training loops. This library provides asynchronous, non-blocking evaluation of model outputs (currently supporting image generation) with automatic logging to Weights & Biases (wandb).
Features
- Asynchronous Evaluation: Evaluations run in the background without blocking your training loop
- Wandb Integration: Results are automatically logged to your wandb runs with proper ordering
- Image Evaluation: Built-in support for evaluating generated images on multiple criteria
- Crowd-in-the-Loop: Uses Rapidata for high-quality crowd evaluation
- Easy Integration: Add evaluation to your training loop with just a few lines of code
Quick Start
import wandb
from checkpoint_evaluation.image_checkpoint_evaluator import ImageEvaluator
# Initialize wandb
run = wandb.init(project="my-project")
# Create evaluator
evaluator = ImageEvaluator(wandb_run=run, model_name="my-model")
# In your training loop
for step in range(100):
# ... your training code ...
# Generate or load validation images (every N steps)
if step % 10 == 0:
validation_images = ["path/to/image_1.png", "path/to/image_2.png"]
# Fire-and-forget evaluation - returns immediately!
evaluator.evaluate(validation_images)
# ... continue training ...
# Wait for all evaluations to complete before finishing
evaluator.wait_for_all_evaluations()
run.finish()
Installation
Prerequisites
- Python 3.9+
- A Rapidata account with API credentials
- A Weights & Biases account
Dependencies
Prerequisites
Install uv if you haven't already:
# For MacOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# For Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
Setup Instructions
- Create and activate a virtual environment:
uv venv # On Unix/macOS source .venv/bin/activate # On Windows .venv\Scripts\activate
- Install dependencies:
uv sync
Environment Setup
Create a .env file in your project root:
OPENAI_API_KEY=your_openai_api_key # If running the example file
RAPIDATA_CLIENT_ID=your_rapidata_client_id # If running on a server
RAPIDATA_CLIENT_SECRET=your_rapidata_client_secret # If running on a server
Detailed Usage
Image Evaluation
The ImageEvaluator evaluates generated images on three key metrics:
- Preference: Overall crowd preference for the image
- Alignment: How well the image matches its text description
- Coherence: Visual quality and absence of artifacts
Image Requirements
For the evaluator to function properly, your image files should adhere to the following naming convention: the image name must end with "_{prompt_id}". The rest of the filename structure is not significant.
Where {prompt_id} corresponds to prompt IDs from the evaluation dataset. The evaluator will automatically validate that your images match available prompts.
Complete Example with Image Generation
To run this, make sure you run the following commands:
uv venv
source .venv/bin/activate
uv sync
uv add openai dotenv
and log in to wandb:
wandb login
import os
import sys
import openai
import requests
import wandb
from checkpoint_evaluation.image_checkpoint_evaluator import ImageEvaluator
from dotenv import load_dotenv
load_dotenv()
# Setup
openai.api_key = os.getenv("OPENAI_API_KEY")
run = wandb.init(project="dalle-evaluation")
evaluator = ImageEvaluator(wandb_run=run, model_name="dalle-3")
def generate_and_save_image(prompt: str, file_location: str) -> str:
"""Generate image using DALL-E and save to disk."""
os.makedirs(os.path.dirname(file_location), exist_ok=True)
response = openai.images.generate(
model="dall-e-3",
prompt=prompt,
size="1024x1024",
quality="standard",
n=1
)
# Download and save image
image_url = response.data[0].url
image_data = requests.get(image_url).content
with open(file_location, 'wb') as f:
f.write(image_data)
return file_location
if __name__ == "__main__":
# Training simulation
for step in range(3):
# Simulate training
run.log({"Some training metric": step})
# Generate images for evaluation (using first 2 prompts)
validation_images = [
generate_and_save_image(prompt, f"validation_images/generated_image_run_{step}_{id}.png")
for id, prompt in list(evaluator.prompts.items())[:2]
]
# Evaluate asynchronously
evaluator.evaluate(validation_images)
print("This will run immediately, but the evaluations will run in the background.")
# Wait for all evaluations
evaluator.wait_for_all_evaluations()
run.finish()
Troubleshooting
Common Issues
"Invalid prompt ids" error:
- Ensure image filenames follow the pattern:
*_{prompt_id}.png - Check that
{prompt_id}exists in the evaluation dataset
Evaluations not appearing in wandb:
- Call
evaluator.wait_for_all_evaluations()beforerun.finish() - Check your Rapidata API credentials
- Verify internet connectivity for API calls
"Module not found" error:
- Ensure you have the correct dependencies installed
- Ensure your example code is run from the root of the repository
Environment Variables
Required:
RAPIDATA_CLIENT_ID: Your Rapidata client ID (Not required if running locally)RAPIDATA_CLIENT_SECRET: Your Rapidata client secret (Not required if running locally)
Optional:
OPENAI_API_KEY: For image generation examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crowd_eval-0.1.7.tar.gz.
File metadata
- Download URL: crowd_eval-0.1.7.tar.gz
- Upload date:
- Size: 46.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4c2678e35d2cb71fc572e5e3dca36b8c8cbda841715a70f0692b6994bf81c2b
|
|
| MD5 |
54f1982ec13dc1c95c3bbe357abbecd9
|
|
| BLAKE2b-256 |
085ffdfa90cce220ca2f31b02fb2e50682f8455df67d1bfbd71f297b5ad5a2d1
|
File details
Details for the file crowd_eval-0.1.7-py3-none-any.whl.
File metadata
- Download URL: crowd_eval-0.1.7-py3-none-any.whl
- Upload date:
- Size: 9.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03a82a836de9cfd711240e56ce0c701e4bcd99be975ab839178ce1fdcd41f23f
|
|
| MD5 |
b44788d5571b55122df176f45718472e
|
|
| BLAKE2b-256 |
3eba860c5ad332c0c6586dd9959bc5af78936b0c3e932f0fdb62811d92d5c5cd
|