Human evaluation tools for AI models and datasets

Project description

Crowd Evaluation for Machine Learning Training

A Python library for integrating crowd evaluation into your machine learning training loops. This library provides asynchronous, non-blocking evaluation of model outputs (currently supporting image generation) with automatic logging to Weights & Biases (wandb).

Features

Asynchronous Evaluation: Evaluations run in the background without blocking your training loop
Wandb Integration: Results are automatically logged to your wandb runs with proper ordering
Image Evaluation: Built-in support for evaluating generated images on multiple criteria
Crowd-in-the-Loop: Uses Rapidata for high-quality crowd evaluation
Easy Integration: Add evaluation to your training loop with just a few lines of code

Quick Start

import wandb
from checkpoint_evaluation.image_checkpoint_evaluator import ImageEvaluator

# Initialize wandb
run = wandb.init(project="my-project")

# Create evaluator
evaluator = ImageEvaluator(wandb_run=run, model_name="my-model")

# In your training loop
for step in range(100):
    # ... your training code ...
    
    # Generate or load validation images (every N steps)
    if step % 10 == 0:
        validation_images = ["path/to/image_1.png", "path/to/image_2.png"]
        
        # Fire-and-forget evaluation - returns immediately!
        evaluator.evaluate(validation_images)
    
    # ... continue training ...

# Wait for all evaluations to complete before finishing
evaluator.wait_for_all_evaluations()
run.finish()

Installation

Prerequisites

Python 3.9+
A Rapidata account with API credentials
A Weights & Biases account

Dependencies

Prerequisites

Install uv if you haven't already:

# For MacOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# For Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Setup Instructions

Create and activate a virtual environment:

uv venv

# On Unix/macOS
source .venv/bin/activate

# On Windows
.venv\Scripts\activate

Install dependencies:
```
uv sync
```

Environment Setup

Create a .env file in your project root:

OPENAI_API_KEY=your_openai_api_key  # If running the example file
RAPIDATA_CLIENT_ID=your_rapidata_client_id # If running on a server
RAPIDATA_CLIENT_SECRET=your_rapidata_client_secret # If running on a server

Detailed Usage

Image Evaluation

The ImageEvaluator evaluates generated images on three key metrics:

Preference: Overall crowd preference for the image
Alignment: How well the image matches its text description
Coherence: Visual quality and absence of artifacts

Image Requirements

For the evaluator to function properly, your image files should adhere to the following naming convention: the image name must end with "_{prompt_id}". The rest of the filename structure is not significant.

Where {prompt_id} corresponds to prompt IDs from the evaluation dataset. The evaluator will automatically validate that your images match available prompts.

Complete Example with Image Generation

To run this, make sure you run the following commands:

uv venv
source .venv/bin/activate
uv sync
uv add openai dotenv

and log in to wandb:

wandb login

import os
import sys
import openai
import requests
import wandb
from checkpoint_evaluation.image_checkpoint_evaluator import ImageEvaluator
from dotenv import load_dotenv

load_dotenv()

# Setup
openai.api_key = os.getenv("OPENAI_API_KEY")
run = wandb.init(project="dalle-evaluation")
evaluator = ImageEvaluator(wandb_run=run, model_name="dalle-3")

def generate_and_save_image(prompt: str, file_location: str) -> str:
    """Generate image using DALL-E and save to disk."""
    os.makedirs(os.path.dirname(file_location), exist_ok=True)
    
    response = openai.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size="1024x1024",
        quality="standard",
        n=1
    )
    
    # Download and save image
    image_url = response.data[0].url
    image_data = requests.get(image_url).content
    with open(file_location, 'wb') as f:
        f.write(image_data)
    
    return file_location

if __name__ == "__main__":
    # Training simulation
    for step in range(3):
        # Simulate training
        run.log({"Some training metric": step})
        
        # Generate images for evaluation (using first 2 prompts)
        validation_images = [
            generate_and_save_image(prompt, f"validation_images/generated_image_run_{step}_{id}.png")
            for id, prompt in list(evaluator.prompts.items())[:2]
        ]
        
        # Evaluate asynchronously
        evaluator.evaluate(validation_images)

    print("This will run immediately, but the evaluations will run in the background.")

    # Wait for all evaluations
    evaluator.wait_for_all_evaluations()
    run.finish()

Troubleshooting

Common Issues

"Invalid prompt ids" error:

Ensure image filenames follow the pattern: *_{prompt_id}.png
Check that {prompt_id} exists in the evaluation dataset

Evaluations not appearing in wandb:

Call evaluator.wait_for_all_evaluations() before run.finish()
Check your Rapidata API credentials
Verify internet connectivity for API calls

"Module not found" error:

Ensure you have the correct dependencies installed
Ensure your example code is run from the root of the repository

Environment Variables

Required:

RAPIDATA_CLIENT_ID: Your Rapidata client ID (Not required if running locally)
RAPIDATA_CLIENT_SECRET: Your Rapidata client secret (Not required if running locally)

Optional:

OPENAI_API_KEY: For image generation examples

Project details

Release history Release notifications | RSS feed

0.3.0

Jun 18, 2025

0.2.0

Jun 13, 2025

0.1.10

Jun 13, 2025

0.1.9

Jun 12, 2025

0.1.8

Jun 10, 2025

This version

0.1.7

Jun 10, 2025

0.1.6

Jun 10, 2025

0.1.5

Jun 10, 2025

0.1.4

Jun 10, 2025

0.1.3

Jun 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crowd_eval-0.1.7.tar.gz (46.8 kB view details)

Uploaded Jun 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crowd_eval-0.1.7-py3-none-any.whl (9.3 kB view details)

Uploaded Jun 10, 2025 Python 3

File details

Details for the file crowd_eval-0.1.7.tar.gz.

File metadata

Download URL: crowd_eval-0.1.7.tar.gz
Upload date: Jun 10, 2025
Size: 46.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.12

File hashes

Hashes for crowd_eval-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`e4c2678e35d2cb71fc572e5e3dca36b8c8cbda841715a70f0692b6994bf81c2b`
MD5	`54f1982ec13dc1c95c3bbe357abbecd9`
BLAKE2b-256	`085ffdfa90cce220ca2f31b02fb2e50682f8455df67d1bfbd71f297b5ad5a2d1`

See more details on using hashes here.

File details

Details for the file crowd_eval-0.1.7-py3-none-any.whl.

File metadata

Download URL: crowd_eval-0.1.7-py3-none-any.whl
Upload date: Jun 10, 2025
Size: 9.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.12

File hashes

Hashes for crowd_eval-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`03a82a836de9cfd711240e56ce0c701e4bcd99be975ab839178ce1fdcd41f23f`
MD5	`b44788d5571b55122df176f45718472e`
BLAKE2b-256	`3eba860c5ad332c0c6586dd9959bc5af78936b0c3e932f0fdb62811d92d5c5cd`

See more details on using hashes here.

crowd-eval 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Crowd Evaluation for Machine Learning Training

Features

Quick Start

Installation

Prerequisites

Dependencies

Prerequisites

Setup Instructions

Environment Setup

Detailed Usage

Image Evaluation

Image Requirements

Complete Example with Image Generation

To run this, make sure you run the following commands:

Troubleshooting

Common Issues

Environment Variables

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes