Skip to main content

Generate high-quality Rubrics based on custom dimensions, descriptors, criteria and scoring system.

Project description

Rubric Kit

Rubric framework. Create, refine, and apply evaluation rubrics powered by AI.

Features

  • Rubric Generation - Create rubrics from Q&A pairs or chat sessions
  • Multi-Judge Panel - Multiple LLMs with consensus mechanisms (quorum, majority, unanimous)
  • Multi-Provider Support - OpenAI, Google Vertex AI, IBM WatsonX, Anthropic, Ollama, and 100+ providers via LiteLLM
  • Flexible Grading - Binary (pass/fail) and score-based (0-3 scale) grading
  • Tool Call Validation - Define required, optional, and prohibited tool calls
  • PDF Reports - Comprehensive reports with charts and breakdowns
  • Export Formats - YAML (source of truth), PDF, CSV, JSON
  • Self-Contained Outputs - Re-run evaluations from previous results
  • Cost Estimation - Dry-run mode to estimate costs before running evaluations
  • Metrics Tracking - Token usage, latency, and cost tracking per LLM call

Installation

Requires Python 3.10 or higher.

pip install rubric-kit

For development:

git clone https://github.com/your-org/rubric-kit
cd rubric-kit
pip install -e ".[dev]"

Quick Start

export OPENAI_API_KEY="your-api-key"

# Generate a rubric from Q&A
rubric-kit generate qa_input.txt rubric.yaml

# Evaluate a chat session
rubric-kit evaluate --from-chat-session chat.txt --rubric-file rubric.yaml --output-file results.yaml

# Export to PDF
rubric-kit export results.yaml --format pdf --output report.pdf

LLM Provider Setup

Rubric Kit uses LiteLLM to support 100+ LLM providers. API keys are configured via environment variables (never in config files or CLI arguments).

Some Supported Providers (most popular)

Provider Model Format Environment Variables
OpenAI gpt-4, gpt-4o OPENAI_API_KEY
Google AI Studio gemini/gemini-2.5-flash GEMINI_API_KEY
Google Vertex AI vertex_ai/gemini-2.5-flash gcloud auth or GOOGLE_APPLICATION_CREDENTIALS
IBM WatsonX watsonx/meta-llama/llama-3-8b-instruct WATSONX_APIKEY, WATSONX_PROJECT_ID
Anthropic claude-3-5-sonnet-20241022 ANTHROPIC_API_KEY
Ollama (local) ollama/llama3.1 None (uses localhost:11434)
Ollama (remote) ollama/granite4 OLLAMA_API_BASE
Azure OpenAI azure/gpt-4 AZURE_API_KEY, AZURE_API_BASE

📚 Full documentation: See LiteLLM Providers for complete list of supported providers, model formats, and required environment variables.

CLI Examples

# OpenAI
export OPENAI_API_KEY="sk-..."
rubric-kit generate --from-qna qna.yaml --output-file rubric.yaml --model gpt-4o

# Google AI Studio (Gemini)
export GEMINI_API_KEY="..."
rubric-kit generate --from-qna qna.yaml --output-file rubric.yaml --model gemini/gemini-2.5-flash

# Google Vertex AI
gcloud auth application-default login
rubric-kit generate --from-qna qna.yaml --output-file rubric.yaml --model vertex_ai/gemini-2.5-flash

# IBM WatsonX
export WATSONX_APIKEY="..."
export WATSONX_PROJECT_ID="..."
rubric-kit generate --from-qna qna.yaml --output-file rubric.yaml --model watsonx/meta-llama/llama-3-8b-instruct

# Local Ollama
rubric-kit generate --from-qna qna.yaml --output-file rubric.yaml --model ollama/llama3

OpenAI-Compatible Endpoints

For custom OpenAI-compatible endpoints (vLLM, LocalAI, etc.), use base_url:

judges:
  - name: custom-endpoint
    model: mistral-7b  # Model name as expected by your endpoint
    base_url: http://your-endpoint:8000/v1

Commands

Command Description
generate Create a rubric from Q&A pair or chat session
evaluate Evaluate content against a rubric (outputs YAML)
refine Improve an existing rubric with AI feedback
export Convert YAML to PDF, CSV, or JSON
rerun Re-evaluate using settings from previous output
arena Compare multiple contestants against same rubric

Use rubric-kit <command> --help for detailed options.

Common Options

Flag Commands Description
--dry-run evaluate, generate Estimate costs without making LLM calls
--no-metrics evaluate, generate, refine Disable metrics collection in output
--include-call-log evaluate Include detailed per-call metrics in output

YAML Formats

See examples/ for complete format examples:

Rubric Structure

Below you will find a very basic rubric to understand how it is composed.

TODO: For Rubrics best practices, please read the RUBRICS file.

dimensions:
  - factual_correctness: Evaluates factual accuracy
    grading_type: binary

  - quality: Evaluates response quality
    grading_type: score
    scores:
      0: Poor
      1: Fair
      2: Good
      3: Excellent

criteria:
  accuracy_check:
    category: Output
    weight: 3
    dimension: factual_correctness
    criterion: Response must correctly state X.

  tool_usage:
    category: Tools
    weight: 2
    dimension: tool_use
    tool_calls:
      respect_order: false
      required:
        - get_info:
            min_calls: 1

Judge Panel

judge_panel:
  judges:
    - name: ChatGPT-4o
      model: gpt-4o
    - name: Gemini-2.5-Flash
      model: vertex_ai/gemini-2.5-flash
    - name: Claude-4.5-Sonnet
      model: anthropic/claude-4-5-sonnet

  execution:
    mode: parallel  # sequential, parallel, batched

  consensus:
    mode: majority  # unanimous, majority, quorum
    on_no_consensus: fail  # fail, median, most_common

Output

Evaluation always produces a self-contained YAML file:

results:
  - criterion_name: accuracy_check
    result: pass
    score: 3
    reason: The response correctly stated X.

summary:
  total_score: 15
  max_score: 18
  percentage: 83.3

rubric: { ... }       # Full rubric for reference
judge_panel: { ... }  # Judge configuration used
input: { ... }        # Input content (Q&A or chat session)
metadata:
  timestamp: 2025-01-20T10:30:00
  metrics:            # LLM usage metrics (unless --no-metrics)
    summary:
      total_calls: 5
      prompt_tokens: 2500
      completion_tokens: 800
      estimated_cost_usd: 0.0425

Development

# Run tests
pytest

# Run with coverage
pytest --cov=rubric_kit --cov-report=html

# Format code
black rubric_kit tests

License

See LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubric_kit-0.2.0.tar.gz (153.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rubric_kit-0.2.0-py3-none-any.whl (105.4 kB view details)

Uploaded Python 3

File details

Details for the file rubric_kit-0.2.0.tar.gz.

File metadata

  • Download URL: rubric_kit-0.2.0.tar.gz
  • Upload date:
  • Size: 153.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubric_kit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8f3af808af6c86872a580bb26ff55016bfa404c0147f6271312949bf5fd63756
MD5 2c0249dcfafd15e392ac2fe8e52dd23b
BLAKE2b-256 c546dacbb528d349c06617aac5a897fe701d8354fa6331a438d5068c65b53ddc

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubric_kit-0.2.0.tar.gz:

Publisher: release.yaml on narmaku/rubric-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rubric_kit-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: rubric_kit-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 105.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubric_kit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 94d8d77ee57e83dd8bbf0dc33bb198f84e55a1816b0002be1df1b26f1b70c827
MD5 7150abb44ba8ac76f31184c65fcd6668
BLAKE2b-256 10420dbd0174a1926144d9cd4ed25aadd6af940a856d5163b1e667a1ce0aa610

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubric_kit-0.2.0-py3-none-any.whl:

Publisher: release.yaml on narmaku/rubric-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page