Skip to main content

A python API sdk facilitating Error Analysis via LLM-as-a-Judge

Project description

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

CLEAR (Comprehensive LLM Error Analysis and Reporting) is an interactive, open-source package for LLM-based error analysis. It helps surface meaningful, recurring issues in model outputs by combining automated evaluation with powerful visualization tools.

The workflow consists of two main phases:

  1. Analysis
    Generates textual feedback for each instance; Identifies system-level error categories from these critiques and quantifies their frequencies.

  2. Interactive Dashboard
    An intuitive dashboard provides a comprehensive view of model behavior. Users can:

    • Explore aggregate visualizations of identified issues
    • Apply dynamic filters to focus on specific error types or score ranges
    • Drill down into individual examples that illustrate specific failure patterns

CLEAR makes it easier to diagnose model shortcomings and prioritize targeted improvements.

You can run CLEAR as a full pipeline, or reuse specific stages (generation, evaluation, or just UI).

🚀 Quickstart

Requires Python 3.10+ and the necessary credentials for a supported provider.

1. Installation

Option 1 (Recommended for development): Clone the repo and set up a virtual environment:

git clone https://github.com/IBM/CLEAR.git
cd CLEAR
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .

📦 Option 2: Install via pip (Latest Release)

pip install clear-eval

` 2. ### Set provider type and credentials CLEAR requires a supported LLM provider and credentials to run analysis. See supported providers ↓

⚠️ Using a private proxy or openai deployment? You must configure your model names explicitly (see below). Otherwise, default model names will be used automatically for supported providers.

  1. Run on sample data:

The sample dataset is a small subset of the GSM8K math problems. For running on the sample data and default configuration, you simpy have to set your provider and run

run-clear-eval-analysis --provider=openai # or rits, watsonx

This will:

  • Run the full CLEAR pipeline
  • Save results under: results/gsm8k/sample_output/
  1. View results in the interactive dashboard:

run-clear-eval-dashboard

Or set the port with

run-clear-eval-dashboard --port <port>

Then:

  • Upload the generated ZIP file from results/gsm8k/sample_output/
  • Explore issues, scores, filters, and drill into examples
  1. To explore the dashboard without running any analysis:

Run the dashboard:

run-clear-eval-dashboard

Then you can load the pre-generated sample output zip. you can manually upload a sample .zip file located at:

<your-env>/site-packages/clear_eval/sample_data/gsm8k/analysis_results_gsm8k_default.zip

📁 Or just download it directly from the GitHub repo.


📂 Analyzing your own data

📄 Input Data Format

CLEAR takes a CSV file as input, with each row representing a single instance to be evaluated.

Required Columns

Column Used When Description
id Always Unique identifier for the instance
model_input Always Prompt provided to the generation model
response Using pre-generated responses Pre-generated model response (ignored if generation is enabled)
ground_truth Performing reference based analysis Ground-truth answer for evaluation (optional)
others --input_columns is used Additional input columns to show in dashboard (e.g. question)

🚀 Running the analysis

CLEAR can be run via the CLI or Python API.

Option 1: CLI commands

Each stage has its own entry point:

run-clear-eval-analysis --config_path path/to/config.yaml    # run full pypeline
run-clear-eval-generation --config_path path/to/config.yaml  # run generation only
run-clear-eval-evaluation --config_path path/to/config.yaml  # Assume generation responses are given, run evaluation
  • If --config_path is specified, all parameters are taken from the config unless explicitly overridden
  • CLI flags passed directly override corresponding config values

Option 2: Python API

from clear_eval.analysis_runner import run_clear_eval_analysis, run_clear_eval_generation, run_clear_eval_evaluation

run_clear_eval_analysis(
    config_path="configs/sample_run_config.yaml"
)

You may also pass overrides instead of using a config file:

from clear_eval.analysis_runner import run_clear_eval_analysis

run_clear_eval_analysis(
    run_name="my_data",
    provider="openai",
    data_path="my_data.csv",
    gen_model_name="gpt-3.5-turbo",
    eval_model_name="gpt-4",
    output_dir="results/gsm8k/",
    perform_generation=False,
    input_columns=["question"]
)

📊 Launching the Dashboard

run-clear-eval-dashboard

Upload the ZIP file generated in your --output-dir when prompted.

🎛 Supported CLI Arguments

Arguments can be provided via:

  • A YAML config file (--config_path)
  • CLI flags
  • Python function parameters (when using the API)

⚠️ Boolean arguments (perform_generation, is_reference_based, resume_enabled)
These must be set explicitly to true or false in YAML, CLI, or Python.
On the CLI, use --flag True or --flag False (case-insensitive).

⚠️ Naming Convention
Parameter names use snake_case in YAML and Python, but use --kebab-case in CLI.
For example:

  • YAML: perform_generation: true
  • Python: perform_generation=True
  • CLI: --perform-generation True
Argument Description Default
--config_path Path to a YAML config file (all values loaded unless overridden by CLI args)
--run_name Unique run name (used in result file names)
--data_path Path to input CSV file
--output_dir Output directory to write results
--provider Model provider: openai, watsonx, rits
--eval_model_name Name of judge model (e.g. gpt-4o)
--gen_model_name Name of the generator model to evaluate. If not running generations - the generator name to display.
--perform_generation Whether to generate responses or use existing response column True
--is_reference_based Use reference-based evaluation (requires ground_truth column in input) False
--resume_enabled Whether to reuse intermediate outputs from previous runs stored in output_dir True
--evaluation_criteria Custom criteria dictionary for scoring individual records: {"criteria_name1":"criteria_desc1", ...}supported for yaml config and python. None
--input_columns Comma-separated list of additional input fields (other than model_input) to appear in the results and dashboard (e.g. question) None

🔑Supported providers and credentials

Depending on your selected --provider:

Provider Required Environment Variables
openai OPENAI_API_KEY, [OPENAI_API_BASE if using proxy ]
watsonx WATSONX_APIKEY, WATSONX_URL, WATSONX_SPACE_ID or PROJECT_ID
rits RITS_API_KEY

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clear_eval-1.0.7.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clear_eval-1.0.7-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file clear_eval-1.0.7.tar.gz.

File metadata

  • Download URL: clear_eval-1.0.7.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for clear_eval-1.0.7.tar.gz
Algorithm Hash digest
SHA256 be0ce2082fd16db5ebabf8d5aefb26e56cb528ddf57b487a41d1fecaa800f1b2
MD5 55c764f03a018e0d4184182de53e7c6b
BLAKE2b-256 960711578b7ff283417665566f77868a1905dd8edd3c798aad394af8ce90ac3a

See more details on using hashes here.

File details

Details for the file clear_eval-1.0.7-py3-none-any.whl.

File metadata

  • Download URL: clear_eval-1.0.7-py3-none-any.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for clear_eval-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 7111bff257bb27c87b339afe49b68760cff5a742a5d4dae5662f403106375a70
MD5 cedac3471fd5c56da7a2256011b3602f
BLAKE2b-256 bce57cd8a3308f31d768a0b2d89d50a31a58919a5610463d8fc4a3a390522c15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page