A python API sdk facilitating Error Analysis via LLM-as-a-Judge
Project description
CLEAR: Error Analysis via LLM-as-a-Judge Made Easy
CLEAR (Comprehensive LLM Error Analysis and Reporting) is an interactive, open-source package for LLM-based error analysis. It helps surface meaningful, recurring issues in model outputs by combining automated evaluation with powerful visualization tools.
The workflow consists of two main phases:
-
Analysis
Generates textual feedback for each instance; Identifies system-level error categories from these critiques and quantifies their frequencies. -
Interactive Dashboard
An intuitive dashboard provides a comprehensive view of model behavior. Users can:- Explore aggregate visualizations of identified issues
- Apply dynamic filters to focus on specific error types or score ranges
- Drill down into individual examples that illustrate specific failure patterns
CLEAR makes it easier to diagnose model shortcomings and prioritize targeted improvements.
You can run CLEAR as a full pipeline, or reuse specific stages (generation, evaluation, or just UI).
🚀 Quickstart
Requires Python 3.10+ and the necessary credentials for a supported provider.
1. Installation
Option 1 (Recommended for development): Clone the repo and set up a virtual environment:
git clone https://github.com/IBM/CLEAR.git
cd CLEAR
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .
📦 Option 2: Install via pip (Latest Release)
pip install clear-eval
` 2. ### Set provider type and credentials CLEAR requires a supported LLM provider and credentials to run analysis. See supported providers ↓
⚠️ Using a private proxy or openai deployment? You must configure your model names explicitly (see below). Otherwise, default model names will be used automatically for supported providers.
-
Run on sample data:
The sample dataset is a small subset of the GSM8K math problems. For running on the sample data and default configuration, you simpy have to set your provider and run
run-clear-eval-analysis --provider=openai # or rits, watsonx
This will:
- Run the full CLEAR pipeline
- Save results under:
results/gsm8k/sample_output/
-
View results in the interactive dashboard:
run-clear-eval-dashboard
Or set the port with
run-clear-eval-dashboard --port <port>
Then:
- Upload the generated ZIP file from
results/gsm8k/sample_output/ - Explore issues, scores, filters, and drill into examples
-
To explore the dashboard without running any analysis:
Run the dashboard:
run-clear-eval-dashboard
Then you can load the pre-generated sample output zip.
you can manually upload a sample .zip file located at:
<your-env>/site-packages/clear_eval/sample_data/gsm8k/analysis_results_gsm8k_default.zip
📁 Or just download it directly from the GitHub repo.
📂 Analyzing your own data
📄 Input Data Format
CLEAR takes a CSV file as input, with each row representing a single instance to be evaluated.
Required Columns
| Column | Used When | Description |
|---|---|---|
id |
Always | Unique identifier for the instance |
model_input |
Always | Prompt provided to the generation model |
response |
Using pre-generated responses | Pre-generated model response (ignored if generation is enabled) |
ground_truth |
Performing reference based analysis | Ground-truth answer for evaluation (optional) |
| others | --input_columns is used |
Additional input columns to show in dashboard (e.g. question) |
🚀 Running the analysis
CLEAR can be run via the CLI or Python API.
Option 1: CLI commands
Each stage has its own entry point:
run-clear-eval-analysis --config_path path/to/config.yaml # run full pypeline
run-clear-eval-generation --config_path path/to/config.yaml # run generation only
run-clear-eval-evaluation --config_path path/to/config.yaml # Assume generation responses are given, run evaluation
- If
--config_pathis specified, all parameters are taken from the config unless explicitly overridden - CLI flags passed directly override corresponding config values
Option 2: Python API
from clear_eval.analysis_runner import run_clear_eval_analysis, run_clear_eval_generation, run_clear_eval_evaluation
run_clear_eval_analysis(
config_path="configs/sample_run_config.yaml"
)
You may also pass overrides instead of using a config file:
from clear_eval.analysis_runner import run_clear_eval_analysis
run_clear_eval_analysis(
run_name="my_data",
provider="openai",
data_path="my_data.csv",
gen_model_name="gpt-3.5-turbo",
eval_model_name="gpt-4",
output_dir="results/gsm8k/",
perform_generation=False,
input_columns=["question"]
)
📊 Launching the Dashboard
run-clear-eval-dashboard
Upload the ZIP file generated in your --output-dir when prompted.
🎛 Supported CLI Arguments
Arguments can be provided via:
- A YAML config file (
--config_path) - CLI flags
- Python function parameters (when using the API)
⚠️ Boolean arguments (
perform_generation,is_reference_based,resume_enabled)
These must be set explicitly totrueorfalsein YAML, CLI, or Python.
On the CLI, use--flag Trueor--flag False(case-insensitive).
⚠️ Naming Convention
Parameter names usesnake_casein YAML and Python, but use--kebab-casein CLI.
For example:
- YAML:
perform_generation: true- Python:
perform_generation=True- CLI:
--perform-generation True
| Argument | Description | Default |
|---|---|---|
--config_path |
Path to a YAML config file (all values loaded unless overridden by CLI args) | |
--run_name |
Unique run name (used in result file names) | |
--data_path |
Path to input CSV file | |
--output_dir |
Output directory to write results | |
--provider |
Model provider: openai, watsonx, rits |
|
--eval_model_name |
Name of judge model (e.g. gpt-4o) |
|
--gen_model_name |
Name of the generator model to evaluate. If not running generations - the generator name to display. | |
--perform_generation |
Whether to generate responses or use existing response column |
True |
--is_reference_based |
Use reference-based evaluation (requires ground_truth column in input) |
False |
--resume_enabled |
Whether to reuse intermediate outputs from previous runs stored in output_dir | True |
--evaluation_criteria |
Custom criteria dictionary for scoring individual records: {"criteria_name1":"criteria_desc1", ...}supported for yaml config and python. |
None |
--input_columns |
Comma-separated list of additional input fields (other than model_input) to appear in the results and dashboard (e.g. question) |
None |
🔑Supported providers and credentials
Depending on your selected --provider:
| Provider | Required Environment Variables |
|---|---|
openai |
OPENAI_API_KEY, [OPENAI_API_BASE if using proxy ] |
watsonx |
WATSONX_APIKEY, WATSONX_URL, WATSONX_SPACE_ID or WATSONX_PROJECT_ID |
rits |
RITS_API_KEY |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clear_eval-1.0.8.tar.gz.
File metadata
- Download URL: clear_eval-1.0.8.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99d0ebd01e42c0ae8e02deb41293bce960c126b10bf9c749e275f45eb8917e25
|
|
| MD5 |
b02260bf3d6a0b7c10501deba4815942
|
|
| BLAKE2b-256 |
ea2bd5d7132d9e595d0d5a301c09c5b08f9f1c0584b93209fc697d1629de79de
|
File details
Details for the file clear_eval-1.0.8-py3-none-any.whl.
File metadata
- Download URL: clear_eval-1.0.8-py3-none-any.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
242b097fdc7a499164c3780c6940e43a264ec8e9cb10407e10b8ff45ce32bba1
|
|
| MD5 |
1bd556ed11d776583b560a0219f6da33
|
|
| BLAKE2b-256 |
4922a8156705ff06f8af49d1e359aaa9ed0354131ad384546a4e67bc852a05aa
|