A comprehensive, standardized validation toolkit for Korean Large Language Models (LLMs).

These details have not been verified by PyPI

Project links

Project description

Haerae-Evaluation-Toolkit

logo

Haerae-Evaluation-Toolkit is an emerging open-source Python library designed to streamline and standardize the evaluation of Large Language Models (LLMs), focusing on Korean.

Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models (Paper Link)

✨ Key Features

Multiple Evaluation Methods
- Logit-Based, String-Match, Partial-Match LLM-as-a-Judge, and more.
Reasoning Chain Analysis
- Dedicated to analyzing extended Korean chain-of-thought reasoning.
Extensive Korean Datasets
- Includes HAE-RAE Bench, KMMLU, KUDGE, CLiCK, K2-Eval, HRM8K, Benchhub, Kormedqa, KBL and more.
Scalable Inference-Time Techniques
- Best-of-N, Majority Voting, Beam Search, and other advanced methods.
Integration-Ready
- Supports OpenAI-Compatible Endpoints, Huggingface, and LiteLLM.
Flexible and Pluggable Architecture
- Easily extend with new datasets, evaluation metrics, and inference backends.

🚀 Project Status

We are actively developing core features and interfaces. Current goals include:

Unified API
- Seamless loading and integration of diverse Korean benchmark datasets.
Configurable Inference Scaling
- Generate higher-quality outputs through techniques like best-of-N and beam search.
Pluggable Evaluation Methods
- Enable chain-of-thought assessments, logit-based scoring, and standard evaluation metrics.
Modular Architecture
- Easily extendable for new backends, tasks, or custom evaluation logic.

🛠️ Key Components

Dataset Abstraction
- Load and preprocess your datasets (or subsets) with minimal configuration.
Scalable Methods
- Apply decoding strategies such as sampling, beam search, and best-of-N approaches.
Evaluation Library
- Compare predictions to references, use judge models, or create custom scoring methods.
Registry System
- Add new components (datasets, models, scaling methods) via simple decorator-based registration.

⚙️ Installation

Clone the repository:

git clone https://github.com/HAE-RAE/haerae-evaluation-toolkit.git
cd haerae-evaluation-toolkit

(Optional) Create and activate a virtual environment:

Using venv:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Using Conda:

conda create -n hret python=3.11 -y
conda activate hret

Install dependencies: Choose one of the following methods:
- Using pip:
```
pip install -r requirements.txt
```
- Using uv (Recommended for speed):
  - First, install uv if you haven't already. See uv installation guide.
  - Then, install dependencies using uv:
```
uv pip install -r requirements.txt
```

🚀 Quickstart: Using the Evaluator API

Below is a minimal example of how to use the Evaluator interface to load a dataset, apply a model and (optionally) a scaling method, and then evaluate the outputs.

Below is an example, for more detailed instructions on getting it up and running, see tutorial/kor(eng)/quick_start.md.

Python Usage

from llm_eval.evaluator import Evaluator

# 1) Initialize an Evaluator with default parameters (optional).
evaluator = Evaluator()

# 2) Run the evaluation pipeline
results = evaluator.run(
    model="huggingface",                        # or "litellm", "openai", etc.
    judge_model=None,                           # specify e.g. "huggingface_judge" if needed
    reward_model=None,                          # specify e.g. "huggingface_reward" if needed
    dataset="haerae_bench",                     # or "kmmlu", "qarv", ...
    subset=["csat_geo", "csat_law"],            # optional subset(s)
    split="test",                               # "train"/"validation"/"test"
    dataset_params={"revision":"main"},         # example HF config
    model_params={"model_name_or_path":"gpt2"}, # example HF Transformers param
    judge_params={},                            # params for judge model (if judge_model is not None)
    reward_params={},                           # params for reward model (if reward_model is not None)
    scaling_method=None,                        # or "beam_search", "best_of_n"
    scaling_params={},             # e.g., {"beam_size":3, "num_iterations":5}
    evaluator_params={}                         # e.g., custom evaluation settings
)

Dataset is loaded from the registry (e.g., haerae_bench is just one of many).
Model is likewise loaded via the registry (huggingface, litellm, etc.).
judge_model and reward_model can be provided if you want LLM-as-a-Judge or reward-model logic. If both are None, the system uses a single model backend.
ScalingMethod is optional if you want to do specialized decoding.
EvaluationMethod (e.g., string_match, log_likelihood, partial_match or llm_judge) measures performance.

CLI Usage

We also provide a simple command-line interface (CLI) via evaluator.py:

python llm_eval/evaluator.py \
  --model huggingface \
  --judge_model huggingface_judge \
  --reward_model huggingface_reward \
  --dataset haerae_bench \
  --subset csat_geo \
  --split test \
  --scaling_method beam_search \
  --evaluation_method string_match \
  --model_params '{"model_name_or_path": "gpt2"}' \
  --scaling_params '{"beam_size":3, "num_iterations":5}' \
  --output_file results.json

This command will:

Load the haerae_bench (subset=csat_geo) test split.
Create a MultiModel internally with: Generate model: huggingface → gpt2 Judge model: huggingface_judge (if you pass relevant judge_params) Reward model: huggingface_reward (if you pass relevant reward_params).
Apply Beam Search (beam_size=3).
Evaluate final outputs via string_match.
Save the resulting JSON file to results.json.

Configuration File

Instead of passing many arguments, the entire pipeline can be described in a single YAML file. Create evaluator_config.yaml:

dataset:
  name: haerae_bench
  split: test
  params: {}
model:
  name: huggingface
  params:
    model_name_or_path: gpt2
evaluation:
  method: string_match
  params: {}
language_penalize: true
target_lang: ko
few_shot:
  num: 0

Run the configuration with:

from llm_eval.evaluator import run_from_config

result = run_from_config("evaluator_config.yaml")

See examples/evaluator_config.yaml for a full template including judge, reward, and scaling options.

🎯 HRET API: MLOps-Friendly Interface

For production environments and MLOps integration, we provide HRET (Haerae Evaluation Toolkit) - a decorator-based API inspired by deepeval that makes LLM evaluation seamless and integration-ready.

Quick Start with HRET

import llm_eval.hret as hret

# Simple decorator-based evaluation
@hret.evaluate(dataset="kmmlu", model="huggingface")
def my_model(input_text: str) -> str:
    return model.generate(input_text)

# Run evaluation
result = my_model()
print(f"Accuracy: {result.metrics['accuracy']}")

Key HRET Features

🎨 Decorator-Based API: @hret.evaluate, @hret.benchmark, @hret.track_metrics
🔧 Context Managers: Fine-grained control with hret.evaluation_context()
📊 MLOps Integration: Built-in support for MLflow, Weights & Biases, and custom loggers
⚙️ Configuration Management: YAML/JSON config files and global settings
📈 Metrics Tracking: Cross-run comparison and performance monitoring
🚀 Production Ready: Designed for training pipelines, A/B testing, and continuous evaluation

Advanced Usage Examples

Model Benchmarking

@hret.benchmark(dataset="kmmlu")
def compare_models():
    return {
        "gpt-4": lambda x: gpt4_model.generate(x),
        "claude-3": lambda x: claude_model.generate(x),
        "custom": lambda x: custom_model.generate(x)
    }

results = compare_models()

MLOps Integration

with hret.evaluation_context(dataset="kmmlu") as ctx:
    # Add MLOps integrations
    ctx.log_to_mlflow(experiment_name="llm_experiments")
    ctx.log_to_wandb(project_name="model_evaluation")

    # Run evaluation
    result = ctx.evaluate(my_model_function)

Training Pipeline Integration

class ModelTrainingPipeline:
    def evaluate_checkpoint(self, epoch):
        with hret.evaluation_context(
            run_name=f"checkpoint_epoch_{epoch}"
        ) as ctx:
            ctx.log_to_mlflow(experiment_name="training")
            result = ctx.evaluate(self.model.generate)

            if self.detect_degradation(result):
                self.send_alert(epoch, result)

Configuration Management

Create hret_config.yaml:

default_dataset: "kmmlu"
default_model: "huggingface"
mlflow_tracking: true
wandb_tracking: true
output_dir: "./results"
auto_save_results: true

Load and use:

hret.load_config("hret_config.yaml")
result = hret.quick_eval(my_model_function)

Documentation

English: docs/eng/08-hret-api-guide.md
한국어: docs/kor/08-hret-api-guide.md
Examples: examples/hret_examples.py, examples/mlops_integration_example.py

HRET maintains full backward compatibility with the existing Evaluator API while providing a modern, MLOps-friendly interface for production deployments.

🤝 Contributing & Contact

We welcome collaborators, contributors, and testers interested in advancing LLM evaluation methods, especially for Korean language tasks.

📩 Contact Us

Development Lead: gksdnf424@gmail.com
Research Lead: spthsrbwls123@yonsei.ac.kr

We look forward to hearing your ideas and contributions!

📝 Citation

If you find HRET useful in your research, please consider citing our paper:

@misc{lee2025redefiningevaluationstandardsunified,
      title={Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models},
      author={Hanwool Lee and Dasol Choi and Sooyong Kim and Ilgyun Jung and Sangwon Baek and Guijin Son and Inseon Hwang and Naeun Lee and Seunghyeok Hong},
      year={2025},
      eprint={2503.22968},
      archivePrefix={arXiv},
      primaryClass={cs.CE},
      url={https://arxiv.org/abs/2503.22968},
}

📜 License

Licensed under the Apache License 2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Aug 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haerae_evaluation_toolkit-0.1.0.tar.gz (1.7 MB view details)

Uploaded Aug 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

haerae_evaluation_toolkit-0.1.0-py3-none-any.whl (129.3 kB view details)

Uploaded Aug 31, 2025 Python 3

File details

Details for the file haerae_evaluation_toolkit-0.1.0.tar.gz.

File metadata

Download URL: haerae_evaluation_toolkit-0.1.0.tar.gz
Upload date: Aug 31, 2025
Size: 1.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for haerae_evaluation_toolkit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`27e62841d9d9059ea7e70e2fef18545d5bc12a9eca80cb643aae1c749b010a27`
MD5	`e749f634fd5fc7ddf2f46c9381a0f9e5`
BLAKE2b-256	`833fda408c7bc5aacae540833e05d1e3a70019f3e67aba3f904e7bca1cfa24c2`

See more details on using hashes here.

File details

Details for the file haerae_evaluation_toolkit-0.1.0-py3-none-any.whl.

File metadata

Download URL: haerae_evaluation_toolkit-0.1.0-py3-none-any.whl
Upload date: Aug 31, 2025
Size: 129.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for haerae_evaluation_toolkit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8097095b7a37788b39c06ac92cdbffa115caa37ae75c48f3e9f625e0da4692c6`
MD5	`cdec548d615f66eaee37ed8dbcb9ffdc`
BLAKE2b-256	`7352cdc8f4227d16c6e529b8c74f4fb685a113bf3d5ae5b7d696ff4a0efa9227`

See more details on using hashes here.

haerae-evaluation-toolkit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Haerae-Evaluation-Toolkit

✨ Key Features

🚀 Project Status

🛠️ Key Components

⚙️ Installation

🚀 Quickstart: Using the Evaluator API

Python Usage

CLI Usage

Configuration File

🎯 HRET API: MLOps-Friendly Interface

Quick Start with HRET

Key HRET Features

Advanced Usage Examples

Model Benchmarking

MLOps Integration

Training Pipeline Integration

Configuration Management

Documentation

🤝 Contributing & Contact

📩 Contact Us

📝 Citation

📜 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes