A lightweight toolkit for evaluating LLM outputs

These details have not been verified by PyPI

Project links

Project description

llm-evaluation-toolkit

LLMの出力を評価するための軽量Pythonライブラリです。BLEU・ROUGE・意味的類似度・LLM-as-a-Judgeの4種類の評価指標を、統一されたAPIで利用できます。

A lightweight Python toolkit for evaluating LLM outputs. Supports BLEU, ROUGE, semantic similarity, and LLM-as-a-Judge — all with a unified API.

なぜ llm-evaluation-toolkit なのか？ / Why llm-evaluation-toolkit?

既存の評価ライブラリは研究用途に特化していたり、セットアップが複雑すぎる問題がありました。このライブラリは、実際の開発ワークフローでLLMの出力を素早く評価したい開発者のために設計されています。

Existing evaluation libraries are either too research-focused or require complex setup. This toolkit is designed for developers who need to evaluate LLM outputs quickly in real workflows.

統一API — すべての評価指標が同じインターフェースを持つ / Unified API — all metrics share the same interface
複数プロバイダ対応 — OpenAI・Anthropicをすぐに利用可能 / Multiple providers — OpenAI and Anthropic out of the box
正解テキスト不要 — LLM-as-a-Judgeは参照テキストなしで評価可能 / No reference needed — LLM-as-a-Judge works without ground truth
軽量設計 — 必要な機能だけインストールできる / Lightweight — install only what you need

インストール / Installation

# 基本インストール / Base installation
pip install llm-evaluation-toolkit

# OpenAI対応 / With OpenAI support
pip install llm-evaluation-toolkit[openai]

# Anthropic対応 / With Anthropic support
pip install llm-evaluation-toolkit[anthropic]

# 意味的類似度対応 / With semantic similarity support
pip install llm-evaluation-toolkit[semantic]

# 全機能 / All features
pip install llm-evaluation-toolkit[openai,anthropic,semantic]

クイックスタート / Quick Start

基本的な評価 / Basic evaluation

from llm_eval import BLEUMetric, ROUGEMetric, SemanticSimilarityMetric

predictions = [
    "The cat is on the mat",
    "A dog was running in the park",
]
references = [
    "A cat is sitting on a mat",
    "The dog ran through the park",
]

bleu = BLEUMetric()
rouge = ROUGEMetric()
semantic = SemanticSimilarityMetric()

print(bleu.compute(predictions, references))
# EvalResult(metric=bleu, score=0.4231)

print(rouge.compute(predictions, references))
# EvalResult(metric=rouge, score=0.6842)

print(semantic.compute(predictions, references))
# EvalResult(metric=semantic_similarity, score=0.8923)

正解テキストなしで評価（LLM-as-a-Judge） / Evaluate without reference texts

from llm_eval import LLMJudgeMetric

# 正解テキストなしで出力品質を評価できる
# Evaluate output quality without any reference text
judge = LLMJudgeMetric(judge_model="gpt-4o-mini")

questions = ["日本の首都はどこですか？"]
answers = ["日本の首都は東京です。政治・経済・文化の中心地として機能しています。"]

result = judge.compute(answers, questions)
print(result)
# EvalResult(metric=llm_judge, score=0.9)

複数の評価指標を一括実行 / Run multiple metrics at once

from llm_eval import BaseEvaluator, BLEUMetric, ROUGEMetric, SemanticSimilarityMetric

evaluator = BaseEvaluator(metrics=[
    BLEUMetric(),
    ROUGEMetric(),
    SemanticSimilarityMetric(),
])

results = evaluator.evaluate(predictions, references)
for metric_name, result in results.items():
    print(f"{metric_name}: {result.score:.4f}")

LLMプロバイダとの連携 / Use with LLM providers

from llm_eval import OpenAIProvider, GenerationConfig, BLEUMetric

# LLMで回答を生成してそのまま評価する
# Generate predictions from LLM and evaluate directly
provider = OpenAIProvider(
    model="gpt-4o-mini",
    config=GenerationConfig(temperature=0.0, max_tokens=256),
)

questions = ["機械学習とは何ですか？", "ニューラルネットワークを説明してください。"]
predictions = provider.generate_batch(questions)

references = [
    "機械学習はデータから学習するAIの一分野です。",
    "ニューラルネットワークは人間の脳を模倣した計算システムです。",
]

result = BLEUMetric().compute(predictions, references)
print(result)

ベンチマークデータセットの利用 / Load benchmark datasets

from llm_eval import DatasetLoader

# 組み込みベンチマークを使う / Use built-in benchmarks
squad = DatasetLoader.load_squad(max_samples=50)
print(squad)
# EvalDataset(name=squad, size=50)

# 自前データを使う / Use your own data
dataset = DatasetLoader.from_dict(
    name="my_dataset",
    questions=["質問1", "質問2"],
    references=["回答1", "回答2"],
)

対応評価指標 / Supported Metrics

指標 / Metric	正解テキスト / Reference	適したタスク / Best for
BLEU	必要 / Yes	翻訳・固定フォーマット生成 / Translation, fixed-format generation
ROUGE	必要 / Yes	要約 / Summarization
Semantic Similarity	必要 / Yes	言い換えを含むタスク / Paraphrase-heavy tasks
LLM-as-a-Judge	不要 / No	自由記述生成 / Open-ended generation

対応プロバイダ / Supported Providers

プロバイダ / Provider	インストール / Install	モデル例 / Models
OpenAI	`pip install llm-evaluation-toolkit[openai]`	gpt-4o, gpt-4o-mini
Anthropic	`pip install llm-evaluation-toolkit[anthropic]`	claude-opus-4-6, claude-haiku-4-5

環境変数 / Environment Variables

OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key

開発環境のセットアップ / Development Setup

git clone https://github.com/swoswoyuu1156/llm-evaluation-toolkit.git
cd llm-evaluation-toolkit
python -m venv .venv

# Mac/Linux
source .venv/bin/activate
# Windows
.venv\Scripts\activate

pip install -e ".[dev]"

# テスト実行 / Run tests
pytest tests/ -v --cov=src/llm_eval

# リント実行 / Run linter
ruff check src/ tests/

コントリビューション / Contributing

コントリビューションを歓迎します！まず CONTRIBUTING.md をご確認ください。

Contributions are welcome! Please read CONTRIBUTING.md first.

ライセンス / License

MIT License — 詳細は LICENSE をご確認ください。

MIT License — see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_evaluation_toolkit-0.1.0.tar.gz (17.3 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_evaluation_toolkit-0.1.0-py3-none-any.whl (17.0 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file llm_evaluation_toolkit-0.1.0.tar.gz.

File metadata

Download URL: llm_evaluation_toolkit-0.1.0.tar.gz
Upload date: Jun 12, 2026
Size: 17.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_evaluation_toolkit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`70e98f6abb0a8722ee42278dd5d69f9e8cb12479defa2d43229638c3f4ba4dfe`
MD5	`4f4ac76fe72d7607417607fc1e7487c2`
BLAKE2b-256	`ab934c3c94c460ef3eaf4e40ba2016047de417839dca6e1d297f1c0d5f102e9e`

See more details on using hashes here.

File details

Details for the file llm_evaluation_toolkit-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_evaluation_toolkit-0.1.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 17.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_evaluation_toolkit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a8fda0dc3c1e627eb4dd9f823266511d72c123010015cdf8ef42dbb0104c9fe3`
MD5	`1f5767f1f39b90d7c60c838808228a60`
BLAKE2b-256	`e652fb646e26e508b263d2a85b0d70972280e4f017ad5df9a2b522b881f049ef`

See more details on using hashes here.

llm-evaluation-toolkit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llm-evaluation-toolkit

なぜ llm-evaluation-toolkit なのか？ / Why llm-evaluation-toolkit?

インストール / Installation

クイックスタート / Quick Start

基本的な評価 / Basic evaluation

正解テキストなしで評価（LLM-as-a-Judge） / Evaluate without reference texts

複数の評価指標を一括実行 / Run multiple metrics at once

LLMプロバイダとの連携 / Use with LLM providers

ベンチマークデータセットの利用 / Load benchmark datasets

対応評価指標 / Supported Metrics

対応プロバイダ / Supported Providers

環境変数 / Environment Variables

開発環境のセットアップ / Development Setup

コントリビューション / Contributing

ライセンス / License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes