No project description provided

These details have not been verified by PyPI

Project description

LLM 評分系統

這是一個使用 Streamlit 和 LiteLLM 建立的語言模型評估系統。系統可以測試不同的語言模型，並使用 Claude 作為評分模型來評估回答品質。

作為 Python 套件使用

快速開始

本系統提供通用的測試執行函數，可以輕鬆整合到你的專案中：

import asyncio
from llm_ranking.eval.boolean_eval import run_test_cases
from llm_ranking.models.boolean_test_case import BooleanTestCase
from llm_ranking.models.test_result import TestCaseResult
from llm_ranking.utils.boolean_test_case_reader import get_bool_test_case_eval_prompt

# 1. 建立測試案例
test_cases = [
    BooleanTestCase(
        id="test_1",
        messages=[{"role": "user", "content": "什麼是人工智能？"}],
        judge_model="google/gemini-2.5-flash",
        # evaluation_prompt 可以手動指定或使用 get_bool_test_case_eval_prompt 生成
        evaluation_prompt=get_bool_test_case_eval_prompt(
            pass_criteria="正確解釋人工智能的基本概念和應用",
            fail_criteria="回應不準確、過於簡略或包含錯誤資訊"
        )
    )
]

# 2. 定義你的 async API 調用函數
async def my_api_caller(test_case: BooleanTestCase) -> str:
    """
    自定義的 API 調用函數
    
    Args:
        test_case: 測試案例，包含 model、messages 等資訊
        
    Returns:
        模型的回應內容
    """
    # 這裡實作你的 API 調用邏輯
    # test_case.model 可以用來決定要調用哪個模型
    # test_case.messages 包含對話訊息
    
    # 範例：調用 OpenAI API (使用 async)
    import openai
    client = openai.AsyncOpenAI()
    response = await client.chat.completions.create(
        model=test_case.model or "gpt-3.5-turbo",
        messages=test_case.messages
    )
    return response.choices[0].message.content

# 3. 執行測試 (使用 async)
async def run_tests():
    async_generator = run_test_cases(test_cases, my_api_caller)
    async for result in async_generator:
        print(f"測試 {result.id}: {'通過' if result.is_pass else '失敗'}")
        print(f"輸出: {result.output}")
        print(f"評估原因: {result.pass_fail_reason}")
        print("-" * 50)

# 4. 執行 async 函數
asyncio.run(run_tests())

評估提示生成

get_bool_test_case_eval_prompt 函數

from llm_ranking.utils.boolean_test_case_reader import get_bool_test_case_eval_prompt

# 生成標準化的評估提示
eval_prompt = get_bool_test_case_eval_prompt(
    pass_criteria="回應必須包含正確的技術解釋和實際應用例子",
    fail_criteria="回應包含錯誤資訊、過於簡略或偏離主題"
)

print(eval_prompt)
# 輸出格式化的中文評估提示，包含過關和不過關原則

參數說明：

pass_criteria: 通過測試的標準（過關原則）
fail_criteria: 未通過測試的標準（不過關原則）

返回： 格式化的中文評估提示字串，可直接用於 evaluation_prompt 欄位

主要組件

TestCaseResult 類別

run_test_cases 函數返回的測試結果物件，包含完整的測試執行資訊：

class TestCaseResult(BaseModel):
    id: str                    # 測試案例 ID
    output: str               # 來自你的 API 調用函數的模型回應內容
    is_pass: bool            # 評估結果：是否通過測試 (True/False)
    pass_fail_reason: str    # 評估模型提供的通過/失敗詳細原因

欄位說明：

id: 對應測試案例的唯一識別碼
output: 你的 get_response 函數返回的原始模型回應
is_pass: 評估模型（judge_model）根據 evaluation_prompt 判斷的結果
pass_fail_reason: 評估模型提供的判斷理由，說明為什麼通過或失敗

使用範例：

async for result in run_test_cases(test_cases, my_api_caller):
    print(f"測試 ID: {result.id}")
    print(f"模型回應: {result.output}")
    print(f"測試結果: {'✓ 通過' if result.is_pass else '✗ 失敗'}")
    print(f"評估原因: {result.pass_fail_reason}")
    print("-" * 50)

run_test_cases 函數

async def run_test_cases(
    test_cases: List[BooleanTestCase], 
    get_response: Callable[[BooleanTestCase], Awaitable[str]]
) -> AsyncGenerator[TestCaseResult, None]:

參數說明：

test_cases: 測試案例列表
get_response: 你的 async API 調用函數，接收 BooleanTestCase 並返回回應字串

返回： AsyncGenerator，逐一產出 TestCaseResult 物件

重要： 從版本 5.8.24 開始，run_test_cases 和 get_response 函數都必須是 async 的

進階使用範例

批量測試多個模型

models_to_test = ["gpt-3.5-turbo", "gpt-4", "claude-3-sonnet"]

async def test_multiple_models():
    for model in models_to_test:
        print(f"\n測試模型: {model}")
        
        test_cases = [
            BooleanTestCase(
                id=f"test_{model}_1",
                system_prompt="You are a helpful assistant.",
                messages=[{"role": "user", "content": "解釋量子計算"}],
                model=model,
                judge_model="google/gemini-2.5-flash",
                evaluation_prompt=get_bool_test_case_eval_prompt(
                    pass_criteria="正確解釋量子計算的基本原理和應用",
                    fail_criteria="回應不準確或缺乏重要概念"
                )
            )
        ]
        
        async for result in run_test_cases(test_cases, my_api_caller):
            print(f"  結果: {'✓' if result.is_pass else '✗'} - {result.pass_fail_reason}")

# 執行測試
asyncio.run(test_multiple_models())

整合自定義評估邏輯

# 如果你需要更複雜的錯誤處理
async def robust_api_caller(test_case: BooleanTestCase) -> str:
    try:
        # 你的 async API 調用邏輯
        return await call_your_model_api(test_case)
    except Exception as e:
        # 自定義錯誤處理
        return f"調用失敗: {str(e)}"

# 使用你的健壯 async API 調用函數
async def collect_results():
    results = []
    async for result in run_test_cases(test_cases, robust_api_caller):
        results.append(result)
    return results

results = asyncio.run(collect_results())

安裝步驟

安裝依賴：

pip install -r requirements.txt

設置環境變數：創建 .env 文件並添加以下內容：

OPENAI_API_KEY=你的OpenAI API金鑰
ANTHROPIC_API_KEY=你的Anthropic API金鑰
GOOGLE_API_KEY=你的Google API金鑰（如果要使用Gemini）
OPENROUTER_API_KEY=你的OpenRouter API金鑰（用於評估模型）
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

Streamlit 應用程式

運行主應用程式：

streamlit run app.py

運行津貼測試應用程式：

streamlit run subsidy_app.py

使用方法

在文本框中輸入測試問題
輸入正確答案
從下拉選單中選擇要測試的語言模型
點擊「開始測試」按鈕
等待系統生成模型回應和評分結果

支援的模型

GPT-3.5 Turbo
GPT-4
Claude-2
Gemini Pro
Mistral-7B-Instruct

poetry export requirements.txt

poetry export -f requirements.txt --output requirements.txt --without-hashes --without-urls

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

5.8.25

Aug 2, 2025

5.8.24

Aug 2, 2025

5.8.23

Aug 2, 2025

5.8.22

Aug 2, 2025

5.8.21

Aug 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

botrun_llm_ranking-5.8.25.tar.gz (60.5 kB view details)

Uploaded Aug 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

botrun_llm_ranking-5.8.25-py3-none-any.whl (63.9 kB view details)

Uploaded Aug 2, 2025 Python 3

File details

Details for the file botrun_llm_ranking-5.8.25.tar.gz.

File metadata

Download URL: botrun_llm_ranking-5.8.25.tar.gz
Upload date: Aug 2, 2025
Size: 60.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.13

File hashes

Hashes for botrun_llm_ranking-5.8.25.tar.gz
Algorithm	Hash digest
SHA256	`7116850a9d5111cf731bdc1f13de6c652ee11a8f8be2f5514ebe9e2d626169a4`
MD5	`7371dc512b7f40c8cad843e67427b969`
BLAKE2b-256	`d912a0b0f70ee507e9e1bb45a48029c6b7a4729b74494fdb37cee509f688608f`

See more details on using hashes here.

File details

Details for the file botrun_llm_ranking-5.8.25-py3-none-any.whl.

File metadata

Download URL: botrun_llm_ranking-5.8.25-py3-none-any.whl
Upload date: Aug 2, 2025
Size: 63.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.13

File hashes

Hashes for botrun_llm_ranking-5.8.25-py3-none-any.whl
Algorithm	Hash digest
SHA256	`96d341579e4dcd629bef9c5ce0b3482ffcb0f8133678665d080f6109009a8092`
MD5	`23a161f53d0d9b850800349d975da983`
BLAKE2b-256	`bed83ca6599a8ab6b8b98fb5a25030f6e876f7403b07e55b5350b618a6bb57fb`

See more details on using hashes here.

botrun-llm-ranking 5.8.25

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LLM 評分系統

作為 Python 套件使用

快速開始

評估提示生成

get_bool_test_case_eval_prompt 函數

主要組件

TestCaseResult 類別

run_test_cases 函數

進階使用範例

批量測試多個模型

整合自定義評估邏輯

安裝步驟

Streamlit 應用程式

運行主應用程式：

運行津貼測試應用程式：

使用方法

支援的模型

poetry export requirements.txt

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes