Skip to main content

No project description provided

Project description

LLM 評分系統

這是一個使用 Streamlit 和 LiteLLM 建立的語言模型評估系統。系統可以測試不同的語言模型,並使用 Claude 作為評分模型來評估回答品質。

作為 Python 套件使用

快速開始

本系統提供通用的測試執行函數,可以輕鬆整合到你的專案中:

from llm_ranking.eval.boolean_eval import run_test_cases
from llm_ranking.models.boolean_test_case import BooleanTestCase
from llm_ranking.models.test_result import TestCaseResult

# 1. 建立測試案例
test_cases = [
    BooleanTestCase(
        id="test_1",
        system_prompt="You are a helpful assistant.",
        messages=[{"role": "user", "content": "什麼是人工智能?"}],
        model="gpt-3.5-turbo",  
        judge_model="anthropic/claude-3-sonnet",
        evaluation_prompt="檢查回應是否正確解釋了人工智能的概念"
    )
]

# 2. 定義你的 API 調用函數
def my_api_caller(test_case: BooleanTestCase) -> str:
    """
    自定義的 API 調用函數
    
    Args:
        test_case: 測試案例,包含 model、messages 等資訊
        
    Returns:
        模型的回應內容
    """
    # 這裡實作你的 API 調用邏輯
    # test_case.model 可以用來決定要調用哪個模型
    # test_case.messages 包含對話訊息
    
    # 範例:調用 OpenAI API
    import openai
    response = openai.ChatCompletion.create(
        model=test_case.model,
        messages=test_case.messages
    )
    return response.choices[0].message.content

# 3. 執行測試
test_generator = run_test_cases(test_cases, my_api_caller)

# 4. 處理結果
for result in test_generator:
    print(f"測試 {result.id}: {'通過' if result.is_pass else '失敗'}")
    print(f"輸出: {result.output}")
    print(f"評估原因: {result.pass_fail_reason}")
    print("-" * 50)

主要組件

TestCaseResult 類別

class TestCaseResult(BaseModel):
    id: str                    # 測試案例 ID
    output: str               # 模型回應內容
    is_pass: bool            # 是否通過測試
    pass_fail_reason: str    # 通過/失敗的詳細原因

run_test_cases 函數

def run_test_cases(
    test_cases: List[BooleanTestCase], 
    get_response: Callable[[BooleanTestCase], str]
) -> Generator[TestCaseResult, None, None]:

參數說明:

  • test_cases: 測試案例列表
  • get_response: 你的 API 調用函數,接收 BooleanTestCase 並返回回應字串

返回: Generator,逐一產出 TestCaseResult 物件

進階使用範例

批量測試多個模型

models_to_test = ["gpt-3.5-turbo", "gpt-4", "claude-3-sonnet"]

for model in models_to_test:
    print(f"\n測試模型: {model}")
    
    test_cases = [
        BooleanTestCase(
            id=f"test_{model}_1",
            system_prompt="You are a helpful assistant.",
            messages=[{"role": "user", "content": "解釋量子計算"}],
            model=model,
            judge_model="anthropic/claude-3-sonnet",
            evaluation_prompt="檢查是否正確解釋了量子計算的基本概念"
        )
    ]
    
    for result in run_test_cases(test_cases, my_api_caller):
        print(f"  結果: {'✓' if result.is_pass else '✗'} - {result.pass_fail_reason}")

整合自定義評估邏輯

# 如果你需要更複雜的錯誤處理
def robust_api_caller(test_case: BooleanTestCase) -> str:
    try:
        # 你的 API 調用邏輯
        return call_your_model_api(test_case)
    except Exception as e:
        # 自定義錯誤處理
        return f"調用失敗: {str(e)}"

# 使用你的健壯 API 調用函數
results = list(run_test_cases(test_cases, robust_api_caller))

安裝步驟

  1. 安裝依賴:
pip install -r requirements.txt
  1. 設置環境變數: 創建 .env 文件並添加以下內容:
OPENAI_API_KEY=你的OpenAI API金鑰
ANTHROPIC_API_KEY=你的Anthropic API金鑰
GOOGLE_API_KEY=你的Google API金鑰(如果要使用Gemini)
OPENROUTER_API_KEY=你的OpenRouter API金鑰(用於評估模型)
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

Streamlit 應用程式

運行主應用程式:

streamlit run app.py

運行津貼測試應用程式:

streamlit run subsidy_app.py

使用方法

  1. 在文本框中輸入測試問題
  2. 輸入正確答案
  3. 從下拉選單中選擇要測試的語言模型
  4. 點擊「開始測試」按鈕
  5. 等待系統生成模型回應和評分結果

支援的模型

  • GPT-3.5 Turbo
  • GPT-4
  • Claude-2
  • Gemini Pro
  • Mistral-7B-Instruct

poetry export requirements.txt

poetry export -f requirements.txt --output requirements.txt --without-hashes --without-urls

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

botrun_llm_ranking-5.8.22.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

botrun_llm_ranking-5.8.22-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file botrun_llm_ranking-5.8.22.tar.gz.

File metadata

  • Download URL: botrun_llm_ranking-5.8.22.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.13

File hashes

Hashes for botrun_llm_ranking-5.8.22.tar.gz
Algorithm Hash digest
SHA256 569cdd3860e159b2b7327a50a8df29f6dc1a368fe5020695eb4b154f97275ecd
MD5 b8ffde854ab17cbb38bd69f60c399a2e
BLAKE2b-256 a85ddc294aee2fc6bada613e331ecf3e62329e0a4a54436f63624227efc9ae4f

See more details on using hashes here.

File details

Details for the file botrun_llm_ranking-5.8.22-py3-none-any.whl.

File metadata

File hashes

Hashes for botrun_llm_ranking-5.8.22-py3-none-any.whl
Algorithm Hash digest
SHA256 f99c89f51b79ce84c08f89ee5314ed8353d2773a64934edd13cdce49ecdeb884
MD5 2833c7f0affdf8f6ea6221f272c38f55
BLAKE2b-256 f44f4e168c4d288c5b8a09724443a8ad1efb5cc06514ac6267ecb6d0ba46b396

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page