No project description provided
Project description
LLM 評分系統
這是一個使用 Streamlit 和 LiteLLM 建立的語言模型評估系統。系統可以測試不同的語言模型,並使用 Claude 作為評分模型來評估回答品質。
作為 Python 套件使用
快速開始
本系統提供通用的測試執行函數,可以輕鬆整合到你的專案中:
from llm_ranking.eval.boolean_eval import run_test_cases
from llm_ranking.models.boolean_test_case import BooleanTestCase
from llm_ranking.models.test_result import TestCaseResult
# 1. 建立測試案例
test_cases = [
BooleanTestCase(
id="test_1",
system_prompt="You are a helpful assistant.",
messages=[{"role": "user", "content": "什麼是人工智能?"}],
model="gpt-3.5-turbo",
judge_model="anthropic/claude-3-sonnet",
evaluation_prompt="檢查回應是否正確解釋了人工智能的概念"
)
]
# 2. 定義你的 API 調用函數
def my_api_caller(test_case: BooleanTestCase) -> str:
"""
自定義的 API 調用函數
Args:
test_case: 測試案例,包含 model、messages 等資訊
Returns:
模型的回應內容
"""
# 這裡實作你的 API 調用邏輯
# test_case.model 可以用來決定要調用哪個模型
# test_case.messages 包含對話訊息
# 範例:調用 OpenAI API
import openai
response = openai.ChatCompletion.create(
model=test_case.model,
messages=test_case.messages
)
return response.choices[0].message.content
# 3. 執行測試
test_generator = run_test_cases(test_cases, my_api_caller)
# 4. 處理結果
for result in test_generator:
print(f"測試 {result.id}: {'通過' if result.is_pass else '失敗'}")
print(f"輸出: {result.output}")
print(f"評估原因: {result.pass_fail_reason}")
print("-" * 50)
主要組件
TestCaseResult 類別
class TestCaseResult(BaseModel):
id: str # 測試案例 ID
output: str # 模型回應內容
is_pass: bool # 是否通過測試
pass_fail_reason: str # 通過/失敗的詳細原因
run_test_cases 函數
def run_test_cases(
test_cases: List[BooleanTestCase],
get_response: Callable[[BooleanTestCase], str]
) -> Generator[TestCaseResult, None, None]:
參數說明:
test_cases: 測試案例列表get_response: 你的 API 調用函數,接收BooleanTestCase並返回回應字串
返回: Generator,逐一產出 TestCaseResult 物件
進階使用範例
批量測試多個模型
models_to_test = ["gpt-3.5-turbo", "gpt-4", "claude-3-sonnet"]
for model in models_to_test:
print(f"\n測試模型: {model}")
test_cases = [
BooleanTestCase(
id=f"test_{model}_1",
system_prompt="You are a helpful assistant.",
messages=[{"role": "user", "content": "解釋量子計算"}],
model=model,
judge_model="anthropic/claude-3-sonnet",
evaluation_prompt="檢查是否正確解釋了量子計算的基本概念"
)
]
for result in run_test_cases(test_cases, my_api_caller):
print(f" 結果: {'✓' if result.is_pass else '✗'} - {result.pass_fail_reason}")
整合自定義評估邏輯
# 如果你需要更複雜的錯誤處理
def robust_api_caller(test_case: BooleanTestCase) -> str:
try:
# 你的 API 調用邏輯
return call_your_model_api(test_case)
except Exception as e:
# 自定義錯誤處理
return f"調用失敗: {str(e)}"
# 使用你的健壯 API 調用函數
results = list(run_test_cases(test_cases, robust_api_caller))
安裝步驟
- 安裝依賴:
pip install -r requirements.txt
- 設置環境變數:
創建
.env文件並添加以下內容:
OPENAI_API_KEY=你的OpenAI API金鑰
ANTHROPIC_API_KEY=你的Anthropic API金鑰
GOOGLE_API_KEY=你的Google API金鑰(如果要使用Gemini)
OPENROUTER_API_KEY=你的OpenRouter API金鑰(用於評估模型)
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
Streamlit 應用程式
運行主應用程式:
streamlit run app.py
運行津貼測試應用程式:
streamlit run subsidy_app.py
使用方法
- 在文本框中輸入測試問題
- 輸入正確答案
- 從下拉選單中選擇要測試的語言模型
- 點擊「開始測試」按鈕
- 等待系統生成模型回應和評分結果
支援的模型
- GPT-3.5 Turbo
- GPT-4
- Claude-2
- Gemini Pro
- Mistral-7B-Instruct
poetry export requirements.txt
poetry export -f requirements.txt --output requirements.txt --without-hashes --without-urls
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
botrun_llm_ranking-5.8.21.tar.gz
(10.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file botrun_llm_ranking-5.8.21.tar.gz.
File metadata
- Download URL: botrun_llm_ranking-5.8.21.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b847158cd7d120b125aa1e5944f1f10cf6778ff30158aaa4876ca35b06dc4598
|
|
| MD5 |
8d5a74e4e9db09ee62bb7195d133e4ce
|
|
| BLAKE2b-256 |
1791843d8053c875ba371886c9f0f53827042eda72bf0d5447645b7e5c0e568f
|
File details
Details for the file botrun_llm_ranking-5.8.21-py3-none-any.whl.
File metadata
- Download URL: botrun_llm_ranking-5.8.21-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
948225ebcc6d151fd9a5ee0daee6fdc276b50d0957636f8cb3971a128bbe1ecb
|
|
| MD5 |
863dffbf14c49977672e152eee467422
|
|
| BLAKE2b-256 |
2d37c5fbcde02945c6504fa04a9f72e3a3d672a3050b2b8424963b500d67a0fe
|