一个面向运营商领域的大模型人机测评工具，支持本地与API模型评估

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

📡televal-电信运营商人机测评工具

televal 是一个用于评估大语言模型（LLMs）在电信运营商领域任务中表现的测评工具包，支持本地模型与 API 模型的推理调用，内置客观题与主观题的评估方法，以及题型维度的可视化结果输出。

📦源码安装

pip install televal

📁项目目录结构要求

your_project/
├── datas/                  # 存放待评估数据集（如 A.json / B.json）
├── outputs/                # 自动生成，存储模型推理与评估结果
├── configs/
│   ├── model_configs.py    # 配置本地模型名称和路径
│   └── api_models.py       # 配置 API 模型和用于主观评估的裁判LLM
├── televal/                # 安装后的主程序包（无需手动改动）

📄数据准备

将测评数据集（如 A.json）上传至 your_project/datas/ 目录中。我们已提供示例数据集，可用于功能验证。

🧠模型配置

1. 本地模型（可选）

在 configs/model_configs.py 中加入模型名称及其路径：

model_configs = {
    'qwen2.5-7b-instruct': '/path/to/Qwen2.5-7B-Instruct',
    'qwen2.5-14b-instruct': '/path/to/Qwen2.5-14B-Instruct',
}

2. API 模型（可选）

在 configs/api_models.py 中配置两个函数：

from openai import OpenAI

# ✅ 必填：裁判模型（用于主观题评估）
def model_eval(prompt):
    client = OpenAI(api_key="YOUR_KEY", base_url="https://api.example.com")
    chat_completion = client.chat.completions.create(
        messages=[{ "role": "user", "content": prompt }],
        model="DeepSeek-V3",
    )
    return chat_completion.choices[0].message.content

# ✅ 可选：评估模型（用于推理答题），当以 API 模式调用时需填写
def model_call(prompt, model_name):
    client = OpenAI(api_key="YOUR_KEY", base_url="https://api.example.com")
    chat_completion = client.chat.completions.create(
        messages=[{ "role": "user", "content": prompt }],
        model=model_name,
    )
    result = chat_completion.choices[0].message.content
    return result

🚀一键评估使用示例

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2,3"

from televal.generation import generator_local, generator_api
from televal.evaluation import compute_metrics, compute_metrics_LLM, compute_scores
from televal.visualization import compute_ranking_average
import pandas as pd
import IPython.display as disp

def run_all_steps(dataset_name, model, gpu_ids=None, is_api=False):
    print("\n Step 1: 生成回答")
    if is_api:
        generator_api.main(dataset_name=dataset_name, model=model)
    else:
        generator_local.main(dataset_name=dataset_name, model=model, gpu_ids=gpu_ids)

    print("\n Step 2: 客观题评估")
    compute_metrics.main(dataset_name=dataset_name, model=model)

    print("\n Step 3: 主观题评估")
    compute_metrics_LLM.main(dataset_name=dataset_name, model=model)

    print("\n Step 4: 综合得分")
    compute_scores.main(dataset_name=dataset_name, model=model)

    print("\n Step 5: 可视化输出")
    df_type, df_cat = compute_ranking_average.main(dataset_name=[dataset_name])
    disp.display(df_type)
    disp.display(df_cat)

# 示例1：使用本地模型评估
run_all_steps(dataset_name="A", model="qwen2.5-14b-instruct", gpu_ids=[2, 3], is_api=False)

# 示例2：使用 API 模型评估
run_all_steps(dataset_name="B", model="DeepSeek-V3", is_api=True)

📊输出结果说明

运行后，系统将自动生成以下输出文件至 outputs/{dataset}/{model}/：

record.json：模型推理记录
evaluation.json：记录了主观题的逐题评分详情
result.json：整合主观题和客观题评估准确率
score.json：题型加权得分
题型排名.csv 和 维度排名.csv：在 outputs/ 下生成的可视化评估结果

📬联系我们

如有需求或建议，欢迎联系：wang.yingying@ustcinfo.com

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.3

Jul 29, 2025

This version

0.1.2

Jul 28, 2025

0.1.1

Jul 28, 2025

0.1.0

Jul 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

televal-0.1.2.tar.gz (23.3 kB view details)

Uploaded Jul 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

televal-0.1.2-py3-none-any.whl (43.8 kB view details)

Uploaded Jul 28, 2025 Python 3

File details

Details for the file televal-0.1.2.tar.gz.

File metadata

Download URL: televal-0.1.2.tar.gz
Upload date: Jul 28, 2025
Size: 23.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for televal-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`75c9cad34b4b7c4cfba87fb95e0d8f4e487acd7dba35a0f84a664b4412e47d8b`
MD5	`e73579d471e427d26481b8fafb9a0d10`
BLAKE2b-256	`4b386fbf6197b3a6bc1f703c3dfa9afe538019bdb4cf9687bc46f5e11e237f0a`

See more details on using hashes here.

File details

Details for the file televal-0.1.2-py3-none-any.whl.

File metadata

Download URL: televal-0.1.2-py3-none-any.whl
Upload date: Jul 28, 2025
Size: 43.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for televal-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bac952622b603b15d2d68ba0ed9071f27c4439c757b8247c64570204bc467935`
MD5	`440ffbec7606c18d403efef512f571a8`
BLAKE2b-256	`0fc06899add61d4c41dbc5edc4e921595883731acb7545744cbb3feaf9995d30`

See more details on using hashes here.

televal 0.1.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

📡televal-电信运营商人机测评工具

📦源码安装

📁项目目录结构要求

📄数据准备

🧠模型配置

1. 本地模型（可选）

2. API 模型（可选）

🚀一键评估使用示例

📊输出结果说明

📬联系我们

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes