Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

TeleLLM-logo
中文 | English

⭐️ Star the project to get the latest updates of TeleLLM in time~


📣 Latest Updates

  • [2024.11.28] Add the ADD_DEFAULT_SYSTEM_ROLE environment variable (default: True) and sync torch_dtype with the model file if mismatched. Added support for Telechat quantized model, Telechat2, Moss-Moon-003-SFT, and BELLE-7B-2M、Ziya-LLaMA-13B-v1 on NVIDIA, add new test cases for verification.🚩🚩🚩
  • [2024.11.18] Adapted and upgraded for MindIE RC3, adding support for Llama3.1-8B, 70B, Telechat-1B, 7B, and 12B models.🚩🚩🚩
  • [2024.10.12] Added support for Huawei Ascend, inference service now supports custom user parameters input, default parameters updated, context length increased to 8k, output 2k, feel free to try it! 🚩🚩🚩 (Continuously updated...)

Introduction

TeleLLM, developed by State Cloud Intelligent Computing Team, is a large model inference project generation scaffold, covering a full set of lightweight LLM task deployment and service solutions. Its main features include:

  • Supports Nvidia and Ascend LLM inference.
  • Supports automatic generation of interface documentation, functional test documentation, and performance test documentation.
  • Supports inference for MindIEServer.
  • Supports large model dataset evaluation.
  • Aligned with OpenAI interfaces, supports multi-modal model inference for image-to-text, text-to-image, and image-to-image.
  • Supports automatic generation of deployment documentation.
  • Supports large model quantization, LMDeploy vision model inference, and function calls.
  • (Continuously updated...)

Quick Start

🛠️ Installation Guide

If you are installing TeleLLM with CUDA, you can refer to this installation guide: CUDA Installation

If you are installing TeleLLM with NPU, you can refer to this installation guide: NPU Installation

📂 Data Preparation

Offline Download in Advance

TeleLLM supports using local datasets for evaluation. You can download the datasets using the following command:

To be updated on whether offline dataset packages are provided...

Use ModelScope for Automatic Download

You can also use ModelScope to load datasets:

Environment setup:

pip install modelscope
export DATASET_SOURCE=ModelScope

💡 Basic Usage

1. Service Invocation

After entering the container where TeleLLM is deployed, you can execute the following command to start the service:

telellm serve --model /model --model_name Qwen2-7B-Instruct -p 8899

Currently, telellm serve supports passing 15 parameters, such as --model, --tensor_parallel_size, etc. For detailed usage, refer to the service parameters documentation: Serve-args.

2. Model Functionality & Performance Testing

After the service is deployed, you can run telellm test to test the service. The test reports (functional and performance reports) will be generated in the current directory.

  • Help:
telellm test --help
  • Parameter description:
Parameter Abbr Type Default Value Description
--test_type -tt str both Test type.
1. functional
2. performance
3. both
--service_host -sh str localhost Service host
--service_port -sp int 8899 Service port
--service_name -sn str llmservice Service name
--model_name -mn str —— Model name
--concurrency -c [int] [1] Concurrency (for performance testing only)
--seq_len -s [int] [25, 100, 400, 800] Test text length (for performance testing only)
25 length > 32 tokens
100 length > 128 tokens
400 length > 512 tokens
800 length > 1024 tokens

  • Example:
telellm test -tt both -sh localhost -sp 8899 -sn nvidia-qwen-infer-svc -mn Qwen-7B-Chat -c 1 -c 2 -c 4
  • -tt both generates both functional and performance test reports; -tt functional generates only the functional test; -tt performance generates only the performance test.
  • -c 1 -c 2 -c 4 represents concurrency numbers [1, 2, 4], which can be adjusted.
  • -s 25 -s 100 -s 400 -s 800 represents text lengths [25, 100, 400, 800], which can be adjusted.

3. Model Accuracy Evaluation

Run telellm eval to evaluate the model. It will generate an evaluation result folder named eval_chat_outs and an evaluation report file in the current directory.

  • Help:
telellm eval --help
Parameter Abbr Type Default Value Description
--service_host -sh str localhost Service host
--service_port -sp int 8899 Service port
--model_name -mn str —— Model name
--dataset -ds str mmlu The dataset to be evaluated
1. mmlu
2. ceval
3. humaneval
4. gsm8k
--type -t str val Dataset type
1. val
2. test
--overwrite -o False Whether to overwrite existing results
--num_threads -nt int 5 The maximum number of threads to use
--temperature -tt float 1.0 Request parameter temperature
--top_p -tp float 0.001 Request parameter top_p
--top_k -tk int 1 Request parameter top_k
--repetition_penalty -rp float 1.0 Request parameter repetition_penalty
--enable_rp -erp False Whether to use the repetition_penalty parameter (temporary)

  • Examples:
# Use existing results
telellm eval -sh localhost -sp 8899 -mn llama2-7b-chat -ds mmlu -t val -nt 5
# Overwrite existing results
telellm eval -sh localhost -sp 8899 -mn llama2-7b-chat -ds mmlu -t val -nt 5 -o
  • -ds mmlu represents using the MMLU dataset for evaluation. Alternatives include ceval/humaneval/gsm8k.
  • -t val represents using the validation set (val) of the dataset for evaluation (some datasets like humaneval/gsm8k do not have a val set and thus don't need this option). The alternative is the test set.
  • -nt 5 specifies the maximum number of threads to use for evaluation.
  • -o means overwriting existing intermediate results and performing a fresh evaluation. If not specified, it will continue from the last intermediate result.
  • -erp enables the repetition_penalty parameter (temporary support for new and old versions).

Note:

  1. (Optional, the evaluation request already applies limits for top_k/temperature/top_p/repetition_penalty) For model evaluation, greedy decoding (do_sample=False) needs to be enabled.
  2. The results of the c-eval test set need to be submitted to the website for scoring: https://cevalbenchmark.com/static/user_interface.html

Dataset Introduction:

Dataset Introduction:

mmlu c-eval human-eval gsm8k
Type General-domain English dataset General-domain Chinese dataset Programming tasks Mathematics
Description Covers 57 tasks including basic math, American history, computer science, law, etc. Involves 4 major subject areas and 52 subcategories, with four difficulty levels (middle school, high school, university, and professional) Contains 164 carefully designed programming tasks, each with four key components A dataset containing high-quality, diverse language elementary school math application problems, all created by human writers
Classification 1. val validation set: 1540 questions
2. test test set: 14079 questions
1. val validation set: 1346 questions
2. test test set: 12342 questions
test test set: 164 programming tasks test test set: 1319 elementary school math problems

Category Introduction:

  • STEM/Science, Technology, Engineering, and Mathematics: Includes subjects like computer science, electrical engineering, chemistry, mathematics, physics, etc.
  • Social Science: Includes subjects like political science, geography, education, economics, business management, etc.
  • Humanities: Includes subjects like law, arts, logic, language, history, etc.
  • Other: A collection of other subjects, including environmental science, fire safety, taxation, sports, medicine, etc.

4. Quantization

Before starting the quantization process, we need to provide some initial quantization parameters: Quant-args

The quantization configuration supports both configuration files and command-line input parameters. The recommended approach is to use the configuration file.

If using a configuration file, you can use the following command to automatically generate the configuration file (quant_config.json) and the default calibration dataset (calib.jsonl):

telellm quant_config

Alternatively, you can use command-line input parameters (not recommended):

telellm quant -mp /model_in -sd /model_out -pf true -acc false

After quantization, a quantization report (quant_result.json) will be generated in the current directory.

🏛 License

This framework is licensed under the Apache License (Version 2.0). For models and datasets, please refer to the original resource page and follow the corresponding License.

☁️ Supported Models

TeleLLM supports a variety of large language models and multimodal models. Below is a list of models currently supported by TeleLLM: Supported_models

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

telellm-0.1.2-py3-none-any.whl (5.4 MB view details)

Uploaded Python 3

File details

Details for the file telellm-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: telellm-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for telellm-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f7cef5eed83952070cea47f24c1419ea18443759405ea2913857de9f5d577ecb
MD5 492b060579dfbf7e926e5182f197ca56
BLAKE2b-256 d0ff6cc3f083951eb2bdbead90a5ae0ac4ddbbf00f2806cc27975b1054300096

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page