A high-throughput and memory-efficient inference and serving engine for LLMs
Project description
⭐️ Star the project to get the latest updates of TeleLLM in time~
📣 Latest Updates
- [2024.11.28] Add the
ADD_DEFAULT_SYSTEM_ROLEenvironment variable (default:True) and synctorch_dtypewith the model file if mismatched. Added support for Telechat quantized model, Telechat2, Moss-Moon-003-SFT, and BELLE-7B-2M、Ziya-LLaMA-13B-v1 on NVIDIA, add new test cases for verification.🚩🚩🚩 - [2024.11.18] Adapted and upgraded for MindIE RC3, adding support for Llama3.1-8B, 70B, Telechat-1B, 7B, and 12B models.🚩🚩🚩
- [2024.10.12] Added support for Huawei Ascend, inference service now supports custom user parameters input, default parameters updated, context length increased to 8k, output 2k, feel free to try it! 🚩🚩🚩 (Continuously updated...)
Introduction
TeleLLM, developed by State Cloud Intelligent Computing Team, is a large model inference project generation scaffold, covering a full set of lightweight LLM task deployment and service solutions. Its main features include:
- Supports Nvidia and Ascend LLM inference.
- Supports automatic generation of interface documentation, functional test documentation, and performance test documentation.
- Supports inference for MindIEServer.
- Supports large model dataset evaluation.
- Aligned with OpenAI interfaces, supports multi-modal model inference for image-to-text, text-to-image, and image-to-image.
- Supports automatic generation of deployment documentation.
- Supports large model quantization, LMDeploy vision model inference, and function calls.
- (Continuously updated...)
Quick Start
🛠️ Installation Guide
If you are installing TeleLLM with CUDA, you can refer to this installation guide: CUDA Installation
If you are installing TeleLLM with NPU, you can refer to this installation guide: NPU Installation
📂 Data Preparation
Offline Download in Advance
TeleLLM supports using local datasets for evaluation. You can download the datasets using the following command:
To be updated on whether offline dataset packages are provided...
Use ModelScope for Automatic Download
You can also use ModelScope to load datasets:
Environment setup:
pip install modelscope
export DATASET_SOURCE=ModelScope
💡 Basic Usage
1. Service Invocation
After entering the container where TeleLLM is deployed, you can execute the following command to start the service:
telellm serve --model /model --model_name Qwen2-7B-Instruct -p 8899
Currently, telellm serve supports passing 15 parameters, such as --model, --tensor_parallel_size, etc. For detailed usage, refer to the service parameters documentation: Serve-args.
2. Model Functionality & Performance Testing
After the service is deployed, you can run
telellm testto test the service. The test reports (functional and performance reports) will be generated in thecurrent directory.
- Help:
telellm test --help
- Parameter description:
| Parameter | Abbr | Type | Default Value | Description |
|---|---|---|---|---|
| --test_type | -tt | str | both | Test type. 1. functional 2. performance 3. both |
| --service_host | -sh | str | localhost | Service host |
| --service_port | -sp | int | 8899 | Service port |
| --service_name | -sn | str | llmservice | Service name |
| --model_name | -mn | str | —— | Model name |
| --concurrency | -c | [int] | [1] | Concurrency (for performance testing only) |
| --seq_len | -s | [int] | [25, 100, 400, 800] | Test text length (for performance testing only) 25 length > 32 tokens 100 length > 128 tokens 400 length > 512 tokens 800 length > 1024 tokens |
- Example:
telellm test -tt both -sh localhost -sp 8899 -sn nvidia-qwen-infer-svc -mn Qwen-7B-Chat -c 1 -c 2 -c 4
-tt bothgenerates both functional and performance test reports;-tt functionalgenerates only the functional test;-tt performancegenerates only the performance test.-c 1 -c 2 -c 4represents concurrency numbers [1, 2, 4], which can be adjusted.-s 25 -s 100 -s 400 -s 800represents text lengths [25, 100, 400, 800], which can be adjusted.
3. Model Accuracy Evaluation
Run telellm eval to evaluate the model. It will generate an evaluation result folder named eval_chat_outs and an evaluation report file in the current directory.
- Help:
telellm eval --help
| Parameter | Abbr | Type | Default Value | Description |
|---|---|---|---|---|
| --service_host | -sh | str | localhost | Service host |
| --service_port | -sp | int | 8899 | Service port |
| --model_name | -mn | str | —— | Model name |
| --dataset | -ds | str | mmlu | The dataset to be evaluated 1. mmlu 2. ceval 3. humaneval 4. gsm8k |
| --type | -t | str | val | Dataset type 1. val 2. test |
| --overwrite | -o | False | Whether to overwrite existing results | |
| --num_threads | -nt | int | 5 | The maximum number of threads to use |
| --temperature | -tt | float | 1.0 | Request parameter temperature |
| --top_p | -tp | float | 0.001 | Request parameter top_p |
| --top_k | -tk | int | 1 | Request parameter top_k |
| --repetition_penalty | -rp | float | 1.0 | Request parameter repetition_penalty |
| --enable_rp | -erp | False | Whether to use the repetition_penalty parameter (temporary) |
- Examples:
# Use existing results
telellm eval -sh localhost -sp 8899 -mn llama2-7b-chat -ds mmlu -t val -nt 5
# Overwrite existing results
telellm eval -sh localhost -sp 8899 -mn llama2-7b-chat -ds mmlu -t val -nt 5 -o
-ds mmlurepresents using the MMLU dataset for evaluation. Alternatives includeceval/humaneval/gsm8k.-t valrepresents using the validation set (val) of the dataset for evaluation (some datasets likehumaneval/gsm8kdo not have avalset and thus don't need this option). The alternative is the test set.-nt 5specifies the maximum number of threads to use for evaluation.-omeans overwriting existing intermediate results and performing a fresh evaluation. If not specified, it will continue from the last intermediate result.-erpenables therepetition_penaltyparameter (temporary support for new and old versions).
Note:
- (Optional, the evaluation request already applies limits for top_k/temperature/top_p/repetition_penalty) For model evaluation, greedy decoding (do_sample=False) needs to be enabled.
- The results of the c-eval test set need to be submitted to the website for scoring: https://cevalbenchmark.com/static/user_interface.html
Dataset Introduction:
Dataset Introduction:
| mmlu | c-eval | human-eval | gsm8k | |
|---|---|---|---|---|
| Type | General-domain English dataset | General-domain Chinese dataset | Programming tasks | Mathematics |
| Description | Covers 57 tasks including basic math, American history, computer science, law, etc. | Involves 4 major subject areas and 52 subcategories, with four difficulty levels (middle school, high school, university, and professional) | Contains 164 carefully designed programming tasks, each with four key components | A dataset containing high-quality, diverse language elementary school math application problems, all created by human writers |
| Classification | 1. val validation set: 1540 questions 2. test test set: 14079 questions |
1. val validation set: 1346 questions 2. test test set: 12342 questions |
test test set: 164 programming tasks | test test set: 1319 elementary school math problems |
Category Introduction:
- STEM/Science, Technology, Engineering, and Mathematics: Includes subjects like computer science, electrical engineering, chemistry, mathematics, physics, etc.
- Social Science: Includes subjects like political science, geography, education, economics, business management, etc.
- Humanities: Includes subjects like law, arts, logic, language, history, etc.
- Other: A collection of other subjects, including environmental science, fire safety, taxation, sports, medicine, etc.
4. Quantization
Before starting the quantization process, we need to provide some initial quantization parameters: Quant-args
The quantization configuration supports both configuration files and command-line input parameters. The recommended approach is to use the configuration file.
If using a configuration file, you can use the following command to automatically generate the configuration file (quant_config.json) and the default calibration dataset (calib.jsonl):
telellm quant_config
Alternatively, you can use command-line input parameters (not recommended):
telellm quant -mp /model_in -sd /model_out -pf true -acc false
After quantization, a quantization report (quant_result.json) will be generated in the current directory.
🏛 License
This framework is licensed under the Apache License (Version 2.0). For models and datasets, please refer to the original resource page and follow the corresponding License.
☁️ Supported Models
TeleLLM supports a variety of large language models and multimodal models. Below is a list of models currently supported by TeleLLM: Supported_models
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file telellm-0.1.2-py3-none-any.whl.
File metadata
- Download URL: telellm-0.1.2-py3-none-any.whl
- Upload date:
- Size: 5.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7cef5eed83952070cea47f24c1419ea18443759405ea2913857de9f5d577ecb
|
|
| MD5 |
492b060579dfbf7e926e5182f197ca56
|
|
| BLAKE2b-256 |
d0ff6cc3f083951eb2bdbead90a5ae0ac4ddbbf00f2806cc27975b1054300096
|