triton inference service of llama
Project description
Introduction
Installation
Below are quick steps for installation:
conda create -n open-mmlab python=3.8
conda activate open-mmlab
git clone https://github.com/open-mmlab/llmdeploy.git
cd llmdeploy
pip install -e .
Quick Start
Build
Pull docker image openmmlab/llmdeploy:base
and build llmdeploy libs in its launched container
mkdir build && cd build
../generate.sh
make -j$(nproc) && make install
Serving LLaMA
Weights for the LLaMA models can be obtained from by filling out this form
Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:
7B
python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
13B
python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
33B
python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
65B
python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
Serving Vicuna
7B
python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
--base-model-path /path/to/llama-7b \
--target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1
python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
13B
python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
--base-model-path /path/to/llama-13b \
--target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1
python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
Inference with Command Line Interface
python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1
Inference with Web UI
python3 llmdeploy/webui/app.py {server_ip_addresss}:33337 model_name
User Guide
Quantization
In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by deploy.py
.
Then adjust config.ini
use_context_fmha
changed to 0, means offquant_policy
is set to 4. This parameter defaults to 0, which means it is not enabled
Contributing
We appreciate all contributions to LLMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.
Acknowledgement
License
This project is released under the Apache 2.0 license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file llmdeploy-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: llmdeploy-0.0.1-py3-none-any.whl
- Upload date:
- Size: 24.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c5cac4723f5fa6516bbfbb3202917daa00a082306620d069757dbcbd21ac239 |
|
MD5 | 85ce0624cf3f345b2ebd01b4eaf9a408 |
|
BLAKE2b-256 | 063c330d0f4edaf108bb8c8e3d33cf200e38afeb0b1c1acb3ffaea98bbe4c731 |