Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
Project description
Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
What is gLLM?
Integreted with features like continuous batching, paged attention, chunked prefill, prefix caching, token throttling, pipeline parallelism, expert parallelsim and tensor parallelism, gLLM provides basic functionality (offline/online inference and interactive chat) to deploy distributed LLMs (supported in huggingface) inference. gLLM provides equivalent or superior offline/online inference speed with mainstream inference engine and minimal (~6k loc) code base. You can also see gLLM as a LLM inference playground for doing experiment or academic research.
Latest News :fire:
- [2025/06/21]: Expert parallelism is integrated :heart_eyes:
- [2025/06/14]: Tensor parallelism is now integrated, allowing joint deploying with pipeline parallelism :sunglasses:
- [2025/05/05]: MoE architecture is supported. Try Qwen2/3 MoE models :star_struck:
- [2025/04/29]: Qwen3 day 1 support. Come and try Qwen3 :tada:
- [2025/04/27]: gLLM is open sourced :earth_asia:
- [2025/04/27]: We support multi-node deployments. You can serve your model across different machines :blush:
- [2025/04/21]: We release our paper on arXiv:2504.14775 :partying_face:
- [2025/03/15]: Chunked prefill has been integrated. You can input any length of text you want :hugs:
- [2025/03/01]: Pipeline parallelism has been integrated. You can run any size of model you want :laughing:
- [2025/02/27]: We apply numerous optimizations which lowers CPU overhead a lot :clap:
Token Throttling
Prefill Token Throttling
Decode Token Throttling
Install gLLM
pip install torch==2.5.1
pip install -v -e .
Quickstart
Interactive Offline Chat
python examples/chat.py --model $MODEL_PATH
Offline Batch Inference
python examples/batch_inference.py --model $MODEL \
--share-gpt-path $SHARE_GPT_PATH --num-prompt $NUM_PROMPT \
--gpu-memory-util $GPU_MEMORY_UTIL
Offline Benchmark
python benchmarks/benchmark_throughput.py --model $MODEL \
--dataset $SHAREGPT_PATH --num-prompt $NUM_PROMPT --backend gllm \
--gpu-memory-util $GPU_MEMORY_UTIL
Launch OpenAI-Compatible Server (Intra-node)
# To see the description of args, run 'python -m gllm.entrypoints.api_server -h'
python -m gllm.entrypoints.api_server --port $PORT --model-path $MODEL_PATH \
--enable-prefix-caching --pp $PP --tp $TP
Launch OpenAI-Compatible Server (Multi-node)
Experimental feature
gLLM can be launched in three modes: (1) normal, used for single-node multiple GPUs (2) master, used for multi-node deployment (3) slave, used for multi-node deployment.
To launch master gLLM instance
python -m gllm.entrypoints.api_server --port $PORT --master-port $MASTER_PORT \
--model-path $MODEL_PATH --pp $PP --launch-mode master --worker-ranks $RANKS
To launch slave gLLM instance
python -m gllm.entrypoints.api_server --host $HOST \
--master-addr $MASTER_ADDR --master-port $MASTER_PORT \
--model-path $MODEL_PATH --pp $PP --launch-mode slave --worker-ranks $RANKS
There are something you need to care about
- Make sure
$MASTER_PORTand$MASTER_ADDRin slave instance can be matched to that in master instance - Make sure slave instance can set up connection with master instance using
$MASTER_ADDR - Make sure master instance can set up connection with slave instance using
$HOST - Make sure
$PPcan be matched to$RANKSin slave or master instance- For example, we want to launch two gLLM instances,
$PPis set to4,$RANKSin master is set to0,1, then$RANKSin slave must set to2,3
- For example, we want to launch two gLLM instances,
- Make sure set environment variable
NCCL_SOCKET_IFNAMENCCL_IB_DISABLEproperly
Client Completions
# Launch server first
python examples/client.py --port $PORT
Interactive Online Chat
# Launch server first
python examples/chat_client.py --port $PORT
Online Benchmark
# Launch server first
python benchmarks/benchmark_serving.py --backend $BACKEND --model $MODEL \
--dataset-name $DATASET_NAME --dataset-path $DATASET_PATH \
--num-prompts $NUM_PROMPTS --port $PORT --trust-remote-code \
--request-rate $REQUEST_RATE
Online Prefix Benchmark
# Launch server first
python benchmarks/benchmark_prefix_serving.py \
--trust-remote-code --backend $BACKEND --dataset $SHAREGPT_PATH \
--model $MODEL --num-max-users $NUM_USERS \
--num-min-rounds $NUM_MIN_ROUNDS \
--num-max-rounds $NUM_MAX_ROUNDS \
--port $PORT
Evaluate Output Quality
# Launch server first
python evaluations/evaluate_MMLU_pro.py --model $MODEL --port $PORT
Supported Models
- Qwen Series: Qwen3, Qwen2.5, Qwen2
- Llama Series: Llama3.2, Llama3.1, Llama3, Llama2 and deepseek-coder
- Mixtral Series: Mixtral-8x7B, Mixtral-8x22B
- ChatGLM Series: Glm4 and Chatglm3
Roadmap
- Support more models
Cite Our Work
@misc{guo2025gllmglobalbalancedpipeline,
title={gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling},
author={Tianyu Guo and Xianwei Zhang and Jiangsu Du and Zhiguang Chen and Nong Xiao and Yutong Lu},
year={2025},
eprint={2504.14775},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2504.14775},
}
Acknowledgment
We studied the architecture and implemented code reuse from these existing projects: vLLM, SGLang and TD-Pipe.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gllm_rt-0.0.3.tar.gz.
File metadata
- Download URL: gllm_rt-0.0.3.tar.gz
- Upload date:
- Size: 752.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26515772808d5c63f342c20551567cf7973d7d9e8849936d19fa810d7cc9098b
|
|
| MD5 |
2e69dae73be52cd22069810ce8e13cdc
|
|
| BLAKE2b-256 |
2ed73c4b028711e3382fb14f0e9cda1f3c84d95c54661dcaf1ace91cad245567
|
File details
Details for the file gllm_rt-0.0.3-cp311-cp311-manylinux1_x86_64.whl.
File metadata
- Download URL: gllm_rt-0.0.3-cp311-cp311-manylinux1_x86_64.whl
- Upload date:
- Size: 80.4 MB
- Tags: CPython 3.11
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7b662815a1d07368a665184355f0df6c939647d3864f008630902781613ce6f
|
|
| MD5 |
4192b5448007845aa4aa8bb7b6fc79f9
|
|
| BLAKE2b-256 |
69408a333e70411e6646c3332f556578c34443eb726c471a2599ee9a1bd28ba0
|