Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling


What is gLLM?

Integreted with features like continuous batching, paged attention, chunked prefill, prefix caching, cuda graph, token throttling, pipeline parallelism, expert parallelsim and tensor parallelism, gLLM provides basic functionality (offline/online inference and interactive chat) to deploy distributed LLMs (supported in huggingface) inference. gLLM provides equivalent or superior offline/online inference speed with mainstream inference engine and minimal code base. You can also see gLLM as a LLM inference playground for doing experiment or academic research.

Latest News :fire:

  • [2025/12/04]: Cuda graph is supported :tada:
  • [2025/11/19]: DeepSeek V3/R1 is supported :laughing:
  • [2025/09/19]: DynaPipe is accepted by NeurIPS'25. Congratulations :smiling_face_with_three_hearts:
  • [2025/08/15]: Qwen2.5 VL is supported :hugs:
  • [2025/08/01]: DeepSeek V2 is supported :clap:
  • [2025/07/12]: FP8 quantization for Qwen3/2.5 is supported :tada:
  • [2025/06/27]: gLLM is accepted by SC'25. Congratulations :smiling_face_with_three_hearts:
  • [2025/06/21]: Expert parallelism is integrated :heart_eyes:
  • [2025/06/14]: Tensor parallelism is now integrated, allowing joint deploying with pipeline parallelism :sunglasses:
  • [2025/05/05]: MoE architecture is supported. Try Qwen2/3 MoE models :star_struck:
  • [2025/04/29]: Qwen3 day 1 support. Come and try Qwen3 :tada:
  • [2025/04/27]: gLLM is open sourced :earth_asia:
Previous News
  • [2025/04/27]: We support multi-node deployments. You can serve your model across different machines :blush:
  • [2025/04/21]: We release our paper on arXiv:2504.14775 :partying_face:
  • [2025/03/15]: Chunked prefill has been integrated. You can input any length of text you want :hugs:
  • [2025/03/01]: Pipeline parallelism has been integrated. You can run any size of model you want :laughing:
  • [2025/02/27]: We apply numerous optimizations which lowers CPU overhead a lot :clap:

Token Throttling

Prefill Token Throttling


Decode Token Throttling

Install gLLM

pip install -e .

Quickstart

Interactive Offline Chat

python examples/chat.py --model $MODEL_PATH

Offline Batch Inference

python examples/batch_inference.py --model $MODEL \
    --share-gpt-path $SHARE_GPT_PATH --num-prompt $NUM_PROMPT \
    --gpu-memory-util $GPU_MEMORY_UTIL

Offline Benchmark

python benchmarks/benchmark_throughput.py --model $MODEL \
    --dataset $SHAREGPT_PATH --num-prompt $NUM_PROMPT --backend gllm \
    --gpu-memory-util $GPU_MEMORY_UTIL

Launch OpenAI-Compatible Server (Intra-node)

# To see the description of args, run 'python -m gllm.entrypoints.api_server -h'
python -m gllm.entrypoints.api_server --port $PORT --model-path $MODEL_PATH \
    --enable-prefix-caching --pp $PP --tp $TP

Launch OpenAI-Compatible Server (Multi-node)

gLLM can be launched in three modes: (1) normal, used for single-node multiple GPUs (2) master, used for multi-node deployment (3) slave, used for multi-node deployment.

To launch master gLLM instance

python -m gllm.entrypoints.api_server --port $PORT --master-port $MASTER_PORT \
    --model-path $MODEL_PATH --pp $PP --launch-mode master --worker-ranks $RANKS

To launch slave gLLM instance

python -m gllm.entrypoints.api_server --host $HOST \
    --master-addr $MASTER_ADDR --master-port $MASTER_PORT \
    --model-path $MODEL_PATH --pp $PP --launch-mode slave --worker-ranks $RANKS

There are something you need to care about

  • Make sure $MASTER_PORT and $MASTER_ADDR in slave instance can be matched to that in master instance
  • Make sure slave instance can set up connection with master instance using $MASTER_ADDR
  • Make sure master instance can set up connection with slave instance using $HOST
  • Make sure $PP can be matched to $RANKS in slave or master instance
    • For example, we want to launch two gLLM instances, $PP is set to 4, $RANKS in master is set to 0,1, then $RANKS in slave must set to 2,3
  • Make sure set environment variable NCCL_SOCKET_IFNAME NCCL_IB_DISABLE properly

Client Completions

# Launch server first
python examples/client.py --port $PORT

Interactive Online Chat

# Launch server first
python examples/chat_client.py --port $PORT

Online Benchmark

# Launch server first
python benchmarks/benchmark_serving.py --backend $BACKEND --model $MODEL \
        --dataset-name $DATASET_NAME --dataset-path $DATASET_PATH \
        --num-prompts $NUM_PROMPTS --port $PORT --trust-remote-code \
        --request-rate $REQUEST_RATE

Online Prefix Benchmark

# Launch server first
python benchmarks/benchmark_prefix_serving.py \
        --trust-remote-code --backend $BACKEND --dataset $SHAREGPT_PATH \
        --model $MODEL --num-max-users $NUM_USERS \
        --num-min-rounds $NUM_MIN_ROUNDS \
        --num-max-rounds $NUM_MAX_ROUNDS \
        --port $PORT

Evaluate Output Quality

# Launch server first
python benchmarks/evaluate_MMLU_pro.py --model $MODEL

Supported Models

  • Kimi Series: Moonlight, K2-Base, K2-Instruct
  • DeepSeek Series: DeepSeek R1, DeepSeek V3, DeepSeek V2
  • Qwen Series: Qwen3 VL, Qwen3, Qwen2.5 VL, Qwen2.5, Qwen2
  • Llama Series: Llama3.2, Llama3.1, Llama3, Llama2 and deepseek-coder
  • Mixtral Series: Mixtral-8x7B, Mixtral-8x22B
  • ChatGLM Series: Glm4 and Chatglm3

Supported Quantization Method

  • fp8

Roadmap

  • Support more models
  • Support more quantization methods

Cite Our Work

@inproceedings{10.1145/3712285.3759823,
author = {Guo, Tianyu and Zhang, Xianwei and Du, Jiangsu and Chen, Zhiguang and Xiao, Nong and Lu, Yutong},
title = {gLLM: Global Balanced Pipeline Parallelism Systems for Distributed LLMs Serving with Token Throttling},
year = {2025},
isbn = {9798400714665},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3712285.3759823},
doi = {10.1145/3712285.3759823},
booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
pages = {1725–1741},
numpages = {17},
keywords = {LLM, Inference, Parallelism, Pipeline Bubbles, Throttling, Runtime},
location = {},
series = {SC '25}
}
@inproceedings{
xu2025dynapipe,
title={DynaPipe: Dynamic Layer Redistribution for Efficient Serving of {LLM}s with Pipeline Parallelism},
author={HongXin Xu and Tianyu Guo and Xianwei Zhang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=D6w7wIN360}
}

Acknowledgment

We studied the architecture and implemented code reuse from these existing projects: vLLM, SGLang and TD-Pipe.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gllm_rt-0.0.6.post1-py3-none-any.whl (228.4 kB view details)

Uploaded Python 3

File details

Details for the file gllm_rt-0.0.6.post1-py3-none-any.whl.

File metadata

  • Download URL: gllm_rt-0.0.6.post1-py3-none-any.whl
  • Upload date:
  • Size: 228.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for gllm_rt-0.0.6.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 87c27a86da6e2f86f186e2bc561bc79771edb2954d34a76cf0c7d09ceff2087f
MD5 e5511f5743288cef56249115dcb76fe5
BLAKE2b-256 759407528ec8f18a27aae27ceebd161ab515d10fe05e5eeca56a61011a082bf9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page