Skip to main content

Flexible and modular LLM inference for mini-batch

Project description

TritonLLM: LLM Inference via Triton 🚀

Flexible and modular LLM inference for mini-batch

🔗 tritonllm.top

English | 中文

TritonLLM implements modular Triton-backed LLM inference with an emphasis on kernel optimization using CUBINs. The initial target is the gpt-oss model, executed via triton_runner and will be tuned for RTX 5090 (sm120). Now support an NVIDIA GPU with compute capability sm120(RTX 5090, RTX PRO 6000, etc.), sm90(H100, H200, H20, etc.), sm80(A800, A100), sm89(RTX 4090, RTX 6000, L40, etc.) and sm86(RTX 3090, A10, etc.). If the GPU memory is greater than or equal to 24 GB, you can run the gpt-oss-20b; if it is greater than or equal to 80 GB, you can run the gpt-oss-120b.

The project is compatible with PyTorch 2.10 and runs correctly in that environment. However, for the best performance, we recommend using a PyTorch/Triton combination equivalent to PyTorch 2.8 with Triton 3.4.0.

Quick Installation

Install a compatible PyTorch environment first, then install the latest stable release of tritonllm from pip. tritonllm does not pin torch or triton at package install time, so you can match versions to your CUDA driver and deployment environment:

pip install torch

Then install tritonllm:

pip install tritonllm

To enable the optional triton_runner JIT backend:

pip install "tritonllm[runner]"

🚀 Command Line Interface (CLI)

To quickly launch with the gpt-oss-20b model and automatically download it from ModelScope:

tritonllm

You can explore all available options with:

tritonllm --help

Usage

usage: tritonllm [-h] [-r REASONING_EFFORT] [-a] [-b] [--show-browser-results] [-p]
                 [--developer-message DEVELOPER_MESSAGE] [-c CONTEXT] [--raw]
                 [FILE]

Positional arguments

Argument Description
FILE Path to the SafeTensors checkpoint. If not provided, downloads the 20B model from ModelScope. You can also run tritonllm 120b to directly use the 120B model from ModelScope.

Options

Option Description
-h, --help Show this help message and exit.
-r REASONING_EFFORT, --reasoning-effort REASONING_EFFORT Set reasoning effort level (low / medium / high). Default: high.
-a, --apply-patch Make the internal apply_patch function available to the model. Default: False.
-b, --browser Enable browser tool so the model can fetch web content. Default: False.
--show-browser-results Show fetched browser results in the output. Default: False.
-p, --python Enable Python execution tool (run Python snippets). Default: False.
--developer-message DEVELOPER_MESSAGE Provide a developer/system message that influences the model’s behavior.
-c CONTEXT, --context CONTEXT Maximum context length (tokens). Default: 8192.
--raw Raw mode. Disable Harmony encoding and render plain output. Default: False.

Install from source

git clone https://github.com/toyaix/tritonllm
cd tritonllm

pip install -e .

Install the optional runner backend from source:

pip install -e ".[runner]"

JIT backend selection

By default the project keeps using @triton.jit. To switch the package-managed kernels to @triton_runner.jit, set:

export TRITONLLM_JIT_BACKEND=triton_runner

Supported values are triton and triton_runner. If triton_runner is selected without the optional dependency installed, the import will fail fast with a clear error.

example code

from tritonllm.gpt_oss.chat import chat, get_parser_args


if __name__ == "__main__":
    chat(get_parser_args())

Run

# test
python examples/generate.py

# chat
python examples/chat.py

Benchmark

I am currently optimizing Tokens Per Second(TPS), the number of tokens generated per second during autoregressive decoding.

python examples/bench_chat.py

# show output
python examples/only_output.py

Run use streamlit with Responses API(has bug)

You can also use Streamlit to interact with the Responses API, providing a convenient web interface for managing the project.

pip install streamlit

python -m gpt_oss.responses_api.serve

streamlit run streamlit/streamlit_chat.py

triton_kernels

triton_kernels is a set of kernels that enable fast moe on different architectures. These kernels are compatible with different precision (e.g bf16, mxfp4)

Original code here https://github.com/triton-lang/triton/tree/main/python/triton_kernels

The current version is the following commit de4376e90a3c2b5ca30ada25a50cccadeadf7f1a and use BlackwellMXValueLayout with commit 19ca20fda4cfd3ae0d3eabde5e547db581fbb7ee。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tritonllm-0.1.1-py3-none-any.whl (163.9 kB view details)

Uploaded Python 3

File details

Details for the file tritonllm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: tritonllm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 163.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for tritonllm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fb7e1ccc77dafb7cc1de537b5629c59d09bd9e63b45a29a4d13577c119500541
MD5 1b959579dfbb4046faa7c9b3260ad588
BLAKE2b-256 8fce46a94847e3ab7159d71cf8e828a1e8ce33ee32e3e129e35e042195a886ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page