Kernel Library for SGLang
Project description
SGL Kernel
Kernel Library for SGLang
Installation
For CUDA 11.8:
pip3 install sgl-kernel -i https://docs.sglang.ai/whl/cu118
For CUDA 12.1 or CUDA 12.4:
pip3 install sgl-kernel
Build from source
Development build:
make build
Note:
The sgl-kernel
is rapidly evolving. If you experience a compilation failure, try using make rebuild
.
Build with ccache
# or `yum install -y ccache`.
apt-get install -y ccache
# Building with ccache is enabled when ccache is installed and CCACHE_DIR is set.
export CCACHE_DIR=/path/to/your/ccache/dir
export CCACHE_BACKEND=""
export CCACHE_KEEP_LOCAL_STORAGE="TRUE"
unset CCACHE_READONLY
python -m uv build --wheel -Cbuild-dir=build --color=always .
Configuring CMake Build Options
Cmake options can be configuring by adding -Ccmake.define.<option>=<value>
to the uv build
flags.
For example, to enable building FP4 kernels, use:
python -m uv build --wheel -Cbuild-dir=build -Ccmake.define.SGL_KERNEL_ENABLE_FP4=1 --color=always .
See CMakeLists.txt for more options.
Parallel Build
We highly recommend you build sgl-kernel with Ninja. Ninja can automatically build sgl-kernel in parallel.
And if you build the sgl-kernel with cmake, you need to add CMAKE_BUILD_PARALLEL_LEVEL
for parallel build like:
CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) python -m uv build --wheel -Cbuild-dir=build --color=always .
Developer Guide
Development Environment Setup
Use Docker to set up the development environment. See Docker setup guide.
Create and enter development container:
docker run -itd --shm-size 32g --gpus all -v $HOME/.cache:/root/.cache --ipc=host --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker exec -it sglang_zhyncs /bin/zsh
Project Structure
Dependencies
Third-party libraries:
FlashAttention FYI
FA3 can fail without a enough shared memory for a some shapes, such as higher hidden_dim or some special cases. Right now, fa3 is supported for sm80/sm87 and sm86/sm89.
The main different Between sm80/sm87 and sm86/sm89 is the shared memory size. you can follow the link below for more information https://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory-8-x.
And for sgl-kernel right now, we can build fa3 on sm80/sm86/sm89/sm90a. Thats mean if you use A100(tested)/A*0/L20(tested)/L40/L40s/3090(tested) you can use fa3.
Kernel Development
Steps to add a new kernel:
- Implement the kernel in csrc
- Expose the interface in include/sgl_kernel_ops.h
- Create torch extension in csrc/common_extension.cc
- Update CMakeLists.txt to include new CUDA source
- Expose Python interface in python
Development Tips
-
When implementing kernels in csrc, only define pure CUDA files and C++ interfaces. If you need to use
Torch::tensor
, use<torch/all.h>
instead of<torch/extension.h>
. Using<torch/extension.h>
will cause compilation errors when using SABI. -
When creating torch extensions, add the function definition with
m.def
, and device binding withm.impl
:
-
Using torch.compile need
m.def
with schema, it helps auto capture the custom kernel. Reference: How to add FakeTensor -
How to write schema: Schema reference
// We need def with schema here for torch.compile m.def( "bmm_fp8(Tensor A, Tensor B, Tensor! D, Tensor A_scale, Tensor B_scale, Tensor workspace_buffer, int " "cublas_handle, int cuda_stream) -> ()"); m.impl("bmm_fp8", torch::kCUDA, &bmm_fp8);
-
When exposing Python interfaces, avoid using kwargs in C++ interface kernels.
Avoid this:
torch.ops.sgl_kernel.apply_rope_pos_ids_cos_sin_cache.default( q=query.view(query.shape[0], -1, head_size), k=key.view(key.shape[0], -1, head_size), q_rope=query.view(query.shape[0], -1, head_size), k_rope=key.view(key.shape[0], -1, head_size), cos_sin_cache=cos_sin_cache, pos_ids=positions.long(), interleave=(not is_neox), cuda_stream=get_cuda_stream(), )
Use this instead:
torch.ops.sgl_kernel.apply_rope_pos_ids_cos_sin_cache.default( query.view(query.shape[0], -1, head_size), key.view(key.shape[0], -1, head_size), query.view(query.shape[0], -1, head_size), key.view(key.shape[0], -1, head_size), cos_sin_cache, positions.long(), (not is_neox), get_cuda_stream(), )
Integrating Third-Party Libraries with Data Type Conversion
When integrating new third-party libraries like flash-attention, you may encounter data type compatibility issues between the C++ interface and PyTorch bindings. For example, the third-party code might use float
or int
types, while PyTorch requires double
and int64_t
.
The reason we need
double
andint64_t
in torch binding is that TORCH_LIBRARY handles thePython-to-C++
conversion process. Python'sfloat
data type actually corresponds todouble
in C++, while Python'sint
corresponds toint64_t
in C++.
To address this issue, we provide the make_pytorch_shim
function in sgl_kernel_torch_shim that handles data type conversions automatically.
When you need to support new data type conversions, you can easily add conversion functions like this:
// Map `int` -> `int64_t`
template <>
struct pytorch_library_compatible_type<int> {
using type = int64_t;
static int convert_from_type(int64_t arg) {
TORCH_CHECK(arg <= std::numeric_limits<int>::max(), "int64_t value is too large to be converted to int");
TORCH_CHECK(arg >= std::numeric_limits<int>::min(), "int64_t value is too small to be converted to int");
return arg;
}
};
To use this with your library functions, simply wrap them with make_pytorch_shim:
/*
* From flash-attention
*/
m.impl("fwd", torch::kCUDA, make_pytorch_shim(&mha_fwd));
Testing & Benchmarking
- Add pytest tests in tests/, if you need to skip some test, please use
@pytest.mark.skipif
@pytest.mark.skipif(
skip_condition, reason="Nvfp4 Requires compute capability of 10 or above."
)
- Add benchmarks using triton benchmark in benchmark/
- Run test suite
FAQ
- When encountering this error while compiling using ccache:
ImportError: /usr/local/lib/python3.10/dist-packages/sgl_kernel/common_ops.abi3.so: undefined symbol: _ZN3c108ListType3getERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4Type24SingletonOrSharedTypePtrIS9_EE
, please modify the last command as follows to resolve it:python3 -m uv build --wheel -Cbuild-dir=build . --color=always --no-build-isolation
.
Release new version
Update version in pyproject.toml and version.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file sgl_kernel-0.1.1-cp39-abi3-manylinux2014_x86_64.whl
.
File metadata
- Download URL: sgl_kernel-0.1.1-cp39-abi3-manylinux2014_x86_64.whl
- Upload date:
- Size: 231.2 MB
- Tags: CPython 3.9+
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
d6ada77da1061c407e4031e3e75d76a456f684f112c8997e45c11a38bde59828
|
|
MD5 |
9e2be3d3ea3bb56fae692deefd3e4c21
|
|
BLAKE2b-256 |
2dbc1df5bb36d85b89b200c0fb5d549b3a7ef24806ac7e1a5ff7df42610ebba5
|