dlblas
Project description
Overall Design
dlBLAS is dedicated to leveraging the latest technologies to achieve the ultimate performance of operators. For example, EP_MoE utilizes cutting-edge industry technologies such as DeepEP and DeepGemm to implement highly efficient MoE modules.
dlBLAS is meant to be an operator library for Triton-based operators. As such, kernel developers register their kernels to the library and users ask for a operator by giving operator name and input tensors.
it improves over Triton's autotuner in the following ways:
-
operator selection: given the same operator, e.g. matmul, there may be different kernel implementations; we want to find the best one based on the input tensors.
-
customized configuration search: instead of enumerating all possible kernel configurations (BLOCK_SIZE etc.), we want to use advanced algorithm e.g. a bayesian optimizer to search for the best configurations. This needs a flexbile definition of search space and search policy. For DSA hardware, the configuration space is large.
-
caching the best operator implementation and kernel configurations are cached for the input tensors. It is shape, dtype, device specific.
Install
cd dlBLAS
python setup.py install
Getting Started
There are a couple of ways to apply dlblas kernels.
- get op from dlblas
from dlblas.utils import get_op
args = parse_args()
dtype = torch.float16
device = 'cuda'
a = torch.randn(
(args.m, args.k),
dtype=dtype,
device=device,
)
b = torch.randn(
(args.k, args.n),
dtype=dtype,
device=device,
)
matmul = get_op('matmul', (a, b))
# test
out = matmul(a, b)
ref_out = a @ b
tol = {
'atol': 1.0,
}
if torch.allclose(out, ref_out, **tol):
print('✅ Triton and Torch match')
else:
print('❌ Triton and Torch differ')
- import kernel functions from the kernel file
from dlblas.kernels.rms_norm import rms_norm
rms_norm(...)
- import dlblas and use the kernels directly
import dlblas
dlblas.topk_gating(...)
Low-level APIs
| Kernel | API |
|---|---|
| silu_and_mul | from dlblas.kernels.activation import silu_and_mul |
| add_rms_norm | from dlblas.kernels.add_rms_norm import call |
| rotary_pos_emb | from dlblas.kernels.apply_rotary_pos_emb import apply_rotary_pos_emb |
| ffn | from dlblas.kernels.ffn import call |
| flash_attention_v2 | from dlblas.kernels.flash_attention_v2 import FlashAttentionV2 |
| fp8_gemm | from dlblas.kernels.fp8_gemm import fp8_gemm |
| fused_rotary_and_fa | from dlblas.kernels.fused_rotary_and_fa import FusedRotaryAndFA |
| partial_rotary_emb | from dlblas.kernels.partial_rotary_emb import PartialRotaryEmb |
| topk_gating | from dlblas.kernels.topk_gating import TopKGatingFunc |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dlblas-0.0.3-py3-none-manylinux1_x86_64.whl.
File metadata
- Download URL: dlblas-0.0.3-py3-none-manylinux1_x86_64.whl
- Upload date:
- Size: 144.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7c0fe71ec8661ef163ac1f8111ab11fb2d23fc727d192f283783b336a7d7175
|
|
| MD5 |
570e58ec45252e093f99b666c8041487
|
|
| BLAKE2b-256 |
5d53c56e71c7b0e52ff39155f7524c5971b425f1a520e669c90cb1b5114f3d81
|