Skip to main content

dlblas

Project description

Overall Design

dlBLAS is dedicated to leveraging the latest technologies to achieve the ultimate performance of operators. For example, EP_MoE utilizes cutting-edge industry technologies such as DeepEP and DeepGemm to implement highly efficient MoE modules.

dlBLAS is meant to be an operator library for Triton-based operators. As such, kernel developers register their kernels to the library and users ask for a operator by giving operator name and input tensors.

it improves over Triton's autotuner in the following ways:

  • operator selection: given the same operator, e.g. matmul, there may be different kernel implementations; we want to find the best one based on the input tensors.

  • customized configuration search: instead of enumerating all possible kernel configurations (BLOCK_SIZE etc.), we want to use advanced algorithm e.g. a bayesian optimizer to search for the best configurations. This needs a flexbile definition of search space and search policy. For DSA hardware, the configuration space is large.

  • caching the best operator implementation and kernel configurations are cached for the input tensors. It is shape, dtype, device specific.

Install

cd dlBLAS
python setup.py install

Getting Started

There are a couple of ways to apply dlblas kernels.

  1. get op from dlblas
from dlblas.utils import get_op
args = parse_args()
dtype = torch.float16
device = 'cuda'
a = torch.randn(
    (args.m, args.k),
    dtype=dtype,
    device=device,
)
b = torch.randn(
    (args.k, args.n),
    dtype=dtype,
    device=device,
)
matmul = get_op('matmul', (a, b))
# test
out = matmul(a, b)
ref_out = a @ b
tol = {
    'atol': 1.0,
}
if torch.allclose(out, ref_out, **tol):
    print('✅ Triton and Torch match')
else:
    print('❌ Triton and Torch differ')

  1. import kernel functions from the kernel file
from dlblas.kernels.rms_norm import rms_norm
rms_norm(...)

  1. import dlblas and use the kernels directly
import dlblas
dlblas.topk_gating(...)

Low-level APIs

Kernel API
silu_and_mul from dlblas.kernels.activation import silu_and_mul
add_rms_norm from dlblas.kernels.add_rms_norm import call
rotary_pos_emb from dlblas.kernels.apply_rotary_pos_emb import apply_rotary_pos_emb
ffn from dlblas.kernels.ffn import call
flash_attention_v2 from dlblas.kernels.flash_attention_v2 import FlashAttentionV2
fp8_gemm from dlblas.kernels.fp8_gemm import fp8_gemm
fused_rotary_and_fa from dlblas.kernels.fused_rotary_and_fa import FusedRotaryAndFA
partial_rotary_emb from dlblas.kernels.partial_rotary_emb import PartialRotaryEmb
topk_gating from dlblas.kernels.topk_gating import TopKGatingFunc

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dlblas-0.0.2-cp312-cp312-manylinux1_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.12

dlblas-0.0.2-cp310-cp310-manylinux1_x86_64.whl (6.0 MB view details)

Uploaded CPython 3.10

File details

Details for the file dlblas-0.0.2-cp312-cp312-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for dlblas-0.0.2-cp312-cp312-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fe246a5617d4b2c512d9307a427ebf97f862ff3c059856259bad069f56f12e17
MD5 46358714bc658769879c2d49c88e9a04
BLAKE2b-256 3a681e64c7b383913be3245bb21450fb3e641e0e3ded24e8c7ae7124e9eb0723

See more details on using hashes here.

File details

Details for the file dlblas-0.0.2-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for dlblas-0.0.2-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c42f34742963b9eaf26dc8f3cf5bf8b5c2f0ebbe37edcae2b85a4b4e5da44b43
MD5 0292e4526979d4d4db5103180ddae448
BLAKE2b-256 91a8466ac8c0297af1167cfa9f16d8bf229a6de3d50f9b55870fcd03f64360f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page