Flash: Triton Kernel Library for LLM Serving
Project description
FLASHNN : A Triton-Powered Kernel Library for LLM Serving
FLASHNN is a pioneering kernel library for Large Language Models (LLMs), providing a high-performance implementation of GPU kernels optimized for LLM serving and inference, involving in comprehensive support for attention kernels and versatile quantization methods.
By harnessing the power of Triton, FLASHNN is engineered to seamlessly integrate with multiple of hardware platforms, ensuring smooth operability and maximizing the utilization of hardware resources.
Features
- Comprehensive Support for Attention Kernels: FLASHNN offers extensive support for various types of attention mechanisms, enabling it to handle a wide array of LLM architectures with ease.
- Multiple Quantization Methods: FLASHNN incorporates multiple quantization techniques (int8, int4) aimed at optimizing both the computational overhead and the memory footprint of LLMs, making it easier to deploy LLMs in resource-constrained environments.
- Low Runtime Overhead:The primary contributor to the performance discrepancy observed with Triton kernels is the runtime overhead. To address this, we have implemented an ahead-of-time kernel cache for Triton kernels, which significantly mitigates this overhead.
- Production-Ready Performance: FLASHNN is meticulously optimized for production scenarios, which delivers state-of-art performance that meets the demanding requirements of real-world applications.
- Smooth Portability on Multiple Hardware: Facilitated by the inherent design of the Triton language, FLASHNN simplifying the process of adapting LLM serving solutions to diverse computing environments.
Compatibility
Supported Operators
Type | Operators |
---|---|
Gemm | A8W8, A16W4, A16W8 |
Attention | PagedAttention V1, PagedAttention V2, FlashAttention V2 |
Norm | LayerNorm, RMSNorm |
Quantization | DynamicQuant, LayerNormDequant, RMSNormDequant |
Embedding | RotaryEmbedding |
Supported Platforms
FlashNN is tested to work in Nvidia and AMD GPUs(e.g. A100, A10, H20, MI210, ...).
Platforms | float16 | float32 | bfloat16 |
---|---|---|---|
NVIDIA A100 | |||
NVIDIA A10 | |||
NVIDIA H20 | |||
AMD MI210 |
Get Started
Requirements
FlashNN requires Pytorch and Triton.
Installation
FlashNN operators can be customized to each function-equivalent PyTorch operators by simply replace the corresponding torch function.
The binary wheel distribution(whl) will be available soon.
Benchmarks
License
The FLASHNN project is based on Apache 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file flashnn-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: flashnn-0.1.1-py3-none-any.whl
- Upload date:
- Size: 60.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 926b068a851e0d2585b2b09b0a41d8c83c21a4ba1ea4492c3e5f6f40e3f02096 |
|
MD5 | 60e83a2dd4e67403cd1ca11f4d919047 |
|
BLAKE2b-256 | ad5703c0601034a6790760e1631a5c928c6c88f451f9d6bdcaff5cdebdadc7ca |