Skip to main content

Flash: Triton Kernel Library for LLM Serving

Project description

FLASHNN : A Triton-Powered Kernel Library for LLM Serving

FLASHNN is a pioneering kernel library for Large Language Models (LLMs), providing a high-performance implementation of GPU kernels optimized for LLM serving and inference, involving in comprehensive support for attention kernels and versatile quantization methods.

By harnessing the power of Triton, FLASHNN is engineered to seamlessly integrate with multiple of hardware platforms, ensuring smooth operability and maximizing the utilization of hardware resources.

Features

  • Comprehensive Support for Attention Kernels: FLASHNN offers extensive support for various types of attention mechanisms, enabling it to handle a wide array of LLM architectures with ease.
  • Multiple Quantization Methods: FLASHNN incorporates multiple quantization techniques (int8, int4) aimed at optimizing both the computational overhead and the memory footprint of LLMs, making it easier to deploy LLMs in resource-constrained environments.
  • Low Runtime Overhead:The primary contributor to the performance discrepancy observed with Triton kernels is the runtime overhead. To address this, we have implemented an ahead-of-time kernel cache for Triton kernels, which significantly mitigates this overhead.
  • Production-Ready Performance: FLASHNN is meticulously optimized for production scenarios, which delivers state-of-art performance that meets the demanding requirements of real-world applications.
  • Smooth Portability on Multiple Hardware: Facilitated by the inherent design of the Triton language, FLASHNN simplifying the process of adapting LLM serving solutions to diverse computing environments.

Compatibility

Supported Operators

Type Operators
Gemm A8W8, A16W4, A16W8
Attention PagedAttention V1, PagedAttention V2, FlashAttention V2
Norm LayerNorm, RMSNorm
Quantization DynamicQuant, LayerNormDequant, RMSNormDequant
Embedding RotaryEmbedding

Supported Platforms

FlashNN is tested to work in Nvidia and AMD GPUs(e.g. A100, A10, H20, MI210, ...).

Platforms float16 float32 bfloat16
NVIDIA A100
NVIDIA A10
NVIDIA H20
AMD MI210

Get Started

Requirements

FlashNN requires Pytorch and Triton.

Installation

FlashNN operators can be customized to each function-equivalent PyTorch operators by simply replace the corresponding torch function.

The binary wheel distribution(whl) will be available soon.

Benchmarks

License

The FLASHNN project is based on Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

flashnn-0.1.1-py3-none-any.whl (60.4 kB view details)

Uploaded Python 3

File details

Details for the file flashnn-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: flashnn-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 60.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for flashnn-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 926b068a851e0d2585b2b09b0a41d8c83c21a4ba1ea4492c3e5f6f40e3f02096
MD5 60e83a2dd4e67403cd1ca11f4d919047
BLAKE2b-256 ad5703c0601034a6790760e1631a5c928c6c88f451f9d6bdcaff5cdebdadc7ca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page