Skip to main content

FlagGems is a function library written in Triton.

Project description

中文版

img_v3_02gp_8115f603-cc89-4e96-ae9d-f01b4fef796g

Introduction

FlagGems is a high-performance general operator library implemented in OpenAI Triton. It builds on a collection of backend neutral kernels that aims to accelerate LLM training and inference across diverse hardware platforms.

By registering with the ATen backend of PyTorch, FlagGems facilitates a seamless transition, allowing model developers to switch to Triton without changing the low level APIs. Users can still use their familiar Pytorch APIs as usual and benefit from new hardware acceleration technologies. For kernel developers, the Triton language offers readability, user-friendliness and performance comparable to CUDA. This convenience allows developers to engage in the development of FlagGems with minimal learning investment.

We created WeChat group for FlagGems. Scan the QR code to join the group chat! To get the first hand message about our updates and new release, or having any questions or ideas, join us now!

bge_wechat_group

Features

FlagGems provides the following technical features.

  • A large collection of PyTorch compatible operators
  • Hand optimized performance for selective operators
  • Eager mode ready, independent of torch.compile
  • Automatic pointwise operator codegen supporting arbitrary input types and layout
  • Fast per-function runtime kernel dispatching
  • Multi-backend interface enabling support of diverse hardware platforms
  • Over 10 supported backends
  • C++ Triton function dispatcher (working in progress)

More About Features

Multi-Backend Hardware Support

FlagGems supports a wide range of hardware platforms and has been extensively tested across different hardware configurations.

Automatic Codegen

FlagGems provides an automatic code generation mechanism that enables developers to easily generate both pointwise and fused operators. The auto-generation system supports a variety of needs, including standard element-wise computations, non-tensor parameters, and specifying output types. For more details, please refer to pointwise_dynamic.

LibEntry

FlagGems introduces LibEntry, which independently manages the kernel cache and bypasses the runtime of Autotuner, Heuristics, and JitFunction. To use it, simply decorate the Triton kernel with LibEntry.

LibEntry also supports direct wrapping of Autotuner, Heuristics, and JitFunction, preserving full tuning functionality. However, it avoids nested runtime type invocations, eliminating redundant parameter processing. This means no need for binding or type wrapping, resulting in a simplified cache key format and reduced unnecessary key computation.

C++ Runtime

FlagGems can be installed either as a pure Python package or as a package with C++ extensions. The C++ runtime is designed to address the overhead of the Python runtime and improve end-to-end performance.

Changelog

v3.0

  • support 184 operators in total, including custom operators used in large model inference
  • support more hardware platforms, add Ascend, AIPU, etc.
  • compatible with the vLLM framework, with the inference verification of DeepSeek model passed

v2.1

  • support Tensor operators: where, arange, repeat, masked_fill, tile, unique, index_select, masked_select, ones, ones_like, zeros, zeros_like, full, full_like, flip, pad
  • support neural network operator: embedding
  • support basic math operators: allclose, isclose, isfinite, floor_divide, trunc_divide, maximum, minimum
  • support distribution operators: normal, uniform_, exponential_, multinomial, nonzero, topk, rand, randn, rand_like, randn_like
  • support science operators: erf, resolve_conj, resolve_neg

v2.0

  • support BLAS operators: mv, outer
  • support pointwise operators: bitwise_and, bitwise_not, bitwise_or, cos, clamp, eq, ge, gt, isinf, isnan, le, lt, ne, neg, or, sin, tanh, sigmoid
  • support reduction operators: all, any, amax, argmax, max, min, prod, sum, var_mean, vector_norm, cross_entropy_loss, group_norm, log_softmax, rms_norm
  • support fused operators: fused_add_rms_norm, skip_layer_norm, gelu_and_mul, silu_and_mul, apply_rotary_position_embedding

v1.0

  • support BLAS operators: addmm, bmm, mm
  • support pointwise operators: abs, add, div, dropout, exp, gelu, mul, pow, reciprocal, relu, rsqrt, silu, sub, triu
  • support reduction operators: cumsum, layernorm, mean, softmax

Get Start

For a quick start with installing and using flag_gems, please refer to the documentation GetStart.

Supported Operators

Operators will be implemented according to OperatorList.

Example Models

  • Bert-base-uncased
  • Llama-2-7b
  • Llava-1.5-7b

Supported Platforms

vendor state float16 float32 bfloat16
aipu ✅ (Partial support)
ascend ✅ (Partial support)
cambricon
hygon
iluvatar
kunlunxin
metax
mthreads
nvidia
arm(cpu) 🚧
tsingmicro 🚧

Performance

The following chart shows the speedup of FlagGems compared with PyTorch ATen library in eager mode. The speedup is calculated by averaging the speedup on each shape, representing the overall performance of the operator.

Operator Speedup

Contributions

If you are interested in contributing to the FlagGems project, please refer to CONTRIBUTING.md. Any contributions would be highly appreciated.

Citation

If you find our work useful, please consider citing our project:

@misc{flaggems2024,
    title={FlagOpen/FlagGems: FlagGems is an operator library for large language models implemented in the Triton language.},
    url={https://github.com/FlagOpen/FlagGems},
    journal={GitHub},
    author={BAAI FlagOpen team},
    year={2024}
}

Contact us

If you have any questions about our project, please submit an issue, or contact us through flaggems@baai.ac.cn.

License

The FlagGems project is based on Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

flag_gems-4.1-cp313-cp313-manylinux_2_38_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.38+ x86-64

flag_gems-4.1-cp312-cp312-manylinux_2_38_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.38+ x86-64

flag_gems-4.1-cp311-cp311-manylinux_2_38_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.38+ x86-64

flag_gems-4.1-cp310-cp310-manylinux_2_38_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.38+ x86-64

File details

Details for the file flag_gems-4.1-cp313-cp313-manylinux_2_38_x86_64.whl.

File metadata

File hashes

Hashes for flag_gems-4.1-cp313-cp313-manylinux_2_38_x86_64.whl
Algorithm Hash digest
SHA256 31b964c1938d25ae35e5f57d0c2149037e1b0a38e367660597b29d0bcd854043
MD5 4d7aa97f10192119f63f242bdf184c6d
BLAKE2b-256 8d001cae16a4c1146f97fcb33a708115c21097c2f79974a6ea481ad0951e1dcf

See more details on using hashes here.

File details

Details for the file flag_gems-4.1-cp312-cp312-manylinux_2_38_x86_64.whl.

File metadata

File hashes

Hashes for flag_gems-4.1-cp312-cp312-manylinux_2_38_x86_64.whl
Algorithm Hash digest
SHA256 d20eec98e94181731dc995ee7dd143f90804e81d209320b0db9b9d5cad801926
MD5 e2105c7521cadd159a99e73ff0a2aec2
BLAKE2b-256 6f7172fb88e3f661fb70102ac21fc174e460fde2b0b175d27496ba7f5c26dc38

See more details on using hashes here.

File details

Details for the file flag_gems-4.1-cp311-cp311-manylinux_2_38_x86_64.whl.

File metadata

File hashes

Hashes for flag_gems-4.1-cp311-cp311-manylinux_2_38_x86_64.whl
Algorithm Hash digest
SHA256 ae2c0f9c7a9f926c4e6f65f866c68ea6aa5ed25986c9e8ac78ec083c4b8ffc47
MD5 ff5ea3bfe27229a16b5c93ee7c89b9f3
BLAKE2b-256 669778bec47dde01153f81eec8363a199e58b11ef2e71a2ba4587d6901e89b46

See more details on using hashes here.

File details

Details for the file flag_gems-4.1-cp310-cp310-manylinux_2_38_x86_64.whl.

File metadata

File hashes

Hashes for flag_gems-4.1-cp310-cp310-manylinux_2_38_x86_64.whl
Algorithm Hash digest
SHA256 9d647b6df4b87448152da3de5212a60618eb16be2a6430fd1d20a86786dd66e3
MD5 a41bcb11c7367ab5c2156f126f0dcb01
BLAKE2b-256 4918ea66d3d0c23b3ed8b9a4f0633ed6f43ce0f2d608578bdcb5c5ece6a9cc8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page