Skip to main content

Triton kernel repository

Project description

Conch :shell:

A "standard library" of Triton kernels.

What is Conch?

Conch is a central repository of Triton kernels for accelerating common AI operations. We strive to provide performant, well-written kernels that can be easily integrated into other projects. We also strive to support multiple hardware platforms (currently Nvidia and AMD).

Key Features

We support each of the following operations. Each operation is complete with a PyTorch-only reference implementation (and sometimes a reference implementation provided by another library, like vLLM), a microbenchmark, and a unit test.

  • Activation functions
    • GeLU and mul
    • SiLU and mul
  • Attention
    • Paged Attention (Flash-Decoding with Paged KV Cache)
    • Varlen Attention (Prefill/decode attention with paged KV cache)
  • Embedding
    • Rotary embedding
  • Normalization
    • Gemma-style RMS norm
    • Llama-style RMS norm
  • Quantization
    • bitsandbytes
      • NF4/FP4/8-bit blockwise quantize/dequantize
    • FP8 static quantization
    • Int8 static quantization
    • GEMM
      • Mixed-precision
      • Scaled
  • vLLM
    • KV cache operations
      • Copy blocks
      • Reshape and cache

Performance

The goal of Conch is not to claim that our operations are faster than CUDA implementations. Our goal is to write Triton operations that are as fast as the state-of-the-art CUDA implementations. This allows developers on any hardware platform (Nvidia, AMD, etc.) access to the same, performant kernels.

Below is a table comparing the relative performance of our Triton kernels to CUDA baselines (on NVIDIA A10). The listed runtime is the median runtime from 10,000 iterations on our microbenchmarks. Note: it's difficult to express the performance of a kernel with a single number (performance will vary with input sizes, data types, etc.). We tried our best to choose representative parameters for a fair comparison. Most relevant parameters are specified via CLI parameters to the microbenchmarks (benchmarks/), so feel free to collect your own results based on your use case. CUDA runtimes collected via vLLM and bitsandbytes (vllm==0.8.5 and bitsandbytes==0.45.5).

Operation CUDA Runtime Triton Runtime Triton Speedup
GeLU, Tanh, and Mul 2.835 ms 2.851 ms 0.99
SiLU and Mul 0.260 ms 0.209 ms 1.24
Paged Attention 0.374 ms 0.344 ms 1.09
Rotary Embedding 0.579 ms 0.600 ms 0.96
RMS Norm (Gemma-style) 1.392 ms 0.141 ms 9.87
RMS Norm (Llama-style) 0.117 ms 0.072 ms 1.63
bitsandbytes: Dequantize 0.175 ms 10.950 ms 0.02
bitsandbytes: Quantize 0.671 ms 12.667 ms 0.05
Int8 Static Quantization 0.167 ms 0.164 ms 1.02
Scaled GEMM [Int8 x BF16] 2.130 ms 4.441 ms 0.48
vLLM: Copy Blocks 8.550 ms 9.933 ms 0.86
vLLM: Reshape and Cache 0.245 ms 0.024 ms 10.21

For additional analysis of kernel performance, check out our performance docs.

Supported platforms

Supported platforms:

  • Nvidia A10, CUDA 12.2
  • Nvidia H100, CUDA 12.2
  • AMD MI300X, ROCm 6.2.4

Work-in-progress platforms:

Getting Started

Users

Check out the installation instructions to get started!

Developers

Check out the developer instructions to get started!

Open-source credits

We were inspired by and leverage components of the following libraries:

License

Copyright 2025 Stack AV Co. Licensed under the Apache License, Version 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conch_triton_kernels-1.0.1.tar.gz (110.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

conch_triton_kernels-1.0.1-py3-none-any.whl (96.4 kB view details)

Uploaded Python 3

File details

Details for the file conch_triton_kernels-1.0.1.tar.gz.

File metadata

  • Download URL: conch_triton_kernels-1.0.1.tar.gz
  • Upload date:
  • Size: 110.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for conch_triton_kernels-1.0.1.tar.gz
Algorithm Hash digest
SHA256 dea1e7301f3e7b6472d2ee50a22865ccf84b4bd409ca5efe4b4a9d48109407a7
MD5 29166beadbbc615ff170cce8efb7ce92
BLAKE2b-256 83c3ec80ec74753029c767163e475957d0f50f6bdbad5ff6aeb1acbad5f60385

See more details on using hashes here.

File details

Details for the file conch_triton_kernels-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for conch_triton_kernels-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f7bc27e2431a19f7e9daee853918af6d9c8ad4de6b5ccfe1798ba8d5adc8efc2
MD5 04788e3e465b47c90a789349f25ef387
BLAKE2b-256 63582a3e0d3ec6a32c5c98ee7a58c50dcceb69a9df61329e457eae4036b23e31

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page