Triton kernel repository

These details have not been verified by PyPI

Project links

Project description

Conch :shell:

A "standard library" of Triton kernels.

What is Conch?

Conch is a central repository of Triton kernels for accelerating common AI operations. We strive to provide performant, well-written kernels that can be easily integrated into other projects. We also strive to support multiple hardware platforms (currently Nvidia and AMD).

Key Features

We support each of the following operations. Each operation is complete with a PyTorch-only reference implementation (and sometimes a reference implementation provided by another library, like vLLM), a microbenchmark, and a unit test.

Activation functions
- GeLU and mul
- SiLU and mul
Attention
- Paged Attention (Flash-Decoding with Paged KV Cache)
Embedding
- Rotary embedding
Normalization
- Gemma-style RMS norm
- Llama-style RMS norm
Quantization
- bitsandbytes
  - NF4/FP4/8-bit blockwise quantize/dequantize
- FP8 static quantization
- Int8 static quantization
- GEMM
  - Mixed-precision
  - Scaled
vLLM
- KV cache operations
  - Copy blocks
  - Reshape and cache

Performance

The goal of Conch is not to claim that our operations are faster than CUDA implementations. Our goal is to write Triton operations that are as fast as the state-of-the-art CUDA implementations. This allows developers on any hardware platform (Nvidia, AMD, etc.) access to the same, performant kernels.

Below is a table comparing the relative performance of our Triton kernels to CUDA baselines (on H100). The listed runtime is the median runtime from 10,000 iterations on our microbenchmarks. Note: it's difficult to express the performance of a kernel with a single number (performance will vary with input sizes, data types, etc.). We tried our best to choose representative parameters for a fair comparison. Most relevant parameters are specified via CLI parameters to the microbenchmarks (benchmarks/), so feel free to collect your own results based on your use case. CUDA runtimes collected via vLLM and bitsandbytes (vllm==0.6.4 and bitsandbytes==0.45.4).

Operation	CUDA Runtime	Triton Runtime	Triton Speedup
GeLU, Tanh, and Mul	0.493 ms	0.466 ms	1.06
SiLU and Mul	0.063 ms	0.047 ms	1.34
Paged Attention	0.090 ms	0.083 ms	1.08
Rotary Embedding	0.107 ms	0.103 ms	1.04
RMS Norm (Gemma-style)	0.392 ms	0.029 ms	13.52
RMS Norm (Llama-style)	0.044 ms	0.018 ms	2.44
bitsandbytes: Dequantize	0.074 ms	4.487 ms	0.02
bitsandbytes: Quantize	0.377 ms	4.819 ms	0.08
FP8 Static Quantization	0.035 ms	0.090 ms	0.39
Int8 Static Quantization	0.056 ms	0.094 ms	0.60
Mixed-precision GEMM [Int4 x FP16]	0.432 ms	1.437 ms	0.30
Scaled GEMM [Int8 x BF16]	0.204 ms	0.285 ms	0.72
vLLM: Copy Blocks	2.231 ms	1.807 ms	1.23
vLLM: Reshape and Cache	0.057 ms	0.010 ms	5.70

For additional analysis of kernel performance, check out our performance docs.

Supported platforms

Supported platforms:

Nvidia A10, CUDA 12.2
Nvidia H100, CUDA 12.2
AMD MI300X, ROCm 6.2.2

Work-in-progress platforms:

Getting Started

Users

Check out the installation instructions to get started!

Developers

Check out the developer instructions to get started!

Open-source credits

We were inspired by and leverage components of the following libraries:

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3

Sep 5, 2025

1.2.1

Jun 18, 2025

1.2.0

Jun 13, 2025

1.1.0

Jun 12, 2025

1.0.1

Jun 10, 2025

1.0.0

Jun 6, 2025

This version

0.0.1

Apr 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conch_triton_kernels-0.0.1.tar.gz (95.5 kB view details)

Uploaded Apr 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

conch_triton_kernels-0.0.1-py3-none-any.whl (80.3 kB view details)

Uploaded Apr 22, 2025 Python 3

File details

Details for the file conch_triton_kernels-0.0.1.tar.gz.

File metadata

Download URL: conch_triton_kernels-0.0.1.tar.gz
Upload date: Apr 22, 2025
Size: 95.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for conch_triton_kernels-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`3006013cc0df0171d8695d2d938a67d49bd9def814b79de78b318ff4c64e050a`
MD5	`a9ea853f9fddadab8037ac1e0e9cff60`
BLAKE2b-256	`4768d131b557e3166d25972292a2fb99e9b9a1ef87fd8e27bb9875b5a2815b1b`

See more details on using hashes here.

File details

Details for the file conch_triton_kernels-0.0.1-py3-none-any.whl.

File metadata

Download URL: conch_triton_kernels-0.0.1-py3-none-any.whl
Upload date: Apr 22, 2025
Size: 80.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for conch_triton_kernels-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`60d370bac6e77cbbf799bfc2bda986304bd87853322e684ebd78e0c3aae2b004`
MD5	`11b4af83150b227c02bfadebc06f0a5b`
BLAKE2b-256	`510c0d0148080bf477e581988e51173ad5eb73d04641a60277428f8864115048`

See more details on using hashes here.

conch-triton-kernels 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Conch :shell:

What is Conch?

Key Features

Performance

Supported platforms

Getting Started

Users

Developers

Open-source credits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes