Goals of this repo:
- Catalog openly available Triton kernels, so
- (i) practitioners save work and
- (ii) learners have world-class examples to learn from.
- Surface which Triton kernels are still needed by the community, so
- (i) work of the community can be more targeted, and
- (ii) eager people new to our community have projects to demonstrate their skill.
Triton is easier to understand and start with than CUDA, especially for Python programmers experienced with PyTorch. And it has recently seen an uptick in its usage -- there is more Triton code around the web than there was before!
This repo collects these kernels in one place, for the benefit of practitioners and learners in our community.
Contributions are very, very welcome!
- Here's the link of an ctrl-f searchable overview
- The kernel entries are in the /kernels directory
-
A Practioner's Guide to Triton is a great gentle intro to Triton (here's the accompanying notebook).
-
The Triton Tutorials are a great intermediate resources.
-
Flash Attention has a number of useful Triton kernels.
-
Unsloth contains many ready-to-use Triton kernels especially for finetuning applications
-
flash-linear-attention has a massive number of Linear attention or subquadratic attention replacement architectures, written using several different approaches of parallelization in Triton.
-
xformers contains many Triton kernels throughout, including some attention kernels such as Flash-Decoding.
-
Applied AI contains kernels such as a Triton MoE, fused softmax for training and inference
-
AO contains kernels for GaLoRe, HQQ and DoRA
-
torch.compile can codegenerate Triton kernels from PyTorch code
-
Liger Kernel is a collection of Triton kernels designed specifically for LLM training
To add a new entry:
- Copy kernels/0000_template.md and fill it out
- Add your entry to kernel_overview.md
- That's it!
Brought to you by the communtiy, initiated by Hailey and Umer ❤️