HazyResearch / ThunderKittens
Tile primitives for speedy kernels
See what the GitHub community is most excited about today.
Tile primitives for speedy kernels
GPU accelerated decision optimization
Fast parallel CTC.
PyTorch bindings for CUTLASS grouped GEMM.
CUDA accelerated rasterization of gaussian splatting
RCCL Performance Benchmark Tests
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
NCCL Tests
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
CUDA Kernel Benchmarking Library
Instant neural graphics primitives: lightning fast NeRF and more
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Graphics Processing Units Molecular Dynamics
LLM training in simple, raw C/CUDA
Distributed multigrid linear solver library on GPU
cuGraph - RAPIDS Graph Analytics Library
FlashInfer: Kernel Library for LLM Serving
cuVS - a library for vector search and clustering on the GPU