About:

A blog about things the author is interested in.

Website:

Specializations:

Subscribe to RSS:
This blog post provides an analysis of the mathematical interpretation of scale factors in Blackwell kernels for NVFP4, focusing on the association of 8-bit scale tensors with the correct layout. It discusses the similarities betw...
This blog post discusses various methods to parallelize the reduction over the K-Mode in the context of the GEMV (General Matrix-Vector multiplication) competition. It provides a detailed explanation of the CuTe reference implemen...
This post presents a mathematical identity that improves the GDN prefill algorithm's efficiency, achieving an 18% speedup in implementation.
The post elucidates the Chunkwise Gated Delta Rule, focusing on its mathematical derivation and application in Gated Delta Attention for improved computational efficiency.
Pipelining techniques in GPU programming using CuTeDSL on the Blackwell architecture enhance performance by overlapping memory transfers and computations through asynchronous operations.
The blog post discusses Chapter 3 of a paper by Colfax on the categorical foundations of CuTe, focusing on a new approach to understanding CuTe Layouts through categories and morphisms. It provides examples and calculations to ill...
This blog post provides an overview of swizzling in the context of performant GEMM kernels, specifically focusing on its implementation in CuTeDSL. It explains how swizzling addresses shared memory bank conflicts and details the c...
The blog post discusses various methods of tiling tensors in the CuTe DSL for GPU programming, specifically focusing on three partitioning strategies: Inner Partition, Outer Partition, and Thread Value Partition. It provides detai...
The post explores the kernel of grouped blockscaled GEMM on B200 GPUs, highlighting differences in implementation between CuTeDSL and traditional methods.
Warp specialization in CuTeDSL enhances GEMM performance by separating tile memory access and matrix multiplication tasks across two warps.
Gated Delta Net decoding optimizes workloads in the Flashinfer Competition through a detailed mathematical framework and PyTorch implementation.
This blog post discusses the implementation of a simple RMSNorm kernel using CuTeDSL, showcasing how to perform reduction in GPU programming. It explains the mathematical foundation of RMSNorm, provides a step-by-step guide to imp...
The blog post discusses the implementation of vector primitive conversions in CuTeDSL, specifically focusing on converting packed data types like FP8 and FP4 to Float32 and Float16 formats. The author provides code examples and ex...
This blog post provides an alternative explanation of MMA Atoms, which are fundamental components in CuTe kernels. It includes a detailed examination of a code example that prints a LaTeX file for visualizing MMA Atoms, along with...
This blog post provides a detailed analysis of the setup process for a blockscaled GEMM (General Matrix Multiply) kernel using the CuTeDSL framework. It covers the initialization of various attributes, compatibility checks for inp...
The blog post discusses the concept of composition in layouts, specifically focusing on mutual refinement and composition of morphisms as outlined in Chapter 4 of the Colfax Paper. It explains the process of calculating layout com...
The post explains how to implement a 2 CTA GEMM operation on Blackwell GPUs, showcasing performance improvements and necessary code adjustments compared to a 1 CTA setup.
This blog post introduces CuTeDSL, a programming model designed for the Blackwell GPU architecture, focusing on the NVFP4 floating-point format. It explains the GEMV (matrix-vector multiplication) kernel used in a hackathon challe...
This blog post explores the relationship between bit counting and properties of unsigned binary numbers. It presents a mathematical identity for unsigned integers, demonstrating how to count the number of one bits in a binary numb...
An introduction to setting up Grouped Blockscaled GEMM on Blackwell, focusing on tensor creation and kernel initialization with code examples.
The post details the setup and execution of Grouped Blockscaled GEMM for B200 GPUs, highlighting differences from traditional methods and key parameters involved.
The blog post discusses caching algorithms, specifically LRU (Least Recently Used) and LFU (Least Frequently Used), as presented in Konstantin Vladimirov's lecture. It explains the implementation of these algorithms in C++, detail...
The blog post serves as a supplementary guide to Chapter 2 of Colfax's paper on CuTe Layouts, providing detailed calculations and step-by-step solutions for various examples. It discusses the redundancy of certain modes, the conce...
The blog post explains the concept of tensor slicing in CuTeDSL, detailing how tensors are defined and manipulated within this framework. It provides a step-by-step guide on how to perform tensor slicing, including examples of cal...