simons blog

2025-12-03 • gemm vectors blackwell cute fnv

This blog post provides an analysis of the mathematical interpretation of scale factors in Blackwell kernels for NVFP4, focusing on the association of 8-bit scale tensors with the correct layout. It discusses the similarities betw...

2025-11-16 • performance benchmarks gpu sgemv data parallelization cute

This blog post discusses various methods to parallelize the reduction over the K-Mode in the context of the GEMV (General Matrix-Vector multiplication) competition. It provides a detailed explanation of the CuTe reference implemen...

2026-03-14 • optimization deep learning mathematics resources algorithms gdi

This post presents a mathematical identity that improves the GDN prefill algorithm's efficiency, achieving an 18% speedup in implementation.

2026-03-10 • machine learning gpu mathematics resources attention mechanism gated delta rule

The post elucidates the Chunkwise Gated Delta Rule, focusing on its mathematical derivation and application in Gated Delta Attention for improved computational efficiency.

2025-12-23 • gpu asynchronous processing pipelining blackwell cute dsl

Pipelining techniques in GPU programming using CuTeDSL on the Blackwell architecture enhance performance by overlapping memory transfers and computations through asynchronous operations.

2025-09-27 • layout data structures mathematics resources category theory cute

The blog post discusses Chapter 3 of a paper by Colfax on the categorical foundations of CuTe, focusing on a new approach to understanding CuTe Layouts through categories and morphisms. It provides examples and calculations to ill...

2025-09-20 • performance memory management gemm swizzle cute dsl

This blog post provides an overview of swizzling in the context of performant GEMM kernels, specifically focusing on its implementation in CuTeDSL. It explains how swizzling addresses shared memory bank conflicts and details the c...

2025-09-14 • performance gpu cute vector tiles, tensor tiling, dynamic tiling

The blog post discusses various methods of tiling tensors in the CuTe DSL for GPU programming, specifically focusing on three partitioning strategies: Inner Partition, Outer Partition, and Thread Value Partition. It provides detai...

2026-02-07 • gpu performance optimization gemm tensorboard cute dsl

The post explores the kernel of grouped blockscaled GEMM on B200 GPUs, highlighting differences in implementation between CuTeDSL and traditional methods.

2026-01-07 • programming performance optimization gpu

Warp specialization in CuTeDSL enhances GEMM performance by separating tile memory access and matrix multiplication tasks across two warps.

2026-02-15 • machine learning optimization pytorch quantum decryption gated delta rule

Gated Delta Net decoding optimizes workloads in the Flashinfer Competition through a detailed mathematical framework and PyTorch implementation.

2025-10-27 • programming machine learning root mean square (rms) gpu cute dsl

This blog post discusses the implementation of a simple RMSNorm kernel using CuTeDSL, showcasing how to perform reduction in GPU programming. It explains the mathematical foundation of RMSNorm, provides a step-by-step guide to imp...

2025-11-23 • performance fp32 gpu data conversion cute dsl f4.3

The blog post discusses the implementation of vector primitive conversions in CuTeDSL, specifically focusing on converting packed data types like FP8 and FP4 to Float32 and Float16 formats. The author provides code examples and ex...

2025-10-20 • programming mma visualization gpu cute

This blog post provides an alternative explanation of MMA Atoms, which are fundamental components in CuTe kernels. It includes a detailed examination of a code example that prints a LaTeX file for visualizing MMA Atoms, along with...

2025-12-06 • gpu gemm matrix multiplication cute dsl

This blog post provides a detailed analysis of the setup process for a blockscaled GEMM (General Matrix Multiply) kernel using the CuTeDSL framework. It covers the initialization of various attributes, compatibility checks for inp...

2025-10-03 • composition layout refinement morphism colfax

The blog post discusses the concept of composition in layouts, specifically focusing on mutual refinement and composition of morphisms as outlined in Chapter 4 of the Colfax Paper. It explains the process of calculating layout com...

2026-01-04 • performance gpu gemm blackwell cute dsl

The post explains how to implement a 2 CTA GEMM operation on Blackwell GPUs, showcasing performance improvements and necessary code adjustments compared to a 1 CTA setup.

2025-11-13 • gpu blackwell sgemv fnv cute dsl

This blog post introduces CuTeDSL, a programming model designed for the Blackwell GPU architecture, focusing on the NVFP4 floating-point format. It explains the GEMV (matrix-vector multiplication) kernel used in a hackathon challe...

2025-11-06 • programming c assembly binary algorithms

This blog post explores the relationship between bit counting and properties of unsigned binary numbers. It presents a mathematical identity for unsigned integers, demonstrating how to count the number of one bits in a binary numb...

2026-01-22 • machine learning gpu gemm vectors

An introduction to setting up Grouped Blockscaled GEMM on Blackwell, focusing on tensor creation and kernel initialization with code examples.

2026-01-24 • gpu performance optimization matrix and tensor operations gemm

The post details the setup and execution of Grouped Blockscaled GEMM for B200 GPUs, highlighting differences from traditional methods and key parameters involved.

2025-10-11 • software development software testing c++ programming caching algorithms

The blog post discusses caching algorithms, specifically LRU (Least Recently Used) and LFU (Least Frequently Used), as presented in Konstantin Vladimirov's lecture. It explains the implementation of these algorithms in C++, detail...

2025-09-23 • technical writing mathematics resources algebra qt layouts

The blog post serves as a supplementary guide to Chapter 2 of Colfax's paper on CuTe Layouts, providing detailed calculations and step-by-step solutions for various examples. It discusses the redundancy of certain modes, the conce...

2025-09-09 • programming machine learning vectors data science and machine learning cute

The blog post explains the concept of tensor slicing in CuTeDSL, detailing how tensors are defined and manipulated within this framework. It provides a step-by-step guide on how to perform tensor slicing, including examples of cal...