Thunder Kittens + CUDA Deep Dive — Key Takeaways

Source: Quinn & Ben YouTube livestream (~5 hrs). Full transcript indexed.

GPU architecture (H100 focus)

SM structure

You hear “132 SMs” thrown around a lot. Here is what is actually going on inside each one.

NVIDIA manufactures 144 SMs per H100 die but only enables 132 — the rest are sacrificed for yield. Each SM is split into 4 sub-partitions (processing blocks), and every sub-partition contains:

32 FP32 CUDA cores (128 total per SM)
1 Tensor Core (4 total per SM, 4th gen on Hopper)
256 KB register file (65,536 x 32-bit, shared across SM)
128 KB configurable L1/shared memory (shared across SM)

The warp scheduler issues 1 instruction per warp (32 threads) per cycle. You want at least 2 warps co-scheduled per quadrant to hide latency — if one warp is waiting on memory, the other can execute.

memory hierarchy

Every performance decision you make in CUDA boils down to where data lives in this stack:

Registers: ~1 cycle access. The fastest thing you have.
Shared memory (SRAM): fast, user-controlled addressing. This is your main lever.
L1 cache: slightly slower than shared mem because of cache tagging overhead.
L2 cache: 50 MB on H100. Be careful — cross-chip L2 access can be nearly as slow as HBM.
HBM3: 3.35 TB/s bandwidth, but ~500-700 cycle latency. Reaching here during a hot loop is a disaster.

key insight: instruction issue overhead

Here is the number that should reshape how you think about GPU compute.

An FP32 FMA burns 20x more energy issuing the instruction than performing the actual math. Twenty to one. The instruction decode, dispatch, and scheduling machinery dwarfs the floating-point unit itself.

This is the entire reason tensor cores exist. A single HMMA instruction performs ~500K FLOPs for the same issue cost as one scalar FMA. You are not paying for compute — you are paying for the right to compute. Tensor cores amortize that tax over 256K+ FLOPs per instruction.

tensor cores

A few details that trip people up:

5th gen (Hopper): m16n8k16 for FP16, m16n8k32 for FP8
The hardware actually computes A * B^T (transposed multiply), not A * B. If your layouts are wrong, you will silently get garbage.
Register layout: 16x16 “core matrices” = one vector register
Weight-stationary design: load weights once, stream activations through

memory consistency model (THE HARD PART)

If CUDA has a dragon, this is it. Not the math, not the indexing, not even the launch configuration. The only truly hard part of CUDA is memory consistency — knowing when data written by one thread is visible to another.

sync_warp

You need this when threads within a warp read each other’s shared memory writes.

A common trap: you load from global into shared memory, then immediately read that shared memory into registers. Without a sync_warp between those two steps, the reading thread may see stale data. The writes are in flight, but nobody promised they have landed.

sync_threads

This is the cross-warp version. When threads in different warps depend on each other’s shared memory, you must synchronize.

Think about it concretely: thread 0 could be halfway through its computation while thread 128 has not even started loading its shared memory tile. Without sync_threads, you are racing — and you will lose.

This is the ONLY hard part of CUDA. Everything else is easy.

async loads (cp.async)

Available on SM80+ (Ampere, Ada, Hopper). The mental model is simple: you tell the hardware “start loading this, and I will tell you when I need it.”

This unlocks pipelining — load iteration N+1 while computing iteration N. On H100, TMA (Tensor Memory Accelerator) takes this further by handling descriptors automatically so you do not have to manage address calculation yourself.

pipeline pattern

for each iteration:
    load_async(next_chunk)        // start loading next
    sync_threads()                // wait for previous load
    compute(current_chunk)        // use current data
    // next iteration: roles swap

bank conflicts

Shared memory has 32 banks, each 32 bits wide. If multiple threads in a warp hit the same bank on the same cycle, they serialize.

Here is where it gets painful: a naive row-major 16x16 tile layout produces an 8-way bank conflict. That means you are getting 1/8th of your shared memory bandwidth. You wrote the kernel, it compiled, it runs — and it is 8x slower than it should be for a reason that does not show up in your code logic at all.

The fix is a swizzle pattern (XOR-based) or padding. Thunder Kittens provides automatic 32/64/128-byte swizzle layouts so you do not have to think about this.

thunder kittens (TK)

design philosophy

Thunder Kittens takes a stance that is the opposite of most GPU libraries: primitives, not abstractions. You still need to understand the hardware. TK just removes the boilerplate that makes you miserable.

Header-only C++ library, compiles in ~5 seconds
Competitive with cuTLASS (within 5-10%) but written in ~10x less code
Flash Attention equivalent: ~75 lines vs thousands in cuTLASS

That last point is worth sitting with. The same algorithm, the same performance, a fraction of the surface area for bugs.

types

TK’s type system maps directly to where data lives in the memory hierarchy:

Global layout: 4D tensor (batch, depth, rows, cols), TMA descriptors auto-generated
Shared tiles: parameterized by type/rows/cols, auto-swizzled addressing
Register tiles: row or column layout (matching tensor core A*B^T requirement)
Register vectors: aligned, ortho, naive layouts

operations

Warp-level: load/store, MMA, elementwise ops
Group-level: collaborative loads across multiple warps, warp group sync
LCSF template: Load-Compute-Store-Finish (producer-consumer pattern)

Producer-Consumer pattern

This is TK’s approach to overlapping memory and compute, and it maps cleanly to how H100 hardware actually wants to work.

You divide warps into two roles: producers handle memory (load from global into shared, manage TMA), and consumers handle compute (load shared into registers, run the math, store results back).

Semaphores (M barriers) synchronize the handoff. When TMA completes a load, it auto-triggers the semaphore — the consumers wake up and go. No polling, no busy-waiting.

deep seek whale kernel (HD=576, VD=512)

This is a case study in what happens when your problem dimensions do not cooperate with your hardware.

576 is 64*9. That is an awkward factorization for tensor cores (16x16 base). You cannot tile it cleanly.

The solution: partition Q into 4 warps, each handling 144 columns. Attention accumulation across warps happens via shared memory atomic adds.

Then a real bug showed up — a race condition where Q shared memory overlapped with K shared memory. The kind of bug that works 99% of the time and silently corrupts the other 1%.

Performance landed roughly on par with PyTorch SDPA, with room for improvement. But the takeaway is more general: your choice of blocking structure is the entire game. Get the tiling wrong and no amount of micro-optimization will save you.

practical lessons

Write the copy kernel first — get data flow right before you touch compute. If your loads are wrong, nothing else matters.
Memory consistency bugs are the only hard bugs — you will never spend two days debugging an incorrect multiply. You will absolutely spend two days debugging a missing sync_threads.
Use NCU (Nsight Compute) — read the SASS, understand the stall reasons. If you are guessing, you are losing.
Async pipelining is mandatory on H100 — Flash Attention 3 vs FA2 is a 2x improvement, almost entirely from pipelining.
Bank conflicts matter — an 8-way conflict means 1/8th of your shared memory bandwidth. It is a silent killer.
Profile, do not guess — NCU shows exactly where time is spent. Trust the profiler, not your intuition.

interesting reads

See also: TrtLLMGen MoE Kernels (register-tiling evolved into TMEM management on SM100), Blackwell Architecture, Inference Stack Synthesis

Alan's PKB

Explorer

Thunder Kittens CUDA

Thunder Kittens + CUDA Deep Dive — Key Takeaways

Graph View

Backlinks