The dirty secret of LLM inference is that your $30,000 GPU spends most of its time waiting for memory, not doing math. During autoregressive decode — the token-by-token generation that dominates real-world serving — the model reads every weight once per token, the tensor cores sit at single-digit utilization, and the entire chip acts as an expensive memory-bandwidth pipe. This is the decode bottleneck, and it’s a physics problem, not an engineering one.

The KV cache makes things worse. Every token in the context window stores a key and value vector at every layer, and these caches grow linearly with sequence length and batch size. For a 70B model at batch 256 with 4K context, the KV cache alone exceeds 160 GB — more than the entire HBM capacity of a B200. Compression techniques like SpectralQuant can reduce this by rotating the cache into a spectral basis where most dimensions carry negligible information, but every compression scheme trades accuracy for memory, and the sweet spots are model-specific.

Kernel optimization matters here because the gap between theoretical hardware throughput and what you actually achieve in production is enormous. Flash Attention rewrote the rules for attention computation by fusing operations and respecting the memory hierarchy. Thunder Kittens showed that register-tiled CUDA kernels on H100 can get within 15% of peak bandwidth. TrtLLMGen open-sourced NVIDIA’s internal MoE kernels. Each of these is a story about squeezing real throughput out of hardware that’s already been paid for — and they compound. The notes below trace how these pieces fit together.