Every AI chip ever shipped — tensor cores, MXUs, matrix cores, whatever the marketing name — is fundamentally a systolic array surrounded by a memory hierarchy. The interesting questions are never about peak FLOPS. They’re about how big you make the array, how you feed it data, and what you’re willing to give up in programmability to keep it busy.

This section covers the architecture of AI accelerators from the transistor level up. GPU die economics, the real cost of the compute-vs-memory wall, why 96% of an H100’s transistors “aren’t doing math” and why that stat is both correct and useless. Understanding hardware at this level matters because the constraints are physical, not engineering choices — data movement costs 1000x more energy than arithmetic, HBM is the single most expensive component on any AI chip, and yield curves determine which architectures are even manufacturable. These are the boundaries every system designer works within.

The Blackwell breakdown walks through where all 92 billion transistors actually live and stress-tests NVIDIA’s bandwidth claims against real workloads. The systolic arrays piece covers the computational primitive that the entire industry converged on. The supply chain analysis connects chip architecture to the packaging and memory sourcing that actually determine whether you can buy the thing.

See also: Cornered Chips — four domain-specific chip architecture proposals (ARIA, VDX-1, PhysDiffuse-1, ATLAS) for workloads that GPUs handle poorly.