Nvidia’s blackwell B200 is a $30-40K, 1000-watt, two-die GPU that nvidia claims is “the world’s most powerful chip.” I went through the specs, ran the math, and read bjarke roune’s book on AI chip design to figure out what’s actually going on inside this thing.
Here’s my breakdown.
the B200 reference card
| parameter | value |
|---|---|
| process | TSMC N4P |
| dies | 2 × ~400 mm² (CoWoS-L) |
| transistors | ~92B total |
| SMs | 180 (of 192) |
| tensor cores | 720 (5th gen, 64×64) |
| FP4 TOPS | 4,500 |
| FP8 TOPS | 2,250 |
| BF16 TFLOPS | 1,500 |
| L2 cache | 96 MB (2× Hopper) |
| HBM3e | 192 GB, 8 TB/s (8 stacks) |
| NVLink 5 | 1.8 TB/s (18 links) |
| die-to-die (NV-HBI) | 10 TB/s |
| TDP | 1000 W |
where blackwell’s transistors actually go
Every few months, someone on AI hardware twitter posts a stat that’s designed to make you feel smart: only 3.3% of an H100’s transistors are dedicated to matrix multiplication. The rest is overhead. Waste. GPU bloat. The implication is always the same — ASICs will eat nvidia’s lunch.
the arithmetic is correct and the conclusion is wrong. here is the math.
The H100 has 528 tensor cores. Each one does 512 FMA operations per cycle. A single FMA costs about 10,000 transistors at the gate level once you include mux logic and pipeline registers:
528 tensor cores × 512 FMAs × 10,000 transistors ≈ 2.7 billion transistors
The H100 has ~80 billion transistors total. 2.7B / 80B = 3.4%. There it is. 96.6% of your $30,000 chip is “not doing math.”
Now do it for blackwell. The B200 has 720 tensor cores (5th gen, 64×64 output per cycle at FP8 = 4096 FMAs per core). At FP8, each FMA needs fewer transistors — call it roughly ~6,000 for the reduced-precision datapath (an order-of-magnitude estimate, not a measured count):
720 tensor cores × 4096 FMAs × 6,000 transistors ≈ 17.7 billion transistors
On ~92 billion total, that’s 19.2%. A big jump from H100’s 3.4%, partly because blackwell’s tensor cores are genuinely bigger (64×64 vs 16×16 scheduling blocks), and partly because FP8 datapaths pack more useful work per transistor.
But both of these percentages are correct and useless.
bjarke roune — who was the software lead for TPUv3 at google, meaning he actually designed the compiler for one of these chips — has the right framing. the question isn’t “what percentage of transistors do math?” it’s “what could an ASIC actually eliminate?”
Here is where blackwell’s ~92B transistors live:
L2 cache: ~16-20B transistors. blackwell has 96 MB of L2, doubled from hopper. With tag arrays, ECC, coherency logic, and the crossbar connecting 180 SMs, you’re looking at 55-60 mm² of silicon across both dies. Does an inference ASIC need this? yes. you need to buffer KV cache, store activations between layers, handle attention scores. Google’s TPUs have massive on-chip memory. Groq’s entire thesis is “put 230 MB of SRAM on-chip.” you can simplify the cache logic, but the SRAM bitcells don’t go away.
L1 / shared memory: ~10-12B transistors. 180 SMs × 228 KB = ~40 MB total. An ASIC still needs local storage near the compute. You can strip configurability, but the SRAM stays.
register files: ~8-10B transistors. 180 SMs × 256 KB per register file = ~45 MB of multi-ported SRAM. This is legitimately reclaimable — a systolic array with fixed dataflow doesn’t need 65K general-purpose registers per compute unit.
HBM3e controllers: ~3-4B transistors. 8 channels of HBM3e with PHY, controller logic, ECC. Non-negotiable for any chip that talks to HBM.
NVLink 5 SerDes: ~4-5B transistors. 18 links with analog transceivers, PLLs, CDR circuits. Reclaimable only if you give up multi-chip scaling.
NV-HBI die-to-die: ~2-3B transistors. specific to the chiplet design. A monolithic ASIC doesn’t need it, but monolithic at 800 mm² also has terrible yield.
what’s actually reclaimable:
| component | transistors | ASIC needs it? |
|---|---|---|
| warp schedulers | ~3-4B | no |
| instruction fetch/decode | ~2-3B | no |
| CUDA cores (FP32/INT32) | ~4-5B | no |
| RT cores | ~2-3B | no |
| MIG logic | ~0.5-1B | no |
| register files (partial) | ~4-5B | reduced |
| total reclaimable | ~16-21B |
That’s 17-23% of die area. Subtract 3-5% for the ASIC-specific control logic you’d need to add back (DMA engines, weight routing, configuration registers), and you net ~12-18% of actual die area savings.
So here’s the honest version of the ASIC argument:
reclaim ~15% of die area + achieve ~80-90% utilization instead of GPU’s 30-40% average utilization. combined, that’s a real 3-5× efficiency advantage. but it’s not 30×, and it’s definitely not the “96.7% waste” story.
the utilization gap is the bigger factor, and it’s not even close. nvidia’s tensor cores spend most of their life waiting — waiting for data from HBM, waiting for the warp scheduler to issue instructions, waiting for a thread block. A fixed-function chip with deterministic dataflow keeps its systolic arrays fed consistently. That’s where the real ASIC advantage lives: not fewer transistors, but less waiting.
Interactive B200 die map — hover over regions to explore transistor breakdown. Requires JavaScript.
the two-die architecture
Here’s a yield calculation that seems like it should kill the chiplet argument.
- monolithic 800 mm²: yield = e^(-0.09 × 8.0) = 49%
- two dies at 400 mm²: yield per die = e^(-0.09 × 4.0) = 70%
- combined good-pair yield: 70% × 70% = 49%… Wait, that’s the same?
Not quite. the chiplet advantage isn’t yield arithmetic — it’s binning. nvidia can independently test each die. A die with 1 defective SM out of 96 becomes a B200A or a lower SKU, rather than scrapping an entire 800 mm² monolithic chip. The effective yield of usable silicon is significantly higher.
The dies connect via NV-HBI at ~10 TB/s, using Local Silicon Interconnect bridges at ~10 μm pitch. Die-to-die energy: ~0.5 pJ/bit (vs ~5-7 pJ/bit for off-package NVLink). That’s 10-14× more energy-efficient than chip-to-chip.
Why does this matter? do the math. At 10 TB/s, NV-HBI would cost ~50W if it ran at NVLink energy levels. At 0.5 pJ/bit, it’s only ~5W. that’s 45W saved for the same bandwidth — enough to power another ~8 SMs worth of compute.
5th gen tensor cores — FP4 and the energy hierarchy
FP4 (E2M1) is barely a number. 1 sign bit, 2 exponent bits, 1 mantissa bit. Only 16 distinct values. The multiplier is essentially an AND gate plus an exponent add. You can hardly call it a computation.
| precision | energy/FMA (silicon limit) | energy/FMA (B200 whole-chip) | gap |
|---|---|---|---|
| FP16 | ~63 fJ | ~667 fJ | 10.6× |
| FP8 | ~25 fJ | ~444 fJ | 17.8× |
| FP4 | ~10 fJ | ~222 fJ | 22.2× |
Look at those gaps. the 10-22× difference between the silicon limit and the whole-chip cost is entirely data movement — reading operands from registers, routing through the memory hierarchy, scheduling on the warp scheduler. This is roune’s core insight: the systolic array itself is efficient; everything around it is where your power budget actually goes.
A counterintuitive detail: H100 is actually MORE power-efficient per op at FP8 (354 fJ vs 444 fJ) than B200. The two-die design plus 8 HBM3e stacks (vs H100’s 5 HBM3 stacks) adds overhead. B200 wins on absolute throughput and bandwidth, not per-op efficiency — a deliberate trade-off nvidia made with eyes wide open.
Now consider the memory wall. At ~4 pJ/bit for HBM3e: a single 16-bit read = 64,000 fJ. Compared to 63 fJ for an FP16 FMA: 1 HBM3e access costs the same energy as 1000 FMAs. The horowitz relationship — “memory access dominates compute cost” — holds all the way out to the bleeding edge.
Interactive energy hierarchy diagram. Requires JavaScript.
power budget: where the watts go
| component | B200 power | % of TDP |
|---|---|---|
| compute (SMs + tensor cores) | ~450 W | 45% |
| HBM3e (8 stacks) | ~200 W | 20% |
| NV-HBI (die-to-die) | ~50 W | 5% |
| NVLink I/O | ~100 W | 10% |
| L2 + NoC | ~80 W | 8% |
| memory controllers | ~50 W | 5% |
| misc (PLLs, thermal, etc.) | ~70 W | 7% |
compute takes less than half the power budget. stare at that table. HBM alone is a fifth. The I/O subsystem (NV-HBI + NVLink + memory controllers) accounts for 20%, all spent moving data rather than computing on it.
This is the physical basis for the ASIC efficiency argument: a chip that eliminates or reduces the NVLink, NV-HBI, and general-purpose compute overhead could potentially run the same workload at 200-300W. You’re not saving on math — you’re saving on everything else.
the memory system — why 192 GB still isn’t enough
Nvidia doubled capacity from H100 (80 GB) to B200 (192 GB) and bandwidth from 3.35 to 8.0 TB/s. Impressive on paper. Let me show you why it’s still not enough.
| parameter | A100 (HBM2e) | H100 (HBM3) | B200 (HBM3e) |
|---|---|---|---|
| stacks | 5 | 5 | 8 |
| capacity | 80 GB | 80 GB | 192 GB |
| bandwidth | 2.0 TB/s | 3.35 TB/s | 8.0 TB/s |
| energy/bit | ~7 pJ/bit | ~5 pJ/bit | ~4 pJ/bit |
Energy/bit improved ~1.75× over three generations while bandwidth went up 4×. nvidia is scaling bandwidth much faster than efficiency — the right call when every inference workload on the planet is bandwidth-starved.
Using roune’s KV cache formula for llama 70B (GQA with 8 KV heads, 80 layers, head_dim=128):
per-token KV = 2 × 80 × 8 × 128 × 1 byte (FP8) = 160 KB/token
160 kilobytes per token per sequence. Sounds small. Now multiply by batch × context and watch it explode:
| scenario | batch | context | KV cache | + weights (FP8) | total | fits in 192 GB? |
|---|---|---|---|---|---|---|
| aggressive serving | 256 | 4,096 | 168 GB | 70 GB | 238 GB | no |
| conservative | 128 | 4,096 | 84 GB | 70 GB | 154 GB | yes (38 GB free) |
| long context | 64 | 8,192 | 84 GB | 70 GB | 154 GB | yes |
| high throughput | 512 | 2,048 | 168 GB | 70 GB | 238 GB | no |
The B200 can serve llama 70B at batch 128 with 4K context, or batch 64 with 8K. the moment you push past ~batch 140 at 4K, you’re out of memory. and this is the smallest model anyone considers production-scale.
For llama 405B at FP8? the weights alone are 405 GB. That’s 2.1× the entire HBM capacity. You need the NVL72.
HBM3e costs $15-20/GB. 192 GB = **$3,000-3,800 just for memory** on a ~$30K BoM chip. And it’s still not big enough.
Roune’s observation is blunt: companies keep buying more expensive HBM because they can’t reduce KV cache requirements without AI research breakthroughs. Sparse attention could theoretically reduce effective sequence length from 1,000,000 to 1,000 — a 1000× reduction. But it’s still research-grade, not production-ready. Nvidia’s answer is to just keep stacking more HBM. Blackwell ultra will have 288 GB. The treadmill continues.
Try the interactive KV cache calculator above (requires JavaScript).
stress-testing the bandwidth
Let’s start with the simplest possible case — batch=1 decode for llama 70B — and see what falls out:
FP16 weights (140 GB): 140 / 8,000 GB/s = 17.5 ms = 57 tok/s
FP8 weights (70 GB): 70 / 8,000 = 8.75 ms = 114 tok/s
FP4 weights (35 GB): 35 / 8,000 = 4.4 ms = 228 tok/s
At batch=1 FP8: arithmetic intensity = 140B ops / 70B bytes = 2 ops/byte. The ridge point is 2,250 TOPS / 8 TB/s = 281 ops/byte. You are 140× below the ridge point. The tensor cores are at 0.7% utilization. $30,000 of compute sitting idle, acting as an expensive memory-bandwidth pipe.
Here’s the roofline organized by precision:
| precision | peak TOPS | BW (TB/s) | ridge point (ops/byte) |
|---|---|---|---|
| FP4 | 4,500 | 8.0 | 562 |
| FP8 | 2,250 | 8.0 | 281 |
| BF16 | 1,500 | 8.0 | 187 |
Something surprising happens when you compare generations. blackwell’s FP8 ridge point (281) is less than half of H100’s (590). that means blackwell transitions to compute-bound at a lower batch size. On H100, you needed batch ~295 to become compute-bound at FP8. On B200, you need batch ~141.
Compute-bound crossover batch sizes for each precision:
- FP16: B ≈ 430
- FP8: B ≈ 560
- FP4: B ≈ 1,125
Even at batch 500, B200 is still memory-bandwidth-bound at FP8. FP4 pushes the crossover even further out — you need batch >1000 to saturate those 4,500 FP4 TOPS. The fundamental physics of memory bandwidth dominates across all precisions.
Nvidia achieved this by scaling bandwidth faster than compute: H100 to B200 compute went 1.14×, bandwidth went 2.39×. The bandwidth scaling outpaced compute by 2.1×. this is the right design choice for inference.
But let’s check a realistic production scenario — batch=128. Are we compute-bound?
AI at batch=128 = 128 × 2 = 256 ops/byte
ridge point = 281 ops/byte
256 < 281. still bandwidth-bound. barely, but we are. And here’s where it gets tight: the batch=141 crossover requires 113 GB of KV cache at 4K context + 70 GB weights = 183 GB out of 192 GB. Right at the knife’s edge of both the memory wall and the compute-bandwidth crossover.
this is not a coincidence. nvidia’s architects sized HBM capacity, bandwidth, and compute to all hit their walls at roughly the same operating point. No resource is dramatically over-provisioned. That’s good chip design. It’s also why there’s no free lunch.
Interactive roofline plot — adjust batch size and precision to see where you land. Requires JavaScript.
decode — the bandwidth-bound reality
Interactive GPU MatMul visualizer — toggle prefill vs decode to see how 720 tensor cores light up (or don’t). Click SMs to zoom into internal architecture, click tensor cores to see the 64×64 systolic wavefront. Requires JavaScript.
“decode is memory-bandwidth-bound, therefore buy the chip with the most bandwidth.” nvidia’s marketing loves this framing. the actual story is way more interesting.
During autoregressive decode, generating one token requires reading ALL weights (for FF) and ALL relevant KV cache entries (for attention). The systolic array utilization problem is severe.
Blackwell’s tensor cores are 64×64. With batch=1, the FF matmul is 1×8192 × 8192×8192. Tiled into 64×64 blocks, you need N=64 rows to fill the array. You have N=1. utilization: 1/64 = 1.6%. 98.4% of those 720 tensor cores are idle. (this is the batch=1 worst case — production inference servers batch requests across users, so actual utilization is significantly higher; see the batching analysis below.) roune calls decode “the more troublesome kind of inference.” I think that undersells it.
But there are three tricks that transform the picture:
trick 1: grouped query attention (GQA). llama 70B groups 128 heads into G=8 groups, so 16 heads share each KV set. N goes from 1 to 16. Attention utilization: 16/64 = 25%. A 16× improvement from a model architecture choice, not hardware.
trick 2: speculative decoding. guess 3-4 tokens ahead with a draft model, verify all at once. N = 4 × 16 = 64. Attention utilization: 64/64 = 100% (in theory).
trick 3: batching for FF. serve 64 users simultaneously, and FF sees a 64×8192 activation matrix. N=64. Utilization: 100%.
Note the asymmetry roune is careful to point out: batching helps FF but NOT attention. different users have different KV caches. You need GQA for attention utilization and batching for FF utilization — different tricks targeting different parts of the same forward pass.
With the full stack (GQA + spec decode + batch=64), let’s check whether we’re still bandwidth-bound:
FF layers: AI = 64 × 4 × 2 = 512 ops/byte. Ridge = 281. compute-bound! decode, which everyone calls “memory-bandwidth-bound,” is compute-bound on FF once you batch and speculate properly.
attention layers: KV cache reads for 4K context across 80 layers ≈ 52 GB per batch element. With batch=64: 3.3 TB of reads. At 8 TB/s: 415 ms. solidly bandwidth-bound.
FF is compute-bound, attention is bandwidth-bound. A classic imbalance within a single forward pass.
This is exactly where roune’s managed aggregation becomes critical: if you run both prefill (compute-bound) and decode on the same chip, the prefill work can overlap with decode’s bandwidth-bound attention. The memory bus serves KV reads while tensor cores crunch prefill tokens. Neither subsystem sits idle.
so what? blackwell’s decode story is legitimately strong, but only in a world where (a) models use aggressive GQA, (b) speculative decoding works for your workload, and (c) your serving stack implements managed aggregation. That’s a lot of ifs. The chip is capable of near-100% utilization. Whether anyone gets there in production is a different question.
blackwell through roune’s lens
roune’s core insight is almost embarrassingly simple: all AI hardware is systolic arrays with marketing names. tensor cores, MXUs, matrix cores — the entire industry converged on the same circuit. The interesting question is how big you make them.
larger systolic arrays are inherently more efficient. double the vector width (N to 2N) and you get 4× math per cycle, but scalar overhead barely scales. Nvidia’s own history tells the story:
| generation | year | tensor core size | FMAs/core/cycle |
|---|---|---|---|
| volta | 2017 | 4×4 | 16 |
| turing | 2018 | 16×16 | 256 |
| blackwell | 2025 | 64×64 | 4,096 |
But google started at 256×256 in 2016. Even at 128×128, a TPU MXU has 16,384 FMAs/cycle vs blackwell’s 4,096. Using roune’s scaling math:
- 64×64 (blackwell): overhead-to-compute ratio improves 16× vs 4×4 baseline
- 128×128 (TPU v4): improves 32×
- 256×256 (TPU v1): improves 64×
A TPU v4’s MXU has 2× better overhead ratio than a blackwell tensor core. So why doesn’t nvidia build 256×256?
Two words: the mono-sized problem.
Every blackwell tensor core is 64×64. The same array handles:
- FF layers: K = 8192. Tiles beautifully into 64×64. Utilization ~95%+.
- attention: K = 128. Gets 128/64 = 2 passes. Decent.
- small attention heads: K = 16. Utilization = 16/64 = 25%. Painful.
Now imagine a 256×256 TPU MXU doing attention at K=128: 128/256 = 0.5. You can’t even fill the K dimension once. Pad with zeros and you waste half your compute.
Roune’s solution is elegant: build two types of cores. a small number of large arrays (256×256) for FF, a larger number of small arrays (64×64) for attention. Both always doing what they’re best at.
Nobody has shipped this. Google sticks with large mono-sized MXUs and pays the attention tax. Nvidia sticks with medium mono-sized tensor cores and pays the FF efficiency tax.
Nvidia’s choice is defensible. CUDA ecosystem compatibility is the moat. Introducing heterogeneous on-chip compute would break existing software and force a multi-year ecosystem migration. So nvidia ships 720 identical 64×64 tensor cores, lets CUDA and Flash Attention handle the tiling, relies on brute-force parallelism to smooth over utilization gaps, and charges $30-40K per chip.
At 1000W per chip and electricity at $0.05-0.10/kWh, the power bill for an NVL72 running 24/7 is ~$50-100K/year. The chips cost $2-3M per rack. Power is 2-5% of TCO annually. nvidia can afford to be 2-3× less efficient than a perfect ASIC and still win because their chips exist, their software works, and their supply chain delivers.
But if you’re designing from scratch — cerebras, groq, etched, or a stealth startup with a blank sheet — roune’s dual-core insight is the highest-leverage architectural idea in the space right now. It would be unsurprising to see this in a future TPU generation — the efficiency gain is too large to leave on the table.
NVL72 — the mega-system as inference platform
72 blackwell GPUs + 36 grace CPUs in a single rack:
- 13.8 TB HBM3e total
- 576 TB/s aggregate memory bandwidth
- 162 PFLOPS FP8 / 324 PFLOPS FP4
- 130 TB/s bisection bandwidth via 18 NVSwitch chips
That last number — 130 TB/s bisection bandwidth — is the most underappreciated spec in the system. Here is why.
capacity math: llama 405B at FP8 (405 GB weights) across 72 GPUs = 5.6 GB/GPU. Remaining: 186 GB/GPU for KV cache = 13.4 TB total. At 160 KB/token: 83 million tokens of KV cache capacity in a single rack. Batch 1000 at 8K context = 1.3 TB. Less than 10% of available memory.
but does the interconnect keep up? for managed aggregation — assigning some GPUs to prefill, others to decode, dynamically — you need to transfer KV cache between them. One sequence at 4K context = 655 MB. For 1000 sequences/second: 655 GB/s of KV transfer.
At 130 TB/s bisection bandwidth: 655 GB/s = 0.5% of available bandwidth. The NVSwitch fabric makes KV cache transfer effectively free.
this is the key insight: NVL72 is not primarily a bigger GPU cluster. It’s the infrastructure that makes flexible prefill/decode disaggregation practical. you can shift GPUs between prefill and decode on the fly, maintain roune’s optimal managed aggregation ratio, and keep both compute and bandwidth fully utilized.
Compare to disaggregation across separate machines on InfiniBand at 400 GB/s per link. That same 655 GB/s transfer saturates more than a full link. NVL72’s internal fabric is 325× the bandwidth at ~300ns latency. Qualitative difference.
The cost question: at $3-5M per rack, rough revenue math — 72K tokens/second at $0.50/million tokens = ~$2,150/day. That’s a 5-year payback before electricity, cooling, and staff. you need >50% utilization and >$1/million token pricing to make it work in 3 years.
Interactive inference cost estimator — calculate GPUs needed, power draw, and $/M tokens. Requires JavaScript.
what I actually think
The B200 is a beautifully over-engineered compromise between silicon efficiency and software compatibility. That’s not a criticism — it’s the most defensible position in the market.
what nvidia got right:
- bandwidth scaling outpacing compute scaling (ridge point halved — the right move for inference)
- chiplet design at exactly the right die size for N4P yield economics
- NVL72 fabric bandwidth enabling managed aggregation at scale
- FP4 support as an escape hatch for bandwidth-bound decode
what nvidia left on the table:
- mono-sized 64×64 tensor cores when roune’s dual-core architecture could gain 2-3× efficiency
- 192 GB HBM3e runs out at modest batch sizes on a 70B model
- 1000W TDP when a purpose-built inference ASIC could potentially achieve the same throughput at 200-300W
competitive positioning:
| vs | B200 advantage | B200 disadvantage |
|---|---|---|
| AMD MI300X | 8.0 vs 5.3 TB/s BW | MI300X has more FP8 TOPS (2615 vs 2250) |
| Google TPU v5p | higher absolute throughput | 2.5× more power, less efficient per op |
| Groq LPU | larger model support (192 GB vs 500 MB SRAM) | groq wins on latency |
| Cerebras WSE-3 | cost, ecosystem | cerebras wins if model fits 44 GB SRAM |
| Apple M4 Ultra | 10× higher throughput | 6.7× more power, different design point |
The history of computing is a history of “good enough” winning over “optimal.” nvidia knows this better than anyone. Blackwell is good enough. The CUDA ecosystem is the moat. And until someone ships roune’s dual-core architecture with a working compiler and production-ready software stack, “good enough” will keep winning.
But that 2-3× efficiency gap is real money at datacenter scale. Someone should build it.
project references
ChipletCostModel: B200 = 2×400 mm² + CoWoS-L + 8× HBM3e. Estimated BoM ~$30K.
RooflineVM: ridge points at FP4/FP8/BF16 = 562/281/187 ops/byte. L2 = 96 MB, HBM = 8 TB/s.
InferBench: use B200 as primary GPU baseline. Decode at B=1 = 18.7% utilization (FP16).
interesting reads
- Bjarke Roune, “Designing AI Chip Software and Hardware” (2026) — the primary reference for this entire article. Roune was the software lead for TPUv3 at google and wrote the definitive guide on systolic array design, managed aggregation, and the mono-sized problem. If you only read one thing on AI chip architecture, read this.
- NVIDIA Blackwell Architecture Technical Brief — nvidia’s official whitepaper covering the B200/B100 specs, NV-HBI die-to-die interconnect, 5th gen tensor cores, and NVL72 system design. Https://resources.nvidia.com/en-us-blackwell-architecture
- SemiAnalysis, “NVIDIA Blackwell” series — dylan patel and team’s multi-part deep dives on blackwell silicon, CoWoS-L packaging, HBM3e economics, and NVL72 rack-level analysis. The best external coverage of nvidia’s supply chain and cost structure. Https://semianalysis.com
- Simon Boehm, “How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance” — step-by-step walkthrough of tiling, shared memory, and register-level optimizations for matrix multiplication on nvidia GPUs. Essential context for understanding why tensor core utilization matters so much. Https://siboehm.com/articles/22/CUDA-MMM
- Thunder Kittens / Hazy Research (Stanford) — hardware-aware GPU kernels from the hazy research group that push tensor core utilization toward theoretical limits. Demonstrates what’s possible when you co-design software with the memory hierarchy. Https://hazyresearch.stanford.edu/blog/2024-05-12-tk
- Mark Horowitz, “Computing’s Energy Problem (and what we can do about it)” (ISSCC 2014) — the original stanford paper establishing that data movement dominates compute energy cost. The “1 HBM access = 1000 FMAs” relationship cited in this article traces directly back to horowitz’s energy table. The foundational reference for why memory walls exist.
- Flash Attention (Tri Dao et al.) — the IO-aware exact attention algorithm that restructures attention computation to minimize HBM reads/writes. Directly relevant to the decode utilization discussion and why software co-design with the memory hierarchy matters.
- Groq LPU Architecture — groq’s approach of putting 230 MB of SRAM on-chip to eliminate the HBM bottleneck entirely. A radically different design point from blackwell that trades model size for deterministic latency. Https://groq.com/technology/
- Cerebras WSE-3 — the wafer-scale engine approach: 44 GB of on-chip SRAM, no HBM at all. The opposite extreme from nvidia’s chiplet strategy, and a useful reference point for understanding the tradeoffs in the ASIC vs GPU debate.
this analysis draws heavily from bjarke roune’s “Designing AI Chip Software and Hardware” (2026), the best single document on AI chip design I’ve read. if you work on AI hardware, read it.