designing AI chip hardware and software — key ideas from bjarke roune

Bjarke Hammersholt Roune was the technical software lead for Google’s TPUv3 and worked at Nvidia on GPUs. In 2026 he published a freely available document laying out how he would design an AI chip from scratch. It’s the best single document on AI chip design I’ve ever read.

This page indexes the key ideas I reference across my research notes.

the core thesis

The eventual future of AI accelerators is AI CPUs — traditional CPUs with big caches and large systolic arrays. The caches and systolic arrays take up most of the chip area and power. Everything else is overhead.

What “AI CPU” means concretely: a chip where roughly 60% of die area is SRAM cache, 30% is systolic arrays for matrix math, and the remaining 10% is minimal control logic — instruction fetch, decode, a thin scheduler, and a small scalar unit for address calculation and branching. That’s the whole chip. No texture units, no ray tracing cores, no general-purpose execution pipelines.

Contrast this with a modern GPU like Blackwell. Roughly 20% of its die goes to control and scheduling infrastructure: warp schedulers, instruction decoders, dispatch logic, scoreboarding. Another 5-10% goes to general-purpose CUDA cores that handle non-matrix workloads. Add in RT cores, texture samplers, and other fixed-function hardware and you’ve committed a third or more of your transistor budget to things that contribute nothing during a matrix multiply.

Roune calls this the “generality tax.” a GPU must handle rasterization, ray tracing, physics simulation, video encoding — and also matrix math. An AI CPU pays none of that tax. Every transistor either computes a multiply-accumulate or feeds data to something that does. The architectural bet is that the software stack (compilers, schedulers, memory management) can be handled by a small scalar unit with a big enough instruction cache, exactly the way a traditional CPU handles complex logic with a simple fetch-decode-execute loop, just at much lower area cost. The endgame looks more like a DSP with a massive register file than anything we’d recognize as a GPU today.

systolic arrays are the foundation

Every AI chip is really just a way to access a systolic array. Nvidia calls them Tensor Cores. Google calls them MXUs. AMD calls them Matrix Cores. Intel calls them AMX. Amazon calls them NeuronCores. Whatever they call them, it’s all systolic arrays. They were invented in 1978.

key insight: larger systolic arrays are more efficient. the math behind this is clean and worth understanding. For an N x N systolic array:

compute scales as N squared. each cycle, the array performs N x N fused multiply-accumulates (FMAs). Double N and you get 4x the throughput.
I/O scales as 2N. each cycle, you feed N values along the top edge and N values along the left edge. Double N and I/O only doubles.
compute-to-I/O ratio = N/2. this is the critical number. It tells you how much useful math you get per byte of data movement.

Going from a 4x4 array (Volta’s tensor core, N=4, ratio=2) to a 128x128 array (TPUv3’s MXU, N=128, ratio=64) gives a 32x improvement in compute-to-I/O ratio. That 32x is not a minor optimization — it means the 128x128 array needs 32x less memory bandwidth per FLOP to stay saturated.

array size	FMAs/cycle	compute/IO ratio (N/2)	overhead ratio improvement vs 4x4	example chips
4x4	16	2	1x (baseline)	Volta tensor core
16x16	256	8	4x	Ampere, Hopper
64x64	4,096	32	16x	Blackwell tensor core
128x128	16,384	64	32x	TPUv3/v4 MXU
256x256	65,536	128	64x	TPUv5e MXU

The “surrounding logic” cost matters too. Each systolic array needs input buffers, output accumulators, format converters, and a small controller. These scale roughly as O(N), not O(N squared). So as the array gets bigger, the overhead fraction shrinks. A 4x4 array might spend half its transistors on overhead logic. A 128x128 array spends less than 2%.

This is why google’s small TPU team (~30 people) could compete with nvidia: 128x128 systolic arrays are fundamentally more efficient than nvidia’s 4x4 (volta) through 64x64 (blackwell). Google traded generality for efficiency by going big on a single systolic array per core, and the physics of N-squared-over-2N rewarded them for it.

Interactive systolic array efficiency chart — see how NVIDIA and Google compare. Requires JavaScript.

mono-sized systolic arrays are unbalanced

This is roune’s most important architectural insight, and the one I keep coming back to.

Transformers have two kinds of heavy computation:

FF layers: K dimension is large (8,192+). Large systolic arrays excel here.
attention: K dimension is small (16-128). Large systolic arrays get low utilization.

Let’s walk through the numbers to see why neither current approach works well.

case 1: a 256x256 MXU (TPU-style). consider a transformer with a feed-forward hidden dim of K=8192 and an attention head dimension of K=128.

FF layer: K=8192 tiles evenly into the 256-wide array (8192/256 = 32 tiles). Every element of the array does useful work on every cycle. Utilization approaches 100%. This is the ideal case.
attention: K=128 on a 256-wide array means K/N = 128/256 = 0.5. The array is only half-occupied — the other half is padded with zeros, burning power and cycles to compute results you throw away. 50% waste, and this is the optimistic case. For smaller head dims (K=64, common in smaller models), waste rises to 75%.

case 2: a 64x64 tensor core (GPU-style).

FF layer: K=8192/64 = 128 tiles. Each tile fully utilizes the array, which is good, but you now need 128 separate tile dispatches. Each dispatch has scheduling overhead, register file reads/writes, and synchronization cost. The array itself is efficient per tile, but the system-level overhead of managing 128 tiles is non-trivial.
attention: K=128/64 = 2 tiles. Decent utilization per tile, much less waste than the 256x256 case. But the small array’s inherently worse compute-to-I/O ratio (N/2 = 32 vs 128) means you need more bandwidth to keep it fed.

Neither architecture is optimal for both workloads. You either waste cycles on attention (large array) or leave efficiency on the table for FF (small array).

roune’s dual-core proposal: dedicate approximately 70% of chip area to a small number of large (256x256) “FF cores” and 30% to many small (64x64) “attention cores” on the same die. The FF cores handle the big matmuls where their size advantage dominates. The attention cores handle the small-K operations where their better utilization matters. A smart scheduler routes each operation to the right core type.

The estimated efficiency gain is 2-3x over a mono-sized design, because both core types run at >80% utilization instead of one type averaging 50-60% across the workload mix. This is not a theoretical argument — it falls directly out of the fill-ratio arithmetic above. The engineering challenge is the scheduler and on-die interconnect, not the math.

I think this is the idea from roune’s work most likely to show up in production silicon within 2-3 years. The pressure to serve larger models at lower cost/watt is relentless, and a 2x efficiency gain from better array sizing is too large to leave on the table.

the KV cache formula

The amount of HBM needed for KV cache:

2 * batch * layers * seq_length * attention_head_groups * vector_width * element_size * idle_magnification * pipeline_factor

Let’s unpack each term:

2: one for keys, one for values. Always a factor of 2.
batch: number of sequences being processed concurrently. Higher batch = more throughput but more memory.
layers: number of transformer layers. Each layer has its own KV cache.
seq_length: tokens in the context window. This is the term that’s been growing exponentially — from 2K (GPT-3) to 4K to 128K to 1M+.
attention_head_groups: with GQA, this is the number of KV head groups, not the total number of attention heads. GQA was invented specifically to shrink this term.
vector_width: dimension per attention head, typically 128.
element_size: bytes per element. FP16 = 2 bytes, FP8 = 1 byte, INT4 = 0.5 bytes.
idle_magnification: this is the subtle one. During decode, generating a single output token requires reading the entire KV cache for the full sequence length to compute attention. The chip is “idle” (from a compute perspective) while it reads all this data — it does O(seq_length) memory reads but only O(1) useful output. Idle_magnification captures this: the memory system must be sized not just to store the KV cache but to stream it fast enough that the systolic arrays aren’t starved.
pipeline_factor: when using pipeline parallelism across multiple chips, each stage holds a fraction of the layers but needs its own copy of the KV cache for its assigned layers. Pipeline parallelism fragments memory — you can’t share KV cache across pipeline stages because each stage processes different layers at different times.

With realistic numbers (B=64, L=32, N=1M, G=8, W=128, S=1 byte): 4,194 GB. This is why AI chips keep getting more expensive HBM.

The formula makes the HBM treadmill obvious: seq_length is the dominant growth term, and it grows linearly. When context windows went from 4K to 128K (32x), KV cache memory requirements grew 32x. The jump from 128K to 1M is another 8x. HBM capacity per chip has grown maybe 4x in the same period (from 80 GB on A100 to 288 GB on B200). The gap is filled by multi-chip systems, which is why NVL72 exists — it’s fundamentally an answer to the KV cache scaling problem.

decode vs prefill

prefill (reading input): compute-bound. Systolic arrays are the bottleneck. High utilization. “nothing speculative about it.”
decode (generating output): memory-bandwidth-bound. Poor systolic array utilization. The troublesome one.

Naive decode utilization on a 256x256 systolic array: less than 1%. But with GQA + speculative decoding: up to 64x improvement.

managed aggregation

Mix prefill and decode on the same chip in a managed ratio. Prefill is compute-bound (waiting on systolic arrays), decode is memory-bound (waiting on bandwidth). Combine them → both resources fully utilized.

This is the key operational insight for running inference at scale. The two phases have complementary resource profiles:

prefill has high arithmetic intensity (many FLOPs per byte of memory traffic). The systolic arrays are the bottleneck — they’re running flat out while the memory system is mostly idle, waiting for results before it can write them back.
decode has low arithmetic intensity (few FLOPs per byte). The memory system is the bottleneck — it’s streaming KV cache at maximum bandwidth while the systolic arrays sit nearly empty, waiting for the next token’s attention scores to arrive.

The trick is interleaving them on the same chip. While decode requests are reading KV cache from HBM (bandwidth-bound, systolic arrays ~idle), you route prefill tokens to those idle tensor cores. While prefill is crunching a big matmul (compute-bound, memory system ~idle), you let decode requests stream KV cache through the idle memory channels.

The optimal prefill-to-decode ratio depends on the workload mix: how many new requests are arriving (need prefill) vs how many ongoing generations are in flight (need decode). A heavily interactive chatbot might be 70% decode. A batch summarization job might be 90% prefill. The scheduler needs to adjust continuously.

This is where hardware interconnect becomes critical. Blackwell’s NVL72 rack provides 130 TB/s of internal bisection bandwidth across 72 GPUs. That bandwidth enables dynamic assignment: any GPU in the rack can shift between prefill and decode roles in microseconds, because the KV cache can be accessed from any node fast enough that it doesn’t matter where it physically lives. Without that fabric bandwidth, you’d need to statically partition GPUs into “prefill pool” and “decode pool,” which guarantees suboptimal utilization whenever the workload mix shifts.

Roune argues that managed aggregation — not larger systolic arrays, not more HBM, not better numerics — is the single largest lever for improving inference cost at the system level. I think he’s right for the 2026-2028 timeframe. The hardware exists today to do this well; the software stack (request schedulers, KV cache migration, load balancers) is what needs to catch up.

the “3.3% of transistors” claim

Etched and others claim only 3.3% of H100 transistors are for matrix multiplication. The math is correct but misleading. An ASIC still needs L2 cache (~17B transistors), L1/shared memory (~12B), register files, HBM controllers. The stuff you can actually eliminate (warp schedulers, instruction decode, RT cores) is maybe 10-15% of die. Not 96.7%.

structured sparsity and compression

Roune proposes Int7+1 format with 1:2 structured sparsity — the 8th bit indicates which of two adjacent entries is non-zero. This could be the end-state numerics format for inference. Combined with huffman encoding for lossless compression on top of lossy quantization.

the TPUv3 turbo mode story

During TPUv3 development, systolic array heat production was underestimated. When chips arrived at google HQ, they ran too hot. Roune (as SW lead) wrote compiler-based static analysis that prevented heat-generating instruction patterns, enabling a 25% clock rate increase. He wanted to call it “Turbo Mode” but his manager’s manager informed him he sucked at marketing.

interesting reads

“Designing AI Chip Software and Hardware” (Bjarke Hammersholt Roune, 2026) — the source material, freely available
Google TPU v1 paper (Jouppi et al., 2017) — “In-Datacenter Performance Analysis of a Tensor Processing Unit”
Google TPU v4 paper (2023) — scaling MXU to 128×128, distributed training
Cerebras WSE-3 — wafer-scale integration, the extreme end of “more SRAM, less HBM”
Groq LPU architecture — deterministic dataflow, no HBM, pure SRAM inference
SemiAnalysis — AI chip economics and competitive landscape

source: “Designing AI Chip Software and Hardware” by Bjarke Hammersholt Roune (2026). freely available. if you work on AI chips, read it.

Alan's PKB

Explorer

Roune: Designing AI Chip Hardware and Software

designing AI chip hardware and software — key ideas from bjarke roune

Graph View

Backlinks