Systolic Arrays
Every modern AI accelerator — Tensor Cores, MXUs, Matrix Cores, AMX — is a systolic array under a different marketing name. The interesting question isn’t what they’re called. it’s why every vendor on earth converged on this exact architecture, and what problems that choice creates.
A systolic array is a grid of processing elements (PEs) that rhythmically pass data between neighbors, performing a multiply-accumulate at each step. Data pulses through the array in a regular wavefront — like a heartbeat. The name is literal.
Every modern AI accelerator is built around them:
| Vendor | Name | Architecture |
|---|---|---|
| NVIDIA | Tensor Cores | Warpgroup MMA on Blackwell |
| MXU (Matrix Multiply Unit) | TPU v1-v6e | |
| AMD | Matrix Cores | CDNA 1/2/3 |
| Intel | AMX (Advanced Matrix Extensions) | Sapphire Rapids+ |
| Amazon | NeuronCore | Trainium / Inferentia |
| Apple | ANE (matrix engine, undisclosed uarch) | M-series / A-series |
historical evolution
H.T. Kung and Charles Leiserson introduced systolic arrays in 1978 at Carnegie Mellon University. The original motivation was signal processing — convolutions, FIR filters, and matrix operations on dedicated VLSI. The idea was elegant: instead of fetching every operand from memory, you tile the computation across a mesh of simple cells that communicate only with their nearest neighbors.
For two decades, systolic arrays were considered a dead end. General-purpose CPUs won because real workloads have irregular control flow, branches, and pointer-chasing — none of which map onto a rigid PE grid. The architecture appeared in a handful of signal processing ASICs and then largely disappeared from mainstream chip design.
Google revived them in 2015 with the TPU v1, a 256×256 INT8 systolic array purpose-built for inference (initially for RankBrain and Street View). The key realization: deep learning workloads are almost entirely matrix multiplication, which is exactly the regular structure systolic arrays were designed for. The TPU v1 delivered 92 TOPS in 40W — unthinkable for a GPU at the time. Every major AI accelerator since has followed the same basic blueprint.
how systolic arrays work
The most common dataflow for AI workloads is weight-stationary: each PE stores one weight from the weight matrix, and activations flow through the array.
Here’s how a single PE operates:
- Receive an activation from the left neighbor
- Receive a partial sum from the top neighbor
- Compute:
partial_sum_out = partial_sum_in + (activation × stored_weight) - Pass the activation right (to the next PE in the row)
- Pass the updated partial sum down (to the next PE in the column)
A 4×4 array computing C = A × B looks like this (weight-stationary dataflow):
↓psum=0 ↓psum=0 ↓psum=0 ↓psum=0
a00 → [w00] → [w01] → [w02] → [w03]
↓ ↓ ↓ ↓
a10 → [w10] → [w11] → [w12] → [w13]
↓ ↓ ↓ ↓
a20 → [w20] → [w21] → [w22] → [w23]
↓ ↓ ↓ ↓
a30 → [w30] → [w31] → [w32] → [w33]
↓ ↓ ↓ ↓
c(col0) c(col1) c(col2) c(col3)
Activations enter from the left, staggered in time (a wavefront). At cycle 0, only a00 enters the top-left PE. At cycle 1, a00 moves right to [w01] while a10 enters [w10]. By cycle 3, all four rows are active and the array is fully utilized. Partial sums accumulate downward and emerge at the bottom as the output row of C.
Why this is efficient — O(N) data reuse:
- Each weight is loaded into its PE once and stays there. It gets multiplied against N different activations (one per row of A that streams through). That’s N uses per load.
- Each activation enters from the left and passes through N PEs in its row, getting multiplied against N different weights. That’s N uses per load.
- Total compute: N² multiply-accumulates. Total data loaded: N² weights + N activations per row = N² + N. The ratio is O(N). More PEs → better reuse → higher arithmetic intensity.
This is fundamentally different from a general-purpose core where every operand is fetched from a register file or cache per operation. The systolic structure turns memory bandwidth into a non-issue for the inner loop.
tiling and data reuse math
Real matrices are larger than the systolic array. A matmul of dimensions M×K × K×N on an S×S array requires tiling.
Number of tiles: ⌈M/S⌉ × ⌈K/S⌉ × ⌈N/S⌉
Each tile performs an S×S outer product accumulated over S steps along the K dimension:
- Weights loaded per tile: S² (one weight per PE)
- Activations loaded per tile: S² (S values streamed per cycle × S cycles)
- Compute per tile: 2S³ FLOPs (S² PEs × S cycles × 2 ops per FMA)
Arithmetic intensity scales as O(S). The key insight is that each weight is reused S times (once per activation row) and each activation is reused S times (once per weight column). Per tile, compute is 2S³ FLOPs against 2S² bytes of data traffic (S² weights + S² activations), giving AI = S ops/byte.
Concrete example for a 128×128 array (S = 128):
Compute per tile: 2 × 128³ = 4,194,304 FLOPs
Data per tile: 2 × 128² = 32,768 bytes
AI = 4,194,304 / 32,768 = 128 ops/byte
A 16×16 array (S = 16) gets only ~16 ops/byte for the same workload. Doubling the array side length doubles arithmetic intensity — this is why every vendor trends toward larger systolic arrays. Bigger arrays don’t just do more compute — they do more useful compute per byte of memory traffic. It’s also why Thunder Kittens focuses on tile-level optimization: the tiling strategy determines how much of this theoretical reuse you actually capture.
why systolic arrays dominate
Key insight from Bjarke Roune: larger systolic arrays are more efficient. Doubling the vector width from N to 2N quadruples compute (N² → 4N²) while only doubling I/O bandwidth (2N → 4N). The compute-to-IO ratio improves as O(N).
This is why the trend is toward ever-larger matrix units — NVIDIA went from 4×4 (Volta) to 16×16 (Turing) to 64×64 (Blackwell warpgroup MMA).
the mono-sized problem
But making arrays bigger creates a new problem: underutilization on small dimensions.
A transformer model has two types of matrix multiplies with very different shapes:
- Feed-forward (FF) layers: K = 8192 or larger. Tiles cleanly onto any array size.
- Attention (QK^T and score×V): K = d_head = 128, or with grouped-query attention as low as 16.
The utilization math is unforgiving:
| Array size | K dimension | Tiles along K | Utilization |
|---|---|---|---|
| 64×64 | 128 | 2 | ~100% (both tiles full) |
| 64×64 | 16 | 1 | 25% (16 of 64 rows used) |
| 256×256 | 128 | 1 | 50% (128 of 256 rows used) |
| 256×256 | 16 | 1 | 6.25% (16 of 256 rows used) |
The TPU’s 128×128 MXU is great for the FF layers that dominate training FLOPs. But during inference decode — where attention is the bottleneck — you’re leaving most of those PEs idle. A 256×256 array would be even worse.
Roune proposes the solution: heterogeneous cores. Put large systolic arrays on the chip for FF layers and smaller arrays (or vector units) for attention. This way each workload gets hardware matched to its dimensions. The idea is straightforward, but nobody has shipped it as of 2026. Every vendor still uses a single array size per chip generation, which means either the FF layers or the attention layers (or both) are running on a suboptimal tile size.
This is also why software matters so much. Thunder Kittens and TrtLLMGen exist precisely because the hardware is not perfectly matched to every kernel shape — clever tiling, pipelining, and data layout can recover some of the utilization that a mono-sized array leaves on the table.
connection to inference optimization
Systolic arrays are compute units. The bottleneck for LLM inference is usually memory bandwidth, not compute — see Inference Stack Synthesis. SpectralQuant’s KV cache compression addresses the bandwidth side; systolic array improvements address the compute side.
At small batch sizes (decode), GPUs achieve only ~1.6% spatial utilization of their systolic arrays (B=1 fills 1 of 64 rows in the tensor core) because they’re waiting on memory. Thunder Kittens and TrtLLMGen optimize how data is fed to these arrays.
interesting reads
- H.T. Kung & Charles Leiserson (1978) — “Systolic Arrays (for VLSI)” — the original CMU paper
- Google TPU v1 paper (Jouppi et al., ISCA 2017) — “In-Datacenter Performance Analysis of a Tensor Processing Unit”
- Eyeriss (MIT, 2016) — spatial architecture for CNNs, excellent dataflow taxonomy (row-stationary, weight-stationary, output-stationary)
- NVIDIA Tensor Core whitepapers — Volta through Blackwell architecture guides
- Bjarke Roune — “Designing AI Chip Software and Hardware” (2026) — the best treatment of systolic array efficiency scaling