DIY TPU v1: Reverse-Engineering Google’s First AI Chip
In december 2025, I set out to reverse-engineer google’s TPU v1 from first principles — the 256×256 INT8 systolic array that delivered 92 TOPS in 40W and kickstarted the entire AI accelerator market. The goal: deconstruct the architecture from papers, implement the full RTL stack in Verilog/SystemVerilog from scratch, and run live inference on an FPGA.
This is what I learned.
why a TPU?
The TPU v1 matters beyond historical interest. It established the template that every modern AI accelerator follows:
- 15-30x faster than contemporary Kepler GPUs and Haswell CPUs at inference (Jouppi et al., 2017)
- set the precedent for inference + training in the CNN era that parallels the Transformer era at far greater scale
- systolic array dataflow lessons echo in today’s chips: Furiosa RNGD, AWS Trainium, SambaNova SN40L
- the memory-bound + networking-bound principles still hold — the problems haven’t changed, only the scale
After chatting with Cliff Young and Richard at HotChips 2025, the base case became clear: understanding TPU v1 deeply lets you form inductive hypotheses about inference architecture that generalize forward.
dennard bottlenecks
Two walls define the design space:
memory wall — the processor-memory performance gap has been widening since 1980. Prefill and decode are both memory-bandwidth-limited at the workload level. MoE models make this worse with their explosion of active parameters.
networking wall — moving information between chips is expensive. Inter-node communication is the second bottleneck after memory.
the question: CPUs are 100x more energy-expensive moving data than computing. Now what?
the answer: specialized silicon. Strip out everything that isn’t matrix multiplication.
CPU vs GPU vs TPU
| CPU (General Purpose) | GPU (SIMD) | TPU v1 | |
|---|---|---|---|
| strength | logic, branching, OS tasks | parallel tasks, graphics | ONE thing: matrix multiply |
| flexibility | high | medium | minimal |
| arithmetic density | low | medium | maximum |
The TPU v1 traded all flexibility for arithmetic density. No branch predictor, no cache hierarchy, no speculative execution. Just a 256×256 array of multiply-accumulate cells fed by a weight FIFO and a unified buffer.
what’s in a TPU v1?
Pure-play inference. By silicon area, two components dominate:
- Matrix Multiply Unit (MXU): 256×256 = 65,536 INT8 MACs. 24% of chip area. does 64K multiply-accumulates per cycle.
- Unified Buffer: 96K×256×8b = 24 MiB of local activation storage. 29% of chip area. this is where input activations live — the SRAM that feeds the systolic array.
The rest: accumulators (4K×256×32b = 4 MiB, 6%), activation pipeline (6%), host interface (2%), control (2%), PCIe interface (3%), DDR3 ports (6%), misc I/O (1%).
the dataflow: how data actually moves
Here’s the TPU floorplan in terms of data movement:
PCIe Gen3 x16 ←→ Host Interface ←→ DDR3-2133 (30 GiB/s)
↓ (14 GiB/s) ↓ (30 GiB/s)
Unified Buffer ←→ Systolic Data Setup → Matrix Multiply Unit
(10 GiB/s) (167 GiB/s, (64K MACs/cycle)
Jouppi et al.)
↓
Accumulators
↓
Activation Pipeline
↓
Normalize / Pool
The key insight: 167 GiB/s (per Jouppi et al.) flows from the systolic data setup into the MXU, but only 30 GiB/s comes from DDR3. The unified buffer and weight FIFO exist to bridge this 5.5x bandwidth gap. Without on-chip SRAM staging, the systolic array would starve.
systolic array dataflow
The MXU is a weight-stationary 256×256 systolic array. Here’s how a single MAC cell works:
Weight (from FIFO)
↓
Input → [MAC] → Next Input
↓
Y_out = W×X + Y_in
- weights flow vertically — loaded from the weight FIFO, one column at a time
- activations flow horizontally — streamed from the unified buffer, left to right
- partial sums accumulate vertically — each column produces one output element
- all PEs compute simultaneously when data is available — N×N array does N² computations at full saturation
At cycle 0: only PE[0,0] is active. By cycle 255: all 65,536 PEs are computing. By cycle 510: the last result drains out. The diagonal wavefront — activations arriving staggered in time — is the fundamental rhythm of every systolic array.
staggered weight-loading: the weight FIFO
This is where timing gets tricky. Weights must arrive at exactly the cycle each PE expects them:
- for a 2×2 example: W00 first, then W01, then W10, then W11 — column 1 is delayed by 1 cycle relative to column 0
- a reset cycle is needed to flush weights already being populated
- synchronization between host control signals and staggered loading timing is the hardest part of the control logic
The weight FIFO absorbs the DDR3→MXU bandwidth mismatch: DDR3 delivers at 30 GiB/s, but the array can consume weights at the systolic clock rate. The FIFO decouples these two domains.
accumulator to unified buffer: double buffering
The accumulators hold 4K×256 entries of 32-bit partial sums. Results flow back to the unified buffer via double buffering:
- phase 1: systolic array writes to Buffer A while Buffer B is being read by the activation pipeline
- phase 2: swap — Buffer A is read, Buffer B is written
- swap happens every clock cycle, fully pipelined
This double-buffering pattern is the same concept that Thunder Kittens implements in software on modern GPUs with cp.async — load the next tile while computing the current one. The TPU v1 did it in hardware, 8 years earlier.
control flow: the instruction set
The TPU v1 has a simple CISC-like ISA with ~12 instructions:
read_host_memory— DMA weights from host DDR3 to weight FIFOread_weights— load weights from FIFO into systolic arraymatmul / convolve— execute matrix multiply (the main event)activate— run activation function (ReLU, sigmoid, etc.)write_host_memory— DMA results back to host
The execution model is coprocessor-style: the host CPU sends instructions over PCIe, the TPU executes them in order. No out-of-order execution, no speculation, no branch prediction. Deterministic dataflow.
the build: learn → build → deploy
The project has three phases:
learn: deconstruct the TPU v1 architecture from first principles and papers (Jouppi et al. 2017, Norrie & Patil et al. 2020). Understand every data path, every timing constraint, every design tradeoff.
build: implement the full RTL stack in Verilog/SystemVerilog from scratch. Systolic array, weight FIFO with staggered loading, unified buffer with double buffering, accumulator pipeline, activation units, host interface.
deploy: run live coprocessor-style inference on an FPGA with a custom compiler and drivers. Prove that the architecture works end-to-end.
why this matters now
The TPU v1 was designed in 2013-2015 for CNN inference. But the architectural patterns it established — systolic arrays, weight-stationary dataflow, on-chip SRAM staging, double-buffered accumulation — are exactly what Blackwell’s tensor cores, Groq’s LPU, and every other modern accelerator still uses.
Understanding the TPU v1 at the RTL level gives you a mental model for how all AI hardware works. The scale changed (256×256 → 64×64 warpgroup MMA). The precision changed (INT8 → FP8/FP4). The complexity changed (single-chip → multi-die CoWoS). But the fundamental dataflow — weights stationary, activations streaming, partial sums accumulating — hasn’t changed in a decade.
interesting reads
- Jouppi et al. (2017) — “In-Datacenter Performance Analysis of a Tensor Processing Unit” (ISCA)
- Norrie, Patil et al. (2020) — “The Design Process for Google’s Training Chips”
- H.T. Kung & Leiserson (1978) — “Systolic Arrays (for VLSI)” — the original CMU paper
- Bjarke Roune — “Designing AI Chip Software and Hardware” (2026)
- Cliff Young — HotChips 2025 TPU retrospective
- open-source TPU implementations — various FPGA implementations on GitHub
See also: Systolic Arrays, Roune: AI Chip Design, Breaking Down Blackwell (modern evolution of the same ideas)