DIY TPU v1: Reverse-Engineering Google’s First AI Chip

In december 2025, I set out to reverse-engineer google’s TPU v1 from first principles — the 256×256 INT8 systolic array that delivered 92 TOPS in 40W and kickstarted the entire AI accelerator market. The goal: deconstruct the architecture from papers, implement the full RTL stack in Verilog/SystemVerilog from scratch, and run live inference on an FPGA.

This is what I learned.

why a TPU?

The TPU v1 matters beyond historical interest. It established the template that every modern AI accelerator follows:

15-30x faster than contemporary Kepler GPUs and Haswell CPUs at inference (Jouppi et al., 2017)
set the precedent for inference + training in the CNN era that parallels the Transformer era at far greater scale
systolic array dataflow lessons echo in today’s chips: Furiosa RNGD, AWS Trainium, SambaNova SN40L
the memory-bound + networking-bound principles still hold — the problems haven’t changed, only the scale

After chatting with Cliff Young and Richard at HotChips 2025, the base case became clear: understanding TPU v1 deeply lets you form inductive hypotheses about inference architecture that generalize forward.

dennard bottlenecks

Two walls define the design space:

memory wall — the processor-memory performance gap has been widening since 1980. Prefill and decode are both memory-bandwidth-limited at the workload level. MoE models make this worse with their explosion of active parameters.

networking wall — moving information between chips is expensive. Inter-node communication is the second bottleneck after memory.

the question: CPUs are 100x more energy-expensive moving data than computing. Now what?

the answer: specialized silicon. Strip out everything that isn’t matrix multiplication.

CPU vs GPU vs TPU

	CPU (General Purpose)	GPU (SIMD)	TPU v1
strength	logic, branching, OS tasks	parallel tasks, graphics	ONE thing: matrix multiply
flexibility	high	medium	minimal
arithmetic density	low	medium	maximum

The TPU v1 traded all flexibility for arithmetic density. No branch predictor, no cache hierarchy, no speculative execution. Just a 256×256 array of multiply-accumulate cells fed by a weight FIFO and a unified buffer.

what’s in a TPU v1?

Pure-play inference. By silicon area, two components dominate:

Matrix Multiply Unit (MXU): 256×256 = 65,536 INT8 MACs. 24% of chip area. does 64K multiply-accumulates per cycle.
Unified Buffer: 96K×256×8b = 24 MiB of local activation storage. 29% of chip area. this is where input activations live — the SRAM that feeds the systolic array.

The rest: accumulators (4K×256×32b = 4 MiB, 6%), activation pipeline (6%), host interface (2%), control (2%), PCIe interface (3%), DDR3 ports (6%), misc I/O (1%).

the dataflow: how data actually moves

Here’s the TPU floorplan in terms of data movement:

PCIe Gen3 x16 ←→ Host Interface ←→ DDR3-2133 (30 GiB/s)
     ↓ (14 GiB/s)              ↓ (30 GiB/s)
   Unified Buffer ←→ Systolic Data Setup → Matrix Multiply Unit
   (10 GiB/s)        (167 GiB/s,           (64K MACs/cycle)
                       Jouppi et al.)
                                                    ↓
                                              Accumulators
                                                    ↓
                                           Activation Pipeline
                                                    ↓
                                            Normalize / Pool

The key insight: 167 GiB/s (per Jouppi et al.) flows from the systolic data setup into the MXU, but only 30 GiB/s comes from DDR3. The unified buffer and weight FIFO exist to bridge this 5.5x bandwidth gap. Without on-chip SRAM staging, the systolic array would starve.

systolic array dataflow

The MXU is a weight-stationary 256×256 systolic array. Here’s how a single MAC cell works:

Weight (from FIFO)
       ↓
Input → [MAC] → Next Input
       ↓
    Y_out = W×X + Y_in

weights flow vertically — loaded from the weight FIFO, one column at a time
activations flow horizontally — streamed from the unified buffer, left to right
partial sums accumulate vertically — each column produces one output element
all PEs compute simultaneously when data is available — N×N array does N² computations at full saturation

At cycle 0: only PE[0,0] is active. By cycle 255: all 65,536 PEs are computing. By cycle 510: the last result drains out. The diagonal wavefront — activations arriving staggered in time — is the fundamental rhythm of every systolic array.

staggered weight-loading: the weight FIFO

This is where timing gets tricky. Weights must arrive at exactly the cycle each PE expects them:

for a 2×2 example: W00 first, then W01, then W10, then W11 — column 1 is delayed by 1 cycle relative to column 0
a reset cycle is needed to flush weights already being populated
synchronization between host control signals and staggered loading timing is the hardest part of the control logic

The weight FIFO absorbs the DDR3→MXU bandwidth mismatch: DDR3 delivers at 30 GiB/s, but the array can consume weights at the systolic clock rate. The FIFO decouples these two domains.

accumulator to unified buffer: double buffering

The accumulators hold 4K×256 entries of 32-bit partial sums. Results flow back to the unified buffer via double buffering:

phase 1: systolic array writes to Buffer A while Buffer B is being read by the activation pipeline
phase 2: swap — Buffer A is read, Buffer B is written
swap happens every clock cycle, fully pipelined

This double-buffering pattern is the same concept that Thunder Kittens implements in software on modern GPUs with cp.async — load the next tile while computing the current one. The TPU v1 did it in hardware, 8 years earlier.

control flow: the instruction set

The TPU v1 has a simple CISC-like ISA with ~12 instructions:

read_host_memory — DMA weights from host DDR3 to weight FIFO
read_weights — load weights from FIFO into systolic array
matmul / convolve — execute matrix multiply (the main event)
activate — run activation function (ReLU, sigmoid, etc.)
write_host_memory — DMA results back to host

The execution model is coprocessor-style: the host CPU sends instructions over PCIe, the TPU executes them in order. No out-of-order execution, no speculation, no branch prediction. Deterministic dataflow.

the build: learn → build → deploy

The project has three phases:

learn: deconstruct the TPU v1 architecture from first principles and papers (Jouppi et al. 2017, Norrie & Patil et al. 2020). Understand every data path, every timing constraint, every design tradeoff.

build: implement the full RTL stack in Verilog/SystemVerilog from scratch. Systolic array, weight FIFO with staggered loading, unified buffer with double buffering, accumulator pipeline, activation units, host interface.

deploy: run live coprocessor-style inference on an FPGA with a custom compiler and drivers. Prove that the architecture works end-to-end.

why this matters now

The TPU v1 was designed in 2013-2015 for CNN inference. But the architectural patterns it established — systolic arrays, weight-stationary dataflow, on-chip SRAM staging, double-buffered accumulation — are exactly what Blackwell’s tensor cores, Groq’s LPU, and every other modern accelerator still uses.

Understanding the TPU v1 at the RTL level gives you a mental model for how all AI hardware works. The scale changed (256×256 → 64×64 warpgroup MMA). The precision changed (INT8 → FP8/FP4). The complexity changed (single-chip → multi-die CoWoS). But the fundamental dataflow — weights stationary, activations streaming, partial sums accumulating — hasn’t changed in a decade.

interesting reads

Jouppi et al. (2017) — “In-Datacenter Performance Analysis of a Tensor Processing Unit” (ISCA)
Norrie, Patil et al. (2020) — “The Design Process for Google’s Training Chips”
H.T. Kung & Leiserson (1978) — “Systolic Arrays (for VLSI)” — the original CMU paper
Bjarke Roune — “Designing AI Chip Software and Hardware” (2026)
Cliff Young — HotChips 2025 TPU retrospective
open-source TPU implementations — various FPGA implementations on GitHub

See also: Systolic Arrays, Roune: AI Chip Design, Breaking Down Blackwell (modern evolution of the same ideas)

Alan's PKB

Explorer

DIY TPU v1: Reverse-Engineering Google's First AI Chip

DIY TPU v1: Reverse-Engineering Google’s First AI Chip

Graph View

Backlinks