DIY TPU v1: Reverse-Engineering Google’s First AI Chip

In december 2025, I set out to reverse-engineer google’s TPU v1 from first principles — the 256×256 INT8 systolic array that delivered 92 TOPS in 40W and kickstarted the entire AI accelerator market. The goal: deconstruct the architecture from papers, implement the full RTL stack in Verilog/SystemVerilog from scratch, and run live inference on an FPGA.

This is what I learned.

why a TPU?

The TPU v1 matters beyond historical interest. It established the template that every modern AI accelerator follows:

  • 15-30x faster than contemporary Kepler GPUs and Haswell CPUs at inference (Jouppi et al., 2017)
  • set the precedent for inference + training in the CNN era that parallels the Transformer era at far greater scale
  • systolic array dataflow lessons echo in today’s chips: Furiosa RNGD, AWS Trainium, SambaNova SN40L
  • the memory-bound + networking-bound principles still hold — the problems haven’t changed, only the scale

After chatting with Cliff Young and Richard at HotChips 2025, the base case became clear: understanding TPU v1 deeply lets you form inductive hypotheses about inference architecture that generalize forward.

dennard bottlenecks

Two walls define the design space:

memory wall — the processor-memory performance gap has been widening since 1980. Prefill and decode are both memory-bandwidth-limited at the workload level. MoE models make this worse with their explosion of active parameters.

networking wall — moving information between chips is expensive. Inter-node communication is the second bottleneck after memory.

the question: CPUs are 100x more energy-expensive moving data than computing. Now what?

the answer: specialized silicon. Strip out everything that isn’t matrix multiplication.

CPU vs GPU vs TPU

CPU (General Purpose)GPU (SIMD)TPU v1
strengthlogic, branching, OS tasksparallel tasks, graphicsONE thing: matrix multiply
flexibilityhighmediumminimal
arithmetic densitylowmediummaximum

The TPU v1 traded all flexibility for arithmetic density. No branch predictor, no cache hierarchy, no speculative execution. Just a 256×256 array of multiply-accumulate cells fed by a weight FIFO and a unified buffer.

what’s in a TPU v1?

Pure-play inference. By silicon area, two components dominate:

  • Matrix Multiply Unit (MXU): 256×256 = 65,536 INT8 MACs. 24% of chip area. does 64K multiply-accumulates per cycle.
  • Unified Buffer: 96K×256×8b = 24 MiB of local activation storage. 29% of chip area. this is where input activations live — the SRAM that feeds the systolic array.

The rest: accumulators (4K×256×32b = 4 MiB, 6%), activation pipeline (6%), host interface (2%), control (2%), PCIe interface (3%), DDR3 ports (6%), misc I/O (1%).

the dataflow: how data actually moves

Here’s the TPU floorplan in terms of data movement:

PCIe Gen3 x16 ←→ Host Interface ←→ DDR3-2133 (30 GiB/s)
     ↓ (14 GiB/s)              ↓ (30 GiB/s)
   Unified Buffer ←→ Systolic Data Setup → Matrix Multiply Unit
   (10 GiB/s)        (167 GiB/s,           (64K MACs/cycle)
                       Jouppi et al.)
                                                    ↓
                                              Accumulators
                                                    ↓
                                           Activation Pipeline
                                                    ↓
                                            Normalize / Pool

The key insight: 167 GiB/s (per Jouppi et al.) flows from the systolic data setup into the MXU, but only 30 GiB/s comes from DDR3. The unified buffer and weight FIFO exist to bridge this 5.5x bandwidth gap. Without on-chip SRAM staging, the systolic array would starve.

systolic array dataflow

The MXU is a weight-stationary 256×256 systolic array. Here’s how a single MAC cell works:

Weight (from FIFO)
       ↓
Input → [MAC] → Next Input
       ↓
    Y_out = W×X + Y_in
  • weights flow vertically — loaded from the weight FIFO, one column at a time
  • activations flow horizontally — streamed from the unified buffer, left to right
  • partial sums accumulate vertically — each column produces one output element
  • all PEs compute simultaneously when data is available — N×N array does N² computations at full saturation

At cycle 0: only PE[0,0] is active. By cycle 255: all 65,536 PEs are computing. By cycle 510: the last result drains out. The diagonal wavefront — activations arriving staggered in time — is the fundamental rhythm of every systolic array.

staggered weight-loading: the weight FIFO

This is where timing gets tricky. Weights must arrive at exactly the cycle each PE expects them:

  • for a 2×2 example: W00 first, then W01, then W10, then W11 — column 1 is delayed by 1 cycle relative to column 0
  • a reset cycle is needed to flush weights already being populated
  • synchronization between host control signals and staggered loading timing is the hardest part of the control logic

The weight FIFO absorbs the DDR3→MXU bandwidth mismatch: DDR3 delivers at 30 GiB/s, but the array can consume weights at the systolic clock rate. The FIFO decouples these two domains.

accumulator to unified buffer: double buffering

The accumulators hold 4K×256 entries of 32-bit partial sums. Results flow back to the unified buffer via double buffering:

  • phase 1: systolic array writes to Buffer A while Buffer B is being read by the activation pipeline
  • phase 2: swap — Buffer A is read, Buffer B is written
  • swap happens every clock cycle, fully pipelined

This double-buffering pattern is the same concept that Thunder Kittens implements in software on modern GPUs with cp.async — load the next tile while computing the current one. The TPU v1 did it in hardware, 8 years earlier.

control flow: the instruction set

The TPU v1 has a simple CISC-like ISA with ~12 instructions:

  1. read_host_memory — DMA weights from host DDR3 to weight FIFO
  2. read_weights — load weights from FIFO into systolic array
  3. matmul / convolve — execute matrix multiply (the main event)
  4. activate — run activation function (ReLU, sigmoid, etc.)
  5. write_host_memory — DMA results back to host

The execution model is coprocessor-style: the host CPU sends instructions over PCIe, the TPU executes them in order. No out-of-order execution, no speculation, no branch prediction. Deterministic dataflow.

the build: learn → build → deploy

The project has three phases:

learn: deconstruct the TPU v1 architecture from first principles and papers (Jouppi et al. 2017, Norrie & Patil et al. 2020). Understand every data path, every timing constraint, every design tradeoff.

build: implement the full RTL stack in Verilog/SystemVerilog from scratch. Systolic array, weight FIFO with staggered loading, unified buffer with double buffering, accumulator pipeline, activation units, host interface.

deploy: run live coprocessor-style inference on an FPGA with a custom compiler and drivers. Prove that the architecture works end-to-end.

why this matters now

The TPU v1 was designed in 2013-2015 for CNN inference. But the architectural patterns it established — systolic arrays, weight-stationary dataflow, on-chip SRAM staging, double-buffered accumulation — are exactly what Blackwell’s tensor cores, Groq’s LPU, and every other modern accelerator still uses.

Understanding the TPU v1 at the RTL level gives you a mental model for how all AI hardware works. The scale changed (256×256 → 64×64 warpgroup MMA). The precision changed (INT8 → FP8/FP4). The complexity changed (single-chip → multi-die CoWoS). But the fundamental dataflow — weights stationary, activations streaming, partial sums accumulating — hasn’t changed in a decade.

interesting reads

  • Jouppi et al. (2017) — “In-Datacenter Performance Analysis of a Tensor Processing Unit” (ISCA)
  • Norrie, Patil et al. (2020) — “The Design Process for Google’s Training Chips”
  • H.T. Kung & Leiserson (1978) — “Systolic Arrays (for VLSI)” — the original CMU paper
  • Bjarke Roune — “Designing AI Chip Software and Hardware” (2026)
  • Cliff Young — HotChips 2025 TPU retrospective
  • open-source TPU implementations — various FPGA implementations on GitHub

See also: Systolic Arrays, Roune: AI Chip Design, Breaking Down Blackwell (modern evolution of the same ideas)