The Physics of Intelligence: AI Hardware from Atoms to Architectures

Every few months someone publishes a new AI chip benchmark, a new TOPS number, a new training record. The numbers go up. The press releases get breathless. And almost everyone misses the thing that actually matters.

every innovation that matters — systolic arrays, HBM, NVLink, block floating point, sparsity, KV-cache compression, SSMs — is about moving less data or shorter distances.

Not faster math. Not more transistors. Just: move fewer bits, and move them across shorter wires. Once you internalize this, the entire AI hardware landscape snaps into focus.

Total Energy = sum(data movement) + epsilon(compute)

The multiply is essentially free. The wire to the multiply is where your power budget actually goes. This is the single most important idea in AI hardware, and everything in this article follows from it.

part I: energy limits

the energy stack (FP8 FMA at 5nm)

Let’s start at the bottom — a single FP8 fused multiply-add on a 5nm process node. Where does the energy actually go?

ComponentEnergy% of Total
Sign XOR0.05 fJ0.1%
Exponent add0.50 fJ0.8%
4x4 mantissa multiply1.70 fJ2.7%
Normalize + round0.75 fJ1.2%
FP32 accumulator10.10 fJ16.0%
Clock + control6.30 fJ10.0%
Wire energy13.70 fJ21.7%
Register file R/W30.00 fJ47.5%
TOTAL63 fJ

Stare at this table. the actual multiply — the thing you think of as “compute” — costs 1.7 fJ. That’s 2.7% of the total. the register file read/write alone is 30 fJ, nearly half the energy budget. Add in wire energy and clock distribution and you’re at 69% of total cost just moving data to the multiplier and back.

This is the Horowitz insight made concrete at the gate level. The multiply is a rounding error. Everything else is logistics.

industry state

So that’s the physics limit. How close are real chips?

ChipNodeChip-level fJ/FMACore-only fJ/FMA
H1004N710210-250
B2004NP22070-100
GB300 (proj.)3nm~120~40-60
TSMC test chipN3E35 (published)

Here’s what jumps out: the gap between the theoretical floor (~63 fJ) and a production tensor core (~70-100 fJ) is only 2-3x. that’s shockingly close to physics. NVIDIA’s tensor cores are not wasteful — they’re approaching the limits of what CMOS can do. The remaining gap is almost entirely clock distribution and control overhead, not bad engineering.

The bigger story is the chip-level gap. B200 at 220 fJ vs its core at 70-100 fJ means 50-70% of the chip’s energy goes to things outside the tensor core — memory controllers, NVLink I/O, the NoC, cache coherency. This is where the real optimization opportunities live.

the hierarchy of limits

Here’s the hierarchy that determines what’s actually possible, and why each level matters:

Landauer (thermodynamics):    0.001 fJ   (irrelevant)
Wire-limited floor:           30-50 fJ   (the real wall)
Full-custom CMOS:             50-80 fJ
Production tensor core:       70-100 fJ
Full chip (B200):             220 fJ

you can ignore Landauer. the thermodynamic minimum for erasing a bit is 0.001 fJ at room temperature. It’s 10,000-17,000x below the wire-limited floor. No practical chip design will ever be constrained by Landauer’s principle — it’s like worrying about the speed of light when you’re stuck in traffic.

the wire-limited floor is the real wall. at 30-50 fJ, you’ve hit the fundamental cost of moving a signal across silicon at 5nm feature sizes. You cannot build wires that consume less energy without changing the laws of electromagnetism. This is the hard physics limit for digital CMOS AI compute.

full-custom CMOS at 50-80 fJ is what you get when you strip away all the programmability of a GPU — no warp schedulers, no instruction decode, no general-purpose register files — and build a pure fixed-function datapath. This is the ceiling for what a perfect ASIC could achieve. The gap between here and the wire-limited floor is layout overhead, clock trees, and power distribution.

production tensor cores at 70-100 fJ are within spitting distance of full-custom. Which means: tensor cores are not the bottleneck. Everything around them is. the memory hierarchy, the interconnect, the scheduling logic — that’s where the 2-3x gap between core-level and chip-level efficiency lives.

we are limited by wires, not thermodynamics. this is the sentence you should remember from this entire article.

part II: number formats

If the wire is the enemy, then the most direct attack is to make the numbers smaller. Fewer bits means fewer wires switching, less energy per transfer, more operations per byte of bandwidth. The AI hardware industry has figured this out, and the result is an arms race in number format design.

MXFP (Microscaling) — the winner

MXFP4 achieves 0.03-0.05 pJ/MAC — roughly 10x cheaper than FP8 with ~1% accuracy loss. it’s an OCP standard adopted by NVIDIA, AMD, Intel, ARM, and Qualcomm. The trick is block-shared exponents: a group of values share one exponent, so each individual value only needs a tiny mantissa. The hardware savings are enormous because the exponent logic (which dominates FP energy) gets amortized across the block.

If you’re designing inference hardware today, MXFP4 support is table stakes.

BitNet b1.58 — eliminates the multiplier

Ternary weights {-1, 0, +1}. At 3B parameters, BitNet matches FP16 Llama quality. The “MAC” operation degenerates into a conditional negate + add: ~0.02 pJ. That’s 15x cheaper than FP8.

Think about what this means architecturally. You don’t need a multiplier at all. The entire datapath simplifies to an adder with a sign flip. The silicon area savings are dramatic — no multiplier tree, no normalization, no rounding. The open question is whether ternary scaling laws hold past 7B parameters.

analog compute — dead for training

Every few years, someone rediscovers that physics does matrix multiplication “for free” in the analog domain. A crossbar array of resistive elements computes a matrix-vector product at the speed of light, using ~200x less energy than digital.

the problem is the ADC tax. converting the analog result back to digital for the accumulator requires an analog-to-digital converter. And ADCs are power-hungry, area-hungry, and precision-limited. By the time you add the ADC, the system-level advantage is only 1-3x over digital. At 3nm, even that vanishes because digital scales with Moore’s Law and analog doesn’t.

Mythic, Luminous, Rain — all failed or pivoted. IBM is the last credible lab pursuing this, but the digital efficiency curves have already crossed. analog compute for AI training is dead. for ultra-low-power edge inference with relaxed accuracy requirements, maybe. But that’s a niche, not a revolution.

part III: the memory wall

horowitz table (2025)

This is the most important table in all of AI hardware. Memorize these ratios.

LevelEnergy per Access
FP8 multiply1.7 fJ
SRAM read (on-chip)2-7 fJ
HBM3E read2,500 fJ
DDR5 read15,000 fJ
NAND read20,000,000 fJ

one DRAM access costs the same energy as 1,000 multiplies. one NAND read costs the same as 12 million multiplies. The hierarchy spans 7 orders of magnitude from the FP8 MAC to flash storage. If your data isn’t in SRAM, you’ve already lost the energy game by 1000x before a single multiply fires.

This is why every serious AI chip is, fundamentally, a memory management engine with some math bolted on. The compute is the easy part. Feeding it is the hard part.

HBM evolution

GenBW/stackCapacitypJ/bit
HBM3E (2024)1,180 GB/s36 GB~2.5
HBM4 (2025)~2,000 GB/s48 GB~2.0

8x HBM3E stacks on the B200 give you 9.4 TB/s. That bandwidth number used to be SRAM-only territory — the kind of thing you’d see inside a custom ASIC, not off a DRAM package. HBM has closed the bandwidth gap with SRAM by almost an order of magnitude over the last decade, while SRAM capacity remains stuck at tens of megabytes. this is why HBM exists: it’s not about capacity (DDR is cheaper per bit), it’s about bandwidth density close to the compute die.

The energy story is less impressive. HBM3E at ~2.5 pJ/bit is better than DDR5 (~15 pJ/bit), but it’s still ~1000x more expensive than an SRAM read. You’re buying bandwidth, not efficiency.

part IV: beyond attention

architecture comparison

The transformer has dominated for 7 years. But its O(n) KV cache during inference is a direct collision with the memory wall. Can we do better?

ArchitectureQuality at ScaleTraining EfficiencyInference Memory
TransformersBestGoodO(n) KV cache
Mamba (SSM)Slightly worseGoodO(1)
Hybrids (Jamba)Matches transformersGoodO(1)-ish

no non-transformer has demonstrated clearly superior scaling laws at frontier scale. that’s the blunt truth, and it’s why transformers aren’t going anywhere soon. But the inference memory column tells a different story: SSMs and hybrids achieve O(1) memory during decode, which means they completely sidestep the KV cache problem. No cache to compress, no cache to page, no cache to spill to host memory.

Hybrids like Jamba capture 90% of the efficiency gains while matching transformer quality. if you’re building inference hardware today, you should design for transformers but leave headroom for hybrid architectures. the memory subsystem that makes transformers fast (huge HBM, high bandwidth) becomes partially wasted if the industry shifts to O(1)-memory architectures. The safe bet is flexibility.

part V: compression limits

how small can a language model get?

MethodBits/weightSize (7B model)
FP161614 GB
INT4 (GPTQ)43.5 GB
QuIP#21.75 GB
BitNet b1.581.581.4 GB
Theoretical floor~1~100 MB for English fluency

the gap between theory (~100 MB for English fluency) and the best practical methods (~300 MB at sub-2-bit) is only about 3x. we’re within one order of magnitude of the information-theoretic limit. This tells you two things: first, current compression techniques are surprisingly good. Second, there isn’t much headroom left — maybe another 3x before we hit the wall.

The practical implication: if you need a 7B-class model to fit in 1 GB, that’s achievable today with aggressive quantization. If you need it in 100 MB, you’re waiting for a research breakthrough. And if your use case requires quality that only 14 GB FP16 can deliver, no amount of compression will save you — you need bigger memory.

key takeaways for project proposals

  1. ChipletCostModel: use the fJ/FMA numbers to model energy per inference, not just cost. The 63 fJ theoretical floor vs 220 fJ chip-level gives you a 3.5x efficiency ceiling for any architecture
  2. RooflineVM: the memory wall data (Horowitz table) is the analytical backbone. The 1000x gap between SRAM and DRAM is the central constraint
  3. SystolicDiff: weight-stationary systolic arrays reduce register file cost from 30 fJ to 3 fJ (10x). This is the single biggest energy win available in datapath design
  4. InferBench: use MXFP4/BitNet energy numbers for architecture comparison. MXFP4 at 0.03 pJ/MAC is the new baseline
  5. RoboEdge: on-chip SRAM at 2-7 fJ vs DRAM at 2,500 fJ = 400-1000x. This is why weight-stationary with large scratchpad wins for diffusion policies
  6. DiffusionASIC: MXFP4 at 0.03 pJ/MAC for the denoising network could enable <1W diffusion inference at the edge

interesting reads

  • Mark Horowitz, “Computing’s Energy Problem (and what we can do about it)” (ISSCC 2014) — the foundational Stanford paper establishing that data movement dominates compute energy cost. The “1 DRAM access = 1000 FMAs” relationship cited throughout this article traces directly back to Horowitz’s energy table. If you read one paper on why memory walls exist, read this one.
  • Bjarke Roune, “Designing AI Chip Software and Hardware” (2026) — the definitive guide on systolic array design, managed aggregation, and the mono-sized problem. Roune was the software lead for TPUv3 at Google and writes with the rare combination of theoretical depth and production experience. Essential context for the energy hierarchy discussion above.
  • SemiAnalysis, “NVIDIA Blackwell” series — Dylan Patel and team’s multi-part deep dives on Blackwell silicon, CoWoS-L packaging, HBM3e economics, and NVL72 rack-level analysis. The best external coverage of NVIDIA’s supply chain and cost structure. Https://semianalysis.com
  • Simon Boehm, “How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance” — step-by-step walkthrough of tiling, shared memory, and register-level optimizations for matrix multiplication. Essential for understanding why the gap between core-level and chip-level efficiency exists. Https://siboehm.com/articles/22/CUDA-MMM
  • Microsoft Research, “MXFP: Microscaling Data Formats for Deep Learning” (2023) — the OCP standard for block floating point. If MXFP4 at 0.03 pJ/MAC is the future of inference, this is the spec that defines it.
  • Microsoft Research, “BitNet b1.58” (2024) — ternary weight networks that eliminate the multiplier entirely. The energy implications (~0.02 pJ/MAC, 15x cheaper than FP8) are profound if the scaling laws hold past current model sizes.
  • Sze et al., “Efficient Processing of Deep Neural Networks” (MIT, 2020) — comprehensive survey of energy-efficient DNN accelerator design. Covers the dataflow taxonomy (weight-stationary, output-stationary, row-stationary) and the energy analysis framework used to derive the numbers in Part I.
  • Tri Dao et al., “Flash Attention” — the IO-aware exact attention algorithm that restructures computation to minimize HBM reads/writes. The canonical example of algorithm-level optimization driven by the memory wall. If the Horowitz table is the “why,” Flash Attention is the “what you do about it.”
  • Albert Gu and Tri Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” (2023) — the SSM architecture that achieves O(1) inference memory. Directly relevant to the “beyond attention” discussion and the question of whether future hardware should optimize for KV caches at all.

See also: Breaking Down Blackwell, Systolic Arrays, Roune: AI Chip Design, DIY TPU v1, TSMC N2 Economics, Thunder Kittens CUDA

this article draws from my AI Hardware Deep Dive keynote and companion decks. the core insight — that data movement dominates compute cost by 10-1000x at every level of the hierarchy — is not original to me; it traces back to Horowitz (2014) and has been elaborated by Roune (2026), Sze (2020), and others. what I’ve tried to do here is connect the physics to the architecture to the product decisions, so you can see why the industry builds what it builds.