PhysDiffuse-1: A Custom Chip for Physics Diffusion Models

Physics diffusion models are the most important workload that nobody has built a chip for.

GenCast generates a probabilistic 15-day global weather forecast in 8 minutes on a single TPU v5 — outperforming ECMWF ENS on 97.4% of 1,320 targets while consuming ~24,000x less energy. AlphaFold 3 denoises random atomic coordinates into protein-ligand 3D structures in minutes. RFdiffusion designs entirely new proteins that fold in the wet lab as predicted. Neural network force fields like MACE and Allegro run molecular dynamics at near-quantum accuracy on 100 million atoms.

These systems share the same mathematical backbone — score-based SDEs, Langevin dynamics, iterative denoising — but operate on physical quantities rather than pixel grids. That rewires every hardware assumption. Where VDX-1 optimizes for dense spatiotemporal attention on uniform grids, physics diffusion demands irregular graph operations on unstructured meshes, SE(3)-equivariant tensor products, FFT-based spectral convolutions, FP64 accumulation for energy conservation, and scatter-gather memory access that leaves GPUs at 10-30% utilization. The gap between what current hardware delivers and what the physics demands is the business case for PhysDiffuse-1.

What Makes Physics Diffusion Different from Image Diffusion

Aspect	Image/Video Diffusion	Physics Diffusion
Data space	Flat Euclidean pixel grids	Manifolds: SE(3), tori, spheres, periodic lattices
Quality metric	Perceptual (FID, CLIP)	Physical fidelity (PDE residual, conservation error)
Constraints	Soft (text prompts)	Hard (conservation laws, symmetries, boundary conditions)
Equivariance	Optional	Mandatory (SE(3), E(3), periodic-E(3))
Stochasticity	Diversity of outputs	Uncertainty quantification with calibrated ensembles
Architecture	U-Net, DiT on uniform grids	GNNs on meshes, equivariant transformers, FNO
Precision	FP16/BF16/FP8	Mixed FP64/FP32/FP16 cascade

Standard diffusion asks “does this look real?” Physics diffusion asks “does this obey the laws of nature?”

Five Bottlenecks GPUs Cannot Fix

1. Scatter-Gather at 10-30% GPU Throughput

Physics simulations operate on irregular domains: unstructured meshes for CFD, particle clouds for MD, icosahedral grids for weather. GNNs handle this via message passing over edges encoding physical connectivity. GraphCast uses ~40,962 nodes and ~327,660 edges with 16 message-passing rounds per forward pass. MeshGraphNets for CFD operate on 10K-600K node meshes.

The scatter-gather operations at the core of message passing achieve only 10-30% of peak GPU throughput. Irregular, pointer-chasing memory access defeats cache hierarchies and prefetchers. A GNN with 10x fewer FLOPs than a transformer can be slower in wall-clock time because the operations map so poorly to hardware.

Dedicated accelerators show the gap is closeable: GNNIE achieved 699x over GPU through degree-aware caching; GCoD achieved 294x over GPU (and 7.8x over prior GNN accelerator HyGCN) through co-designed density-polarized engines.

2. Equivariant Operations 5-20x Slower Than Dense Layers

Physics respects symmetry. Equivariant neural networks (SE(3)/E(3) architectures like MACE, NequIP, Allegro, RFdiffusion’s SE(3)-Transformer) build rotational and translational symmetry into the network so it holds exactly rather than being learned from data augmentation.

The computational cost centers on Clebsch-Gordan tensor products between irreducible representations of SO(3). Naive CG tensor products scale O(L^6) for maximum angular momentum L. The eSCN breakthrough (Passaro & Zitnick, 2023) reduced this to O(L^3) — a ~216x reduction at L=6 (46,656 to 216). NVIDIA’s cuEquivariance delivers up to 7x end-to-end speedup (and up to 17x on individual symmetric contraction operations), and custom Triton kernels have shown further TFLOPS improvements through sparse parity re-indexing.

Even so, equivariant layers remain 5-20x more expensive per FLOP than standard dense layers. Spherical harmonics evaluation (Y_l^m) is a prerequisite for every message — at L=4-6, the (2L+1)^2 components per interaction involve nested polynomial recurrences that map poorly to SIMD hardware.

3. Diffusion Iteration Tax: 20-50x Per Inference

Diffusion models generate samples through iterative denoising. GenCast uses the DPMSolver++2S sampler with 20 denoising steps (39 function evaluations) per 12-hour weather timestep. This iteration tax is the same structural cost that VDX-1 addresses for video generation, but physics diffusion layers additional constraints: equivariance, FP64 precision, and irregular graph topologies. A 15-day forecast requires 30 autoregressive steps times 39 evaluations = 1,170 total neural network forward passes. Compare this to GraphCast, which needs exactly 1 forward pass per 6-hour step — 40 passes for a 10-day forecast.

For a 50-member ensemble:

GraphCast + perturbation: 50 x 40 = 2,000 forward passes, but ensemble spread quality is inferior (no learned stochasticity)
GenCast: 50 x 1,170 = 58,500 forward passes, fully parallelizable, with calibrated physically meaningful spread
ECMWF ENS: 51 full IFS runs on a supercomputer, consuming hours and 271 MJ per forecast cycle

The diffusion tax is real but the net savings over traditional simulation remain enormous (1,000x+). The question is whether specialized hardware can shrink the iteration cost to near-single-pass levels through architectural support for the denoising loop.

4. FP64 Strands 90%+ of Tensor Core Capacity

Molecular dynamics requires energy conservation over millions of timesteps. Floating-point errors in force computation accumulate and cause energy drift that renders simulation results physically meaningless. MACE defaults to FP64 training, with a documented ~2x speedup when switching to FP32 — but FP32 is inadequate for production long-timescale MD. The e3nn ecosystem recommends FP64 for any simulation where cumulative drift matters.

The GPU penalty for FP64 is severe:

GPU	FP64 TFLOPS	TF32 TFLOPS (w/ sparsity)	FP16 TFLOPS (w/ sparsity)	FP64:TF32 Ratio
A100 SXM	19.5	312	624	1:16
H100 SXM	34	989	1,979	1:29

The H100 delivers 29x less throughput in FP64 than TF32. Physics simulation requiring FP64 leaves >90% of tensor core capacity idle.

AI weather models sidestep FP64 because they predict independent forecast steps rather than time-integrated trajectories. But molecular dynamics, materials science, and fusion plasma cannot make this dodge — the precision penalty is a hard constraint.

5. Memory Wall at Physical Resolution

Physics demands resolution that dwarfs typical ML workloads. A global weather state at 0.25 degrees with 227 channels occupies ~940 MB in FP32. Scaling to storm-resolving 1 km resolution increases this to ~218 GB per atmospheric state — requiring multi-chip model parallelism for a single inference pass. Turbulence DNS at high Reynolds numbers demands grids of 4096^3 or larger, consuming ~200 GB per field variable in FP64.

Molecular dynamics at the atomic scale is comparatively compact per particle but the sequential integration loop (10^9 timesteps for millisecond simulation) creates a latency-dominated memory access pattern that favors on-chip SRAM over off-chip HBM.

The Efficiency Gap

GPU Utilization on Physics Diffusion Workloads

Colored = effective utilization | Gray = wasted capacity

Scatter-Gather Memory Access

10-30%

Irregular pointer-chasing defeats cache hierarchies and prefetchers

Equivariant Layers (CG Tensor Products)

5-20%

5-20x slower than equivalent dense layers — spherical harmonics + CG products map poorly to SIMD

FP64 Compute Throughput

3-7%

A100: 19.5 TF FP64 vs 312 TF TF32 — H100: 34 TF FP64 vs 989 TF TF32 (1:29 ratio)

Tensor Core Active Time

~50%

Dense layers hit tensor cores, but graph ops and special functions bypass them entirely

Overall Effective Utilization

~15-25%

75-85% of GPU silicon sits idle on physics diffusion workloads — the business case for PhysDiffuse-1

0% 25% 50% 75% 100%

The Anton Precedent: What Purpose-Built Physics Hardware Can Achieve

D.E. Shaw Research’s Anton machines are the proof-of-concept that domain-specific silicon for physics simulation delivers 100-1000x over general-purpose hardware.

Three Generations of Co-Design

Generation	Year	Key Specs	Performance
Anton 1	2008	32 pipelined HTIS modules at 800 MHz, 3D torus with 607 Gbit/s bisection and 50 ns hop latency, Tensilica flex cores + 8 SIMD geometry cores	>17,000 ns/day for 23,558-atom protein; 50-100x over contemporary HPC
Anton 2	2014	Upgraded ASICs, four 512-node partitions, higher clock rates, larger on-chip memory	3-5x over Anton 1
Anton 3	~2024	New process node, redesigned compute pipelines, enhanced network	~10x over Anton 2 (per HPCA 2022 paper: “an order of magnitude”); routine multi-millisecond simulations

At ~10x over Anton 2, Anton 3 represents a cumulative ~30-50x improvement over the original Anton 1 (given Anton 2’s ~3-5x over Anton 1).

Why Anton Succeeded

Anton’s dominance stems from five co-design principles: (1) hardwiring the bottleneck — non-bonded pairwise forces consume >90% of MD compute, so the HTIS dedicates fixed-function silicon to this kernel at near-peak throughput; (2) minimizing communication latency — the 3D torus at 50 ns per-hop (versus microseconds on commodity networks) collapses the global-barrier-every-timestep penalty; (3) balancing compute and communication — the ASIC’s compute rate matches the network’s injection bandwidth, so neither starves; (4) exploiting domain structure — MD has a fixed algorithmic skeleton (neighbor list, force eval, integration, repeat) and Anton’s pipeline is tuned to exactly this loop; (5) achieving long timescales — ~1 microsecond wall clock per femtosecond timestep enabled millisecond-scale simulations that opened entirely new science.

The Critical Divergence

Classical MD is latency-bound and communication-bound: every timestep requires global synchronization, favoring low-latency interconnect and fixed-function pipelines. Diffusion-based prediction is throughput-bound and compute-bound: each denoising step is a self-contained neural network forward pass, favoring tensor parallelism and batch efficiency.

Neural network force fields for MD (MACE, NequIP, Allegro) sit at the intersection. Each MD timestep requires NN inference to compute forces, then classical integration. This hybrid regime — learned potentials driving long-timescale MD — is the workload most likely to benefit from a new class of domain-specific accelerator that fuses NN inference with physics integration logic.

D.E. Shaw Research

Anton Legacy & the PhysDiffuse-1 Divergence

2008

Anton 1

32 HTIS modules
800 MHz, 3D torus

50-100x over GPU

2014

Anton 2

Upgraded ASICs
4 × 512-node partitions

3-5x over Anton 1

~2024

Anton 3

New process node
Redesigned pipelines

~10x over Anton 2

Proposed

PhysDiffuse-1

Neural + physics hybrid
GNN & diffusion silicon

Different paradigm

Cumulative Speedup vs. Contemporary GPU (log scale)

~75x

Anton 1
(2008)

~300x

Anton 2
(2014)

~750x

Anton 3
(~2024)

7-14x*

PhysDiffuse-1
(*neural MD)

≠

Anton: latency-bound (pairwise forces, global sync every timestep). PhysDiffuse-1: throughput-bound (neural network forward passes, batch-parallel denoising). Same physics, different computational regimes -- different beasts entirely.

PhysDiffuse-1: Architecture Proposal

A 4 nm multi-tile ASIC with 256 Physics Processing Elements (PPEs) designed to accelerate the five bottlenecks enumerated above.

Per-PPE Hardware

Subunit	Function	Specifications
Tensor Core Cluster	Dense matmul for denoising network layers	128 FP16 MACs + 16 FP64 MACs, mixed-precision accumulator (FP64 running sums with FP16 inputs)
Graph Engine	Hardware scatter-gather for GNN message passing	32-wide parallel gather unit, programmable aggregation (sum/max/mean), neighbor list buffer with degree-aware scheduling
Special Function Unit (SFU)	FFT butterflies, spherical harmonics, transcendentals	8 pipelined CORDIC-based lanes, hardwired radix-2/4/8 FFT, Y_l^m evaluator for l up to 32
Local SRAM	Graph neighborhoods, activations, partial sums	2 MB per PPE (512 MB total across chip)

Physics Processing Element -- Internal Architecture

×256 PPEs on die

Tensor Core Cluster

128 FP16 MACs

16 FP64 MACs

Mixed-precision accumulator
FP64 running sums + FP16 inputs

Graph Engine

32-wide

Parallel Gather

Programmable aggregation
sum / max / mean
Degree-aware scheduling

Special Function Unit

FFT butterflies
Y_l^m harmonics
exp / erf / sin

8 pipelined CORDIC lanes

TENSOR

GRAPH

SFU

TENSOR

SFU

Local SRAM

Graph neighborhoods · activations · partial sums

2 MB

Dense compute

Graph ops

Special functions

On-chip memory

256 PPEs × 4 nm × 4 tiles

Graph Engine: Closing the Scatter-Gather Gap

On GPUs, graph message passing achieves 10-30% of peak memory bandwidth due to irregular access patterns. PhysDiffuse-1’s Graph Engine attacks this directly:

32-Wide Gather Unit. Thirty-two parallel read ports into the Neighbor List Buffer, allowing a single PPE to fetch an entire neighborhood in one cycle for graphs with average degree up to 32. For physics graphs (molecular: k~~20-50, mesh: k~~3-20, icosahedral weather: k~6), this covers the majority of vertices without stalling.

Degree-Aware Scheduling. Real-world physics graphs follow power-law or bounded-degree distributions. The scheduler assigns high-degree nodes to multiple PPEs (vertex splitting) while packing multiple low-degree nodes into single PPE timeslots, achieving 70-80% PE utilization versus the 30-40% typical of naive row-parallel assignment on GPUs.

Programmable Aggregation. Sum, max, and mean reduction modes are configurable per layer, supporting the full range of GNN aggregation functions without software overhead.

Effective Bandwidth. For a graph with average degree 20, the combination of wide gather, on-chip neighbor list caching, and degree-aware scheduling delivers 16x effective bandwidth compared to GPU scatter-gather, translating irregular graph access into streaming on-chip reads.

Flexible Precision Cascade

The physics workload spans four precision regimes simultaneously. PhysDiffuse-1 provides hardware support for all four:

Precision	Use Case	Hardware
FP64	Energy conservation in MD integration, force accumulation	16 FP64 MACs per PPE (4,096 total), FP64 accumulators
FP32	Intermediate calculations, training gradients	Shared with FP64 units (2x throughput in FP32 mode)
FP16/BF16	Bulk neural network operations (attention, MLP, convolution)	128 FP16 MACs per PPE (32,768 total), tensor core equivalent
INT8	Graph topology, index arithmetic, neighbor lists	Graph Engine integer paths

The mixed-precision accumulator maintains running sums in FP64 even when inputs arrive in FP16, enabling the “FP64 accumulation with FP16 compute” pattern that MD and equivariant networks require — avoiding the GPU’s all-or-nothing precision choice.

Stochastic rounding hardware (per-MAC LFSR) is included for FP16 training. The posit number format (16/32-bit) is supported for applications where dynamic range matters more than uniform precision.

Precision Cascade

How Precision Flows Through the Pipeline

FP64

PDE residuals · energy conservation · optimizer accumulation

FP32

Force calculations · edge features · normalization layers

FP16 / BF16

Bulk neural network: attention · MLPs · score function evaluation

INT8

Neighbor lists · graph topology indexing · cell-list lookups

Highest precision

↓

Highest throughput

⚡

Mixed-Precision Accumulator

FP16 multiply → FP64 accumulate. Running sums maintained in FP64 even when inputs arrive in FP16, preventing the energy drift that renders MD simulations physically meaningless.

FFT Butterfly Units

Fourier Neural Operators (FNO, FourCastNet’s AFNO) use FFT as their core spatial mixing, achieving O(N log N) global communication versus O(N^2) for self-attention.

PhysDiffuse-1 includes hardwired radix-2/4/8 butterfly units in each SFU:

256-point complex FFT in 32 cycles versus hundreds of cycles through general-purpose multiply-add
Batched 1D FFT for multi-channel spectral operations (the dominant pattern in AFNO: FFT per channel, learned spectral weighting, IFFT)
Real-valued optimization: physics fields are real, so the r2c/c2r transform saves half the computation and storage versus general complex FFT

For FourCastNet-class models operating on a 720 x 1,440 weather grid, the FFT butterfly units handle the spatial mixing phase entirely on-chip, eliminating the round-trip to HBM that dominates FFT latency on GPUs.

Spherical Harmonics Evaluator

SE(3)-equivariant networks compute spherical harmonics Y_l^m of relative displacement vectors as a prerequisite for every message. For low angular momentum (L=0-2), this is cheap. For the L=4-6 needed in high-accuracy molecular potentials, and the higher orders used in global weather models on the sphere, evaluation involves nested polynomial recurrences (associated Legendre functions, trigonometric scaling) that map poorly to SIMD hardware.

The SFU includes a dedicated Y_l^m evaluator:

Angular momentum up to L=32 (1,089 components), covering all current and foreseeable equivariant architectures
CORDIC-based evaluation of associated Legendre polynomials and trigonometric functions, avoiding lookup tables and enabling arbitrary-precision computation
Pipelined throughput: one complete set of harmonics (all m for a given l) per cycle at L=6, scaling linearly with L
Clebsch-Gordan coefficient cache: precomputed CG coefficients for common coupling paths stored in dedicated on-chip ROM, eliminating the repeated recomputation that dominates e3nn profiling

3D Torus Interconnect

Inspired by Anton’s network co-design. Single-chip (256 PPEs): internal mesh, 6 links per PPE at 100 Gbit/s. Multi-chip (up to 512 chips): 8 x 8 x 8 torus, 6 inter-chip links at 200 Gbit/s each, 86 Tbit/s bisection bandwidth, 100 ns hop latency (2x Anton 1, but 10-50x below commodity InfiniBand).

Hardware Halo Exchange Engine: dedicated DMA pushes boundary data to neighboring PPEs automatically when local computation completes — no software synchronization overhead. Hardware all-reduce: tree-based aggregation for energy totals, conservation checks, and FFT transposes, completing a 256-PPE all-reduce in ~50 cycles.

Interconnect Topology

3D Torus Network -- 4×4 Slice (of 256 PPEs)

Each node = 1 PPE · 6 links per node · wrap-around edges form the torus

6 links/PPE

200 Gbit/s each

Direct link

Wrap-around

Interconnect Comparison

GPU Cluster

PCIe / NVLink

Tree / fat-tree topology · high hop latency

Anton 3

3D Torus

50 ns/hop · 607 Gbit/s bisection · classical MD

PROPOSED

PhysDiffuse-1

3D Torus

100 ns/hop · 86 Tbit/s bisection · NN + MD

Tbit/s bisection bandwidth

141× Anton 1 · 10-50× InfiniBand

Memory Hierarchy

Level	Capacity	Bandwidth	Latency	Purpose
Register file	32 KB per PPE	N/A	1 cycle	Tensor core operands
Local SRAM	2 MB per PPE (512 MB total)	4 TB/s aggregate	3 cycles	Graph neighborhoods, activations, NN weights for current layer
Shared L2	64 MB per tile (4 tiles, 256 MB total)	16 TB/s	10 cycles	Cross-PPE data sharing, FFT working space
HBM3e	128 GB	8 TB/s	~100 ns	Full model weights, atmospheric states, training data

The 512 MB of total on-chip SRAM is sized to hold the complete graph neighborhood data for a 40,000-node icosahedral weather mesh (GenCast-scale) or the neighbor lists for a 500,000-atom molecular system without spilling to HBM. This is critical because the scatter-gather bottleneck is fundamentally a memory-proximity problem: if the data is on-chip, irregular access patterns are fast; if it is in HBM, they are 30-100x slower.

Dataflow for a GenCast-Class Weather Model

A single 12-hour forecast step through PhysDiffuse-1:

Encode (lat-lon to icosahedral mesh): Bipartite graph message passing. The Graph Engine handles the irregular gather from ~1M lat-lon points to ~41K mesh nodes. SRAM holds the mesh graph; HBM streams the atmospheric state.
Process (16 graph transformer layers): Each layer runs message passing (Graph Engine, scatter-gather on ~330K edges) followed by dense transformation (Tensor Core Cluster, standard GEMM on node features). The Graph Engine and Tensor Cores operate in pipelined alternation, keeping both units saturated.
Decode (mesh back to lat-lon): Inverse of encode. Graph Engine scatters mesh predictions back to the full-resolution grid.
Denoising loop (39 evaluations per step): The full encode-process-decode pass repeats 39 times with different noise levels. Model weights remain pinned in SRAM/L2 across iterations, eliminating the HBM weight reload that dominates GPU denoising latency.
Autoregressive rollout (30 steps for 15-day forecast): The output of each step becomes the input to the next. The torus network handles multi-chip partitioning for ensemble parallelism.

Dataflow for MACE-Class Molecular Dynamics

Each MD timestep runs without returning to the host: (1) the Graph Engine builds neighbor lists via cell-list partitioning with hardware periodic boundary conditions; (2) the SFU evaluates spherical harmonics for all neighbor pairs, CG tensor products execute on Tensor Cores (dense contraction) plus Graph Engine (sparse coupling), with the CG coefficient cache eliminating redundant lookups; (3) dense output MLP computes energy, forces derived via autodiff backward pass; (4) FP64 MACs handle Velocity Verlet integration; (5) the Halo Exchange Engine pushes boundary atoms to neighboring PPEs.

For a 50,000-atom protein system, the target is ~10 microsecond wall-clock per femtosecond timestep — millisecond-scale simulations in days, comparable to Anton 3 for classical force fields but with near-DFT accuracy from the learned potential.

Performance Estimates

Speedup Over H100

For a typical physics diffusion forward pass with the following compute breakdown (derived from profiling GenCast and MACE on A100/H100):

Phase	% of GPU Time	PhysDiffuse-1 Speedup	Rationale
Message passing (scatter-gather)	40%	10-16x	Hardware Graph Engine vs. GPU indirect memory access
Dense linear layers	30%	1-1.5x	Comparable tensor core throughput
FFT / spectral ops	20%	3-5x	Hardwired butterfly vs. general-purpose cuFFT
Special functions (Y_l^m, exp, erf)	10%	5-10x	CORDIC SFU vs. GPU transcendental library

Weighted end-to-end: 4-6x over H100 on physics diffusion inference.

For FP64-dominant workloads (molecular dynamics with neural potentials):

Phase	% of GPU Time	PhysDiffuse-1 Speedup	Rationale
Equivariant tensor products	50%	8-15x	CG cache + Graph Engine + dedicated FP64 MACs
Force accumulation (FP64)	20%	10-20x	FP64 MACs at full throughput vs. 1:29 ratio on H100
Neighbor list / integration	20%	3-5x	Hardware neighbor list builder + FP64 integrator
Communication (halo exchange)	10%	5-20x	Hardware halo exchange vs. software MPI

Weighted end-to-end: 7-14x over H100 on neural network molecular dynamics.

Absolute Performance Targets

Workload	H100 Performance	PhysDiffuse-1 Target
GenCast 15-day forecast (single member)	~4-5 min (estimated from TPU v5 at 8 min)	<1 min
GenCast 50-member ensemble	~4 hours (sequential on 1 GPU)	~45 min (sequential) or ~1 min (50-chip parallel)
MACE MD timestep (50K atoms, L=6)	~100 microseconds	~10 microseconds
FNO inference (2D Navier-Stokes, 256x256)	~5 ms	~1 ms
eSCN single-point energy (OC-20 catalyst)	~50 ms	~5 ms

Precision Engineering: Mixed FP64/FP16 Cascades

The precision landscape in physics ML is more nuanced than the binary FP64-vs-FP16 framing suggests. The optimal strategy is a cascade: FP64 for quantities where error accumulates (energy sums, force accumulation, symplectic integration), FP32 for intermediate NN computations where model approximation dominates roundoff, and FP16/BF16 for bulk tensor operations where throughput matters most. PhysDiffuse-1’s mixed-precision accumulator implements this cascade in hardware.

Supporting techniques built into the SFU: stochastic rounding (per-MAC LFSR randomness prevents systematic bias when FP16 gradients fall below epsilon), CORDIC-based transcendentals (shifts-and-adds for exp, sin, cos, erf — higher precision and lower power than lookup tables, especially for the associated Legendre recurrence in spherical harmonics), and posit format support (configurable 16/32-bit for applications where dynamic range matters more than uniform precision).

Adaptive mesh refinement (AMR) is handled natively by the Graph Engine’s variable-degree graph support: AMR hierarchies map to graphs where fine-resolution regions have higher connectivity, and the degree-aware scheduler handles the resulting load imbalance automatically.

Multi-Scale Memory Mapping

Physics GNNs (GraphCast, GenCast) use an encoder-processor-decoder pattern analogous to multigrid: encode from a fine grid (~1M points) to a coarse mesh (~40K nodes), process at coarse resolution, decode back. PhysDiffuse-1’s memory hierarchy maps naturally to this:

Scale	Data Size	Storage Tier
Coarse mesh (processor)	~41K nodes x 512 features = ~80 MB	Local SRAM (on-chip)
Fine grid (encoder/decoder)	~1M points x 227 channels = ~940 MB	HBM, streamed through SRAM
Model weights	~150 MB (GraphCast-class)	SRAM (pinned)

The coarse mesh and model weights fit entirely in the 512 MB SRAM, meaning the 16-layer processor runs without any HBM access. Only encode/decode phases stream from HBM via the Graph Engine. For molecular simulation, MACE’s two-iteration message passing operates on local atomic neighborhoods that fit in per-PPE SRAM (2 MB holds ~10,000 atoms with neighbors and features), while the global system state resides in HBM.

Market Justification

Development Cost

A custom 4 nm ASIC of this complexity requires $300-500M in development cost (design, verification, tape-out, packaging, testing), consistent with other domain-specific accelerators at this scale. The multi-tile approach with 4 dies per package uses chiplet-based design to reduce per-die area and improve yield.

Total Addressable Market

Domain	Annual Spend	PhysDiffuse-1 Relevance
Operational weather prediction (global NWP agencies)	$2B/year	Direct: GenCast-class inference
AI drug discovery (molecular simulation + structure prediction)	~$4.9B by 2028	Direct: MACE/AF3/RFdiffusion inference and MD
Computational materials science	~$1.5B	Direct: neural potential MD for battery, catalyst, semiconductor design
Fusion energy simulation	$6B+ invested to date	Partial: plasma turbulence, MHD simulation surrogates
Climate modeling	$500M+/year	Direct: high-resolution ensemble climate projection
Automotive/aerospace CFD	$5-50M per OEM annually	Partial: FNO/MeshGraphNet surrogates in design loops

Conservative TAM: $3-5B/year, growing 25-35% annually as AI-native simulation displaces traditional HPC.

Weather

$2B/yr

Global NWP Agencies

GenCast: 24,000× less energy than ECMWF ENS. Probabilistic 15-day forecasts in minutes.

Drug Discovery

~$4.9B by 2028

Molecular Sim + Structure

AlphaFold 3, RFdiffusion: Protein design, ligand binding, neural potential MD.

Materials Science

~$1.5B

Battery, Catalyst, Semi

DiffCSP: Crystal structure prediction. Neural potentials for battery & catalyst design.

Fusion Energy

$6B+ invested

Plasma & MHD Simulation

Coupled PDE + particle sim. Plasma turbulence surrogates for tokamak design.

Climate Modeling

$500M+/yr

Earth System Models

IPCC-class projections. High-resolution ensemble climate modeling at km scale.

Total Addressable Market

$3–5B/year

Growing 25–35% annually

Unit Economics

At $30,000-50,000 per chip (H100-comparable pricing), a single PhysDiffuse-1 replacing 6-14 H100s for physics workloads is compelling. A 512-chip system for a national weather service would cost $15-25M — less than the annual opex of an ECMWF-class supercomputer — while delivering ensemble forecasts in minutes. For pharma, a 64-chip rack could replace a floor of GPU servers for neural potential MD, cutting capital and power by 5-10x.

Competitive Landscape and Risks

Anton 3 dominates classical MD but cannot run neural network potentials — as the field moves to learned potentials (MACE, Allegro), Anton’s architectural advantage narrows. Cerebras CS-2 (850K cores, 40 GB SRAM, 220 Pb/s bandwidth) attacks the same memory wall from a general-purpose angle. NVIDIA H100/B200 with cuEquivariance and PhysicsNeMo will continue narrowing the software gap; the B200’s 8 TB/s HBM partially addresses scatter-gather. Google TPU v5/v6 already runs GenCast operationally.

Key risks: (1) a custom ASIC without PyTorch/JAX integration and support for e3nn, PyG, DGL, OpenMM-ML is dead on arrival; (2) workload fragmentation across weather, molecular, materials, fluid, and plasma domains means no single configuration is optimal for all; (3) NVIDIA ships a new GPU generation every two years — if the GB300 adds hardware graph support, the window narrows; (4) a chip arriving in 2028 needs the market to sustain production volumes at scale.

Summary

Physics diffusion models are simultaneously the most compute-intensive and least hardware-efficient workloads in modern AI. They combine iterative denoising (20-50x compute multiplier), irregular graph operations (10-30% GPU utilization), equivariant tensor products (5-20x overhead versus dense layers), mixed FP64/FP16 precision requirements (90%+ tensor core waste on GPUs), and extreme resolution demands (1M+ grid points for weather, atomic-scale for MD).

Anton proved that purpose-built physics hardware delivers 100-1000x over general-purpose systems when the computational kernel is well-defined. The kernels of physics diffusion — scatter-gather message passing, CG tensor products, FFT-based spectral convolution, spherical harmonics evaluation, mixed-precision accumulation — are now well-defined enough to justify silicon.

PhysDiffuse-1 targets 4-6x over H100 on physics diffusion inference and 7-14x on neural network molecular dynamics, through a combination of hardware scatter-gather (16x effective bandwidth for graph operations), dedicated FP64 MACs (avoiding the 1:29 penalty of GPU tensor cores), hardwired FFT butterflies, a spherical harmonics evaluator, and a 3D torus interconnect inspired by Anton. A 256-PPE chip with 512 MB SRAM, 128 GB HBM3e, and multi-chip scaling to 512 nodes addresses the full spectrum of physics diffusion workloads from single-member weather forecasts to millisecond-scale molecular dynamics.

The question is not whether physics diffusion needs better hardware. It does. The question is whether the market reaches critical mass before general-purpose GPUs close the gap. At $3-5B TAM growing 25-35% annually, the window is open. It will not stay open forever. For a radically different approach to the efficiency problem — one that replaces matrix multiplication with continuous dynamics entirely — see Nonlinear Silicon: Oscillator-Based Computing.

Additional Reading

GenCast: Probabilistic Weather Forecasting — Price et al., DeepMind | Nature
AlphaFold 3 — Abramson et al., DeepMind/Isomorphic
RFdiffusion — Watson et al., Baker Lab
GraphCast — Lam et al., DeepMind | Science
FourCastNet — Pathak et al., NVIDIA
Score-Based Generative Modeling through SDEs — Song et al. 2021
Fourier Neural Operator — Li et al. 2020
Anton: Special-Purpose MD Machine — Shaw et al., CACM 2008
Anton 3 Network — Shim et al., HPCA 2022
eSCN: Reducing SO(3) to SO(2) — Passaro & Zitnick, ICML 2023
MACE — Batatia et al.
NequIP — Batzner et al. | Nature Comms

Alan's PKB

Explorer

PhysDiffuse-1: A Custom Chip for Physics Diffusion Models

PhysDiffuse-1: A Custom Chip for Physics Diffusion Models

What Makes Physics Diffusion Different from Image Diffusion

Five Bottlenecks GPUs Cannot Fix

1. Scatter-Gather at 10-30% GPU Throughput

2. Equivariant Operations 5-20x Slower Than Dense Layers

3. Diffusion Iteration Tax: 20-50x Per Inference

4. FP64 Strands 90%+ of Tensor Core Capacity

5. Memory Wall at Physical Resolution

The Anton Precedent: What Purpose-Built Physics Hardware Can Achieve

Three Generations of Co-Design

Why Anton Succeeded

The Critical Divergence

PhysDiffuse-1: Architecture Proposal

Per-PPE Hardware

Graph Engine: Closing the Scatter-Gather Gap

Flexible Precision Cascade

FFT Butterfly Units

Spherical Harmonics Evaluator

3D Torus Interconnect

Memory Hierarchy

Dataflow for a GenCast-Class Weather Model

Dataflow for MACE-Class Molecular Dynamics

Performance Estimates

Speedup Over H100

Absolute Performance Targets

Precision Engineering: Mixed FP64/FP16 Cascades

Multi-Scale Memory Mapping

Market Justification

Development Cost

Total Addressable Market

Unit Economics

Competitive Landscape and Risks

Summary

Additional Reading

Graph View

Table of Contents