PhysDiffuse-1: A Custom Chip for Physics Diffusion Models

Physics diffusion models are the most important workload that nobody has built a chip for.

GenCast generates a probabilistic 15-day global weather forecast in 8 minutes on a single TPU v5 — outperforming ECMWF ENS on 97.4% of 1,320 targets while consuming ~24,000x less energy. AlphaFold 3 denoises random atomic coordinates into protein-ligand 3D structures in minutes. RFdiffusion designs entirely new proteins that fold in the wet lab as predicted. Neural network force fields like MACE and Allegro run molecular dynamics at near-quantum accuracy on 100 million atoms.

These systems share the same mathematical backbone — score-based SDEs, Langevin dynamics, iterative denoising — but operate on physical quantities rather than pixel grids. That rewires every hardware assumption. Where VDX-1 optimizes for dense spatiotemporal attention on uniform grids, physics diffusion demands irregular graph operations on unstructured meshes, SE(3)-equivariant tensor products, FFT-based spectral convolutions, FP64 accumulation for energy conservation, and scatter-gather memory access that leaves GPUs at 10-30% utilization. The gap between what current hardware delivers and what the physics demands is the business case for PhysDiffuse-1.

What Makes Physics Diffusion Different from Image Diffusion

AspectImage/Video DiffusionPhysics Diffusion
Data spaceFlat Euclidean pixel gridsManifolds: SE(3), tori, spheres, periodic lattices
Quality metricPerceptual (FID, CLIP)Physical fidelity (PDE residual, conservation error)
ConstraintsSoft (text prompts)Hard (conservation laws, symmetries, boundary conditions)
EquivarianceOptionalMandatory (SE(3), E(3), periodic-E(3))
StochasticityDiversity of outputsUncertainty quantification with calibrated ensembles
ArchitectureU-Net, DiT on uniform gridsGNNs on meshes, equivariant transformers, FNO
PrecisionFP16/BF16/FP8Mixed FP64/FP32/FP16 cascade

Standard diffusion asks “does this look real?” Physics diffusion asks “does this obey the laws of nature?”

Five Bottlenecks GPUs Cannot Fix

1. Scatter-Gather at 10-30% GPU Throughput

Physics simulations operate on irregular domains: unstructured meshes for CFD, particle clouds for MD, icosahedral grids for weather. GNNs handle this via message passing over edges encoding physical connectivity. GraphCast uses ~40,962 nodes and ~327,660 edges with 16 message-passing rounds per forward pass. MeshGraphNets for CFD operate on 10K-600K node meshes.

The scatter-gather operations at the core of message passing achieve only 10-30% of peak GPU throughput. Irregular, pointer-chasing memory access defeats cache hierarchies and prefetchers. A GNN with 10x fewer FLOPs than a transformer can be slower in wall-clock time because the operations map so poorly to hardware.

Dedicated accelerators show the gap is closeable: GNNIE achieved 699x over GPU through degree-aware caching; GCoD achieved 294x over GPU (and 7.8x over prior GNN accelerator HyGCN) through co-designed density-polarized engines.

2. Equivariant Operations 5-20x Slower Than Dense Layers

Physics respects symmetry. Equivariant neural networks (SE(3)/E(3) architectures like MACE, NequIP, Allegro, RFdiffusion’s SE(3)-Transformer) build rotational and translational symmetry into the network so it holds exactly rather than being learned from data augmentation.

The computational cost centers on Clebsch-Gordan tensor products between irreducible representations of SO(3). Naive CG tensor products scale O(L^6) for maximum angular momentum L. The eSCN breakthrough (Passaro & Zitnick, 2023) reduced this to O(L^3) — a ~216x reduction at L=6 (46,656 to 216). NVIDIA’s cuEquivariance delivers up to 7x end-to-end speedup (and up to 17x on individual symmetric contraction operations), and custom Triton kernels have shown further TFLOPS improvements through sparse parity re-indexing.

Even so, equivariant layers remain 5-20x more expensive per FLOP than standard dense layers. Spherical harmonics evaluation (Y_l^m) is a prerequisite for every message — at L=4-6, the (2L+1)^2 components per interaction involve nested polynomial recurrences that map poorly to SIMD hardware.

3. Diffusion Iteration Tax: 20-50x Per Inference

Diffusion models generate samples through iterative denoising. GenCast uses the DPMSolver++2S sampler with 20 denoising steps (39 function evaluations) per 12-hour weather timestep. This iteration tax is the same structural cost that VDX-1 addresses for video generation, but physics diffusion layers additional constraints: equivariance, FP64 precision, and irregular graph topologies. A 15-day forecast requires 30 autoregressive steps times 39 evaluations = 1,170 total neural network forward passes. Compare this to GraphCast, which needs exactly 1 forward pass per 6-hour step — 40 passes for a 10-day forecast.

For a 50-member ensemble:

  • GraphCast + perturbation: 50 x 40 = 2,000 forward passes, but ensemble spread quality is inferior (no learned stochasticity)
  • GenCast: 50 x 1,170 = 58,500 forward passes, fully parallelizable, with calibrated physically meaningful spread
  • ECMWF ENS: 51 full IFS runs on a supercomputer, consuming hours and 271 MJ per forecast cycle

The diffusion tax is real but the net savings over traditional simulation remain enormous (1,000x+). The question is whether specialized hardware can shrink the iteration cost to near-single-pass levels through architectural support for the denoising loop.

4. FP64 Strands 90%+ of Tensor Core Capacity

Molecular dynamics requires energy conservation over millions of timesteps. Floating-point errors in force computation accumulate and cause energy drift that renders simulation results physically meaningless. MACE defaults to FP64 training, with a documented ~2x speedup when switching to FP32 — but FP32 is inadequate for production long-timescale MD. The e3nn ecosystem recommends FP64 for any simulation where cumulative drift matters.

The GPU penalty for FP64 is severe:

GPUFP64 TFLOPSTF32 TFLOPS (w/ sparsity)FP16 TFLOPS (w/ sparsity)FP64:TF32 Ratio
A100 SXM19.53126241:16
H100 SXM349891,9791:29

The H100 delivers 29x less throughput in FP64 than TF32. Physics simulation requiring FP64 leaves >90% of tensor core capacity idle.

AI weather models sidestep FP64 because they predict independent forecast steps rather than time-integrated trajectories. But molecular dynamics, materials science, and fusion plasma cannot make this dodge — the precision penalty is a hard constraint.

5. Memory Wall at Physical Resolution

Physics demands resolution that dwarfs typical ML workloads. A global weather state at 0.25 degrees with 227 channels occupies ~940 MB in FP32. Scaling to storm-resolving 1 km resolution increases this to ~218 GB per atmospheric state — requiring multi-chip model parallelism for a single inference pass. Turbulence DNS at high Reynolds numbers demands grids of 4096^3 or larger, consuming ~200 GB per field variable in FP64.

Molecular dynamics at the atomic scale is comparatively compact per particle but the sequential integration loop (10^9 timesteps for millisecond simulation) creates a latency-dominated memory access pattern that favors on-chip SRAM over off-chip HBM.

The Efficiency Gap
GPU Utilization on Physics Diffusion Workloads
Colored = effective utilization  |  Gray = wasted capacity
Scatter-Gather Memory Access
10-30%
Irregular pointer-chasing defeats cache hierarchies and prefetchers
Equivariant Layers (CG Tensor Products)
5-20%
5-20x slower than equivalent dense layers — spherical harmonics + CG products map poorly to SIMD
FP64 Compute Throughput
3-7%
A100: 19.5 TF FP64 vs 312 TF TF32 — H100: 34 TF FP64 vs 989 TF TF32 (1:29 ratio)
Tensor Core Active Time
~50%
Dense layers hit tensor cores, but graph ops and special functions bypass them entirely
Overall Effective Utilization
~15-25%
75-85% of GPU silicon sits idle on physics diffusion workloads — the business case for PhysDiffuse-1
0% 25% 50% 75% 100%

The Anton Precedent: What Purpose-Built Physics Hardware Can Achieve

D.E. Shaw Research’s Anton machines are the proof-of-concept that domain-specific silicon for physics simulation delivers 100-1000x over general-purpose hardware.

Three Generations of Co-Design

GenerationYearKey SpecsPerformance
Anton 1200832 pipelined HTIS modules at 800 MHz, 3D torus with 607 Gbit/s bisection and 50 ns hop latency, Tensilica flex cores + 8 SIMD geometry cores>17,000 ns/day for 23,558-atom protein; 50-100x over contemporary HPC
Anton 22014Upgraded ASICs, four 512-node partitions, higher clock rates, larger on-chip memory3-5x over Anton 1
Anton 3~2024New process node, redesigned compute pipelines, enhanced network~10x over Anton 2 (per HPCA 2022 paper: “an order of magnitude”); routine multi-millisecond simulations

At ~10x over Anton 2, Anton 3 represents a cumulative ~30-50x improvement over the original Anton 1 (given Anton 2’s ~3-5x over Anton 1).

Why Anton Succeeded

Anton’s dominance stems from five co-design principles: (1) hardwiring the bottleneck — non-bonded pairwise forces consume >90% of MD compute, so the HTIS dedicates fixed-function silicon to this kernel at near-peak throughput; (2) minimizing communication latency — the 3D torus at 50 ns per-hop (versus microseconds on commodity networks) collapses the global-barrier-every-timestep penalty; (3) balancing compute and communication — the ASIC’s compute rate matches the network’s injection bandwidth, so neither starves; (4) exploiting domain structure — MD has a fixed algorithmic skeleton (neighbor list, force eval, integration, repeat) and Anton’s pipeline is tuned to exactly this loop; (5) achieving long timescales — ~1 microsecond wall clock per femtosecond timestep enabled millisecond-scale simulations that opened entirely new science.

The Critical Divergence

Classical MD is latency-bound and communication-bound: every timestep requires global synchronization, favoring low-latency interconnect and fixed-function pipelines. Diffusion-based prediction is throughput-bound and compute-bound: each denoising step is a self-contained neural network forward pass, favoring tensor parallelism and batch efficiency.

Neural network force fields for MD (MACE, NequIP, Allegro) sit at the intersection. Each MD timestep requires NN inference to compute forces, then classical integration. This hybrid regime — learned potentials driving long-timescale MD — is the workload most likely to benefit from a new class of domain-specific accelerator that fuses NN inference with physics integration logic.

D.E. Shaw Research
Anton Legacy & the PhysDiffuse-1 Divergence
2008
Anton 1
32 HTIS modules
800 MHz, 3D torus
50-100x over GPU
2014
Anton 2
Upgraded ASICs
4 × 512-node partitions
3-5x over Anton 1
~2024
Anton 3
New process node
Redesigned pipelines
~10x over Anton 2
Proposed
PhysDiffuse-1
Neural + physics hybrid
GNN & diffusion silicon
Different paradigm
Cumulative Speedup vs. Contemporary GPU (log scale)
~75x
Anton 1
(2008)
~300x
Anton 2
(2014)
~750x
Anton 3
(~2024)
7-14x*
PhysDiffuse-1
(*neural MD)
Anton: latency-bound (pairwise forces, global sync every timestep). PhysDiffuse-1: throughput-bound (neural network forward passes, batch-parallel denoising). Same physics, different computational regimes -- different beasts entirely.

PhysDiffuse-1: Architecture Proposal

A 4 nm multi-tile ASIC with 256 Physics Processing Elements (PPEs) designed to accelerate the five bottlenecks enumerated above.

Per-PPE Hardware

SubunitFunctionSpecifications
Tensor Core ClusterDense matmul for denoising network layers128 FP16 MACs + 16 FP64 MACs, mixed-precision accumulator (FP64 running sums with FP16 inputs)
Graph EngineHardware scatter-gather for GNN message passing32-wide parallel gather unit, programmable aggregation (sum/max/mean), neighbor list buffer with degree-aware scheduling
Special Function Unit (SFU)FFT butterflies, spherical harmonics, transcendentals8 pipelined CORDIC-based lanes, hardwired radix-2/4/8 FFT, Y_l^m evaluator for l up to 32
Local SRAMGraph neighborhoods, activations, partial sums2 MB per PPE (512 MB total across chip)
Physics Processing Element -- Internal Architecture
×256 PPEs on die
Tensor Core Cluster
128 FP16 MACs
16 FP64 MACs
Mixed-precision accumulator
FP64 running sums + FP16 inputs
Graph Engine
32-wide
Parallel Gather
Programmable aggregation
sum / max / mean
Degree-aware scheduling
Special Function Unit
FFT butterflies
Ylm harmonics
exp / erf / sin
8 pipelined CORDIC lanes
TENSOR
GRAPH
GRAPH
SFU
TENSOR
SFU
Local SRAM
Graph neighborhoods · activations · partial sums
2 MB
Dense compute
Graph ops
Special functions
On-chip memory
256 PPEs × 4 nm × 4 tiles

Graph Engine: Closing the Scatter-Gather Gap

On GPUs, graph message passing achieves 10-30% of peak memory bandwidth due to irregular access patterns. PhysDiffuse-1’s Graph Engine attacks this directly:

32-Wide Gather Unit. Thirty-two parallel read ports into the Neighbor List Buffer, allowing a single PPE to fetch an entire neighborhood in one cycle for graphs with average degree up to 32. For physics graphs (molecular: k20-50, mesh: k3-20, icosahedral weather: k~6), this covers the majority of vertices without stalling.

Degree-Aware Scheduling. Real-world physics graphs follow power-law or bounded-degree distributions. The scheduler assigns high-degree nodes to multiple PPEs (vertex splitting) while packing multiple low-degree nodes into single PPE timeslots, achieving 70-80% PE utilization versus the 30-40% typical of naive row-parallel assignment on GPUs.

Programmable Aggregation. Sum, max, and mean reduction modes are configurable per layer, supporting the full range of GNN aggregation functions without software overhead.

Effective Bandwidth. For a graph with average degree 20, the combination of wide gather, on-chip neighbor list caching, and degree-aware scheduling delivers 16x effective bandwidth compared to GPU scatter-gather, translating irregular graph access into streaming on-chip reads.

Flexible Precision Cascade

The physics workload spans four precision regimes simultaneously. PhysDiffuse-1 provides hardware support for all four:

PrecisionUse CaseHardware
FP64Energy conservation in MD integration, force accumulation16 FP64 MACs per PPE (4,096 total), FP64 accumulators
FP32Intermediate calculations, training gradientsShared with FP64 units (2x throughput in FP32 mode)
FP16/BF16Bulk neural network operations (attention, MLP, convolution)128 FP16 MACs per PPE (32,768 total), tensor core equivalent
INT8Graph topology, index arithmetic, neighbor listsGraph Engine integer paths

The mixed-precision accumulator maintains running sums in FP64 even when inputs arrive in FP16, enabling the “FP64 accumulation with FP16 compute” pattern that MD and equivariant networks require — avoiding the GPU’s all-or-nothing precision choice.

Stochastic rounding hardware (per-MAC LFSR) is included for FP16 training. The posit number format (16/32-bit) is supported for applications where dynamic range matters more than uniform precision.

Precision Cascade
How Precision Flows Through the Pipeline
FP64
PDE residuals · energy conservation · optimizer accumulation
FP32
Force calculations · edge features · normalization layers
FP16 / BF16
Bulk neural network: attention · MLPs · score function evaluation
INT8
Neighbor lists · graph topology indexing · cell-list lookups
Highest precision
Highest throughput
Mixed-Precision Accumulator
FP16 multiply → FP64 accumulate. Running sums maintained in FP64 even when inputs arrive in FP16, preventing the energy drift that renders MD simulations physically meaningless.

FFT Butterfly Units

Fourier Neural Operators (FNO, FourCastNet’s AFNO) use FFT as their core spatial mixing, achieving O(N log N) global communication versus O(N^2) for self-attention.

PhysDiffuse-1 includes hardwired radix-2/4/8 butterfly units in each SFU:

  • 256-point complex FFT in 32 cycles versus hundreds of cycles through general-purpose multiply-add
  • Batched 1D FFT for multi-channel spectral operations (the dominant pattern in AFNO: FFT per channel, learned spectral weighting, IFFT)
  • Real-valued optimization: physics fields are real, so the r2c/c2r transform saves half the computation and storage versus general complex FFT

For FourCastNet-class models operating on a 720 x 1,440 weather grid, the FFT butterfly units handle the spatial mixing phase entirely on-chip, eliminating the round-trip to HBM that dominates FFT latency on GPUs.

Spherical Harmonics Evaluator

SE(3)-equivariant networks compute spherical harmonics Y_l^m of relative displacement vectors as a prerequisite for every message. For low angular momentum (L=0-2), this is cheap. For the L=4-6 needed in high-accuracy molecular potentials, and the higher orders used in global weather models on the sphere, evaluation involves nested polynomial recurrences (associated Legendre functions, trigonometric scaling) that map poorly to SIMD hardware.

The SFU includes a dedicated Y_l^m evaluator:

  • Angular momentum up to L=32 (1,089 components), covering all current and foreseeable equivariant architectures
  • CORDIC-based evaluation of associated Legendre polynomials and trigonometric functions, avoiding lookup tables and enabling arbitrary-precision computation
  • Pipelined throughput: one complete set of harmonics (all m for a given l) per cycle at L=6, scaling linearly with L
  • Clebsch-Gordan coefficient cache: precomputed CG coefficients for common coupling paths stored in dedicated on-chip ROM, eliminating the repeated recomputation that dominates e3nn profiling

3D Torus Interconnect

Inspired by Anton’s network co-design. Single-chip (256 PPEs): internal mesh, 6 links per PPE at 100 Gbit/s. Multi-chip (up to 512 chips): 8 x 8 x 8 torus, 6 inter-chip links at 200 Gbit/s each, 86 Tbit/s bisection bandwidth, 100 ns hop latency (2x Anton 1, but 10-50x below commodity InfiniBand).

Hardware Halo Exchange Engine: dedicated DMA pushes boundary data to neighboring PPEs automatically when local computation completes — no software synchronization overhead. Hardware all-reduce: tree-based aggregation for energy totals, conservation checks, and FFT transposes, completing a 256-PPE all-reduce in ~50 cycles.

Interconnect Topology
3D Torus Network -- 4×4 Slice (of 256 PPEs)
Each node = 1 PPE · 6 links per node · wrap-around edges form the torus
6 links/PPE
200 Gbit/s each
Direct link
Wrap-around
Interconnect Comparison
GPU Cluster
PCIe / NVLink
Tree / fat-tree topology · high hop latency
Anton 3
3D Torus
50 ns/hop · 607 Gbit/s bisection · classical MD
PROPOSED
PhysDiffuse-1
3D Torus
100 ns/hop · 86 Tbit/s bisection · NN + MD
86
Tbit/s bisection bandwidth
141× Anton 1 · 10-50× InfiniBand

Memory Hierarchy

LevelCapacityBandwidthLatencyPurpose
Register file32 KB per PPEN/A1 cycleTensor core operands
Local SRAM2 MB per PPE (512 MB total)4 TB/s aggregate3 cyclesGraph neighborhoods, activations, NN weights for current layer
Shared L264 MB per tile (4 tiles, 256 MB total)16 TB/s10 cyclesCross-PPE data sharing, FFT working space
HBM3e128 GB8 TB/s~100 nsFull model weights, atmospheric states, training data

The 512 MB of total on-chip SRAM is sized to hold the complete graph neighborhood data for a 40,000-node icosahedral weather mesh (GenCast-scale) or the neighbor lists for a 500,000-atom molecular system without spilling to HBM. This is critical because the scatter-gather bottleneck is fundamentally a memory-proximity problem: if the data is on-chip, irregular access patterns are fast; if it is in HBM, they are 30-100x slower.

Dataflow for a GenCast-Class Weather Model

A single 12-hour forecast step through PhysDiffuse-1:

  1. Encode (lat-lon to icosahedral mesh): Bipartite graph message passing. The Graph Engine handles the irregular gather from ~1M lat-lon points to ~41K mesh nodes. SRAM holds the mesh graph; HBM streams the atmospheric state.

  2. Process (16 graph transformer layers): Each layer runs message passing (Graph Engine, scatter-gather on ~330K edges) followed by dense transformation (Tensor Core Cluster, standard GEMM on node features). The Graph Engine and Tensor Cores operate in pipelined alternation, keeping both units saturated.

  3. Decode (mesh back to lat-lon): Inverse of encode. Graph Engine scatters mesh predictions back to the full-resolution grid.

  4. Denoising loop (39 evaluations per step): The full encode-process-decode pass repeats 39 times with different noise levels. Model weights remain pinned in SRAM/L2 across iterations, eliminating the HBM weight reload that dominates GPU denoising latency.

  5. Autoregressive rollout (30 steps for 15-day forecast): The output of each step becomes the input to the next. The torus network handles multi-chip partitioning for ensemble parallelism.

Dataflow for MACE-Class Molecular Dynamics

Each MD timestep runs without returning to the host: (1) the Graph Engine builds neighbor lists via cell-list partitioning with hardware periodic boundary conditions; (2) the SFU evaluates spherical harmonics for all neighbor pairs, CG tensor products execute on Tensor Cores (dense contraction) plus Graph Engine (sparse coupling), with the CG coefficient cache eliminating redundant lookups; (3) dense output MLP computes energy, forces derived via autodiff backward pass; (4) FP64 MACs handle Velocity Verlet integration; (5) the Halo Exchange Engine pushes boundary atoms to neighboring PPEs.

For a 50,000-atom protein system, the target is ~10 microsecond wall-clock per femtosecond timestep — millisecond-scale simulations in days, comparable to Anton 3 for classical force fields but with near-DFT accuracy from the learned potential.

Performance Estimates

Speedup Over H100

For a typical physics diffusion forward pass with the following compute breakdown (derived from profiling GenCast and MACE on A100/H100):

Phase% of GPU TimePhysDiffuse-1 SpeedupRationale
Message passing (scatter-gather)40%10-16xHardware Graph Engine vs. GPU indirect memory access
Dense linear layers30%1-1.5xComparable tensor core throughput
FFT / spectral ops20%3-5xHardwired butterfly vs. general-purpose cuFFT
Special functions (Y_l^m, exp, erf)10%5-10xCORDIC SFU vs. GPU transcendental library

Weighted end-to-end: 4-6x over H100 on physics diffusion inference.

For FP64-dominant workloads (molecular dynamics with neural potentials):

Phase% of GPU TimePhysDiffuse-1 SpeedupRationale
Equivariant tensor products50%8-15xCG cache + Graph Engine + dedicated FP64 MACs
Force accumulation (FP64)20%10-20xFP64 MACs at full throughput vs. 1:29 ratio on H100
Neighbor list / integration20%3-5xHardware neighbor list builder + FP64 integrator
Communication (halo exchange)10%5-20xHardware halo exchange vs. software MPI

Weighted end-to-end: 7-14x over H100 on neural network molecular dynamics.

Absolute Performance Targets

WorkloadH100 PerformancePhysDiffuse-1 Target
GenCast 15-day forecast (single member)~4-5 min (estimated from TPU v5 at 8 min)<1 min
GenCast 50-member ensemble~4 hours (sequential on 1 GPU)~45 min (sequential) or ~1 min (50-chip parallel)
MACE MD timestep (50K atoms, L=6)~100 microseconds~10 microseconds
FNO inference (2D Navier-Stokes, 256x256)~5 ms~1 ms
eSCN single-point energy (OC-20 catalyst)~50 ms~5 ms

Precision Engineering: Mixed FP64/FP16 Cascades

The precision landscape in physics ML is more nuanced than the binary FP64-vs-FP16 framing suggests. The optimal strategy is a cascade: FP64 for quantities where error accumulates (energy sums, force accumulation, symplectic integration), FP32 for intermediate NN computations where model approximation dominates roundoff, and FP16/BF16 for bulk tensor operations where throughput matters most. PhysDiffuse-1’s mixed-precision accumulator implements this cascade in hardware.

Supporting techniques built into the SFU: stochastic rounding (per-MAC LFSR randomness prevents systematic bias when FP16 gradients fall below epsilon), CORDIC-based transcendentals (shifts-and-adds for exp, sin, cos, erf — higher precision and lower power than lookup tables, especially for the associated Legendre recurrence in spherical harmonics), and posit format support (configurable 16/32-bit for applications where dynamic range matters more than uniform precision).

Adaptive mesh refinement (AMR) is handled natively by the Graph Engine’s variable-degree graph support: AMR hierarchies map to graphs where fine-resolution regions have higher connectivity, and the degree-aware scheduler handles the resulting load imbalance automatically.

Multi-Scale Memory Mapping

Physics GNNs (GraphCast, GenCast) use an encoder-processor-decoder pattern analogous to multigrid: encode from a fine grid (~1M points) to a coarse mesh (~40K nodes), process at coarse resolution, decode back. PhysDiffuse-1’s memory hierarchy maps naturally to this:

ScaleData SizeStorage Tier
Coarse mesh (processor)~41K nodes x 512 features = ~80 MBLocal SRAM (on-chip)
Fine grid (encoder/decoder)~1M points x 227 channels = ~940 MBHBM, streamed through SRAM
Model weights~150 MB (GraphCast-class)SRAM (pinned)

The coarse mesh and model weights fit entirely in the 512 MB SRAM, meaning the 16-layer processor runs without any HBM access. Only encode/decode phases stream from HBM via the Graph Engine. For molecular simulation, MACE’s two-iteration message passing operates on local atomic neighborhoods that fit in per-PPE SRAM (2 MB holds ~10,000 atoms with neighbors and features), while the global system state resides in HBM.

Market Justification

Development Cost

A custom 4 nm ASIC of this complexity requires $300-500M in development cost (design, verification, tape-out, packaging, testing), consistent with other domain-specific accelerators at this scale. The multi-tile approach with 4 dies per package uses chiplet-based design to reduce per-die area and improve yield.

Total Addressable Market

DomainAnnual SpendPhysDiffuse-1 Relevance
Operational weather prediction (global NWP agencies)$2B/yearDirect: GenCast-class inference
AI drug discovery (molecular simulation + structure prediction)~$4.9B by 2028Direct: MACE/AF3/RFdiffusion inference and MD
Computational materials science~$1.5BDirect: neural potential MD for battery, catalyst, semiconductor design
Fusion energy simulation$6B+ invested to datePartial: plasma turbulence, MHD simulation surrogates
Climate modeling$500M+/yearDirect: high-resolution ensemble climate projection
Automotive/aerospace CFD$5-50M per OEM annuallyPartial: FNO/MeshGraphNet surrogates in design loops

Conservative TAM: $3-5B/year, growing 25-35% annually as AI-native simulation displaces traditional HPC.

Weather
$2B/yr
Global NWP Agencies
GenCast: 24,000× less energy than ECMWF ENS. Probabilistic 15-day forecasts in minutes.
Drug Discovery
~$4.9B by 2028
Molecular Sim + Structure
AlphaFold 3, RFdiffusion: Protein design, ligand binding, neural potential MD.
Materials Science
~$1.5B
Battery, Catalyst, Semi
DiffCSP: Crystal structure prediction. Neural potentials for battery & catalyst design.
Fusion Energy
$6B+ invested
Plasma & MHD Simulation
Coupled PDE + particle sim. Plasma turbulence surrogates for tokamak design.
Climate Modeling
$500M+/yr
Earth System Models
IPCC-class projections. High-resolution ensemble climate modeling at km scale.
Total Addressable Market
$3–5B/year
Growing 25–35% annually

Unit Economics

At $30,000-50,000 per chip (H100-comparable pricing), a single PhysDiffuse-1 replacing 6-14 H100s for physics workloads is compelling. A 512-chip system for a national weather service would cost $15-25M — less than the annual opex of an ECMWF-class supercomputer — while delivering ensemble forecasts in minutes. For pharma, a 64-chip rack could replace a floor of GPU servers for neural potential MD, cutting capital and power by 5-10x.

Competitive Landscape and Risks

Anton 3 dominates classical MD but cannot run neural network potentials — as the field moves to learned potentials (MACE, Allegro), Anton’s architectural advantage narrows. Cerebras CS-2 (850K cores, 40 GB SRAM, 220 Pb/s bandwidth) attacks the same memory wall from a general-purpose angle. NVIDIA H100/B200 with cuEquivariance and PhysicsNeMo will continue narrowing the software gap; the B200’s 8 TB/s HBM partially addresses scatter-gather. Google TPU v5/v6 already runs GenCast operationally.

Key risks: (1) a custom ASIC without PyTorch/JAX integration and support for e3nn, PyG, DGL, OpenMM-ML is dead on arrival; (2) workload fragmentation across weather, molecular, materials, fluid, and plasma domains means no single configuration is optimal for all; (3) NVIDIA ships a new GPU generation every two years — if the GB300 adds hardware graph support, the window narrows; (4) a chip arriving in 2028 needs the market to sustain production volumes at scale.

Summary

Physics diffusion models are simultaneously the most compute-intensive and least hardware-efficient workloads in modern AI. They combine iterative denoising (20-50x compute multiplier), irregular graph operations (10-30% GPU utilization), equivariant tensor products (5-20x overhead versus dense layers), mixed FP64/FP16 precision requirements (90%+ tensor core waste on GPUs), and extreme resolution demands (1M+ grid points for weather, atomic-scale for MD).

Anton proved that purpose-built physics hardware delivers 100-1000x over general-purpose systems when the computational kernel is well-defined. The kernels of physics diffusion — scatter-gather message passing, CG tensor products, FFT-based spectral convolution, spherical harmonics evaluation, mixed-precision accumulation — are now well-defined enough to justify silicon.

PhysDiffuse-1 targets 4-6x over H100 on physics diffusion inference and 7-14x on neural network molecular dynamics, through a combination of hardware scatter-gather (16x effective bandwidth for graph operations), dedicated FP64 MACs (avoiding the 1:29 penalty of GPU tensor cores), hardwired FFT butterflies, a spherical harmonics evaluator, and a 3D torus interconnect inspired by Anton. A 256-PPE chip with 512 MB SRAM, 128 GB HBM3e, and multi-chip scaling to 512 nodes addresses the full spectrum of physics diffusion workloads from single-member weather forecasts to millisecond-scale molecular dynamics.

The question is not whether physics diffusion needs better hardware. It does. The question is whether the market reaches critical mass before general-purpose GPUs close the gap. At $3-5B TAM growing 25-35% annually, the window is open. It will not stay open forever. For a radically different approach to the efficiency problem — one that replaces matrix multiplication with continuous dynamics entirely — see Nonlinear Silicon: Oscillator-Based Computing.

Additional Reading