As AI workloads fragment from “train a big transformer” into specialized inference regimes, the hardware must fragment too. GPUs remain the default. The margins live in the corners.

Four architecture proposals. Each one: workload characterization, GPU mismatch analysis, a concrete chip with die-level specs, and a market case.


ARIA — 256 MB on-chip SRAM + 96 GB HBM4. Heterogeneous prefill/decode engines for multi-step agents that leave GPUs at 0.17% utilization. N3E, 600 mm², $465-710M NRE.

VDX-1 — 14 GB weight SRAM across 4 chiplets. Weights never leave the die across 30 denoising steps. Closes the 218x real-time gap. 475W, $5-8K/chip.

PhysDiffuse-1 — 256 Physics Processing Elements with hardware scatter-gather, FFT butterflies, and spherical harmonics. 16x effective bandwidth over GPU for graph neural networks. 3D torus inspired by Anton.

ATLAS — 32 parallel rollout units, 640,000 imagination steps/sec. Sense-predict-imagine-plan in 13.5 ms. From drones at 25W to digital twins at 500W. Same ISA.

Nonlinear Silicon (OscNet-1) — 4,096 coupled oscillators replacing matrix multiply with synchronization. Kuramoto physics as compute. <100 mW, microsecond inference.

JEPA-R: Latent Prediction for Robotics — Predicts in latent space at 50Hz, never renders pixels. 1.1 encoder-equivalents vs 50 for diffusion. 15W with VDX-1 as the rendering backend.

SpecDecode-1: Speculative Decoding ASIC — Draft accelerator (64MB SRAM, <50µs/token) + tree-attention verifier + hardware PagedAttention KV-cache. 5-10x over GPU. 7,100 tok/s on 7B.