Custom Chips for World Models

World models — systems that learn an internal representation of an environment and predict how it evolves — sit at the convergence of the three most compute-hungry AI workloads: large-scale video processing, autoregressive sequence modeling, and reinforcement learning with planning. As these models move from research papers to production robots and vehicles, they expose a hardware gap that no existing chip was designed to fill. This article surveys the world model landscape, the silicon racing to serve it, and a concrete architecture proposal for what purpose-built world model hardware could look like.

The World Model Landscape

A world model takes in observations (camera frames, lidar sweeps, joint angles, language commands), compresses them into a latent state, predicts how that state evolves given candidate actions, and — critically — runs branching imagination rollouts to evaluate plans before committing to physical motion. The field has crystallized around three architecture families, each with a distinct hardware profile.

Architecture Families

Autoregressive transformers tokenize observations via VQ-VAE or similar codebooks and predict the next token in sequence. GAIA-1 (Wayve, 9.4B total parameters across a 6.5B world model, 2.6B video decoder, and 0.3B tokenizer) demonstrated emergent driving dynamics and counterfactual reasoning from video, text, and action tokens. Genie 2 and Genie 3 (Google DeepMind) generate interactive 3D worlds up to one minute long from a single image prompt. NVIDIA Cosmos scales to 14B parameters as a video-generation world foundation model targeting autonomous driving simulation on Blackwell GPUs. Autoregressive models scale naturally with compute but suffer a sequential generation bottleneck: each token depends on the previous one, making latency proportional to sequence length.

Diffusion models generate frames by iterative denoising. UniSim (Google DeepMind) simulates realistic experience from actions in diverse environments. DriveDreamer integrates HD maps and 3D bounding boxes as conditioning for precise scenario generation. The fidelity advantage is real, but the cost is 20 to 1,000 denoising steps per generated frame, multiplying compute by that factor relative to single-pass prediction — the same iterative cost that drives the VDX-1 chip proposal. For real-time planning, this is often prohibitive.

Recurrent state-space models (RSSMs) compress history into a fixed-size latent state and advance it one step at a time. DreamerV3 (12M to 400M parameters) famously collected a diamond in Minecraft from scratch using a single A100 over nine days, unrolling 16 imagination steps per training batch across thousands of parallel trajectories. MILE applies the same paradigm to end-to-end driving. RSSMs are more parameter-efficient and fundamentally cheaper per imagination step, but historically harder to scale to photorealistic generation.

JEPA (Joint Embedding Predictive Architecture), proposed by LeCun with concrete instantiations in I-JEPA and V-JEPA, predicts in abstract representation space without ever reconstructing pixels. This yields lightweight inference — a property exploited by the JEPA-R edge chip proposal — but requires downstream tasks to operate on frozen representations rather than generated images or video.

Key Systems at a Glance

System	Params	Architecture	Compute	Achievement
DreamerV3	12M—400M	RSSM (GRU + stochastic latent)	1 A100, 9 days	Minecraft diamond from scratch
GAIA-1	9.4B (6.5B world model)	Autoregressive transformer	Multi-GPU cluster	Emergent driving dynamics
Genie 2/3	Undisclosed (large)	Tokenizer + dynamics + decoder	High-end GPUs	1-minute interactive 3D worlds
Cosmos	2B—14B	Video generation platform	Blackwell GPUs	30-second predictive video
UniSim	Large	Diffusion	Multi-GPU	Simulated experience from actions
pi0	3.3B	VLA with flow matching	Single GPU	50 Hz robot control across 7 embodiments

Why World Models Break Existing Hardware

Five properties combine to make world models uniquely demanding:

Prediction-imagination-planning loops. Each training or inference cycle interleaves encoding, multi-step forward rollouts, reward estimation, and action selection — a deeply heterogeneous pipeline.
Video-scale I/O. Cosmos and Genie operate on high-resolution video with hundreds of frames of temporal context.
Multi-step diffusion. Diffusion-based world models multiply per-frame compute by the number of denoising steps.
Model scale. GAIA-1 at 9.4B (total system) and Cosmos at 14B parameters rival the largest language models.
Continuous learning. Reinforcement learning settings interleave data collection with model training and imagination-based policy updates, demanding sustained throughput rather than burst compute.

World Model Pipeline

Sense → Imagine → Plan

6 CAMERAS

LiDAR • IMU

INPUT

→

Sensor
Front End

196

TFLOPS

ViT ENCODE

→

Prediction
Engine

50 µs

SSM SCAN

NEXT STATE

→

Imagination Engine

32 rollout units

640K STEPS/SEC

→

Planning
Unit

MCTS

300 ITER/MS

BEST PATH

→

⚙

Motor
Commands

OUTPUT

Encode

Predict

Imagine (branching)

Plan

Act

Hardware for Planning and Search

Planning — selecting actions by imagining their consequences — is the operation that most sharply distinguishes world model inference from standard neural network inference. Two paradigms dominate, each with a distinctive hardware profile.

MCTS with Learned Models: The MuZero Legacy

AlphaGo used 1,920 CPUs and 280 GPUs to defeat Lee Sedol (Elo 3,739). AlphaGo Master reduced that to 4 TPUs while achieving a far higher Elo (4,858 vs. 3,739). MuZero generalized the approach to games without known rules, using 16 TPU v3s for training and 1,000 TPUs for self-play, running 800 simulations per move. Each simulation requires a dynamics-network forward pass at every tree node expansion, making MCTS fundamentally limited by inference latency and throughput.

Gumbel MuZero directly addresses this hardware bottleneck: by replacing PUCT with Gumbel-based sampling without replacement, it significantly improves performance when planning with few simulations — reducing the number of forward passes required per decision by an order of magnitude in some settings. This is a software optimization designed around hardware constraints.

The structural problem remains: MCTS tree management logic runs on CPUs while neural evaluation runs on accelerators. The serialization between them creates an imagination bottleneck where the accelerator is idle during tree traversal and the CPU is idle during neural evaluation.

MPC-Style Planning: CEM and MPPI

Cross-Entropy Method (CEM) and Model Predictive Path Integral (MPPI) planning sample hundreds of candidate action sequences in parallel and roll them forward through learned dynamics. This is more GPU-friendly than MCTS: no tree structure, just a large batch of independent rollouts. TD-MPC2 scales to 317M parameters across 80 tasks using this approach.

The hardware profile is memory-bandwidth-bound rather than latency-bound: storing and scoring hundreds of parallel trajectories dominates over the per-step compute cost.

Planning Hardware Requirements

Method	Compute Pattern	Key Bottleneck
MCTS (MuZero)	Sequential tree expansion, batched leaf eval	Inference latency per simulation
CEM/MPPI	Massively parallel independent rollouts	Memory bandwidth for trajectory storage
Dreamer imagination	Batched latent rollouts with backprop	Latent state memory for long horizons
Ensemble models	K parallel independent forward passes	Memory for K model copies

Real-Time Robotics: The Edge Constraint

World models for physical systems must run on the robot, not in cloud. Network round-trip latency (20—100 ms) violates the sub-10 ms inference budget required for dexterous manipulation.

The Latency Budget

Domain	Control Frequency	Latency Budget
Dexterous manipulation	50—200 Hz	5—20 ms sensor-to-action
Mobile navigation	10—50 Hz	20—100 ms
Humanoid locomotion	200—500 Hz	2—5 ms (inner loop classical)

pi0 (Physical Intelligence) represents the current state of the art: 3.3B parameters, 10 forward Euler steps per action chunk, 50 Hz output via action chunking on a single GPU. It controls seven different robot embodiments with a unified model.

Current Edge Hardware

Platform	AI Performance	Power	Target
Jetson AGX Orin	275 TOPS	15—60 W	Heavy manipulation, mobile robots
Jetson Thor	2,070 TFLOPS (FP4)	40—130 W	Humanoids, VLA models
Tesla FSD HW3	144 TOPS (dual)	72 W	Autonomous driving
Tesla FSD HW4	~300—500 TOPS	100 W	Next-gen driving
Tesla AI5 (planned)	~3,000—5,000 TOPS	800 W	Optimus humanoid, L4/L5 driving
Google Edge TPU	4 TOPS	2 W	Lightweight perception only

The gap that matters: Tesla considers HW4 insufficient for autonomous Optimus operation. Humanoid-grade world models plausibly require 500+ TOPS at 15 W with deterministic latency. That chip does not exist. Jetson Thor at 2,070 TFLOPS comes closest but draws 40—130 W — too much for a battery-powered humanoid running all day. AI5 at 800 W is a datacenter-in-a-car, not a mobile robotics solution. The industry needs an order-of-magnitude improvement in TOPS-per-watt for world model inference at the edge.

World models for embodied AI must ingest camera frames (30 Hz), lidar sweeps (10—20 Hz), IMU and joint encoders (100—1,000 Hz), language commands (sporadic), and proprioception simultaneously. Each modality arrives at a different rate, in a different representation, and with a different token count. A vehicle running ViT-Large on 8 cameras at 30 Hz demands approximately 14.8 TFLOPS just for vision encoding — before any cross-modal attention, prediction, or planning.

Tesla FSD HW3 (2019) remains the canonical example of purpose-built multi-modal driving silicon. Samsung 14 nm, approximately 260 mm² die area. Two independent NNAs, each with a 96x96 multiply-accumulate array operating on 8-bit integers at 2 GHz, delivering 36 TOPS per NNA (72 TOPS total). The key architectural insight is 32 MB of SRAM per NNA — large enough to hold entire neural network layer weights on-die, avoiding DRAM round-trips and achieving near-peak MAC utilization. A dedicated ISP handles 8 cameras simultaneously at 2.5 Gpixels/s. Dual-redundant SoCs on one board provide functional safety: if both agree, the action executes; disagreement triggers fallback.

HW4 (2023) moved to Samsung 7 nm, doubled RAM to 16 GB, and added high-definition radar. Musk claimed 3—8x HW3 compute, enabling end-to-end neural network processing of all cameras simultaneously rather than per-camera feature extraction. AI5 (estimated 2027) is the generational leap, consuming up to 800 W — an acknowledgment that world model inference at L4/L5 is fundamentally more compute-hungry than the convolutional perception stacks of earlier hardware.

NVIDIA DRIVE Thor (2025+) represents a different philosophy: a general-purpose transformer accelerator for automotive. TSMC 4NP Blackwell architecture, 2,560 CUDA cores, 1,000 sparse INT8 TOPS, and critically 128 GB of LPDDR5X — four times Orin’s memory, explicitly sized to hold billion-parameter transformer world models in-vehicle. The Transformer Engine with native FP8 support targets the attention-heavy BEV transformer architectures that Orin struggles to run at full resolution. Two Thor SoCs interconnect via NVLink-C2C for 2,000 TOPS configurations.

Mobileye EyeQ6 to EyeQ Ultra traces a different path: specialized, power-efficient accelerators. EyeQ6 ships at 16 TOPS on 7 nm targeting L2+; EyeQ Ultra at 176 TOPS on 5 nm targets L4 with dedicated accelerators for deep learning, classical computer vision, and general-purpose compute on a single die. Over 27 OEMs use EyeQ silicon.

Apple Vision Pro R1 offers a template outside driving. A dedicated sensor fusion co-processor handles 12 cameras, 5 sensors, and 6 microphones simultaneously at 12 ms photon-to-photon latency. By isolating sensor fusion on a chip with its own hard real-time OS, Apple guarantees that application load on the M2 never causes sensor fusion jitter. This dual-chip isolation pattern — a dedicated real-time co-processor alongside a general-purpose application processor — is the cleanest existing solution to the asynchronous multi-modal input problem.

Recurring Hardware Patterns

Four patterns recur across production multi-modal SoCs:

Dedicated ISP/sensor pipeline + central neural accelerator. Preprocessing (demosaic, distortion correction, point cloud voxelization) runs on fixed-function hardware while the neural network runs on a programmable accelerator. Tesla and NVIDIA DRIVE both use this.
Dual-chip isolation for real-time guarantees. A dedicated sensor fusion chip guarantees hard real-time deadlines independent of the main application processor (Apple R1 pattern).
Large on-chip SRAM to avoid the DRAM bottleneck. Tesla’s 32 MB per NNA and NVIDIA’s DLA SRAM buffers keep weights on-die. Future multi-modal chips will likely push to 64—128 MB.
FP8 transformer engines. NVIDIA Thor and Hopper both feature FP8 arithmetic that halves memory footprint and doubles throughput versus FP16, directly enabling larger cross-modal attention windows within the same power envelope.

The BEV Transformer Bottleneck in Autonomous Driving

BEVFormer established the paradigm for modern autonomous driving perception: use spatial cross-attention to lift multi-camera 2D features into a unified bird’s-eye-view representation, then apply temporal self-attention across history frames. The result is a dense 3D understanding of the scene suitable for occupancy prediction and trajectory planning.

The hardware bottleneck is attention scale. A 200x200 BEV grid with 8 history frames produces 320,000+ tokens per frame — far beyond NLP-scale attention. Solutions include deformable attention (attending to sparse learned reference points), sparse queries (VoxFormer, StreamPETR), and temporal compression into fixed-size latents. Tesla’s occupancy networks push this further: dense 3D voxel grids (200x200x16 at 0.5 m resolution) updated at 10 Hz across 8 cameras require billions of MACs per second.

These workloads explain the 10—20x compute escalation from current shipping hardware (Orin at 254 TOPS, HW4 at ~400 TOPS) to next-generation silicon (Thor at 2,000 TOPS, AI5 at an estimated 3,000—5,000 TOPS). The architecture has shifted from CNN-era perception to transformer-era world modeling, and the silicon must follow.

Autonomous Driving Chip Comparison

Chip	TOPS (INT8)	Process	Power	Memory	Target Level	Status
Tesla HW3	144 (dual)	14 nm	72 W	8 GB	L2—L3	Shipping since 2019
Tesla HW4	~300—500	7 nm	100 W	16 GB	L3—L4	Shipping since 2023
Tesla AI5	~3,000—5,000	4—5 nm est.	800 W	TBD	L4—L5	Est. 2027
NVIDIA Orin	254 (sparse)	8 nm	100 W	32 GB	L2—L4	Shipping since 2022
NVIDIA Thor	1,000 (sparse; 2,000 dual)	TSMC 4NP	~300—500 W	128 GB	L3—L5	Sampling 2025
Mobileye EyeQ6	16	7 nm	60 W	LPDDR5X	L2+	Shipping 2024
Mobileye EyeQ Ultra	176	5 nm	~100 W	LPDDR5X	L4	2025—2026

Autonomous Driving

Chip Comparison

TOPS (INT8) and power draw -- dual metric bars per chip

TOPS (INT8) Power (W) TOPS/Watt

Tesla HW3 14nm · 2019

2.0 TOPS/W

144 TOPS

72W

Tesla HW4 7nm · 2023

5.0 TOPS/W

~500 TOPS

100W

Tesla AI5 4-5nm est. PLANNED

3.8-6.3 TOPS/W

~3,000-5,000 TOPS (range)

800W -- datacenter-class

NVIDIA Thor TSMC 4NP · Blackwell

~4.0-6.7 TOPS/W

2,000 TOPS · 128 GB

~300-500W est.

ATLAS

ATLAS TSMC N3E

5.8-7.8 TOPS/W

467 TOPS INT8

60-80W -- edge-deployable

78%

Highest TOPS/Watt in class

60-80% utilization on world model workloads vs <10% for GPU

BEV transformers need 320K+ tokens/frame

200x200 BEV grid x 8 history frames. This drives the 10-20x compute escalation from current shipping hardware to next-gen silicon.

SSM and Mamba: Reshaping the Imagination Engine

The transformer’s O(n^2) attention cost and ever-growing KV cache make it increasingly awkward for the operation world models do most: stepping forward in time, step after step, for thousands or millions of imagination steps during planning. State space models offer a fundamentally different trade: O(N) complexity and constant memory per step, at the cost of a compressed fixed-size state rather than full-context access.

The Core Architectures

S4 (Gu et al., 2021) introduced three computational views of the same linear system: continuous (natural for signals), recurrent (O(1) per step at inference), and convolutional (O(L log L) via FFT for training). The convolutional view enabled fast GPU training, but required time-invariant parameters.

Mamba (S6, Gu and Dao, 2023) made SSM parameters input-dependent, enabling content-based selection analogous to attention — but with a compressed fixed-size state rather than full context storage. The hardware-aware implementation loads SSM parameters from HBM to SRAM, performs the entire selective scan in SRAM, and writes only final outputs back — avoiding materializing the expanded state tensor. This kernel fusion trick mirrors FlashAttention’s strategy for avoiding the O(n^2) attention matrix.

Mamba-2 (Dao and Gu, 2024) revealed the State Space Duality: SSMs and a class of structured attention are mathematically equivalent. The practical breakthrough is a chunk-wise matmul algorithm that hits tensor cores — fixing Mamba-1’s critical hardware weakness. Mamba-1’s selective scan used element-wise operations that could not exploit tensor cores. On H100, the gap between matmul throughput (989 TFLOPS BF16) and element-wise throughput (67 TFLOPS FP32) means Mamba-1 was leaving over 90% of the chip’s peak compute on the table. Mamba-2 splits sequences into chunks of 64—128, computes intra-chunk outputs via batched matmul (tensor-core-friendly), passes states across chunk boundaries via a small scan on the reduced sequence, and finishes with another batched matmul. The result: 2—8x faster than Mamba-1, and the state dimension can grow from N=16 to N=64—256+ without penalty.

SSMs for World Models: The R2I Breakthrough

R2I (Recall to Imagine) demonstrated the payoff concretely: integrating S4/S5-family SSMs into DreamerV3’s world model achieved up to 9x faster training than the GRU-based original. The constant-state property is what makes this possible:

Property	Transformer	SSM (Mamba-class)
Per-step inference FLOPs	O(L) attention + O(1) FFN, L growing with history	O(1) state update
Per-step memory	O(L) KV cache per layer, growing linearly	O(N) fixed state per layer
10,000-step imagination	O(H^2) total attention FLOPs, ~5 GB KV cache	O(H) total FLOPs, ~1 MB constant
1M-step simulation	Infeasible without windowing	Straightforward

R2I handled 4,000-step episodes in Memory Maze — dramatically beyond GRU-based DreamerV3’s capacity — and achieved superhuman performance on complex memory tasks. For a robot running a world model at 50 Hz with 200-step planning horizons, SSMs enable roughly 10,000 imagination steps per second with constant memory — feasible on embedded hardware where a transformer would blow through memory budgets.

Memory Footprint Comparison

Transformer KV Cache vs SSM Fixed State

TRANSFORMER

KV cache grows O(L) per layer

Step 1

~1 KB

Step 100

~100 KB

Step 10K

~5 GB

Step 1M

OVERFLOW

~500 GB

IMPOSSIBLE

Memory explodes with sequence length

SSM (MAMBA)

O(N) fixed state, any length

Step 1

~1 MB

Step 100

~1 MB

Step 10K

~1 MB

Step 1M

~1 MB

CONSTANT

Same memory at any sequence length

SSMs enable million-step imagination rollouts

Fixed-size state means planning horizon is limited only by compute time, not memory

9x R2I faster than DreamerV3 using SSM world model

Hybrid Architectures

Pure SSMs struggle with precise retrieval from long context (the “copying” problem). This motivates hybrid designs: Jamba (AI21) interleaves transformer and Mamba layers with mixture-of-experts, fitting on a single 80 GB GPU with 256K context. Samba (Microsoft) combines Mamba with sliding window attention, achieving 3.73x higher throughput than transformers at 128K context while extrapolating from 4K training length to 256K with perfect memory recall. Zamba (Zyphra) uses a Mamba backbone with a single shared attention module at 7B parameters. These hybrids create heterogeneous workloads: SSM layers are scan-heavy (Mamba-1) or chunk-matmul (Mamba-2), while attention layers are standard GEMM-heavy. A chip serving these models needs both efficient scan units and high-throughput matmul engines.

Hardware for SSM Inference

SSM inference, like transformer inference, is dominated by memory bandwidth — the bottleneck is loading model weights from HBM, not computing the scan. But SSMs eliminate the KV cache entirely, enabling larger batch sizes in the same memory footprint and constant cost per step regardless of sequence length. The emerging FPGA accelerator literature (SSMA, LightMamba, FastMamba, SpecMamba, MambaOPU, MARCA-v2, HCSAs, eMamba — eight papers in 2024—2025 alone) converges on common themes: INT8/INT4 quantized states, custom parallel scan units replacing general-purpose ALUs, on-chip state buffers that hold the SSM state permanently in SRAM, and reconfigurable dataflow to handle input-dependent control paths.

The case for custom silicon is clear: a chip optimized for SSM-based world model inference would prioritize large on-chip SRAM for model weights (not KV cache — there is none), dedicated hardwired scan units, and high HBM bandwidth for parameter streaming. This is architecturally quite different from a transformer inference chip.

ATLAS: A Purpose-Built World Model Processor

No existing chip is designed for the world model workload. GPUs are compute-overprovisioned and memory-bandwidth-starved for the small, sequential state updates that dominate imagination rollouts — a 10,000-step rollout at batch-1 utilizes less than 5% of an H100’s tensor cores. Driving SoCs (Tesla HW4, Mobileye EyeQ6) are optimized for perception, not recurrent multi-step prediction. LLM inference chips (Groq LPU, Etched Sohu) have no multi-modal input pipelines, no parallel rollout scheduling, and no mechanism for branching tree search.

ATLAS (Autonomous Terrain Learning and Simulation Processor) is a concrete architecture proposal for a chip purpose-built for the three phases of world model operation: sense (multi-modal encoding), imagine (parallel rollout generation), and plan (tree search over imagined futures).

Architecture Overview

A single-die SoC at approximately 450 mm^2 on TSMC N3E, comprising five functional blocks connected by a deterministic on-die mesh network:

Sensor Front End — Four 128x128 BF16 systolic array matrix tiles (aggregate ~196 TFLOPS BF16) with 256 KB SRAM per tile. A dedicated ISP handles 6 cameras at 30 fps. A hardware voxelizer converts lidar point clouds to BEV tokens. A proprioception FIFO handles IMU and joint encoders at 200 Hz. A cross-modal tokenizer projects all modalities into a shared D=1024 embedding space. The full sensor-to-latent pipeline completes in under 10 ms.

Prediction Engine — Two dedicated SSM scan units implementing hardware-wired parallel prefix scan for Mamba-2/S5-class models at state dimension N=256 across D=1024 channels. Two small attention tiles (64x64 BF16 systolic arrays) handle the occasional attention layers in hybrid SSM-attention models. Four MB of state SRAM holds the full SSM hidden state permanently on-die. Eight MB of weight cache streams model layers. Per-step latency: approximately 50 microseconds at N=256, D=1024 — dominated by parameter loads from the weight cache, confirming the memory-bound analysis.

Imagination Engine — 32 parallel rollout units (RUs), each containing a compact scan unit (N=64, D=256 — a 4x-compressed version of the full model), 128 KB state SRAM, and a scalar ALU for reward computation. Four MB of shared SRAM holds the compressed rollout model (approximately 50M parameters at FP8) entirely on-chip, making rollout steps weight-stationary with zero HBM access. At 50 microseconds per step per RU, 32 RUs running 50-step rollouts in parallel deliver 640,000 imagination steps per second — sufficient for real-time MCTS with 64 rollouts of 50 steps every 20 ms control cycle.

Imagination Engine

Parallel Rollout Array

32 rollout units running branching imagination in parallel

World

State

Latent s(t)

D=256, N=64

R10

R11

R12

R13

R14

R15

R16

R17

R18

R19

R20

R21

R22

R23

R24

R25

R26

R27

R28

R29

R30

R31

Active (imagining) Idle / available

640,000

imagination steps/sec

50 µs

per SSM step

parallel rollout units

MuZero: 800 sims/move on 1,000 TPUs. ATLAS: comparable on one chip.

Each of 32 RUs contains a compact scan unit (N=64, D=256), 128 KB state SRAM, and a scalar ALU for reward. The compressed rollout model (~50M params, FP8) lives entirely on-chip -- zero HBM access during imagination.

Planning Unit — Two RISC-V cores (RV64GCV with vector extensions) handle control-flow-heavy planning logic. A hardware MCTS accelerator with 4K-node tree memory performs UCB selection via a 32-wide comparator tree in a single cycle and value backpropagation in a 4-cycle pipeline. A hardware CEM sampler produces 32 Gaussian candidate action sequences per cycle with top-K fitness selection. One MCTS iteration (select, expand, rollout, backprop) takes approximately 3 microseconds; in 1 ms of planning budget, the unit completes approximately 300 iterations.

State Memory — 64 MB unified SRAM partitioned dynamically between model weights, persistent world state, and temporary buffers. Two HBM3E stacks provide 16 GB at 2 TB/s aggregate bandwidth for overflow weights, replay buffers, and map data.

The Two-Mode Control Cycle

ATLAS operates in two modes within each 20 ms control cycle:

Mode 1 — Sense and Predict (approximately 10—12 ms): Sensor Front End and Prediction Engine are active. Imagination Engine is power-gated. Ingest new sensor data, encode to latent representation, advance the world state by one step using the full-fidelity model. Compute character: GEMM-heavy (vision encoding) plus memory-bound sequential (SSM next-state).

Mode 2 — Imagine and Plan (approximately 8 ms): Imagination Engine (all 32 RUs) and Planning Unit are active. Sensor Front End matrix tiles can be repurposed as additional rollout capacity. Dispatch 32—64 candidate action sequences, roll out compressed world model in parallel, evaluate outcomes, run MCTS/CEM to select the best action. Compute character: embarrassingly parallel scan operations across 32+ independent state trajectories.

Performance Summary

Phase	Operation	Latency
Sense	6-camera ViT-Large encoding + lidar + fusion	8 ms
Predict	Full-model single-step state advance	1 ms
Imagine	32 rollouts x 50 steps (compressed model)	3 ms
Plan	300 MCTS iterations over rollout results	1 ms
Act	Action decode + safety check + bus write	0.5 ms
Total		13.5 ms (6.5 ms margin in 20 ms cycle)

ATLAS Control Loop

13.5 ms total — 50 Hz cycle

20 ms budget • 6.5 ms safety margin

Sense 8 ms

Imagine 3 ms

Margin 6.5 ms

0 5 ms 10 ms 15 ms 20 ms

Sense 8 ms — 6-camera ViT encoding

Predict 1 ms — SSM next-state

Imagine 3 ms — 32 parallel rollouts

Plan 1 ms — MCTS/CEM selection

Act 0.5 ms — Motor command

Margin 6.5 ms — Safety buffer

Block	Peak INT8 TOPS	Typical Power
Sensor Front End (4 matrix tiles)	392	15 W
Prediction Engine (2 scan + 2 attn)	50	8 W
Imagination Engine (32 RUs)	24	12 W
Planning Unit (2 RISC-V + accel.)	1	3 W
State Memory + HBM + I/O	—	22 W
Total chip	~467 TOPS	~60 W typical, ~80 W peak

The raw TOPS number is modest compared to Thor’s 1,000 or projected AI5 figures. The design philosophy is fundamentally different: ATLAS maximizes utilization at the actual world model operating point (small batch, sequential prediction, parallel rollouts) rather than peak throughput on large-batch GEMM. Estimated utilization on world model workloads is 60—80% versus less than 10% for a GPU running the same workload.

Comparison with Existing Silicon

Dimension	Tesla FSD HW4	NVIDIA Thor	Mobileye EyeQ Ultra	ATLAS
Primary workload	Perception + e2e driving	Perception + world model (general)	Perception + mapping	World model: sense-imagine-plan
Process	Samsung 7 nm	TSMC 4NP	5 nm	TSMC N3E
AI compute	~300—500 TOPS	1,000 TOPS	176 TOPS	467 TOPS
Parallel rollout units	None	None	None	32 dedicated
SSM scan hardware	None	None	None	2 full + 32 compact
Prediction latency (SSM step)	N/A	~1—5 ms (GPU inference)	N/A	~50 microseconds
Imagination throughput	N/A	N/A	N/A	640K steps/s
On-chip state SRAM	32 MB/NNA	~20 MB	~8 MB	64 MB unified + 4 MB/engine
Safety features	ASIL-D (dual redundancy)	ASIL-D (lockstep)	ASIL-B	ASIL-B/D (watchdog + lockstep)
Power	~100 W	~300—500 W est.	~100 W	60—80 W

The key differentiators: ATLAS is the only architecture with dedicated parallel rollout hardware, hardwired SSM scan units achieving 50-microsecond prediction steps, and a hardware MCTS accelerator — the three operations that define world model inference and that existing chips handle only through general-purpose programmable compute.

Scalability: Edge to Cloud

ATLAS is designed as a scalable building block:

Configuration	Dies	Power	TOPS (INT8)	Model Size	Rollouts	Use Case
Edge	1 (reduced, 8 RUs)	25 W	150	200M	8 x 30 steps	Drone, small robot
Standard	1 (full, 32 RUs)	60—80 W	467	500M—1B	32 x 50 steps	Autonomous vehicle, humanoid
Cloud	8 (UCIe chiplet link)	500 W	3,700	5—10B	256 x 200 steps	Simulation, digital twins

The edge configuration at 25 W approaches — but does not reach — the 500+ TOPS at 15 W target for battery-powered humanoids. Closing that gap requires a further process shrink (N2 or A14) and aggressive voltage scaling, likely a 2028—2029 prospect.

ATLAS Scalability

Edge to Cloud on One Architecture

25 W

Edge

8 Rollout Units

150 TOPS INT8

200M params

8 × 30 steps

Drones • Small Robots

same ISA

80 W

Vehicle

32 Rollout Units

467 TOPS INT8

500M–1B params

32 × 50 steps

Autonomous Driving • Humanoid

same ISA

500 W

UCIe

Cloud

8 Dies • 256 Rollout Units

3,700 TOPS INT8

5–10B params

256 × 200 steps

Simulation • Digital Twins

Same ISA, same software stack, different power envelope

One architecture from 25 W edge to 500 W datacenter — scale rollout units, not redesign

Risks

The primary risk is betting on SSM-based world models when the architecture landscape is still evolving. Mitigation: scan units occupy less than 10% of die area. The matrix tiles and attention tiles handle transformer workloads competently, and the chip degrades gracefully to a conventional neural accelerator. The compressed rollout model may also prove too inaccurate for safety-critical planning; the prediction engine can run rollouts sequentially at full fidelity as a fallback. Software ecosystem bootstrapping without CUDA is the hardest commercial challenge, addressed through an MLIR-based compiler accepting standard PyTorch/JAX model definitions.

The Convergence Thesis

The autonomous driving and robotics industries are converging on a common architecture: multi-camera BEV transformers feeding an occupancy-based world model that jointly predicts the future and plans trajectories. The compute requirements — real-time attention over 320,000+ spatial-temporal tokens, 3D voxel prediction, multi-agent trajectory simulation, and branching imagination rollouts — explain the 10—20x compute escalation from current shipping hardware to next-generation silicon.

The winners will be determined not just by raw TOPS but by TOPS per watt at the right precision (FP8/INT8 for transformers, INT4 for occupancy and compressed rollouts), memory bandwidth (BEV temporal buffers and parameter streaming are bandwidth-hungry), and effective utilization on the actual workload (most chips waste 90%+ of peak throughput on world model inference). Tesla bets on vertical integration and fleet data. NVIDIA bets on ecosystem breadth and CUDA lock-in. Mobileye bets on power efficiency. Purpose-built architectures like ATLAS bet that the world model workload is different enough from both LLM inference and perception that it deserves its own silicon.

The hardware world is at an inflection point. Transformer-optimized accelerators have dominated for five years. SSMs demand a different balance: less emphasis on KV cache, more on memory bandwidth and efficient scan primitives — a shift that connects to the nonlinear silicon thesis, where dynamical systems process continuous signals natively via analog ODE solvers rather than discretized digital scan operations. The eight FPGA accelerator papers published in 2024—2025 alone suggest the research community recognizes this gap. The first SSM-optimized ASICs for world model inference — combining high-bandwidth parameter streaming, on-chip state buffers, hardwired parallel scan units, and dedicated imagination rollout arrays — may define the next generation of embodied AI hardware.

Research compiled 2026-04-30. Specifications for unreleased products (Tesla AI5, NVIDIA Thor, Mobileye EyeQ Ultra) are based on public announcements and may change before production. ATLAS is a research architecture proposal, not a shipping product.

Additional Reading

World Models — Ha & Schmidhuber 2018
DreamerV3 — Hafner et al.
MuZero — Schrittwieser et al.
A Path Towards Autonomous Machine Intelligence (JEPA) — LeCun 2022
Genie: Generative Interactive Environments — Bruce et al., DeepMind
NVIDIA Cosmos — NVIDIA (77 authors)
Mamba — Gu & Dao 2023
pi0: Vision-Language-Action Flow Model — Black et al., Physical Intelligence
NVIDIA Jetson Thor — NVIDIA
GAIA-1: A Generative World Model for Autonomous Driving — Hu et al., Wayve
R2I: Mastering Memory Tasks with World Models — Samsami et al.
Mamba-2: State Space Duality — Dao & Gu 2024

Alan's PKB

Explorer

Custom Chips for World Models

Custom Chips for World Models

The World Model Landscape

Architecture Families

Key Systems at a Glance

Why World Models Break Existing Hardware

Hardware for Planning and Search

MCTS with Learned Models: The MuZero Legacy

MPC-Style Planning: CEM and MPPI

Planning Hardware Requirements

Real-Time Robotics: The Edge Constraint

The Latency Budget

Current Edge Hardware

Recurring Hardware Patterns

The BEV Transformer Bottleneck in Autonomous Driving

Autonomous Driving Chip Comparison

SSM and Mamba: Reshaping the Imagination Engine

The Core Architectures

SSMs for World Models: The R2I Breakthrough

Hybrid Architectures

Hardware for SSM Inference

ATLAS: A Purpose-Built World Model Processor

Architecture Overview

The Two-Mode Control Cycle

Performance Summary

Comparison with Existing Silicon

Scalability: Edge to Cloud

Risks

The Convergence Thesis

Additional Reading

Graph View

Table of Contents

Backlinks

Alan's PKB

Explorer

Custom Chips for World Models

Custom Chips for World Models

The World Model Landscape

Architecture Families

Key Systems at a Glance

Why World Models Break Existing Hardware

Hardware for Planning and Search

MCTS with Learned Models: The MuZero Legacy

MPC-Style Planning: CEM and MPPI

Planning Hardware Requirements

Real-Time Robotics: The Edge Constraint

The Latency Budget

Current Edge Hardware

Multi-Modal Fusion Hardware

Production Multi-Modal SoCs

Recurring Hardware Patterns

The BEV Transformer Bottleneck in Autonomous Driving

Autonomous Driving Chip Comparison

SSM and Mamba: Reshaping the Imagination Engine

The Core Architectures

SSMs for World Models: The R2I Breakthrough

Hybrid Architectures

Hardware for SSM Inference

ATLAS: A Purpose-Built World Model Processor

Architecture Overview

The Two-Mode Control Cycle

Performance Summary

Comparison with Existing Silicon

Scalability: Edge to Cloud

Risks

The Convergence Thesis

Additional Reading

Graph View

Table of Contents

Backlinks