Custom Chips for World Models

World models — systems that learn an internal representation of an environment and predict how it evolves — sit at the convergence of the three most compute-hungry AI workloads: large-scale video processing, autoregressive sequence modeling, and reinforcement learning with planning. As these models move from research papers to production robots and vehicles, they expose a hardware gap that no existing chip was designed to fill. This article surveys the world model landscape, the silicon racing to serve it, and a concrete architecture proposal for what purpose-built world model hardware could look like.


The World Model Landscape

A world model takes in observations (camera frames, lidar sweeps, joint angles, language commands), compresses them into a latent state, predicts how that state evolves given candidate actions, and — critically — runs branching imagination rollouts to evaluate plans before committing to physical motion. The field has crystallized around three architecture families, each with a distinct hardware profile.

Architecture Families

Autoregressive transformers tokenize observations via VQ-VAE or similar codebooks and predict the next token in sequence. GAIA-1 (Wayve, 9.4B total parameters across a 6.5B world model, 2.6B video decoder, and 0.3B tokenizer) demonstrated emergent driving dynamics and counterfactual reasoning from video, text, and action tokens. Genie 2 and Genie 3 (Google DeepMind) generate interactive 3D worlds up to one minute long from a single image prompt. NVIDIA Cosmos scales to 14B parameters as a video-generation world foundation model targeting autonomous driving simulation on Blackwell GPUs. Autoregressive models scale naturally with compute but suffer a sequential generation bottleneck: each token depends on the previous one, making latency proportional to sequence length.

Diffusion models generate frames by iterative denoising. UniSim (Google DeepMind) simulates realistic experience from actions in diverse environments. DriveDreamer integrates HD maps and 3D bounding boxes as conditioning for precise scenario generation. The fidelity advantage is real, but the cost is 20 to 1,000 denoising steps per generated frame, multiplying compute by that factor relative to single-pass prediction — the same iterative cost that drives the VDX-1 chip proposal. For real-time planning, this is often prohibitive.

Recurrent state-space models (RSSMs) compress history into a fixed-size latent state and advance it one step at a time. DreamerV3 (12M to 400M parameters) famously collected a diamond in Minecraft from scratch using a single A100 over nine days, unrolling 16 imagination steps per training batch across thousands of parallel trajectories. MILE applies the same paradigm to end-to-end driving. RSSMs are more parameter-efficient and fundamentally cheaper per imagination step, but historically harder to scale to photorealistic generation.

JEPA (Joint Embedding Predictive Architecture), proposed by LeCun with concrete instantiations in I-JEPA and V-JEPA, predicts in abstract representation space without ever reconstructing pixels. This yields lightweight inference — a property exploited by the JEPA-R edge chip proposal — but requires downstream tasks to operate on frozen representations rather than generated images or video.

Key Systems at a Glance

SystemParamsArchitectureComputeAchievement
DreamerV312M—400MRSSM (GRU + stochastic latent)1 A100, 9 daysMinecraft diamond from scratch
GAIA-19.4B (6.5B world model)Autoregressive transformerMulti-GPU clusterEmergent driving dynamics
Genie 2/3Undisclosed (large)Tokenizer + dynamics + decoderHigh-end GPUs1-minute interactive 3D worlds
Cosmos2B—14BVideo generation platformBlackwell GPUs30-second predictive video
UniSimLargeDiffusionMulti-GPUSimulated experience from actions
pi03.3BVLA with flow matchingSingle GPU50 Hz robot control across 7 embodiments

Why World Models Break Existing Hardware

Five properties combine to make world models uniquely demanding:

  1. Prediction-imagination-planning loops. Each training or inference cycle interleaves encoding, multi-step forward rollouts, reward estimation, and action selection — a deeply heterogeneous pipeline.
  2. Video-scale I/O. Cosmos and Genie operate on high-resolution video with hundreds of frames of temporal context.
  3. Multi-step diffusion. Diffusion-based world models multiply per-frame compute by the number of denoising steps.
  4. Model scale. GAIA-1 at 9.4B (total system) and Cosmos at 14B parameters rival the largest language models.
  5. Continuous learning. Reinforcement learning settings interleave data collection with model training and imagination-based policy updates, demanding sustained throughput rather than burst compute.
World Model Pipeline
Sense → Imagine → Plan
6 CAMERAS
LiDAR • IMU
INPUT
Sensor
Front End
196
TFLOPS
ViT ENCODE
Prediction
Engine
50 µs
SSM SCAN
NEXT STATE
Imagination Engine
32 rollout units
640K STEPS/SEC
Planning
Unit
MCTS
300 ITER/MS
BEST PATH
Motor
Commands
OUTPUT
Encode
Predict
Imagine (branching)
Plan
Act

Planning — selecting actions by imagining their consequences — is the operation that most sharply distinguishes world model inference from standard neural network inference. Two paradigms dominate, each with a distinctive hardware profile.

MCTS with Learned Models: The MuZero Legacy

AlphaGo used 1,920 CPUs and 280 GPUs to defeat Lee Sedol (Elo 3,739). AlphaGo Master reduced that to 4 TPUs while achieving a far higher Elo (4,858 vs. 3,739). MuZero generalized the approach to games without known rules, using 16 TPU v3s for training and 1,000 TPUs for self-play, running 800 simulations per move. Each simulation requires a dynamics-network forward pass at every tree node expansion, making MCTS fundamentally limited by inference latency and throughput.

Gumbel MuZero directly addresses this hardware bottleneck: by replacing PUCT with Gumbel-based sampling without replacement, it significantly improves performance when planning with few simulations — reducing the number of forward passes required per decision by an order of magnitude in some settings. This is a software optimization designed around hardware constraints.

The structural problem remains: MCTS tree management logic runs on CPUs while neural evaluation runs on accelerators. The serialization between them creates an imagination bottleneck where the accelerator is idle during tree traversal and the CPU is idle during neural evaluation.

MPC-Style Planning: CEM and MPPI

Cross-Entropy Method (CEM) and Model Predictive Path Integral (MPPI) planning sample hundreds of candidate action sequences in parallel and roll them forward through learned dynamics. This is more GPU-friendly than MCTS: no tree structure, just a large batch of independent rollouts. TD-MPC2 scales to 317M parameters across 80 tasks using this approach.

The hardware profile is memory-bandwidth-bound rather than latency-bound: storing and scoring hundreds of parallel trajectories dominates over the per-step compute cost.

Planning Hardware Requirements

MethodCompute PatternKey Bottleneck
MCTS (MuZero)Sequential tree expansion, batched leaf evalInference latency per simulation
CEM/MPPIMassively parallel independent rolloutsMemory bandwidth for trajectory storage
Dreamer imaginationBatched latent rollouts with backpropLatent state memory for long horizons
Ensemble modelsK parallel independent forward passesMemory for K model copies

Real-Time Robotics: The Edge Constraint

World models for physical systems must run on the robot, not in cloud. Network round-trip latency (20—100 ms) violates the sub-10 ms inference budget required for dexterous manipulation.

The Latency Budget

DomainControl FrequencyLatency Budget
Dexterous manipulation50—200 Hz5—20 ms sensor-to-action
Mobile navigation10—50 Hz20—100 ms
Humanoid locomotion200—500 Hz2—5 ms (inner loop classical)

pi0 (Physical Intelligence) represents the current state of the art: 3.3B parameters, 10 forward Euler steps per action chunk, 50 Hz output via action chunking on a single GPU. It controls seven different robot embodiments with a unified model.

Current Edge Hardware

PlatformAI PerformancePowerTarget
Jetson AGX Orin275 TOPS15—60 WHeavy manipulation, mobile robots
Jetson Thor2,070 TFLOPS (FP4)40—130 WHumanoids, VLA models
Tesla FSD HW3144 TOPS (dual)72 WAutonomous driving
Tesla FSD HW4~300—500 TOPS100 WNext-gen driving
Tesla AI5 (planned)~3,000—5,000 TOPS800 WOptimus humanoid, L4/L5 driving
Google Edge TPU4 TOPS2 WLightweight perception only

The gap that matters: Tesla considers HW4 insufficient for autonomous Optimus operation. Humanoid-grade world models plausibly require 500+ TOPS at 15 W with deterministic latency. That chip does not exist. Jetson Thor at 2,070 TFLOPS comes closest but draws 40—130 W — too much for a battery-powered humanoid running all day. AI5 at 800 W is a datacenter-in-a-car, not a mobile robotics solution. The industry needs an order-of-magnitude improvement in TOPS-per-watt for world model inference at the edge.


Multi-Modal Fusion Hardware

World models for embodied AI must ingest camera frames (30 Hz), lidar sweeps (10—20 Hz), IMU and joint encoders (100—1,000 Hz), language commands (sporadic), and proprioception simultaneously. Each modality arrives at a different rate, in a different representation, and with a different token count. A vehicle running ViT-Large on 8 cameras at 30 Hz demands approximately 14.8 TFLOPS just for vision encoding — before any cross-modal attention, prediction, or planning.

Production Multi-Modal SoCs

Tesla FSD HW3 (2019) remains the canonical example of purpose-built multi-modal driving silicon. Samsung 14 nm, approximately 260 mm² die area. Two independent NNAs, each with a 96x96 multiply-accumulate array operating on 8-bit integers at 2 GHz, delivering 36 TOPS per NNA (72 TOPS total). The key architectural insight is 32 MB of SRAM per NNA — large enough to hold entire neural network layer weights on-die, avoiding DRAM round-trips and achieving near-peak MAC utilization. A dedicated ISP handles 8 cameras simultaneously at 2.5 Gpixels/s. Dual-redundant SoCs on one board provide functional safety: if both agree, the action executes; disagreement triggers fallback.

HW4 (2023) moved to Samsung 7 nm, doubled RAM to 16 GB, and added high-definition radar. Musk claimed 3—8x HW3 compute, enabling end-to-end neural network processing of all cameras simultaneously rather than per-camera feature extraction. AI5 (estimated 2027) is the generational leap, consuming up to 800 W — an acknowledgment that world model inference at L4/L5 is fundamentally more compute-hungry than the convolutional perception stacks of earlier hardware.

NVIDIA DRIVE Thor (2025+) represents a different philosophy: a general-purpose transformer accelerator for automotive. TSMC 4NP Blackwell architecture, 2,560 CUDA cores, 1,000 sparse INT8 TOPS, and critically 128 GB of LPDDR5X — four times Orin’s memory, explicitly sized to hold billion-parameter transformer world models in-vehicle. The Transformer Engine with native FP8 support targets the attention-heavy BEV transformer architectures that Orin struggles to run at full resolution. Two Thor SoCs interconnect via NVLink-C2C for 2,000 TOPS configurations.

Mobileye EyeQ6 to EyeQ Ultra traces a different path: specialized, power-efficient accelerators. EyeQ6 ships at 16 TOPS on 7 nm targeting L2+; EyeQ Ultra at 176 TOPS on 5 nm targets L4 with dedicated accelerators for deep learning, classical computer vision, and general-purpose compute on a single die. Over 27 OEMs use EyeQ silicon.

Apple Vision Pro R1 offers a template outside driving. A dedicated sensor fusion co-processor handles 12 cameras, 5 sensors, and 6 microphones simultaneously at 12 ms photon-to-photon latency. By isolating sensor fusion on a chip with its own hard real-time OS, Apple guarantees that application load on the M2 never causes sensor fusion jitter. This dual-chip isolation pattern — a dedicated real-time co-processor alongside a general-purpose application processor — is the cleanest existing solution to the asynchronous multi-modal input problem.

Recurring Hardware Patterns

Four patterns recur across production multi-modal SoCs:

  1. Dedicated ISP/sensor pipeline + central neural accelerator. Preprocessing (demosaic, distortion correction, point cloud voxelization) runs on fixed-function hardware while the neural network runs on a programmable accelerator. Tesla and NVIDIA DRIVE both use this.
  2. Dual-chip isolation for real-time guarantees. A dedicated sensor fusion chip guarantees hard real-time deadlines independent of the main application processor (Apple R1 pattern).
  3. Large on-chip SRAM to avoid the DRAM bottleneck. Tesla’s 32 MB per NNA and NVIDIA’s DLA SRAM buffers keep weights on-die. Future multi-modal chips will likely push to 64—128 MB.
  4. FP8 transformer engines. NVIDIA Thor and Hopper both feature FP8 arithmetic that halves memory footprint and doubles throughput versus FP16, directly enabling larger cross-modal attention windows within the same power envelope.

The BEV Transformer Bottleneck in Autonomous Driving

BEVFormer established the paradigm for modern autonomous driving perception: use spatial cross-attention to lift multi-camera 2D features into a unified bird’s-eye-view representation, then apply temporal self-attention across history frames. The result is a dense 3D understanding of the scene suitable for occupancy prediction and trajectory planning.

The hardware bottleneck is attention scale. A 200x200 BEV grid with 8 history frames produces 320,000+ tokens per frame — far beyond NLP-scale attention. Solutions include deformable attention (attending to sparse learned reference points), sparse queries (VoxFormer, StreamPETR), and temporal compression into fixed-size latents. Tesla’s occupancy networks push this further: dense 3D voxel grids (200x200x16 at 0.5 m resolution) updated at 10 Hz across 8 cameras require billions of MACs per second.

These workloads explain the 10—20x compute escalation from current shipping hardware (Orin at 254 TOPS, HW4 at ~400 TOPS) to next-generation silicon (Thor at 2,000 TOPS, AI5 at an estimated 3,000—5,000 TOPS). The architecture has shifted from CNN-era perception to transformer-era world modeling, and the silicon must follow.

Autonomous Driving Chip Comparison

ChipTOPS (INT8)ProcessPowerMemoryTarget LevelStatus
Tesla HW3144 (dual)14 nm72 W8 GBL2—L3Shipping since 2019
Tesla HW4~300—5007 nm100 W16 GBL3—L4Shipping since 2023
Tesla AI5~3,000—5,0004—5 nm est.800 WTBDL4—L5Est. 2027
NVIDIA Orin254 (sparse)8 nm100 W32 GBL2—L4Shipping since 2022
NVIDIA Thor1,000 (sparse; 2,000 dual)TSMC 4NP~300—500 W128 GBL3—L5Sampling 2025
Mobileye EyeQ6167 nm60 WLPDDR5XL2+Shipping 2024
Mobileye EyeQ Ultra1765 nm~100 WLPDDR5XL42025—2026
Autonomous Driving
Chip Comparison
TOPS (INT8) and power draw -- dual metric bars per chip
TOPS (INT8) Power (W) TOPS/Watt
Tesla HW3 14nm · 2019
2.0 TOPS/W
144 TOPS
72W
Tesla HW4 7nm · 2023
5.0 TOPS/W
~500 TOPS
100W
Tesla AI5 4-5nm est. PLANNED
3.8-6.3 TOPS/W
~3,000-5,000 TOPS (range)
800W -- datacenter-class
NVIDIA Thor TSMC 4NP · Blackwell
~4.0-6.7 TOPS/W
2,000 TOPS · 128 GB
~300-500W est.
ATLAS
ATLAS TSMC N3E
5.8-7.8 TOPS/W
467 TOPS INT8
60-80W -- edge-deployable
78%
Highest TOPS/Watt in class
60-80% utilization on world model workloads vs <10% for GPU
BEV transformers need 320K+ tokens/frame
200x200 BEV grid x 8 history frames. This drives the 10-20x compute escalation from current shipping hardware to next-gen silicon.

SSM and Mamba: Reshaping the Imagination Engine

The transformer’s O(n^2) attention cost and ever-growing KV cache make it increasingly awkward for the operation world models do most: stepping forward in time, step after step, for thousands or millions of imagination steps during planning. State space models offer a fundamentally different trade: O(N) complexity and constant memory per step, at the cost of a compressed fixed-size state rather than full-context access.

The Core Architectures

S4 (Gu et al., 2021) introduced three computational views of the same linear system: continuous (natural for signals), recurrent (O(1) per step at inference), and convolutional (O(L log L) via FFT for training). The convolutional view enabled fast GPU training, but required time-invariant parameters.

Mamba (S6, Gu and Dao, 2023) made SSM parameters input-dependent, enabling content-based selection analogous to attention — but with a compressed fixed-size state rather than full context storage. The hardware-aware implementation loads SSM parameters from HBM to SRAM, performs the entire selective scan in SRAM, and writes only final outputs back — avoiding materializing the expanded state tensor. This kernel fusion trick mirrors FlashAttention’s strategy for avoiding the O(n^2) attention matrix.

Mamba-2 (Dao and Gu, 2024) revealed the State Space Duality: SSMs and a class of structured attention are mathematically equivalent. The practical breakthrough is a chunk-wise matmul algorithm that hits tensor cores — fixing Mamba-1’s critical hardware weakness. Mamba-1’s selective scan used element-wise operations that could not exploit tensor cores. On H100, the gap between matmul throughput (989 TFLOPS BF16) and element-wise throughput (67 TFLOPS FP32) means Mamba-1 was leaving over 90% of the chip’s peak compute on the table. Mamba-2 splits sequences into chunks of 64—128, computes intra-chunk outputs via batched matmul (tensor-core-friendly), passes states across chunk boundaries via a small scan on the reduced sequence, and finishes with another batched matmul. The result: 2—8x faster than Mamba-1, and the state dimension can grow from N=16 to N=64—256+ without penalty.

SSMs for World Models: The R2I Breakthrough

R2I (Recall to Imagine) demonstrated the payoff concretely: integrating S4/S5-family SSMs into DreamerV3’s world model achieved up to 9x faster training than the GRU-based original. The constant-state property is what makes this possible:

PropertyTransformerSSM (Mamba-class)
Per-step inference FLOPsO(L) attention + O(1) FFN, L growing with historyO(1) state update
Per-step memoryO(L) KV cache per layer, growing linearlyO(N) fixed state per layer
10,000-step imaginationO(H^2) total attention FLOPs, ~5 GB KV cacheO(H) total FLOPs, ~1 MB constant
1M-step simulationInfeasible without windowingStraightforward

R2I handled 4,000-step episodes in Memory Maze — dramatically beyond GRU-based DreamerV3’s capacity — and achieved superhuman performance on complex memory tasks. For a robot running a world model at 50 Hz with 200-step planning horizons, SSMs enable roughly 10,000 imagination steps per second with constant memory — feasible on embedded hardware where a transformer would blow through memory budgets.

Memory Footprint Comparison
Transformer KV Cache vs SSM Fixed State
TRANSFORMER
KV cache grows O(L) per layer
Step 1
~1 KB
Step 100
~100 KB
Step 10K
~5 GB
Step 1M
OVERFLOW
~500 GB
IMPOSSIBLE
Memory explodes with sequence length
SSM (MAMBA)
O(N) fixed state, any length
Step 1
~1 MB
Step 100
~1 MB
Step 10K
~1 MB
Step 1M
~1 MB
CONSTANT
Same memory at any sequence length
SSMs enable million-step imagination rollouts
Fixed-size state means planning horizon is limited only by compute time, not memory
9x R2I faster than DreamerV3 using SSM world model

Hybrid Architectures

Pure SSMs struggle with precise retrieval from long context (the “copying” problem). This motivates hybrid designs: Jamba (AI21) interleaves transformer and Mamba layers with mixture-of-experts, fitting on a single 80 GB GPU with 256K context. Samba (Microsoft) combines Mamba with sliding window attention, achieving 3.73x higher throughput than transformers at 128K context while extrapolating from 4K training length to 256K with perfect memory recall. Zamba (Zyphra) uses a Mamba backbone with a single shared attention module at 7B parameters. These hybrids create heterogeneous workloads: SSM layers are scan-heavy (Mamba-1) or chunk-matmul (Mamba-2), while attention layers are standard GEMM-heavy. A chip serving these models needs both efficient scan units and high-throughput matmul engines.

Hardware for SSM Inference

SSM inference, like transformer inference, is dominated by memory bandwidth — the bottleneck is loading model weights from HBM, not computing the scan. But SSMs eliminate the KV cache entirely, enabling larger batch sizes in the same memory footprint and constant cost per step regardless of sequence length. The emerging FPGA accelerator literature (SSMA, LightMamba, FastMamba, SpecMamba, MambaOPU, MARCA-v2, HCSAs, eMamba — eight papers in 2024—2025 alone) converges on common themes: INT8/INT4 quantized states, custom parallel scan units replacing general-purpose ALUs, on-chip state buffers that hold the SSM state permanently in SRAM, and reconfigurable dataflow to handle input-dependent control paths.

The case for custom silicon is clear: a chip optimized for SSM-based world model inference would prioritize large on-chip SRAM for model weights (not KV cache — there is none), dedicated hardwired scan units, and high HBM bandwidth for parameter streaming. This is architecturally quite different from a transformer inference chip.


ATLAS: A Purpose-Built World Model Processor

No existing chip is designed for the world model workload. GPUs are compute-overprovisioned and memory-bandwidth-starved for the small, sequential state updates that dominate imagination rollouts — a 10,000-step rollout at batch-1 utilizes less than 5% of an H100’s tensor cores. Driving SoCs (Tesla HW4, Mobileye EyeQ6) are optimized for perception, not recurrent multi-step prediction. LLM inference chips (Groq LPU, Etched Sohu) have no multi-modal input pipelines, no parallel rollout scheduling, and no mechanism for branching tree search.

ATLAS (Autonomous Terrain Learning and Simulation Processor) is a concrete architecture proposal for a chip purpose-built for the three phases of world model operation: sense (multi-modal encoding), imagine (parallel rollout generation), and plan (tree search over imagined futures).

Architecture Overview

A single-die SoC at approximately 450 mm^2 on TSMC N3E, comprising five functional blocks connected by a deterministic on-die mesh network:

Sensor Front End — Four 128x128 BF16 systolic array matrix tiles (aggregate ~196 TFLOPS BF16) with 256 KB SRAM per tile. A dedicated ISP handles 6 cameras at 30 fps. A hardware voxelizer converts lidar point clouds to BEV tokens. A proprioception FIFO handles IMU and joint encoders at 200 Hz. A cross-modal tokenizer projects all modalities into a shared D=1024 embedding space. The full sensor-to-latent pipeline completes in under 10 ms.

Prediction Engine — Two dedicated SSM scan units implementing hardware-wired parallel prefix scan for Mamba-2/S5-class models at state dimension N=256 across D=1024 channels. Two small attention tiles (64x64 BF16 systolic arrays) handle the occasional attention layers in hybrid SSM-attention models. Four MB of state SRAM holds the full SSM hidden state permanently on-die. Eight MB of weight cache streams model layers. Per-step latency: approximately 50 microseconds at N=256, D=1024 — dominated by parameter loads from the weight cache, confirming the memory-bound analysis.

Imagination Engine32 parallel rollout units (RUs), each containing a compact scan unit (N=64, D=256 — a 4x-compressed version of the full model), 128 KB state SRAM, and a scalar ALU for reward computation. Four MB of shared SRAM holds the compressed rollout model (approximately 50M parameters at FP8) entirely on-chip, making rollout steps weight-stationary with zero HBM access. At 50 microseconds per step per RU, 32 RUs running 50-step rollouts in parallel deliver 640,000 imagination steps per second — sufficient for real-time MCTS with 64 rollouts of 50 steps every 20 ms control cycle.

Imagination Engine
Parallel Rollout Array
32 rollout units running branching imagination in parallel
World
State
Latent s(t)
D=256, N=64
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15
R16
R17
R18
R19
R20
R21
R22
R23
R24
R25
R26
R27
R28
R29
R30
R31
Active (imagining) Idle / available
640,000
imagination steps/sec
50 µs
per SSM step
32
parallel rollout units
vs
MuZero: 800 sims/move on 1,000 TPUs. ATLAS: comparable on one chip.
Each of 32 RUs contains a compact scan unit (N=64, D=256), 128 KB state SRAM, and a scalar ALU for reward. The compressed rollout model (~50M params, FP8) lives entirely on-chip -- zero HBM access during imagination.

Planning Unit — Two RISC-V cores (RV64GCV with vector extensions) handle control-flow-heavy planning logic. A hardware MCTS accelerator with 4K-node tree memory performs UCB selection via a 32-wide comparator tree in a single cycle and value backpropagation in a 4-cycle pipeline. A hardware CEM sampler produces 32 Gaussian candidate action sequences per cycle with top-K fitness selection. One MCTS iteration (select, expand, rollout, backprop) takes approximately 3 microseconds; in 1 ms of planning budget, the unit completes approximately 300 iterations.

State Memory — 64 MB unified SRAM partitioned dynamically between model weights, persistent world state, and temporary buffers. Two HBM3E stacks provide 16 GB at 2 TB/s aggregate bandwidth for overflow weights, replay buffers, and map data.

The Two-Mode Control Cycle

ATLAS operates in two modes within each 20 ms control cycle:

Mode 1 — Sense and Predict (approximately 10—12 ms): Sensor Front End and Prediction Engine are active. Imagination Engine is power-gated. Ingest new sensor data, encode to latent representation, advance the world state by one step using the full-fidelity model. Compute character: GEMM-heavy (vision encoding) plus memory-bound sequential (SSM next-state).

Mode 2 — Imagine and Plan (approximately 8 ms): Imagination Engine (all 32 RUs) and Planning Unit are active. Sensor Front End matrix tiles can be repurposed as additional rollout capacity. Dispatch 32—64 candidate action sequences, roll out compressed world model in parallel, evaluate outcomes, run MCTS/CEM to select the best action. Compute character: embarrassingly parallel scan operations across 32+ independent state trajectories.

Performance Summary

PhaseOperationLatency
Sense6-camera ViT-Large encoding + lidar + fusion8 ms
PredictFull-model single-step state advance1 ms
Imagine32 rollouts x 50 steps (compressed model)3 ms
Plan300 MCTS iterations over rollout results1 ms
ActAction decode + safety check + bus write0.5 ms
Total13.5 ms (6.5 ms margin in 20 ms cycle)
ATLAS Control Loop
13.5 ms total — 50 Hz cycle
20 ms budget • 6.5 ms safety margin
Sense 8 ms
P
Imagine 3 ms
P
A
Margin 6.5 ms
0 5 ms 10 ms 15 ms 20 ms
Sense 8 ms — 6-camera ViT encoding
Predict 1 ms — SSM next-state
Imagine 3 ms — 32 parallel rollouts
Plan 1 ms — MCTS/CEM selection
Act 0.5 ms — Motor command
Margin 6.5 ms — Safety buffer
BlockPeak INT8 TOPSTypical Power
Sensor Front End (4 matrix tiles)39215 W
Prediction Engine (2 scan + 2 attn)508 W
Imagination Engine (32 RUs)2412 W
Planning Unit (2 RISC-V + accel.)13 W
State Memory + HBM + I/O22 W
Total chip~467 TOPS~60 W typical, ~80 W peak

The raw TOPS number is modest compared to Thor’s 1,000 or projected AI5 figures. The design philosophy is fundamentally different: ATLAS maximizes utilization at the actual world model operating point (small batch, sequential prediction, parallel rollouts) rather than peak throughput on large-batch GEMM. Estimated utilization on world model workloads is 60—80% versus less than 10% for a GPU running the same workload.

Comparison with Existing Silicon

DimensionTesla FSD HW4NVIDIA ThorMobileye EyeQ UltraATLAS
Primary workloadPerception + e2e drivingPerception + world model (general)Perception + mappingWorld model: sense-imagine-plan
ProcessSamsung 7 nmTSMC 4NP5 nmTSMC N3E
AI compute~300—500 TOPS1,000 TOPS176 TOPS467 TOPS
Parallel rollout unitsNoneNoneNone32 dedicated
SSM scan hardwareNoneNoneNone2 full + 32 compact
Prediction latency (SSM step)N/A~1—5 ms (GPU inference)N/A~50 microseconds
Imagination throughputN/AN/AN/A640K steps/s
On-chip state SRAM32 MB/NNA~20 MB~8 MB64 MB unified + 4 MB/engine
Safety featuresASIL-D (dual redundancy)ASIL-D (lockstep)ASIL-BASIL-B/D (watchdog + lockstep)
Power~100 W~300—500 W est.~100 W60—80 W

The key differentiators: ATLAS is the only architecture with dedicated parallel rollout hardware, hardwired SSM scan units achieving 50-microsecond prediction steps, and a hardware MCTS accelerator — the three operations that define world model inference and that existing chips handle only through general-purpose programmable compute.

Scalability: Edge to Cloud

ATLAS is designed as a scalable building block:

ConfigurationDiesPowerTOPS (INT8)Model SizeRolloutsUse Case
Edge1 (reduced, 8 RUs)25 W150200M8 x 30 stepsDrone, small robot
Standard1 (full, 32 RUs)60—80 W467500M—1B32 x 50 stepsAutonomous vehicle, humanoid
Cloud8 (UCIe chiplet link)500 W3,7005—10B256 x 200 stepsSimulation, digital twins

The edge configuration at 25 W approaches — but does not reach — the 500+ TOPS at 15 W target for battery-powered humanoids. Closing that gap requires a further process shrink (N2 or A14) and aggressive voltage scaling, likely a 2028—2029 prospect.

ATLAS Scalability
Edge to Cloud on One Architecture
25 W
Edge
8 Rollout Units
150 TOPS INT8
200M params
8 × 30 steps
Drones • Small Robots
same ISA
80 W
Vehicle
32 Rollout Units
467 TOPS INT8
500M–1B params
32 × 50 steps
Autonomous Driving • Humanoid
same ISA
500 W
UCIe
Cloud
8 Dies • 256 Rollout Units
3,700 TOPS INT8
5–10B params
256 × 200 steps
Simulation • Digital Twins
Same ISA, same software stack, different power envelope
One architecture from 25 W edge to 500 W datacenter — scale rollout units, not redesign

Risks

The primary risk is betting on SSM-based world models when the architecture landscape is still evolving. Mitigation: scan units occupy less than 10% of die area. The matrix tiles and attention tiles handle transformer workloads competently, and the chip degrades gracefully to a conventional neural accelerator. The compressed rollout model may also prove too inaccurate for safety-critical planning; the prediction engine can run rollouts sequentially at full fidelity as a fallback. Software ecosystem bootstrapping without CUDA is the hardest commercial challenge, addressed through an MLIR-based compiler accepting standard PyTorch/JAX model definitions.


The Convergence Thesis

The autonomous driving and robotics industries are converging on a common architecture: multi-camera BEV transformers feeding an occupancy-based world model that jointly predicts the future and plans trajectories. The compute requirements — real-time attention over 320,000+ spatial-temporal tokens, 3D voxel prediction, multi-agent trajectory simulation, and branching imagination rollouts — explain the 10—20x compute escalation from current shipping hardware to next-generation silicon.

The winners will be determined not just by raw TOPS but by TOPS per watt at the right precision (FP8/INT8 for transformers, INT4 for occupancy and compressed rollouts), memory bandwidth (BEV temporal buffers and parameter streaming are bandwidth-hungry), and effective utilization on the actual workload (most chips waste 90%+ of peak throughput on world model inference). Tesla bets on vertical integration and fleet data. NVIDIA bets on ecosystem breadth and CUDA lock-in. Mobileye bets on power efficiency. Purpose-built architectures like ATLAS bet that the world model workload is different enough from both LLM inference and perception that it deserves its own silicon.

The hardware world is at an inflection point. Transformer-optimized accelerators have dominated for five years. SSMs demand a different balance: less emphasis on KV cache, more on memory bandwidth and efficient scan primitives — a shift that connects to the nonlinear silicon thesis, where dynamical systems process continuous signals natively via analog ODE solvers rather than discretized digital scan operations. The eight FPGA accelerator papers published in 2024—2025 alone suggest the research community recognizes this gap. The first SSM-optimized ASICs for world model inference — combining high-bandwidth parameter streaming, on-chip state buffers, hardwired parallel scan units, and dedicated imagination rollout arrays — may define the next generation of embodied AI hardware.


Research compiled 2026-04-30. Specifications for unreleased products (Tesla AI5, NVIDIA Thor, Mobileye EyeQ Ultra) are based on public announcements and may change before production. ATLAS is a research architecture proposal, not a shipping product.

Additional Reading