Custom Chips for World Models
World models — systems that learn an internal representation of an environment and predict how it evolves — sit at the convergence of the three most compute-hungry AI workloads: large-scale video processing, autoregressive sequence modeling, and reinforcement learning with planning. As these models move from research papers to production robots and vehicles, they expose a hardware gap that no existing chip was designed to fill. This article surveys the world model landscape, the silicon racing to serve it, and a concrete architecture proposal for what purpose-built world model hardware could look like.
The World Model Landscape
A world model takes in observations (camera frames, lidar sweeps, joint angles, language commands), compresses them into a latent state, predicts how that state evolves given candidate actions, and — critically — runs branching imagination rollouts to evaluate plans before committing to physical motion. The field has crystallized around three architecture families, each with a distinct hardware profile.
Architecture Families
Autoregressive transformers tokenize observations via VQ-VAE or similar codebooks and predict the next token in sequence. GAIA-1 (Wayve, 9.4B total parameters across a 6.5B world model, 2.6B video decoder, and 0.3B tokenizer) demonstrated emergent driving dynamics and counterfactual reasoning from video, text, and action tokens. Genie 2 and Genie 3 (Google DeepMind) generate interactive 3D worlds up to one minute long from a single image prompt. NVIDIA Cosmos scales to 14B parameters as a video-generation world foundation model targeting autonomous driving simulation on Blackwell GPUs. Autoregressive models scale naturally with compute but suffer a sequential generation bottleneck: each token depends on the previous one, making latency proportional to sequence length.
Diffusion models generate frames by iterative denoising. UniSim (Google DeepMind) simulates realistic experience from actions in diverse environments. DriveDreamer integrates HD maps and 3D bounding boxes as conditioning for precise scenario generation. The fidelity advantage is real, but the cost is 20 to 1,000 denoising steps per generated frame, multiplying compute by that factor relative to single-pass prediction — the same iterative cost that drives the VDX-1 chip proposal. For real-time planning, this is often prohibitive.
Recurrent state-space models (RSSMs) compress history into a fixed-size latent state and advance it one step at a time. DreamerV3 (12M to 400M parameters) famously collected a diamond in Minecraft from scratch using a single A100 over nine days, unrolling 16 imagination steps per training batch across thousands of parallel trajectories. MILE applies the same paradigm to end-to-end driving. RSSMs are more parameter-efficient and fundamentally cheaper per imagination step, but historically harder to scale to photorealistic generation.
JEPA (Joint Embedding Predictive Architecture), proposed by LeCun with concrete instantiations in I-JEPA and V-JEPA, predicts in abstract representation space without ever reconstructing pixels. This yields lightweight inference — a property exploited by the JEPA-R edge chip proposal — but requires downstream tasks to operate on frozen representations rather than generated images or video.
Key Systems at a Glance
| System | Params | Architecture | Compute | Achievement |
|---|---|---|---|---|
| DreamerV3 | 12M—400M | RSSM (GRU + stochastic latent) | 1 A100, 9 days | Minecraft diamond from scratch |
| GAIA-1 | 9.4B (6.5B world model) | Autoregressive transformer | Multi-GPU cluster | Emergent driving dynamics |
| Genie 2/3 | Undisclosed (large) | Tokenizer + dynamics + decoder | High-end GPUs | 1-minute interactive 3D worlds |
| Cosmos | 2B—14B | Video generation platform | Blackwell GPUs | 30-second predictive video |
| UniSim | Large | Diffusion | Multi-GPU | Simulated experience from actions |
| pi0 | 3.3B | VLA with flow matching | Single GPU | 50 Hz robot control across 7 embodiments |
Why World Models Break Existing Hardware
Five properties combine to make world models uniquely demanding:
- Prediction-imagination-planning loops. Each training or inference cycle interleaves encoding, multi-step forward rollouts, reward estimation, and action selection — a deeply heterogeneous pipeline.
- Video-scale I/O. Cosmos and Genie operate on high-resolution video with hundreds of frames of temporal context.
- Multi-step diffusion. Diffusion-based world models multiply per-frame compute by the number of denoising steps.
- Model scale. GAIA-1 at 9.4B (total system) and Cosmos at 14B parameters rival the largest language models.
- Continuous learning. Reinforcement learning settings interleave data collection with model training and imagination-based policy updates, demanding sustained throughput rather than burst compute.
Front End
Engine
Unit
Commands
Hardware for Planning and Search
Planning — selecting actions by imagining their consequences — is the operation that most sharply distinguishes world model inference from standard neural network inference. Two paradigms dominate, each with a distinctive hardware profile.
MCTS with Learned Models: The MuZero Legacy
AlphaGo used 1,920 CPUs and 280 GPUs to defeat Lee Sedol (Elo 3,739). AlphaGo Master reduced that to 4 TPUs while achieving a far higher Elo (4,858 vs. 3,739). MuZero generalized the approach to games without known rules, using 16 TPU v3s for training and 1,000 TPUs for self-play, running 800 simulations per move. Each simulation requires a dynamics-network forward pass at every tree node expansion, making MCTS fundamentally limited by inference latency and throughput.
Gumbel MuZero directly addresses this hardware bottleneck: by replacing PUCT with Gumbel-based sampling without replacement, it significantly improves performance when planning with few simulations — reducing the number of forward passes required per decision by an order of magnitude in some settings. This is a software optimization designed around hardware constraints.
The structural problem remains: MCTS tree management logic runs on CPUs while neural evaluation runs on accelerators. The serialization between them creates an imagination bottleneck where the accelerator is idle during tree traversal and the CPU is idle during neural evaluation.
MPC-Style Planning: CEM and MPPI
Cross-Entropy Method (CEM) and Model Predictive Path Integral (MPPI) planning sample hundreds of candidate action sequences in parallel and roll them forward through learned dynamics. This is more GPU-friendly than MCTS: no tree structure, just a large batch of independent rollouts. TD-MPC2 scales to 317M parameters across 80 tasks using this approach.
The hardware profile is memory-bandwidth-bound rather than latency-bound: storing and scoring hundreds of parallel trajectories dominates over the per-step compute cost.
Planning Hardware Requirements
| Method | Compute Pattern | Key Bottleneck |
|---|---|---|
| MCTS (MuZero) | Sequential tree expansion, batched leaf eval | Inference latency per simulation |
| CEM/MPPI | Massively parallel independent rollouts | Memory bandwidth for trajectory storage |
| Dreamer imagination | Batched latent rollouts with backprop | Latent state memory for long horizons |
| Ensemble models | K parallel independent forward passes | Memory for K model copies |
Real-Time Robotics: The Edge Constraint
World models for physical systems must run on the robot, not in cloud. Network round-trip latency (20—100 ms) violates the sub-10 ms inference budget required for dexterous manipulation.
The Latency Budget
| Domain | Control Frequency | Latency Budget |
|---|---|---|
| Dexterous manipulation | 50—200 Hz | 5—20 ms sensor-to-action |
| Mobile navigation | 10—50 Hz | 20—100 ms |
| Humanoid locomotion | 200—500 Hz | 2—5 ms (inner loop classical) |
pi0 (Physical Intelligence) represents the current state of the art: 3.3B parameters, 10 forward Euler steps per action chunk, 50 Hz output via action chunking on a single GPU. It controls seven different robot embodiments with a unified model.
Current Edge Hardware
| Platform | AI Performance | Power | Target |
|---|---|---|---|
| Jetson AGX Orin | 275 TOPS | 15—60 W | Heavy manipulation, mobile robots |
| Jetson Thor | 2,070 TFLOPS (FP4) | 40—130 W | Humanoids, VLA models |
| Tesla FSD HW3 | 144 TOPS (dual) | 72 W | Autonomous driving |
| Tesla FSD HW4 | ~300—500 TOPS | 100 W | Next-gen driving |
| Tesla AI5 (planned) | ~3,000—5,000 TOPS | 800 W | Optimus humanoid, L4/L5 driving |
| Google Edge TPU | 4 TOPS | 2 W | Lightweight perception only |
The gap that matters: Tesla considers HW4 insufficient for autonomous Optimus operation. Humanoid-grade world models plausibly require 500+ TOPS at 15 W with deterministic latency. That chip does not exist. Jetson Thor at 2,070 TFLOPS comes closest but draws 40—130 W — too much for a battery-powered humanoid running all day. AI5 at 800 W is a datacenter-in-a-car, not a mobile robotics solution. The industry needs an order-of-magnitude improvement in TOPS-per-watt for world model inference at the edge.
Multi-Modal Fusion Hardware
World models for embodied AI must ingest camera frames (30 Hz), lidar sweeps (10—20 Hz), IMU and joint encoders (100—1,000 Hz), language commands (sporadic), and proprioception simultaneously. Each modality arrives at a different rate, in a different representation, and with a different token count. A vehicle running ViT-Large on 8 cameras at 30 Hz demands approximately 14.8 TFLOPS just for vision encoding — before any cross-modal attention, prediction, or planning.
Production Multi-Modal SoCs
Tesla FSD HW3 (2019) remains the canonical example of purpose-built multi-modal driving silicon. Samsung 14 nm, approximately 260 mm² die area. Two independent NNAs, each with a 96x96 multiply-accumulate array operating on 8-bit integers at 2 GHz, delivering 36 TOPS per NNA (72 TOPS total). The key architectural insight is 32 MB of SRAM per NNA — large enough to hold entire neural network layer weights on-die, avoiding DRAM round-trips and achieving near-peak MAC utilization. A dedicated ISP handles 8 cameras simultaneously at 2.5 Gpixels/s. Dual-redundant SoCs on one board provide functional safety: if both agree, the action executes; disagreement triggers fallback.
HW4 (2023) moved to Samsung 7 nm, doubled RAM to 16 GB, and added high-definition radar. Musk claimed 3—8x HW3 compute, enabling end-to-end neural network processing of all cameras simultaneously rather than per-camera feature extraction. AI5 (estimated 2027) is the generational leap, consuming up to 800 W — an acknowledgment that world model inference at L4/L5 is fundamentally more compute-hungry than the convolutional perception stacks of earlier hardware.
NVIDIA DRIVE Thor (2025+) represents a different philosophy: a general-purpose transformer accelerator for automotive. TSMC 4NP Blackwell architecture, 2,560 CUDA cores, 1,000 sparse INT8 TOPS, and critically 128 GB of LPDDR5X — four times Orin’s memory, explicitly sized to hold billion-parameter transformer world models in-vehicle. The Transformer Engine with native FP8 support targets the attention-heavy BEV transformer architectures that Orin struggles to run at full resolution. Two Thor SoCs interconnect via NVLink-C2C for 2,000 TOPS configurations.
Mobileye EyeQ6 to EyeQ Ultra traces a different path: specialized, power-efficient accelerators. EyeQ6 ships at 16 TOPS on 7 nm targeting L2+; EyeQ Ultra at 176 TOPS on 5 nm targets L4 with dedicated accelerators for deep learning, classical computer vision, and general-purpose compute on a single die. Over 27 OEMs use EyeQ silicon.
Apple Vision Pro R1 offers a template outside driving. A dedicated sensor fusion co-processor handles 12 cameras, 5 sensors, and 6 microphones simultaneously at 12 ms photon-to-photon latency. By isolating sensor fusion on a chip with its own hard real-time OS, Apple guarantees that application load on the M2 never causes sensor fusion jitter. This dual-chip isolation pattern — a dedicated real-time co-processor alongside a general-purpose application processor — is the cleanest existing solution to the asynchronous multi-modal input problem.
Recurring Hardware Patterns
Four patterns recur across production multi-modal SoCs:
- Dedicated ISP/sensor pipeline + central neural accelerator. Preprocessing (demosaic, distortion correction, point cloud voxelization) runs on fixed-function hardware while the neural network runs on a programmable accelerator. Tesla and NVIDIA DRIVE both use this.
- Dual-chip isolation for real-time guarantees. A dedicated sensor fusion chip guarantees hard real-time deadlines independent of the main application processor (Apple R1 pattern).
- Large on-chip SRAM to avoid the DRAM bottleneck. Tesla’s 32 MB per NNA and NVIDIA’s DLA SRAM buffers keep weights on-die. Future multi-modal chips will likely push to 64—128 MB.
- FP8 transformer engines. NVIDIA Thor and Hopper both feature FP8 arithmetic that halves memory footprint and doubles throughput versus FP16, directly enabling larger cross-modal attention windows within the same power envelope.
The BEV Transformer Bottleneck in Autonomous Driving
BEVFormer established the paradigm for modern autonomous driving perception: use spatial cross-attention to lift multi-camera 2D features into a unified bird’s-eye-view representation, then apply temporal self-attention across history frames. The result is a dense 3D understanding of the scene suitable for occupancy prediction and trajectory planning.
The hardware bottleneck is attention scale. A 200x200 BEV grid with 8 history frames produces 320,000+ tokens per frame — far beyond NLP-scale attention. Solutions include deformable attention (attending to sparse learned reference points), sparse queries (VoxFormer, StreamPETR), and temporal compression into fixed-size latents. Tesla’s occupancy networks push this further: dense 3D voxel grids (200x200x16 at 0.5 m resolution) updated at 10 Hz across 8 cameras require billions of MACs per second.
These workloads explain the 10—20x compute escalation from current shipping hardware (Orin at 254 TOPS, HW4 at ~400 TOPS) to next-generation silicon (Thor at 2,000 TOPS, AI5 at an estimated 3,000—5,000 TOPS). The architecture has shifted from CNN-era perception to transformer-era world modeling, and the silicon must follow.
Autonomous Driving Chip Comparison
| Chip | TOPS (INT8) | Process | Power | Memory | Target Level | Status |
|---|---|---|---|---|---|---|
| Tesla HW3 | 144 (dual) | 14 nm | 72 W | 8 GB | L2—L3 | Shipping since 2019 |
| Tesla HW4 | ~300—500 | 7 nm | 100 W | 16 GB | L3—L4 | Shipping since 2023 |
| Tesla AI5 | ~3,000—5,000 | 4—5 nm est. | 800 W | TBD | L4—L5 | Est. 2027 |
| NVIDIA Orin | 254 (sparse) | 8 nm | 100 W | 32 GB | L2—L4 | Shipping since 2022 |
| NVIDIA Thor | 1,000 (sparse; 2,000 dual) | TSMC 4NP | ~300—500 W | 128 GB | L3—L5 | Sampling 2025 |
| Mobileye EyeQ6 | 16 | 7 nm | 60 W | LPDDR5X | L2+ | Shipping 2024 |
| Mobileye EyeQ Ultra | 176 | 5 nm | ~100 W | LPDDR5X | L4 | 2025—2026 |
SSM and Mamba: Reshaping the Imagination Engine
The transformer’s O(n^2) attention cost and ever-growing KV cache make it increasingly awkward for the operation world models do most: stepping forward in time, step after step, for thousands or millions of imagination steps during planning. State space models offer a fundamentally different trade: O(N) complexity and constant memory per step, at the cost of a compressed fixed-size state rather than full-context access.
The Core Architectures
S4 (Gu et al., 2021) introduced three computational views of the same linear system: continuous (natural for signals), recurrent (O(1) per step at inference), and convolutional (O(L log L) via FFT for training). The convolutional view enabled fast GPU training, but required time-invariant parameters.
Mamba (S6, Gu and Dao, 2023) made SSM parameters input-dependent, enabling content-based selection analogous to attention — but with a compressed fixed-size state rather than full context storage. The hardware-aware implementation loads SSM parameters from HBM to SRAM, performs the entire selective scan in SRAM, and writes only final outputs back — avoiding materializing the expanded state tensor. This kernel fusion trick mirrors FlashAttention’s strategy for avoiding the O(n^2) attention matrix.
Mamba-2 (Dao and Gu, 2024) revealed the State Space Duality: SSMs and a class of structured attention are mathematically equivalent. The practical breakthrough is a chunk-wise matmul algorithm that hits tensor cores — fixing Mamba-1’s critical hardware weakness. Mamba-1’s selective scan used element-wise operations that could not exploit tensor cores. On H100, the gap between matmul throughput (989 TFLOPS BF16) and element-wise throughput (67 TFLOPS FP32) means Mamba-1 was leaving over 90% of the chip’s peak compute on the table. Mamba-2 splits sequences into chunks of 64—128, computes intra-chunk outputs via batched matmul (tensor-core-friendly), passes states across chunk boundaries via a small scan on the reduced sequence, and finishes with another batched matmul. The result: 2—8x faster than Mamba-1, and the state dimension can grow from N=16 to N=64—256+ without penalty.
SSMs for World Models: The R2I Breakthrough
R2I (Recall to Imagine) demonstrated the payoff concretely: integrating S4/S5-family SSMs into DreamerV3’s world model achieved up to 9x faster training than the GRU-based original. The constant-state property is what makes this possible:
| Property | Transformer | SSM (Mamba-class) |
|---|---|---|
| Per-step inference FLOPs | O(L) attention + O(1) FFN, L growing with history | O(1) state update |
| Per-step memory | O(L) KV cache per layer, growing linearly | O(N) fixed state per layer |
| 10,000-step imagination | O(H^2) total attention FLOPs, ~5 GB KV cache | O(H) total FLOPs, ~1 MB constant |
| 1M-step simulation | Infeasible without windowing | Straightforward |
R2I handled 4,000-step episodes in Memory Maze — dramatically beyond GRU-based DreamerV3’s capacity — and achieved superhuman performance on complex memory tasks. For a robot running a world model at 50 Hz with 200-step planning horizons, SSMs enable roughly 10,000 imagination steps per second with constant memory — feasible on embedded hardware where a transformer would blow through memory budgets.
Hybrid Architectures
Pure SSMs struggle with precise retrieval from long context (the “copying” problem). This motivates hybrid designs: Jamba (AI21) interleaves transformer and Mamba layers with mixture-of-experts, fitting on a single 80 GB GPU with 256K context. Samba (Microsoft) combines Mamba with sliding window attention, achieving 3.73x higher throughput than transformers at 128K context while extrapolating from 4K training length to 256K with perfect memory recall. Zamba (Zyphra) uses a Mamba backbone with a single shared attention module at 7B parameters. These hybrids create heterogeneous workloads: SSM layers are scan-heavy (Mamba-1) or chunk-matmul (Mamba-2), while attention layers are standard GEMM-heavy. A chip serving these models needs both efficient scan units and high-throughput matmul engines.
Hardware for SSM Inference
SSM inference, like transformer inference, is dominated by memory bandwidth — the bottleneck is loading model weights from HBM, not computing the scan. But SSMs eliminate the KV cache entirely, enabling larger batch sizes in the same memory footprint and constant cost per step regardless of sequence length. The emerging FPGA accelerator literature (SSMA, LightMamba, FastMamba, SpecMamba, MambaOPU, MARCA-v2, HCSAs, eMamba — eight papers in 2024—2025 alone) converges on common themes: INT8/INT4 quantized states, custom parallel scan units replacing general-purpose ALUs, on-chip state buffers that hold the SSM state permanently in SRAM, and reconfigurable dataflow to handle input-dependent control paths.
The case for custom silicon is clear: a chip optimized for SSM-based world model inference would prioritize large on-chip SRAM for model weights (not KV cache — there is none), dedicated hardwired scan units, and high HBM bandwidth for parameter streaming. This is architecturally quite different from a transformer inference chip.
ATLAS: A Purpose-Built World Model Processor
No existing chip is designed for the world model workload. GPUs are compute-overprovisioned and memory-bandwidth-starved for the small, sequential state updates that dominate imagination rollouts — a 10,000-step rollout at batch-1 utilizes less than 5% of an H100’s tensor cores. Driving SoCs (Tesla HW4, Mobileye EyeQ6) are optimized for perception, not recurrent multi-step prediction. LLM inference chips (Groq LPU, Etched Sohu) have no multi-modal input pipelines, no parallel rollout scheduling, and no mechanism for branching tree search.
ATLAS (Autonomous Terrain Learning and Simulation Processor) is a concrete architecture proposal for a chip purpose-built for the three phases of world model operation: sense (multi-modal encoding), imagine (parallel rollout generation), and plan (tree search over imagined futures).
Architecture Overview
A single-die SoC at approximately 450 mm^2 on TSMC N3E, comprising five functional blocks connected by a deterministic on-die mesh network:
Sensor Front End — Four 128x128 BF16 systolic array matrix tiles (aggregate ~196 TFLOPS BF16) with 256 KB SRAM per tile. A dedicated ISP handles 6 cameras at 30 fps. A hardware voxelizer converts lidar point clouds to BEV tokens. A proprioception FIFO handles IMU and joint encoders at 200 Hz. A cross-modal tokenizer projects all modalities into a shared D=1024 embedding space. The full sensor-to-latent pipeline completes in under 10 ms.
Prediction Engine — Two dedicated SSM scan units implementing hardware-wired parallel prefix scan for Mamba-2/S5-class models at state dimension N=256 across D=1024 channels. Two small attention tiles (64x64 BF16 systolic arrays) handle the occasional attention layers in hybrid SSM-attention models. Four MB of state SRAM holds the full SSM hidden state permanently on-die. Eight MB of weight cache streams model layers. Per-step latency: approximately 50 microseconds at N=256, D=1024 — dominated by parameter loads from the weight cache, confirming the memory-bound analysis.
Imagination Engine — 32 parallel rollout units (RUs), each containing a compact scan unit (N=64, D=256 — a 4x-compressed version of the full model), 128 KB state SRAM, and a scalar ALU for reward computation. Four MB of shared SRAM holds the compressed rollout model (approximately 50M parameters at FP8) entirely on-chip, making rollout steps weight-stationary with zero HBM access. At 50 microseconds per step per RU, 32 RUs running 50-step rollouts in parallel deliver 640,000 imagination steps per second — sufficient for real-time MCTS with 64 rollouts of 50 steps every 20 ms control cycle.
Planning Unit — Two RISC-V cores (RV64GCV with vector extensions) handle control-flow-heavy planning logic. A hardware MCTS accelerator with 4K-node tree memory performs UCB selection via a 32-wide comparator tree in a single cycle and value backpropagation in a 4-cycle pipeline. A hardware CEM sampler produces 32 Gaussian candidate action sequences per cycle with top-K fitness selection. One MCTS iteration (select, expand, rollout, backprop) takes approximately 3 microseconds; in 1 ms of planning budget, the unit completes approximately 300 iterations.
State Memory — 64 MB unified SRAM partitioned dynamically between model weights, persistent world state, and temporary buffers. Two HBM3E stacks provide 16 GB at 2 TB/s aggregate bandwidth for overflow weights, replay buffers, and map data.
The Two-Mode Control Cycle
ATLAS operates in two modes within each 20 ms control cycle:
Mode 1 — Sense and Predict (approximately 10—12 ms): Sensor Front End and Prediction Engine are active. Imagination Engine is power-gated. Ingest new sensor data, encode to latent representation, advance the world state by one step using the full-fidelity model. Compute character: GEMM-heavy (vision encoding) plus memory-bound sequential (SSM next-state).
Mode 2 — Imagine and Plan (approximately 8 ms): Imagination Engine (all 32 RUs) and Planning Unit are active. Sensor Front End matrix tiles can be repurposed as additional rollout capacity. Dispatch 32—64 candidate action sequences, roll out compressed world model in parallel, evaluate outcomes, run MCTS/CEM to select the best action. Compute character: embarrassingly parallel scan operations across 32+ independent state trajectories.
Performance Summary
| Phase | Operation | Latency |
|---|---|---|
| Sense | 6-camera ViT-Large encoding + lidar + fusion | 8 ms |
| Predict | Full-model single-step state advance | 1 ms |
| Imagine | 32 rollouts x 50 steps (compressed model) | 3 ms |
| Plan | 300 MCTS iterations over rollout results | 1 ms |
| Act | Action decode + safety check + bus write | 0.5 ms |
| Total | 13.5 ms (6.5 ms margin in 20 ms cycle) |
| Block | Peak INT8 TOPS | Typical Power |
|---|---|---|
| Sensor Front End (4 matrix tiles) | 392 | 15 W |
| Prediction Engine (2 scan + 2 attn) | 50 | 8 W |
| Imagination Engine (32 RUs) | 24 | 12 W |
| Planning Unit (2 RISC-V + accel.) | 1 | 3 W |
| State Memory + HBM + I/O | — | 22 W |
| Total chip | ~467 TOPS | ~60 W typical, ~80 W peak |
The raw TOPS number is modest compared to Thor’s 1,000 or projected AI5 figures. The design philosophy is fundamentally different: ATLAS maximizes utilization at the actual world model operating point (small batch, sequential prediction, parallel rollouts) rather than peak throughput on large-batch GEMM. Estimated utilization on world model workloads is 60—80% versus less than 10% for a GPU running the same workload.
Comparison with Existing Silicon
| Dimension | Tesla FSD HW4 | NVIDIA Thor | Mobileye EyeQ Ultra | ATLAS |
|---|---|---|---|---|
| Primary workload | Perception + e2e driving | Perception + world model (general) | Perception + mapping | World model: sense-imagine-plan |
| Process | Samsung 7 nm | TSMC 4NP | 5 nm | TSMC N3E |
| AI compute | ~300—500 TOPS | 1,000 TOPS | 176 TOPS | 467 TOPS |
| Parallel rollout units | None | None | None | 32 dedicated |
| SSM scan hardware | None | None | None | 2 full + 32 compact |
| Prediction latency (SSM step) | N/A | ~1—5 ms (GPU inference) | N/A | ~50 microseconds |
| Imagination throughput | N/A | N/A | N/A | 640K steps/s |
| On-chip state SRAM | 32 MB/NNA | ~20 MB | ~8 MB | 64 MB unified + 4 MB/engine |
| Safety features | ASIL-D (dual redundancy) | ASIL-D (lockstep) | ASIL-B | ASIL-B/D (watchdog + lockstep) |
| Power | ~100 W | ~300—500 W est. | ~100 W | 60—80 W |
The key differentiators: ATLAS is the only architecture with dedicated parallel rollout hardware, hardwired SSM scan units achieving 50-microsecond prediction steps, and a hardware MCTS accelerator — the three operations that define world model inference and that existing chips handle only through general-purpose programmable compute.
Scalability: Edge to Cloud
ATLAS is designed as a scalable building block:
| Configuration | Dies | Power | TOPS (INT8) | Model Size | Rollouts | Use Case |
|---|---|---|---|---|---|---|
| Edge | 1 (reduced, 8 RUs) | 25 W | 150 | 200M | 8 x 30 steps | Drone, small robot |
| Standard | 1 (full, 32 RUs) | 60—80 W | 467 | 500M—1B | 32 x 50 steps | Autonomous vehicle, humanoid |
| Cloud | 8 (UCIe chiplet link) | 500 W | 3,700 | 5—10B | 256 x 200 steps | Simulation, digital twins |
The edge configuration at 25 W approaches — but does not reach — the 500+ TOPS at 15 W target for battery-powered humanoids. Closing that gap requires a further process shrink (N2 or A14) and aggressive voltage scaling, likely a 2028—2029 prospect.
Risks
The primary risk is betting on SSM-based world models when the architecture landscape is still evolving. Mitigation: scan units occupy less than 10% of die area. The matrix tiles and attention tiles handle transformer workloads competently, and the chip degrades gracefully to a conventional neural accelerator. The compressed rollout model may also prove too inaccurate for safety-critical planning; the prediction engine can run rollouts sequentially at full fidelity as a fallback. Software ecosystem bootstrapping without CUDA is the hardest commercial challenge, addressed through an MLIR-based compiler accepting standard PyTorch/JAX model definitions.
The Convergence Thesis
The autonomous driving and robotics industries are converging on a common architecture: multi-camera BEV transformers feeding an occupancy-based world model that jointly predicts the future and plans trajectories. The compute requirements — real-time attention over 320,000+ spatial-temporal tokens, 3D voxel prediction, multi-agent trajectory simulation, and branching imagination rollouts — explain the 10—20x compute escalation from current shipping hardware to next-generation silicon.
The winners will be determined not just by raw TOPS but by TOPS per watt at the right precision (FP8/INT8 for transformers, INT4 for occupancy and compressed rollouts), memory bandwidth (BEV temporal buffers and parameter streaming are bandwidth-hungry), and effective utilization on the actual workload (most chips waste 90%+ of peak throughput on world model inference). Tesla bets on vertical integration and fleet data. NVIDIA bets on ecosystem breadth and CUDA lock-in. Mobileye bets on power efficiency. Purpose-built architectures like ATLAS bet that the world model workload is different enough from both LLM inference and perception that it deserves its own silicon.
The hardware world is at an inflection point. Transformer-optimized accelerators have dominated for five years. SSMs demand a different balance: less emphasis on KV cache, more on memory bandwidth and efficient scan primitives — a shift that connects to the nonlinear silicon thesis, where dynamical systems process continuous signals natively via analog ODE solvers rather than discretized digital scan operations. The eight FPGA accelerator papers published in 2024—2025 alone suggest the research community recognizes this gap. The first SSM-optimized ASICs for world model inference — combining high-bandwidth parameter streaming, on-chip state buffers, hardwired parallel scan units, and dedicated imagination rollout arrays — may define the next generation of embodied AI hardware.
Research compiled 2026-04-30. Specifications for unreleased products (Tesla AI5, NVIDIA Thor, Mobileye EyeQ Ultra) are based on public announcements and may change before production. ATLAS is a research architecture proposal, not a shipping product.
Additional Reading
- World Models — Ha & Schmidhuber 2018
- DreamerV3 — Hafner et al.
- MuZero — Schrittwieser et al.
- A Path Towards Autonomous Machine Intelligence (JEPA) — LeCun 2022
- Genie: Generative Interactive Environments — Bruce et al., DeepMind
- NVIDIA Cosmos — NVIDIA (77 authors)
- Mamba — Gu & Dao 2023
- pi0: Vision-Language-Action Flow Model — Black et al., Physical Intelligence
- NVIDIA Jetson Thor — NVIDIA
- GAIA-1: A Generative World Model for Autonomous Driving — Hu et al., Wayve
- R2I: Mastering Memory Tasks with World Models — Samsami et al.
- Mamba-2: State Space Duality — Dao & Gu 2024