JEPA-R: A Latent Prediction Chip for Robotics
Robots do not need to see the future. They need to predict it. The distinction is architectural, and it has direct consequences for silicon. Seeing the future means generating pixels — photorealistic video frames of what might happen next. Predicting the future means advancing a compressed state representation by one timestep in a learned latent space. The first operation costs 1,250-10,000 GFLOPs per frame. The second costs approximately 0.5 GFLOPs. That is a 2,500x to 20,000x gap, and it determines the shape of every chip in the robotics inference stack.
JEPA — Joint-Embedding Predictive Architecture — is the clearest instantiation of the prediction-only approach. Introduced by Yann LeCun as a position paper and implemented in I-JEPA (images) and V-JEPA (video) by Meta, the architecture predicts in representation space using a transformer predictor that costs roughly 1.3 encoder-equivalents per prediction step. Diffusion world models pay 50 encoder-equivalents for the same operation because they must iteratively denoise pixel-level outputs. Autoregressive world models pay proportionally to token count, with KV-cache memory growing linearly. JEPA pays once, in latent space, with no pixels on the critical path.
This article proposes JEPA-R: a purpose-built edge ASIC for JEPA-family latent prediction in robotic control loops. It then argues that JEPA-R and VDX-1 (the video diffusion rendering chip proposed in this series) form a complementary chip pair — not competitors, but co-processors in a complete embodied AI stack. The robot ships with JEPA-R. The fleet management system ships with VDX-1.
Why JEPA for Robotics Hardware
V-JEPA 2, released by Meta in 2025, is the most concrete evidence that JEPA-family models work on physical robots. The action-conditioned variant, V-JEPA 2-AC, was pretrained on VideoMix22M (over 1 million hours of internet video), then fine-tuned on just 62 hours of unlabeled robot video from the DROID dataset — no action labels, no reward signals, no task-specific data collection. Deployed zero-shot on Franka Emika Panda arms with RobotiQ grippers across two independent labs, V-JEPA 2-AC achieved:
- 100% success on single-goal reaching (end-effector within 4cm of target)
- 73% success on zero-shot pick-and-place
- 65% success on grasping (averaged across two labs)
The comparison to diffusion-based world models is stark. On the same manipulation tasks, NVIDIA Cosmos — a diffusion-based video generation world model — achieved 80% on reaching but 0-20% on object interaction tasks (grasping and pick-and-place) while requiring approximately 4 minutes per action. V-JEPA 2-AC required 16 seconds per action. That is a 15x latency advantage with dramatically better success on contact-rich tasks. The latency gap is not a software optimization problem. It is structural: diffusion models must execute 20-50 serial denoising passes through a multi-billion parameter backbone per generated frame. JEPA executes a single forward pass through a 300M-parameter predictor.
Three properties make JEPA uniquely suited to dedicated robotics silicon:
Deterministic, bounded latency. Each prediction is a single forward pass — no iterative refinement, no rejection sampling, no variable-length generation. Worst-case execution time equals best-case execution time. For safety-critical control near humans, this is not a convenience; it is a prerequisite. No shipping GPU can guarantee worst-case inference latency because thread scheduling is inherently stochastic.
Minimal memory footprint. The latent state produced by V-JEPA 2’s ViT-g encoder is a 16x16 spatial grid of 1408-dimensional features: approximately 360K floats, or 1.4 MB per frame in FP32 (720 KB in FP16). Compare this to autoregressive world models, where a 16-frame rollout at 256 tokens per frame accumulates roughly 256 MB of KV-cache. The 180x memory reduction means the entire working set fits in on-chip SRAM, eliminating HBM bandwidth as the bottleneck.
Narrow predictor bottleneck. The V-JEPA 2-AC predictor is a 24-layer transformer with a 1024-dimensional hidden size and 16 attention heads — approximately 300M parameters. (The original V-JEPA used a narrower 12-layer, 384-dim predictor; V-JEPA 2-AC scales this up significantly.) The predictor is where the “world modeling” actually happens: it takes the current encoded state plus action inputs and outputs the predicted next-state representation. The encoder is the expensive part (60-70% of total FLOPs), but the encoder processes raw sensor input once per timestep. The predictor runs once per imagination rollout step, and imagination rollouts are where planning happens. At 300M parameters, the predictor is roughly 30% the size of the 1B-parameter encoder — small enough to fit entirely in on-chip SRAM on a modestly-sized ASIC.
The Compute Profile
Understanding exactly where FLOPs go in a JEPA inference pass is essential for chip architecture. The pipeline has three stages with radically different compute characteristics:
Stage 1: ViT encoding (60-70% of total FLOPs). The V-JEPA 2 encoder is a ViT-g/16 with approximately 1B parameters. It processes raw camera frames by patchifying them into 2x16x16 spatiotemporal tubelets, then running self-attention over the resulting patch tokens. For a single 256x256 RGB frame, this produces a 16x16 grid of 1408-dimensional feature vectors. Self-attention over patch tokens dominates: the quadratic attention cost over ~256 tokens per frame is the single largest compute block. With 3D Rotary Position Embeddings (RoPE) for spatial-temporal reasoning, the encoder represents the bulk of the arithmetic workload. For multi-camera setups (6 cameras on a humanoid), this stage scales linearly with camera count.
Stage 2: Predictor (~30% of encoder compute). The V-JEPA 2-AC predictor is a 24-layer transformer with a 1024-dimensional hidden size and 16 attention heads — roughly 300M parameters. (The original V-JEPA used a narrower 12-layer, 384-dim predictor.) It takes encoded visual features, proprioceptive state (7D end-effector pose: 3D position, 3D orientation, 1D gripper state), and a candidate action vector as input. All modalities are projected into the same 1024-dim embedding space and fused via block-causal attention. The predictor outputs a predicted next-state representation in a single forward pass. Per-layer compute is approximately (1024/1408)^2 ~ 53% of an encoder layer’s attention cost, but the predictor has 24 layers vs. the encoder’s ~40 layers. In aggregate, the predictor is roughly 1.3 encoder-equivalents per prediction step when accounting for both layer count and per-layer cost — still far cheaper than diffusion’s 50x.
Stage 3: Planning (CEM optimization in latent space). V-JEPA 2-AC uses the Cross-Entropy Method with 800 samples and 10 refinement iterations. Each CEM sample requires rolling the predictor forward through the action sequence. The planning step is embarrassingly parallel across samples (all 800 candidates are independent), but sequential across refinement iterations. At 800 samples x 10 iterations x 1 predictor forward pass each = 8,000 predictor evaluations per action. At 0.5 GFLOPs per evaluation, the total planning cost is approximately 4 TFLOPs per action — substantial, but amortized over the ~1 Hz action replanning rate when using action chunking.
Critically absent from this pipeline: no VAE decoder, no denoising loop, no autoregressive token generation. The EMA target encoder update (exponential moving average of encoder weights) is a training-only operation — at inference, only the online encoder and the predictor run. The entire inference-time architecture is two sequential matrix-multiply chains (encoder, then predictor) followed by a parallel optimization (CEM).
JEPA-R: The Chip Proposal
JEPA-R is a purpose-built edge inference ASIC for JEPA-family latent prediction in robotic control loops. The target is a 15W chip that replaces the GPU in the sense-predict-plan-act pipeline while providing deterministic latency guarantees that no general-purpose GPU can offer.
Specification
| Parameter | JEPA-R Target | Rationale |
|---|---|---|
| Compute | 20-40 TOPS (INT8) | ViT-g encoding at 30 Hz across 6 cameras + predictor + CEM |
| On-chip SRAM | 32-64 MB | Full predictor weights (300M params at INT8 = 300 MB — SRAM holds hot layers) + encoder activation scratch + latent state buffer |
| Weight memory | 8-16 GB LPDDR5X | Full ViT-g encoder (1B params at INT8 = ~1 GB) + predictor + margin |
| Memory bandwidth | 100-200 GB/s | Sufficient for ViT-g weight streaming at 30 Hz (1 GB x 30 = 30 GB/s sustained) |
| Power | 10-15W TDP | Mobile manipulator / humanoid compute budget |
| Latency | <10 ms sensor-to-prediction (deterministic WCET) | 50 Hz control with 10 ms margin for planning + actuation |
| ISP | 6-camera MIPI CSI-2, hardware demosaic + rectification | Humanoid-class multi-camera input |
| Safety | Hardware watchdog + lockstep fallback cores (ISO 13849 PLd target) | Collaborative robot deployment near humans |
| Process | TSMC N5 or N4 (mature, high-yield) | Edge economics demand cost-optimized silicon, not bleeding-edge density |
Key Subsystems
ViT encoder pipeline. Eight matrix tiles, each a 128x128 INT8 systolic array with 256 KB local SRAM. At 1.5 GHz, each tile delivers ~49 GOPS INT8, aggregate ~393 GOPS. The tiles process patch tokens through the ViT-g encoder in a pipelined, layer-by-layer fashion: weights stream from LPDDR5X into local SRAM, activations remain on-chip between layers. The encoder pipeline processes one camera at a time in round-robin, or two cameras in parallel when compute budget allows. For 6 cameras at 30 Hz, the encoder must complete a full forward pass in approximately 5.5 ms per camera — tight but feasible at 20+ TOPS with INT8 quantization and structured pruning.
Predictor pipeline. A dedicated 1024-wide datapath optimized for the V-JEPA 2-AC predictor’s hidden dimension. Four matrix tiles (64x64 INT8) with 512 KB shared SRAM that holds the predictor’s hot working set. The predictor forward pass completes in under 0.5 ms at the target compute budget. During CEM planning, all four tiles execute predictor evaluations in parallel across candidate action sequences, achieving up to 4 concurrent rollouts.
Multi-camera ISP. Integrated image signal processor handling 6 MIPI CSI-2 camera inputs at 30 Hz. Hardware demosaicing, lens distortion correction, stereo rectification, and auto-exposure feed directly into the ViT encoder’s patch extraction logic via DMA, eliminating CPU involvement in the sensor pipeline.
Hardware safety island. Dual lockstep ARM Cortex-R cores running a certified safety executive. A silicon watchdog timer monitors inference completion: if the ViT+predictor pipeline does not produce an action within the 10 ms deadline, the watchdog triggers an interrupt to the safety cores, which execute a pre-loaded impedance controller (gently stops the arm) or emergency stop. The safety island has its own power domain, clock, and memory — it cannot be starved by the AI accelerator. This is the robotics equivalent of what Mobileye’s EyeQ provides for autonomous driving.
Comparison to Jetson Thor
NVIDIA’s Jetson Thor is the closest existing product. The top SKU (T5000) provides 2,070 FP4 TFLOPS, 128 GB LPDDR5X at 273 GB/s, and a configurable 40-130W power envelope. It can run a 3B VLA model at 50 Hz, and NVIDIA’s Isaac software stack provides mature optimization paths.
But Jetson Thor has five structural gaps for robotics JEPA inference:
- No deterministic latency. GPU thread scheduling is stochastic. A context switch or memory contention spike can push inference from 15 ms to 40 ms. JEPA-R’s VLIW datapath provides statically analyzable worst-case execution time.
- No hardware safety island. Thor lacks ISO 13849 / IEC 61508 functional safety certification. The safety fallback must be built entirely in software.
- 40W power floor. Even the T4000 at 40W minimum exceeds the 10-15W budget for mobile manipulators and drones. JEPA-R targets 15W maximum.
- Overkill general-purpose silicon. Thor’s 2,560 CUDA cores include graphics pipelines, ray tracing units, and other logic irrelevant to JEPA inference. This wastes area and power.
- 273 GB/s memory bandwidth. Adequate for FP4, but tight for FP8/FP16. A 1B parameter model in FP16 at 50 Hz requires streaming ~100 GB/s sustained, leaving little margin for concurrent planning workloads.
JEPA-R trades generality for efficiency: 10-30x lower power than Thor at the specific workload of latent prediction + CEM planning.
The 2,500-20,000x Rendering Asymmetry
The numbers that justify splitting prediction and rendering into separate chips:
| Operation | FLOPs per step | Steps per output | Total FLOPs |
|---|---|---|---|
| JEPA latent prediction | ~0.5 GFLOPs | 1 | 0.5 GFLOPs |
| Diffusion denoising (1 step, DiT-7B) | 50-200 GFLOPs | 1 | 50-200 GFLOPs |
| Full frame render (25-50 denoise steps) | 50-200 GFLOPs | 25-50 | 1,250-10,000 GFLOPs |
A single latent prediction costs 0.5 GFLOPs. Rendering that prediction into a single photorealistic video frame costs 1,250-10,000 GFLOPs. The ratio is 2,500x to 20,000x. Even aggressive step distillation (4-8 denoising steps, as demonstrated by consistency models) only reduces the rendering cost to 200-1,600 GFLOPs — still 400x to 3,200x more expensive than prediction.
V-JEPA proved this split is architecturally clean. The original V-JEPA paper trained a conditional diffusion decoder after pretraining, as a post-hoc add-on. The decoder received only the masked-region latent predictions with no access to visible context, and still produced spatially and temporally coherent video — object permanence, consistent motion, plausible textures. The decoder is architecturally decoupled from the predictor. They share no weights, no gradients, no runtime dependencies. They are literally different models that communicate through a latent vector interface.
NVIDIA’s Cosmos Transfer1 validates the rendering side at scale. Generating photorealistic 1280x704 video at 24 fps from simulation control signals (edge maps, depth, segmentation) requires 64 B200 GPUs on a GB200 NVL72 rack to achieve real-time throughput: 4.2 seconds to generate a 5-second clip of ~56,320 tokens. This is the VDX-1 workload profile — sustained, high-throughput, iterative denoising with cross-attention to control signals. It needs its own silicon because it is 1,000-10,000x heavier than the prediction step running on JEPA-R.
When You Need Pixels
If the robot’s control loop never touches pixels, why build a rendering chip? Because the full embodied AI pipeline has six distinct moments where latent predictions must become visible images. All six are bursty or batch workloads — none are on the 50 Hz control path:
1. Training data generation (sim-to-real). The single largest consumer of rendered pixels. Cosmos Transfer1 demonstrated converting 20 simulated robot manipulation scenarios from NVIDIA Omniverse/Isaac Lab into photorealistic video with physically plausible dynamics, complex shading, and natural illumination. Separate modal weightings for robot foreground (edge + appearance) vs. background (segmentation) preserve kinematic accuracy while adding visual realism. RL policies trained on this rendered data transfer zero-shot to real robots. The rendering throughput requirement scales with fleet size and training cadence.
2. Human monitoring dashboards. A fleet operator cannot interpret latent vectors. They need video feeds, but at 5-10 fps, not 50 Hz. This is a low-duty-cycle rendering workload: the JEPA chip runs autonomously most of the time; the VDX-1 renders only the frames that humans request.
3. Digital twin synchronization. The JEPA chip predicts what will happen next in the digital twin’s latent space. The VDX-1 renders the predicted states into video for visual inspection, regression testing, and stakeholder review. A batch workload — rendering hours of predicted scenarios overnight.
4. Domain randomization. The JEPA chip generates a latent trajectory once. The VDX-1 renders N photorealistic variants under randomized lighting, reflections, colors, and scene assets. Cost multiplier: 1x prediction, Nx rendering. This is how you build robust training sets cheaply.
5. Failure replay and debugging. When a robot fails, engineers need to see what happened. JEPA-R logs compact latent state trajectories (kilobytes per second vs. megabytes per second for raw video). Post-hoc, VDX-1 renders the latent trajectory into photorealistic video for human review. Forensic, offline, quality-critical, not latency-sensitive.
6. VLM training data. V-JEPA 2 achieves 84.0 on PerceptionTest for video QA. Training these VLM heads requires pixel-grounded data. The JEPA chip imagines diverse scenarios in latent space; VDX-1 renders them into captioned video for VLM fine-tuning. This closes the loop: the rendering chip generates training signal for models that run on the prediction chip.
The JEPA-R + VDX-1 System Architecture
The two chips occupy fundamentally different positions in the robotics compute stack. JEPA-R is an edge chip that ships with every robot. VDX-1 is a datacenter/rack chip that serves an entire fleet.
6 x 256×256 @ 30 Hz
60-70% of total FLOPs
Single forward pass · 0.5 GFLOPs
Parallel rollouts in latent space
Fallback impedance controller · ISO 13849 PLd
1.4 MB/frame
~100 KB/s @ 50 Hz compressed
Cross-attention injection into DiT
25-50 serial denoising steps per frame
Per-modality transformer branches
4K upscaler for digital twin
Domain randomization · failure replay
Data flow. JEPA-R runs continuously at 50 Hz, consuming sensor input and producing latent predictions and motor commands. It logs compact latent state trajectories to a circular buffer. VDX-1 sits mostly idle during normal operation. When the system needs pixels — a human opens the monitoring dashboard, the training pipeline requests synthetic data, or the failure analysis system triggers — VDX-1 reads latent states from the buffer, conditions the diffusion decoder on them, and renders photorealistic video.
Power budget. JEPA-R at 15W runs continuously (always-on prediction). VDX-1 at 475W runs episodically (duty cycle 1-10% in deployment, near-continuous during training data campaigns). The blended power during deployment is dominated by JEPA-R — the right economics for edge.
Bandwidth between chips. Latent state vectors are small. A 1408-dimensional latent at FP16 per spatial position, with 16x16 = 256 positions, is 720 KB per frame. Compressed (delta encoding between frames), this drops to approximately 100 KB/s at 50 Hz. Even a slow wireless backhaul handles this comfortably. The bottleneck is entirely within VDX-1’s denoising loop, not the chip-to-chip interconnect.
Comparison to ATLAS
The ATLAS chip proposed earlier in this series targets a different problem. ATLAS is a general world-model chip with parallel rollout units, designed for autonomous driving workloads where the world model must process 320K+ BEV (bird’s-eye-view) tokens per frame across a long temporal horizon. ATLAS includes recurrent state-space units (for DreamerV3-class RSSM dynamics), a full decoder pipeline, and targets the 30-130W power envelope of an AV compute module.
JEPA-R is narrower, cheaper, and more edge-friendly:
| Dimension | ATLAS | JEPA-R |
|---|---|---|
| Target workload | General world models (RSSM, SSM, transformer) | JEPA-family latent prediction only |
| Target domain | Autonomous driving (320K+ BEV tokens) | Manipulation robotics (256 visual tokens) |
| Decoder | On-chip (for visualization / training) | None (latent-only; VDX-1 handles rendering) |
| Power | 30-130W | 10-15W |
| Compute | 100-500+ TOPS | 20-40 TOPS |
| Memory | 32-128 GB HBM | 8-16 GB LPDDR5X |
| Safety | ASIL-D automotive | ISO 13849 PLd machinery |
| Planner | MCTS with parallel tree search | CEM with parallel rollouts |
| Key bet | General-purpose world model accelerator | JEPA-specific, minimum viable silicon |
ATLAS makes sense when the world model is large, multi-modal, and must serve a complex planner with many branching futures (driving). JEPA-R makes sense when the world model is a compact JEPA predictor, the domain is manipulation with a few hundred visual tokens, and the power budget is 15W. They are chips for different robots.
The Latency Budget
The 10 ms sensor-to-prediction target breaks down as follows:
The critical observation: ViT encoding dominates at 45% of the latency budget and 60-70% of total FLOPs. The predictor — the step that actually performs world modeling — takes only 0.5 ms. This means the chip’s silicon is predominantly a vision encoder accelerator with a modest predictor co-processor attached. The architecture of the workload determines the architecture of the chip.
For multi-camera humanoid configurations (6 cameras at 30 Hz), the encoder must process cameras in a pipelined round-robin. The effective per-camera budget at 30 Hz is 5.5 ms per camera, with frames from different cameras overlapping in the encoder pipeline. The predictor still runs once per control step, regardless of camera count, because all camera features are fused into a single latent state before prediction.
Gaps and Risks
JEPA is not a solved problem. Five limitations constrain the JEPA-R proposal:
JEPA cannot generate video. There is no decoder. When the prediction is wrong and the robot fails, engineers cannot visualize what the model “thought” would happen. This is a debugging problem in development and a liability problem in deployment. The VDX-1 pairing mitigates this (render latent trajectories post-hoc for forensic analysis), but adds latency to the debugging loop. A lightweight onboard CNN decoder for low-fidelity preview (DreamerV3-style) could help, at the cost of additional die area and power.
No uncertainty quantification. JEPA’s predictions are deterministic — the L1 regression target produces a single point estimate of the future state. Unlike DreamerV3’s stochastic latent variables (categorical distributions with KL balancing), JEPA provides no native measure of when it is uncertain. For safety-critical robotics, knowing when the world model does not know is essential for triggering fallback behaviors. The hardware safety island catches timeout failures, but cannot detect semantically wrong-but-confident predictions.
CEM planning is still slow. At 800 samples and 10 refinement steps, V-JEPA 2-AC requires 16 seconds per action on standard GPU hardware. JEPA-R’s dedicated silicon can reduce this substantially (the predictor forward pass becomes sub-millisecond), but full CEM at scale still demands significant compute. Amortized policy learning — distilling the CEM planner into a direct policy network that maps observations to actions in a single forward pass — is the likely path to 50 Hz control, but it has not been demonstrated with JEPA features at manipulation quality.
Limited task demonstration. V-JEPA 2-AC results are on tabletop manipulation only: reaching, grasping, and pick-and-place on Franka arms. Generalization to dynamic tasks (catching, rapid assembly, locomotion), multi-step tasks with branching sub-goals, or novel embodiments beyond Franka is unproven. The 16x16 spatial grid provides approximately 16-24mm resolution at typical camera distances — potentially insufficient for sub-centimeter assembly tasks.
Encoder quantization risk. JEPA-R assumes the ViT-g encoder can be quantized to INT8 without catastrophic accuracy loss. Vision transformers generally quantize well, but the JEPA training objective (L1 regression against EMA features) may be more sensitive to quantization noise than classification objectives. If INT8 proves insufficient, FP16 weights double the memory requirement to 2 GB, tightening the LPDDR5X bandwidth budget.
The Chip That Does Not Exist
No shipping chip simultaneously provides 20-40 TOPS at 15W with deterministic WCET, a hardware safety island, and a ViT-optimized datapath. Jetson Thor provides the compute but at 3-8x the power without safety guarantees. Mobileye’s EyeQ provides safety certification but targets CNN-based driving perception, not transformer-based world models. Google’s Edge TPU provides the power efficiency but lacks the compute for a 1B parameter ViT.
JEPA-R fills this gap: a ViT-encoder-first, predictor-narrow, safety-certified edge ASIC for the specific workload of JEPA latent prediction in robotic manipulation. It does not try to be a general-purpose AI accelerator. It does not try to render pixels. It does one thing — predict the future in latent space at 50 Hz within a hard 10 ms deadline — and delegates everything else to VDX-1 and the fleet management stack.
The V-JEPA 2 results suggest the workload is real. The 2,500-20,000x rendering asymmetry suggests the chip split is justified. The question is whether humanoid robot fleets scale to thousands of units fast enough to justify the NRE. If they do, the JEPA-R + VDX-1 pairing is the minimum silicon architecture for embodied AI: predict on the edge, render in the cloud.
Additional Reading
- V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video — Bardes et al., Meta 2024
- V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning — Meta AI 2025
- I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture — Assran et al., Meta 2023
- A Path Towards Autonomous Machine Intelligence (JEPA position paper) — LeCun 2022
- pi0: A Vision-Language-Action Flow Model for General Robot Control — Physical Intelligence 2024
- Cosmos World Foundation Models — NVIDIA 2025
- Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control — NVIDIA 2025
- DreamerV3: Mastering Diverse Domains through World Models — Hafner et al. 2023