Custom Chips for Video Diffusion Generation
Video diffusion is the most compute-intensive workload in generative AI. A single 5-second 720p clip from a 14B-parameter model burns roughly 500 PetaFLOPs of inference compute --- 50 denoising steps, each a full forward pass through a transformer operating on ~432,000 spatiotemporal tokens. On a single H100, that takes minutes. The real-time target is 33 milliseconds per frame. The gap between where we are and where we need to be is a factor of 218x, and GPUs are structurally incapable of closing it alone.
No one has shipped a chip designed for this workload. Every published diffusion accelerator --- EdgeDiff, SQ-DM, Ditto, SD-Acc, a silicon photonics proposal --- targets small image models at roughly 1B parameters. None handle temporal attention. None handle video. This article covers why the problem is hard, what video codec ASICs teach us about when specialization works, and what a purpose-built video diffusion chip would actually look like.
The Workload: 10 Models, One Pattern
The current generation of video diffusion transformers spans 1.3B to 30B parameters. All share the same basic structure: a 3D VAE compresses raw video into a latent space, a Diffusion Transformer (DiT) iteratively denoises that latent through 20-50 steps, and the VAE decodes the result back to pixels.
| Model | Params | Layers | Hidden Dim | Architecture |
|---|---|---|---|---|
| Movie Gen (Meta) | 30B | — | — | Full 3D DiT |
| Step-Video (StepFun) | 30B | 48 | 6,144 | Full 3D DiT |
| Wan 2.1-14B (Alibaba) | 14B | 40 | 5,120 | Full 3D DiT |
| HunyuanVideo (Tencent) | 13B+ | dual/single-stream | — | Dual-to-single stream DiT |
| Open-Sora 2.0 | 11B | — | — | Full 3D DiT |
| Mochi 1 (Genmo) | 10B | 48 | 3,072 | Asymmetric DiT |
| Open-Sora Plan v1.5 | 8B | — | — | Skiparse 3D (SUV) DiT |
| CogVideoX-5B (Zhipu) | 5B | — | — | Expert transformer |
| LTX-Video (Lightricks) | 2-13B | — | — | DiT |
| Wan 2.1-1.3B | 1.3B | 30 | 1,536 | Full 3D DiT |
These dwarf image-generation models. The original DiT-XL/2 was 675M parameters. Video models are 10—50x larger, and the token sequences they process are 30—100x longer.
The Token Explosion
A 720p frame, after 8x spatial compression through the VAE, becomes a 90x160 grid of 14,400 latent tokens. A 5-second clip at 24fps with 4x temporal compression produces about 30 temporal positions. The joint token count: 14,400 x 30 = 432,000 tokens per denoising step.
For a single 720p image, attention costs O(14,400^2). For a 5-second video at the same resolution with full 3D attention, the cost is O(432,000^2). That is roughly 900x more attention compute for the video than for one frame. This single number --- 900x --- is why video diffusion is a categorically different hardware problem than image diffusion.
Meta’s Movie Gen pushes this further: 73,000 video tokens for 16-second generation at 16fps. HunyuanVideo processes 129 frames at 720p, requiring ~60 GB VRAM on a single GPU. Step-Video needs ~78 GB for 204 frames.
What the Denoising Loop Actually Costs
Total inference compute follows a simple formula — one that also governs the iteration cost in PhysDiffuse-1’s scientific diffusion workloads, though at different scales:
Total FLOPs = denoising_steps x FLOPs_per_forward_pass
For a Wan-14B-class model at 720p with 50 denoising steps, a single forward pass through the transformer is on the order of 10 PetaFLOPs (dominated by linear projections at ~8.5 PF plus attention). Multiply by 50 steps: ~500 PFLOPs per video clip. On an H100 delivering ~990 TFLOPS BF16, that is ~500 seconds of continuous full-utilization compute --- but actual utilization is far below 100%, which is why real inference takes minutes.
Why GPUs Waste Half Their Silicon
Profiling video diffusion on H100 reveals a structurally inefficient workload. The GPU is not slow because it lacks peak FLOPS. It is slow because it cannot keep its own hardware fed.
Memory Bandwidth Is the Dominant Bottleneck
The H100’s arithmetic intensity ratio is 295:1 (990 TFLOPS FP16 / 3.35 TB/s HBM3 bandwidth). Any operation with fewer than 295 FLOPS per byte of memory traffic leaves tensor cores idle waiting for data. Attention at batch size 1 --- the typical case for single-video generation --- achieves roughly 0.25—1.0 FLOPS/byte. The tensor cores are starving.
Every denoising step reloads the full model from HBM. For a 14B model at FP8, that is 14 GB per step, times 50 steps = 700 GB of redundant weight traffic per video. At 3.35 TB/s, weight loading alone consumes ~6.3 seconds --- before any useful computation begins.
Tensor Cores Are Idle Half the Time
H100 has 528 tensor cores across 132 SMs. They engage during matrix multiplies (QKV projections, attention matmul, FFN linear layers) but sit dark during softmax, layer norms, activation functions, and residual additions. These element-wise and reduction operations constitute roughly 30—40% of wall-clock time despite being a small fraction of total FLOPS. Estimated tensor core engagement during a DiT forward pass: ~50% of wall-clock time.
At batch size 1, even the GEMM operations degenerate toward matrix-vector multiplies that cannot fill the tensor core’s matrix-shaped execution units. A 256x256 tile is optimal; a 256x1 vector wastes most of it.
Kernel Launch Overhead Adds Up
Each denoising step launches hundreds of CUDA kernels --- QKV projections, attention matmuls, softmax, layer norms, FFN layers, residual adds, and scheduling overhead across every transformer layer --- at 3.3—9.6 microseconds each. For a 50-step inference through a 40-layer DiT, hundreds of kernels per step yield tens of thousands of total launches, consuming seconds of pure dispatch overhead. CUDA graphs reduce per-launch cost marginally (3.8 us to 3.4 us), but the fundamental problem is architectural: the GPU’s programming model requires explicit kernel dispatch for every operation.
Effective Silicon Utilization: 40—50%
Combining die-area relevance (the H100 devotes ~70—80% of its 814 mm^2 die to compute-relevant blocks) with temporal utilization of active units (tensor cores at ~50%, memory controllers at ~60—70%), effective silicon utilization during video diffusion inference lands at roughly 40—50%. You are paying for 700W of cooling to use 400—550W of compute.
H200 helps on the bandwidth axis (4.8 TB/s, 43% more than H100) but has identical tensor cores. B200 doubles compute to ~9,000 TFLOPS FP8 and pushes bandwidth to ~8 TB/s, but its ops:byte ratio (~312:1 FP16) stays similar. See the Blackwell architecture deep dive for the full B200 specs. The MI300X offers 5.3 TB/s and 192 GB HBM3 --- 58% more bandwidth than H100 --- but ROCm’s software stack lags CUDA by 6—12 months, leaving 20—40% of the theoretical advantage unrealized.
The GPU scaling story is not encouraging either. HunyuanVideo on 8x H100 with NVLink achieves 5.6x speedup (338s vs 1,904s), which is 70% scaling efficiency. The remaining 30% is lost to all-reduce synchronization during sequence-parallel attention.
The 218x Real-Time Gap
The distance from current state-of-the-art to real-time is precisely quantified:
| Model | Resolution | Per-Frame Time | Hardware |
|---|---|---|---|
| SVD-XT (25 steps) | 576x1024 | 7.2 s | A100 80GB |
| HunyuanVideo 13B (50 steps) | 1280x720 | 14.8 s | 1x A100 |
| HunyuanVideo 13B (USP, 8 GPU) | 1280x720 | 2.6 s | 8x A100 |
| AnimateLCM (distilled) | 512x512 | ~63 ms | — |
From SVD’s 7.2 s/frame to the 33 ms target at 30 FPS: a factor of ~218x. But the gap decomposes into independent optimization axes that multiply together:
| Optimization | Factor | Mechanism |
|---|---|---|
| Step reduction (25 to 1—4 steps) | 6—25x | Consistency/adversarial distillation |
| Quantization (BF16 to FP8/INT8) | 1.5—2x | Lower-precision tensor ops, reduced BW |
| Architecture caching | 2—4x | TeaCache, first-block cache, timestep skip |
| Compiler optimization | 1.2—1.5x | torch.compile, TensorRT kernel fusion |
| FlashAttention-3 | 1.5—2x | Tiled SRAM attention, 75% H100 utilization |
| Multi-GPU parallelism | 2—6x | Ring attention, sequence parallelism |
| Hardware generation jump | 2—3x | H100 to B200 FLOPS and bandwidth |
Theoretical combined ceiling: ~590x. But these factors are not independent --- step reduction degrades quality that caching cannot recover, quantization interacts with distillation, multi-GPU scaling has communication overhead. A realistic combined stack achieves 100—300x, placing real-time 1080p video on the edge of feasibility with aggressive optimization on next-generation hardware. Hardware specialization is what pushes it over the edge.
Temporal Attention: The Hardware Design Space
The single most important architectural question for a video diffusion chip is how to handle temporal attention. The field has converged on five distinct strategies, each with different hardware implications.
Full 3D Joint Attention. Every spatiotemporal token attends to every other. Used by Sora, Step-Video, CogVideoX, HunyuanVideo, Wan. Pure dense matrix multiplication that maps well onto systolic arrays, but the attention matrix for 73K tokens at FP16 requires ~10 GB per head per layer --- far exceeding any on-chip SRAM. FlashAttention becomes non-optional.
Factored 2+1D Attention. Decompose 3D attention into separate spatial and temporal passes. Spatial: each frame processed independently (14,400-token attention, 20 times in parallel). Temporal: each spatial position processed across frames (20—30-token attention, 14,400 times in parallel). This converts one enormous problem into many small ones. The Latte model found interleaved spatial-temporal blocks outperform late fusion, at ~5,573 GFLOPs for 673M parameters. Hardware challenge: spatial passes are compute-heavy; temporal passes are overhead-heavy (so small they become latency-bound by kernel launch overhead on GPUs).
Causal Temporal Attention. Frame t attends only to frames 0 through t. Halves effective attention computation. Maps naturally to a pipeline where each frame’s KV is computed and broadcast forward in temporal order --- conceptually similar to how codec hardware pipelines reference frame data through motion estimation.
Sliding Window Temporal Attention. Restricts each frame’s receptive field to W neighboring frames, reducing cost from O(T^2) to O(T x W). Open-Sora 1.3 uses this. Produces banded attention matrices with predictable stride-regular memory access --- excellent for hardware prefetching. On-chip SRAM need only hold W frames of KV data.
Sparse 3D Attention (Skiparse/SUV). Open-Sora Plan v1.5 introduced SUV: alternating single-skip and group-skip patterns that process 1/k of total tokens while maintaining global receptive field. U-shaped design uses low sparsity in shallow layers, high sparsity in deep layers. Achieves 35% end-to-end speedup (45% on attention) over dense 3D DiT while matching HunyuanVideo quality (VBench 83.02% vs 83.24%). Hardware challenge: irregular gather/scatter memory access conflicts with coalesced-access models.
The hardware lesson: The attention pattern --- which tokens attend to which --- changes across models and will keep changing. The math primitives under all five patterns are identical: tiled matrix multiplication, online softmax, ring communication, mixed-precision accumulation. A chip should hardwire the primitives but leave the attention topology programmable.
Quantization: Timestep-Dependent Precision
Video diffusion models are migrating down the precision ladder, but the sensitivity landscape is more complex than in LLMs.
The energy argument is overwhelming. An INT8 multiply costs ~0.2 pJ versus 3.7 pJ for FP32 --- 18.5x cheaper in energy per operation (Horowitz 2014, 45nm). DRAM reads cost ~640 pJ versus ~5 pJ for SRAM. Halving tensor size from FP16 to FP8 saves substantial data movement energy on top of the arithmetic savings.
W8A8 is the safe default. ViDiT-Q (ICLR 2025) confirms W8A8 quantization for video diffusion transformers with “negligible degradation in visual quality and metrics,” delivering 1.4—1.7x latency speedup and 2—2.5x memory savings. The MixDQ framework achieves the same without performance loss.
W4A8 is viable with care. Requires timestep-aware calibration. DiT architectures (FLUX, video transformers) are more quantization-resilient than UNets. SVDQuant achieves 4-bit weights on the 12B FLUX.1 model with 3.5x memory reduction and 3.0x speedup by absorbing weight outliers via SVD decomposition.
Temporal attention layers need higher precision than spatial layers. ViDiT-Q specifically identifies layers “responsible for retaining essential temporal information” as sensitive to bit-width reduction. Quantization errors in temporal attention weights cause frame-to-frame inconsistency --- flicker, jitter, unnatural motion discontinuities.
Denoising timestep matters. Quantization sensitivity varies by timestep:
- Early timesteps (high noise): Tolerant of aggressive 4-bit quantization --- the signal is dominated by Gaussian noise.
- Late timesteps (low noise, fine detail): Highly sensitive. Accumulated quantization error at low SNR corrupts fine structure. These steps need 8-bit or higher.
- Mid timesteps: W4A8 is typically safe.
MixDQ implements this via integer-programming-based bit-width allocation across layers and timesteps. A hardware scheduler that dynamically adjusts precision per denoising step can extract efficiency without quality loss --- something a GPU cannot do without software intervention at each step.
Microscaling is the emerging standard. The OCP Microscaling Specification (backed by AMD, ARM, Intel, Meta, Microsoft, NVIDIA, Qualcomm) defines MXFP4/MXFP6/MXFP8: block floating-point where 32 elements share a common 8-bit scale factor. NVIDIA Blackwell implements MXFP4 natively. B200 FP4 roughly doubles FP8 throughput again.
What Codec ASICs Teach Us
Video codec hardware --- from H.264 through AV1 --- represents one of silicon engineering’s greatest success stories, and the parallels to video diffusion are instructive.
The Forty-Year Pattern
Each codec generation brought a wave of specialization. H.264 (2003) introduced integer DCT, quarter-pixel motion compensation, and CABAC entropy coding; by 2019 it was used by 91% of video developers. H.265/HEVC (2013) doubled compression at substantially higher complexity; MIT demonstrated a 4K30 HEVC decoder at under 0.1W. AV1 (2018) was “orders of magnitude slower” than HEVC in software; hardware became essential rather than optional. VVC/H.266 (2020) pushes 10x the encoding complexity of HEVC.
Every major silicon vendor --- NVIDIA (NVENC/NVDEC), Intel (Quick Sync), Apple (Media Engine), AMD (VCN) --- ships dedicated fixed-function codec blocks alongside programmable compute cores. These are non-programmable, non-flexible, and extraordinarily efficient.
Structural Parallels
| Codec Stage | Diffusion Analog | Key Difference |
|---|---|---|
| Motion estimation | Temporal attention | ME is search; attention is learned weighted average |
| DCT transform | VAE encoder | DCT is fixed linear; VAE is nonlinear learned |
| Quantization | Latent space bottleneck | Fixed step sizes vs. learned compression |
| In-loop filtering | Self-attention refinement | Fixed rules vs. context-dependent learned refinement |
| Reference frame buffer | KV cache / latent state buffer | Both demand large on-chip memory |
| Iterative RDO loop | Iterative denoising | RDO optimizes known objective; denoising follows learned score |
Why Codec ASICs Succeeded: Four Conditions
- Standardized, frozen algorithms. Once ratified, the bitstream format was locked. H.264 hardware from 2006 still decodes 2026 streams.
- Massive, predictable volume. Billions of devices per year amortize NRE.
- Clear computational bottleneck. Motion estimation alone consumes 60—80% of encoder compute, with regular data access patterns amenable to massive parallelism.
- Asymmetric encode/decode. Decode is dramatically simpler, so even low-power mobile chips include it.
Video diffusion violates most of these. No frozen algorithm --- DiT replaced UNet; flow matching challenges DDPM; consistency models promise single-step generation. No standardization --- Sora, Runway Gen-3, Kling, HunyuanVideo all use proprietary architectures. Irregular compute --- attention has data-dependent memory access. Rapid precision evolution --- FP32 to FP16 to BF16 to FP8 to MXFP4 in four years.
Five Lessons That Transfer
- Identify durable primitives. In codecs, DCT and block matching survived four decades. In diffusion, matrix multiplication and attention are the candidates. Hardware that accelerates these remains useful even as model architectures change.
- Memory bandwidth is the real bottleneck. Codec ASICs dedicate enormous die area to reference frame SRAM. Diffusion chips need the same discipline for KV caches and activations.
- Generation is harder than decode --- always has been. Codec decode is fixed cost. Encoding requires RD search. Diffusion generation requires iterative stochastic sampling. Hardware must be sized for the generation case.
- Standardization unlocks hardware investment. The codec world’s willingness to commit billions to silicon came after algorithm standardization. Until AI video generation converges on a stable architecture, programmability is non-negotiable.
- Power efficiency will eventually force specialization. MIT’s 0.1W HEVC decoder shows the endgame: 100x efficiency over software. If video generation must run on mobile devices, similar gains will demand specialized silicon.
VDX-1: A Chip Architecture for Video Diffusion
VDX-1 is a weight-stationary, multi-die chiplet design targeting TSMC N4P. The core thesis: keep all model weights resident in on-chip SRAM across all denoising steps, and build specialized datapaths for the four operations that dominate video diffusion.
Weight-Stationary Dataflow: The Entire Point
On a GPU, every denoising step reloads the full model from HBM. For a 14B model at FP8: 14 GB x 30 steps = 420 GB of weight traffic per video (or 700 GB at 50 steps). At H100’s 3.35 TB/s, weight loading alone takes 3.75—6.3 seconds.
VDX-1 loads weights once at initialization and pins them in on-chip SRAM. During the denoising loop, only activations and latents move. This eliminates hundreds of GB of redundant memory traffic per video and converts the entire workload from memory-bound to compute-bound — a weight-stationary approach that uses systolic array datapaths throughout.
14 GB of On-Chip SRAM
This is the hard engineering bet. At W4A8 quantization (4-bit weights, 8-bit activations), the 14B model fits in 7 GB. At FP8, it requires the full 14 GB. For context:
| Chip | On-Chip SRAM | Process |
|---|---|---|
| NVIDIA B200 | ~256 MB | N4 |
| Apple M4 Ultra | ~192 MB | N3E |
| Google TPU v6e | ~110 MB | N4 |
| Cerebras WSE-3 | 44 GB | N5 |
| VDX-1 | 14 GB (4 dies) | N4P |
At N4P SRAM density (~30—35 Mbit/mm^2), 14 GB = 112 Gbit requires ~3,200—3,700 mm^2 --- impossible on a single die (reticle limit ~800 mm^2). This drives the multi-chiplet design. At W4A8, each compute die needs ~1.75—2.3 GB, roughly 60% of die area for SRAM, comparable to Cerebras’ ratio.
Specialized Engines
Spatial Attention Engine (SAE). Handles within-frame self-attention over 14,400 tokens per frame. 64 lanes, each processing one attention head, with 128x128 systolic arrays for matmul. FlashAttention-style tiling: loads Q tiles, streams K/V from SRAM, accumulates online softmax. Fused QKV projection + attention + output projection in a single pass to avoid activation spills.
Temporal Attention Engine (TAE). Handles cross-frame attention at each spatial position --- 30 tokens per sequence. The compute is negligible (30x30 attention = 1.8 KB), but the bottleneck is gather: collecting the same spatial position across 30 frames from the latent buffer. 256 lanes running in parallel across spatial positions, optimized for strided memory access rather than raw FLOPS.
Why separate engines? Spatial attention is compute-bound (large token count). Temporal attention is memory-bound (tiny token count, scattered access). A unified engine would be poorly utilized in both cases.
Stochastic Generation Unit (SGU). 256 parallel LFSR-based uniform RNG lanes (Xoshiro256++) with paired Box-Muller transform units. Throughput: 96 billion Gaussian samples/second at 1.5 GHz. Generates the full 6.9M-element noise tensor in ~72 microseconds, running concurrently with the previous step’s transformer pass via double buffering. Total area: ~2 mm^2.
VAE Decoder Engine. Dedicated die with 200 MB weight SRAM for the ~200M-parameter 3D causal VAE decoder. Streaming pipeline that decodes frame-by-frame. First-frame latency: ~8 ms (enables playback before full decode completes). Full 120-frame decode: ~120 ms.
Timestep Conditioning Unit (TCU). Computes sinusoidal positional encoding and the AdaLN-Zero MLP in ~500 ns, broadcasting per-layer scale/shift/gate parameters via a dedicated conditioning bus. Since all 30 timesteps are known before inference begins, the TCU precomputes all conditioning vectors during weight loading.
Memory Hierarchy
Level Size Contents Bandwidth
----- ---- -------- ---------
PE registers 128 KB/die Partial sums --
Weight SRAM 14 GB (4 dies) Model weights (pinned) ~50 TB/s aggregate
Activation 512 MB/die KV cache, attention ~12 TB/s per die
scratchpad (2 GB total) intermediates
HBM3e 48 GB (2 stacks) Latents, pixels, 2 TB/s
overflow
Weight SRAM and activation scratchpad are physically separate. Weights use high-density single-port SRAM (read-only during inference). Activations use dual-port SRAM for read-write bandwidth. This heterogeneous strategy saves ~15% die area over uniform dual-port.
During the denoising loop, HBM bandwidth utilization is < 5%. HBM exists for initial weight load (~7 seconds for 14 GB at 2 TB/s), latent tensor storage across frames (13.8 MB/step), and final pixel output (332 MB for 120 frames at 720p). The entire denoising loop runs on-chip.
Multi-Die Design
Four chiplets on a silicon interposer with UCIe D2D links at 800 GB/s:
- Dies 0—2 (Compute): Hold 10/10/20 transformer layers with weights pinned in local SRAM. Pipeline-parallel: Die 0 processes layers 0—9, Die 1 processes 10—19, Die 2 processes 20—39. Activation transfers at die boundaries (~40 MB) take 50 microseconds at 800 GB/s --- negligible versus per-layer compute.
- Die 3 (VAE + Utility): VAE decoder, SGU, TCU, scheduler, I/O.
- Die sizes: ~500—600 mm^2 each. Total package: ~2,000 mm^2 active silicon on a 2,500 mm^2 interposer.
Performance Estimates
At 1,024 TOPS aggregate (INT4 x INT8 MAC across 3 compute dies), 1.5 GHz clock:
| Metric | Value |
|---|---|
| Per-step compute | ~30.5 TFLOPs (linear projections ~28T, spatial attention ~2.4T, temporal ~0.02T) |
| 30-step total | ~915 TFLOPs |
| Denoising latency | ~1.3 s (at ~700 effective TOPS) |
| VAE decode | ~120 ms |
| Total per 5s video | ~1.5 seconds (3.3x real-time) |
| With temporal reuse (Ditto-style, 30—40% skip) | ~1.0 s (5x real-time) |
| TDP | 475W |
| Estimated cost | $5,000—$8,000 per chip |
For comparison: an H100 at $25,000—$30,000 takes minutes for the same video, at 700W.
Risks
Yield. 550 mm^2 dies at N4P will have ~50—60% yield. With 4 dies per package, package-level yield could be 15—25%. Known-good-die testing is essential.
Obsolescence. If video generation shifts from DiT to autoregressive transformers or some other paradigm, VDX-1 becomes expensive scrap. Mitigated by making the systolic arrays and attention engines programmable enough to handle any attention-based architecture.
Model size growth. 14 GB SRAM fits current 14B models at W4A8. If models grow to 50B+ (plausible within 2 years), the chip cannot hold weights on-chip. INT3/INT2 quantization could extend the runway.
The Path to Real-Time
Real-time video diffusion at 30 FPS, 1080p, from a single device requires combining every optimization axis simultaneously:
| Milestone | Per-Frame Budget | Steps | Precision | Hardware | Status |
|---|---|---|---|---|---|
| Current SOTA (SVD-XT) | 7,200 ms | 25 | BF16 | 1x A100 | Deployed |
| AnimateLCM-class | 60—100 ms | 1—4 | FP16 | 1x RTX 4090 | Demonstrated at 512x512 |
| Real-time 720p (pure diffusion) | 33—42 ms | 1 | FP8 | 1x H100/H200 | Feasible with full optimization stack |
| Real-time 1080p (hybrid keyframe+interp) | 133 ms diffusion + 5 ms interp | 1 | FP8 | 1x H100 | Feasible now with engineering |
| Consumer real-time 1080p | 33 ms | 1 | INT8/FP8 | RTX 5090/6090 | Projected 2027—28 |
The hybrid keyframe + interpolation approach is the most immediately viable path: generate every 4th frame with the diffusion model (133 ms budget), fill gaps with a lightweight frame interpolation network (RIFE, FILM). The diffusion model generates 7.5 FPS of keyframes; interpolation fills to 30 FPS.
The RTX ray tracing analogy is instructive. Real-time ray tracing went from “technically possible but unusable” (RTX 2080, 2018) to “native 4K” (RTX 5090, 2025) across 3 hardware generations and 6 years. DLSS was as important as RT Core improvements. Video diffusion is on a similar curve: from minutes per clip (2022) to approaching real-time (2026), with full real-time projected for 2027—2028. The combination of distillation, streaming pipelines, and hardware specialization will get there --- the open question is whether that hardware looks like a better GPU or a purpose-built chip.
Where This Goes
The argument for dedicated video diffusion silicon rests on a bet about convergence. Codec ASICs succeeded because four conditions were met simultaneously: frozen algorithms, massive volume, clear bottlenecks, and asymmetric encode/decode. Video diffusion currently meets only one of these (clear bottleneck). The other three are open questions.
If the field converges on DiT-class architectures with attention as the durable primitive --- and the evidence points this direction, since attention has survived from 2017 to 2026 across NLP, vision, and video --- then the VDX-1 approach works: hardwire the math, leave the topology programmable, put the weights on-chip, and eliminate the memory wall.
If the field fractures into competing paradigms (autoregressive video, consistency models, GAN hybrids, architectures not yet invented), then programmability wins and GPUs remain the platform.
The day a fixed-function video generation engine appears on a chip spec sheet --- analogous to NVENC appearing on Kepler in 2012 --- will be the day the field has matured enough to standardize. That day is not yet here. But the physics of power efficiency will eventually force specialization. MIT’s 0.1W HEVC decoder consumed 100x less power than software. A 4K60 H.265 decode that costs 10W in software costs 0.1W in silicon. Video generation on mobile devices will demand the same kind of gains.
Until then, the most probable near-term outcome is not a diffusion ASIC but increasingly specialized programmable accelerators --- enhanced tensor/attention engines with massive on-chip memory, hardware support for the iterative denoising loop, and tight multi-chip interconnect for sequence-parallel attention. The JEPA-R robotics chip takes a complementary approach, avoiding the denoising loop entirely by predicting in latent space and offloading pixel rendering to VDX-1. VDX-1 represents the logical endpoint of that trajectory: the chip you build when the architecture stops moving.
Additional Reading
- DiT: Scalable Diffusion Models with Transformers — Peebles & Xie
- Stable Video Diffusion — Blattmann et al.
- Consistency Models — Song et al. 2023
- Open-Sora — Zheng et al. | GitHub
- Open-Sora Plan — Lin et al.
- HunyuanVideo — Kong et al., Tencent
- Movie Gen — Polyak et al., Meta (30B params, 88 authors)
- TeaCache — Liu et al., CVPR 2025 (4.41x speedup)
- EXION: Diffusion Model Accelerator — Heo et al., HPCA 2025
- Ditto: Temporal Value Similarity for Diffusion — Kim et al., HPCA 2025
- SnapGen-V: 5s Video in 5s on Mobile — Wu et al., CVPR 2025
- Wan 2.1: Open Large-Scale Video Generative Models — Team Wan, Alibaba (14B and 1.3B)
- Step-Video-T2V — StepFun (30B params, 48 layers, 6144 hidden dim)
- AnimateLCM — Wang et al., SIGGRAPH Asia 2024 (25s to ~1s generation)
- ViDiT-Q: Quantization of Diffusion Transformers — ICLR 2025 (W8A8/W4A8 for video DiT)
- SVDQuant: 4-Bit Diffusion Models — Li et al., ICLR 2025 Spotlight (3.5x memory reduction on FLUX.1 12B)
- Mochi 1: Asymmetric DiT for Video — Genmo (10B params, 48 layers)
- CogVideoX — Zhipu AI (5B expert transformer)