Custom Chips for Video Diffusion Generation

Video diffusion is the most compute-intensive workload in generative AI. A single 5-second 720p clip from a 14B-parameter model burns roughly 500 PetaFLOPs of inference compute --- 50 denoising steps, each a full forward pass through a transformer operating on ~432,000 spatiotemporal tokens. On a single H100, that takes minutes. The real-time target is 33 milliseconds per frame. The gap between where we are and where we need to be is a factor of 218x, and GPUs are structurally incapable of closing it alone.

No one has shipped a chip designed for this workload. Every published diffusion accelerator --- EdgeDiff, SQ-DM, Ditto, SD-Acc, a silicon photonics proposal --- targets small image models at roughly 1B parameters. None handle temporal attention. None handle video. This article covers why the problem is hard, what video codec ASICs teach us about when specialization works, and what a purpose-built video diffusion chip would actually look like.

The Workload: 10 Models, One Pattern

The current generation of video diffusion transformers spans 1.3B to 30B parameters. All share the same basic structure: a 3D VAE compresses raw video into a latent space, a Diffusion Transformer (DiT) iteratively denoises that latent through 20-50 steps, and the VAE decodes the result back to pixels.

Model	Params	Layers	Hidden Dim	Architecture
Movie Gen (Meta)	30B	—	—	Full 3D DiT
Step-Video (StepFun)	30B	48	6,144	Full 3D DiT
Wan 2.1-14B (Alibaba)	14B	40	5,120	Full 3D DiT
HunyuanVideo (Tencent)	13B+	dual/single-stream	—	Dual-to-single stream DiT
Open-Sora 2.0	11B	—	—	Full 3D DiT
Mochi 1 (Genmo)	10B	48	3,072	Asymmetric DiT
Open-Sora Plan v1.5	8B	—	—	Skiparse 3D (SUV) DiT
CogVideoX-5B (Zhipu)	5B	—	—	Expert transformer
LTX-Video (Lightricks)	2-13B	—	—	DiT
Wan 2.1-1.3B	1.3B	30	1,536	Full 3D DiT

These dwarf image-generation models. The original DiT-XL/2 was 675M parameters. Video models are 10—50x larger, and the token sequences they process are 30—100x longer.

The Token Explosion

A 720p frame, after 8x spatial compression through the VAE, becomes a 90x160 grid of 14,400 latent tokens. A 5-second clip at 24fps with 4x temporal compression produces about 30 temporal positions. The joint token count: 14,400 x 30 = 432,000 tokens per denoising step.

For a single 720p image, attention costs O(14,400^2). For a 5-second video at the same resolution with full 3D attention, the cost is O(432,000^2). That is roughly 900x more attention compute for the video than for one frame. This single number --- 900x --- is why video diffusion is a categorically different hardware problem than image diffusion.

The Token Explosion

Attention cost scales with token count squared

14,400

tokens per frame

O(14,4002)

~207M attention pairs

900x

▶

more compute

30 FRAMES

432,000

tokens per step

O(432,0002)

~186B attention pairs

Image

90 x 160 spatial grid

Video

90 x 160 x 30 temporal volume

Attention cost

Quadratic in token count

Meta’s Movie Gen pushes this further: 73,000 video tokens for 16-second generation at 16fps. HunyuanVideo processes 129 frames at 720p, requiring ~60 GB VRAM on a single GPU. Step-Video needs ~78 GB for 204 frames.

What the Denoising Loop Actually Costs

Total inference compute follows a simple formula — one that also governs the iteration cost in PhysDiffuse-1’s scientific diffusion workloads, though at different scales:

Total FLOPs = denoising_steps x FLOPs_per_forward_pass

For a Wan-14B-class model at 720p with 50 denoising steps, a single forward pass through the transformer is on the order of 10 PetaFLOPs (dominated by linear projections at ~8.5 PF plus attention). Multiply by 50 steps: ~500 PFLOPs per video clip. On an H100 delivering ~990 TFLOPS BF16, that is ~500 seconds of continuous full-utilization compute --- but actual utilization is far below 100%, which is why real inference takes minutes.

Why GPUs Waste Half Their Silicon

Profiling video diffusion on H100 reveals a structurally inefficient workload. The GPU is not slow because it lacks peak FLOPS. It is slow because it cannot keep its own hardware fed.

Memory Bandwidth Is the Dominant Bottleneck

The H100’s arithmetic intensity ratio is 295:1 (990 TFLOPS FP16 / 3.35 TB/s HBM3 bandwidth). Any operation with fewer than 295 FLOPS per byte of memory traffic leaves tensor cores idle waiting for data. Attention at batch size 1 --- the typical case for single-video generation --- achieves roughly 0.25—1.0 FLOPS/byte. The tensor cores are starving.

Every denoising step reloads the full model from HBM. For a 14B model at FP8, that is 14 GB per step, times 50 steps = 700 GB of redundant weight traffic per video. At 3.35 TB/s, weight loading alone consumes ~6.3 seconds --- before any useful computation begins.

Tensor Cores Are Idle Half the Time

H100 has 528 tensor cores across 132 SMs. They engage during matrix multiplies (QKV projections, attention matmul, FFN linear layers) but sit dark during softmax, layer norms, activation functions, and residual additions. These element-wise and reduction operations constitute roughly 30—40% of wall-clock time despite being a small fraction of total FLOPS. Estimated tensor core engagement during a DiT forward pass: ~50% of wall-clock time.

At batch size 1, even the GEMM operations degenerate toward matrix-vector multiplies that cannot fill the tensor core’s matrix-shaped execution units. A 256x256 tile is optimal; a 256x1 vector wastes most of it.

Kernel Launch Overhead Adds Up

Each denoising step launches hundreds of CUDA kernels --- QKV projections, attention matmuls, softmax, layer norms, FFN layers, residual adds, and scheduling overhead across every transformer layer --- at 3.3—9.6 microseconds each. For a 50-step inference through a 40-layer DiT, hundreds of kernels per step yield tens of thousands of total launches, consuming seconds of pure dispatch overhead. CUDA graphs reduce per-launch cost marginally (3.8 us to 3.4 us), but the fundamental problem is architectural: the GPU’s programming model requires explicit kernel dispatch for every operation.

Effective Silicon Utilization: 40—50%

Combining die-area relevance (the H100 devotes ~70—80% of its 814 mm^2 die to compute-relevant blocks) with temporal utilization of active units (tensor cores at ~50%, memory controllers at ~60—70%), effective silicon utilization during video diffusion inference lands at roughly 40—50%. You are paying for 700W of cooling to use 400—550W of compute.

H200 helps on the bandwidth axis (4.8 TB/s, 43% more than H100) but has identical tensor cores. B200 doubles compute to ~9,000 TFLOPS FP8 and pushes bandwidth to ~8 TB/s, but its ops:byte ratio (~312:1 FP16) stays similar. See the Blackwell architecture deep dive for the full B200 specs. The MI300X offers 5.3 TB/s and 192 GB HBM3 --- 58% more bandwidth than H100 --- but ROCm’s software stack lags CUDA by 6—12 months, leaving 20—40% of the theoretical advantage unrealized.

The GPU scaling story is not encouraging either. HunyuanVideo on 8x H100 with NVLink achieves 5.6x speedup (338s vs 1,904s), which is 70% scaling efficiency. The remaining 30% is lost to all-reduce synchronization during sequence-parallel attention.

H100 Die Area Breakdown

814 mm² total — what actually works during video diffusion inference

TENSOR CORES

528 cores × 132 SMs

990 TFLOPS BF16 peak

~50% active during diffusion

CUDA CORES

FP32 / INT32 units

softmax, norms, activations

30–40% of wall time here

MEMORY CONTROLLERS

HBM3 — 3.35 TB/s

80 GB HBM3 interface

Always active — bottleneck

OTHER / WASTED

RT cores, NVLink, misc I/O

Dark silicon for diffusion

✕ ✕ ✕

0% utilized

40–50%

effective silicon utilization

10K+

kernel launches per video

>1 s

pure dispatch overhead

You pay for 700W of cooling to use 400–550W of compute

The 218x Real-Time Gap

The distance from current state-of-the-art to real-time is precisely quantified:

Model	Resolution	Per-Frame Time	Hardware
SVD-XT (25 steps)	576x1024	7.2 s	A100 80GB
HunyuanVideo 13B (50 steps)	1280x720	14.8 s	1x A100
HunyuanVideo 13B (USP, 8 GPU)	1280x720	2.6 s	8x A100
AnimateLCM (distilled)	512x512	~63 ms	—

From SVD’s 7.2 s/frame to the 33 ms target at 30 FPS: a factor of ~218x. But the gap decomposes into independent optimization axes that multiply together:

Optimization	Factor	Mechanism
Step reduction (25 to 1—4 steps)	6—25x	Consistency/adversarial distillation
Quantization (BF16 to FP8/INT8)	1.5—2x	Lower-precision tensor ops, reduced BW
Architecture caching	2—4x	TeaCache, first-block cache, timestep skip
Compiler optimization	1.2—1.5x	torch.compile, TensorRT kernel fusion
FlashAttention-3	1.5—2x	Tiled SRAM attention, 75% H100 utilization
Multi-GPU parallelism	2—6x	Ring attention, sequence parallelism
Hardware generation jump	2—3x	H100 to B200 FLOPS and bandwidth

Theoretical combined ceiling: ~590x. But these factors are not independent --- step reduction degrades quality that caching cannot recover, quantization interacts with distillation, multi-GPU scaling has communication overhead. A realistic combined stack achieves 100—300x, placing real-time 1080p video on the edge of feasibility with aggressive optimization on next-generation hardware. Hardware specialization is what pushes it over the edge.

Bridging the 218× Gap

Multiplicative optimizations — each layer compounds on the previous

7.2 s

per frame (SVD-XT)

33 ms

target (30 FPS)

Step Distillation

(25 → 1–4 steps)

6–25×

TeaCache / Architecture Caching

2–4×

HW Generation (B200)

2–3×

FP8 Quantization

1.5–2×

Streaming Pipeline

1.5×

Theoretical ceiling

~590×

Realistic stack

100–300×

Gap to close

218×

Factors are not independent — step reduction degrades quality that caching cannot recover; quantization interacts with distillation

Temporal Attention: The Hardware Design Space

The single most important architectural question for a video diffusion chip is how to handle temporal attention. The field has converged on five distinct strategies, each with different hardware implications.

Full 3D Joint Attention. Every spatiotemporal token attends to every other. Used by Sora, Step-Video, CogVideoX, HunyuanVideo, Wan. Pure dense matrix multiplication that maps well onto systolic arrays, but the attention matrix for 73K tokens at FP16 requires ~10 GB per head per layer --- far exceeding any on-chip SRAM. FlashAttention becomes non-optional.

Factored 2+1D Attention. Decompose 3D attention into separate spatial and temporal passes. Spatial: each frame processed independently (14,400-token attention, 20 times in parallel). Temporal: each spatial position processed across frames (20—30-token attention, 14,400 times in parallel). This converts one enormous problem into many small ones. The Latte model found interleaved spatial-temporal blocks outperform late fusion, at ~5,573 GFLOPs for 673M parameters. Hardware challenge: spatial passes are compute-heavy; temporal passes are overhead-heavy (so small they become latency-bound by kernel launch overhead on GPUs).

Causal Temporal Attention. Frame t attends only to frames 0 through t. Halves effective attention computation. Maps naturally to a pipeline where each frame’s KV is computed and broadcast forward in temporal order --- conceptually similar to how codec hardware pipelines reference frame data through motion estimation.

Sliding Window Temporal Attention. Restricts each frame’s receptive field to W neighboring frames, reducing cost from O(T^2) to O(T x W). Open-Sora 1.3 uses this. Produces banded attention matrices with predictable stride-regular memory access --- excellent for hardware prefetching. On-chip SRAM need only hold W frames of KV data.

Sparse 3D Attention (Skiparse/SUV). Open-Sora Plan v1.5 introduced SUV: alternating single-skip and group-skip patterns that process 1/k of total tokens while maintaining global receptive field. U-shaped design uses low sparsity in shallow layers, high sparsity in deep layers. Achieves 35% end-to-end speedup (45% on attention) over dense 3D DiT while matching HunyuanVideo quality (VBench 83.02% vs 83.24%). Hardware challenge: irregular gather/scatter memory access conflicts with coalesced-access models.

The hardware lesson: The attention pattern --- which tokens attend to which --- changes across models and will keep changing. The math primitives under all five patterns are identical: tiled matrix multiplication, online softmax, ring communication, mixed-precision accumulation. A chip should hardwire the primitives but leave the attention topology programmable.

Temporal Attention Pattern Gallery

Five strategies — same math primitives, different memory & compute profiles

Full 3D

Used by: Sora, Wan, Step-Video

Factored 2+1D

Used by: Latte, SVD

Causal

Used by: CogVideoX, Mochi

Sliding Window

Used by: Open-Sora 1.3

Skiparse

Used by: Open-Sora Plan v1.5

Quantization: Timestep-Dependent Precision

Video diffusion models are migrating down the precision ladder, but the sensitivity landscape is more complex than in LLMs.

The energy argument is overwhelming. An INT8 multiply costs ~0.2 pJ versus 3.7 pJ for FP32 --- 18.5x cheaper in energy per operation (Horowitz 2014, 45nm). DRAM reads cost ~640 pJ versus ~5 pJ for SRAM. Halving tensor size from FP16 to FP8 saves substantial data movement energy on top of the arithmetic savings.

W8A8 is the safe default. ViDiT-Q (ICLR 2025) confirms W8A8 quantization for video diffusion transformers with “negligible degradation in visual quality and metrics,” delivering 1.4—1.7x latency speedup and 2—2.5x memory savings. The MixDQ framework achieves the same without performance loss.

W4A8 is viable with care. Requires timestep-aware calibration. DiT architectures (FLUX, video transformers) are more quantization-resilient than UNets. SVDQuant achieves 4-bit weights on the 12B FLUX.1 model with 3.5x memory reduction and 3.0x speedup by absorbing weight outliers via SVD decomposition.

Temporal attention layers need higher precision than spatial layers. ViDiT-Q specifically identifies layers “responsible for retaining essential temporal information” as sensitive to bit-width reduction. Quantization errors in temporal attention weights cause frame-to-frame inconsistency --- flicker, jitter, unnatural motion discontinuities.

Denoising timestep matters. Quantization sensitivity varies by timestep:

Early timesteps (high noise): Tolerant of aggressive 4-bit quantization --- the signal is dominated by Gaussian noise.
Late timesteps (low noise, fine detail): Highly sensitive. Accumulated quantization error at low SNR corrupts fine structure. These steps need 8-bit or higher.
Mid timesteps: W4A8 is typically safe.

MixDQ implements this via integer-programming-based bit-width allocation across layers and timesteps. A hardware scheduler that dynamically adjusts precision per denoising step can extract efficiency without quality loss --- something a GPU cannot do without software intervention at each step.

Microscaling is the emerging standard. The OCP Microscaling Specification (backed by AMD, ARM, Intel, Meta, Microsoft, NVIDIA, Qualcomm) defines MXFP4/MXFP6/MXFP8: block floating-point where 32 elements share a common 8-bit scale factor. NVIDIA Blackwell implements MXFP4 natively. B200 FP4 roughly doubles FP8 throughput again.

What Codec ASICs Teach Us

Video codec hardware --- from H.264 through AV1 --- represents one of silicon engineering’s greatest success stories, and the parallels to video diffusion are instructive.

The Forty-Year Pattern

Each codec generation brought a wave of specialization. H.264 (2003) introduced integer DCT, quarter-pixel motion compensation, and CABAC entropy coding; by 2019 it was used by 91% of video developers. H.265/HEVC (2013) doubled compression at substantially higher complexity; MIT demonstrated a 4K30 HEVC decoder at under 0.1W. AV1 (2018) was “orders of magnitude slower” than HEVC in software; hardware became essential rather than optional. VVC/H.266 (2020) pushes 10x the encoding complexity of HEVC.

Every major silicon vendor --- NVIDIA (NVENC/NVDEC), Intel (Quick Sync), Apple (Media Engine), AMD (VCN) --- ships dedicated fixed-function codec blocks alongside programmable compute cores. These are non-programmable, non-flexible, and extraordinarily efficient.

Structural Parallels

Codec Stage	Diffusion Analog	Key Difference
Motion estimation	Temporal attention	ME is search; attention is learned weighted average
DCT transform	VAE encoder	DCT is fixed linear; VAE is nonlinear learned
Quantization	Latent space bottleneck	Fixed step sizes vs. learned compression
In-loop filtering	Self-attention refinement	Fixed rules vs. context-dependent learned refinement
Reference frame buffer	KV cache / latent state buffer	Both demand large on-chip memory
Iterative RDO loop	Iterative denoising	RDO optimizes known objective; denoising follows learned score

Why Codec ASICs Succeeded: Four Conditions

Standardized, frozen algorithms. Once ratified, the bitstream format was locked. H.264 hardware from 2006 still decodes 2026 streams.
Massive, predictable volume. Billions of devices per year amortize NRE.
Clear computational bottleneck. Motion estimation alone consumes 60—80% of encoder compute, with regular data access patterns amenable to massive parallelism.
Asymmetric encode/decode. Decode is dramatically simpler, so even low-power mobile chips include it.

Video diffusion violates most of these. No frozen algorithm --- DiT replaced UNet; flow matching challenges DDPM; consistency models promise single-step generation. No standardization --- Sora, Runway Gen-3, Kling, HunyuanVideo all use proprietary architectures. Irregular compute --- attention has data-dependent memory access. Rapid precision evolution --- FP32 to FP16 to BF16 to FP8 to MXFP4 in four years.

Five Lessons That Transfer

Identify durable primitives. In codecs, DCT and block matching survived four decades. In diffusion, matrix multiplication and attention are the candidates. Hardware that accelerates these remains useful even as model architectures change.
Memory bandwidth is the real bottleneck. Codec ASICs dedicate enormous die area to reference frame SRAM. Diffusion chips need the same discipline for KV caches and activations.
Generation is harder than decode --- always has been. Codec decode is fixed cost. Encoding requires RD search. Diffusion generation requires iterative stochastic sampling. Hardware must be sized for the generation case.
Standardization unlocks hardware investment. The codec world’s willingness to commit billions to silicon came after algorithm standardization. Until AI video generation converges on a stable architecture, programmability is non-negotiable.
Power efficiency will eventually force specialization. MIT’s 0.1W HEVC decoder shows the endgame: 100x efficiency over software. If video generation must run on mobile devices, similar gains will demand specialized silicon.

Codec ASIC Evolution Timeline

■Codec ASICs: Frozen standard, massive volume, proven ROI

◆Diffusion ASICs: Moving target, no standard, unproven market

2003

H.264

First mass ASIC

2013

H.265/HEVC

MIT 0.1W decoder

2018

RTX 2080

NVENC/NVDEC standard

2020

AV1

Open codec, Google & Apple HW

2025

VDX-1?

Video diffusion ASIC?

VDX-1: A Chip Architecture for Video Diffusion

VDX-1 is a weight-stationary, multi-die chiplet design targeting TSMC N4P. The core thesis: keep all model weights resident in on-chip SRAM across all denoising steps, and build specialized datapaths for the four operations that dominate video diffusion.

Weight-Stationary Dataflow: The Entire Point

On a GPU, every denoising step reloads the full model from HBM. For a 14B model at FP8: 14 GB x 30 steps = 420 GB of weight traffic per video (or 700 GB at 50 steps). At H100’s 3.35 TB/s, weight loading alone takes 3.75—6.3 seconds.

VDX-1 loads weights once at initialization and pins them in on-chip SRAM. During the denoising loop, only activations and latents move. This eliminates hundreds of GB of redundant memory traffic per video and converts the entire workload from memory-bound to compute-bound — a weight-stationary approach that uses systolic array datapaths throughout.

14 GB of On-Chip SRAM

This is the hard engineering bet. At W4A8 quantization (4-bit weights, 8-bit activations), the 14B model fits in 7 GB. At FP8, it requires the full 14 GB. For context:

Chip	On-Chip SRAM	Process
NVIDIA B200	~256 MB	N4
Apple M4 Ultra	~192 MB	N3E
Google TPU v6e	~110 MB	N4
Cerebras WSE-3	44 GB	N5
VDX-1	14 GB (4 dies)	N4P

At N4P SRAM density (~30—35 Mbit/mm^2), 14 GB = 112 Gbit requires ~3,200—3,700 mm^2 --- impossible on a single die (reticle limit ~800 mm^2). This drives the multi-chiplet design. At W4A8, each compute die needs ~1.75—2.3 GB, roughly 60% of die area for SRAM, comparable to Cerebras’ ratio.

Specialized Engines

Spatial Attention Engine (SAE). Handles within-frame self-attention over 14,400 tokens per frame. 64 lanes, each processing one attention head, with 128x128 systolic arrays for matmul. FlashAttention-style tiling: loads Q tiles, streams K/V from SRAM, accumulates online softmax. Fused QKV projection + attention + output projection in a single pass to avoid activation spills.

Temporal Attention Engine (TAE). Handles cross-frame attention at each spatial position --- 30 tokens per sequence. The compute is negligible (30x30 attention = 1.8 KB), but the bottleneck is gather: collecting the same spatial position across 30 frames from the latent buffer. 256 lanes running in parallel across spatial positions, optimized for strided memory access rather than raw FLOPS.

Why separate engines? Spatial attention is compute-bound (large token count). Temporal attention is memory-bound (tiny token count, scattered access). A unified engine would be poorly utilized in both cases.

Stochastic Generation Unit (SGU). 256 parallel LFSR-based uniform RNG lanes (Xoshiro256++) with paired Box-Muller transform units. Throughput: 96 billion Gaussian samples/second at 1.5 GHz. Generates the full 6.9M-element noise tensor in ~72 microseconds, running concurrently with the previous step’s transformer pass via double buffering. Total area: ~2 mm^2.

VAE Decoder Engine. Dedicated die with 200 MB weight SRAM for the ~200M-parameter 3D causal VAE decoder. Streaming pipeline that decodes frame-by-frame. First-frame latency: ~8 ms (enables playback before full decode completes). Full 120-frame decode: ~120 ms.

Timestep Conditioning Unit (TCU). Computes sinusoidal positional encoding and the AdaLN-Zero MLP in ~500 ns, broadcasting per-layer scale/shift/gate parameters via a dedicated conditioning bus. Since all 30 timesteps are known before inference begins, the TCU precomputes all conditioning vectors during weight loading.

Memory Hierarchy

Level           Size              Contents                 Bandwidth
-----           ----              --------                 ---------
PE registers    128 KB/die        Partial sums             --
Weight SRAM     14 GB (4 dies)    Model weights (pinned)   ~50 TB/s aggregate
Activation      512 MB/die        KV cache, attention      ~12 TB/s per die
 scratchpad     (2 GB total)      intermediates
HBM3e           48 GB (2 stacks)  Latents, pixels,         2 TB/s
                                  overflow

Weight SRAM and activation scratchpad are physically separate. Weights use high-density single-port SRAM (read-only during inference). Activations use dual-port SRAM for read-write bandwidth. This heterogeneous strategy saves ~15% die area over uniform dual-port.

During the denoising loop, HBM bandwidth utilization is < 5%. HBM exists for initial weight load (~7 seconds for 14 GB at 2 TB/s), latent tensor storage across frames (13.8 MB/step), and final pixel output (332 MB for 120 frames at 720p). The entire denoising loop runs on-chip.

Multi-Die Design

Four chiplets on a silicon interposer with UCIe D2D links at 800 GB/s:

Dies 0—2 (Compute): Hold 10/10/20 transformer layers with weights pinned in local SRAM. Pipeline-parallel: Die 0 processes layers 0—9, Die 1 processes 10—19, Die 2 processes 20—39. Activation transfers at die boundaries (~40 MB) take 50 microseconds at 800 GB/s --- negligible versus per-layer compute.
Die 3 (VAE + Utility): VAE decoder, SGU, TCU, scheduler, I/O.
Die sizes: ~500—600 mm^2 each. Total package: ~2,000 mm^2 active silicon on a 2,500 mm^2 interposer.

VDX-1 Chiplet Layout

4 dies on silicon interposer — TSMC N4P — 2,500 mm² package

SILICON INTERPOSER

HBM3e — 12 GB

DIE 0 — TRANSFORMERLayers 0-9

SAE
64 lanes

TAE
256 lanes

WEIGHT SRAM — 3.5 GB

DIE 1 — TRANSFORMERLayers 10-19

SAE
64 lanes

TAE
256 lanes

WEIGHT SRAM — 3.5 GB

DIE 2 — TRANSFORMERLayers 20-39

SAE
64 lanes

TAE
256 lanes

WEIGHT SRAM — 7 GB

DIE 3 — VAE + UTILITYDecoder, SGU, TCU

VAE DECODER
3D Causal, Streaming

SGU

TCU

SRAM — 200 MB

UCIe D2D

14 GB

Weight SRAM (4 dies)

48 GB

HBM3e (4 stacks)

800 GB/s

UCIe D2D per link

475W

Total TDP

Performance Estimates

At 1,024 TOPS aggregate (INT4 x INT8 MAC across 3 compute dies), 1.5 GHz clock:

Metric	Value
Per-step compute	~30.5 TFLOPs (linear projections ~28T, spatial attention ~2.4T, temporal ~0.02T)
30-step total	~915 TFLOPs
Denoising latency	~1.3 s (at ~700 effective TOPS)
VAE decode	~120 ms
Total per 5s video	~1.5 seconds (3.3x real-time)
With temporal reuse (Ditto-style, 30—40% skip)	~1.0 s (5x real-time)
TDP	475W
Estimated cost	$5,000—$8,000 per chip

For comparison: an H100 at $25,000—$30,000 takes minutes for the same video, at 700W.

Risks

Yield. 550 mm^2 dies at N4P will have ~50—60% yield. With 4 dies per package, package-level yield could be 15—25%. Known-good-die testing is essential.

Obsolescence. If video generation shifts from DiT to autoregressive transformers or some other paradigm, VDX-1 becomes expensive scrap. Mitigated by making the systolic arrays and attention engines programmable enough to handle any attention-based architecture.

Model size growth. 14 GB SRAM fits current 14B models at W4A8. If models grow to 50B+ (plausible within 2 years), the chip cannot hold weights on-chip. INT3/INT2 quantization could extend the runway.

The Path to Real-Time

Real-time video diffusion at 30 FPS, 1080p, from a single device requires combining every optimization axis simultaneously:

Milestone	Per-Frame Budget	Steps	Precision	Hardware	Status
Current SOTA (SVD-XT)	7,200 ms	25	BF16	1x A100	Deployed
AnimateLCM-class	60—100 ms	1—4	FP16	1x RTX 4090	Demonstrated at 512x512
Real-time 720p (pure diffusion)	33—42 ms	1	FP8	1x H100/H200	Feasible with full optimization stack
Real-time 1080p (hybrid keyframe+interp)	133 ms diffusion + 5 ms interp	1	FP8	1x H100	Feasible now with engineering
Consumer real-time 1080p	33 ms	1	INT8/FP8	RTX 5090/6090	Projected 2027—28

The hybrid keyframe + interpolation approach is the most immediately viable path: generate every 4th frame with the diffusion model (133 ms budget), fill gaps with a lightweight frame interpolation network (RIFE, FILM). The diffusion model generates 7.5 FPS of keyframes; interpolation fills to 30 FPS.

The RTX ray tracing analogy is instructive. Real-time ray tracing went from “technically possible but unusable” (RTX 2080, 2018) to “native 4K” (RTX 5090, 2025) across 3 hardware generations and 6 years. DLSS was as important as RT Core improvements. Video diffusion is on a similar curve: from minutes per clip (2022) to approaching real-time (2026), with full real-time projected for 2027—2028. The combination of distillation, streaming pipelines, and hardware specialization will get there --- the open question is whether that hardware looks like a better GPU or a purpose-built chip.

Where This Goes

The argument for dedicated video diffusion silicon rests on a bet about convergence. Codec ASICs succeeded because four conditions were met simultaneously: frozen algorithms, massive volume, clear bottlenecks, and asymmetric encode/decode. Video diffusion currently meets only one of these (clear bottleneck). The other three are open questions.

If the field converges on DiT-class architectures with attention as the durable primitive --- and the evidence points this direction, since attention has survived from 2017 to 2026 across NLP, vision, and video --- then the VDX-1 approach works: hardwire the math, leave the topology programmable, put the weights on-chip, and eliminate the memory wall.

If the field fractures into competing paradigms (autoregressive video, consistency models, GAN hybrids, architectures not yet invented), then programmability wins and GPUs remain the platform.

The day a fixed-function video generation engine appears on a chip spec sheet --- analogous to NVENC appearing on Kepler in 2012 --- will be the day the field has matured enough to standardize. That day is not yet here. But the physics of power efficiency will eventually force specialization. MIT’s 0.1W HEVC decoder consumed 100x less power than software. A 4K60 H.265 decode that costs 10W in software costs 0.1W in silicon. Video generation on mobile devices will demand the same kind of gains.

Until then, the most probable near-term outcome is not a diffusion ASIC but increasingly specialized programmable accelerators --- enhanced tensor/attention engines with massive on-chip memory, hardware support for the iterative denoising loop, and tight multi-chip interconnect for sequence-parallel attention. The JEPA-R robotics chip takes a complementary approach, avoiding the denoising loop entirely by predicting in latent space and offloading pixel rendering to VDX-1. VDX-1 represents the logical endpoint of that trajectory: the chip you build when the architecture stops moving.

Additional Reading

DiT: Scalable Diffusion Models with Transformers — Peebles & Xie
Stable Video Diffusion — Blattmann et al.
Consistency Models — Song et al. 2023
Open-Sora — Zheng et al. | GitHub
Open-Sora Plan — Lin et al.
HunyuanVideo — Kong et al., Tencent
Movie Gen — Polyak et al., Meta (30B params, 88 authors)
TeaCache — Liu et al., CVPR 2025 (4.41x speedup)
EXION: Diffusion Model Accelerator — Heo et al., HPCA 2025
Ditto: Temporal Value Similarity for Diffusion — Kim et al., HPCA 2025
SnapGen-V: 5s Video in 5s on Mobile — Wu et al., CVPR 2025
Wan 2.1: Open Large-Scale Video Generative Models — Team Wan, Alibaba (14B and 1.3B)
Step-Video-T2V — StepFun (30B params, 48 layers, 6144 hidden dim)
AnimateLCM — Wang et al., SIGGRAPH Asia 2024 (25s to ~1s generation)
ViDiT-Q: Quantization of Diffusion Transformers — ICLR 2025 (W8A8/W4A8 for video DiT)
SVDQuant: 4-Bit Diffusion Models — Li et al., ICLR 2025 Spotlight (3.5x memory reduction on FLUX.1 12B)
Mochi 1: Asymmetric DiT for Video — Genmo (10B params, 48 layers)
CogVideoX — Zhipu AI (5B expert transformer)

Alan's PKB

Explorer

Custom Chips for Video Diffusion Generation

Custom Chips for Video Diffusion Generation

The Workload: 10 Models, One Pattern

The Token Explosion

What the Denoising Loop Actually Costs

Why GPUs Waste Half Their Silicon

Memory Bandwidth Is the Dominant Bottleneck

Tensor Cores Are Idle Half the Time

Kernel Launch Overhead Adds Up

Effective Silicon Utilization: 40—50%

The 218x Real-Time Gap

Temporal Attention: The Hardware Design Space

Quantization: Timestep-Dependent Precision

What Codec ASICs Teach Us

The Forty-Year Pattern

Structural Parallels

Why Codec ASICs Succeeded: Four Conditions

Five Lessons That Transfer

VDX-1: A Chip Architecture for Video Diffusion

Weight-Stationary Dataflow: The Entire Point

14 GB of On-Chip SRAM

Specialized Engines

Memory Hierarchy

Multi-Die Design

Performance Estimates

Risks

The Path to Real-Time

Where This Goes

Additional Reading

Graph View

Table of Contents

Backlinks