SpecDecode-1: A Speculative Decoding ASIC

Every LLM inference chip today treats speculative decoding as a software optimization running on general-purpose hardware. Draft models run on the same tensor cores designed for training-scale matrix multiplications. Tree-structured attention masks get shoe-horned into attention kernels optimized for dense lower-triangular causal patterns. Acceptance sampling round-trips through the CPU. KV-cache sharing between draft and verifier is a memory management nightmare implemented in Python. SpecDecode-1 treats speculative decoding as the primary design target — dedicated silicon where the draft-verify-accept loop is the fundamental operation, not an afterthought. Every transistor on this chip exists to serve one purpose: produce more accepted tokens per verification cycle, at lower latency, than any general-purpose accelerator can achieve.

Why Speculative Decoding Deserves Its Own Chip

Speculative decoding is the single most effective technique for reducing single-stream autoregressive latency without changing the model or losing output quality. A small draft model generates K candidate tokens; the large target model verifies all K in a single parallel forward pass; rejection sampling guarantees the output distribution is identical to the target model. Zero quality loss, multiple tokens per forward pass.

The software results are already impressive:

TechniqueSpeedupMechanism
Classic speculative decode (Leviathan et al., 2022)2-3xSmall draft model + parallel verify
Medusa (multi-head prediction)2.2-3.6xMultiple prediction heads, no draft model
EAGLE-2 (dynamic draft trees)3.05-4.26xContext-aware, runtime-configurable tree topologies
Sequoia (hardware-aware trees)4.04x (A100), up to 9.96x (offloaded)DP-optimal tree topology per hardware profile

But these are software speedups on hardware not designed for the pattern. Three structural mismatches bleed performance on GPUs:

The draft model wastes GPU cycles. An EAGLE-style draft head for a 7B target model has ~240M parameters. At INT4 quantization, the entire model fits in ~60 MB. An H100 has 80 GB of HBM and 989 TFLOPS of tensor core capacity. Running a 240M model on hardware designed for trillion-parameter training is like using a cargo ship to deliver a letter — the overhead of launching GPU kernels, scheduling SMs, and managing memory contexts exceeds the actual computation. The draft model needs less than 1% of the GPU’s compute, but it pays 100% of the dispatch and scheduling tax.

Tree verification wastes GPU cycles. Standard causal attention uses a lower-triangular mask — regular, predictable, optimized to within an inch of its life by FlashAttention. Tree-structured speculative attention uses an irregular mask where each node attends only to its ancestors in the tree. This mask changes every verification step. Generating it in software burns cycles. Applying it to attention kernels designed for dense masks wastes compute on masked-out positions. FlashAttention’s tiling strategy assumes contiguous KV sequences; tree branches are non-contiguous.

KV-cache sharing is a memory management nightmare. The draft model and the verifier operate on the same sequence but maintain separate KV-cache contexts. When the draft model forks into a tree of candidates, each branch needs its own KV-cache extension — but the prefix is shared. On a GPU, this means either duplicating the shared prefix (wasting memory) or implementing complex pointer arithmetic through vLLM’s PagedAttention (adding software overhead per verification cycle). When branches get rejected, their KV-cache must be freed. When they’re accepted, they must be committed. This commit/rollback pattern is a transactional memory problem implemented in Python on hardware with no transactional memory support.

The fundamental insight: speculative decoding is not a single workload. It is three different workloads with three different hardware profiles that must execute in a tightly coordinated pipeline:

  1. Drafting — tiny model, extreme latency sensitivity, wants SRAM-only deterministic execution (Groq-like)
  2. Verification — large model, irregular tree-attention, wants high memory bandwidth with flexible batch handling (GPU-like but with tree-mask support)
  3. Orchestration — tree construction, acceptance sampling, KV-cache management, wants transactional memory semantics (database-like)

No general-purpose chip is optimized for all three simultaneously. SpecDecode-1 is.

The Speculative Decode Compute Pattern

Before the architecture, the workload. Understanding what hardware needs to optimize requires tracing one complete draft-verify-accept cycle.

Draft phase. A tiny model (100M-1B parameters) autoregressively generates a tree of K candidate tokens. For EAGLE-2, this means 3 forward passes through a single transformer layer + FC head to produce a 10-token draft tree. For Sequoia, trees can reach 64-768 nodes depending on hardware. The draft model must be fast — sub-50 microseconds per token — because draft latency directly gates end-to-end throughput. Every microsecond of draft jitter wastes verifier capacity. The draft model’s KV-cache is tiny (a few MB) but must be read with zero latency. The compute is trivial (a few GFLOPS per token) but must execute with deterministic timing.

Verify phase. The large target model runs a single forward pass over the entire draft tree. This is a batch of K candidates with tree-structured attention — each node attends only to its ancestors. For a 7B model with a 64-node tree, the verification pass requires processing 64 tokens through 32 transformer layers. The compute is moderate (comparable to a 64-token prefill) but the attention pattern is irregular. The verifier must stream weights from HBM for models above ~7B parameters, making memory bandwidth the binding constraint.

Accept/reject. At each position in the tree, compare the draft model’s probability distribution against the verifier’s. Accept with probability min(1, p_verifier(x) / p_draft(x)). Reject and resample from an adjusted distribution otherwise. Walk the tree from root to leaves, accepting tokens along the longest valid path. This is a sampling + comparison operation that must be atomic — it cannot be split across kernel launches. On a GPU, this round-trips through the CPU and Python runtime, adding 1-5 ms per cycle. On dedicated hardware, it completes in clock cycles.

KV-cache management. The accepted path’s KV-cache entries become permanent. All rejected branch entries must be freed. The verifier’s newly computed KV entries for the accepted prefix must be committed. The draft model’s KV-cache must be reset to the new tip position. This is a transactional commit/rollback with zero-copy sharing between draft and verifier — a hardware page table operation, not a software memory management problem.

SpecDecode-1 Architecture

Five purpose-built subsystems on a single die.

SpecDecode-1Die Floorplan
~650-1136 mm² · TSMC N4/N5
uC0
uC1
uC2
uC3
DRAFT
4 micro-cores
64 MB SRAM
~15-20 mm²
VERIFIER ENGINE
256 tensor cores · tree-attention mask gen · 512 MB SRAM
~150-180 mm²
TREE
MANAGER
FSM tree builder
HW RNG sampler
Mask generator
~5-8 mm²
SHARED KV-CACHE POOL
2-4 GB Banked SRAM · Hardware Page Tables · Zero-Copy Dual-Port
Copy-on-Write Branching · Single-Cycle Commit/Rollback · Dedup Tags
~400-800 mm² (area-dominant)
RING INTERCONNECT · DUAL-PORT SRAM BUS
HBM3e
16 GB
HBM3e
16 GB
HBM3e
16 GB
HBM3e
16 GB
HBM3e
16 GB
HBM3e
16 GB
HBM3e
16 GB
HBM3e
16 GB
Draft Accelerator (15-20 mm²)
Verifier Engine (150-180 mm²)
KV-Cache Pool (400-800 mm²)
Tree Manager (5-8 mm²)
HBM3e (50-80 mm²)
ComponentArea (mm²)Function
Draft Accelerator15-2064 MB SRAM + 4 micro-cores, Groq-style deterministic pipeline
Verifier Engine150-180256 tensor cores + 512 MB SRAM, native tree-attention
Shared KV-Cache Pool400-8002-4 GB banked SRAM, hardware page tables, zero-copy dual-port
Token Tree Manager5-8FSM tree builder, HW RNG sampler, mask generator
HBM3e Interface50-808 stacks, 128 GB, ~8 TB/s
I/O, clocking, misc30-50PCIe, power management
Total650-1136Comparable to H100 (814 mm²) / B200 (~1000 mm²)

The die is dominated by SRAM for the KV-cache pool. This is the deliberate architectural bet: trade die area for zero-latency KV-cache access. A cost-reduced variant with 1 GB on-chip KV-cache (~200 mm²) brings the total to ~450-550 mm² — well within the reticle limit and very manufacturable.

Component 1: Draft Accelerator

The draft accelerator exists for one reason: run a draft model at absolute minimum latency with absolute zero jitter. Not low latency. Zero jitter. Deterministic, cycle-accurate execution where every memory access is scheduled at compile time.

4 micro-cores. Each is a small systolic array (32x32 FP16 MACs) optimized for EAGLE-style draft models. At 1 GHz with INT4 quantization, the four cores deliver ~4 TFLOPS INT4 — enough for a 250M-parameter draft model in ~30 microseconds per token.

64 MB on-chip SRAM. The entire draft model fits on-chip. EAGLE’s draft head for a 7B target is ~240M parameters = ~480 MB at FP16, but at INT4 = ~60 MB. No HBM access. No cache misses. No stalls. This is the Groq playbook applied to a model small enough that a single chip can hold it: statically scheduled, deterministic VLIW execution where the compiler resolves all data movement at compile time.

Multi-token output. The 4 micro-cores run in parallel to produce multiple tree branches simultaneously — one core per branch at a tree fork — or pipeline 4 sequential draft steps with initiation interval of 1. The tree manager configures the execution pattern dynamically.

Target: sub-50 microseconds per draft token. The draft model runs at near-SRAM bandwidth: 64 MB at 40+ TB/s effective on-chip bandwidth, versus 3.35 TB/s HBM on an H100. This 12x bandwidth advantage on the draft model directly translates to 12x faster draft generation. The draft phase is no longer the bottleneck — it is fast enough to be hidden behind the verifier’s execution.

Silicon area: ~15-20 mm². SRAM dominates at ~13 mm² (64 MB at ~0.2 mm²/MB on 5nm). Compute is ~2-3 mm². This is a tiny fraction of the die — purpose-built silicon for a purpose-built problem.

Component 2: Verifier Engine

The verifier is the workhorse: 256 tensor cores running the large target model (7B-70B+) with two critical modifications that no GPU provides.

Native tree-attention mask generation. Standard attention uses a lower-triangular causal mask. Tree attention uses a mask where M[i][j]=1 iff node j is an ancestor of node i in the draft tree. This mask is irregular and changes every verification step. On a GPU, generating it in software wastes cycles; applying it through FlashAttention kernels designed for contiguous sequences wastes compute. The verifier engine includes a dedicated tree-mask FSM that reads the tree adjacency list from the token tree manager and produces mask bits at wire speed — the mask is ready before the first attention score is computed.

Variable-length batch engine. The number of tokens to verify changes every step. Sequoia trees range from 64 to 768 nodes. The tensor cores handle batches of 4 to 768 tokens with zero setup overhead. No padding. No wasted compute on masked positions. This is critical: on a GPU, a 64-token tree verification occupies the same resources as a 768-token one if the kernel is not dynamically shaped, and dynamic kernel reshaping adds dispatch overhead.

512 MB local SRAM. Holds activations and partial KV-cache for the currently executing layer. At 7B scale with 32 layers, activations per layer for 64 tokens at hidden dim 4096 = ~0.5 MB — trivially fits. For 70B models, weights stream from HBM but activations stay on-chip.

Silicon area: ~150-180 mm². Tensor cores: ~60 mm² (comparable to H100’s SM array scaled down). SRAM: ~100 mm² for 512 MB. This is the second-largest component after the KV-cache pool.

Component 3: Shared KV-Cache Memory Pool

This is the architectural centerpiece — PagedAttention implemented in silicon.

2-4 GB banked SRAM organized as a hardware-managed page pool. This is not software-managed cache. This is a hardware memory management unit with page tables, TLBs, copy-on-write, and commit/rollback semantics, purpose-built for the speculative decoding access pattern.

Hardware block table. A content-addressable memory (CAM) maps (sequence_id, layer, position) to physical SRAM bank addresses. Functions exactly like an OS page table but with single-cycle lookups. 4 KB pages — one token’s KV for typical model dimensions. A hardware TLB with 4K entries covers the hot working set.

Zero-copy dual-port access. Both draft accelerator and verifier engine have independent read ports into the KV-cache pool. The draft model reads KV-cache entries for its attention computation without copying them. The verifier reads the same entries. No data movement. No synchronization overhead. This eliminates the 10-30% overhead that GPU implementations pay for KV-cache context switching between draft and verify phases.

Copy-on-write for tree branches. When the token tree forks — multiple candidates at one position — the KV-cache for the shared prefix is not duplicated. A COW bit in the block table marks the page. Only when a branch extends its cache does a new physical page get allocated. For a 64-node tree with depth 6, this means the prefix KV-cache (potentially thousands of tokens of conversation history) is shared across all 64 verification paths with zero memory overhead.

Single-cycle commit/rollback. After verification, accepted branches have their KV-cache pages committed by flipping a status bit. Rejected branches have their pages freed by clearing block table entries. No memory zeroing, no garbage collection, no deallocation overhead. A 64-branch rejection completes in one cycle by clearing a 64-bit mask.

Silicon area: ~400-800 mm². This is the area-dominant component. 2 GB of SRAM at 5nm is ~400 mm². At 4 GB, it pushes ~800 mm² — approaching the reticle limit. This is where the chip makes its fundamental tradeoff: trading die area for zero-latency transactional KV-cache access. At 4 GB, the pool supports ~32K context at 70B scale (FP8 KV) or ~128K context at 7B scale.

Component 4: Hardware Token Tree Manager

The “brain” that orchestrates draft-verify-accept cycles at hardware speed, without CPU intervention.

FSM-based tree builder. Takes draft model logits, applies top-K selection, builds the token tree by expanding high-confidence branches and pruning low-confidence ones. Implements EAGLE-2’s confidence-based dynamic tree construction and Sequoia’s DP-optimal topology algorithm. Configurable tree budget (N=4 to 1024 nodes) and maximum depth.

Hardware attention mask generator. Reads the tree adjacency structure and produces the attention mask matrix for the verifier. For N nodes, generates an NxN binary matrix in O(N) time using a dedicated tree-walk unit. The mask is ready before the verifier begins its forward pass.

Hardware acceptance sampler. Hardware RNG (LFSR or AES-based TRNG) plus parallel comparator array. For each position, computes acceptance probability min(1, q(x)/p(x)) and compares against a uniform random sample. All N positions evaluated in parallel in a single cycle. The longest accepted path is identified by a priority encoder cascaded with the tree topology.

Dynamic tree topology optimizer. Implements Sequoia’s key insight: the optimal tree shape depends on the hardware’s compute-to-bandwidth ratio. On SpecDecode-1, the ratio is different from any GPU — the draft accelerator is extremely fast relative to the verifier — so the optimal trees are deeper and narrower. The topology optimizer runs a simplified version of Sequoia’s DP algorithm in hardware, updating tree shape every N verification cycles based on observed acceptance rates.

Silicon area: ~5-8 mm². Mostly control logic, small CAMs, and the RNG array. A tiny fraction of die area for a disproportionate impact on system performance.

Component 5: HBM3e Interface

For models larger than ~7B at FP8, the verifier streams weights from HBM.

8 HBM3e stacks. 128 GB capacity, ~8 TB/s aggregate bandwidth. This is the standard memory tier for large-model inference. Nothing exotic — just industry-standard HBM doing what HBM does.

Weight prefetch engine. While the draft accelerator is generating tokens (a window of ~200-500 microseconds), the verifier’s weight prefetch unit preloads the next layer’s weights from HBM into the 512 MB verifier SRAM. This hides HBM latency behind draft latency — the fundamental advantage of having a dedicated draft accelerator. On a GPU, the draft model and verifier share the same memory subsystem, so draft execution and weight prefetch compete for bandwidth. On SpecDecode-1, they are physically independent.

KV-cache overflow. For very long contexts (>32K tokens at 70B scale), KV-cache pages evict to HBM using the page table’s LRU policy with attention-score-weighted priority. Pages whose tokens receive high attention scores are retained on-chip; decaying-score pages spill to HBM with hardware FP8 compression on the spill path.

Performance Projections

Speculative Decoding Speedup: Software vs. Silicon

Speedup over vanilla autoregressive GPU decoding (higher = faster)
Classic
Spec Decode
draft + verify on GPU2-3x
Medusa
multi-head · no draft model2.2-3.6x
EAGLE-2
dynamic draft trees · context-aware3.05-4.26x
Sequoia
(on A100)
HW-aware trees · on-chip4.04x
Sequoia
(offloaded)
HW-aware trees · high BW ratio9.96x
SpecDecode-1
(projected)
dedicated silicon · zero-copy KV · HW tree-attention5-10x
Classic (Leviathan et al.)
Multi-head (no draft)
Feature-level draft
DP-optimal trees
Purpose-built silicon
SpecDecode-1 projected 5-10x is over vanilla autoregressive on GPU -- i.e., 2-4x over the best software spec-decode implementations running on general-purpose hardware. Sequoia 9.96x in offloaded setting (CMU/Together AI, 2024). All methods preserve exact target-model distribution via rejection sampling.

The speedup decomposes into six sources:

GPU BottleneckSpecDecode-1 SolutionEstimated Gain
Draft model contends for GPU SMsSeparate SRAM-only draft accelerator3-5x draft speed
KV-cache copied between draft/verifierZero-copy shared pool, dual read portsEliminates 10-30% overhead
Tree attention mask computed in softwareHardware mask generator at wire speed2-3x verification speed
Accept/reject round-trips through CPUHardware sampler, single-cycle decisionEliminates 1-5 ms per step
Variable batch size wastes GPU resourcesNative variable-length batch, no padding1.5-2x compute efficiency
Tree construction in Python runtimeHardware tree builder FSMSub-microsecond tree builds

Absolute throughput for 7B model. Draft at 50 microseconds/token x 10 tree nodes = 500 microseconds draft time. Verification of a 64-node tree in ~200 microseconds. Accept/commit in ~1 microsecond. Total per step: ~700 microseconds producing ~5 accepted tokens = ~140 microseconds per output token = ~7,100 tokens/second single-stream. Compare to H100 single-stream 7B: ~1,000-2,000 tokens/second.

Absolute throughput for 70B model. Draft at 50 microseconds/token overlapped with weight prefetch. Verification limited by HBM bandwidth at ~8 TB/s. Per step: ~3 ms producing ~5 tokens = ~600 microseconds per output token = ~1,600 tokens/second. Compare to H100 single-stream 70B: ~100-200 tokens/second. This is the 8-10x regime that Sequoia achieves in the offloaded setting, but SpecDecode-1 achieves it natively.

Power efficiency. At 700W TDP (comparable to H100) and 7,100 tokens/sec for 7B: ~10 tokens/sec/watt — roughly 5-8x better than H100.

The Draft-Verify Pipeline: Hiding Latency Behind Latency

The deepest architectural insight in SpecDecode-1 is that draft generation and verification can overlap in time. While the verifier checks tokens 1 through K, the draft accelerator is already generating token K+1, K+2, and beyond for the next speculation cycle. While the draft accelerator runs the next draft phase, the HBM prefetch engine preloads the verifier’s next layer weights. Three operations — draft, verify, and weight prefetch — execute concurrently across three physically independent subsystems.

Draft-Verify Pipeline: Three-Way Overlap

Timing diagram for 70B model — one draft-verify cycle producing ~5 accepted tokens in ~3 ms
Draft
Accelerator
Draft K=10
idle
Draft K=10
Verifier
Engine
prefetch L1
Verify 64-node tree — 32 layers × FP8 matmul + tree-attention
HBM3e
Prefetch
Prefetch L1-L4
Stream weights L1-L32 @ 8 TB/s
Prefetch L1-L4
Tree Mgr
+ KV Commit
Build tree
Build tree
0 μs500 μs1000 μs1500 μs2000 μs2500 μs3000 μs
Key insight: Draft generation (500 μs) overlaps with HBM weight prefetch. By the time the draft tree is ready, the first 4 transformer layers are already in verifier SRAM. Verification streams remaining weights from HBM while computing. The next draft cycle begins as soon as accept/commit completes (~1 cycle). Effective pipeline throughput: ~5 tokens / 3 ms = ~1,600 tok/s for 70B single-stream.
On a GPU, all four lanes compete for the same memory subsystem and SM scheduler. SpecDecode-1's physical separation of draft SRAM, verifier compute, HBM interface, and tree management enables true concurrent execution.

On a GPU, the draft model runs first, then the verifier runs, then the CPU performs acceptance sampling, then KV-cache is updated. Four sequential phases. On SpecDecode-1, the draft accelerator, verifier engine, HBM prefetch unit, and tree manager are physically independent subsystems operating concurrently. The draft model for cycle N+1 begins before the verifier for cycle N finishes. Weight prefetch for the next layer begins before the current layer’s compute completes. Tree construction overlaps with draft generation. The only serialization point is the accept/commit step, which completes in a single cycle.

This three-way overlap is the fundamental reason SpecDecode-1 achieves higher effective throughput than software speculative decoding on a GPU — it is not just faster at each step, it executes steps concurrently that a GPU must serialize.

Key Innovations vs. GPU

Hardware PagedAttention. vLLM’s PagedAttention is the single most impactful software optimization for KV-cache management. It virtualizes KV-cache into pages, enabling non-contiguous allocation, deduplication, and copy-on-write. SpecDecode-1 implements this in silicon: a hardware TLB + page table for KV-cache, with zero-copy sharing between draft and verifier, COW branching for tree speculation, and single-cycle commit/rollback. What vLLM does in thousands of lines of Python, the KV-cache pool does in combinational logic.

Tree-attention in silicon. The attention mask for tree-structured speculation is generated by a hardware FSM that reads the tree adjacency list and produces mask bits at wire speed. No kernel launch, no CPU intervention, no FlashAttention tiling overhead for irregular masks. The mask generation unit integrates directly with the tensor core array’s attention pipeline.

Draft-verify pipeline overlap. On a GPU, the draft model and verifier compete for the same SMs, the same memory controllers, and the same scheduler. They cannot run concurrently (short of complex multi-stream orchestration that still shares bandwidth). On SpecDecode-1, they are physically independent subsystems with dedicated memory paths. Draft generation and verification overlap in time.

Deterministic draft latency. The draft accelerator uses statically scheduled, deterministic execution — no GPU scheduler jitter, no cache miss variability, no kernel launch overhead. Every draft token takes exactly the same number of cycles. This is critical for pipeline scheduling: the verifier knows exactly when the draft tree will be ready, enabling precise weight prefetch timing.

Connection to ARIA

ARIA is the broader agentic inference chip proposed in the companion article — a general-purpose agentic accelerator with heterogeneous prefill/decode engines, a 256 MB SRAM, RISC-V control cluster, and hardware support for speculative decoding as one feature among many.

SpecDecode-1 is the speculative-decode-maximalist variant. ARIA includes spec decode as one feature; SpecDecode-1 makes it the feature. The differences are architectural priorities:

DimensionARIASpecDecode-1
Primary targetMulti-agent orchestration (batch 8-16)Single-stream latency (batch 1)
Draft modelSoftware on decode tilesDedicated SRAM-only hardware
KV-cache pool256 MB SRAM (shared with weights, activations)2-4 GB dedicated KV-cache SRAM
Tree attentionSFU-assisted softwareHardware mask generation + native tensor core support
Accept/rejectSFU + RISC-VDedicated hardware sampler
Tree constructionRISC-V control clusterFSM-based hardware tree manager
HBM usageWeights + KV overflow + batch activationsPrimarily weight streaming
Die area~550-600 mm²~650-1136 mm²

For workloads where single-stream latency is everything — coding agents waiting for the next line of code, real-time chat requiring sub-100ms per token, interactive tool-use loops — SpecDecode-1 wins. For workloads requiring concurrent multi-agent batching, model flexibility, and broader architectural support, ARIA wins.

They share a key thesis — also central to ARIA’s design: that the memory hierarchy is the binding constraint, and that hardware-managed KV-cache with transactional semantics is the right abstraction for inference. SpecDecode-1 just takes the KV-cache investment further, dedicating 2-4 GB of on-chip SRAM to it instead of ARIA’s 256 MB shared pool. For a broader treatment of the inference optimization stack that situates speculative decoding among quantization, MoE, and prefill/decode disaggregation, see the synthesis article.

Competitive Landscape

Groq LPU. The most relevant prior art. Groq’s 230 MB SRAM per chip and deterministic VLIW execution are exactly what the draft accelerator needs. But Groq has no tree-attention support, no hardware acceptance sampling, and no concept of draft-verify pipelining. Groq treats every token as equal; SpecDecode-1 treats draft tokens as cheap, speculative bets and verifier tokens as expensive, authoritative confirmations. Groq’s architecture could theoretically be adapted for draft model execution, but it would require a separate cluster of Groq chips for the verifier — the exact cluster-level disaggregation that SpecDecode-1 integrates on a single die.

Cerebras WSE-3. 44 GB of on-chip SRAM eliminates the memory wall for KV-cache. If Cerebras built a speculative decoding mode, they already have the memory substrate. They would need the tree-attention logic, acceptance sampling hardware, and a way to run a tiny draft model at deterministic latency on a small partition of the wafer while the rest runs the verifier. The WSE’s 900,000 cores and wafer-scale interconnect could theoretically do this, but it is not what the architecture was designed for.

SambaNova SN40L. Three-tier memory (SRAM + HBM + DDR5) maps naturally to draft SRAM / verifier SRAM / KV overflow. SambaNova has already demonstrated batched speculative decoding on the SN40L in production. But it remains a software implementation on reconfigurable hardware — not purpose-built silicon for the draft-verify pattern.

HADES (Yang et al., 2025). The first published end-to-end hardware accelerator specifically designed for speculative decoding. HADES addresses “the design of an LLM accelerator with hardware-level speculative decoding support, a concept not previously explored in existing literature.” It targets energy efficiency through a hardware-native draft/verify pipeline. SpecDecode-1 extends this concept with hardware PagedAttention, tree-attention mask generation, and a multi-gigabyte on-chip KV-cache pool.

Etched Sohu. Hardcodes the entire transformer inference pipeline into fixed-function silicon. Claims 500K+ tokens/sec on Llama 70B (unverified). But Sohu is optimized for standard autoregressive decoding — dense causal attention, regular batch sizes. It has no published speculative decoding support, no tree-attention mask generation, and no draft-verify pipelining. If Sohu’s throughput claims hold, the raw speed might make speculative decoding less necessary for batch throughput — but for single-stream latency, spec decode remains essential regardless of hardware speed.

Risks

Acceptance rates vary by task. Speculative decoding’s speedup depends on how well the draft model predicts the verifier’s output. Acceptance rates range from 50% (creative, high-entropy generation) to 90%+ (code completion, structured output). At 50% acceptance, the effective speedup drops from 5-10x to 2-3x. The chip’s economics are sensitive to workload mix. If the median workload sees 60% acceptance rather than 80%, the ROI case weakens substantially.

Distillation may reduce the need. If target models get cheaper to run directly — through better distillation, more aggressive quantization, or architectural improvements — the absolute latency of autoregressive decoding drops, reducing the value of speculative speedup. A 70B model that decodes at 500 tok/s on next-generation GPUs makes the 1,600 tok/s SpecDecode-1 advantage less compelling than a 70B model that decodes at 100 tok/s on current GPUs.

Architecture evolution. Speculative decoding is inherently tied to autoregressive generation. If the field moves to non-autoregressive architectures — diffusion-based language models, state-space models that can generate multiple tokens natively (as in ATLAS’s SSM-based imagination engine), or parallel decoding schemes — the entire speculative decode loop becomes irrelevant. The draft accelerator, tree manager, and acceptance sampler become dead silicon. Unlike ARIA, which retains programmable tiles that can run arbitrary architectures, SpecDecode-1 has significant fixed-function hardware that only serves speculative decoding.

Die size dominated by SRAM. 2-4 GB of on-chip SRAM at 5nm costs 400-800 mm² — the single most expensive component on the die. SRAM bit-cells do not scale as aggressively as logic across process nodes, and SRAM yield is sensitive to defect density. A single soft error in the KV-cache pool corrupts inference output. ECC overhead adds ~12% to SRAM area. The cost-reduced 1 GB variant trades 2x KV-cache capacity for ~200 mm² of die area savings, but this limits context length to ~16K at 70B scale — potentially insufficient for agentic workloads that accumulate long conversation histories.

The spec-decode maximalist bet. SpecDecode-1 bets that single-stream latency matters more than batch throughput for its target market. If inference providers prioritize cost-per-token (which favors high-batch throughput on GPUs) over latency-per-token (which favors spec-decode optimization), the market for a dedicated spec-decode chip may be smaller than projected. The chip’s sweet spot is narrow: coding agents, real-time chat, interactive tool use — workloads where a single user is waiting for a single model’s output. Multi-user serving at scale still favors batch-optimized hardware.

The Unifying Principle

Every component of SpecDecode-1 serves the same purpose: produce more accepted tokens per verification cycle. The draft accelerator generates candidates as fast as silicon allows. The tree manager optimizes tree topology to maximize expected accepted tokens per tree. The verifier engine processes tree-structured attention without wasting compute on masked positions. The KV-cache pool eliminates memory management overhead through hardware transactions. The HBM interface hides weight-streaming latency behind draft generation.

This is not a general-purpose inference chip that happens to support speculative decoding. This is silicon where every transistor is justified by the draft-tree-verify-accept loop. The draft accelerator exists only to feed the verifier. The KV-cache pool exists only to share state at zero cost. The tree manager exists only to orchestrate speculation without CPU intervention. The chip that makes a 7B model respond at 7,100 tokens/second and a 70B model at 1,600 tokens/second — not through brute-force memory bandwidth, but through the algorithmic leverage of speculating correctly.

Additional Reading