SpecDecode-1: A Speculative Decoding ASIC
Every LLM inference chip today treats speculative decoding as a software optimization running on general-purpose hardware. Draft models run on the same tensor cores designed for training-scale matrix multiplications. Tree-structured attention masks get shoe-horned into attention kernels optimized for dense lower-triangular causal patterns. Acceptance sampling round-trips through the CPU. KV-cache sharing between draft and verifier is a memory management nightmare implemented in Python. SpecDecode-1 treats speculative decoding as the primary design target — dedicated silicon where the draft-verify-accept loop is the fundamental operation, not an afterthought. Every transistor on this chip exists to serve one purpose: produce more accepted tokens per verification cycle, at lower latency, than any general-purpose accelerator can achieve.
Why Speculative Decoding Deserves Its Own Chip
Speculative decoding is the single most effective technique for reducing single-stream autoregressive latency without changing the model or losing output quality. A small draft model generates K candidate tokens; the large target model verifies all K in a single parallel forward pass; rejection sampling guarantees the output distribution is identical to the target model. Zero quality loss, multiple tokens per forward pass.
The software results are already impressive:
| Technique | Speedup | Mechanism |
|---|---|---|
| Classic speculative decode (Leviathan et al., 2022) | 2-3x | Small draft model + parallel verify |
| Medusa (multi-head prediction) | 2.2-3.6x | Multiple prediction heads, no draft model |
| EAGLE-2 (dynamic draft trees) | 3.05-4.26x | Context-aware, runtime-configurable tree topologies |
| Sequoia (hardware-aware trees) | 4.04x (A100), up to 9.96x (offloaded) | DP-optimal tree topology per hardware profile |
But these are software speedups on hardware not designed for the pattern. Three structural mismatches bleed performance on GPUs:
The draft model wastes GPU cycles. An EAGLE-style draft head for a 7B target model has ~240M parameters. At INT4 quantization, the entire model fits in ~60 MB. An H100 has 80 GB of HBM and 989 TFLOPS of tensor core capacity. Running a 240M model on hardware designed for trillion-parameter training is like using a cargo ship to deliver a letter — the overhead of launching GPU kernels, scheduling SMs, and managing memory contexts exceeds the actual computation. The draft model needs less than 1% of the GPU’s compute, but it pays 100% of the dispatch and scheduling tax.
Tree verification wastes GPU cycles. Standard causal attention uses a lower-triangular mask — regular, predictable, optimized to within an inch of its life by FlashAttention. Tree-structured speculative attention uses an irregular mask where each node attends only to its ancestors in the tree. This mask changes every verification step. Generating it in software burns cycles. Applying it to attention kernels designed for dense masks wastes compute on masked-out positions. FlashAttention’s tiling strategy assumes contiguous KV sequences; tree branches are non-contiguous.
KV-cache sharing is a memory management nightmare. The draft model and the verifier operate on the same sequence but maintain separate KV-cache contexts. When the draft model forks into a tree of candidates, each branch needs its own KV-cache extension — but the prefix is shared. On a GPU, this means either duplicating the shared prefix (wasting memory) or implementing complex pointer arithmetic through vLLM’s PagedAttention (adding software overhead per verification cycle). When branches get rejected, their KV-cache must be freed. When they’re accepted, they must be committed. This commit/rollback pattern is a transactional memory problem implemented in Python on hardware with no transactional memory support.
The fundamental insight: speculative decoding is not a single workload. It is three different workloads with three different hardware profiles that must execute in a tightly coordinated pipeline:
- Drafting — tiny model, extreme latency sensitivity, wants SRAM-only deterministic execution (Groq-like)
- Verification — large model, irregular tree-attention, wants high memory bandwidth with flexible batch handling (GPU-like but with tree-mask support)
- Orchestration — tree construction, acceptance sampling, KV-cache management, wants transactional memory semantics (database-like)
No general-purpose chip is optimized for all three simultaneously. SpecDecode-1 is.
The Speculative Decode Compute Pattern
Before the architecture, the workload. Understanding what hardware needs to optimize requires tracing one complete draft-verify-accept cycle.
Draft phase. A tiny model (100M-1B parameters) autoregressively generates a tree of K candidate tokens. For EAGLE-2, this means 3 forward passes through a single transformer layer + FC head to produce a 10-token draft tree. For Sequoia, trees can reach 64-768 nodes depending on hardware. The draft model must be fast — sub-50 microseconds per token — because draft latency directly gates end-to-end throughput. Every microsecond of draft jitter wastes verifier capacity. The draft model’s KV-cache is tiny (a few MB) but must be read with zero latency. The compute is trivial (a few GFLOPS per token) but must execute with deterministic timing.
Verify phase. The large target model runs a single forward pass over the entire draft tree. This is a batch of K candidates with tree-structured attention — each node attends only to its ancestors. For a 7B model with a 64-node tree, the verification pass requires processing 64 tokens through 32 transformer layers. The compute is moderate (comparable to a 64-token prefill) but the attention pattern is irregular. The verifier must stream weights from HBM for models above ~7B parameters, making memory bandwidth the binding constraint.
Accept/reject. At each position in the tree, compare the draft model’s probability distribution against the verifier’s. Accept with probability min(1, p_verifier(x) / p_draft(x)). Reject and resample from an adjusted distribution otherwise. Walk the tree from root to leaves, accepting tokens along the longest valid path. This is a sampling + comparison operation that must be atomic — it cannot be split across kernel launches. On a GPU, this round-trips through the CPU and Python runtime, adding 1-5 ms per cycle. On dedicated hardware, it completes in clock cycles.
KV-cache management. The accepted path’s KV-cache entries become permanent. All rejected branch entries must be freed. The verifier’s newly computed KV entries for the accepted prefix must be committed. The draft model’s KV-cache must be reset to the new tip position. This is a transactional commit/rollback with zero-copy sharing between draft and verifier — a hardware page table operation, not a software memory management problem.
SpecDecode-1 Architecture
Five purpose-built subsystems on a single die.
| Component | Area (mm²) | Function |
|---|---|---|
| Draft Accelerator | 15-20 | 64 MB SRAM + 4 micro-cores, Groq-style deterministic pipeline |
| Verifier Engine | 150-180 | 256 tensor cores + 512 MB SRAM, native tree-attention |
| Shared KV-Cache Pool | 400-800 | 2-4 GB banked SRAM, hardware page tables, zero-copy dual-port |
| Token Tree Manager | 5-8 | FSM tree builder, HW RNG sampler, mask generator |
| HBM3e Interface | 50-80 | 8 stacks, 128 GB, ~8 TB/s |
| I/O, clocking, misc | 30-50 | PCIe, power management |
| Total | 650-1136 | Comparable to H100 (814 mm²) / B200 (~1000 mm²) |
The die is dominated by SRAM for the KV-cache pool. This is the deliberate architectural bet: trade die area for zero-latency KV-cache access. A cost-reduced variant with 1 GB on-chip KV-cache (~200 mm²) brings the total to ~450-550 mm² — well within the reticle limit and very manufacturable.
Component 1: Draft Accelerator
The draft accelerator exists for one reason: run a draft model at absolute minimum latency with absolute zero jitter. Not low latency. Zero jitter. Deterministic, cycle-accurate execution where every memory access is scheduled at compile time.
4 micro-cores. Each is a small systolic array (32x32 FP16 MACs) optimized for EAGLE-style draft models. At 1 GHz with INT4 quantization, the four cores deliver ~4 TFLOPS INT4 — enough for a 250M-parameter draft model in ~30 microseconds per token.
64 MB on-chip SRAM. The entire draft model fits on-chip. EAGLE’s draft head for a 7B target is ~240M parameters = ~480 MB at FP16, but at INT4 = ~60 MB. No HBM access. No cache misses. No stalls. This is the Groq playbook applied to a model small enough that a single chip can hold it: statically scheduled, deterministic VLIW execution where the compiler resolves all data movement at compile time.
Multi-token output. The 4 micro-cores run in parallel to produce multiple tree branches simultaneously — one core per branch at a tree fork — or pipeline 4 sequential draft steps with initiation interval of 1. The tree manager configures the execution pattern dynamically.
Target: sub-50 microseconds per draft token. The draft model runs at near-SRAM bandwidth: 64 MB at 40+ TB/s effective on-chip bandwidth, versus 3.35 TB/s HBM on an H100. This 12x bandwidth advantage on the draft model directly translates to 12x faster draft generation. The draft phase is no longer the bottleneck — it is fast enough to be hidden behind the verifier’s execution.
Silicon area: ~15-20 mm². SRAM dominates at ~13 mm² (64 MB at ~0.2 mm²/MB on 5nm). Compute is ~2-3 mm². This is a tiny fraction of the die — purpose-built silicon for a purpose-built problem.
Component 2: Verifier Engine
The verifier is the workhorse: 256 tensor cores running the large target model (7B-70B+) with two critical modifications that no GPU provides.
Native tree-attention mask generation. Standard attention uses a lower-triangular causal mask. Tree attention uses a mask where M[i][j]=1 iff node j is an ancestor of node i in the draft tree. This mask is irregular and changes every verification step. On a GPU, generating it in software wastes cycles; applying it through FlashAttention kernels designed for contiguous sequences wastes compute. The verifier engine includes a dedicated tree-mask FSM that reads the tree adjacency list from the token tree manager and produces mask bits at wire speed — the mask is ready before the first attention score is computed.
Variable-length batch engine. The number of tokens to verify changes every step. Sequoia trees range from 64 to 768 nodes. The tensor cores handle batches of 4 to 768 tokens with zero setup overhead. No padding. No wasted compute on masked positions. This is critical: on a GPU, a 64-token tree verification occupies the same resources as a 768-token one if the kernel is not dynamically shaped, and dynamic kernel reshaping adds dispatch overhead.
512 MB local SRAM. Holds activations and partial KV-cache for the currently executing layer. At 7B scale with 32 layers, activations per layer for 64 tokens at hidden dim 4096 = ~0.5 MB — trivially fits. For 70B models, weights stream from HBM but activations stay on-chip.
Silicon area: ~150-180 mm². Tensor cores: ~60 mm² (comparable to H100’s SM array scaled down). SRAM: ~100 mm² for 512 MB. This is the second-largest component after the KV-cache pool.
Component 3: Shared KV-Cache Memory Pool
This is the architectural centerpiece — PagedAttention implemented in silicon.
2-4 GB banked SRAM organized as a hardware-managed page pool. This is not software-managed cache. This is a hardware memory management unit with page tables, TLBs, copy-on-write, and commit/rollback semantics, purpose-built for the speculative decoding access pattern.
Hardware block table. A content-addressable memory (CAM) maps (sequence_id, layer, position) to physical SRAM bank addresses. Functions exactly like an OS page table but with single-cycle lookups. 4 KB pages — one token’s KV for typical model dimensions. A hardware TLB with 4K entries covers the hot working set.
Zero-copy dual-port access. Both draft accelerator and verifier engine have independent read ports into the KV-cache pool. The draft model reads KV-cache entries for its attention computation without copying them. The verifier reads the same entries. No data movement. No synchronization overhead. This eliminates the 10-30% overhead that GPU implementations pay for KV-cache context switching between draft and verify phases.
Copy-on-write for tree branches. When the token tree forks — multiple candidates at one position — the KV-cache for the shared prefix is not duplicated. A COW bit in the block table marks the page. Only when a branch extends its cache does a new physical page get allocated. For a 64-node tree with depth 6, this means the prefix KV-cache (potentially thousands of tokens of conversation history) is shared across all 64 verification paths with zero memory overhead.
Single-cycle commit/rollback. After verification, accepted branches have their KV-cache pages committed by flipping a status bit. Rejected branches have their pages freed by clearing block table entries. No memory zeroing, no garbage collection, no deallocation overhead. A 64-branch rejection completes in one cycle by clearing a 64-bit mask.
Silicon area: ~400-800 mm². This is the area-dominant component. 2 GB of SRAM at 5nm is ~400 mm². At 4 GB, it pushes ~800 mm² — approaching the reticle limit. This is where the chip makes its fundamental tradeoff: trading die area for zero-latency transactional KV-cache access. At 4 GB, the pool supports ~32K context at 70B scale (FP8 KV) or ~128K context at 7B scale.
Component 4: Hardware Token Tree Manager
The “brain” that orchestrates draft-verify-accept cycles at hardware speed, without CPU intervention.
FSM-based tree builder. Takes draft model logits, applies top-K selection, builds the token tree by expanding high-confidence branches and pruning low-confidence ones. Implements EAGLE-2’s confidence-based dynamic tree construction and Sequoia’s DP-optimal topology algorithm. Configurable tree budget (N=4 to 1024 nodes) and maximum depth.
Hardware attention mask generator. Reads the tree adjacency structure and produces the attention mask matrix for the verifier. For N nodes, generates an NxN binary matrix in O(N) time using a dedicated tree-walk unit. The mask is ready before the verifier begins its forward pass.
Hardware acceptance sampler. Hardware RNG (LFSR or AES-based TRNG) plus parallel comparator array. For each position, computes acceptance probability min(1, q(x)/p(x)) and compares against a uniform random sample. All N positions evaluated in parallel in a single cycle. The longest accepted path is identified by a priority encoder cascaded with the tree topology.
Dynamic tree topology optimizer. Implements Sequoia’s key insight: the optimal tree shape depends on the hardware’s compute-to-bandwidth ratio. On SpecDecode-1, the ratio is different from any GPU — the draft accelerator is extremely fast relative to the verifier — so the optimal trees are deeper and narrower. The topology optimizer runs a simplified version of Sequoia’s DP algorithm in hardware, updating tree shape every N verification cycles based on observed acceptance rates.
Silicon area: ~5-8 mm². Mostly control logic, small CAMs, and the RNG array. A tiny fraction of die area for a disproportionate impact on system performance.
Component 5: HBM3e Interface
For models larger than ~7B at FP8, the verifier streams weights from HBM.
8 HBM3e stacks. 128 GB capacity, ~8 TB/s aggregate bandwidth. This is the standard memory tier for large-model inference. Nothing exotic — just industry-standard HBM doing what HBM does.
Weight prefetch engine. While the draft accelerator is generating tokens (a window of ~200-500 microseconds), the verifier’s weight prefetch unit preloads the next layer’s weights from HBM into the 512 MB verifier SRAM. This hides HBM latency behind draft latency — the fundamental advantage of having a dedicated draft accelerator. On a GPU, the draft model and verifier share the same memory subsystem, so draft execution and weight prefetch compete for bandwidth. On SpecDecode-1, they are physically independent.
KV-cache overflow. For very long contexts (>32K tokens at 70B scale), KV-cache pages evict to HBM using the page table’s LRU policy with attention-score-weighted priority. Pages whose tokens receive high attention scores are retained on-chip; decaying-score pages spill to HBM with hardware FP8 compression on the spill path.
Performance Projections
Speculative Decoding Speedup: Software vs. Silicon
Spec Decode
(on A100)
(offloaded)
(projected)
The speedup decomposes into six sources:
| GPU Bottleneck | SpecDecode-1 Solution | Estimated Gain |
|---|---|---|
| Draft model contends for GPU SMs | Separate SRAM-only draft accelerator | 3-5x draft speed |
| KV-cache copied between draft/verifier | Zero-copy shared pool, dual read ports | Eliminates 10-30% overhead |
| Tree attention mask computed in software | Hardware mask generator at wire speed | 2-3x verification speed |
| Accept/reject round-trips through CPU | Hardware sampler, single-cycle decision | Eliminates 1-5 ms per step |
| Variable batch size wastes GPU resources | Native variable-length batch, no padding | 1.5-2x compute efficiency |
| Tree construction in Python runtime | Hardware tree builder FSM | Sub-microsecond tree builds |
Absolute throughput for 7B model. Draft at 50 microseconds/token x 10 tree nodes = 500 microseconds draft time. Verification of a 64-node tree in ~200 microseconds. Accept/commit in ~1 microsecond. Total per step: ~700 microseconds producing ~5 accepted tokens = ~140 microseconds per output token = ~7,100 tokens/second single-stream. Compare to H100 single-stream 7B: ~1,000-2,000 tokens/second.
Absolute throughput for 70B model. Draft at 50 microseconds/token overlapped with weight prefetch. Verification limited by HBM bandwidth at ~8 TB/s. Per step: ~3 ms producing ~5 tokens = ~600 microseconds per output token = ~1,600 tokens/second. Compare to H100 single-stream 70B: ~100-200 tokens/second. This is the 8-10x regime that Sequoia achieves in the offloaded setting, but SpecDecode-1 achieves it natively.
Power efficiency. At 700W TDP (comparable to H100) and 7,100 tokens/sec for 7B: ~10 tokens/sec/watt — roughly 5-8x better than H100.
The Draft-Verify Pipeline: Hiding Latency Behind Latency
The deepest architectural insight in SpecDecode-1 is that draft generation and verification can overlap in time. While the verifier checks tokens 1 through K, the draft accelerator is already generating token K+1, K+2, and beyond for the next speculation cycle. While the draft accelerator runs the next draft phase, the HBM prefetch engine preloads the verifier’s next layer weights. Three operations — draft, verify, and weight prefetch — execute concurrently across three physically independent subsystems.
Draft-Verify Pipeline: Three-Way Overlap
Accelerator
Engine
Prefetch
+ KV Commit
On a GPU, the draft model runs first, then the verifier runs, then the CPU performs acceptance sampling, then KV-cache is updated. Four sequential phases. On SpecDecode-1, the draft accelerator, verifier engine, HBM prefetch unit, and tree manager are physically independent subsystems operating concurrently. The draft model for cycle N+1 begins before the verifier for cycle N finishes. Weight prefetch for the next layer begins before the current layer’s compute completes. Tree construction overlaps with draft generation. The only serialization point is the accept/commit step, which completes in a single cycle.
This three-way overlap is the fundamental reason SpecDecode-1 achieves higher effective throughput than software speculative decoding on a GPU — it is not just faster at each step, it executes steps concurrently that a GPU must serialize.
Key Innovations vs. GPU
Hardware PagedAttention. vLLM’s PagedAttention is the single most impactful software optimization for KV-cache management. It virtualizes KV-cache into pages, enabling non-contiguous allocation, deduplication, and copy-on-write. SpecDecode-1 implements this in silicon: a hardware TLB + page table for KV-cache, with zero-copy sharing between draft and verifier, COW branching for tree speculation, and single-cycle commit/rollback. What vLLM does in thousands of lines of Python, the KV-cache pool does in combinational logic.
Tree-attention in silicon. The attention mask for tree-structured speculation is generated by a hardware FSM that reads the tree adjacency list and produces mask bits at wire speed. No kernel launch, no CPU intervention, no FlashAttention tiling overhead for irregular masks. The mask generation unit integrates directly with the tensor core array’s attention pipeline.
Draft-verify pipeline overlap. On a GPU, the draft model and verifier compete for the same SMs, the same memory controllers, and the same scheduler. They cannot run concurrently (short of complex multi-stream orchestration that still shares bandwidth). On SpecDecode-1, they are physically independent subsystems with dedicated memory paths. Draft generation and verification overlap in time.
Deterministic draft latency. The draft accelerator uses statically scheduled, deterministic execution — no GPU scheduler jitter, no cache miss variability, no kernel launch overhead. Every draft token takes exactly the same number of cycles. This is critical for pipeline scheduling: the verifier knows exactly when the draft tree will be ready, enabling precise weight prefetch timing.
Connection to ARIA
ARIA is the broader agentic inference chip proposed in the companion article — a general-purpose agentic accelerator with heterogeneous prefill/decode engines, a 256 MB SRAM, RISC-V control cluster, and hardware support for speculative decoding as one feature among many.
SpecDecode-1 is the speculative-decode-maximalist variant. ARIA includes spec decode as one feature; SpecDecode-1 makes it the feature. The differences are architectural priorities:
| Dimension | ARIA | SpecDecode-1 |
|---|---|---|
| Primary target | Multi-agent orchestration (batch 8-16) | Single-stream latency (batch 1) |
| Draft model | Software on decode tiles | Dedicated SRAM-only hardware |
| KV-cache pool | 256 MB SRAM (shared with weights, activations) | 2-4 GB dedicated KV-cache SRAM |
| Tree attention | SFU-assisted software | Hardware mask generation + native tensor core support |
| Accept/reject | SFU + RISC-V | Dedicated hardware sampler |
| Tree construction | RISC-V control cluster | FSM-based hardware tree manager |
| HBM usage | Weights + KV overflow + batch activations | Primarily weight streaming |
| Die area | ~550-600 mm² | ~650-1136 mm² |
For workloads where single-stream latency is everything — coding agents waiting for the next line of code, real-time chat requiring sub-100ms per token, interactive tool-use loops — SpecDecode-1 wins. For workloads requiring concurrent multi-agent batching, model flexibility, and broader architectural support, ARIA wins.
They share a key thesis — also central to ARIA’s design: that the memory hierarchy is the binding constraint, and that hardware-managed KV-cache with transactional semantics is the right abstraction for inference. SpecDecode-1 just takes the KV-cache investment further, dedicating 2-4 GB of on-chip SRAM to it instead of ARIA’s 256 MB shared pool. For a broader treatment of the inference optimization stack that situates speculative decoding among quantization, MoE, and prefill/decode disaggregation, see the synthesis article.
Competitive Landscape
Groq LPU. The most relevant prior art. Groq’s 230 MB SRAM per chip and deterministic VLIW execution are exactly what the draft accelerator needs. But Groq has no tree-attention support, no hardware acceptance sampling, and no concept of draft-verify pipelining. Groq treats every token as equal; SpecDecode-1 treats draft tokens as cheap, speculative bets and verifier tokens as expensive, authoritative confirmations. Groq’s architecture could theoretically be adapted for draft model execution, but it would require a separate cluster of Groq chips for the verifier — the exact cluster-level disaggregation that SpecDecode-1 integrates on a single die.
Cerebras WSE-3. 44 GB of on-chip SRAM eliminates the memory wall for KV-cache. If Cerebras built a speculative decoding mode, they already have the memory substrate. They would need the tree-attention logic, acceptance sampling hardware, and a way to run a tiny draft model at deterministic latency on a small partition of the wafer while the rest runs the verifier. The WSE’s 900,000 cores and wafer-scale interconnect could theoretically do this, but it is not what the architecture was designed for.
SambaNova SN40L. Three-tier memory (SRAM + HBM + DDR5) maps naturally to draft SRAM / verifier SRAM / KV overflow. SambaNova has already demonstrated batched speculative decoding on the SN40L in production. But it remains a software implementation on reconfigurable hardware — not purpose-built silicon for the draft-verify pattern.
HADES (Yang et al., 2025). The first published end-to-end hardware accelerator specifically designed for speculative decoding. HADES addresses “the design of an LLM accelerator with hardware-level speculative decoding support, a concept not previously explored in existing literature.” It targets energy efficiency through a hardware-native draft/verify pipeline. SpecDecode-1 extends this concept with hardware PagedAttention, tree-attention mask generation, and a multi-gigabyte on-chip KV-cache pool.
Etched Sohu. Hardcodes the entire transformer inference pipeline into fixed-function silicon. Claims 500K+ tokens/sec on Llama 70B (unverified). But Sohu is optimized for standard autoregressive decoding — dense causal attention, regular batch sizes. It has no published speculative decoding support, no tree-attention mask generation, and no draft-verify pipelining. If Sohu’s throughput claims hold, the raw speed might make speculative decoding less necessary for batch throughput — but for single-stream latency, spec decode remains essential regardless of hardware speed.
Risks
Acceptance rates vary by task. Speculative decoding’s speedup depends on how well the draft model predicts the verifier’s output. Acceptance rates range from 50% (creative, high-entropy generation) to 90%+ (code completion, structured output). At 50% acceptance, the effective speedup drops from 5-10x to 2-3x. The chip’s economics are sensitive to workload mix. If the median workload sees 60% acceptance rather than 80%, the ROI case weakens substantially.
Distillation may reduce the need. If target models get cheaper to run directly — through better distillation, more aggressive quantization, or architectural improvements — the absolute latency of autoregressive decoding drops, reducing the value of speculative speedup. A 70B model that decodes at 500 tok/s on next-generation GPUs makes the 1,600 tok/s SpecDecode-1 advantage less compelling than a 70B model that decodes at 100 tok/s on current GPUs.
Architecture evolution. Speculative decoding is inherently tied to autoregressive generation. If the field moves to non-autoregressive architectures — diffusion-based language models, state-space models that can generate multiple tokens natively (as in ATLAS’s SSM-based imagination engine), or parallel decoding schemes — the entire speculative decode loop becomes irrelevant. The draft accelerator, tree manager, and acceptance sampler become dead silicon. Unlike ARIA, which retains programmable tiles that can run arbitrary architectures, SpecDecode-1 has significant fixed-function hardware that only serves speculative decoding.
Die size dominated by SRAM. 2-4 GB of on-chip SRAM at 5nm costs 400-800 mm² — the single most expensive component on the die. SRAM bit-cells do not scale as aggressively as logic across process nodes, and SRAM yield is sensitive to defect density. A single soft error in the KV-cache pool corrupts inference output. ECC overhead adds ~12% to SRAM area. The cost-reduced 1 GB variant trades 2x KV-cache capacity for ~200 mm² of die area savings, but this limits context length to ~16K at 70B scale — potentially insufficient for agentic workloads that accumulate long conversation histories.
The spec-decode maximalist bet. SpecDecode-1 bets that single-stream latency matters more than batch throughput for its target market. If inference providers prioritize cost-per-token (which favors high-batch throughput on GPUs) over latency-per-token (which favors spec-decode optimization), the market for a dedicated spec-decode chip may be smaller than projected. The chip’s sweet spot is narrow: coding agents, real-time chat, interactive tool use — workloads where a single user is waiting for a single model’s output. Multi-user serving at scale still favors batch-optimized hardware.
The Unifying Principle
Every component of SpecDecode-1 serves the same purpose: produce more accepted tokens per verification cycle. The draft accelerator generates candidates as fast as silicon allows. The tree manager optimizes tree topology to maximize expected accepted tokens per tree. The verifier engine processes tree-structured attention without wasting compute on masked positions. The KV-cache pool eliminates memory management overhead through hardware transactions. The HBM interface hides weight-streaming latency behind draft generation.
This is not a general-purpose inference chip that happens to support speculative decoding. This is silicon where every transistor is justified by the draft-tree-verify-accept loop. The draft accelerator exists only to feed the verifier. The KV-cache pool exists only to share state at zero cost. The tree manager exists only to orchestrate speculation without CPU intervention. The chip that makes a 7B model respond at 7,100 tokens/second and a 70B model at 1,600 tokens/second — not through brute-force memory bandwidth, but through the algorithmic leverage of speculating correctly.
Additional Reading
- Speculative Decoding — Leviathan et al. 2023. The foundational paper: parallel verification of draft tokens with rejection sampling.
- Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding — Chen et al., CMU/Together AI, 2024. DP-optimal tree topologies co-designed with hardware profiles. Up to 9.96x speedup.
- SpecInfer: Accelerating LLM Serving with Tree-based Speculative Inference — Miao et al., CMU, ASPLOS 2024. Multiple draft models organized as a token tree verified in one pass.
- EAGLE — Li et al. 2024. Feature-level draft model operating on second-to-top-layer representations.
- EAGLE-2 — Li et al. 2024. Context-aware dynamic draft trees with confidence-based topology adaptation.
- Medusa — Cai et al., Together AI, 2024. Multiple prediction heads attached directly to the target model.
- PagedAttention / vLLM — Kwon et al., SOSP 2023. Virtual memory for KV-cache: paging, copy-on-write, deduplication.
- Splitwise: Efficient Generative LLM Inference — Patel et al. 2024. Disaggregated prefill/decode across heterogeneous hardware.
- HADES: Hardware Accelerated Decoding for Efficient Speculation — Yang et al., ICCEA 2025. First published hardware accelerator design targeting speculative decoding natively.