The Case for an Agentic Inference Chip

Agentic AI workloads — multi-step reasoning chains, tool invocation, code execution, planning loops with branching sub-agents — are the fastest-growing segment of LLM inference demand. They are also the segment worst served by existing hardware. This article examines why, surveys the accelerator landscape, analyzes the latency and power budgets in detail, and proposes a concrete chip architecture — ARIA, the Agentic Reasoning Inference Accelerator — that targets the structural gaps.

The Workload: Why Agents Break GPUs

An agent does not make a single inference call. A coding agent solving a SWE-bench task executes 5-30 sequential rounds of prefill-decode-parse-tool_call-resume. A deep research agent may chain 10-50 calls across search-read-synthesize loops. Multi-agent orchestration can exceed 100 calls totaling 100K+ generated tokens. Each round pays the full time-to-first-token (TTFT) penalty, and context grows with every turn as conversation history, tool outputs, and file contents accumulate.

The end-to-end anatomy of a single agent turn:

Stage	Typical Latency	Bound By
Tokenization	1-5 ms	CPU (negligible)
Network RTT	20-150 ms	Geography, TCP/TLS
Queue + scheduling	5-50 ms	Load, batch formation
Prefill	50-2,000 ms	Compute (FLOPS)
Decode	10-40 ms/token	Memory bandwidth
Tool-call parse + dispatch	5-20 ms	CPU, framework
Tool execution	50-5,000 ms	I/O, external service
Resume prefill (tool result)	50-500 ms	Compute

Agent Turn Latency Waterfall

Stacked timeline of a single agent turn generating 500 tokens on a 70B model — total ~17.7 s

0 s2 s4 s6 s8 s10 s12 s14 s16 s17.7 s

Prefill 400 ms

Decode — 500 tokens 15,000 ms (15 s) memory-bound

Tool exec 2,000 ms

Resume 200 ms

Tokenize · 3 ms

Network RTT · 80 ms

Prefill · 400 ms (compute-bound)

Decode · 15 s (memory-bound)

Tool execution · 2 s

Resume prefill · 200 ms (compute-bound)

Decode dominates at 85% of wall-clock time. An agent chaining 20 turns pays this budget sequentially — cumulative latency of 4-8 minutes.

For a 20-turn coding agent on a 70B model served by H100s, the aggregate budget is roughly 4-8 minutes of wall-clock time. Prefill dominates when the agent issues short tool-call commands with long accumulated context. Decode dominates when it generates long code blocks. In both regimes, the hardware is severely underutilized.

The core mismatch is quantitative. Autoregressive decode streams the entire model’s weights from HBM for every generated token. A 70B FP16 model requires reading ~140 GB per token. On an H100 at 3.35 TB/s HBM bandwidth, that caps single-stream decode at ~24 tokens/second — regardless of the GPU’s 989 TFLOPS of tensor core capacity. The arithmetic intensity during decode is ~1 FLOP/byte; GPUs are designed for 100+ FLOPs/byte. At batch size 1-8 (the agentic operating point), GPU utilization is below 10%.

Five characteristics make agentic inference a distinct hardware regime, separate from the video generation workloads addressed by VDX-1 or the robotics edge constraints targeted by JEPA-R:

Low effective batch size per decision step. GPUs need batch >= 32-64 to saturate compute; agents operate at batch 1-8 per reasoning step.
Interleaved prefill and decode within a single request. Tool outputs trigger long prefills mid-generation, breaking the assumption of a clean prefill-then-decode pipeline.
Long, growing context. Agents accumulate 32K-128K+ tokens across turns. KV-cache for Llama 2 70B at 32K context consumes tens of GB, competing with weights for bandwidth.
KV-cache sharing across agents. Multi-agent systems reuse system prompts and shared context. KVCOMM (2025) demonstrates 70%+ reuse rates and up to 7.8x speedup through cross-context cache alignment.
Latency on the critical path. Every token of reasoning is sequential. TTFT after a tool return matters as much as steady-state decode throughput.

GPUs also pay a hidden tax: kernel launch overhead. Each transformer layer invokes 10-20 GPU kernels, each with 3-5 microseconds of CPU-to-GPU dispatch latency. A 70B model with 80 layers runs 800-1,600 kernel launches per token, adding 2.4-8 ms of pure overhead — a substantial fraction of the 20-40 ms decode budget. CUDA Graphs partially mitigate this but cannot eliminate it. Dedicated inference hardware like Groq’s LPU avoids the dispatch loop entirely through deterministic, compiler-scheduled execution.

The Accelerator Landscape

Eleven custom chips attack the memory wall through fundamentally different architectural bets.

Groq LPU. The most radical memory-first design. A single-core Tensor Streaming Processor on GlobalFoundries 14nm with 230 MB of on-chip SRAM (no HBM) and ~80 TB/s of on-chip bandwidth. Fully deterministic VLIW execution with no caches, buffers, or runtime arbitration — the compiler resolves all data movement at compile time. Delivers ~250-500 tokens/sec per user on Llama 3 70B class models. The weakness: no useful model fits on one chip. Mixtral 8x7B requires 576 chips (8 racks, 144 CPUs, 144 TB host RAM). Economically punishing at scale, but the latency profile is ideal for agentic loops where deterministic execution eliminates tail latency across 10-20 chained calls.

Cerebras WSE-3. Wafer-scale integration: 4 trillion transistors on a single 46,225 mm^2 TSMC 5nm wafer, 900,000 cores, 44 GB on-chip SRAM, 125 PFLOPS peak. The Cerebras Inference Service reports 969 tokens/sec on Llama 3.1 405B at 240 ms TTFT — 18x faster than GPU-served Claude 3.5 Sonnet and 12x faster than GPT-4o. At 100K context, it still sustains 539 tokens/sec. The 44 GB of on-chip SRAM eliminates HBM bandwidth as the decode bottleneck for models that fit. System cost and single-vendor dependency are the constraints.

Etched Sohu. A transformer-specific ASIC on TSMC 4nm that hardcodes the entire inference pipeline (matmul, attention, LayerNorm, softmax) into fixed-function silicon. Claims an 8-chip server generates 500,000+ tokens/sec on Llama 70B — roughly 22x an equivalent 8-GPU H100 server. If validated, the economics are transformative. The risk is architectural lock-in: if the field moves to state-space hybrids or non-transformer architectures, Sohu becomes worthless silicon. Independent benchmarks are not yet available.

Google TPU v6e (Trillium). 4.7x peak compute over v5e, 2x HBM, 2x ICI bandwidth, 256-chip pods at 99% scaling efficiency. 67% energy efficiency gain. Strong for large-model inference but locked to Google Cloud and JAX/XLA.

NVIDIA B200. 208 billion transistors on TSMC 4NP, 192 GB HBM3e at 8 TB/s. 10,614 tokens/sec on Llama 3.3 70B (FP4, TP=1). The GB200 NVL72 claims 30x inference over equivalent H100 configurations. NVIDIA explicitly targets agentic workloads with Blackwell Ultra. The CUDA ecosystem moat remains the real barrier to alternatives. For a deeper dive into the Blackwell architecture, see the dedicated article.

SambaNova SN40L. Reconfigurable dataflow with three-tier memory (SRAM + HBM + DDR5, 1.5 TB/node). Millisecond-scale model switching suits agentic routing between specialized models. d-Matrix Corsair. SRAM-based digital in-memory compute performing matmul where data lives; promising but unbenched. Apple Silicon M4 Max. 546 GB/s unified memory, 128 GB capacity, 38 TOPS ANE at ~1.5-2 TOPS/W with near-zero idle draw. Enables the hybrid pattern: quantized 3B on-device for tool orchestration, cloud fan-out for heavy reasoning. Others. AWS Inferentia2 (190 TFLOPS, modest latency), Tenstorrent Blackhole (RISC-V, 664 TFLOPS FP8, open SDK, GDDR6-limited), Meta MTIA and Microsoft Maia (internal-only).

Accelerator Comparison

On-chip SRAM, memory bandwidth, and single-user decode throughput across five architectures

On-chip SRAM Memory BW Single-user tok/s

Groq LPU GF 14nm · SRAM-only

SRAM

230 MB

80 TB/s on-chip

tok/s

~500

Cerebras WSE-3 TSMC 5nm · Wafer-scale

SRAM

44 GB (wafer)

on-die only

tok/s

969

Etched Sohu TSMC 4nm · Fixed-function

SRAM

undisclosed

tok/s

500K (claimed, unverified)

NVIDIA B200 TSMC 4NP · GPU

SRAM

~50 MB

8 TB/s HBM3e · 192 GB

tok/s

10,614 (batched)

ARIA TSMC N3E · Proposed agentic accelerator

SRAM

256 MB

12 TB/s HBM4 · 96 GB

tok/s

150/user (target, 16 agents)

SRAM bars are linear (256 MB max for single-die; Cerebras 44 GB is wafer-scale, clipped). BW bars are linear (80 TB/s max). tok/s bars are log-compressed to fit Etched's claimed 500K alongside real-world figures.

Memory Architecture: The Binding Constraint

Every technique in the inference optimization stack — speculative decoding, MoE sparsity, KV-cache paging, prefill/decode disaggregation — is fundamentally a strategy to extract more useful tokens per byte transferred from memory. The memory hierarchy is the single most important design decision.

ARIA Memory Hierarchy Data Movement Waterfall

Fast/small → slow/large

Registers

Tile-local

64 KB

1 cycle · ~80 TB/s

▼ On-die bus · full bisection BW

On-Chip SRAM

Unified L1

256 MB

3 cycles · ~40 TB/s on-die

ACT

▼ HBM4 PHY · 2,048-bit interface × 6 stacks

HBM4

6 stacks on-package

96 GB

~12 cycles · 12 TB/s aggregate

Weights

KV spill

Batch

▼ CXL 3.0 · coherent shared memory

CXL Pool

Shared across nodes

Shared KV-Cache Pool

~100+ cycles · 242 GB/s per link · hardware coherency

Dedup

COW

Bandwidth gradient: bright = fast/small, dim = slow/large

170× gap: 3.7 pJ (FP mul) vs 640 pJ (DRAM read)

80-90% of inference energy = memory access

7× dedup with CXL shared KV-cache

HBM4. Standardized April 2025, HBM4 doubles the interface width to 2,048 bits per stack, achieving 2 TB/s per stack (67% over HBM3e) with up to 64 GB per 16-high stack. Six HBM4 stacks provide 12 TB/s aggregate bandwidth and 96 GB capacity — roughly 4x an H100’s HBM bandwidth. This is the minimum envelope for serving 70B+ models to concurrent agents without starving the decode engine.

CXL 3.0 shared KV-cache. CXL 3.0’s coherent memory sharing enables KV-cache deduplication across agent instances. Eight coding agents sharing a system prompt and 90% of context can deduplicate for approximately 7x memory reduction and eliminate redundant prefill computation. Hardware-managed coherency avoids the software overhead of explicit cache coordination.

Processing-in-Memory. Samsung HBM-PIM and SK Hynix AiM embed modest compute directly within DRAM banks. Attention’s arithmetic intensity (~1 FLOP/byte for long sequences) maps perfectly to PIM’s strength: compute co-located with massive bandwidth. Yang et al. (Duke, 2026) demonstrate significant energy efficiency improvements for end-to-end transformer acceleration through PIM, performing attention score computation in-situ on KV-cache banks. MXFormer (2026) pushes further with charge-trap transistor CIM, achieving 3.3-60.5x improvement in compute density using MXFP4 precision.

ReRAM/MRAM for instant cold-start. Non-volatile memory with sub-2ns read access could store model weights persistently. An agent “wakes up” with weights already in place, eliminating the multi-second model load that currently gates the first request after a cold start or model switch. This is particularly relevant for edge agents and for datacenter deployments serving many models with bursty demand.

Speculative Decoding: The Latency Multiplier

Speculative decoding is the highest-impact single optimization for agentic latency, explored in detail in the dedicated SpecDecode-1 architecture proposal. A small draft model generates K candidate tokens; the large target model verifies all K in a single parallel forward pass. Rejection sampling guarantees the output distribution is identical to the target model — zero quality loss.

Technique	Speedup	Mechanism
Classic speculative decode (Leviathan et al., 2022)	2-3x	Small draft model + parallel verify
DeepMind speculative sampling (Chen et al., 2023)	2-2.5x	Validated at 70B scale (Chinchilla)
Medusa (multi-head prediction)	2.2-3.6x	Multiple prediction heads, no draft model
EAGLE-2 (dynamic draft trees)	3.05-4.26x	Runtime-configurable tree topologies
Sequoia (hardware-aware trees)	up to 4.04x (A100), 9.96x (offloaded)	DP-optimal tree topology per hardware profile

The key insight from Sequoia (CMU/Together AI, 2024) is that speculation strategy must be co-designed with hardware. The optimal tree topology depends on the target platform’s compute-to-bandwidth ratio. A custom chip can implement tree verification natively: accept K draft tokens, compute target logits for all K positions, perform rejection sampling in hardware (comparing draft vs. target distributions with hardware RNG), and identify the last accepted position — all in under 100 clock cycles.

Hardware requirements for native speculative decode: a dedicated draft co-processor (or small model co-located on-die), a tree attention unit with scatter-gather DMA for non-contiguous KV-cache access, a KV-cache MMU (analogous to a CPU TLB) supporting PagedAttention with copy-on-write, and MoE routing circuits for single-cycle expert dispatch in mixture-of-experts models.

Speculative Decoding Speedup Hierarchy

Verified speedup over standard autoregressive decoding (higher = faster)

Classic
Spec Decode

draft + verify 2-3x

Medusa

multi-head prediction 2.2-3.6x

EAGLE-2

dynamic draft trees 3.05-4.26x

Sequoia

tree attention · HW-aware topology up to 9.96x

Draft + verify (Leviathan et al.)

No draft model needed

Runtime-configurable trees

DP-optimal per hardware profile

Sequoia 9.96x achieved in offloaded setting (CMU/Together AI, 2024). On A100: up to 4.04x. All methods preserve exact target-model distribution via rejection sampling.

Disaggregated Inference: The Heterogeneous Thesis

The prefill phase is compute-bound (processing thousands of input tokens in parallel through dense matrix multiplications). The decode phase is memory-bandwidth-bound (streaming weights for single matrix-vector products at batch 1-8). These are fundamentally different hardware problems that GPUs handle with a single, compromised architecture.

Splitwise (Microsoft, 2023) separates prefill and decode onto different hardware within a cluster, transferring KV-cache state over interconnects. Result: 2.35x throughput at iso-cost. DistServe pushes further: 7.4x more requests served within latency SLOs by independently scaling prefill and decode resources. TetriInfer (2024) extends the approach with prompt chunking and predictive scheduling, achieving 97% TTFT reduction and 47% JCT improvement with 38% fewer resources.

SPAD (Princeton, 2025) takes the logical next step: physically distinct chip designs for prefill (larger systolic arrays, cost-effective memory) and decode (prioritized memory bandwidth, reduced compute). This is the most direct academic precedent for a heterogeneous agentic inference chip.

The hybrid CPU-architecture thesis adds another dimension. Apple Silicon’s M4 Max achieves 546 GB/s unified memory bandwidth with a CPU-centric design. Intel AMX delivers 60-65% latency reduction for inference workloads on server CPUs. ARM’s Scalable Matrix Extension (SME) and RISC-V vector extensions provide matrix acceleration without GPU overhead. The argument: for thin-thread agents at batch 1 (a single user’s reasoning chain), a CPU control plane with a matrix co-processor wins on latency because it eliminates kernel launch overhead, CPU-GPU data transfers, and scheduling complexity. The agent’s control flow (branching, tool dispatch, state management) runs natively on the CPU while matrix operations dispatch to co-located accelerators.

Power Efficiency: The Economic Constraint

For always-on agents, tokens per watt determines economic viability. A 70B model at batch 1 on an H100 SXM: 0.14-0.26 tok/J at 700W. B200 pushes to 0.30-0.50 tok/J. But agents are bursty — active 200-500 ms, then idle 500 ms-10 s waiting for tools. GPU idle power stays at 100-200W, wasting energy during the dominant idle phase.

Data movement dominates: a 32-bit DRAM read costs ~640 pJ versus ~3.7 pJ for a 32-bit FP multiply (170x). For 70B FP16 inference, 80-90% of energy goes to memory access. This ratio has only worsened at modern nodes as compute energy shrinks faster than DRAM energy.

Quantization is the strongest lever. FP16-to-INT4 cuts weight traffic 4x, yielding ~4x power savings on memory-bound decode. FP8 on H100/B200 Tensor Cores gives 2x with minimal accuracy loss. For models above 5B parameters, NF4 achieves near-FP16 accuracy with 75% memory savings; below 3B, dequantization overhead can increase energy 25-56%. NVIDIA’s 2:4 structured sparsity adds 30-36% perf/watt. MoE provides intrinsic sparsity: Mixtral activates 2/8 experts per token, so a 140B MoE model consumes power closer to 35B dense.

The power-gating advantage matters most for agents. Apple’s ANE transitions from ~0W idle to full inference in microseconds; an agent active 10% of the time averages ~0.5-1W. KAIROS (Michigan, 2026) achieves 27% power reduction through agent-granularity DVFS. ThunderAgent (2026) delivers 1.5-3.6x throughput via program-aware KV-cache scheduling of “LLM Program” control flows.

Power Efficiency: Tokens per Joule Across Platforms

70B-class model, batch-1 decode · higher = more energy-efficient

0.08 - 0.11
tok/J

A100 SXM

400 W TDP

0.14 - 0.26
tok/J

H100 SXM

700 W TDP

0.30 - 0.50
tok/J

B200

~1000 W TDP

High tok/J
(small models)

Apple M4 Max

~20 W active

Projected
target

ARIA

150 W TDP

⚡

80-90% of inference energy goes to memory access. A 32-bit DRAM read costs ~640 pJ vs ~3.7 pJ for a 32-bit FP multiply -- a 170x gap. This ratio worsens at each node as compute energy shrinks faster than DRAM energy. The architecture that minimizes data movement wins the power budget.

Ampere (2020)

Hopper (2022)

Blackwell (2024)

Apple Silicon (2024)

ARIA (proposed)

M4 Max excels at sub-7B quantized models via 546 GB/s unified memory and near-zero idle draw. ARIA targets 70B at 150W with 256 MB SRAM + HBM4, power-gating idle tiles during tool-call waits.

ARIA: A Concrete Architecture Proposal

The Agentic Reasoning Inference Accelerator synthesizes these findings into a single-die chip with multi-chip scaling provisions.

Die Overview

ARIA is a ~600 mm^2 die on TSMC N3E (~200M transistors/mm^2 logic, 0.021 um^2 SRAM bitcell), comprising four major subsystems: a control cluster, heterogeneous compute engines, a large unified SRAM, and a high-bandwidth memory controller.

Block	Area (est.)	Function
SRAM (256 MB)	~300 mm^2	KV-cache, activations, weight cache
Matrix tiles (8x)	~80 mm^2	Prefill engine
Vector tiles (16x)	~60 mm^2	Decode engine
SFUs + control cluster	~30 mm^2	Non-linear ops, RISC-V cores, agent FSM
HBM4 PHYs (6 stacks)	~40 mm^2	Off-chip memory interface
On-die network + misc	~30 mm^2	Ring + mesh interconnect
UCIe D2D	~10 mm^2	Multi-chip scaling
Total	~550-600 mm^2	Within N3E reticle limit (858 mm^2)

Total transistor count: ~50B (logic) + ~100B (SRAM) = ~150B.

ARIA Die Floorplan — Top-Down View

~600 mm² · TSMC N3E · ~150B transistors

MT0

MT1

MT2

MT3

MT4

MT5

MT6

MT7

PREFILL ENGINE

8 Matrix Tiles · 256×256 BF16 Systolic

~1.57 PFLOPS BF16 · ~80 mm²

V10

V11

V12

V13

V14

V15

DECODE ENGINE

16 Vector Tiles · 128-wide BF16

~6 TFLOPS BF16 · ~60 mm²

RISC-V

4× RV64GC

SFUs

Softmax · TopK

SpecDec · KV-MMU

256 MB UNIFIED SRAM

KV-Cache · Activations · Weight Cache · ~300 mm²

Hardware Paged KV · 4KB Pages · COW · Dedup Tags

RING + MESH INTERCONNECT · ~30 mm²

HBM4

16 GB

HBM4

16 GB

HBM4

16 GB

HBM4

16 GB

UCIe

D2D

HBM4

16 GB

HBM4

16 GB

Prefill (Matrix)

Decode (Vector)

SRAM (256 MB)

HBM4 (96 GB)

RISC-V Control

Special Function

UCIe D2D

Memory System

256 MB unified on-chip SRAM. This is the core differentiator. Groq achieved 230 MB at 14nm on 725 mm^2; N3E’s ~4.6x density improvement allows comparable capacity in less area. The SRAM serves triple duty: KV-cache primary storage (100K tokens at FP8 for a 70B GQA model with 8 KV heads, ~2.5 KB per token), activation scratch space (eliminating HBM round-trips for intermediate values), and weight cache (significant reuse for sub-30B models across the batch).

The SRAM implements hardware-managed paged KV-cache — vLLM’s PagedAttention in silicon. 4 KB pages (one token’s KV for typical model dimensions), a hardware TLB with 4K entries covering 16 MB of hot working set, copy-on-write support (forking agents create COW references to shared system-prompt KV-cache), and content-addressable deduplication tags that detect identical KV blocks across agents.

96 GB HBM4 at 12 TB/s. Six stacks provide overflow: cold KV-cache pages, full model weights for 70B+ models, large-batch activations. The HBM controller includes a prefetch engine (predicts needed KV pages from agent conversation history), bandwidth partitioning (separate virtual channels for weight streaming, KV spill/fill, and activation traffic), and transparent 2:4 structured sparsity decompression on read (~2x effective bandwidth for sparse weights).

Bandwidth analysis at the operating point (batch=8 agents, each decoding one token on a 70B FP8 model with 8K context): weight load per token is ~70 GB, KV-cache read per token is ~160 MB. At 100 tok/s/user target, required bandwidth is ~1 TB/s for weights + ~128 GB/s for KV. With 256 MB SRAM caching hot weights and KV, effective HBM demand drops to 2-4 TB/s — within the 12 TB/s envelope with margin for prefill bursts.

Compute Engines

Prefill engine: 8 matrix tiles. Each tile contains a 256x256 BF16 systolic array (131K FLOPs/cycle) with 512 KB local SRAM. At 1.5 GHz: ~196 TFLOPS BF16 per tile, ~1.57 PFLOPS BF16 aggregate (~3.14 POPS INT8). Comparable to H100 tensor core throughput. Handles the compute-bound phase: processing tool outputs, ingesting long context, computing initial KV-cache. The 8 tiles can partition flexibly — all 8 for a single large prefill, or split across concurrent smaller prefills from different agents.

Decode engine: 16 vector tiles. Each tile is a 128-wide BF16 vector unit with 1 MB local SRAM. Deliberately modest compute (~384 GFLOPS BF16 per tile, ~6 TFLOPS aggregate) because decode at batch 1-8 degenerates into matrix-vector products where weights stream from memory and activations fit in registers. The tiles connect to SRAM banks at full bisection bandwidth. Vector units include fused multiply-add, FP8/FP4 support, and fast reduction trees for attention score accumulation.

Heterogeneous scheduling. The control cluster dynamically assigns work based on phase: pure decode power-gates the matrix tiles (saving ~30W), pure prefill activates all matrix tiles while decode tiles handle low-priority concurrent decode, and the agentic common case runs mixed — matrix tiles prefill returning tool output while vector tiles continue decode for other agents.

Special Function Units

Non-linear operations (softmax, LayerNorm, GeLU/SiLU, TopK) consume 5-15% of end-to-end GPU latency despite comprising less than 1% of FLOPs. ARIA includes dedicated hardware:

Softmax/LayerNorm engine. Streaming pipelined architecture processing a row of attention scores in a single pass with online (numerically stable) softmax. 16 independent lanes, each handling one head or one batch element.

Speculative decode verification unit. Tree-structured verification logic following Sequoia’s framework. Accepts K draft tokens from an on-die or co-packaged draft model plus the target model’s logits for all K positions. Performs rejection sampling in hardware with hardware RNG. Configurable tree topology loaded from the control cluster. Target: verify a tree of 8-16 speculated tokens in under 100 clock cycles.

TopK/sampling unit. Hardware TopK extraction (top-50, top-100) from vocabulary-sized logit vectors (32K-256K entries). Temperature scaling, top-p nucleus sampling, repetition penalty in a streaming pipeline. Avoids the GPU pattern of sorting the full vocabulary.

KV-cache management unit. Page table walker, TLB, COW logic. Eviction policy engine implementing LRU with attention-score-weighted priority (pages whose tokens receive high attention scores are retained; decaying-score pages spill to HBM). Compression/decompression for FP8 KV quantization on spill/fill.

ISA: Two-Tier Programmability

ARIA is not a fixed-function ASIC. It must handle SSMs (Mamba), mixture-of-experts routing, multi-modal cross-attention, retrieval-augmented generation, and architectures that do not yet exist.

Tile-level ISA (VLIW with transformer extensions): Matrix ops (MMUL.BF16/FP8/FP4, batched GEMV), attention primitives (ATTN.SCORE, ATTN.WEIGHT, ATTN.FLASH for fused tiled attention), reduction ops (REDUCE.SUM/MAX/TOPK), page-based memory ops (LOAD.PAGE, PREFETCH.KV, STREAM.WEIGHT), speculative control (SPEC.DRAFT, SPEC.VERIFY, BATCH.YIELD), non-linear dispatch (SOFTMAX.ROW, LAYERNORM, GELU, SILU, ROPE), and first-class MoE support (MOE.GATE for top-K expert selection, MOE.SCATTER, MOE.GATHER).

Control cluster ISA: Four RISC-V RV64GC cores with custom extensions for agent state machine management (fork/join agents, track conversation turns), batch scheduling (add/remove sequences per iteration — the hardware realization of Orca’s iteration-level scheduling), power management (tile power gating, DVFS per cluster following KAIROS-style agent-granularity power policy), and multi-chip coordination (message passing over UCIe D2D).

Multi-Chip Scaling

For models exceeding single-die capacity (>70B at FP8, or >30B with large batch and long context):

Interconnect: UCIe D2D port at 256 GB/s bidirectional per die. For 4-8 chip servers, a proprietary coherent ring provides 512 GB/s chip-to-chip bandwidth with cache coherency for KV-cache pages across dies.

Scaling targets:

Configuration	Chips	Model Size (FP8)	Concurrent Agents @ 32K	Target tok/s/user
Single die	1	30B	16	150
Quad	4	70B	32	120
Octet	8	200B	64	80
Rack (32)	32	405B+	256	60

Tensor parallelism for 2-8 chips (all-reduce after each layer; at 512 GB/s, a 70B model across 4 chips adds ~50 us per layer, ~3 ms for 60 layers). Pipeline parallelism for 8+ chips. Expert parallelism for MoE models.

Comparison

Dimension	NVIDIA B200	Groq LPU	Etched Sohu	Google TPU v5p	ARIA
Process	TSMC 4NP	GF 14nm	TSMC 4nm	Undisclosed	TSMC N3E
On-chip SRAM	~50 MB	230 MB	Unknown	~50 MB (est.)	256 MB
HBM	192 GB HBM3e, 8 TB/s	None	144 GB HBM3e	95 GB, 2.7 TB/s	96 GB HBM4, 12 TB/s
Peak BF16	~2.5 PFLOPS	~750 TFLOPS	Unknown	459 TFLOPS	~1.6 PFLOPS
Low-batch decode	Poor (<10% util at batch 1)	Excellent	Excellent	Moderate	Excellent
Spec decode	Software only	Not supported	Hardware (claimed)	Software only	Hardware SFU
KV-cache mgmt	Software (vLLM)	Compiler	Unknown	Software	Hardware paged + COW + dedup
Programmability	Full CUDA	VLIW, limited	Fixed-function	XLA	VLIW + RISC-V
Sweet spot	Training + high-batch	Ultra-low-latency single-stream	Transformer-only	Training + batch serving	Agentic: low-batch, long-context, multi-turn

ARIA occupies the space between Groq’s SRAM-only radicalism and NVIDIA’s general-purpose flexibility. It matches Groq-class on-chip SRAM (256 MB vs. 230 MB) but adds HBM4 for capacity, retains programmability for model evolution, and adds agentic-specific hardware that no existing chip provides.

Manufacturing and Economics

Process: TSMC N3E (relaxed design rules over N3, better yields, ~5% density penalty). At ~0.1 defects/cm^2 (mature process), a 600 mm^2 die yields ~55-60% by the Poisson model — aggressive but comparable to the A100 at 826 mm^2 on 7nm.

BOM per packaged chip:

Component	Cost
Die (silicon, ~50 good dies/wafer at $20-25K/wafer)	$400-500
HBM4 (6 stacks, 96 GB)	$1,500-2,500
CoWoS packaging	$800-1,200
Substrate + passives + test	$200-400
Total BOM	$2,900-4,600

Compare: NVIDIA B200 BOM is estimated at $3,000-5,000; Groq LPU at ~$1,500-2,000 (no HBM) but requiring 8-576x more chips per model.

NRE: $465-710M total (architecture + RTL $200-300M, verification $50-80M, physical design $40-60M, masks $15-20M, IP licensing $30-50M, EDA $20-30M, prototype silicon $30-50M, software stack $80-120M). Timeline: first tapeout at month 24, production-ready at month 36-42.

Break-even: At $10,000-15,000 ASP with $3,000-4,500 COGS, gross margin is ~$6,500-10,500 per chip. Break-even on $600M NRE requires 57,000-92,000 chips. At 50,000 chips/year (plausible if agentic workloads scale as projected), break-even in ~1.5-2 years.

Risks

Model architecture shift. If transformers are supplanted by a fundamentally different architecture, the SFUs become dead silicon. Mitigation: SFUs are less than 5% of die area, and the programmable tiles handle any architecture through recompilation.

HBM4 availability. Specs finalized April 2025, but volume production may slip to late 2026-2027. The design must be HBM3e-compatible as fallback (8 TB/s, 64 GB — still viable, just tighter on bandwidth margin).

The CUDA moat. ARIA needs an MLIR-based compiler targeting the tile ISA, a runtime compatible with vLLM/SGLang/TensorRT-LLM APIs, and PyTorch/JAX integration. The $80-120M software investment is not optional — it is table stakes.

NVIDIA’s trajectory. Blackwell Ultra explicitly targets agentic workloads. The window of architectural advantage may be narrow. ARIA’s bet is that purpose-built heterogeneity (separate prefill/decode engines, hardware KV-cache management, native spec-decode) captures structural gains that general-purpose GPU evolution cannot match.

Yield at 600 mm^2. A yield-killing bug wastes ~$400 per defective die at volume. Redundant SRAM banks and spare tiles are essential design-for-yield measures.

The Unifying Principle

Every optimization in this design — speculative decoding, MoE sparsity, KV-cache paging with COW, prefill/decode disaggregation, heterogeneous memory — is a strategy to extract more useful tokens per byte moved from memory. DRAM access energy has improved far less than compute energy across every node transition; the ratio has only worsened. The architecture that makes memory movement the first-class design constraint, rather than peak FLOPS, wins the agentic inference workload.

ARIA is a concrete instantiation: 256 MB SRAM to keep the hot working set on-die, 12 TB/s HBM4 for overflow, heterogeneous engines matched to the prefill/decode split, hardware acceleration for speculative verification, paged KV-cache with deduplication, agent state machines, and programmable tiles that survive architectural evolution. The chip that makes a 70B-parameter agent respond in 80 ms per token with 32K context, 16 concurrent agents, at 150W.

Additional Reading

Efficiently Scaling Transformer Inference — Pope et al., Google 2022
FlashAttention — Dao et al. 2022
FlashAttention-2 — Dao 2023
PagedAttention / vLLM — Kwon et al., SOSP 2023
Splitwise: Efficient Generative LLM Inference — Patel et al. 2024
Speculative Decoding — Leviathan et al. 2023
Medusa — Cai et al. 2024
EAGLE — Li et al. 2024
EAGLE-2 — Li et al. 2024
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding — Chen et al., 2024
DistServe: Disaggregating Prefill and Decoding — Zhong et al., OSDI 2024
Groq LPU Architecture — Groq
Cerebras WSE-3 — Cerebras

Alan's PKB

Explorer

The Case for an Agentic Inference Chip

The Case for an Agentic Inference Chip

The Workload: Why Agents Break GPUs

The Accelerator Landscape

Memory Architecture: The Binding Constraint

Speculative Decoding: The Latency Multiplier

Speculative Decoding Speedup Hierarchy

Disaggregated Inference: The Heterogeneous Thesis

Power Efficiency: The Economic Constraint

Power Efficiency: Tokens per Joule Across Platforms

ARIA: A Concrete Architecture Proposal

Die Overview

Memory System

Compute Engines

Special Function Units

ISA: Two-Tier Programmability

Multi-Chip Scaling

Comparison

Manufacturing and Economics

Risks

The Unifying Principle

Additional Reading

Graph View

Table of Contents

Backlinks