The Case for an Agentic Inference Chip

Agentic AI workloads — multi-step reasoning chains, tool invocation, code execution, planning loops with branching sub-agents — are the fastest-growing segment of LLM inference demand. They are also the segment worst served by existing hardware. This article examines why, surveys the accelerator landscape, analyzes the latency and power budgets in detail, and proposes a concrete chip architecture — ARIA, the Agentic Reasoning Inference Accelerator — that targets the structural gaps.

The Workload: Why Agents Break GPUs

An agent does not make a single inference call. A coding agent solving a SWE-bench task executes 5-30 sequential rounds of prefill-decode-parse-tool_call-resume. A deep research agent may chain 10-50 calls across search-read-synthesize loops. Multi-agent orchestration can exceed 100 calls totaling 100K+ generated tokens. Each round pays the full time-to-first-token (TTFT) penalty, and context grows with every turn as conversation history, tool outputs, and file contents accumulate.

The end-to-end anatomy of a single agent turn:

StageTypical LatencyBound By
Tokenization1-5 msCPU (negligible)
Network RTT20-150 msGeography, TCP/TLS
Queue + scheduling5-50 msLoad, batch formation
Prefill50-2,000 msCompute (FLOPS)
Decode10-40 ms/tokenMemory bandwidth
Tool-call parse + dispatch5-20 msCPU, framework
Tool execution50-5,000 msI/O, external service
Resume prefill (tool result)50-500 msCompute
Agent Turn Latency Waterfall
Stacked timeline of a single agent turn generating 500 tokens on a 70B model — total ~17.7 s
0 s2 s4 s6 s8 s10 s12 s14 s16 s17.7 s
Prefill 400 ms
Decode — 500 tokens 15,000 ms (15 s) memory-bound
Tool exec 2,000 ms
Resume 200 ms
Tokenize · 3 ms
Network RTT · 80 ms
Prefill · 400 ms (compute-bound)
Decode · 15 s (memory-bound)
Tool execution · 2 s
Resume prefill · 200 ms (compute-bound)
Decode dominates at 85% of wall-clock time. An agent chaining 20 turns pays this budget sequentially — cumulative latency of 4-8 minutes.

For a 20-turn coding agent on a 70B model served by H100s, the aggregate budget is roughly 4-8 minutes of wall-clock time. Prefill dominates when the agent issues short tool-call commands with long accumulated context. Decode dominates when it generates long code blocks. In both regimes, the hardware is severely underutilized.

The core mismatch is quantitative. Autoregressive decode streams the entire model’s weights from HBM for every generated token. A 70B FP16 model requires reading ~140 GB per token. On an H100 at 3.35 TB/s HBM bandwidth, that caps single-stream decode at ~24 tokens/second — regardless of the GPU’s 989 TFLOPS of tensor core capacity. The arithmetic intensity during decode is ~1 FLOP/byte; GPUs are designed for 100+ FLOPs/byte. At batch size 1-8 (the agentic operating point), GPU utilization is below 10%.

Five characteristics make agentic inference a distinct hardware regime, separate from the video generation workloads addressed by VDX-1 or the robotics edge constraints targeted by JEPA-R:

  1. Low effective batch size per decision step. GPUs need batch >= 32-64 to saturate compute; agents operate at batch 1-8 per reasoning step.
  2. Interleaved prefill and decode within a single request. Tool outputs trigger long prefills mid-generation, breaking the assumption of a clean prefill-then-decode pipeline.
  3. Long, growing context. Agents accumulate 32K-128K+ tokens across turns. KV-cache for Llama 2 70B at 32K context consumes tens of GB, competing with weights for bandwidth.
  4. KV-cache sharing across agents. Multi-agent systems reuse system prompts and shared context. KVCOMM (2025) demonstrates 70%+ reuse rates and up to 7.8x speedup through cross-context cache alignment.
  5. Latency on the critical path. Every token of reasoning is sequential. TTFT after a tool return matters as much as steady-state decode throughput.

GPUs also pay a hidden tax: kernel launch overhead. Each transformer layer invokes 10-20 GPU kernels, each with 3-5 microseconds of CPU-to-GPU dispatch latency. A 70B model with 80 layers runs 800-1,600 kernel launches per token, adding 2.4-8 ms of pure overhead — a substantial fraction of the 20-40 ms decode budget. CUDA Graphs partially mitigate this but cannot eliminate it. Dedicated inference hardware like Groq’s LPU avoids the dispatch loop entirely through deterministic, compiler-scheduled execution.

The Accelerator Landscape

Eleven custom chips attack the memory wall through fundamentally different architectural bets.

Groq LPU. The most radical memory-first design. A single-core Tensor Streaming Processor on GlobalFoundries 14nm with 230 MB of on-chip SRAM (no HBM) and ~80 TB/s of on-chip bandwidth. Fully deterministic VLIW execution with no caches, buffers, or runtime arbitration — the compiler resolves all data movement at compile time. Delivers ~250-500 tokens/sec per user on Llama 3 70B class models. The weakness: no useful model fits on one chip. Mixtral 8x7B requires 576 chips (8 racks, 144 CPUs, 144 TB host RAM). Economically punishing at scale, but the latency profile is ideal for agentic loops where deterministic execution eliminates tail latency across 10-20 chained calls.

Cerebras WSE-3. Wafer-scale integration: 4 trillion transistors on a single 46,225 mm^2 TSMC 5nm wafer, 900,000 cores, 44 GB on-chip SRAM, 125 PFLOPS peak. The Cerebras Inference Service reports 969 tokens/sec on Llama 3.1 405B at 240 ms TTFT — 18x faster than GPU-served Claude 3.5 Sonnet and 12x faster than GPT-4o. At 100K context, it still sustains 539 tokens/sec. The 44 GB of on-chip SRAM eliminates HBM bandwidth as the decode bottleneck for models that fit. System cost and single-vendor dependency are the constraints.

Etched Sohu. A transformer-specific ASIC on TSMC 4nm that hardcodes the entire inference pipeline (matmul, attention, LayerNorm, softmax) into fixed-function silicon. Claims an 8-chip server generates 500,000+ tokens/sec on Llama 70B — roughly 22x an equivalent 8-GPU H100 server. If validated, the economics are transformative. The risk is architectural lock-in: if the field moves to state-space hybrids or non-transformer architectures, Sohu becomes worthless silicon. Independent benchmarks are not yet available.

Google TPU v6e (Trillium). 4.7x peak compute over v5e, 2x HBM, 2x ICI bandwidth, 256-chip pods at 99% scaling efficiency. 67% energy efficiency gain. Strong for large-model inference but locked to Google Cloud and JAX/XLA.

NVIDIA B200. 208 billion transistors on TSMC 4NP, 192 GB HBM3e at 8 TB/s. 10,614 tokens/sec on Llama 3.3 70B (FP4, TP=1). The GB200 NVL72 claims 30x inference over equivalent H100 configurations. NVIDIA explicitly targets agentic workloads with Blackwell Ultra. The CUDA ecosystem moat remains the real barrier to alternatives. For a deeper dive into the Blackwell architecture, see the dedicated article.

SambaNova SN40L. Reconfigurable dataflow with three-tier memory (SRAM + HBM + DDR5, 1.5 TB/node). Millisecond-scale model switching suits agentic routing between specialized models. d-Matrix Corsair. SRAM-based digital in-memory compute performing matmul where data lives; promising but unbenched. Apple Silicon M4 Max. 546 GB/s unified memory, 128 GB capacity, 38 TOPS ANE at ~1.5-2 TOPS/W with near-zero idle draw. Enables the hybrid pattern: quantized 3B on-device for tool orchestration, cloud fan-out for heavy reasoning. Others. AWS Inferentia2 (190 TFLOPS, modest latency), Tenstorrent Blackhole (RISC-V, 664 TFLOPS FP8, open SDK, GDDR6-limited), Meta MTIA and Microsoft Maia (internal-only).

Accelerator Comparison
On-chip SRAM, memory bandwidth, and single-user decode throughput across five architectures
On-chip SRAM Memory BW Single-user tok/s
Groq LPU GF 14nm · SRAM-only
SRAM
230 MB
BW
80 TB/s on-chip
tok/s
~500
Cerebras WSE-3 TSMC 5nm · Wafer-scale
SRAM
44 GB (wafer)
BW
on-die only
tok/s
969
Etched Sohu TSMC 4nm · Fixed-function
SRAM
undisclosed
BW
undisclosed
tok/s
500K (claimed, unverified)
NVIDIA B200 TSMC 4NP · GPU
SRAM
~50 MB
BW
8 TB/s HBM3e · 192 GB
tok/s
10,614 (batched)
ARIA TSMC N3E · Proposed agentic accelerator
SRAM
256 MB
BW
12 TB/s HBM4 · 96 GB
tok/s
150/user (target, 16 agents)
SRAM bars are linear (256 MB max for single-die; Cerebras 44 GB is wafer-scale, clipped). BW bars are linear (80 TB/s max). tok/s bars are log-compressed to fit Etched's claimed 500K alongside real-world figures.

Memory Architecture: The Binding Constraint

Every technique in the inference optimization stack — speculative decoding, MoE sparsity, KV-cache paging, prefill/decode disaggregation — is fundamentally a strategy to extract more useful tokens per byte transferred from memory. The memory hierarchy is the single most important design decision.

ARIA Memory Hierarchy Data Movement Waterfall
Fast/small → slow/large
Registers
Tile-local
64 KB
1 cycle · ~80 TB/s
On-die bus · full bisection BW
On-Chip SRAM
Unified L1
256 MB
3 cycles · ~40 TB/s on-die
KV
ACT
W$
HBM4 PHY · 2,048-bit interface × 6 stacks
HBM4
6 stacks on-package
96 GB
~12 cycles · 12 TB/s aggregate
Weights
KV spill
Batch
CXL 3.0 · coherent shared memory
CXL Pool
Shared across nodes
Shared KV-Cache Pool
~100+ cycles · 242 GB/s per link · hardware coherency
Dedup
COW
Bandwidth gradient: bright = fast/small, dim = slow/large
170× gap: 3.7 pJ (FP mul) vs 640 pJ (DRAM read)
80-90% of inference energy = memory access
dedup with CXL shared KV-cache

HBM4. Standardized April 2025, HBM4 doubles the interface width to 2,048 bits per stack, achieving 2 TB/s per stack (67% over HBM3e) with up to 64 GB per 16-high stack. Six HBM4 stacks provide 12 TB/s aggregate bandwidth and 96 GB capacity — roughly 4x an H100’s HBM bandwidth. This is the minimum envelope for serving 70B+ models to concurrent agents without starving the decode engine.

CXL 3.0 shared KV-cache. CXL 3.0’s coherent memory sharing enables KV-cache deduplication across agent instances. Eight coding agents sharing a system prompt and 90% of context can deduplicate for approximately 7x memory reduction and eliminate redundant prefill computation. Hardware-managed coherency avoids the software overhead of explicit cache coordination.

Processing-in-Memory. Samsung HBM-PIM and SK Hynix AiM embed modest compute directly within DRAM banks. Attention’s arithmetic intensity (~1 FLOP/byte for long sequences) maps perfectly to PIM’s strength: compute co-located with massive bandwidth. Yang et al. (Duke, 2026) demonstrate significant energy efficiency improvements for end-to-end transformer acceleration through PIM, performing attention score computation in-situ on KV-cache banks. MXFormer (2026) pushes further with charge-trap transistor CIM, achieving 3.3-60.5x improvement in compute density using MXFP4 precision.

ReRAM/MRAM for instant cold-start. Non-volatile memory with sub-2ns read access could store model weights persistently. An agent “wakes up” with weights already in place, eliminating the multi-second model load that currently gates the first request after a cold start or model switch. This is particularly relevant for edge agents and for datacenter deployments serving many models with bursty demand.

Speculative Decoding: The Latency Multiplier

Speculative decoding is the highest-impact single optimization for agentic latency, explored in detail in the dedicated SpecDecode-1 architecture proposal. A small draft model generates K candidate tokens; the large target model verifies all K in a single parallel forward pass. Rejection sampling guarantees the output distribution is identical to the target model — zero quality loss.

TechniqueSpeedupMechanism
Classic speculative decode (Leviathan et al., 2022)2-3xSmall draft model + parallel verify
DeepMind speculative sampling (Chen et al., 2023)2-2.5xValidated at 70B scale (Chinchilla)
Medusa (multi-head prediction)2.2-3.6xMultiple prediction heads, no draft model
EAGLE-2 (dynamic draft trees)3.05-4.26xRuntime-configurable tree topologies
Sequoia (hardware-aware trees)up to 4.04x (A100), 9.96x (offloaded)DP-optimal tree topology per hardware profile

The key insight from Sequoia (CMU/Together AI, 2024) is that speculation strategy must be co-designed with hardware. The optimal tree topology depends on the target platform’s compute-to-bandwidth ratio. A custom chip can implement tree verification natively: accept K draft tokens, compute target logits for all K positions, perform rejection sampling in hardware (comparing draft vs. target distributions with hardware RNG), and identify the last accepted position — all in under 100 clock cycles.

Hardware requirements for native speculative decode: a dedicated draft co-processor (or small model co-located on-die), a tree attention unit with scatter-gather DMA for non-contiguous KV-cache access, a KV-cache MMU (analogous to a CPU TLB) supporting PagedAttention with copy-on-write, and MoE routing circuits for single-cycle expert dispatch in mixture-of-experts models.

Speculative Decoding Speedup Hierarchy

Verified speedup over standard autoregressive decoding (higher = faster)
Classic
Spec Decode
draft + verify 2-3x
Medusa
multi-head prediction 2.2-3.6x
EAGLE-2
dynamic draft trees 3.05-4.26x
Sequoia
tree attention · HW-aware topology up to 9.96x
Draft + verify (Leviathan et al.)
No draft model needed
Runtime-configurable trees
DP-optimal per hardware profile
Sequoia 9.96x achieved in offloaded setting (CMU/Together AI, 2024). On A100: up to 4.04x. All methods preserve exact target-model distribution via rejection sampling.

Disaggregated Inference: The Heterogeneous Thesis

The prefill phase is compute-bound (processing thousands of input tokens in parallel through dense matrix multiplications). The decode phase is memory-bandwidth-bound (streaming weights for single matrix-vector products at batch 1-8). These are fundamentally different hardware problems that GPUs handle with a single, compromised architecture.

Splitwise (Microsoft, 2023) separates prefill and decode onto different hardware within a cluster, transferring KV-cache state over interconnects. Result: 2.35x throughput at iso-cost. DistServe pushes further: 7.4x more requests served within latency SLOs by independently scaling prefill and decode resources. TetriInfer (2024) extends the approach with prompt chunking and predictive scheduling, achieving 97% TTFT reduction and 47% JCT improvement with 38% fewer resources.

SPAD (Princeton, 2025) takes the logical next step: physically distinct chip designs for prefill (larger systolic arrays, cost-effective memory) and decode (prioritized memory bandwidth, reduced compute). This is the most direct academic precedent for a heterogeneous agentic inference chip.

The hybrid CPU-architecture thesis adds another dimension. Apple Silicon’s M4 Max achieves 546 GB/s unified memory bandwidth with a CPU-centric design. Intel AMX delivers 60-65% latency reduction for inference workloads on server CPUs. ARM’s Scalable Matrix Extension (SME) and RISC-V vector extensions provide matrix acceleration without GPU overhead. The argument: for thin-thread agents at batch 1 (a single user’s reasoning chain), a CPU control plane with a matrix co-processor wins on latency because it eliminates kernel launch overhead, CPU-GPU data transfers, and scheduling complexity. The agent’s control flow (branching, tool dispatch, state management) runs natively on the CPU while matrix operations dispatch to co-located accelerators.

Power Efficiency: The Economic Constraint

For always-on agents, tokens per watt determines economic viability. A 70B model at batch 1 on an H100 SXM: 0.14-0.26 tok/J at 700W. B200 pushes to 0.30-0.50 tok/J. But agents are bursty — active 200-500 ms, then idle 500 ms-10 s waiting for tools. GPU idle power stays at 100-200W, wasting energy during the dominant idle phase.

Data movement dominates: a 32-bit DRAM read costs ~640 pJ versus ~3.7 pJ for a 32-bit FP multiply (170x). For 70B FP16 inference, 80-90% of energy goes to memory access. This ratio has only worsened at modern nodes as compute energy shrinks faster than DRAM energy.

Quantization is the strongest lever. FP16-to-INT4 cuts weight traffic 4x, yielding ~4x power savings on memory-bound decode. FP8 on H100/B200 Tensor Cores gives 2x with minimal accuracy loss. For models above 5B parameters, NF4 achieves near-FP16 accuracy with 75% memory savings; below 3B, dequantization overhead can increase energy 25-56%. NVIDIA’s 2:4 structured sparsity adds 30-36% perf/watt. MoE provides intrinsic sparsity: Mixtral activates 2/8 experts per token, so a 140B MoE model consumes power closer to 35B dense.

The power-gating advantage matters most for agents. Apple’s ANE transitions from ~0W idle to full inference in microseconds; an agent active 10% of the time averages ~0.5-1W. KAIROS (Michigan, 2026) achieves 27% power reduction through agent-granularity DVFS. ThunderAgent (2026) delivers 1.5-3.6x throughput via program-aware KV-cache scheduling of “LLM Program” control flows.

Power Efficiency: Tokens per Joule Across Platforms

70B-class model, batch-1 decode · higher = more energy-efficient
0.08 - 0.11
tok/J
A100 SXM
400 W TDP
0.14 - 0.26
tok/J
H100 SXM
700 W TDP
0.30 - 0.50
tok/J
B200
~1000 W TDP
High tok/J
(small models)
Apple M4 Max
~20 W active
Projected
target
ARIA
150 W TDP
80-90% of inference energy goes to memory access. A 32-bit DRAM read costs ~640 pJ vs ~3.7 pJ for a 32-bit FP multiply -- a 170x gap. This ratio worsens at each node as compute energy shrinks faster than DRAM energy. The architecture that minimizes data movement wins the power budget.
Ampere (2020)
Hopper (2022)
Blackwell (2024)
Apple Silicon (2024)
ARIA (proposed)
M4 Max excels at sub-7B quantized models via 546 GB/s unified memory and near-zero idle draw. ARIA targets 70B at 150W with 256 MB SRAM + HBM4, power-gating idle tiles during tool-call waits.

ARIA: A Concrete Architecture Proposal

The Agentic Reasoning Inference Accelerator synthesizes these findings into a single-die chip with multi-chip scaling provisions.

Die Overview

ARIA is a ~600 mm^2 die on TSMC N3E (~200M transistors/mm^2 logic, 0.021 um^2 SRAM bitcell), comprising four major subsystems: a control cluster, heterogeneous compute engines, a large unified SRAM, and a high-bandwidth memory controller.

BlockArea (est.)Function
SRAM (256 MB)~300 mm^2KV-cache, activations, weight cache
Matrix tiles (8x)~80 mm^2Prefill engine
Vector tiles (16x)~60 mm^2Decode engine
SFUs + control cluster~30 mm^2Non-linear ops, RISC-V cores, agent FSM
HBM4 PHYs (6 stacks)~40 mm^2Off-chip memory interface
On-die network + misc~30 mm^2Ring + mesh interconnect
UCIe D2D~10 mm^2Multi-chip scaling
Total~550-600 mm^2Within N3E reticle limit (858 mm^2)

Total transistor count: ~50B (logic) + ~100B (SRAM) = ~150B.

ARIA Die Floorplan — Top-Down View
~600 mm² · TSMC N3E · ~150B transistors
MT0
MT1
MT2
MT3
MT4
MT5
MT6
MT7
PREFILL ENGINE
8 Matrix Tiles · 256×256 BF16 Systolic
~1.57 PFLOPS BF16 · ~80 mm²
V0
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
DECODE ENGINE
16 Vector Tiles · 128-wide BF16
~6 TFLOPS BF16 · ~60 mm²
C0
C1
C2
C3
RISC-V
4× RV64GC
SFUs
Softmax · TopK
SpecDec · KV-MMU
256 MB UNIFIED SRAM
KV-Cache · Activations · Weight Cache · ~300 mm²
Hardware Paged KV · 4KB Pages · COW · Dedup Tags
RING + MESH INTERCONNECT · ~30 mm²
HBM4
16 GB
HBM4
16 GB
HBM4
16 GB
HBM4
16 GB
UCIe
D2D
HBM4
16 GB
HBM4
16 GB
Prefill (Matrix)
Decode (Vector)
SRAM (256 MB)
HBM4 (96 GB)
RISC-V Control
Special Function
UCIe D2D

Memory System

256 MB unified on-chip SRAM. This is the core differentiator. Groq achieved 230 MB at 14nm on 725 mm^2; N3E’s ~4.6x density improvement allows comparable capacity in less area. The SRAM serves triple duty: KV-cache primary storage (100K tokens at FP8 for a 70B GQA model with 8 KV heads, ~2.5 KB per token), activation scratch space (eliminating HBM round-trips for intermediate values), and weight cache (significant reuse for sub-30B models across the batch).

The SRAM implements hardware-managed paged KV-cache — vLLM’s PagedAttention in silicon. 4 KB pages (one token’s KV for typical model dimensions), a hardware TLB with 4K entries covering 16 MB of hot working set, copy-on-write support (forking agents create COW references to shared system-prompt KV-cache), and content-addressable deduplication tags that detect identical KV blocks across agents.

96 GB HBM4 at 12 TB/s. Six stacks provide overflow: cold KV-cache pages, full model weights for 70B+ models, large-batch activations. The HBM controller includes a prefetch engine (predicts needed KV pages from agent conversation history), bandwidth partitioning (separate virtual channels for weight streaming, KV spill/fill, and activation traffic), and transparent 2:4 structured sparsity decompression on read (~2x effective bandwidth for sparse weights).

Bandwidth analysis at the operating point (batch=8 agents, each decoding one token on a 70B FP8 model with 8K context): weight load per token is ~70 GB, KV-cache read per token is ~160 MB. At 100 tok/s/user target, required bandwidth is ~1 TB/s for weights + ~128 GB/s for KV. With 256 MB SRAM caching hot weights and KV, effective HBM demand drops to 2-4 TB/s — within the 12 TB/s envelope with margin for prefill bursts.

Compute Engines

Prefill engine: 8 matrix tiles. Each tile contains a 256x256 BF16 systolic array (131K FLOPs/cycle) with 512 KB local SRAM. At 1.5 GHz: ~196 TFLOPS BF16 per tile, ~1.57 PFLOPS BF16 aggregate (~3.14 POPS INT8). Comparable to H100 tensor core throughput. Handles the compute-bound phase: processing tool outputs, ingesting long context, computing initial KV-cache. The 8 tiles can partition flexibly — all 8 for a single large prefill, or split across concurrent smaller prefills from different agents.

Decode engine: 16 vector tiles. Each tile is a 128-wide BF16 vector unit with 1 MB local SRAM. Deliberately modest compute (~384 GFLOPS BF16 per tile, ~6 TFLOPS aggregate) because decode at batch 1-8 degenerates into matrix-vector products where weights stream from memory and activations fit in registers. The tiles connect to SRAM banks at full bisection bandwidth. Vector units include fused multiply-add, FP8/FP4 support, and fast reduction trees for attention score accumulation.

Heterogeneous scheduling. The control cluster dynamically assigns work based on phase: pure decode power-gates the matrix tiles (saving ~30W), pure prefill activates all matrix tiles while decode tiles handle low-priority concurrent decode, and the agentic common case runs mixed — matrix tiles prefill returning tool output while vector tiles continue decode for other agents.

Special Function Units

Non-linear operations (softmax, LayerNorm, GeLU/SiLU, TopK) consume 5-15% of end-to-end GPU latency despite comprising less than 1% of FLOPs. ARIA includes dedicated hardware:

Softmax/LayerNorm engine. Streaming pipelined architecture processing a row of attention scores in a single pass with online (numerically stable) softmax. 16 independent lanes, each handling one head or one batch element.

Speculative decode verification unit. Tree-structured verification logic following Sequoia’s framework. Accepts K draft tokens from an on-die or co-packaged draft model plus the target model’s logits for all K positions. Performs rejection sampling in hardware with hardware RNG. Configurable tree topology loaded from the control cluster. Target: verify a tree of 8-16 speculated tokens in under 100 clock cycles.

TopK/sampling unit. Hardware TopK extraction (top-50, top-100) from vocabulary-sized logit vectors (32K-256K entries). Temperature scaling, top-p nucleus sampling, repetition penalty in a streaming pipeline. Avoids the GPU pattern of sorting the full vocabulary.

KV-cache management unit. Page table walker, TLB, COW logic. Eviction policy engine implementing LRU with attention-score-weighted priority (pages whose tokens receive high attention scores are retained; decaying-score pages spill to HBM). Compression/decompression for FP8 KV quantization on spill/fill.

ISA: Two-Tier Programmability

ARIA is not a fixed-function ASIC. It must handle SSMs (Mamba), mixture-of-experts routing, multi-modal cross-attention, retrieval-augmented generation, and architectures that do not yet exist.

Tile-level ISA (VLIW with transformer extensions): Matrix ops (MMUL.BF16/FP8/FP4, batched GEMV), attention primitives (ATTN.SCORE, ATTN.WEIGHT, ATTN.FLASH for fused tiled attention), reduction ops (REDUCE.SUM/MAX/TOPK), page-based memory ops (LOAD.PAGE, PREFETCH.KV, STREAM.WEIGHT), speculative control (SPEC.DRAFT, SPEC.VERIFY, BATCH.YIELD), non-linear dispatch (SOFTMAX.ROW, LAYERNORM, GELU, SILU, ROPE), and first-class MoE support (MOE.GATE for top-K expert selection, MOE.SCATTER, MOE.GATHER).

Control cluster ISA: Four RISC-V RV64GC cores with custom extensions for agent state machine management (fork/join agents, track conversation turns), batch scheduling (add/remove sequences per iteration — the hardware realization of Orca’s iteration-level scheduling), power management (tile power gating, DVFS per cluster following KAIROS-style agent-granularity power policy), and multi-chip coordination (message passing over UCIe D2D).

Multi-Chip Scaling

For models exceeding single-die capacity (>70B at FP8, or >30B with large batch and long context):

Interconnect: UCIe D2D port at 256 GB/s bidirectional per die. For 4-8 chip servers, a proprietary coherent ring provides 512 GB/s chip-to-chip bandwidth with cache coherency for KV-cache pages across dies.

Scaling targets:

ConfigurationChipsModel Size (FP8)Concurrent Agents @ 32KTarget tok/s/user
Single die130B16150
Quad470B32120
Octet8200B6480
Rack (32)32405B+25660

Tensor parallelism for 2-8 chips (all-reduce after each layer; at 512 GB/s, a 70B model across 4 chips adds ~50 us per layer, ~3 ms for 60 layers). Pipeline parallelism for 8+ chips. Expert parallelism for MoE models.

Comparison

DimensionNVIDIA B200Groq LPUEtched SohuGoogle TPU v5pARIA
ProcessTSMC 4NPGF 14nmTSMC 4nmUndisclosedTSMC N3E
On-chip SRAM~50 MB230 MBUnknown~50 MB (est.)256 MB
HBM192 GB HBM3e, 8 TB/sNone144 GB HBM3e95 GB, 2.7 TB/s96 GB HBM4, 12 TB/s
Peak BF16~2.5 PFLOPS~750 TFLOPSUnknown459 TFLOPS~1.6 PFLOPS
Low-batch decodePoor (<10% util at batch 1)ExcellentExcellentModerateExcellent
Spec decodeSoftware onlyNot supportedHardware (claimed)Software onlyHardware SFU
KV-cache mgmtSoftware (vLLM)CompilerUnknownSoftwareHardware paged + COW + dedup
ProgrammabilityFull CUDAVLIW, limitedFixed-functionXLAVLIW + RISC-V
Sweet spotTraining + high-batchUltra-low-latency single-streamTransformer-onlyTraining + batch servingAgentic: low-batch, long-context, multi-turn

ARIA occupies the space between Groq’s SRAM-only radicalism and NVIDIA’s general-purpose flexibility. It matches Groq-class on-chip SRAM (256 MB vs. 230 MB) but adds HBM4 for capacity, retains programmability for model evolution, and adds agentic-specific hardware that no existing chip provides.

Manufacturing and Economics

Process: TSMC N3E (relaxed design rules over N3, better yields, ~5% density penalty). At ~0.1 defects/cm^2 (mature process), a 600 mm^2 die yields ~55-60% by the Poisson model — aggressive but comparable to the A100 at 826 mm^2 on 7nm.

BOM per packaged chip:

ComponentCost
Die (silicon, ~50 good dies/wafer at $20-25K/wafer)$400-500
HBM4 (6 stacks, 96 GB)$1,500-2,500
CoWoS packaging$800-1,200
Substrate + passives + test$200-400
Total BOM$2,900-4,600

Compare: NVIDIA B200 BOM is estimated at $3,000-5,000; Groq LPU at ~$1,500-2,000 (no HBM) but requiring 8-576x more chips per model.

NRE: $465-710M total (architecture + RTL $200-300M, verification $50-80M, physical design $40-60M, masks $15-20M, IP licensing $30-50M, EDA $20-30M, prototype silicon $30-50M, software stack $80-120M). Timeline: first tapeout at month 24, production-ready at month 36-42.

Break-even: At $10,000-15,000 ASP with $3,000-4,500 COGS, gross margin is ~$6,500-10,500 per chip. Break-even on $600M NRE requires 57,000-92,000 chips. At 50,000 chips/year (plausible if agentic workloads scale as projected), break-even in ~1.5-2 years.

Risks

Model architecture shift. If transformers are supplanted by a fundamentally different architecture, the SFUs become dead silicon. Mitigation: SFUs are less than 5% of die area, and the programmable tiles handle any architecture through recompilation.

HBM4 availability. Specs finalized April 2025, but volume production may slip to late 2026-2027. The design must be HBM3e-compatible as fallback (8 TB/s, 64 GB — still viable, just tighter on bandwidth margin).

The CUDA moat. ARIA needs an MLIR-based compiler targeting the tile ISA, a runtime compatible with vLLM/SGLang/TensorRT-LLM APIs, and PyTorch/JAX integration. The $80-120M software investment is not optional — it is table stakes.

NVIDIA’s trajectory. Blackwell Ultra explicitly targets agentic workloads. The window of architectural advantage may be narrow. ARIA’s bet is that purpose-built heterogeneity (separate prefill/decode engines, hardware KV-cache management, native spec-decode) captures structural gains that general-purpose GPU evolution cannot match.

Yield at 600 mm^2. A yield-killing bug wastes ~$400 per defective die at volume. Redundant SRAM banks and spare tiles are essential design-for-yield measures.

The Unifying Principle

Every optimization in this design — speculative decoding, MoE sparsity, KV-cache paging with COW, prefill/decode disaggregation, heterogeneous memory — is a strategy to extract more useful tokens per byte moved from memory. DRAM access energy has improved far less than compute energy across every node transition; the ratio has only worsened. The architecture that makes memory movement the first-class design constraint, rather than peak FLOPS, wins the agentic inference workload.

ARIA is a concrete instantiation: 256 MB SRAM to keep the hot working set on-die, 12 TB/s HBM4 for overflow, heterogeneous engines matched to the prefill/decode split, hardware acceleration for speculative verification, paged KV-cache with deduplication, agent state machines, and programmable tiles that survive architectural evolution. The chip that makes a 70B-parameter agent respond in 80 ms per token with 32K context, 16 concurrent agents, at 150W.

Additional Reading