InferBench: Inference ASIC Benchmark Suite — Deep Research

why every existing benchmark is wrong

Benchmark	Problem
MLPerf	Favors GPUs (software maturity dominates), no cost/power normalization, burdensome submission
NVIDIA benchmarks	Vendor-controlled, TDP-based power (inflates efficiency), non-reproducible TensorRT configs
Cerebras	No power reporting (WSE-3 draws ~15kW), cost comparison nonsensical ($2M+ vs $30K H100)
Groq	tokens/sec/dollar uses API pricing (includes margin), LPU is fixed-function
SemiAnalysis InferenceMAX	Proposal only, no tool, no simulator, relies on vendor numbers

The gap: No benchmark separates algorithmic efficiency from hardware capability from software maturity.

workload characterization (Real numbers)

LLM prefill — llama 3 70B, S=2048

Per token per layer: Q proj 134M + K/V proj 33.6M + O proj 134M + FFN (gate+up+down) 1.41B + attention 32,768xS

Full prefill S=2048: 2048 x 142B = 291 TFLOP

Weights: 140 GB (loaded once if weight-stationary)
Arithmetic intensity: ~1,300 FLOP/byte — compute-bound

LLM decode — llama 3 70B

Batch Size	FLOPs/step	AI (FLOP/byte)	Bound on A100
1	137 GFLOP	0.97	Memory BW
8	1.1 TFLOP	7.8	Memory BW
32	4.4 TFLOP	31	Memory BW
128	17.5 TFLOP	124	Memory BW
256	35 TFLOP	246	Balanced

At B=1: A100 theoretical spatial utilization = 0.6%. Loads 140 GB weights, performs 137 GFLOPs. Batch size is everything.

KV cache per sequence at S=8192: 2.62 GB. Batch 256 at S=8192: 672 GB — exceeds H100 80GB.

diffusion — SD3 medium (2B params)

24 MMDiT blocks, 4429 tokens (4096 image + 333 text)
Per denoising step: 8.9 TFLOPs
50 steps: 445 TFLOPs
Flux.1 (12B): 28 steps = 1,456 TFLOPs
Weights reused 28-50x — more compute-bound than LLM decode

MoE — mixtral 8x7B

32.2B FLOPs per token (only 12.9B active due to top-2 routing)
At B=1 per expert: GEMV territory, AI ~0.34 FLOP/byte
Even more batch-sensitive than dense models

vision — ViT-L (307M)

122.7 GFLOPs per image (197 tokens)
AI ~143 FLOP/byte — balanced at A100 ridge point
Real-time at 30fps easily achievable on A100 (0.39ms)

architecture models

systolic array (TPU-like)

peak_ops = n_arrays x N^2 x 2 x f_clk
utilization(M) = min(1, M/N)  -- M=1 gives 0.78% util on 128x128 array

Weight-stationary: load weights once, stream activations. 90%+ util for M >= 256.

SIMT (GPU-like)

Roofline model:

if AI > peak_ops/hbm_bw: t = FLOPs / (peak x util)  -- compute bound
else: t = bytes / hbm_bw                              -- memory bound

Utilization lookup (empirical from cuBLAS): M<32: 15%, M=128-512: 70%, M>4096: 85%

Power: P = P_idle + alpha x GFLOPS + beta x GB/s (A100: 60W + 0.2 mW/GFLOP + 50 mW/(GB/s))

In-Memory compute (d-Matrix-like)

ADC is the bottleneck. 256 ADCs at 1 GHz = 131 TOPS per tile. Precision-throughput tradeoff: n_passes = ceil(input_bits/dac_bits) x ceil(effective_bits/adc_bits) 10-50x energy efficiency over digital for INT4 weights. FP16 multi-pass negates benefit.

dataflow (Cerebras-like)

When weights fit on-chip (44 GB SRAM on WSE-3): zero weight loading, decode becomes compute-bound. Llama 3 70B at FP16 (140 GB): does NOT fit. Llama 3 8B (16 GB): FITS. Power: ~15 kW for inference is extreme.

reconfigurable (CGRA/Fractile-like)

Operator fusion advantage: GEMM+bias+ReLU+LayerNorm fused, intermediates stay on-chip. Reconfiguration overhead: ~10us per config, 80 layers x 3 groups = 2.4ms — significant for decode.

worked example: B200 vs Groq LPU vs TPU v5p on llama 70B decode

The architecture models above are abstract. To show why they matter, we run a concrete comparison: single-user (B=1) autoregressive decode of Llama 3 70B across three fundamentally different hardware architectures. Every number below follows directly from the roofline framework.

NVIDIA B200 (SIMT / GPU)

Specs from Blackwell Architecture: 8 TB/s HBM3e bandwidth, 192 GB HBM capacity, 2,250 TOPS FP8, 1,000W TDP.

At B=1, decode is a sequence of GEMVs. The model is 70B parameters = 70 GB at FP8. Every token requires loading the full weight tensor once.

Latency: 70 GB / 8 TB/s = 8.75 ms/tok = 114 tok/s (theoretical peak — actual throughput is lower due to kernel launch overhead and scheduler latency)
Arithmetic intensity: 137 GFLOP / 70 GB = 1.96 FLOP/byte
Ridge point: 2,250 TOPS / 8 TB/s = 281 FLOP/byte (assumes simultaneous peak compute and peak bandwidth, which is not achievable in practice)
Compute utilization: 1.96 / 281 = 0.7%

The B200 is delivering 114 tok/s, which sounds fast, but 99.3% of its compute dies are idle. The 2,250 TOPS of FP8 capability is irrelevant — we are firmly on the memory-bandwidth slope of the roofline. Power draw during memory-bound GEMV workloads is roughly 600-700W (below TDP but still substantial), yielding approximately 0.16-0.19 tok/s/W.

groq LPU (Dataflow / SRAM-only)

Note: this compares single-chip Groq against single-chip B200; Groq deploys large models across multiple chips, which changes the comparison.

Groq’s Language Processing Unit uses ~230 MB of on-chip SRAM with an estimated ~80 TB/s internal bandwidth (no HBM). The architecture eliminates the memory wall entirely — but only when the model fits.

Llama 3 70B at FP16 (140 GB): Does not fit. Period. Even at FP8 (70 GB), the model is 300x larger than available SRAM. Groq would need to shard across hundreds of chips with inter-chip bandwidth becoming the new bottleneck, or simply cannot serve this model competitively.

Llama 3 8B at FP8 (8 GB): Fits in SRAM. Now the arithmetic changes dramatically:

Latency: 8 GB / 80 TB/s = 0.1 ms/tok = ~10,000 tok/s
This is ~88x faster than B200 on Llama 70B, but it is a different model.

The lesson: Groq’s architecture delivers extraordinary throughput when the model fits, and nothing when it does not. There is no graceful degradation — the performance cliff is binary.

Power per LPU chip is estimated at ~300W. For the 8B model on a single chip, energy efficiency is remarkable: ~33 tok/s/W. But the $/tok comparison is misleading because the LPU cannot serve the same model.

google TPU v5p (Systolic array)

Specs: 4.8 TB/s HBM bandwidth per chip, 95 GB HBM capacity, ~918 TOPS BF16 (459 TOPS effective at INT8).

At B=1, Llama 70B at FP8 (70 GB) fits within a single chip’s 95 GB HBM:

Latency: 70 GB / 4.8 TB/s = 14.6 ms/tok = 68 tok/s
Ridge point: 918 TOPS / 4.8 TB/s = ~191 FLOP/byte
Compute utilization: 1.96 / 191 = 1.0% (marginally better than B200 in relative terms, worse in absolute throughput)

Power per chip is ~400W (pod-level amortized), yielding approximately 0.17 tok/s/W — comparable to the B200.

comparison table

Metric	B200 (Llama 70B)	Groq LPU (Llama 8B)	TPU v5p (Llama 70B)
tok/s (B=1)	114	~10,000	68
Compute utilization	0.7%	~12%*	1.0%
Power (W)	~650	~300	~400
tok/s/W	~0.18	~33	~0.17
Model fits?	Yes (192 GB)	No (70B); Yes (8B)	Yes (95 GB)
Ridge point (FLOP/byte)	281	N/A (SRAM)	191
Est. $/tok (on-demand)	~$0.0003	~$0.00003**	~$0.0004

* Groq utilization is higher because the 8B model has lower absolute FLOP count relative to available compute, and SRAM bandwidth removes the memory wall. ** Groq cost estimate is for the 8B model only; the comparison is not apples-to-apples.

key insight

The “best” hardware depends entirely on the workload. B200 wins on flexibility (any model up to 192 GB, any batch size, mature software stack). Groq wins on raw latency for small models that fit in SRAM. TPU v5p wins on cost at scale with large batch sizes in cloud deployments. A benchmark that declares a single winner is lying by omission — it has chosen a workload that favors one architecture. InferBench must report across the full workload space.

proposed metrics

Existing benchmarks let vendors cherry-pick the single metric that makes their hardware look best. NVIDIA reports peak TOPS. Groq reports tok/s at B=1 for models that fit in SRAM. Cerebras omits power. MLPerf conflates software maturity with hardware capability. InferBench needs a metric set that is jointly necessary and sufficient to characterize inference hardware honestly.

useful TOPS

Definition: Peak TOPS x measured utilization at a reference workload (e.g., Llama 70B decode at B=1, B=32, B=256).

Peak TOPS is a marketing number. A B200 at 2,250 TOPS FP8 delivers 1.57 useful TOPS during B=1 decode (0.07% utilization). Reporting Useful TOPS forces vendors to acknowledge the gap between theoretical peak and delivered performance. The reference workload must be standardized — InferBench defines three: single-user decode (B=1), throughput decode (B=max fitting in memory), and prefill (S=2048).

cost efficiency

Definition: Useful TOPS / amortized $/hour (including power, cooling, and 3-year depreciation).

API pricing (Groq’s preferred metric) bakes in margin, demand elasticity, and subsidies. TDP-based cost models (common in analyst reports) undercount real power by 10-30%. InferBench uses wall-measured power x local electricity rate + cooling overhead (PUE) + hardware amortization. For cloud instances, the on-demand hourly rate is an acceptable proxy, but must be reported alongside the component breakdown.

energy efficiency

Definition: Useful TOPS / Watt, measured at the wall (not TDP).

TDP is a thermal design constraint, not a power measurement. A100 SXM draws 250-400W depending on workload; its TDP is 400W. Reporting efficiency at TDP inflates the denominator for memory-bound workloads where actual draw is lower, making the hardware appear less efficient than it is during compute-bound phases. Wall measurement captures the actual energy cost, including voltage regulation losses and memory power.

flexibility score

Definition: Fraction of the workload-configuration space where measured utilization exceeds 50%.

The workload space is defined by model size (1B to 200B), batch size (1 to 512), sequence length (128 to 32K), and precision (FP16, FP8, INT4). A hardware platform’s flexibility score is the fraction of this grid where it achieves over 50% of its own peak Useful TOPS. GPUs score high (~0.6-0.7) because they handle diverse shapes. Groq LPU scores low (~0.05-0.1) because it requires models to fit in SRAM. This metric penalizes fixed-function designs that excel on a narrow slice and fail everywhere else.

why all four are necessary

Any three of the four can be gamed. High Useful TOPS at terrible cost efficiency (overprovisioned hardware). Good cost efficiency at low flexibility (one workload only). Strong energy efficiency on a narrow workload (fixed-function ASIC). A vendor that scores well on all four has genuinely good hardware — or at minimum, honestly good hardware for its target niche, reported transparently.

validation strategy

Compare InferBench simulator (a100_sxm.yaml + workload graphs) against Alan’s energy-study A100 measurements. Target: <15% MAPE for latency, <20% MAPE for energy.

Calibration knobs: utilization table entries, power model coefficients.

calibration procedure

The power model P = P_idle + alpha x GFLOPS + beta x GB/s has three free parameters. Published A100 SXM measurements provide calibration data across a range of workloads:

Idle: P_idle measured at ~60W (fans + memory refresh + base logic).
Memory-bound (B=1 GEMV): Power scales with HBM bandwidth utilization. At 2 TB/s sustained, draw is ~250W, giving beta ~= 50 mW/(GB/s) after subtracting idle.
Compute-bound (large GEMM, B=4096): Power approaches TDP (~400W). At ~250 TFLOPS sustained, alpha ~= 0.2 mW/GFLOP after subtracting idle and bandwidth component.

These coefficients transfer to Blackwell Architecture B200 as a starting point, with adjustment for the higher HBM3e power envelope and Blackwell’s more aggressive power management. Initial B200 estimates: P_idle ~80W, alpha ~0.15 mW/GFLOP (improved perf/W from architecture), beta ~45 mW/(GB/s) (HBM3e is slightly more power-efficient per GB/s than HBM3).

Validation against the B200 numbers serves as the GPU baseline for InferBench. The worked example above (114 tok/s at B=1) should match simulator output within the 15% latency MAPE target. Energy prediction for that workload (650W x 8.75ms = ~5.7 J/tok) should fall within the 20% energy MAPE target.

For non-GPU architectures (TPU, LPU, CGRA), validation requires either published measurements or partner-provided data. The TPU v5p estimate (68 tok/s at B=1) can be cross-checked against publicly reported Cloud TPU latency numbers. Groq’s claimed Llama 8B latency can be verified against their public API response times.

publication strategy

Blog: “Why Every Inference ASIC Benchmark Is Wrong” — SemiAnalysis style
Email Dylan Patel: “You proposed the right metrics at OCP. We built the tool.”
Twitter thread with comparison table and roofline plots
Workshop paper: MLSys 2027 or ISCA ML+Arch workshop
Make competing chip companies contribute their own architecture specs — creates self-sustaining news cycle

interesting reads

MLPerf Inference — the closest existing benchmark; InferBench addresses its gaps
Samuel Williams et al., “Roofline Model” (Berkeley) — the analytical framework underlying all utilization claims
Groq LPU architecture — the deterministic-dataflow alternative to GPU inference
vLLM (UC Berkeley) — the serving framework whose PagedAttention changed KV cache management
Anyscale “LLM Inference Performance Engineering” — practical serving optimization guide

See also: Inference Stack Synthesis, Blackwell Architecture, SpectralQuant (compression shifts the roofline for decode), TrtLLMGen (MoE kernel benchmarking), Systolic Arrays, AI Hardware Landscape

Alan's PKB

Explorer

InferBench

InferBench: Inference ASIC Benchmark Suite — Deep Research

interesting reads

Graph View

Table of Contents

Backlinks