InferBench: Inference ASIC Benchmark Suite — Deep Research
why every existing benchmark is wrong
| Benchmark | Problem |
|---|---|
| MLPerf | Favors GPUs (software maturity dominates), no cost/power normalization, burdensome submission |
| NVIDIA benchmarks | Vendor-controlled, TDP-based power (inflates efficiency), non-reproducible TensorRT configs |
| Cerebras | No power reporting (WSE-3 draws ~15kW), cost comparison nonsensical ($2M+ vs $30K H100) |
| Groq | tokens/sec/dollar uses API pricing (includes margin), LPU is fixed-function |
| SemiAnalysis InferenceMAX | Proposal only, no tool, no simulator, relies on vendor numbers |
The gap: No benchmark separates algorithmic efficiency from hardware capability from software maturity.
workload characterization (Real numbers)
LLM prefill — llama 3 70B, S=2048
Per token per layer: Q proj 134M + K/V proj 33.6M + O proj 134M + FFN (gate+up+down) 1.41B + attention 32,768xS
Full prefill S=2048: 2048 x 142B = 291 TFLOP
- Weights: 140 GB (loaded once if weight-stationary)
- Arithmetic intensity: ~1,300 FLOP/byte — compute-bound
LLM decode — llama 3 70B
| Batch Size | FLOPs/step | AI (FLOP/byte) | Bound on A100 |
|---|---|---|---|
| 1 | 137 GFLOP | 0.97 | Memory BW |
| 8 | 1.1 TFLOP | 7.8 | Memory BW |
| 32 | 4.4 TFLOP | 31 | Memory BW |
| 128 | 17.5 TFLOP | 124 | Memory BW |
| 256 | 35 TFLOP | 246 | Balanced |
At B=1: A100 theoretical spatial utilization = 0.6%. Loads 140 GB weights, performs 137 GFLOPs. Batch size is everything.
KV cache per sequence at S=8192: 2.62 GB. Batch 256 at S=8192: 672 GB — exceeds H100 80GB.
diffusion — SD3 medium (2B params)
- 24 MMDiT blocks, 4429 tokens (4096 image + 333 text)
- Per denoising step: 8.9 TFLOPs
- 50 steps: 445 TFLOPs
- Flux.1 (12B): 28 steps = 1,456 TFLOPs
- Weights reused 28-50x — more compute-bound than LLM decode
MoE — mixtral 8x7B
- 32.2B FLOPs per token (only 12.9B active due to top-2 routing)
- At B=1 per expert: GEMV territory, AI ~0.34 FLOP/byte
- Even more batch-sensitive than dense models
vision — ViT-L (307M)
- 122.7 GFLOPs per image (197 tokens)
- AI ~143 FLOP/byte — balanced at A100 ridge point
- Real-time at 30fps easily achievable on A100 (0.39ms)
architecture models
systolic array (TPU-like)
peak_ops = n_arrays x N^2 x 2 x f_clk
utilization(M) = min(1, M/N) -- M=1 gives 0.78% util on 128x128 array
Weight-stationary: load weights once, stream activations. 90%+ util for M >= 256.
SIMT (GPU-like)
Roofline model:
if AI > peak_ops/hbm_bw: t = FLOPs / (peak x util) -- compute bound
else: t = bytes / hbm_bw -- memory bound
Utilization lookup (empirical from cuBLAS): M<32: 15%, M=128-512: 70%, M>4096: 85%
Power: P = P_idle + alpha x GFLOPS + beta x GB/s (A100: 60W + 0.2 mW/GFLOP + 50 mW/(GB/s))
In-Memory compute (d-Matrix-like)
ADC is the bottleneck. 256 ADCs at 1 GHz = 131 TOPS per tile. Precision-throughput tradeoff: n_passes = ceil(input_bits/dac_bits) x ceil(effective_bits/adc_bits) 10-50x energy efficiency over digital for INT4 weights. FP16 multi-pass negates benefit.
dataflow (Cerebras-like)
When weights fit on-chip (44 GB SRAM on WSE-3): zero weight loading, decode becomes compute-bound. Llama 3 70B at FP16 (140 GB): does NOT fit. Llama 3 8B (16 GB): FITS. Power: ~15 kW for inference is extreme.
reconfigurable (CGRA/Fractile-like)
Operator fusion advantage: GEMM+bias+ReLU+LayerNorm fused, intermediates stay on-chip. Reconfiguration overhead: ~10us per config, 80 layers x 3 groups = 2.4ms — significant for decode.
worked example: B200 vs Groq LPU vs TPU v5p on llama 70B decode
The architecture models above are abstract. To show why they matter, we run a concrete comparison: single-user (B=1) autoregressive decode of Llama 3 70B across three fundamentally different hardware architectures. Every number below follows directly from the roofline framework.
NVIDIA B200 (SIMT / GPU)
Specs from Blackwell Architecture: 8 TB/s HBM3e bandwidth, 192 GB HBM capacity, 2,250 TOPS FP8, 1,000W TDP.
At B=1, decode is a sequence of GEMVs. The model is 70B parameters = 70 GB at FP8. Every token requires loading the full weight tensor once.
- Latency: 70 GB / 8 TB/s = 8.75 ms/tok = 114 tok/s (theoretical peak — actual throughput is lower due to kernel launch overhead and scheduler latency)
- Arithmetic intensity: 137 GFLOP / 70 GB = 1.96 FLOP/byte
- Ridge point: 2,250 TOPS / 8 TB/s = 281 FLOP/byte (assumes simultaneous peak compute and peak bandwidth, which is not achievable in practice)
- Compute utilization: 1.96 / 281 = 0.7%
The B200 is delivering 114 tok/s, which sounds fast, but 99.3% of its compute dies are idle. The 2,250 TOPS of FP8 capability is irrelevant — we are firmly on the memory-bandwidth slope of the roofline. Power draw during memory-bound GEMV workloads is roughly 600-700W (below TDP but still substantial), yielding approximately 0.16-0.19 tok/s/W.
groq LPU (Dataflow / SRAM-only)
Note: this compares single-chip Groq against single-chip B200; Groq deploys large models across multiple chips, which changes the comparison.
Groq’s Language Processing Unit uses ~230 MB of on-chip SRAM with an estimated ~80 TB/s internal bandwidth (no HBM). The architecture eliminates the memory wall entirely — but only when the model fits.
Llama 3 70B at FP16 (140 GB): Does not fit. Period. Even at FP8 (70 GB), the model is 300x larger than available SRAM. Groq would need to shard across hundreds of chips with inter-chip bandwidth becoming the new bottleneck, or simply cannot serve this model competitively.
Llama 3 8B at FP8 (8 GB): Fits in SRAM. Now the arithmetic changes dramatically:
- Latency: 8 GB / 80 TB/s = 0.1 ms/tok = ~10,000 tok/s
- This is ~88x faster than B200 on Llama 70B, but it is a different model.
The lesson: Groq’s architecture delivers extraordinary throughput when the model fits, and nothing when it does not. There is no graceful degradation — the performance cliff is binary.
Power per LPU chip is estimated at ~300W. For the 8B model on a single chip, energy efficiency is remarkable: ~33 tok/s/W. But the $/tok comparison is misleading because the LPU cannot serve the same model.
google TPU v5p (Systolic array)
Specs: 4.8 TB/s HBM bandwidth per chip, 95 GB HBM capacity, ~918 TOPS BF16 (459 TOPS effective at INT8).
At B=1, Llama 70B at FP8 (70 GB) fits within a single chip’s 95 GB HBM:
- Latency: 70 GB / 4.8 TB/s = 14.6 ms/tok = 68 tok/s
- Ridge point: 918 TOPS / 4.8 TB/s = ~191 FLOP/byte
- Compute utilization: 1.96 / 191 = 1.0% (marginally better than B200 in relative terms, worse in absolute throughput)
Power per chip is ~400W (pod-level amortized), yielding approximately 0.17 tok/s/W — comparable to the B200.
comparison table
| Metric | B200 (Llama 70B) | Groq LPU (Llama 8B) | TPU v5p (Llama 70B) |
|---|---|---|---|
| tok/s (B=1) | 114 | ~10,000 | 68 |
| Compute utilization | 0.7% | ~12%* | 1.0% |
| Power (W) | ~650 | ~300 | ~400 |
| tok/s/W | ~0.18 | ~33 | ~0.17 |
| Model fits? | Yes (192 GB) | No (70B); Yes (8B) | Yes (95 GB) |
| Ridge point (FLOP/byte) | 281 | N/A (SRAM) | 191 |
| Est. $/tok (on-demand) | ~$0.0003 | ~$0.00003** | ~$0.0004 |
* Groq utilization is higher because the 8B model has lower absolute FLOP count relative to available compute, and SRAM bandwidth removes the memory wall. ** Groq cost estimate is for the 8B model only; the comparison is not apples-to-apples.
key insight
The “best” hardware depends entirely on the workload. B200 wins on flexibility (any model up to 192 GB, any batch size, mature software stack). Groq wins on raw latency for small models that fit in SRAM. TPU v5p wins on cost at scale with large batch sizes in cloud deployments. A benchmark that declares a single winner is lying by omission — it has chosen a workload that favors one architecture. InferBench must report across the full workload space.
proposed metrics
Existing benchmarks let vendors cherry-pick the single metric that makes their hardware look best. NVIDIA reports peak TOPS. Groq reports tok/s at B=1 for models that fit in SRAM. Cerebras omits power. MLPerf conflates software maturity with hardware capability. InferBench needs a metric set that is jointly necessary and sufficient to characterize inference hardware honestly.
useful TOPS
Definition: Peak TOPS x measured utilization at a reference workload (e.g., Llama 70B decode at B=1, B=32, B=256).
Peak TOPS is a marketing number. A B200 at 2,250 TOPS FP8 delivers 1.57 useful TOPS during B=1 decode (0.07% utilization). Reporting Useful TOPS forces vendors to acknowledge the gap between theoretical peak and delivered performance. The reference workload must be standardized — InferBench defines three: single-user decode (B=1), throughput decode (B=max fitting in memory), and prefill (S=2048).
cost efficiency
Definition: Useful TOPS / amortized $/hour (including power, cooling, and 3-year depreciation).
API pricing (Groq’s preferred metric) bakes in margin, demand elasticity, and subsidies. TDP-based cost models (common in analyst reports) undercount real power by 10-30%. InferBench uses wall-measured power x local electricity rate + cooling overhead (PUE) + hardware amortization. For cloud instances, the on-demand hourly rate is an acceptable proxy, but must be reported alongside the component breakdown.
energy efficiency
Definition: Useful TOPS / Watt, measured at the wall (not TDP).
TDP is a thermal design constraint, not a power measurement. A100 SXM draws 250-400W depending on workload; its TDP is 400W. Reporting efficiency at TDP inflates the denominator for memory-bound workloads where actual draw is lower, making the hardware appear less efficient than it is during compute-bound phases. Wall measurement captures the actual energy cost, including voltage regulation losses and memory power.
flexibility score
Definition: Fraction of the workload-configuration space where measured utilization exceeds 50%.
The workload space is defined by model size (1B to 200B), batch size (1 to 512), sequence length (128 to 32K), and precision (FP16, FP8, INT4). A hardware platform’s flexibility score is the fraction of this grid where it achieves over 50% of its own peak Useful TOPS. GPUs score high (~0.6-0.7) because they handle diverse shapes. Groq LPU scores low (~0.05-0.1) because it requires models to fit in SRAM. This metric penalizes fixed-function designs that excel on a narrow slice and fail everywhere else.
why all four are necessary
Any three of the four can be gamed. High Useful TOPS at terrible cost efficiency (overprovisioned hardware). Good cost efficiency at low flexibility (one workload only). Strong energy efficiency on a narrow workload (fixed-function ASIC). A vendor that scores well on all four has genuinely good hardware — or at minimum, honestly good hardware for its target niche, reported transparently.
validation strategy
Compare InferBench simulator (a100_sxm.yaml + workload graphs) against Alan’s energy-study A100 measurements. Target: <15% MAPE for latency, <20% MAPE for energy.
Calibration knobs: utilization table entries, power model coefficients.
calibration procedure
The power model P = P_idle + alpha x GFLOPS + beta x GB/s has three free parameters. Published A100 SXM measurements provide calibration data across a range of workloads:
- Idle: P_idle measured at ~60W (fans + memory refresh + base logic).
- Memory-bound (B=1 GEMV): Power scales with HBM bandwidth utilization. At 2 TB/s sustained, draw is ~250W, giving beta ~= 50 mW/(GB/s) after subtracting idle.
- Compute-bound (large GEMM, B=4096): Power approaches TDP (~400W). At ~250 TFLOPS sustained, alpha ~= 0.2 mW/GFLOP after subtracting idle and bandwidth component.
These coefficients transfer to Blackwell Architecture B200 as a starting point, with adjustment for the higher HBM3e power envelope and Blackwell’s more aggressive power management. Initial B200 estimates: P_idle ~80W, alpha ~0.15 mW/GFLOP (improved perf/W from architecture), beta ~45 mW/(GB/s) (HBM3e is slightly more power-efficient per GB/s than HBM3).
Validation against the B200 numbers serves as the GPU baseline for InferBench. The worked example above (114 tok/s at B=1) should match simulator output within the 15% latency MAPE target. Energy prediction for that workload (650W x 8.75ms = ~5.7 J/tok) should fall within the 20% energy MAPE target.
For non-GPU architectures (TPU, LPU, CGRA), validation requires either published measurements or partner-provided data. The TPU v5p estimate (68 tok/s at B=1) can be cross-checked against publicly reported Cloud TPU latency numbers. Groq’s claimed Llama 8B latency can be verified against their public API response times.
publication strategy
- Blog: “Why Every Inference ASIC Benchmark Is Wrong” — SemiAnalysis style
- Email Dylan Patel: “You proposed the right metrics at OCP. We built the tool.”
- Twitter thread with comparison table and roofline plots
- Workshop paper: MLSys 2027 or ISCA ML+Arch workshop
- Make competing chip companies contribute their own architecture specs — creates self-sustaining news cycle
interesting reads
- MLPerf Inference — the closest existing benchmark; InferBench addresses its gaps
- Samuel Williams et al., “Roofline Model” (Berkeley) — the analytical framework underlying all utilization claims
- Groq LPU architecture — the deterministic-dataflow alternative to GPU inference
- vLLM (UC Berkeley) — the serving framework whose PagedAttention changed KV cache management
- Anyscale “LLM Inference Performance Engineering” — practical serving optimization guide
See also: Inference Stack Synthesis, Blackwell Architecture, SpectralQuant (compression shifts the roofline for decode), TrtLLMGen (MoE kernel benchmarking), Systolic Arrays, AI Hardware Landscape